LibGuides: Data Visualization: Data Preparation

Sourcing Data

The first step in creating a visualization is to source data with a research question in mind. Data can be either created through research and experimentation or sourced from a number of UT Library repositories.

Consider utilizing these UT Library resources for guidance on finding or creating data:

Contact the subject librarian for your field for advice on relevant sources and data repositories: search for librarians by subject.

Clean, Transform, & Integrate Data

Once you have acquired your data from either research or a repository, you must clean and prepare it for use with a visualization tool. Depending on the data you are utilizing, this process may involve different steps however data cleaning and transforming will almost always be the longest step in this process.

Cleaning steps to consider:

Address both duplicate values and empty spaces in the data table.
Remove variables / fields that are unnecessary for your research question.
Account for outliers or invalid data that may skew the observable trends in a visualization.
Standardize names / values for machine use.
Check for typos, mislabeling, misaligned columns, and other clerical errors that may affect your processing.

This cleaned data may then be integrated with other data sets & the rest of the information informing your visualization.

Recommended Tools for Data Cleaning

OpenRefine

OpenRefine is an open-source desktop program used for “data wrangling.” In other words, it can clean and manage data, export it, and crosswalk the format to a number of desktop and online sources. When working with spreadsheets or other data sets that may have inaccurate or missing information, OpenRefine can clean them without damaging or altering the original file. For information on how to install and use OpenRefine, visit the OpenRefine documentation & guide pages.

Python

Python is a programming language that can be used to clean data through a variety of packages. Utilizing the Natural Language Tool Kit or SpaCy, one can tokenize elements in Python for easy removal or replacement when cleaning data. For more information on Python, as well as a number of recommended books, please visit our Python Library Guide.

R / R Studio

R Studio is a desktop environment for the R language, which excels at language processing and data visualization. R Studio supports plug-ins like TidyText for a number of tokenizing and NLP processes. You can learn more about R and RStudio, as well as find a number of recommended books, in our R Library Guide & our RStudio Library Guide.

Analyze & Determine the Best Kind of Visualization

There are a plethora of possible choices for visualization, and the best one for your project will depend on the data you possess. You may choose a map for geospatial data, or a timeline for temporal data. If you are comparing two data sets, you may use a more traditional chart like a set of bar graphs or box-and-whisker diagrams.

Several online resources can help you to discover the best option for your work:

DataVizCatalogue - determine what visualization to use by choosing between categories like ‘comparison,’ ‘relationship,’ ‘location,’ and more. DataVizProject also hosts examples & descriptions for different types of visualizations.
Consider color accessibility using web tools like colorsafe.co & WebAim’s Contrast Checker. Appropriate contrast makes your visualization easier to read and accessible to more people.

Once you have determined the type of visualization you’d like to create, consult the Tools page of this guide and find a software tool that is capable of both ingesting your data and generating the desired output. Also keep in mind the format of your data, some tools may require specific file types or layouts for ingestion.

Data Processing Tips for Spreadsheets

Sets of data often come in the form of spreadsheets. Below are tips for maintaining clean data within a spreadsheet to ensure it can be utilized with a variety of tools and retains its accuracy.

Cardinal Rules for Spreadsheets

Put all your variables in columns.
Don't combine multiple pieces of information in one cell.
Put each observation on its own row.
Leave the raw data raw- don't mess with it!
Export the cleaned data to a text based format like CSV.

Common Spreadsheet Issues

Multiple tables
Multiple tabs
Not filling in zeros
Using bad null values
Using formatting to convey information
Using formatting to make the data sheet look pretty
Placing comments or units in cells
More than one piece of information in a cell
Field name problems
Special characters in data
Inclusion of metadata in data table
Date formatting