Research Guides: How to Find and Work with Data: Cleaning and Wrangling

Need help? You can ask!

Régan Schwartz
Research and Instruction
Librarian

rms8@williams.edu
413-597-3085

Pronouns: she/her

Working with large public data sets

Many of the sources gathered together on this guide are large public data sets that can feel a little overwhelming to begin with. As you explore them, consider the questions in the "Before you Begin" section, such as:

Does this dataset collect information at the neighborhood, city, county, state, or national level? Will the level of granularity work for your project?
Does this dataset cover the time period you need? Is the data collected frequently enough for your needs?

It can take a while to navigate the many variable, geography, and timeframe options in many large dataset user interfaces. It can also take a long time to search through documentation and record the information that you may need, such as question text for surveys and other variable information that may not be obvious in the downloaded data file. Give your self time, go slowly, and keep good records.

And remember, your librarians are here to help!

Tips and Tools

The data we want to work with rarely comes ready-to-play. It often requires reorganizing and cleaning up and is often in multiple places, which means working with multiple files. The work done to get clean data ready for analysis is often called "wrangling." It can involve data cleaning, combining multiple datasets from multiple sources, and normalizing that data so that it can work together.

Organizing

Before you end up with 4,378 data files with long alphanumeric strings for names in your downloads folder, let's talk about organizing your workflow. Consider creating a dedicated home folder for each project. Title it clearly and use it to store that project's data. Use a consistent and clear file naming system that allows you, at a glance, to know what a file contains.

Cleaning

Data cleaning is the process of fixing errors, duplications, and inconsistencies in your data set. The actual process will vary from dataset to dataset but the general principles remain the same:

Inspect your data for inconsistencies, missing values, and errors.
Fix any problems that you find
Inspect again to verify that you caught and corrected all of the anomalies
Document your process

Some helpful tools:

OpenRefine
A no-code tool for exploring, cleaning, organizing, and transforming messy data.
Tableau's Guide to Cleaning Data
Tableau's comprehensive guide to cleaning up your data.
Data Cleaning Guide
Poverty Action's guide to cleaning and documenting data.
Guide for Social Science Replication Packages
Social Science Data Editors' guide to replication package documentation.
Stata Coding Guide
Julian Reif's guide to organizing your workflow in Stata is also a fantastic resource for thinking about how to organize your research for transparency and reproducibility.
TAPoR 3
The Text Analysis Portal for Research is a curated collection of text analysis tools along with reviews and recommendations to help you choose the best tool for your project.