Many of the sources gathered together on this guide are large public data sets that can feel a little overwhelming to begin with. As you explore them, consider the questions in the "Before you Begin" section, such as:
Does this dataset collect information at the neighborhood, city, county, state, or national level? Will the level of granularity work for your project?
Does this dataset cover the time period you need? Is the data collected frequently enough for your needs?
It can take a while to navigate the many variable, geography, and timeframe options in many large dataset user interfaces. It can also take a long time to search through documentation and record the information that you may need, such as question text for surveys and other variable information that may not be obvious in the downloaded data file. Give your self time, go slowly, and keep good records.
And remember, your librarians are here to help!
The data we want to work with rarely comes ready-to-play. It often requires reorganizing and cleaning up and is often in multiple places, which means working with multiple files. The work done to get clean data ready for analysis is often called "wrangling." It can involve data cleaning, combining multiple datasets from multiple sources, and normalizing that data so that it can work together.
Before you end up with 4,378 data files with long alphanumeric strings for names in your downloads folder, let's talk about organizing your workflow. Consider creating a dedicated home folder for each project. Title it clearly and use it to store that project's data. Use a consistent and clear file naming system that allows you, at a glance, to know what a file contains.
Data cleaning is the process of fixing errors, duplications, and inconsistencies in your data set. The actual process will vary from dataset to dataset but the general principles remain the same: