Datasets and their Manipulation



Summary

Manipulating datasets in spreadsheet software is essential for transforming raw data into actionable insights. It supports informed decision-making, improves efficiency through automation, organizes data for clarity, and uncovers trends for problem-solving. Spreadsheet manipulation enhances communication by making visualizations and summaries more interpretable. Spreadsheet’s versatility across industries highlights their critical role in ensuring accurate analysis and better outcomes.

Learning Outcomes

  • Modify open access data in spreadsheet softwares for data evaluation
  • Prepare the dataset content in spreadsheet softwares for data visualization
  • Present preliminary relationships with the data using spreadsheet softwares

Downloading Datasets


A common access point for large datasets is an online database. Often, an online database will provide a search interface that allows users to filter a dataset and access or download only the data they need for their research. 

Download one or more of the datasets below in order to become familiar with the process.

Electric vehicle population data: https://catalog.data.gov/dataset/electric-vehicle-population-data

Energy Hourly consumption data: https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption

Greenhouse gas emissions data: https://www.kaggle.com/datasets/unitednations/international-greenhouse-gas-emissions

Bike sharing data: https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Bike%20Sharing

Data Cleaning in Spreadsheet Softwares


The video or instructions below will show how to:

  1. Import data into Google Sheets
  2. Rename columns headers
  3. Delete columns
  4. Combine categories

Reference: [1]

Preparing Data with Spreadsheet Softwares

Often, the dataset you download in its original form doesn’t fit your data analysis needs exactly. You may want to make surface-level changes to this “raw” dataset so that it’s easier to work with. This process is called “cleaning” your data, or preparing it for data analysis.

It is important to remember that cleaning data should not change the values in any way.

  • Renaming column headers

Remember, when naming column headers, we want to avoid using spaces or special characters (! , * are examples of special characters) so that it is easy for the computer to read.

  • Removing unnecessary columns

There may be columns in the dataset that you do not plan to use. You can delete those so that the size of the file is smaller and easier to work with.

  • Combining categories

Sometimes the dataset is more specific than we need.

Exploratory Data Analysis


Google Sheets and Microsoft Excel have a tool called PivotTables that help you easily calculate, summarize, and analyze data. You see comparisons, patterns, and trends in your data and then use those visualizations in reports or papers.

Reference: [2]

Python


Many people use spreadsheet applications like Google Sheets or Microsoft Excel to store, analyze, and visualize data. They are relatively easy to learn and use and have built-in advanced data analysis capabilities. Spreadsheet applications use a graphic user interface with drop-down menus and toolbars, so they can be more familiar to many of us.

Python is an open-source programming language that can also be used for data analysis purposes. There can be a steep learning curve at first, but it is a powerful tool.  Python and other programming languages work well for:

  • Doing research with large datasets
  • Automating or repeating the same processes many times
  • Showing your work so others can reproduce your research

Reference: [3]


References

[1] Cahoon, C. Data Cleaning in Google Sheets. Panopto. https://ncsu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=056353fa-6ab0-4386-99f4-ad42014a2ce7&start=0 (accessed 2025-01-22).

[2] Cahoon, C. Exploratory Data Analysis in Google Sheets. Panopto. https://ncsu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=1e1886b6-ce2b-4687-bcd6-ad4701266857&start=0 (accessed 2025-01-22).

[3] Cahoon, C. Using Google Colabs to work with Python. Panopto. https://ncsu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=1fc37086-39ba-4841-a012-ad440148be27&start=0 (accessed 2025-01-22).