Exercise 6 - AQI Data Preparation#
This lesson was developed with assistance from GPT-4-1106-preview, claude-2.1, and GPT-3.5-turbo.
1. Problem definition#
What is the impact of COVID19 lockdown orders on air quality in Miami?
2. Data requirements#
We need to collect daily air quality data for parameters such as PM2.5, PM10, NO2, SO2, and CO in csv format from 2019 to 2021. That is one year before and one year after the lockdown orders of COVID 19 in April 2020. EPA Air Data contains air quality data collected at outdoor monitors across the US.
3. Data preparation#
Data discovery and profiling: Exploring the content, quality, and structure of the dataset
Data collection: Gathering raw data from various sources for analysis
Data cleaning: Standardizing inconsistent data formats, correcting errors, and handling missing values
Data structuring: Organizing data for efficient storage and retrieval
Data enrichment: Enhancing dataset with additional information from external sources to improve analysis
Data transformation: Optimizing the content and structure of data for further analysis
Data validation: Ensuring the accuracy, completeness, and reliability of the data
Data integration: Combining data from multiple sources to create a unified dataset for analysis
3.1 Data discovery and profiling#
In this step we are exploring the content, quality, and structure of the dataset. From this step we can learn the following.
EPA Air Data contains air quality data collected at outdoor monitors across the US.
For this exercise we want to use Pre-Generated Data Files - Daily Data.
For each year one data quality parameter (e.g., NO2, SO2, and CO) is recorded in a csv file.
The dataset have 16 air quality parameters.
For three years, there are 48 csv files and each file is about 60 MB. That is approximately 3 GB of data.
3.2 Data collection#
In this step we want to gather raw data for analysis The breakdown of this step is as follows.
We need to collect data for 16 air quality parameters for three years from Pre-Generated Data Files - Daily Data that is 48 files
Each file is in csv format and zipped
We can use Python tools for making HTTP requests such as
request
packageWe can use Python tools for unzipping files such as
zipfile
packageEach file is large than 60MB, so the download and processing of raw data can be long
3.3 Data cleaning#
This step involves standardizing inconsistent data formats, correcting errors, and handling missing values. For this step we will do the following:
For this EPA data, cleaning is minimal
We will record the files that has no records for our study area and period
We do not need to do anything about missing data at this stage
3.4 Data structuring#
This steps involves organizing and formating the data in a way that is optimal for storage and retrieval. For this step we need to do the following tasks.
To reduce storage space, we will filter out rows outside our study region
We will only keep relevant columns for our record and analysis that are: ‘County Code’, ‘Parameter Name’, ‘Date Local’, ‘Units of Measure’, ‘Arithmetic Mean’, ‘CBSA Name’.
This will reduce the storgage space per file from from ~60 MB to ~0.4 MB.
Note: A core-based statistical area (CBSA) is a geographic region of the U.S.as defined by the Office of Management and Budget.A CBSA is “one or more adjacent counties or county equivalents that have at least one urban core area of at least 10,000 population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties” (Wikipedia).
After conducting an initial data exploration and profiling, we can formulate a prompt our language model to assist with data collection, cleaning, and structuring. Here is a prompot:
Python code with rich annotations:
This code collects air quality data in a region
for one year before and one year after the lockdown order on April 1, 2020.
These are the FileIDs of the air quality data that we want to download:
Category |
Parameter |
FileID |
Unit |
Records |
---|---|---|---|---|
Criteria Gases |
Ozone |
44201 |
na |
0 |
Criteria Gases |
SO2 |
42401 |
na |
0 |
Criteria Gases |
CO |
42101 |
na |
0 |
Criteria Gases |
NO2 |
42602 |
na |
0 |
Particulates |
PM2.5 FRM/FEM Mass |
88101 |
na |
0 |
Particulates |
PM2.5 non FRM/FEM Mass |
88502 |
na |
0 |
Particulates |
PM10 Mass |
81102 |
na |
0 |
Particulates |
PMc Mass |
86101 |
na |
0 |
Meteorological |
Winds |
WIND |
na |
0 |
Meteorological |
Temperature |
TEMP |
na |
0 |
Meteorological |
Barometric Pressure |
PRESS |
na |
0 |
Meteorological |
RH and Dewpoint |
RH_DP |
na |
0 |
Toxics, Precursors, and Lead |
HAPs |
HAPS |
na |
0 |
Toxics, Precursors, and Lead |
VOCs |
VOCS |
na |
0 |
Toxics, Precursors, and Lead |
NONOxNOy |
NONOxNOy |
na |
0 |
Toxics, Precursors, and Lead |
Lead |
LEAD |
na |
0 |
The daily data file is named ‘daily_FileID_Year.zip’.
For example, ‘daily_44201_2023.zip’ has Ozone data for 2023
The file address is: https://aqs.epa.gov/aqsweb/airdata/daily_44201_2023.zip
For example,
filename = 'daily_44201_2023.zip'
base_url = 'https://aqs.epa.gov/aqsweb/airdata/
download_from = base_url + filename
will be the address of the Ozone data for 2023.
Each zip file contains a csv file. For example, ‘daily_44201_2023.zip’ contains ‘daily_44201_2023.csv’
Addtional information
#These files will be downloaded to:
save_to = 'Data\L2_Structured"
# Region of interest
region = 'Miami'
# CBSA Name of the region of interest
CBSA_Name_Value = 'Miami-Fort Lauderdale-West Palm Beach, FL'
Write a Python code to:
Use the information in th table above to create
Attributes
list or dictionary for ‘Category’, ‘Parameter’, ‘FileID’, ‘Unit’, ‘Records’Define variables
region
andCBSA_Name_Value
Count each download with variable name
download_count
Get files for years 2019, 2020, 2021 for all the FileIDs
Show message about download progress after each fileas as
download_count
:year
:parameter
:filename
Unzip each file
Read csv file with Pandas as
raw_data
DataFrameDelete zip file: ‘daily_FileID_Year.zip’ only if it exists and if file is open, close it. For example, ‘daily_44201_2023.zip’
Delete csv file: ‘daily_FileID_Year.csv’ only if it exists and if file is open, close it. For example, ‘daily_44201_2023.csv’
In
raw_data
, address this warning: “Columns (13) have mixed types. Specify dtype option on import or set low_memory=False”.In
Attributes
, update ‘Unit’ with the first value in column ‘Units of Measure’ inraw_data
. For example, for fileid = 44201, ‘na’ will be replaced with ‘Parts per million’. If the first value does not exist update ‘Unit’ inAttributes
list with ‘empy_file’In
raw_data
, keep only rows where the value of Column ‘CBSA Name’ is equal to CBSA_Name_ValueIn
raw_data
, keep only columns with labels: ‘County Code’, ‘Parameter Name’, ‘Date Local’, ‘Units of Measure’, ‘Arithmetic Mean’, ‘CBSA Name’In
raw_data
, convert ‘Date Local’ column to datetime format, set as index, and sort by indexIn
Attributes
, update ‘Records’ with the number of rows in theraw_data
plus the current value of ‘Records’Save
raw_data
atsave_to
location with file name ‘daily_{FileID}{year}{region}.csv2019, and create folder if it does not existsAt the end from
Attributes
createdf_Attributes
DataFrameSave
df_Attributes
atsave_to
as ‘Attributes.csDisplay
df_Attributes
DataFrameAnnotate code with detailed and rich annotations describing each task and sub-task inside and outside the loop
Download = False #Change to true if you want to download the data
Download:
#ChatGPT 3.5 Turbo Code
import pandas as pd
import os
import zipfile
from urllib.request import urlretrieve
# 1. Define Attributes dictionary for air quality data
Attributes = {
'Category': ['Criteria Gases', 'Criteria Gases', 'Criteria Gases', 'Criteria Gases',
'Particulates', 'Particulates', 'Particulates', 'Particulates',
'Meteorological', 'Meteorological', 'Meteorological', 'Meteorological',
'Toxics, Precursors, and Lead', 'Toxics, Precursors, and Lead',
'Toxics, Precursors, and Lead', 'Toxics, Precursors, and Lead'],
'Parameter': ['Ozone', 'SO2', 'CO', 'NO2', 'PM2.5 FRM/FEM Mass', 'PM2.5 non FRM/FEM Mass',
'PM10 Mass', 'PMc Mass', 'Winds', 'Temperature', 'Barometric Pressure',
'RH and Dewpoint', 'HAPs', 'VOCs', 'NONOxNOy', 'Lead'],
'FileID': [44201, 42401, 42101, 42602, 88101, 88502, 81102, 86101, 'WIND', 'TEMP',
'PRESS', 'RH_DP', 'HAPS', 'VOCS', 'NONOxNOy', 'LEAD'],
'Unit': ['na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na'],
'Records': [0] * 16
}
# 2. Define region and CBSA_Name_Value
region = 'Miami'
CBSA_Name_Value = 'Miami-Fort Lauderdale-West Palm Beach, FL'
# 3. Initialize download_count
download_count = 0
# 4. Loop through each year and download files
for year in range(2019, 2022):
for i, file_id in enumerate(Attributes['FileID']):
filename = f'daily_{file_id}_{year}.zip'
base_url = 'https://aqs.epa.gov/aqsweb/airdata/'
download_from = base_url + filename
# 5. Show download progress
download_count += 1
print(f"Download {download_count}: {year} : {Attributes['Parameter'][i]} : {filename}")
# Download file
urlretrieve(download_from, filename)
# 6. Unzip file
with zipfile.ZipFile(filename, 'r') as zip_ref:
zip_ref.extractall()
# 7. Read csv file
raw_data = pd.read_csv(f'daily_{file_id}_{year}.csv', low_memory=False)
# 8. Delete zip file if exists
if os.path.exists(filename):
os.remove(filename)
# 9. Delete csv file if exists
if os.path.exists(f'daily_{file_id}_{year}.csv'):
os.remove(f'daily_{file_id}_{year}.csv')
# 10. Handle warning in raw_data
pd.set_option('mode.chained_assignment', None)
# 11. Update 'Unit' in Attributes
Attributes['Unit'][i] = raw_data['Units of Measure'][0] if len(raw_data) > 0 else 'empty_file'
# 12. Filter by CBSA_Name_Value
raw_data = raw_data[raw_data['CBSA Name'] == CBSA_Name_Value]
# 13. Keep required columns
raw_data = raw_data[['County Code', 'Parameter Name', 'Date Local', 'Units of Measure',
'Arithmetic Mean', 'CBSA Name']]
# 14. Convert 'Date Local' to datetime and set as index
raw_data['Date Local'] = pd.to_datetime(raw_data['Date Local'])
raw_data.set_index('Date Local', inplace=True)
raw_data.sort_index(inplace=True)
# 15. Update 'Records' in Attributes
Attributes['Records'][i] += len(raw_data)
# 16. Save raw_data
save_folder = os.path.join('Data', 'L2_Structured')
if not os.path.exists(save_folder):
os.makedirs(save_folder)
raw_data.to_csv(os.path.join(save_folder, f'daily_{file_id}_{year}_{region}.csv'))
# 17. Create DataFrame from Attributes
df_Attributes = pd.DataFrame(Attributes)
# 18. Save df_Attributes
df_Attributes.to_csv(os.path.join(save_folder, 'Attributes.csv'))
# 19. Display df_Attributes
display(df_Attributes)
Cell In[1], line 3
Download:
^
SyntaxError: invalid syntax
3.5 Data integration#
This step involves combining data from multiple data to create a unified dataset for analysis.
We need to combine our data into a 2d array that has our 16 parameters (also known as features or variables) for the whole study period.
We can give this code outline below for your LM to complete.
region = 'Miami'
# read 'Data/L2_Structured/Attributes.csv' as `df_Attributes` DataFrame
# create `data` DataFrame with daily index from 2019-01-01 to 2021-12-31
#and column names are the values of 'FileID' column in `df_Attributes`
# loop from year 2019 to 2021
#Loop for each value in 'FileID'
#Read f'Data/L2_Structured/daily_{fileID}_{year}_{region}.csv' as `raw_data` DataFrame
#and set column 'Date Local' as datetime index
#resample the column 'Arithmetic Mean' in `raw_data` with .max() as `resampled_arithmetic_mean`
#Copy the values of resampled_arithmetic_mean for each fileID into its corresponding column in `data` with the same fileID
#Dates must match. For example, the value in 2019-01-01 in resampled_arithmetic_mean would be copied to
#the firt row of `data` with index equal 2019-01-01
# Save `data` DataFrame as 'data_{region}.csv at this path 'Data/L3_Integrated' and create folder if it does not exists
#display `data` DataFrame using `display()` not `print()`
# Add rich code annotation
#code only. no introduction and conclusion.
#ChatGPT 3.5 Turbo Code
import pandas as pd
import os
region = 'Miami'
# Read 'Data/L2_Structured/Attributes.csv' as `df_Attributes` DataFrame
df_Attributes = pd.read_csv('Data/L2_Structured/Attributes.csv')
# Create `data` DataFrame with daily index from 2019-01-01 to 2021-12-31
# and with columns such that the column labels are the 'FileID' column in `df_Attributes`
start_date = '2019-01-01'
end_date = '2021-12-31'
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
data = pd.DataFrame(index=date_range, columns=df_Attributes['FileID'])
# Loop from year 2019 to 2021
for year in range(2019, 2022):
# Loop for each value in 'FileID'
for fileID in df_Attributes['FileID']:
# Read f'Data/L2_Structured/daily_{fileID}_{year}_{region}.csv' as `raw_data` DataFrame
file_path = f'Data/L2_Structured/daily_{fileID}_{year}_{region}.csv'
if os.path.exists(file_path):
raw_data = pd.read_csv(file_path)
# Set column 'Date Local' as datetime index
raw_data['Date Local'] = pd.to_datetime(raw_data['Date Local'])
raw_data.set_index('Date Local', inplace=True)
# Resample the column 'Arithmetic Mean' in `raw_data` with .max() as `resampled_arithmetic_mean`
resampled_arithmetic_mean = raw_data['Arithmetic Mean'].resample('D').max()
# Copy the values of resampled_arithmetic_mean for each fileID into its corresponding column in `data`
# with the same fileID. Dates must match.
data.loc[resampled_arithmetic_mean.index, fileID] = resampled_arithmetic_mean.values
# Save `data` DataFrame as 'data_{region}.csv' at this path 'Data/L3_Integrated' and create folder if it does not exist
output_folder = 'Data/L3_Integrated'
os.makedirs(output_folder, exist_ok=True)
data.to_csv(os.path.join(output_folder, f'data_{region}.csv'))
# Display `data` DataFrame using `display()`
display(data)
FileID | 44201 | 42401 | 42101 | 42602 | 88101 | 88502 | 81102 | 86101 | WIND | TEMP | PRESS | RH_DP | HAPS | VOCS | NONOxNOy | LEAD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-01-01 | 0.035412 | 0.6625 | 0.477667 | 8.416667 | 7.158333 | 53.795833 | 18.0 | NaN | 258.291667 | 73.0 | NaN | 86.833333 | NaN | NaN | 18.820833 | NaN |
2019-01-02 | 0.029471 | 0.754167 | 0.55 | 18.217647 | 6.404348 | 5.125 | 18.0 | NaN | 262.916667 | 74.75 | NaN | 82.958333 | NaN | NaN | 49.576471 | NaN |
2019-01-03 | 0.026882 | 0.725 | 0.56725 | 18.270833 | 9.929167 | 6.979167 | 22.0 | 10.1 | 307.708333 | 75.291667 | NaN | 87.375 | 0.72 | 2.76 | 50.520833 | NaN |
2019-01-04 | 0.031882 | 0.740909 | 0.533333 | 18.1625 | 12.4 | 10.7875 | 28.0 | NaN | 155.083333 | 72.875 | NaN | 88.166667 | NaN | NaN | 46.6125 | NaN |
2019-01-05 | 0.034765 | 0.683333 | 0.408333 | 17.983333 | 12.016667 | 11.370833 | 24.0 | NaN | 100.166667 | 70.541667 | NaN | 82.75 | NaN | NaN | 32.495833 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2021-12-27 | 0.042882 | 0.805882 | 0.7375 | 24.579167 | 9.929167 | 8.045833 | 20 | 6.970833 | 202.208333 | 70.208333 | NaN | 84.958333 | 0.01 | NaN | 42.8375 | NaN |
2021-12-28 | 0.040647 | 0.683333 | 0.816667 | 27.979167 | 8.865217 | 7.663158 | 21 | 6.904167 | 254.875 | 70.166667 | NaN | 87.625 | NaN | NaN | 85.0875 | NaN |
2021-12-29 | 0.034059 | 0.691667 | 0.665375 | 24.183333 | 8.183333 | 6.970833 | 17 | 7.8125 | 269.5 | 71.958333 | NaN | 88.25 | NaN | NaN | 77.915 | NaN |
2021-12-30 | 0.025706 | 0.679167 | 0.804167 | 23.5375 | 8.320833 | 6.708333 | 16 | 7.066667 | 268.958333 | 73.5 | NaN | 89.75 | 0.007 | NaN | 83.466667 | NaN |
2021-12-31 | 0.02925 | 0.5375 | 0.717125 | 15.479167 | 12.591667 | 16.475 | 14 | 6.170833 | 261.416667 | 73.583333 | NaN | 88.416667 | NaN | NaN | 41.7125 | NaN |
1096 rows × 16 columns
3.6 Data validation#
Data validation includes ensuring the accuracy, completeness, and reliability of the data. Here we will simply cross-reference our collected data in data
DataFrame with the data in raw data files to ensure that we did not do any mistake. We can display any rows and columns of your data
DataFrame and cross-check them with raw data and the csv files. Here is one example.
columns_to_show = ['86101', 'WIND','TEMP']
data.loc['2019-12-30':'2020-01-01', columns_to_show]
FileID | 86101 | WIND | TEMP |
---|---|---|---|
2019-12-30 | 10.45 | 220.916667 | 75.833333 |
2019-12-31 | 6.995833 | 170.458333 | 71.333333 |
2020-01-01 | 6.233333 | 200.875 | 68.333333 |
We can also use the JupyterLab variableInspector extension to get an overview of the data before additional preprocessing:
review each parameter in the CSV file to ensure there are no errors.
note that the parameters “NONOxNOy”, “VOCs”, and “HAPs” need special handling due to its multiple-parameter nature.
acknowledge this issue for the current analysis. No further action is needed as this parameter won’t be used in our analysis.
3.7 Data enrichment#
This steps involved enhancing dataset with additional information from external sources to improve analysis.
No data enrichment is required at this stage.
3.8 Data transformation#
This step involves converting data into a format suitable form for analysis. Data enrichment and data transformation are closely related but distinct from data structuring. While data structuring primarily involves data wrangling for efficient storage and retrieval, data transformation focuses on preparing data for optimal analysis.
Our analysis will focus on PM 2.5, PM 10,
Parameter |
FileID |
Unit |
Records |
---|---|---|---|
PM2.5 FRM/FEM Mass |
88101 |
Micrograms/cubic meter (LC) |
14924 |
PM2.5 non FRM/FEM Mass |
88502 |
Micrograms/cubic meter (LC) |
7987 |
PM10 Mass |
81102 |
Micrograms/cubic meter (25 C) |
4130 |
NO2 |
42602 |
Parts per billion |
5056 |
SO2 |
42401 |
Parts per billion |
4253 |
CO |
42101 |
Parts per million |
6256 |
Ozone |
44201 |
Parts per million |
8420 |
Winds |
WIND |
Knots |
2192 |
Temperature |
TEMP |
Degrees Fahrenheit |
1796 |
RH and Dewpoint |
RH_DP |
Percent relative humidity |
1787 |
HAPs |
HAPS |
Micrograms/cubic meter (LC) |
6536 |
VOCs |
VOCS |
Parts per billion Carbon |
7433 |
To streamline this process, we will:
Combine ‘PM2.5 FRM/FEM Mass’ and ‘PM2.5 non FRM/FEM Mass’ for both samplers into one ‘PM2.5’ set.
Select parameters: PM2.5, PM10, NO2, SO2, CO, and Ozone.
Rename column labels from FileIDs to parameter names for clarity (e.g., ‘81102’ to ‘PM10’).
Convert the DataFrame to a NumPy representation using
.values
for analysis with NumPySave the data as a NumPy array for analysis.
We can ask our AI code assistant to do this.
Python code with rich annotations
region = ‘Miami’
Read ‘data_{}.csv’ as
data
DateFrame with datetime index from ‘Data/L3_Integrated’In
data
, create column ‘PM2.5’ which is mean of columns ‘88101’, ‘88502’Create a new DataFrame
aqi_data
as a copy ofdata
which shall include the following columns fromdata
columns_from_data= [
'PM2.5', #PM2.5
'81102', #PM10
'42602', #NO2
'42401', #SO2
'42101', #CO
'44201', #O3
]
In
aqi_data
DataFrame rename columns lables as follows:
'PM2.5': 'PM2.5'
'81102', 'PM10'
'42602', 'NO2'
'42401', 'SO2'
'42101', 'CO'
'44201', 'O3'
In
air_quality_data
, get a NumPy representation of DataFrame using.values
methodTo be later openned by NumPy, save the
aqi_data_
DataFrame as ‘aqi_data_{region}.csv’ at ‘Data/L4_Transformed’ and create folder if it does not exists
import pandas as pd
import os
region = 'Miami'
# 1. Read 'data_{}.csv' as `data` DataFrame with datetime index from 'Data/L3_Integrated'
data = pd.read_csv(f'Data/L3_Integrated/data_{region}.csv', index_col=0, parse_dates=True)
# 2. In `data`, create column 'PM2.5' which is mean of columns '88101', '88502'
data['PM2.5'] = data[['88101', '88502']].mean(axis=1)
# 3. Create a new DataFrame `aqi_data` as a copy of `data` which shall include specific columns
columns_from_data = ['PM2.5', '81102', '42602', '42401', '42101', '44201']
aqi_data = data[columns_from_data].copy()
# 4. Rename columns labels in `aqi_data` DataFrame
aqi_data.columns = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']
# 5. Get a NumPy representation of `aqi_data` DataFrame using `.values` method
aqi_data_values = aqi_data.values
# 6. Save the `aqi_data` DataFrame as 'aqi_data_{region}.csv' in 'Data/L4_Transformed' and create folder if it does not exist
os.makedirs('Data/L4_Transformed', exist_ok=True)
aqi_data.to_csv(f'Data/L4_Transformed/aqi_data_{region}.csv')
After this transformation, our data is prepared for analysis. The next step is to utilize NumPy to calculate the air quality index before and after the lockdown.
4. Summary#
Step(s) |
File Name |
Number of files |
File size |
Data size |
Comments |
---|---|---|---|---|---|
1) Data discovery and profiling |
‘daily_{FileID}_{year}.csv’ |
48 |
~60 MB |
~ 3 GB |
16 parameters and 3 years |
2) Data collection, cleaning, and structuring |
‘daily_{FileID}_{year}_Miami.csv’ |
48 |
~0. 4MB |
~ 10 MB |
16 parameters and 3 years for Miami and selected columns |
‘attributes.csv’ |
1 |
2 KB |
Parameter details and file information |
||
5) Data integration |
‘data_Miami.csv’ |
1 |
125 KB |
125 KB |
Integrated data with 16 parameters |
6) Data validation |
‘data_Miami.csv’ |
NONOxNOy, HAPs and VOCs columns need additional treatment |
|||
7) Data enrichment |
‘data_Miami.csv’ |
No data enrichment |
|||
8) Data transformation |
‘aqi_data_Miami.csv’ |
1 |
67 KB |
67 KB |
Data for AQI analysis with NumPy representation |