Exercise 6 - AQI Data Preparation

Exercise 6 - AQI Data Preparation#

This lesson was developed with assistance from GPT-4-1106-preview, claude-2.1, and GPT-3.5-turbo.

1. Problem definition#

What is the impact of COVID19 lockdown orders on air quality in Miami?

2. Data requirements#

We need to collect daily air quality data for parameters such as PM2.5, PM10, NO2, SO2, and CO in csv format from 2019 to 2021. That is one year before and one year after the lockdown orders of COVID 19 in April 2020. EPA Air Data contains air quality data collected at outdoor monitors across the US.

3. Data preparation#

Data discovery and profiling: Exploring the content, quality, and structure of the dataset
Data collection: Gathering raw data from various sources for analysis
Data cleaning: Standardizing inconsistent data formats, correcting errors, and handling missing values
Data structuring: Organizing data for efficient storage and retrieval
Data enrichment: Enhancing dataset with additional information from external sources to improve analysis
Data transformation: Optimizing the content and structure of data for further analysis
Data validation: Ensuring the accuracy, completeness, and reliability of the data
Data integration: Combining data from multiple sources to create a unified dataset for analysis

3.1 Data discovery and profiling#

In this step we are exploring the content, quality, and structure of the dataset. From this step we can learn the following.

EPA Air Data contains air quality data collected at outdoor monitors across the US.
For this exercise we want to use Pre-Generated Data Files - Daily Data.
For each year one data quality parameter (e.g., NO2, SO2, and CO) is recorded in a csv file.
The dataset have 16 air quality parameters.
For three years, there are 48 csv files and each file is about 60 MB. That is approximately 3 GB of data.

3.2 Data collection#

In this step we want to gather raw data for analysis The breakdown of this step is as follows.

We need to collect data for 16 air quality parameters for three years from Pre-Generated Data Files - Daily Data that is 48 files
Each file is in csv format and zipped
We can use Python tools for making HTTP requests such as request package
We can use Python tools for unzipping files such as zipfile package
Each file is large than 60MB, so the download and processing of raw data can be long

3.3 Data cleaning#

This step involves standardizing inconsistent data formats, correcting errors, and handling missing values. For this step we will do the following:

For this EPA data, cleaning is minimal
We will record the files that has no records for our study area and period
We do not need to do anything about missing data at this stage

3.4 Data structuring#

This steps involves organizing and formating the data in a way that is optimal for storage and retrieval. For this step we need to do the following tasks.

To reduce storage space, we will filter out rows outside our study region
We will only keep relevant columns for our record and analysis that are: ‘County Code’, ‘Parameter Name’, ‘Date Local’, ‘Units of Measure’, ‘Arithmetic Mean’, ‘CBSA Name’.
This will reduce the storgage space per file from from ~60 MB to ~0.4 MB.

Note: A core-based statistical area (CBSA) is a geographic region of the U.S.as defined by the Office of Management and Budget.A CBSA is “one or more adjacent counties or county equivalents that have at least one urban core area of at least 10,000 population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties” (Wikipedia).

After conducting an initial data exploration and profiling, we can formulate a prompt our language model to assist with data collection, cleaning, and structuring. Here is a prompot:

Python code with rich annotations:
This code collects air quality data in a region for one year before and one year after the lockdown order on April 1, 2020.

These are the FileIDs of the air quality data that we want to download:

Category	Parameter	FileID	Unit
Criteria Gases	Ozone	44201	na
Criteria Gases	SO2	42401	na
Criteria Gases	CO	42101	na
Criteria Gases	NO2	42602	na
Particulates	PM2.5 FRM/FEM Mass	88101	na
Particulates	PM2.5 non FRM/FEM Mass	88502	na
Particulates	PM10 Mass	81102	na
Particulates	PMc Mass	86101	na
Meteorological	Winds	WIND	na
Meteorological	Temperature	TEMP	na
Meteorological	Barometric Pressure	PRESS	na
Meteorological	RH and Dewpoint	RH_DP	na
Toxics, Precursors, and Lead	HAPs	HAPS	na
Toxics, Precursors, and Lead	VOCs	VOCS	na
Toxics, Precursors, and Lead	NONOxNOy	NONOxNOy	na
Toxics, Precursors, and Lead	Lead	LEAD	na

The daily data file is named ‘daily_FileID_Year.zip’.
For example, ‘daily_44201_2023.zip’ has Ozone data for 2023

The file address is: https://aqs.epa.gov/aqsweb/airdata/daily_44201_2023.zip
For example,

filename = 'daily_44201_2023.zip' 
base_url = 'https://aqs.epa.gov/aqsweb/airdata/
download_from = base_url + filename

will be the address of the Ozone data for 2023.

Each zip file contains a csv file. For example, ‘daily_44201_2023.zip’ contains ‘daily_44201_2023.csv’

Addtional information

#These files will be downloaded to:
save_to = 'Data\L2_Structured"

# Region of interest
region = 'Miami'

# CBSA Name of the region of interest 
CBSA_Name_Value = 'Miami-Fort Lauderdale-West Palm Beach, FL'

Write a Python code to:

Use the information in th table above to create Attributes list or dictionary for ‘Category’, ‘Parameter’, ‘FileID’, ‘Unit’, ‘Records’
Define variables region and CBSA_Name_Value
Count each download with variable name download_count
Get files for years 2019, 2020, 2021 for all the FileIDs
Show message about download progress after each fileas as download_count : year : parameter : filename
Unzip each file
Read csv file with Pandas as raw_data DataFrame
Delete zip file: ‘daily_FileID_Year.zip’ only if it exists and if file is open, close it. For example, ‘daily_44201_2023.zip’
Delete csv file: ‘daily_FileID_Year.csv’ only if it exists and if file is open, close it. For example, ‘daily_44201_2023.csv’
In raw_data, address this warning: “Columns (13) have mixed types. Specify dtype option on import or set low_memory=False”.
In Attributes, update ‘Unit’ with the first value in column ‘Units of Measure’ in raw_data. For example, for fileid = 44201, ‘na’ will be replaced with ‘Parts per million’. If the first value does not exist update ‘Unit’ in Attributes list with ‘empy_file’
In raw_data, keep only rows where the value of Column ‘CBSA Name’ is equal to CBSA_Name_Value
In raw_data, keep only columns with labels: ‘County Code’, ‘Parameter Name’, ‘Date Local’, ‘Units of Measure’, ‘Arithmetic Mean’, ‘CBSA Name’
In raw_data, convert ‘Date Local’ column to datetime format, set as index, and sort by index
In Attributes, update ‘Records’ with the number of rows in the raw_data plus the current value of ‘Records’
Save raw_data at save_to location with file name ‘daily_{FileID}{year}{region}.csv2019, and create folder if it does not exists
At the end from Attributes create df_Attributes DataFrame
Save df_Attributes at save_to as ‘Attributes.cs
Display df_Attributes DataFrame
Annotate code with detailed and rich annotations describing each task and sub-task inside and outside the loop

Download = False #Change to true if you want to download the data

Download: 
    #ChatGPT 3.5 Turbo Code    
    import pandas as pd
    import os
    import zipfile
    from urllib.request import urlretrieve
    
    # 1. Define Attributes dictionary for air quality data
    Attributes = {
        'Category': ['Criteria Gases', 'Criteria Gases', 'Criteria Gases', 'Criteria Gases',
                     'Particulates', 'Particulates', 'Particulates', 'Particulates',
                     'Meteorological', 'Meteorological', 'Meteorological', 'Meteorological',
                     'Toxics, Precursors, and Lead', 'Toxics, Precursors, and Lead',
                     'Toxics, Precursors, and Lead', 'Toxics, Precursors, and Lead'],
        'Parameter': ['Ozone', 'SO2', 'CO', 'NO2', 'PM2.5 FRM/FEM Mass', 'PM2.5 non FRM/FEM Mass',
                      'PM10 Mass', 'PMc Mass', 'Winds', 'Temperature', 'Barometric Pressure',
                      'RH and Dewpoint', 'HAPs', 'VOCs', 'NONOxNOy', 'Lead'],
        'FileID': [44201, 42401, 42101, 42602, 88101, 88502, 81102, 86101, 'WIND', 'TEMP',
                   'PRESS', 'RH_DP', 'HAPS', 'VOCS', 'NONOxNOy', 'LEAD'],
        'Unit': ['na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na'],
        'Records': [0] * 16
    }
    
    # 2. Define region and CBSA_Name_Value
    region = 'Miami'
    CBSA_Name_Value = 'Miami-Fort Lauderdale-West Palm Beach, FL'
    
    # 3. Initialize download_count
    download_count = 0
    
    # 4. Loop through each year and download files
    for year in range(2019, 2022):
        for i, file_id in enumerate(Attributes['FileID']):
            filename = f'daily_{file_id}_{year}.zip'
            base_url = 'https://aqs.epa.gov/aqsweb/airdata/'
            download_from = base_url + filename
            
            # 5. Show download progress
            download_count += 1
            print(f"Download {download_count}: {year} : {Attributes['Parameter'][i]} : {filename}")
            
            # Download file
            urlretrieve(download_from, filename)
            
            # 6. Unzip file
            with zipfile.ZipFile(filename, 'r') as zip_ref:
                zip_ref.extractall()
            
            # 7. Read csv file
            raw_data = pd.read_csv(f'daily_{file_id}_{year}.csv', low_memory=False)
            
            # 8. Delete zip file if exists
            if os.path.exists(filename):
                os.remove(filename)
            
            # 9. Delete csv file if exists
            if os.path.exists(f'daily_{file_id}_{year}.csv'):
                os.remove(f'daily_{file_id}_{year}.csv')
            
            # 10. Handle warning in raw_data
            pd.set_option('mode.chained_assignment', None)
            
            # 11. Update 'Unit' in Attributes
            Attributes['Unit'][i] = raw_data['Units of Measure'][0] if len(raw_data) > 0 else 'empty_file'
            
            # 12. Filter by CBSA_Name_Value
            raw_data = raw_data[raw_data['CBSA Name'] == CBSA_Name_Value]
            
            # 13. Keep required columns
            raw_data = raw_data[['County Code', 'Parameter Name', 'Date Local', 'Units of Measure',
                                 'Arithmetic Mean', 'CBSA Name']]
            
            # 14. Convert 'Date Local' to datetime and set as index
            raw_data['Date Local'] = pd.to_datetime(raw_data['Date Local'])
            raw_data.set_index('Date Local', inplace=True)
            raw_data.sort_index(inplace=True)
            
            # 15. Update 'Records' in Attributes
            Attributes['Records'][i] += len(raw_data)
            
            # 16. Save raw_data
            save_folder = os.path.join('Data', 'L2_Structured')
            if not os.path.exists(save_folder):
                os.makedirs(save_folder)
            raw_data.to_csv(os.path.join(save_folder, f'daily_{file_id}_{year}_{region}.csv'))
            
    
    # 17. Create DataFrame from Attributes
    df_Attributes = pd.DataFrame(Attributes)
    
    # 18. Save df_Attributes
    df_Attributes.to_csv(os.path.join(save_folder, 'Attributes.csv'))
    
    # 19. Display df_Attributes
    display(df_Attributes)

  Cell In[1], line 3
    Download:
              ^
SyntaxError: invalid syntax

3.5 Data integration#

This step involves combining data from multiple data to create a unified dataset for analysis.

We need to combine our data into a 2d array that has our 16 parameters (also known as features or variables) for the whole study period.

We can give this code outline below for your LM to complete.

region = 'Miami'

# read 'Data/L2_Structured/Attributes.csv' as `df_Attributes` DataFrame

# create `data` DataFrame with daily index from 2019-01-01 to 2021-12-31 
#and column names are the values of 'FileID' column in `df_Attributes`
 
# loop from year 2019 to 2021
    #Loop for each value in 'FileID'
        #Read f'Data/L2_Structured/daily_{fileID}_{year}_{region}.csv' as `raw_data` DataFrame 
        #and set column 'Date Local' as datetime index

       #resample the column 'Arithmetic Mean' in `raw_data` with .max() as `resampled_arithmetic_mean`

       #Copy the values of resampled_arithmetic_mean for each fileID into its corresponding column in `data` with the same fileID
       #Dates must match. For example, the value in 2019-01-01 in resampled_arithmetic_mean would be copied to 
       #the firt row of `data` with index equal 2019-01-01 

# Save `data` DataFrame as 'data_{region}.csv at this path 'Data/L3_Integrated' and create folder if it does not exists

#display `data` DataFrame using  `display()` not `print()`

# Add rich code annotation

#code only. no introduction and conclusion.

#ChatGPT 3.5 Turbo Code 
import pandas as pd
import os

region = 'Miami'

# Read 'Data/L2_Structured/Attributes.csv' as `df_Attributes` DataFrame
df_Attributes = pd.read_csv('Data/L2_Structured/Attributes.csv')

# Create `data` DataFrame with daily index from 2019-01-01 to 2021-12-31 
# and with columns such that the column labels are the 'FileID' column in `df_Attributes`
start_date = '2019-01-01'
end_date = '2021-12-31'
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
data = pd.DataFrame(index=date_range, columns=df_Attributes['FileID'])

# Loop from year 2019 to 2021
for year in range(2019, 2022):
    # Loop for each value in 'FileID'
    for fileID in df_Attributes['FileID']:
        # Read f'Data/L2_Structured/daily_{fileID}_{year}_{region}.csv' as `raw_data` DataFrame
        file_path = f'Data/L2_Structured/daily_{fileID}_{year}_{region}.csv'
        if os.path.exists(file_path):
            raw_data = pd.read_csv(file_path)
            # Set column 'Date Local' as datetime index
            raw_data['Date Local'] = pd.to_datetime(raw_data['Date Local'])
            raw_data.set_index('Date Local', inplace=True)
            
            # Resample the column 'Arithmetic Mean' in `raw_data` with .max() as `resampled_arithmetic_mean`
            resampled_arithmetic_mean = raw_data['Arithmetic Mean'].resample('D').max()
            
            # Copy the values of resampled_arithmetic_mean for each fileID into its corresponding column in `data`
            # with the same fileID. Dates must match.
            data.loc[resampled_arithmetic_mean.index, fileID] = resampled_arithmetic_mean.values

# Save `data` DataFrame as 'data_{region}.csv' at this path 'Data/L3_Integrated' and create folder if it does not exist
output_folder = 'Data/L3_Integrated'
os.makedirs(output_folder, exist_ok=True)
data.to_csv(os.path.join(output_folder, f'data_{region}.csv'))

# Display `data` DataFrame using `display()`
display(data)

FileID	44201	42401	42101	42602	88101	88502	81102	86101	WIND	TEMP	PRESS	RH_DP	HAPS	VOCS	NONOxNOy	LEAD
2019-01-01	0.035412	0.6625	0.477667	8.416667	7.158333	53.795833	18.0	NaN	258.291667	73.0	NaN	86.833333	NaN	NaN	18.820833	NaN
2019-01-02	0.029471	0.754167	0.55	18.217647	6.404348	5.125	18.0	NaN	262.916667	74.75	NaN	82.958333	NaN	NaN	49.576471	NaN
2019-01-03	0.026882	0.725	0.56725	18.270833	9.929167	6.979167	22.0	10.1	307.708333	75.291667	NaN	87.375	0.72	2.76	50.520833	NaN
2019-01-04	0.031882	0.740909	0.533333	18.1625	12.4	10.7875	28.0	NaN	155.083333	72.875	NaN	88.166667	NaN	NaN	46.6125	NaN
2019-01-05	0.034765	0.683333	0.408333	17.983333	12.016667	11.370833	24.0	NaN	100.166667	70.541667	NaN	82.75	NaN	NaN	32.495833	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2021-12-27	0.042882	0.805882	0.7375	24.579167	9.929167	8.045833	20	6.970833	202.208333	70.208333	NaN	84.958333	0.01	NaN	42.8375	NaN
2021-12-28	0.040647	0.683333	0.816667	27.979167	8.865217	7.663158	21	6.904167	254.875	70.166667	NaN	87.625	NaN	NaN	85.0875	NaN
2021-12-29	0.034059	0.691667	0.665375	24.183333	8.183333	6.970833	17	7.8125	269.5	71.958333	NaN	88.25	NaN	NaN	77.915	NaN
2021-12-30	0.025706	0.679167	0.804167	23.5375	8.320833	6.708333	16	7.066667	268.958333	73.5	NaN	89.75	0.007	NaN	83.466667	NaN
2021-12-31	0.02925	0.5375	0.717125	15.479167	12.591667	16.475	14	6.170833	261.416667	73.583333	NaN	88.416667	NaN	NaN	41.7125	NaN

1096 rows × 16 columns

3.6 Data validation#

Data validation includes ensuring the accuracy, completeness, and reliability of the data. Here we will simply cross-reference our collected data in data DataFrame with the data in raw data files to ensure that we did not do any mistake. We can display any rows and columns of your data DataFrame and cross-check them with raw data and the csv files. Here is one example.

columns_to_show = ['86101', 'WIND','TEMP']
data.loc['2019-12-30':'2020-01-01', columns_to_show]

FileID	86101	WIND	TEMP
2019-12-30	10.45	220.916667	75.833333
2019-12-31	6.995833	170.458333	71.333333
2020-01-01	6.233333	200.875	68.333333

We can also use the JupyterLab variableInspector extension to get an overview of the data before additional preprocessing:

review each parameter in the CSV file to ensure there are no errors.
note that the parameters “NONOxNOy”, “VOCs”, and “HAPs” need special handling due to its multiple-parameter nature.
acknowledge this issue for the current analysis. No further action is needed as this parameter won’t be used in our analysis.

3.7 Data enrichment#

This steps involved enhancing dataset with additional information from external sources to improve analysis.

No data enrichment is required at this stage.

3.8 Data transformation#

This step involves converting data into a format suitable form for analysis. Data enrichment and data transformation are closely related but distinct from data structuring. While data structuring primarily involves data wrangling for efficient storage and retrieval, data transformation focuses on preparing data for optimal analysis.

Our analysis will focus on PM 2.5, PM 10, 2, SO₂, CO, and O₃.

Parameter	FileID	Unit	Records
PM2.5 FRM/FEM Mass	88101	Micrograms/cubic meter (LC)	14924
PM2.5 non FRM/FEM Mass	88502	Micrograms/cubic meter (LC)	7987
PM10 Mass	81102	Micrograms/cubic meter (25 C)	4130
NO2	42602	Parts per billion	5056
SO2	42401	Parts per billion	4253
CO	42101	Parts per million	6256
Ozone	44201	Parts per million	8420
Winds	WIND	Knots	2192
Temperature	TEMP	Degrees Fahrenheit	1796
RH and Dewpoint	RH_DP	Percent relative humidity	1787
HAPs	HAPS	Micrograms/cubic meter (LC)	6536
VOCs	VOCS	Parts per billion Carbon	7433

To streamline this process, we will:

Combine ‘PM2.5 FRM/FEM Mass’ and ‘PM2.5 non FRM/FEM Mass’ for both samplers into one ‘PM2.5’ set.
Select parameters: PM2.5, PM10, NO2, SO2, CO, and Ozone.
Rename column labels from FileIDs to parameter names for clarity (e.g., ‘81102’ to ‘PM10’).
Convert the DataFrame to a NumPy representation using .values for analysis with NumPy
Save the data as a NumPy array for analysis.

We can ask our AI code assistant to do this.

Python code with rich annotations

region = ‘Miami’

Read ‘data_{}.csv’ as data DateFrame with datetime index from ‘Data/L3_Integrated’
In data, create column ‘PM2.5’ which is mean of columns ‘88101’, ‘88502’
Create a new DataFrame aqi_data as a copy of data which shall include the following columns from data

columns_from_data= [
'PM2.5', #PM2.5
'81102', #PM10
'42602', #NO2
'42401', #SO2
'42101', #CO
'44201',  #O3
]

In aqi_data DataFrame rename columns lables as follows:

'PM2.5': 'PM2.5'
'81102', 'PM10'
'42602', 'NO2'
'42401', 'SO2'
'42101', 'CO'
'44201',  'O3'

In air_quality_data, get a NumPy representation of DataFrame using .values method
To be later openned by NumPy, save the aqi_data_ DataFrame as ‘aqi_data_{region}.csv’ at ‘Data/L4_Transformed’ and create folder if it does not exists

import pandas as pd
import os
region = 'Miami'

# 1. Read 'data_{}.csv' as `data` DataFrame with datetime index from 'Data/L3_Integrated'
data = pd.read_csv(f'Data/L3_Integrated/data_{region}.csv', index_col=0, parse_dates=True)

# 2. In `data`, create column 'PM2.5' which is mean of columns '88101', '88502'
data['PM2.5'] = data[['88101', '88502']].mean(axis=1)

# 3. Create a new DataFrame `aqi_data` as a copy of `data` which shall include specific columns
columns_from_data = ['PM2.5', '81102', '42602', '42401', '42101', '44201']

aqi_data = data[columns_from_data].copy()

# 4. Rename columns labels in `aqi_data` DataFrame
aqi_data.columns = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']


# 5. Get a NumPy representation of `aqi_data` DataFrame using `.values` method
aqi_data_values = aqi_data.values

# 6. Save the `aqi_data` DataFrame as 'aqi_data_{region}.csv' in 'Data/L4_Transformed' and create folder if it does not exist
os.makedirs('Data/L4_Transformed', exist_ok=True)
aqi_data.to_csv(f'Data/L4_Transformed/aqi_data_{region}.csv')

After this transformation, our data is prepared for analysis. The next step is to utilize NumPy to calculate the air quality index before and after the lockdown.

4. Summary#

Step(s)	File Name	Number of files	File size	Data size	Comments
1) Data discovery and profiling	‘daily_{FileID}_{year}.csv’	48	~60 MB	~ 3 GB	16 parameters and 3 years
2) Data collection, cleaning, and structuring	‘daily_{FileID}_{year}_Miami.csv’	48	~0. 4MB	~ 10 MB	16 parameters and 3 years for Miami and selected columns
	‘attributes.csv’	1	2 KB		Parameter details and file information
5) Data integration	‘data_Miami.csv’	1	125 KB	125 KB	Integrated data with 16 parameters
6) Data validation	‘data_Miami.csv’				NONOxNOy, HAPs and VOCs columns need additional treatment
7) Data enrichment	‘data_Miami.csv’				No data enrichment
8) Data transformation	‘aqi_data_Miami.csv’	1	67 KB	67 KB	Data for AQI analysis with NumPy representation