Exercise 6 - AQI Data Preparation#

image.png This lesson was developed with assistance from GPT-4-1106-preview, claude-2.1, and GPT-3.5-turbo.

Binder


1. Problem definition#

What is the impact of COVID19 lockdown orders on air quality in Miami?

2. Data requirements#

We need to collect daily air quality data for parameters such as PM2.5, PM10, NO2, SO2, and CO in csv format from 2019 to 2021. That is one year before and one year after the lockdown orders of COVID 19 in April 2020. EPA Air Data contains air quality data collected at outdoor monitors across the US.

3. Data preparation#

  • Data discovery and profiling: Exploring the content, quality, and structure of the dataset

  • Data collection: Gathering raw data from various sources for analysis

  • Data cleaning: Standardizing inconsistent data formats, correcting errors, and handling missing values

  • Data structuring: Organizing data for efficient storage and retrieval

  • Data enrichment: Enhancing dataset with additional information from external sources to improve analysis

  • Data transformation: Optimizing the content and structure of data for further analysis

  • Data validation: Ensuring the accuracy, completeness, and reliability of the data

  • Data integration: Combining data from multiple sources to create a unified dataset for analysis

3.1 Data discovery and profiling#

In this step we are exploring the content, quality, and structure of the dataset. From this step we can learn the following.

  • EPA Air Data contains air quality data collected at outdoor monitors across the US.

  • For this exercise we want to use Pre-Generated Data Files - Daily Data.

  • For each year one data quality parameter (e.g., NO2, SO2, and CO) is recorded in a csv file.

  • The dataset have 16 air quality parameters.

  • For three years, there are 48 csv files and each file is about 60 MB. That is approximately 3 GB of data.

3.2 Data collection#

In this step we want to gather raw data for analysis The breakdown of this step is as follows.

  • We need to collect data for 16 air quality parameters for three years from Pre-Generated Data Files - Daily Data that is 48 files

  • Each file is in csv format and zipped

  • We can use Python tools for making HTTP requests such as request package

  • We can use Python tools for unzipping files such as zipfile package

  • Each file is large than 60MB, so the download and processing of raw data can be long

3.3 Data cleaning#

This step involves standardizing inconsistent data formats, correcting errors, and handling missing values. For this step we will do the following:

  • For this EPA data, cleaning is minimal

  • We will record the files that has no records for our study area and period

  • We do not need to do anything about missing data at this stage

3.4 Data structuring#

This steps involves organizing and formating the data in a way that is optimal for storage and retrieval. For this step we need to do the following tasks.

  • To reduce storage space, we will filter out rows outside our study region

  • We will only keep relevant columns for our record and analysis that are: ‘County Code’, ‘Parameter Name’, ‘Date Local’, ‘Units of Measure’, ‘Arithmetic Mean’, ‘CBSA Name’.

  • This will reduce the storgage space per file from from ~60 MB to ~0.4 MB.

Note: A core-based statistical area (CBSA) is a geographic region of the U.S.as defined by the Office of Management and Budget.A CBSA is “one or more adjacent counties or county equivalents that have at least one urban core area of at least 10,000 population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties” (Wikipedia).

After conducting an initial data exploration and profiling, we can formulate a prompt our language model to assist with data collection, cleaning, and structuring. Here is a prompot:

Python code with rich annotations:
This code collects air quality data in a region for one year before and one year after the lockdown order on April 1, 2020.

These are the FileIDs of the air quality data that we want to download:

Category

Parameter

FileID

Unit

Records

Criteria Gases

Ozone

44201

na

0

Criteria Gases

SO2

42401

na

0

Criteria Gases

CO

42101

na

0

Criteria Gases

NO2

42602

na

0

Particulates

PM2.5 FRM/FEM Mass

88101

na

0

Particulates

PM2.5 non FRM/FEM Mass

88502

na

0

Particulates

PM10 Mass

81102

na

0

Particulates

PMc Mass

86101

na

0

Meteorological

Winds

WIND

na

0

Meteorological

Temperature

TEMP

na

0

Meteorological

Barometric Pressure

PRESS

na

0

Meteorological

RH and Dewpoint

RH_DP

na

0

Toxics, Precursors, and Lead

HAPs

HAPS

na

0

Toxics, Precursors, and Lead

VOCs

VOCS

na

0

Toxics, Precursors, and Lead

NONOxNOy

NONOxNOy

na

0

Toxics, Precursors, and Lead

Lead

LEAD

na

0

The daily data file is named ‘daily_FileID_Year.zip’.
For example, ‘daily_44201_2023.zip’ has Ozone data for 2023

The file address is: https://aqs.epa.gov/aqsweb/airdata/daily_44201_2023.zip
For example,

filename = 'daily_44201_2023.zip' 
base_url = 'https://aqs.epa.gov/aqsweb/airdata/
download_from = base_url + filename

will be the address of the Ozone data for 2023.

Each zip file contains a csv file. For example, ‘daily_44201_2023.zip’ contains ‘daily_44201_2023.csv’

Addtional information

#These files will be downloaded to:
save_to = 'Data\L2_Structured"

# Region of interest
region = 'Miami'

# CBSA Name of the region of interest 
CBSA_Name_Value = 'Miami-Fort Lauderdale-West Palm Beach, FL'

Write a Python code to:

  1. Use the information in th table above to create Attributes list or dictionary for ‘Category’, ‘Parameter’, ‘FileID’, ‘Unit’, ‘Records’

  2. Define variables region and CBSA_Name_Value

  3. Count each download with variable name download_count

  4. Get files for years 2019, 2020, 2021 for all the FileIDs

  5. Show message about download progress after each fileas as download_count : year : parameter : filename

  6. Unzip each file

  7. Read csv file with Pandas as raw_data DataFrame

  8. Delete zip file: ‘daily_FileID_Year.zip’ only if it exists and if file is open, close it. For example, ‘daily_44201_2023.zip’

  9. Delete csv file: ‘daily_FileID_Year.csv’ only if it exists and if file is open, close it. For example, ‘daily_44201_2023.csv’

  10. In raw_data, address this warning: “Columns (13) have mixed types. Specify dtype option on import or set low_memory=False”.

  11. In Attributes, update ‘Unit’ with the first value in column ‘Units of Measure’ in raw_data. For example, for fileid = 44201, ‘na’ will be replaced with ‘Parts per million’. If the first value does not exist update ‘Unit’ in Attributes list with ‘empy_file’

  12. In raw_data, keep only rows where the value of Column ‘CBSA Name’ is equal to CBSA_Name_Value

  13. In raw_data, keep only columns with labels: ‘County Code’, ‘Parameter Name’, ‘Date Local’, ‘Units of Measure’, ‘Arithmetic Mean’, ‘CBSA Name’

  14. In raw_data, convert ‘Date Local’ column to datetime format, set as index, and sort by index

  15. In Attributes, update ‘Records’ with the number of rows in the raw_data plus the current value of ‘Records’

  16. Save raw_data at save_to location with file name ‘daily_{FileID}{year}{region}.csv2019, and create folder if it does not exists

  17. At the end from Attributes create df_Attributes DataFrame

  18. Save df_Attributes at save_to as ‘Attributes.cs

  19. Display df_Attributes DataFrame

  20. Annotate code with detailed and rich annotations describing each task and sub-task inside and outside the loop

Download = False #Change to true if you want to download the data

Download: 
    #ChatGPT 3.5 Turbo Code    
    import pandas as pd
    import os
    import zipfile
    from urllib.request import urlretrieve
    
    # 1. Define Attributes dictionary for air quality data
    Attributes = {
        'Category': ['Criteria Gases', 'Criteria Gases', 'Criteria Gases', 'Criteria Gases',
                     'Particulates', 'Particulates', 'Particulates', 'Particulates',
                     'Meteorological', 'Meteorological', 'Meteorological', 'Meteorological',
                     'Toxics, Precursors, and Lead', 'Toxics, Precursors, and Lead',
                     'Toxics, Precursors, and Lead', 'Toxics, Precursors, and Lead'],
        'Parameter': ['Ozone', 'SO2', 'CO', 'NO2', 'PM2.5 FRM/FEM Mass', 'PM2.5 non FRM/FEM Mass',
                      'PM10 Mass', 'PMc Mass', 'Winds', 'Temperature', 'Barometric Pressure',
                      'RH and Dewpoint', 'HAPs', 'VOCs', 'NONOxNOy', 'Lead'],
        'FileID': [44201, 42401, 42101, 42602, 88101, 88502, 81102, 86101, 'WIND', 'TEMP',
                   'PRESS', 'RH_DP', 'HAPS', 'VOCS', 'NONOxNOy', 'LEAD'],
        'Unit': ['na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na', 'na'],
        'Records': [0] * 16
    }
    
    # 2. Define region and CBSA_Name_Value
    region = 'Miami'
    CBSA_Name_Value = 'Miami-Fort Lauderdale-West Palm Beach, FL'
    
    # 3. Initialize download_count
    download_count = 0
    
    # 4. Loop through each year and download files
    for year in range(2019, 2022):
        for i, file_id in enumerate(Attributes['FileID']):
            filename = f'daily_{file_id}_{year}.zip'
            base_url = 'https://aqs.epa.gov/aqsweb/airdata/'
            download_from = base_url + filename
            
            # 5. Show download progress
            download_count += 1
            print(f"Download {download_count}: {year} : {Attributes['Parameter'][i]} : {filename}")
            
            # Download file
            urlretrieve(download_from, filename)
            
            # 6. Unzip file
            with zipfile.ZipFile(filename, 'r') as zip_ref:
                zip_ref.extractall()
            
            # 7. Read csv file
            raw_data = pd.read_csv(f'daily_{file_id}_{year}.csv', low_memory=False)
            
            # 8. Delete zip file if exists
            if os.path.exists(filename):
                os.remove(filename)
            
            # 9. Delete csv file if exists
            if os.path.exists(f'daily_{file_id}_{year}.csv'):
                os.remove(f'daily_{file_id}_{year}.csv')
            
            # 10. Handle warning in raw_data
            pd.set_option('mode.chained_assignment', None)
            
            # 11. Update 'Unit' in Attributes
            Attributes['Unit'][i] = raw_data['Units of Measure'][0] if len(raw_data) > 0 else 'empty_file'
            
            # 12. Filter by CBSA_Name_Value
            raw_data = raw_data[raw_data['CBSA Name'] == CBSA_Name_Value]
            
            # 13. Keep required columns
            raw_data = raw_data[['County Code', 'Parameter Name', 'Date Local', 'Units of Measure',
                                 'Arithmetic Mean', 'CBSA Name']]
            
            # 14. Convert 'Date Local' to datetime and set as index
            raw_data['Date Local'] = pd.to_datetime(raw_data['Date Local'])
            raw_data.set_index('Date Local', inplace=True)
            raw_data.sort_index(inplace=True)
            
            # 15. Update 'Records' in Attributes
            Attributes['Records'][i] += len(raw_data)
            
            # 16. Save raw_data
            save_folder = os.path.join('Data', 'L2_Structured')
            if not os.path.exists(save_folder):
                os.makedirs(save_folder)
            raw_data.to_csv(os.path.join(save_folder, f'daily_{file_id}_{year}_{region}.csv'))
            
    
    # 17. Create DataFrame from Attributes
    df_Attributes = pd.DataFrame(Attributes)
    
    # 18. Save df_Attributes
    df_Attributes.to_csv(os.path.join(save_folder, 'Attributes.csv'))
    
    # 19. Display df_Attributes
    display(df_Attributes)
  Cell In[1], line 3
    Download:
              ^
SyntaxError: invalid syntax

3.5 Data integration#

This step involves combining data from multiple data to create a unified dataset for analysis.

  • We need to combine our data into a 2d array that has our 16 parameters (also known as features or variables) for the whole study period.

We can give this code outline below for your LM to complete.

region = 'Miami'

# read 'Data/L2_Structured/Attributes.csv' as `df_Attributes` DataFrame

# create `data` DataFrame with daily index from 2019-01-01 to 2021-12-31 
#and column names are the values of 'FileID' column in `df_Attributes`
 
# loop from year 2019 to 2021
    #Loop for each value in 'FileID'
        #Read f'Data/L2_Structured/daily_{fileID}_{year}_{region}.csv' as `raw_data` DataFrame 
        #and set column 'Date Local' as datetime index

       #resample the column 'Arithmetic Mean' in `raw_data` with .max() as `resampled_arithmetic_mean`

       #Copy the values of resampled_arithmetic_mean for each fileID into its corresponding column in `data` with the same fileID
       #Dates must match. For example, the value in 2019-01-01 in resampled_arithmetic_mean would be copied to 
       #the firt row of `data` with index equal 2019-01-01 

# Save `data` DataFrame as 'data_{region}.csv at this path 'Data/L3_Integrated' and create folder if it does not exists

#display `data` DataFrame using  `display()` not `print()`

# Add rich code annotation

#code only. no introduction and conclusion.
#ChatGPT 3.5 Turbo Code 
import pandas as pd
import os

region = 'Miami'

# Read 'Data/L2_Structured/Attributes.csv' as `df_Attributes` DataFrame
df_Attributes = pd.read_csv('Data/L2_Structured/Attributes.csv')

# Create `data` DataFrame with daily index from 2019-01-01 to 2021-12-31 
# and with columns such that the column labels are the 'FileID' column in `df_Attributes`
start_date = '2019-01-01'
end_date = '2021-12-31'
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
data = pd.DataFrame(index=date_range, columns=df_Attributes['FileID'])

# Loop from year 2019 to 2021
for year in range(2019, 2022):
    # Loop for each value in 'FileID'
    for fileID in df_Attributes['FileID']:
        # Read f'Data/L2_Structured/daily_{fileID}_{year}_{region}.csv' as `raw_data` DataFrame
        file_path = f'Data/L2_Structured/daily_{fileID}_{year}_{region}.csv'
        if os.path.exists(file_path):
            raw_data = pd.read_csv(file_path)
            # Set column 'Date Local' as datetime index
            raw_data['Date Local'] = pd.to_datetime(raw_data['Date Local'])
            raw_data.set_index('Date Local', inplace=True)
            
            # Resample the column 'Arithmetic Mean' in `raw_data` with .max() as `resampled_arithmetic_mean`
            resampled_arithmetic_mean = raw_data['Arithmetic Mean'].resample('D').max()
            
            # Copy the values of resampled_arithmetic_mean for each fileID into its corresponding column in `data`
            # with the same fileID. Dates must match.
            data.loc[resampled_arithmetic_mean.index, fileID] = resampled_arithmetic_mean.values

# Save `data` DataFrame as 'data_{region}.csv' at this path 'Data/L3_Integrated' and create folder if it does not exist
output_folder = 'Data/L3_Integrated'
os.makedirs(output_folder, exist_ok=True)
data.to_csv(os.path.join(output_folder, f'data_{region}.csv'))

# Display `data` DataFrame using `display()`
display(data)
FileID 44201 42401 42101 42602 88101 88502 81102 86101 WIND TEMP PRESS RH_DP HAPS VOCS NONOxNOy LEAD
2019-01-01 0.035412 0.6625 0.477667 8.416667 7.158333 53.795833 18.0 NaN 258.291667 73.0 NaN 86.833333 NaN NaN 18.820833 NaN
2019-01-02 0.029471 0.754167 0.55 18.217647 6.404348 5.125 18.0 NaN 262.916667 74.75 NaN 82.958333 NaN NaN 49.576471 NaN
2019-01-03 0.026882 0.725 0.56725 18.270833 9.929167 6.979167 22.0 10.1 307.708333 75.291667 NaN 87.375 0.72 2.76 50.520833 NaN
2019-01-04 0.031882 0.740909 0.533333 18.1625 12.4 10.7875 28.0 NaN 155.083333 72.875 NaN 88.166667 NaN NaN 46.6125 NaN
2019-01-05 0.034765 0.683333 0.408333 17.983333 12.016667 11.370833 24.0 NaN 100.166667 70.541667 NaN 82.75 NaN NaN 32.495833 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-12-27 0.042882 0.805882 0.7375 24.579167 9.929167 8.045833 20 6.970833 202.208333 70.208333 NaN 84.958333 0.01 NaN 42.8375 NaN
2021-12-28 0.040647 0.683333 0.816667 27.979167 8.865217 7.663158 21 6.904167 254.875 70.166667 NaN 87.625 NaN NaN 85.0875 NaN
2021-12-29 0.034059 0.691667 0.665375 24.183333 8.183333 6.970833 17 7.8125 269.5 71.958333 NaN 88.25 NaN NaN 77.915 NaN
2021-12-30 0.025706 0.679167 0.804167 23.5375 8.320833 6.708333 16 7.066667 268.958333 73.5 NaN 89.75 0.007 NaN 83.466667 NaN
2021-12-31 0.02925 0.5375 0.717125 15.479167 12.591667 16.475 14 6.170833 261.416667 73.583333 NaN 88.416667 NaN NaN 41.7125 NaN

1096 rows × 16 columns

3.6 Data validation#

Data validation includes ensuring the accuracy, completeness, and reliability of the data. Here we will simply cross-reference our collected data in data DataFrame with the data in raw data files to ensure that we did not do any mistake. We can display any rows and columns of your data DataFrame and cross-check them with raw data and the csv files. Here is one example.

columns_to_show = ['86101', 'WIND','TEMP']
data.loc['2019-12-30':'2020-01-01', columns_to_show]
FileID 86101 WIND TEMP
2019-12-30 10.45 220.916667 75.833333
2019-12-31 6.995833 170.458333 71.333333
2020-01-01 6.233333 200.875 68.333333

We can also use the JupyterLab variableInspector extension to get an overview of the data before additional preprocessing:

  • review each parameter in the CSV file to ensure there are no errors.

  • note that the parameters “NONOxNOy”, “VOCs”, and “HAPs” need special handling due to its multiple-parameter nature.

  • acknowledge this issue for the current analysis. No further action is needed as this parameter won’t be used in our analysis.

3.7 Data enrichment#

This steps involved enhancing dataset with additional information from external sources to improve analysis.

  • No data enrichment is required at this stage.

3.8 Data transformation#

This step involves converting data into a format suitable form for analysis. Data enrichment and data transformation are closely related but distinct from data structuring. While data structuring primarily involves data wrangling for efficient storage and retrieval, data transformation focuses on preparing data for optimal analysis.

Our analysis will focus on PM 2.5, PM 10, 2, SO2, CO, and O3.

Parameter

FileID

Unit

Records

PM2.5 FRM/FEM Mass

88101

Micrograms/cubic meter (LC)

14924

PM2.5 non FRM/FEM Mass

88502

Micrograms/cubic meter (LC)

7987

PM10 Mass

81102

Micrograms/cubic meter (25 C)

4130

NO2

42602

Parts per billion

5056

SO2

42401

Parts per billion

4253

CO

42101

Parts per million

6256

Ozone

44201

Parts per million

8420

Winds

WIND

Knots

2192

Temperature

TEMP

Degrees Fahrenheit

1796

RH and Dewpoint

RH_DP

Percent relative humidity

1787

HAPs

HAPS

Micrograms/cubic meter (LC)

6536

VOCs

VOCS

Parts per billion Carbon

7433

To streamline this process, we will:

  • Combine ‘PM2.5 FRM/FEM Mass’ and ‘PM2.5 non FRM/FEM Mass’ for both samplers into one ‘PM2.5’ set.

  • Select parameters: PM2.5, PM10, NO2, SO2, CO, and Ozone.

  • Rename column labels from FileIDs to parameter names for clarity (e.g., ‘81102’ to ‘PM10’).

  • Convert the DataFrame to a NumPy representation using .values for analysis with NumPy

  • Save the data as a NumPy array for analysis.

We can ask our AI code assistant to do this.

Python code with rich annotations

region = ‘Miami’

  1. Read ‘data_{}.csv’ as data DateFrame with datetime index from ‘Data/L3_Integrated’

  2. In data, create column ‘PM2.5’ which is mean of columns ‘88101’, ‘88502’

  3. Create a new DataFrame aqi_data as a copy of data which shall include the following columns from data

columns_from_data= [
'PM2.5', #PM2.5
'81102', #PM10
'42602', #NO2
'42401', #SO2
'42101', #CO
'44201',  #O3
]
  1. In aqi_data DataFrame rename columns lables as follows:

'PM2.5': 'PM2.5'
'81102', 'PM10'
'42602', 'NO2'
'42401', 'SO2'
'42101', 'CO'
'44201',  'O3'
  1. In air_quality_data, get a NumPy representation of DataFrame using .values method

  2. To be later openned by NumPy, save the aqi_data_ DataFrame as ‘aqi_data_{region}.csv’ at ‘Data/L4_Transformed’ and create folder if it does not exists

import pandas as pd
import os
region = 'Miami'

# 1. Read 'data_{}.csv' as `data` DataFrame with datetime index from 'Data/L3_Integrated'
data = pd.read_csv(f'Data/L3_Integrated/data_{region}.csv', index_col=0, parse_dates=True)

# 2. In `data`, create column 'PM2.5' which is mean of columns '88101', '88502'
data['PM2.5'] = data[['88101', '88502']].mean(axis=1)

# 3. Create a new DataFrame `aqi_data` as a copy of `data` which shall include specific columns
columns_from_data = ['PM2.5', '81102', '42602', '42401', '42101', '44201']

aqi_data = data[columns_from_data].copy()

# 4. Rename columns labels in `aqi_data` DataFrame
aqi_data.columns = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']


# 5. Get a NumPy representation of `aqi_data` DataFrame using `.values` method
aqi_data_values = aqi_data.values

# 6. Save the `aqi_data` DataFrame as 'aqi_data_{region}.csv' in 'Data/L4_Transformed' and create folder if it does not exist
os.makedirs('Data/L4_Transformed', exist_ok=True)
aqi_data.to_csv(f'Data/L4_Transformed/aqi_data_{region}.csv')

After this transformation, our data is prepared for analysis. The next step is to utilize NumPy to calculate the air quality index before and after the lockdown.

4. Summary#

Step(s)

File Name

Number of files

File size

Data size

Comments

1) Data discovery and profiling

‘daily_{FileID}_{year}.csv’

48

~60 MB

~ 3 GB

16 parameters and 3 years

2) Data collection, cleaning, and structuring

‘daily_{FileID}_{year}_Miami.csv’

48

~0. 4MB

~ 10 MB

16 parameters and 3 years for Miami and selected columns

‘attributes.csv’

1

2 KB

Parameter details and file information

5) Data integration

‘data_Miami.csv’

1

125 KB

125 KB

Integrated data with 16 parameters

6) Data validation

‘data_Miami.csv’

NONOxNOy, HAPs and VOCs columns need additional treatment

7) Data enrichment

‘data_Miami.csv’

No data enrichment

8) Data transformation

‘aqi_data_Miami.csv’

1

67 KB

67 KB

Data for AQI analysis with NumPy representation