Exercise 7 NumPy: Air quality data analysis

Exercise 7 NumPy: Air quality data analysis#

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from scipy import stats

1. Problem statement#

This assignment involves analyzing air quality data for the Miami core-based statistical area (CBSA) to evaluate the impact of COVID-19 lockdown orders on air quality in Miami. The analysis will be based on EPA daily air quality data for PM2.5, PM10, NO2, SO2, CO, and O3 from 01-04-2019 to 31-03-2021, covering one year before and after the lockdown order on April 1, 2020. The tasks include:

Calculating the mean concentrations for these 6 air quality parameters before and after the lockdown order.
Conducting hypothesis testing to determine if there is a statistically significant difference in the pre and post data.

To complete this exercise, you will apply the skills learned in the Data Science Workflow and NumPy lessons.

2. Prepare data#

2.1 Read data as Numpy array#

Read data from ‘aqi_data_Miami.csv’ as NumPy array data

# Read the CSV file into a NumPy mixed type array with dtype object
# Column names: [datetime-empty], 'PM2.5', 'PM10',  'NO2', 'SO2',  'CO', 'O3'
data = pd.read_csv('data/aqi_data_Miami.csv').to_numpy()

# Convert the first column in the NumPy array to datetime format

# Display data

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 3
# Read the CSV file into a NumPy mixed type array with dtype object
# Column names: [datetime-empty], 'PM2.5', 'PM10',  'NO2', 'SO2',  'CO', 'O3'
----> 3 data = pd.read_csv('data/aqi_data_Miami.csv').to_numpy()
# Convert the first column in the NumPy array to datetime format

# Display data

File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\parsers\readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)   1022     dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\parsers\readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\parsers\readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\parsers\readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'data/aqi_data_Miami.csv'

2.2 Extract dates and values#

Extract dates and values from Numpy array data as Numpy arrays dates and values

# Extracting first column as datetime (you can do it in one line)

# Extract numerical values as numbers (you can do it in one line)

2.3 Remove rows with NaNs#

Remove rows with NaN values from data, dates, and values.

#Original dataset information

#Remove NaNs

#Cleaned dataset information

2.4 Extracting pre and post-lockdown data#

Extract data based on the lockdown order on April 1, 2020:

Extract data one year before lockdown as pre_dates and pre_values,
Extract data one year after lockdown as post_dates and post_values,
Extract data one year before and after lockdown combined as pre_post_dates and pre_post_values.

# Define date ranges
lockdown_start = pd.Timestamp('2020-04-01')
one_year_before = pd.Timestamp('2019-04-01')
one_year_after = pd.Timestamp('2021-04-01')

#Cleaned dataset information

# Filter the NumPy array based on the date range for pre-post-covid19 data

# Filter the NumPy array based on the date range for pre-covid19 data

# Filter the NumPy array based on the date range for post-covid19 data

2.5 Save NumpPy arrays of dates and values (extra)#

Let us save our pre-proceeded data

pre_dates one year datetime data before lockdown
pre_values one year concentration data before lockdown for PM2.5, PM10, NO2, SO2, CO, O3
post_dates one year datetime data after lockdown
post_values one year concentration data after lockdown for PM2.5, PM10, NO2, SO2, CO, O3

because we will use it in Matplotlib exercise. We do not need to use allow_pickle=True argument because these arrays do not have data with mixed types.

# Save the data arrays to a file 
np.save('data/pre_dates.npy', pre_dates)
np.save('data/post_dates.npy', post_dates)
np.save('data/pre_values.npy', pre_values)
np.save('data/post_values.npy', post_values)

Let us double check that we can load and use the saved data

# Load the array from file
pre_dates = np.load('data/pre_dates.npy')
pre_values = np.load('data/pre_values.npy')
print("pre_dates:", pre_dates.dtype, pre_dates.shape,
      "pre_values:", pre_values.dtype, pre_values.shape)

2.6 Additional information#

Here is additional information that can be helpful to our analysis. We can refer to EPA document EPA 454/B-18-007 for concentration breakpoints indicating unhealthy levels for sensitive groups.

# Define date ranges
lockdown_start = pd.Timestamp('2020-04-01')
one_year_before = pd.Timestamp('2019-04-01')
one_year_after = pd.Timestamp('2021-04-01')

# Air quality parameter information
parameters = [        'PM2.5', 'PM10',  'NO2', 'SO2',  'CO', 'O3']
units = [             'µg/m³', 'µg/m³', 'ppb', 'ppb', 'ppm', 'ppm']
limits = [        35 ,      155,    100,   50,     9,  0.07]  # Unhealthy levels of senitive groups

3. Mean values for each parameter befor and after lockdown#

Before proceeding with hypothesis testing, let’s analyze the mean values of each air quality parameter before and after the lockdown.

For a simple analysis, we can compare the mean concentrations before and after the lockdown for each parameter.
For a more sophisticated analysis, an Air Quality Index (AQI) calculation method, to calculate the AQI for quality parameter. Then we can calcuate the mean AQI values before and after the lockdown for each parameter.

Let us try both analyses.

3.1 Simple approach#

We can calculate the Air Quality Index (AQI) for PM2.5, PM10, etc.using an AQI calculation method, which can be complex. For simplicity, let’s just compare the mean concentrations or pre and post lockdown for each parameter.

# Six air quality parameters
parameters =['PM2.5','PM10','NO2','SO2','CO','O3']

# Calculate mean concentrations for pre-lockdown and post-lockdown


# Display results in a table form with `parameters` as index,
# and first and second column values are `pre_mean` and `post_mean` 

The means of the pre and post seem similar. Let us confirm this with some statstical analysis. Before we do this, let us try a simple version of the AQI calculation method.

3.2 AQI approach (optional)#

3.2.1 Calculate AQI values#

We can compile a concentration breakpoints and their corresponding AQI values for various pollutants (EPA 454/B-18-007). Here is an example:

Parameter	Unit	Good AQI (0-50)	Moderate AQI (51-100)	Unhealthy for Sensitive Groups AQI (101-150)	Unhealthy AQI (151-200)
PM2.5	µg/m³	0.0 - 12.0	12.1 - 35.4	35.5 - 55.4	55.5 - 150.4
PM10	µg/m³	0 - 54	55 - 154	155 - 254	255 - 354
NO2	ppb	0 - 40	41 - 100	101 - 200	201 - 250
SO2	ppb	0 - 20	21 - 50	51 - 100	101 - 304
CO	ppm	0.0 - 4.4	4.5 - 9.4	9.5 - 12.4	12.5 - 15.4
O3	ppm	0.000 - 0.054	0.055 - 0.07	0.071 - 0.085	0.086 - 0.115

The “Good” category ranges from 0 to 50 on the AQI, indicating satisfactory air quality. “Moderate” falls between 51 and 100, generally acceptable but may affect sensitive groups. “Unhealthy for Sensitive Groups” is 101-150, suggesting limits on outdoor activities for at-risk individuals. “Unhealthy” is 151-200, signaling adverse health effects for everyone, requiring outdoor limits.

We can create a dictionary for each pollutant containing its respective concentration breakpoints and corresponding AQI values.

#Dictionary for breakpoints of 6 parameters
pollutants = {
    'PM2.5': {'breakpoints': [0, 12.1, 35.5, 55.5, 150.4]},
    'PM10': {'breakpoints': [0, 55, 155, 255, 354]},
    'NO2': {'breakpoints': [0, 41, 101, 201, 250]},
    'SO2': {'breakpoints': [0, 21, 51, 101, 304]},
    'CO': {'breakpoints': [0, 4.5, 9.5, 12.5, 15.4]},
    'O3': {'breakpoints': [0, 0.055, 0.071, 0.086, 0.115]}}

With linear interpolation we can calcuate the AQI for a given concentration.

def calculate_aqi(pollutant, concentration):
    poll_data = pollutants[pollutant]
    bp = poll_data['breakpoints']
    aqi = [0, 51, 101, 151,200]

    if concentration <= bp[1]:
        aqi_val = ((aqi[1]-aqi[0])/(bp[1]-bp[0])) *(concentration-bp[0])+ aqi[0]
    elif concentration <= bp[2]:
        aqi_val = ((aqi[2]-aqi[1])/(bp[2]-bp[1])) *(concentration-bp[1])+ aqi[1]
    elif concentration <= bp[3]:
        aqi_val = ((aqi[3]-aqi[2])/(bp[3]-bp[2])) *(concentration-bp[2])+ aqi[2]
    else:
        aqi_val = ((aqi[4]-aqi[3])/(bp[4]-bp[3])) *(concentration-bp[3])+ aqi[3]
        
    return aqi_val

Then we can call this function to calculate AQI value for each sample.

# Six air quality parameters
parameters =['PM2.5','PM10','NO2','SO2','CO','O3']

#Calculate aqi for pre and post data
for dataset in ['pre', 'post']:

    #Intiate variable aqi to store calculated aqi values
    if dataset == 'pre':
        aqi_data = pre_values.copy()
    elif dataset == 'post':
        aqi_data = post_values.copy()
    
    #Loop through each row and each element in the row to calcuate aqi
    for row_index, row in enumerate(aqi_data):
        for col_index, element in enumerate(row):
            
            #Get parameter name
            parameter= parameters[col_index]
    
            #Pass parameter name and value to calculate_aqi
            #Update pre_aqi with the calculated aqi for that parameter and date
            aqi_data[row_index,col_index] = calculate_aqi(parameter, element)

    #Display aqi
    print(dataset, aqi_data.dtype, aqi_data.shape)
    print(aqi_data[0])
    
    # Save aqi for pre and post
    if dataset == 'pre':
        pre_aqi = aqi_data
    elif dataset == 'post':
        post_aqi = aqi_data

3.2.2 Save AQI values as NumPy arrays (extra)#

This step is not needed in this exercise, but we want to use this data in the Matplotlib exercise.

# Save the data arrays to a file 
np.save('data/pre_aqi.npy', pre_aqi)
np.save('data/post_aqi.npy', post_aqi)

Load saved data for double checking.

pre_aqi = np.load('data/pre_aqi.npy')
post_aqi = np.load('data/post_aqi.npy')
print("pre_aqi:", pre_aqi.dtype, pre_aqi.shape)
print("post_aqi:", post_aqi.dtype, pre_aqi.shape)

3.2.3 Calculate mean AQI values for each parameter#

We calculated the AQI for PM2.5, PM10, etc.. Now let us compare the mean AQI for pre and post lockdown for each parameter.

# Six air quality parameters
parameters =['PM2.5','PM10','NO2','SO2','CO','O3']

# Calculate mean concentrations for pre-lockdown and post-lockdown
pre_aqi = np.mean(pre_aqi, axis=0)
post_aqi = np.mean(post_aqi, axis=0)

# Display results in a table form with `parameters` as index,
# and first and second column values are `pre_mean` and `post_mean` 
df = pd.DataFrame({'Pre Mean AQI': np.round(pre_aqi,2), 
                   'Post Mean AQI': np.round(post_aqi,2)}, 
                  index=parameters)

df

The means of the pre and post seem similar. Let us confirm this with some statstical analysis.

4. Hypothesis testing#

4.1 Introduction#

We analyze one year of air quality data before and after the April 2020 lockdown order in Miami, Florida. The air quality parameters considered are PM2.5, PM10, NO2, SO2, CO, and O3. We employ hypothesis testing for the analysis. To learn about hypothesis testing, you can refer to this Statistical Vignettes presentation of Project EDDIE.

Hypothesis testing is a statistical method used to compare sample data and make conclusions about a population. With air quality data before and after a COVID-19 lockdown, we are investigating differences in air quality parameters. We typically set up null hypothesis \(H_0\) and alternative hypothesis \(H_a\). For instance, for PM2.5 concentration:

Null hypthesos \(H_0\): Mean PM2.5 before and after the lockdown are equal.
Alternative hyptheis \(H_a\): Mean PM2.5 before and after the lockdown differ.

Then to determine whether there is a statistically significant difference between the mean PM2.5 concentrations before and after the lockdown we can calculate the test statistic and p-value.

We choose a statistic test like a t-test or Mann-Whitney U test based on based on the distribution of our data (e.g., whether it is normally distributed).
We set the significance level to 0.05, indicating that we are willing to accept a 5% chance of mistakenly rejecting the null hypothesis when it is actually true.
If the p-value is below 0.05, we reject the null hypothesis, indicating a significant difference in PM2.5 concentrations pre and post COVID-19 lockdown.

We can repeat the same for the other air quality parameters.

Info

The t-test is a statistical test used to compare the means of two independent groups, assuming that the data is normally distributed and the variances of the two groups are equal. We can relaxe the variance condtion. For example, in the t-test function ttest_ind in scipy library, we can set equal_var=False. For non-normally distributed data or small samples, the Mann-Whitney U test compares two groups based on ranks by checking if one group tends to have higher values.

4.2 Method#

Hypothesis testing steps for each parameter:

Hypotheses:
- \(H_0\): No lockdown impact
- \(H_1\): Lockdown impact present
Significance Level: Set at 0.05
Data Preparation: Organize pre and post data
Test Statistic Selection: Choose t-test or Mann-Whitney U test for mean comparison
Testing: Use scipy.stats to calculate p-values
Interpretation: Reject \(H_0\) if p-value < 0.05, indicating significant lockdown impact on Miami’s air quality

Performing a t-test to assess the statistical significance of mean differences.

# Air quality parameters
parameters = ['PM2.5', 'PM10',  'NO2', 'SO2',  'CO', 'O3']

#Loop for all parameter is pre_data and post_data

    # Step 3 Data prearation

    # Step 4 Select test
    #"Student t-test",  "Mann-Whitney U test"

    # Step 5 Perform test
        # Perform a t-test for PM2.5
  
        # Perform Mann-Whitney U test

    # Step 6 Check if the result is significant

Info

In the example above, we utilized SciPy, a Python library for scientific computing that encompasses tools for optimization, integration, interpolation, linear algebra, statistics, and more. While we have not covered SciPy or statistical tests in this class, a key learning objective is to independently explore and apply new libraries and methods

4.3 Results#

Present your main findings here.

4.4 Discussion#

Critically discuss your hypthesis testing results here.

Instruction:

Discuss p-values showing understanding of the statistical method used
Explain expected and unexpected results with references to relevant studies
Discuss analysis limitations and suggest areas for further analysis to enhance understanding
Ensure a well-organized, coherent response within 150-300 words
Use appropriate citations to support arguments

The learning objective of this question is to help you to enhance your critical thinking in data analysis, use of references to support your arguments, and clarity in presenting findings. This shall help you to prepare for your final project.