Exercise 7 NumPy: Air quality data analysis

Exercise 7 NumPy: Air quality data analysis#

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from scipy import stats

1. Problem statement#

This assignment involves analyzing air quality data for the Miami core-based statistical area (CBSA) to evaluate the impact of COVID-19 lockdown orders on air quality in Miami. The analysis will be based on EPA daily air quality data for PM2.5, PM10, NO2, SO2, CO, and O3 from 01-04-2019 to 31-03-2021, covering one year before and after the lockdown order on April 1, 2020. The tasks include:

Calculating the mean concentrations for these 6 air quality parameters before and after the lockdown order.
Conducting hypothesis testing to determine if there is a statistically significant difference in the pre and post data.

To complete this exercise, you will apply the skills learned in the Data Science Workflow and NumPy lessons.

2. Prepare data#

2.1 Read data as Numpy array#

Read data from ‘aqi_data_Miami.csv’ as NumPy array data

# Read the CSV file into a NumPy mixed type array with dtype object
# Column names: [datetime-empty], 'PM2.5', 'PM10',  'NO2', 'SO2',  'CO', 'O3'
data = pd.read_csv('data/aqi_data_Miami.csv').to_numpy()

# Convert the first column in the NumPy array to datetime format
data[:, 0] = pd.to_datetime(data[:, 0])

# Display data
print("data:", data.dtype, data.shape)
print(data[0])

data: object (1096, 7)
[Timestamp('2019-01-01 00:00:00') 30.477083 18.0 8.416667 0.6625 0.477667
 0.035412]

2.2 Extract dates and values#

Extract dates and values from Numpy array data as Numpy arrays dates and values

# Extracting first column as datetime (you can do it in one line)
dates =data[:, 0]  # object
dates = np.asarray(dates, dtype=np.datetime64) # Convert object to dateime
print("dates:", dates.dtype, dates.shape)

# Extract numerical values as numbers (you can do it in one line)
values = data[:, 1:] # object
values = np.asarray(values, dtype=np.float64) # Convert object to float
print("values:", values.dtype, values.shape)

dates: datetime64[us] (1096,)
values: float64 (1096, 6)

2.3 Remove rows with NaNs#

Remove rows with NaN values from data, dates, and values.

#Original dataset information
print("Original data shape: data", data.shape, 
      "dates", dates.shape,"values", values.shape)

#Remove NaNs
nan_rows = np.any(np.isnan(values), axis=1) # Find rows with NaN values in float columns
cleaned_data = data[~nan_rows] # Remove rows with NaN values
cleaned_dates = dates[~nan_rows] # Remove rows with NaN values
cleaned_values = values[~nan_rows] # Remove rows with NaN values

#Cleaned dataset information
print("Cleaned data shape: data", cleaned_data.shape, 
      "dates", cleaned_dates.shape,"values", cleaned_values.shape)

Original data shape: data (1096, 7) dates (1096,) values (1096, 6)
Cleaned data shape: data (1088, 7) dates (1088,) values (1088, 6)

2.4 Extracting pre and post-lockdown data#

Extract data based on the lockdown order on April 1, 2020:

Extract data one year before lockdown as pre_dates and pre_values,
Extract data one year after lockdown as post_dates and post_values,
Extract data one year before and after lockdown combined as pre_post_dates and pre_post_values.

# Define date ranges
lockdown_start = pd.Timestamp('2020-04-01')
one_year_before = pd.Timestamp('2019-04-01')
one_year_after = pd.Timestamp('2021-04-01')

#Cleaned dataset information
print("Cleaned data shape: data", cleaned_data.shape, 
      "dates", cleaned_dates.shape,"values", cleaned_values.shape,"\n")

# Filter the NumPy array based on the date range for pre-post-covid19 data
pre_mask = (cleaned_dates >= one_year_before) & (cleaned_dates < lockdown_start)
pre_data = cleaned_data[pre_mask] 
pre_dates = cleaned_dates[pre_mask] 
pre_values = cleaned_values[pre_mask] 
print("Pre_data: pre_data", pre_data.shape, 
      "pre_dates", pre_dates.shape,"pre_values", pre_values.shape)
print(pre_dates[0],pre_dates[-1],"\n")

# Filter the NumPy array based on the date range for pre-covid19 data
post_mask = (cleaned_dates >= lockdown_start) & (cleaned_dates < one_year_after)
post_data = cleaned_data[post_mask] 
post_dates = cleaned_dates[post_mask] 
post_values = cleaned_values[post_mask] 
print("Pre_data: post_data", post_data.shape, 
      "post_dates", post_dates.shape,"post_values", post_values.shape)
print(post_dates[0],post_dates[-1],"\n")

# Filter the NumPy array based on the date range for post-covid19 data
pre_post_mask = (cleaned_dates >= one_year_before) & (cleaned_dates < one_year_after)
pre_post_data = cleaned_data[pre_post_mask] 
pre_post_dates = cleaned_dates[pre_post_mask] 
pre_post_values = cleaned_values[pre_post_mask] 
print("Pre_post_data: pre_post_data", pre_post_data.shape, 
      "pre_post_dates", pre_post_dates.shape,"pre_post_values", pre_post_values.shape)
print(pre_post_dates[0],pre_post_dates[-1])

Cleaned data shape: data (1088, 7) dates (1088,) values (1088, 6) 

Pre_data: pre_data (358, 7) pre_dates (358,) pre_values (358, 6)
2019-04-01T00:00:00.000000 2020-03-31T00:00:00.000000 

Pre_data: post_data (365, 7) post_dates (365,) post_values (365, 6)
2020-04-01T00:00:00.000000 2021-03-31T00:00:00.000000 

Pre_post_data: pre_post_data (723, 7) pre_post_dates (723,) pre_post_values (723, 6)
2019-04-01T00:00:00.000000 2021-03-31T00:00:00.000000

2.5 Save NumpPy arrays of dates and values (extra)#

Let us save our pre-proceeded data

pre_dates one year datetime data before lockdown
pre_values one year concentration data before lockdown for PM2.5, PM10, NO2, SO2, CO, O3
post_dates one year datetime data after lockdown
post_values one year concentration data after lockdown for PM2.5, PM10, NO2, SO2, CO, O3

because we will use it in Matplotlib exercise. We do not need to use allow_pickle=True argument because these arrays do not have data with mixed types.

# Save the data arrays to a file 
np.save('data/pre_dates.npy', pre_dates)
np.save('data/post_dates.npy', post_dates)
np.save('data/pre_values.npy', pre_values)
np.save('data/post_values.npy', post_values)

Let us double check that we can load and use the saved data

# Load the array from file
pre_dates = np.load('data/pre_dates.npy')
pre_values = np.load('data/pre_values.npy')
print("pre_dates:", pre_dates.dtype, pre_dates.shape,
      "pre_values:", pre_values.dtype, pre_values.shape)

pre_dates: datetime64[us] (358,) pre_values: float64 (358, 6)

2.6 Additional information#

Here is additional information that can be helpful to our analysis. We can refer to EPA document EPA 454/B-18-007 for concentration breakpoints indicating unhealthy levels for sensitive groups.

# Define date ranges
lockdown_start = pd.Timestamp('2020-04-01')
one_year_before = pd.Timestamp('2019-04-01')
one_year_after = pd.Timestamp('2021-04-01')

# Air quality parameter information
parameters = [        'PM2.5', 'PM10',  'NO2', 'SO2',  'CO', 'O3']
units = [             'µg/m³', 'µg/m³', 'ppb', 'ppb', 'ppm', 'ppm']
limits = [        35 ,      155,    100,   50,     9,  0.07]  # Unhealthy levels of senitive groups

3. Mean values for each parameter befor and after lockdown#

Before proceeding with hypothesis testing, let’s analyze the mean values of each air quality parameter before and after the lockdown.

For a simple analysis, we can compare the mean concentrations before and after the lockdown for each parameter.
For a more sophisticated analysis, an Air Quality Index (AQI) calculation method, to calculate the AQI for quality parameter. Then we can calcuate the mean AQI values before and after the lockdown for each parameter.

Let us try both analyses.

3.1 Simple approach#

We can calculate the Air Quality Index (AQI) for PM2.5, PM10, etc.using an AQI calculation method, which can be complex. For simplicity, let’s just compare the mean concentrations or pre and post lockdown for each parameter.

# Six air quality parameters
parameters =['PM2.5','PM10','NO2','SO2','CO','O3']

# Calculate mean concentrations for pre-lockdown and post-lockdown
pre_mean = np.mean(pre_values, axis=0)
post_mean = np.mean(post_values, axis=0)

# Display results in a table form with `parameters` as index,
# and first and second column values are `pre_mean` and `post_mean` 
df = pd.DataFrame({ 'Unit': units,
                    'Pre Mean': np.round(pre_mean,2), 
                   'Post Mean': np.round(post_mean,2), 
                  'Healthy limit': limits},
                  index=parameters)

df

	Unit	Pre Mean	Post Mean	Healthy limit
PM2.5	µg/m³	9.28	9.94	35.00
PM10	µg/m³	21.69	23.81	155.00
NO2	ppb	15.01	13.12	100.00
SO2	ppb	1.73	0.38	50.00
CO	ppm	0.47	0.48	9.00
O3	ppm	0.03	0.03	0.07

The means of the pre and post seem similar. Let us confirm this with some statstical analysis. Before we do this, let us try a simple version of the AQI calculation method.

3.2 AQI approach (optional)#

3.2.1 Calculate AQI values#

We can compile a concentration breakpoints and their corresponding AQI values for various pollutants (EPA 454/B-18-007). Here is an example:

Parameter	Unit	Good AQI (0-50)	Moderate AQI (51-100)	Unhealthy for Sensitive Groups AQI (101-150)	Unhealthy AQI (151-200)
PM2.5	µg/m³	0.0 - 12.0	12.1 - 35.4	35.5 - 55.4	55.5 - 150.4
PM10	µg/m³	0 - 54	55 - 154	155 - 254	255 - 354
NO2	ppb	0 - 40	41 - 100	101 - 200	201 - 250
SO2	ppb	0 - 20	21 - 50	51 - 100	101 - 304
CO	ppm	0.0 - 4.4	4.5 - 9.4	9.5 - 12.4	12.5 - 15.4
O3	ppm	0.000 - 0.054	0.055 - 0.07	0.071 - 0.085	0.086 - 0.115

The “Good” category ranges from 0 to 50 on the AQI, indicating satisfactory air quality. “Moderate” falls between 51 and 100, generally acceptable but may affect sensitive groups. “Unhealthy for Sensitive Groups” is 101-150, suggesting limits on outdoor activities for at-risk individuals. “Unhealthy” is 151-200, signaling adverse health effects for everyone, requiring outdoor limits.

We can create a dictionary for each pollutant containing its respective concentration breakpoints and corresponding AQI values.

#Dictionary for breakpoints of 6 parameters
pollutants = {
    'PM2.5': {'breakpoints': [0, 12.1, 35.5, 55.5, 150.4]},
    'PM10': {'breakpoints': [0, 55, 155, 255, 354]},
    'NO2': {'breakpoints': [0, 41, 101, 201, 250]},
    'SO2': {'breakpoints': [0, 21, 51, 101, 304]},
    'CO': {'breakpoints': [0, 4.5, 9.5, 12.5, 15.4]},
    'O3': {'breakpoints': [0, 0.055, 0.071, 0.086, 0.115]}}

With linear interpolation we can calcuate the AQI for a given concentration.

def calculate_aqi(pollutant, concentration):
    poll_data = pollutants[pollutant]
    bp = poll_data['breakpoints']
    aqi = [0, 51, 101, 151,200]

    if concentration <= bp[1]:
        aqi_val = ((aqi[1]-aqi[0])/(bp[1]-bp[0])) *(concentration-bp[0])+ aqi[0]
    elif concentration <= bp[2]:
        aqi_val = ((aqi[2]-aqi[1])/(bp[2]-bp[1])) *(concentration-bp[1])+ aqi[1]
    elif concentration <= bp[3]:
        aqi_val = ((aqi[3]-aqi[2])/(bp[3]-bp[2])) *(concentration-bp[2])+ aqi[2]
    else:
        aqi_val = ((aqi[4]-aqi[3])/(bp[4]-bp[3])) *(concentration-bp[3])+ aqi[3]
        
    return aqi_val

Then we can call this function to calculate AQI value for each sample.

# Six air quality parameters
parameters =['PM2.5','PM10','NO2','SO2','CO','O3']

#Calculate aqi for pre and post data
for dataset in ['pre', 'post']:

    #Intiate variable aqi to store calculated aqi values
    if dataset == 'pre':
        aqi_data = pre_values.copy()
    elif dataset == 'post':
        aqi_data = post_values.copy()
    
    #Loop through each row and each element in the row to calcuate aqi
    for row_index, row in enumerate(aqi_data):
        for col_index, element in enumerate(row):
            
            #Get parameter name
            parameter= parameters[col_index]
    
            #Pass parameter name and value to calculate_aqi
            #Update pre_aqi with the calculated aqi for that parameter and date
            aqi_data[row_index,col_index] = calculate_aqi(parameter, element)

    #Display aqi
    print(dataset, aqi_data.dtype, aqi_data.shape)
    print(aqi_data[0])
    
    # Save aqi for pre and post
    if dataset == 'pre':
        pre_aqi = aqi_data
    elif dataset == 'post':
        post_aqi = aqi_data

pre float64 (358, 6)
[32.64505364 16.69090909 25.71768293  2.35918471  5.38333333 33.54501818]
post float64 (365, 6)
[45.88946281 22.25454545 11.16920773  2.01224386  2.94612267 42.0546    ]

3.2.2 Save AQI values as NumPy arrays (extra)#

This step is not needed in this exercise, but we want to use this data in the Matplotlib exercise.

# Save the data arrays to a file 
np.save('data/pre_aqi.npy', pre_aqi)
np.save('data/post_aqi.npy', post_aqi)

Load saved data for double checking.

pre_aqi = np.load('data/pre_aqi.npy')
post_aqi = np.load('data/post_aqi.npy')
print("pre_aqi:", pre_aqi.dtype, pre_aqi.shape)
print("post_aqi:", post_aqi.dtype, pre_aqi.shape)

pre_aqi: float64 (358, 6)
post_aqi: float64 (358, 6)

3.2.3 Calculate mean AQI values for each parameter#

We calculated the AQI for PM2.5, PM10, etc.. Now let us compare the mean AQI for pre and post lockdown for each parameter.

# Six air quality parameters
parameters =['PM2.5','PM10','NO2','SO2','CO','O3']

# Calculate mean concentrations for pre-lockdown and post-lockdown
pre_aqi = np.mean(pre_aqi, axis=0)
post_aqi = np.mean(post_aqi, axis=0)

# Display results in a table form with `parameters` as index,
# and first and second column values are `pre_mean` and `post_mean` 
df = pd.DataFrame({'Pre Mean AQI': np.round(pre_aqi,2), 
                   'Post Mean AQI': np.round(post_aqi,2)}, 
                  index=parameters)

df

	Pre Mean AQI	Post Mean AQI
PM2.5	38.13	40.37
PM10	20.09	21.85
NO2	18.67	16.31
SO2	4.21	0.92
CO	5.38	5.42
O3	29.46	28.74

The means of the pre and post seem similar. Let us confirm this with some statstical analysis.

4. Hypothesis testing#

4.1 Introduction#

We analyze one year of air quality data before and after the April 2020 lockdown order in Miami, Florida. The air quality parameters considered are PM2.5, PM10, NO2, SO2, CO, and O3. We employ hypothesis testing for the analysis. To learn about hypothesis testing, you can refer to this Statistical Vignettes presentation of Project EDDIE.

Hypothesis testing is a statistical method used to compare sample data and make conclusions about a population. With air quality data before and after a COVID-19 lockdown, we are investigating differences in air quality parameters. We typically set up null hypothesis \(H_0\) and alternative hypothesis \(H_a\). For instance, for PM2.5 concentration:

Null hypthesos \(H_0\): Mean PM2.5 before and after the lockdown are equal.
Alternative hyptheis \(H_a\): Mean PM2.5 before and after the lockdown differ.

Then to determine whether there is a statistically significant difference between the mean PM2.5 concentrations before and after the lockdown we can calculate the test statistic and p-value.

We choose a statistic test like a t-test or Mann-Whitney U test based on based on the distribution of our data (e.g., whether it is normally distributed).
We set the significance level to 0.05, indicating that we are willing to accept a 5% chance of mistakenly rejecting the null hypothesis when it is actually true.
If the p-value is below 0.05, we reject the null hypothesis, indicating a significant difference in PM2.5 concentrations pre and post COVID-19 lockdown.

We can repeat the same for the other air quality parameters.

Info

The t-test is a statistical test used to compare the means of two independent groups, assuming that the data is normally distributed and the variances of the two groups are equal. We can relaxe the variance condtion. For example, in the t-test function ttest_ind in scipy library, we can set equal_var=False. For non-normally distributed data or small samples, the Mann-Whitney U test compares two groups based on ranks by checking if one group tends to have higher values.

4.2 Method#

Hypothesis testing steps for each parameter:

Hypotheses:
- \(H_0\): No lockdown impact
- \(H_1\): Lockdown impact present
Significance Level: Set at 0.05
Data Preparation: Organize pre and post data
Test Statistic Selection: Choose t-test or Mann-Whitney U test for mean comparison
Testing: Use scipy.stats to calculate p-values
Interpretation: Reject \(H_0\) if p-value < 0.05, indicating significant lockdown impact on Miami’s air quality

Performing a t-test to assess the statistical significance of mean differences.

# Air quality parameters
parameters = ['PM2.5', 'PM10',  'NO2', 'SO2',  'CO', 'O3']

#Loop for all parameter is pre_data and post_data
for index, parameter in enumerate(parameters):

    # Step 3 Data prearation
    pre_parameter = pre_values[:,index]
    pre_parameter_mean = np.mean(pre_parameter)
    post_parameter = post_values[:,index]
    post_parameter_mean = np.mean(post_parameter)
    
    # Step 4 Select test
    
    #"Student t-test",  "Mann-Whitney U test", Annova, etc. 
    Select_Test = "ANOVA"

    # Step 5 Perform test
    if Select_Test == "Student t-test":
        # Perform a t-test for PM2.5
        statistic, p_value = stats.ttest_ind(post_parameter, pre_parameter, equal_var=False)
   
    elif Select_Test == "Mann-Whitney U test":
        # Perform Mann-Whitney U test
        statistic, p_value = stats.mannwhitneyu(post_parameter, pre_parameter, alternative='two-sided')
    elif Select_Test == "ANOVA":
    # Perform ANOVA
        statistic, p_value = stats.f_oneway(post_parameter, pre_parameter)

    

    # Step 6 Check if the result is significant
    check=''
    if p_value >= 0.05:
        check= 'not'

    print(f"Change in {parameter} mean concentration \
(pre {pre_parameter_mean:.2f} vs post {post_parameter_mean:.2f}) \
is {check} statistically significant (p-value {p_value:.3f})")

Change in PM2.5 mean concentration (pre 9.28 vs post 9.94) is  statistically significant (p-value 0.017)
Change in PM10 mean concentration (pre 21.69 vs post 23.81) is  statistically significant (p-value 0.008)
Change in NO2 mean concentration (pre 15.01 vs post 13.12) is  statistically significant (p-value 0.000)
Change in SO2 mean concentration (pre 1.73 vs post 0.38) is  statistically significant (p-value 0.000)
Change in CO mean concentration (pre 0.47 vs post 0.48) is not statistically significant (p-value 0.708)
Change in O3 mean concentration (pre 0.03 vs post 0.03) is not statistically significant (p-value 0.250)

Info

In the example above, we utilized SciPy, a Python library for scientific computing that encompasses tools for optimization, integration, interpolation, linear algebra, statistics, and more. While we have not covered SciPy or statistical tests in this class, a key learning objective is to independently explore and apply new libraries and methods

4.3 Results#

Present your main findings here.

The lockdown had varied effects on different pollutants in Miami, with some showing improvement while others remained relatively stable as follows.

During the lockdown, PM2.5 and PM10 concentrations increased, indicating worsened air quality. Mean PM2.5 rose from 9.28 µg/m3 to 9.94 µg/m3, and mean PM10 from 21.69 µg/m3 to 23.81 µg/m3. Both changes were statistically significant (p = 0.008 and 0.017), suggesting factors beyond random chance influenced the increases.
Conversely, NO2 and SO2 concentrations decreased during the lockdown, signifying improved air quality for these pollutants. Mean NO2 dropped from 14.99 ppb to 13.12 ppb, and mean SO2 decreased from 1.73 ppb to 0.38 ppb. Both changes were statistically significant (p = 0.000), indicating real improvement.
Changes in CO and O3 concentrations were not statistically significant, indicating no significant air quality changes for these pollutants during the lockdown.

These key findings suggest that the air quality in Miami, Florida, did not universally improve during the lockdown, unlike what we might expect.

4.4 Discussion#

Critically discuss your hypthesis testing results here.

Instruction:

Discuss p-values showing understanding of the statistical method used
Explain expected and unexpected results with references to relevant studies
Discuss analysis limitations and suggest areas for further analysis to enhance understanding
Ensure a well-organized, coherent response within 150-300 words
Use appropriate citations to support arguments

The learning objective of this question is to help you to enhance your critical thinking in data analysis, use of references to support your arguments, and clarity in presenting findings. This shall help you to prepare for your final project.

While the air quality for some pollutants did not improve, the levels remained within acceptable limits, suggesting that Miami’s air quality was already relatively good prior to the lockdown, limiting the potential for further improvements. The reasons for the varying pollutant levels during the lockdown could be multifaceted:

Increased PM2.5 and PM10 may be due factors unaffected by lockdown such as construction activities, meteorological factors, dust, biomass burning, or sea spray (benchari et al. 2021; Hassan et al., 2022).
Decreased NO2 and SO2 could result from reduced vehicular emissions, industrial activities, and shipping (Otmani et al., 2020).
Unchanged CO levels may reflect an offset between reduced vehicular emissions and increased industrial/commercial emissions.
Unchanged O3 levels could be due to lower NO2 and precursors offsetting meteorological conditions such as higher temperatures and sunlight (Vanvir et al., 2021).

Without specific local data, pinpointing the exact causes is challenging. While the data suggests no improvement in PM2.5 and PM10 levels, further analysis with local information is needed to understand the complex interplay between human activities, meteorological conditions, and air quality.