Exercise 7 NumPy: Air quality data analysis#
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from scipy import stats
1. Problem statement#
This assignment involves analyzing air quality data for the Miami core-based statistical area (CBSA) to evaluate the impact of COVID-19 lockdown orders on air quality in Miami. The analysis will be based on EPA daily air quality data for PM2.5, PM10, NO2, SO2, CO, and O3 from 01-04-2019 to 31-03-2021, covering one year before and after the lockdown order on April 1, 2020. The tasks include:
Calculating the mean concentrations for these 6 air quality parameters before and after the lockdown order.
Conducting hypothesis testing to determine if there is a statistically significant difference in the pre and post data.
To complete this exercise, you will apply the skills learned in the Data Science Workflow and NumPy lessons.
2. Prepare data#
2.1 Read data as Numpy array#
Read data from ‘aqi_data_Miami.csv’ as NumPy array data
# Read the CSV file into a NumPy mixed type array with dtype object
# Column names: [datetime-empty], 'PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3'
data = pd.read_csv('data/aqi_data_Miami.csv').to_numpy()
# Convert the first column in the NumPy array to datetime format
# Display data
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[2], line 3
1 # Read the CSV file into a NumPy mixed type array with dtype object
2 # Column names: [datetime-empty], 'PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3'
----> 3 data = pd.read_csv('data/aqi_data_Miami.csv').to_numpy()
5 # Convert the first column in the NumPy array to datetime format
6
7 # Display data
File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\parsers\readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...) 1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\parsers\readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\parsers\readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\parsers\readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File ~\AppData\Local\miniconda3\Lib\site-packages\pandas\io\common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'data/aqi_data_Miami.csv'
2.2 Extract dates and values#
Extract dates and values from Numpy array data
as Numpy arrays dates
and values
# Extracting first column as datetime (you can do it in one line)
# Extract numerical values as numbers (you can do it in one line)
2.3 Remove rows with NaNs#
Remove rows with NaN values from data
, dates
, and values
.
#Original dataset information
#Remove NaNs
#Cleaned dataset information
2.4 Extracting pre and post-lockdown data#
Extract data based on the lockdown order on April 1, 2020:
Extract data one year before lockdown as
pre_dates
andpre_values
,Extract data one year after lockdown as
post_dates
andpost_values
,Extract data one year before and after lockdown combined as
pre_post_dates
andpre_post_values
.
# Define date ranges
lockdown_start = pd.Timestamp('2020-04-01')
one_year_before = pd.Timestamp('2019-04-01')
one_year_after = pd.Timestamp('2021-04-01')
#Cleaned dataset information
# Filter the NumPy array based on the date range for pre-post-covid19 data
# Filter the NumPy array based on the date range for pre-covid19 data
# Filter the NumPy array based on the date range for post-covid19 data
2.5 Save NumpPy arrays of dates and values (extra)#
Let us save our pre-proceeded data
pre_dates
one year datetime data before lockdownpre_values
one year concentration data before lockdown for PM2.5, PM10, NO2, SO2, CO, O3post_dates
one year datetime data after lockdownpost_values
one year concentration data after lockdown for PM2.5, PM10, NO2, SO2, CO, O3
because we will use it in Matplotlib exercise. We do not need to use allow_pickle=True
argument because these arrays do not have data with mixed types.
# Save the data arrays to a file
np.save('data/pre_dates.npy', pre_dates)
np.save('data/post_dates.npy', post_dates)
np.save('data/pre_values.npy', pre_values)
np.save('data/post_values.npy', post_values)
Let us double check that we can load and use the saved data
# Load the array from file
pre_dates = np.load('data/pre_dates.npy')
pre_values = np.load('data/pre_values.npy')
print("pre_dates:", pre_dates.dtype, pre_dates.shape,
"pre_values:", pre_values.dtype, pre_values.shape)
2.6 Additional information#
Here is additional information that can be helpful to our analysis. We can refer to EPA document EPA 454/B-18-007 for concentration breakpoints indicating unhealthy levels for sensitive groups.
# Define date ranges
lockdown_start = pd.Timestamp('2020-04-01')
one_year_before = pd.Timestamp('2019-04-01')
one_year_after = pd.Timestamp('2021-04-01')
# Air quality parameter information
parameters = [ 'PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']
units = [ 'µg/m³', 'µg/m³', 'ppb', 'ppb', 'ppm', 'ppm']
limits = [ 35 , 155, 100, 50, 9, 0.07] # Unhealthy levels of senitive groups
3. Mean values for each parameter befor and after lockdown#
Before proceeding with hypothesis testing, let’s analyze the mean values of each air quality parameter before and after the lockdown.
For a simple analysis, we can compare the mean concentrations before and after the lockdown for each parameter.
For a more sophisticated analysis, an Air Quality Index (AQI) calculation method, to calculate the AQI for quality parameter. Then we can calcuate the mean AQI values before and after the lockdown for each parameter.
Let us try both analyses.
3.1 Simple approach#
We can calculate the Air Quality Index (AQI) for PM2.5, PM10, etc.using an AQI calculation method, which can be complex. For simplicity, let’s just compare the mean concentrations or pre and post lockdown for each parameter.
# Six air quality parameters
parameters =['PM2.5','PM10','NO2','SO2','CO','O3']
# Calculate mean concentrations for pre-lockdown and post-lockdown
# Display results in a table form with `parameters` as index,
# and first and second column values are `pre_mean` and `post_mean`
The means of the pre and post seem similar. Let us confirm this with some statstical analysis. Before we do this, let us try a simple version of the AQI calculation method.
3.2 AQI approach (optional)#
3.2.1 Calculate AQI values#
We can compile a concentration breakpoints and their corresponding AQI values for various pollutants (EPA 454/B-18-007). Here is an example:
Parameter |
Unit |
Good |
Moderate |
Unhealthy for Sensitive Groups |
Unhealthy |
---|---|---|---|---|---|
PM2.5 |
µg/m³ |
0.0 - 12.0 |
12.1 - 35.4 |
35.5 - 55.4 |
55.5 - 150.4 |
PM10 |
µg/m³ |
0 - 54 |
55 - 154 |
155 - 254 |
255 - 354 |
NO2 |
ppb |
0 - 40 |
41 - 100 |
101 - 200 |
201 - 250 |
SO2 |
ppb |
0 - 20 |
21 - 50 |
51 - 100 |
101 - 304 |
CO |
ppm |
0.0 - 4.4 |
4.5 - 9.4 |
9.5 - 12.4 |
12.5 - 15.4 |
O3 |
ppm |
0.000 - 0.054 |
0.055 - 0.07 |
0.071 - 0.085 |
0.086 - 0.115 |
The “Good” category ranges from 0 to 50 on the AQI, indicating satisfactory air quality. “Moderate” falls between 51 and 100, generally acceptable but may affect sensitive groups. “Unhealthy for Sensitive Groups” is 101-150, suggesting limits on outdoor activities for at-risk individuals. “Unhealthy” is 151-200, signaling adverse health effects for everyone, requiring outdoor limits.
We can create a dictionary for each pollutant containing its respective concentration breakpoints and corresponding AQI values.
#Dictionary for breakpoints of 6 parameters
pollutants = {
'PM2.5': {'breakpoints': [0, 12.1, 35.5, 55.5, 150.4]},
'PM10': {'breakpoints': [0, 55, 155, 255, 354]},
'NO2': {'breakpoints': [0, 41, 101, 201, 250]},
'SO2': {'breakpoints': [0, 21, 51, 101, 304]},
'CO': {'breakpoints': [0, 4.5, 9.5, 12.5, 15.4]},
'O3': {'breakpoints': [0, 0.055, 0.071, 0.086, 0.115]}}
With linear interpolation we can calcuate the AQI for a given concentration.
def calculate_aqi(pollutant, concentration):
poll_data = pollutants[pollutant]
bp = poll_data['breakpoints']
aqi = [0, 51, 101, 151,200]
if concentration <= bp[1]:
aqi_val = ((aqi[1]-aqi[0])/(bp[1]-bp[0])) *(concentration-bp[0])+ aqi[0]
elif concentration <= bp[2]:
aqi_val = ((aqi[2]-aqi[1])/(bp[2]-bp[1])) *(concentration-bp[1])+ aqi[1]
elif concentration <= bp[3]:
aqi_val = ((aqi[3]-aqi[2])/(bp[3]-bp[2])) *(concentration-bp[2])+ aqi[2]
else:
aqi_val = ((aqi[4]-aqi[3])/(bp[4]-bp[3])) *(concentration-bp[3])+ aqi[3]
return aqi_val
Then we can call this function to calculate AQI value for each sample.
# Six air quality parameters
parameters =['PM2.5','PM10','NO2','SO2','CO','O3']
#Calculate aqi for pre and post data
for dataset in ['pre', 'post']:
#Intiate variable aqi to store calculated aqi values
if dataset == 'pre':
aqi_data = pre_values.copy()
elif dataset == 'post':
aqi_data = post_values.copy()
#Loop through each row and each element in the row to calcuate aqi
for row_index, row in enumerate(aqi_data):
for col_index, element in enumerate(row):
#Get parameter name
parameter= parameters[col_index]
#Pass parameter name and value to calculate_aqi
#Update pre_aqi with the calculated aqi for that parameter and date
aqi_data[row_index,col_index] = calculate_aqi(parameter, element)
#Display aqi
print(dataset, aqi_data.dtype, aqi_data.shape)
print(aqi_data[0])
# Save aqi for pre and post
if dataset == 'pre':
pre_aqi = aqi_data
elif dataset == 'post':
post_aqi = aqi_data
3.2.2 Save AQI values as NumPy arrays (extra)#
This step is not needed in this exercise, but we want to use this data in the Matplotlib exercise.
# Save the data arrays to a file
np.save('data/pre_aqi.npy', pre_aqi)
np.save('data/post_aqi.npy', post_aqi)
Load saved data for double checking.
pre_aqi = np.load('data/pre_aqi.npy')
post_aqi = np.load('data/post_aqi.npy')
print("pre_aqi:", pre_aqi.dtype, pre_aqi.shape)
print("post_aqi:", post_aqi.dtype, pre_aqi.shape)
3.2.3 Calculate mean AQI values for each parameter#
We calculated the AQI for PM2.5, PM10, etc.. Now let us compare the mean AQI for pre and post lockdown for each parameter.
# Six air quality parameters
parameters =['PM2.5','PM10','NO2','SO2','CO','O3']
# Calculate mean concentrations for pre-lockdown and post-lockdown
pre_aqi = np.mean(pre_aqi, axis=0)
post_aqi = np.mean(post_aqi, axis=0)
# Display results in a table form with `parameters` as index,
# and first and second column values are `pre_mean` and `post_mean`
df = pd.DataFrame({'Pre Mean AQI': np.round(pre_aqi,2),
'Post Mean AQI': np.round(post_aqi,2)},
index=parameters)
df
The means of the pre and post seem similar. Let us confirm this with some statstical analysis.
4. Hypothesis testing#
4.1 Introduction#
We analyze one year of air quality data before and after the April 2020 lockdown order in Miami, Florida. The air quality parameters considered are PM2.5, PM10, NO2, SO2, CO, and O3. We employ hypothesis testing for the analysis. To learn about hypothesis testing, you can refer to this Statistical Vignettes presentation of Project EDDIE.
Hypothesis testing is a statistical method used to compare sample data and make conclusions about a population. With air quality data before and after a COVID-19 lockdown, we are investigating differences in air quality parameters. We typically set up null hypothesis \(H_0\) and alternative hypothesis \(H_a\). For instance, for PM2.5 concentration:
Null hypthesos \(H_0\): Mean PM2.5 before and after the lockdown are equal.
Alternative hyptheis \(H_a\): Mean PM2.5 before and after the lockdown differ.
Then to determine whether there is a statistically significant difference between the mean PM2.5 concentrations before and after the lockdown we can calculate the test statistic and p-value.
We choose a statistic test like a t-test or Mann-Whitney U test based on based on the distribution of our data (e.g., whether it is normally distributed).
We set the significance level to 0.05, indicating that we are willing to accept a 5% chance of mistakenly rejecting the null hypothesis when it is actually true.
If the p-value is below 0.05, we reject the null hypothesis, indicating a significant difference in PM2.5 concentrations pre and post COVID-19 lockdown.
We can repeat the same for the other air quality parameters.
Info
The t-test is a statistical test used to compare the means of two independent groups, assuming that the data is normally distributed and the variances of the two groups are equal. We can relaxe the variance condtion. For example, in the t-test functionttest_ind
in scipy library, we can set equal_var=False
. For non-normally distributed data or small samples, the Mann-Whitney U test compares two groups based on ranks by checking if one group tends to have higher values.4.2 Method#
Hypothesis testing steps for each parameter:
Hypotheses:
\(H_0\): No lockdown impact
\(H_1\): Lockdown impact present
Significance Level: Set at 0.05
Data Preparation: Organize pre and post data
Test Statistic Selection: Choose t-test or Mann-Whitney U test for mean comparison
Testing: Use
scipy.stats
to calculate p-valuesInterpretation: Reject \(H_0\) if p-value < 0.05, indicating significant lockdown impact on Miami’s air quality
Performing a t-test to assess the statistical significance of mean differences.
# Air quality parameters
parameters = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']
#Loop for all parameter is pre_data and post_data
# Step 3 Data prearation
# Step 4 Select test
#"Student t-test", "Mann-Whitney U test"
# Step 5 Perform test
# Perform a t-test for PM2.5
# Perform Mann-Whitney U test
# Step 6 Check if the result is significant
Info
In the example above, we utilized SciPy, a Python library for scientific computing that encompasses tools for optimization, integration, interpolation, linear algebra, statistics, and more. While we have not covered SciPy or statistical tests in this class, a key learning objective is to independently explore and apply new libraries and methods4.3 Results#
Present your main findings here.
4.4 Discussion#
Critically discuss your hypthesis testing results here.
Instruction:
Discuss p-values showing understanding of the statistical method used
Explain expected and unexpected results with references to relevant studies
Discuss analysis limitations and suggest areas for further analysis to enhance understanding
Ensure a well-organized, coherent response within 150-300 words
Use appropriate citations to support arguments
The learning objective of this question is to help you to enhance your critical thinking in data analysis, use of references to support your arguments, and clarity in presenting findings. This shall help you to prepare for your final project.