Exercise 4 - Red tides data analysis

Exercise 4 - Red tides data analysis#

Dataset#

Red tides are caused by Karenia brevis harmful algae blooms. For Karenia brevis cell count data, you can use the current dataset of Physical and biological data collected along the Texas, Mississippi, Alabama, and Florida Gulf coasts in the Gulf of Mexico as part of the Harmful Algal BloomS Observing System from 1953-08-19 to 2023-07-06 (NCEI Accession 0120767). For direct data download, you can use this data link and this data documentation link. Alternatively, FWRI documents Karenia brevis blooms from 1953 to the present. The dataset has more than 200,000 records is updated daily. To request this dataset email: HABdata@MyFWC.com. To learn more about this data, check the FWRI Red Tide Red Tide Current Status.

Study areas#

Conduct your analysis in Tampa Bay and Charlotte Harbor estuary. For Tampa Bay, restrict the Karenia brevis measurements from 27° N to 28° N and 85° W to coast. For Charlotte Harbor estuary, restrict the Karenia brevis measurements from 25.5° N to less than 27° N and 85° W to coast.

Problem statement#

For each of the two regions, plot the maximum Karenia brevis concentration (i.e., cellcount) per week for the whole study period and for the last 10 years.

Exercise#

Peform the 10 tasks below. Use of a LLM is permitted.

1. Import libraries#

import pandas as pd

2. Read data from csv file#

# Read a csv file with Pandas
df = pd.read_csv('habsos_20230714.csv', low_memory=False)

3. Filter to columns needed for analysis#

Select columns that include State ID, description, latitude, longitude, sample date, and cellcount.

#Display columns labels
df.columns

Index(['STATE_ID', 'DESCRIPTION', 'LATITUDE', 'LONGITUDE', 'SAMPLE_DATE',
       'SAMPLE_DEPTH', 'GENUS', 'SPECIES', 'CATEGORY', 'CELLCOUNT',
       'CELLCOUNT_UNIT', 'CELLCOUNT_QA', 'SALINITY', 'SALINITY_UNIT',
       'SALINITY_QA', 'WATER_TEMP', 'WATER_TEMP_UNIT', 'WATER_TEMP_QA',
       'WIND_DIR', 'WIND_DIR_UNIT', 'WIND_DIR_QA', 'WIND_SPEED',
       'WIND_SPEED_UNIT', 'WIND_SPEED_QA'],
      dtype='object')

# Select column names that you need for your analysis (e.g., 'LATITUDE', 'LONGITUDE', 'SAMPLE_DATE', 'CELLCOUNT', etc.) 
selected_columns = ['STATE_ID','DESCRIPTION','LATITUDE', 'LONGITUDE', 'SAMPLE_DATE', 'CELLCOUNT']

# Filter the DataFrame to include only the selected columns
df = df[selected_columns]

# Display your DataFrame
df

	STATE_ID	DESCRIPTION	LATITUDE	LONGITUDE	SAMPLE_DATE	CELLCOUNT
0	FL	Bay Dock (Sarasota Bay)	27.331600	-82.577900	11/30/2022 18:50	388400000
1	FL	Bay Dock (Sarasota Bay)	27.331600	-82.577900	12/9/1994 20:30	358000000
2	FL	Siesta Key; 8 mi off mkr 3A at 270 degrees	27.277200	-82.722300	2/22/1996 0:00	197656000
3	TX	Windsurfing Flats, Pinell Property, south Padr...	26.162420	-97.182580	10/10/2005 21:21	40000
4	FL	Lido Key, 2.5 miles WSW of	27.300000	-82.620000	1/2/2019 20:30	186266667
...	...	...	...	...	...	...
205547	MS	5-9A	30.361850	-88.850067	8/25/2020 0:00	0
205548	MS	Katrina Key	30.356869	-88.839592	9/30/2020 0:00	0
205549	MS	Sample* Long Beach	30.346020	-89.141030	1/25/2021 0:00	0
205550	MS	10-Jun	30.343900	-88.602667	11/15/2021 0:00	0
205551	MS	10-Jun	30.343900	-88.602667	12/21/2021 0:00	0

205552 rows × 6 columns

4. Assign region name based on latitude and longitude values.#

First create a new column REGION with default value ‘Other’. Then use a Boolean mask to change ‘Other’ in each row to ‘Tampa Bay’ or ‘Charlotte Harbor’ based on latitude and longitude values. You can assign values, i.e., ‘Tampa Bay` or ‘Charlotte Harbor’, uing for example,

`df.loc[mask, 'column_name'] = 'your_value'

Or any other method of your choice. Do it with this method, and ask a LLM to suggest for you other methods. That is a very effective way of learning.

# Create a new column 'region' with default value 'Other'
df['REGION'] = 'Other'

# Mask for dicing: Define mask for Tampa Bay region based on latitude and longitude values
tampa_bay_mask = (df['LATITUDE'] >= 27) & (df['LATITUDE'] <= 28) & (df['LONGITUDE'] >= -85)

# Assign value 'Tampa Bay' to rows matching Tampa Bay mask 
df.loc[tampa_bay_mask, 'REGION'] = 'Tampa Bay'

# Mask for decing: Define mask for Charlotte Harbor estuary region  based on latitude and longitude values
charlotte_harbor_mask = (df['LATITUDE'] >= 25.5) & (df['LATITUDE'] < 27) & (df['LONGITUDE'] >= -85)

# Assign value 'Charlotte Harbor' to rows matching Charlotte Harbor mask
df.loc[charlotte_harbor_mask, 'REGION'] = 'Charlotte Harbor'

#Display dataframe
df

	STATE_ID	DESCRIPTION	LATITUDE	LONGITUDE	SAMPLE_DATE	CELLCOUNT	REGION
0	FL	Bay Dock (Sarasota Bay)	27.331600	-82.577900	11/30/2022 18:50	388400000	Tampa Bay
1	FL	Bay Dock (Sarasota Bay)	27.331600	-82.577900	12/9/1994 20:30	358000000	Tampa Bay
2	FL	Siesta Key; 8 mi off mkr 3A at 270 degrees	27.277200	-82.722300	2/22/1996 0:00	197656000	Tampa Bay
3	TX	Windsurfing Flats, Pinell Property, south Padr...	26.162420	-97.182580	10/10/2005 21:21	40000	Other
4	FL	Lido Key, 2.5 miles WSW of	27.300000	-82.620000	1/2/2019 20:30	186266667	Tampa Bay
...	...	...	...	...	...	...	...
205547	MS	5-9A	30.361850	-88.850067	8/25/2020 0:00	0	Other
205548	MS	Katrina Key	30.356869	-88.839592	9/30/2020 0:00	0	Other
205549	MS	Sample* Long Beach	30.346020	-89.141030	1/25/2021 0:00	0	Other
205550	MS	10-Jun	30.343900	-88.602667	11/15/2021 0:00	0	Other
205551	MS	10-Jun	30.343900	-88.602667	12/21/2021 0:00	0	Other

205552 rows × 7 columns

5. Set column with date information as index#

Change SAMPLE_DATE to datetime format and set is as an index.

# Convert the "SAMPLE_DATE" column to datetime format using pd.to_datetime()
df['SAMPLE_DATE'] = pd.to_datetime(df['SAMPLE_DATE'])

# Set the "SAMPLE_DATE" column as the index of your DataFrame
df.set_index('SAMPLE_DATE', inplace=True)

Sort the index and display min and max index value

#Sort index (optional step)
df.sort_index(inplace=True)

#Display min index value
display(df.index.min())

#Display max index value
display(df.index.max())

Timestamp('1953-08-19 00:00:00')

Timestamp('2023-07-06 16:06:00')

6. Select the data for each region#

Create a new DataFrame from for each region: charlotte_harbor_data and tampa_bay_data. Each DataFrame should contain the maximum cellcount per month for the whole period.

Start with charlotte harbor region

#Get rows for charlotte harbor region
charlotte_harbor_data = df[df['REGION'] == 'Charlotte Harbor'].copy()

#For these rows, find the maximum cellcount per month for the study period 
charlotte_harbor_data = charlotte_harbor_data['CELLCOUNT'].resample('M').max()

# display data
charlotte_harbor_data

C:\Users\aelshall\AppData\Local\Temp\ipykernel_19116\4247551668.py:5: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  charlotte_harbor_data = charlotte_harbor_data['CELLCOUNT'].resample('M').max()

SAMPLE_DATE
1953-08-31      116000.0
1953-09-30           NaN
1953-10-31           NaN
1953-11-30           NaN
1953-12-31           NaN
                 ...    
2023-02-28    55655327.0
2023-03-31    12085102.0
2023-04-30      240128.0
2023-05-31      184500.0
2023-06-30         333.0
Freq: ME, Name: CELLCOUNT, Length: 839, dtype: float64

#Get rows for tampa bay region
tampa_bay_data = df[df['REGION'] ==  'Tampa Bay'].copy()

#For these rows find the maximum cellcount per month for the study period 
tampa_bay_data = tampa_bay_data['CELLCOUNT'].resample('M').max()

# display data
tampa_bay_data

C:\Users\aelshall\AppData\Local\Temp\ipykernel_19116\3546396697.py:5: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  tampa_bay_data = tampa_bay_data['CELLCOUNT'].resample('M').max()

SAMPLE_DATE
1953-08-31      178000.0
1953-09-30     2444000.0
1953-10-31      140000.0
1953-11-30      392000.0
1953-12-31      580000.0
                 ...    
2023-02-28    35286933.0
2023-03-31     6363000.0
2023-04-30     1023333.0
2023-05-31       89000.0
2023-06-30         667.0
Freq: ME, Name: CELLCOUNT, Length: 839, dtype: float64

7. Plot the data for each region#

Use the Pandas plot function to plot the data for each region. If you put the plot commands for each region together in the same code cell, both of them will be plotted on the same graph.

# Plot charlotte harbor data
charlotte_harbor_data.plot();

../../_images/06894ce6113d795eb3c123fac1a7cc544f92e504dc6c9880b46649c9e5bad5e3.png

# Plot tampa bay data
tampa_bay_data.plot();

../../_images/87a37c04ee3dfb91a53badab43282e41211f786d9df2744368fc0de25033cd84.png

# Plot data for both regions by placing the two plot commands for each region below one another
charlotte_harbor_data.plot();
tampa_bay_data.plot();

../../_images/a635609728524e5b969b81b90fe6355772d1426f32e32a0df1a5497b7cb322f9.png

Now we have a problem because we do not know which is which. Ask a LLM about how to add a legend for two Pandas plots.

Note

Pandas plot is something quick. Later we will learn how to use matplotlib library to make sophisticated and customized plots. Actually, your LLM may suggest that you use matplotlib because Pandas plot alone may not be able to do this. In that case you may need to install matplotlib with pip install before you use it.

8. Asking a LLM about adding a legend to two pandas plots#

Ask a LLM about how to add legend for two plots when using the plotting tool of pandas.

This is the prompt that I provided to ChatGPT 3.5 Turbo:

I am doing these two plots with pandas, how to add legend
charlotte_harbor['CELLCOUNT'].resample('Y').max().plot()
tampa_bay['CELLCOUNT'].resample('Y').max().plot()

Here is the reponse that I got:

To add a legend to your plots in Pandas, you can set the label parameter within the plot() function for each plot, and then call the legend() method. Here’s how you can do it:
import matplotlib.pyplot as plt

# Plot the data
charlotte_harbor_data.plot(label='Charlotte Harbor')
tampa_bay_data.plot(label='Tampa Bay')

# Add legend
plt.legend()

# Show the plot
plt.show()
In this code:

Each plot is labeled using the label parameter within the plot() function.

After plotting both series, plt.legend() is called to display the legend.

Finally, plt.show() is used to show the plot with the legend. Make sure to have import matplotlib.pyplot as plt at the beginning of your script or Jupyter Notebook cell.

Now let us try this solution

import matplotlib.pyplot as plt

# Plot the data
charlotte_harbor_data.plot(label='Charlotte Harbor')
tampa_bay_data.plot(label='Tampa Bay')

# Add legend
plt.legend()

# Show the plot
plt.show()

../../_images/50626316487d96c13b978ee93664feb707b7c93caebdf123c7f6985c8102d01d.png

9. You can do some cutomization with Pandas plot#

Set the size of the figure to (14, 4) inches and add a y-label. Check Section 4.15.4 Customize your Plot in Pandas lesson for details.

# Plot the data with cutomization 
charlotte_harbor_data.plot(label='Charlotte Harbor',figsize=(12, 4));
tampa_bay_data.plot(label='Tampa Bay',figsize=(12, 4),ylabel='Cells/Liter');

# Add legend
plt.legend()

# Show the plot
plt.show()

../../_images/db88c614d838c93f61e405a0b2573f6a0c6f9e3e8712d5b8e85fb55aa76f043b.png

10. Which region experienced more severe red tides in the last 10 years?#

To answer this question let use plot the data for the last 10 years. You can do this with plot cutomization. Check Section 4.15.4 Customize your Plot in Pandas Primer lesson for details.

# Plot the data with cutomization 
charlotte_harbor_data.plot(label='Charlotte Harbor',figsize=(12, 4),xlim=('2014-01-01', '2024-01-01'));
tampa_bay_data.plot(label='Tampa Bay',figsize=(12, 4),ylabel='Cells/Liter', xlim=('2014-01-01', '2024-01-01'));

# Add legend
plt.legend()

# Show the plot
plt.show()

../../_images/57932cb79b89e5a9f3d6500c8c9a89ea3fa04eacfaf8e46cf711b6f29dfca800.png

Which region experienced more severe red tides in the last 10 years?

From the above figure, it is visually unclear which region experienced more severe red tides in the last 10 years. Also, to accurately assess the significance of the data, we need additional information to understand the relationship between cell counts and bloom severity. Moreover, we need to improve our analysis method. Instead of simply finding the maximum concentration per month, we need to develop a more rigorous method to characterize bloom events. That is to get a deeper insights into the dynamics of red tides occurrences and understand data gaps. Finally, we may need to select statistical methods that can help identify patterns, trends, and correlations within the data. Addressing the above limitations in our analysis will allow for better understanding of red tides occurrences and their potential impacts.