eds

Homework 3

Late Submission Policy

Problem 1: Exploratory data analysis with Pandas (100 points)

Tasks

Perform the following tasks:

  1. Repeat the steps in Lessons 10 - 14 Pansas Primer using a dataset of your interest
  2. Apply a Pandas method that was not introduced in class in your data analysis
  3. Present noteworthy information uncovered from your data exploration
  4. Document your use of AI-LLM in improving your learning and productivity

Learning objectives

As outlined in the syllabus, this course emphasizes project-based learning and self-directed study opportunities. This problem provides you with the chance to explore a dataset of personal interest. The objectives of this problem are to:

Dataset selection

Select a dataset that aligns with the learning objectives for this assignment. When choosing a dataset, consider the following:

The instructor is available to provide guidance if you need help selecting an appropriate dataset. Please reach out with any questions!

Rubric

Task 1. Data exploration (85 points)

Using any (big) dataset of your interest, repeat Section 4.0 in Lessons 10 - 14

Task Criteria Points
1. Data Selection Appropriateness of dataset and proper citation 8
2. Read CSV File Correctly read into DataFrame with Pandas 1
3. Display DataFrame Properly displayed with head/tail/info 1
4. Filter Columns by Labels Accurate filtering, demonstrated understanding 5
5. Filter Rows by Keyword Accurate filtering, demonstrated understanding 5
6. Filter Rows by Value Correct filtering by numeric and non-numeric values 5
7. Datetime Index* Appropriate datetime conversion and indexing, or alternative if no datetime 5
8. Descriptive Statistics Comprehensive descriptive statistics provided 5
9. Resampling of Time-Series Data* Correct application of resampling techniques, or alternative if no datetime 5
10. Groupby Correct usage of groupby for data aggregation 5
11. Slicing with loc & iloc Accurate slicing techniques demonstrated 5
12. Dicing Correct dicing of DataFrame to obtain smaller portions 5
13. Slicing and Dicing Together Effective combination of slicing and dicing to extract data 5
14. Datetime Column Subsetting* Correct subsetting using Datetime, or alternative if no datetime 5
15. Quick Plots of Data Creation of insightful plots that aid in data understanding 5
16. Operations on DataFrame Appropriate and effective operations applied to DataFrame 5
17. Iterating over Rows (Bonus) Successful iteration over DataFrame rows for additional insights 1 (bonus)
18. Save & Load DataFrame Correctly saved to and loaded from file 5

Task 2. New Pandas method exploration (5 points)

Explore a Pandas method that was not covered in class but piques your interest or aligns with your dataset analysis. Demonstrate its usage with examples from your dataset. Here are some suggestions: .agg() method, .pivot_table() methods, .stack() and .unstack()methods, .merge() and .join()methods, .cut() and .qcut() methods. .explode() method, .shift()method, .rolling() method, .duplicated() and .drop_duplicates() methods, and many more.

Task Criteria Points
19. New Method Successful demonstration of a new Pandas method not covered in class, with clear explanation and proper application to the dataset 5

Task 3. Information discovries (5 points)

Report any valuable information and findings derived from your data analysis.

Task Criteria Points
20. Information Discoveries Clear and informative report outlining valuable findings from the analysis, including patterns, anomalies, or other relevant observations 5

Task 4. LLM usage (5 points)

Highlight the integration of advanced AI technologies in data analysis workflows and showcases your ability to leverage cutting-edge tools for effective problem-solving. In this section, document the specific instances where you utilized a Large Language Model (LLM) such as GPT-3.5 for problem-solving during this assignment and discuss your overall experience of using LLM.

Task Criteria Points
21. LLM Usage Thorough documentation of LLM usage in the analysis process, including specific examples and reflection on the experience 5

Additional Notes:

Additional Information:

For Task 3. Information discovries (5 points), if you are interested, you can learn about data-information-knowledge-wisdom hierarchy at the science-policy interface.

Problem 2 - Water quality analysis (10 points)

Dataset

Red tides are caused by Karenia brevis harmful algae blooms. For Karenia brevis cell count data, you can use the current dataset of Physical and biological data collected along the Texas, Mississippi, Alabama, and Florida Gulf coasts in the Gulf of Mexico as part of the Harmful Algal BloomS Observing System from 1953-08-19 to 2023-07-06 (NCEI Accession 0120767). For direct data download, you can use this data link and this data documentation link. Alternatively, FWRI documents Karenia brevis blooms from 1953 to the present. The dataset has more than 200,000 records is updated daily. To request this dataset email: HABdata@MyFWC.com. To learn more about this data, check the FWRI Red Tide Red Tide Current Status.

Study areas

Conduct your analysis in Tampa Bay and Charlotte Harbor estuary. For Tampa Bay, restrict the Karenia brevis measurements from 27° N to 28° N and 85° W to coast. For Charlotte Harbor estuary, restrict the Karenia brevis measurements from 25.5° N to less than 27° N and 85° W to coast.

Problem statement

Task 1: Plot the maximum cellcount of Karenia brevis (cell counts per letter) per week for the whole dataset for each of the regions of Tampa Bay and Charlotte Harbor estuary.

Task 2: FWRI classifies Karenia brevis abundance based on cell counts as described here as follows:

Index Description K. brevis abundance Possible effects (K. brevis only)
0 NOT PRESENT- BACKGROUND background levels of 1,000 cells or less no effects anticipated
1 VERY LOW > 1,000 - 10,000 cells/L possible respiratory irritation; shellfish harvesting closures when cell abundance equals or exceeds 5,000 cells/L
2 LOW > 10,000 - 100,000 cells/L respiratory irritation; shellfish harvesting closures; possible fish kills; probable detection of chlorophyll by satellites at upper range of cell abundance
3 MEDIUM > 100,000 - 1,000,000 cells/L respiratory irritation; shellfish harvesting closures; probable fish kills; detection of surface chlorophyll by satellites
4 HIGH > 1,000,000 cells/L as above, plus water discoloration

Given the data in the above table, plot a histogram for each region to show the frequencies of the maximum cellcount per week according to the above classification. The histogram should only include 4 bins for the cases of ‘very low’, ‘low’, ‘medium’, and ‘high’.

Here is one solution strategy that you can follow.

(1) After you read and clean your data into DataFrame let use say df as you did in task 1, create a new column BLOOM_CLASS. Given the above table do bloom classifcation (i.e., ‘no bloom’, ‘very low bloom’, ‘low bloom’, ‘medium bloom’, and ‘hig bloomh’) for the whole dataset. For example, if the max cellcount in row 1 in the table below is 388400000 cells/L , then according to the table above this is a HIGH bloom, and then the first rows in BLOOM_CLASS will have the values of 4 (i.e., the index value in the table above). If the max concentration in a given row is 0 then the index will 0 and that row under BLOOM_CLASS will have a value of 0. Here is an example, of how your DataFrame df will look like:

  STATE_ID DESCRIPTION LATITUDE LONGITUDE CELLCOUNT REGION BLOOM_CLASS
2022-11-30 18:50:00 FL Bay Dock (Sarasota Bay) 27.331600 -82.577900 388400000 Tampa Bay 4
1994-12-09 20:30:00 FL Bay Dock (Sarasota Bay) 27.331600 -82.577900 358000000 Tampa Bay 4
1996-02-22 00:00:00 FL Siesta Key; 8 mi off mkr 3A at 270 degrees 27.277200 -82.722300 197656000 Tampa Bay 4
2005-10-10 21:21:00 TX Windsurfing Flats, Pinell Property, south Padr… 26.162420 -97.182580 40000 Other 2
2019-01-02 20:30:00 FL Lido Key, 2.5 miles WSW of 27.300000 -82.620000 186266667 Tampa Bay 4
2020-08-25 00:00:00 MS 5-9A 30.361850 -88.850067 0 Other 0
2020-09-30 00:00:00 MS Katrina Key 30.356869 -88.839592 0 Other 0
2021-01-25 00:00:00 MS Sample* Long Beach 30.346020 -89.141030 0 Other 0
2021-11-15 00:00:00 MS 10-Jun 30.343900 -88.602667 0 Other 0
2021-12-21 00:00:00 MS 10-Jun 30.343900 -88.602667 0 Other 0

205552 rows × 7 columns

This is similar to Exercise 4. In exercise 4 you created an new column REGION and did a mask and dicing to fill-in this new column with values (i.e, ‘Tampa Bay’, ‘Charlotte Harbor’, and ‘Other’) based on the latitude and longitude mask. You can do the same here. Create a new column BLOOM_CLASS, and do a mask and dicing to fill-in new columns with values of 0, 1, 2, 3, or 4 based on brevis abundance mask given the ranges in the table above.

(2) As a copy from your original DataFrame df, create two new DataFrames, one for each region as follows: charlotte_harbor_hist_data and tampa_bay_hist_data. You can use these DataFrames to do resampling for each regions

(3) Resample your cellcount to find the maximum cell count per week. Remember from class that you can only do resample with numeric columns, so make sure that you only select the BLOOM_CLASS column. It is always a good idea to do sorting before resampling. Then after you do sorting and weekly resample, your data should look like this for Charlotte Harbor, for example,

SAMPLE_DATE BLOOM_CLASS
1953-08-23 3.0
1953-08-30 0.0
1953-09-06 NaN
1953-09-13 NaN
1953-09-20 NaN
2023-06-04 0.0
2023-06-11 0.0
2023-06-18 0.0
2023-06-25 0.0
2023-07-02 0.0

3646 rows × 1 columns

(4) Create a histogram plot for only index values 1 to 4 in your BLOOM_Class column for each region. You can have one figure for each region.

Hints:

Final answer

This is a solution: histogram1953

This is an extra figure in case your are curious about the last 10 years: histogram

Rubric

The student’s performance on this homework problem will be evaluated based on the ability to collect and organize the data, perform data analysis and visualization, interpret the results, and communicate the findings in a clear and concise manner as follows.

  1. Data Collection and Preparation (1 points)
    • Correctly downloaded and imported the dataset from the provided data link or requested the dataset from FWRI as instructed.
    • Successfully filtered and subsetted the data for Tampa Bay and Charlotte Harbor estuary regions based on the provided latitude and longitude constraints.
  2. Data Analysis and Visualization (6 points)
    • Accurately extracted the maximum concentration of Karenia brevis cell counts per week from the dataset for both Tampa Bay and Charlotte Harbor estuary.
    • Created a clear and informative plot(s) of the maximum concentration of Karenia brevis cell counts per week for the whole dataset and for each region.
    • Accurately classified the bloom per week for for both Tampa Bay and Charlotte Harbor estuary.
    • Created a clear and informative histogram plot(s) for the bloom impact per week for each region.
  3. Interpretation and Conclusion (1 points)
    • Provided a brief interpretation of the plots, including any noticeable patterns, trends, or anomalies between the two regions.
    • Discussed relevant assumptions or limitations.
  4. Code Quality and Documentation (2 points)
    • Submitted a well-structured and commented Python code, demonstrating a good understanding of Pandas and good coding practices
    • Proper citation of the used AI-LLM and data source.