Perform the following tasks:
As outlined in the syllabus, this course emphasizes project-based learning and self-directed study opportunities. This problem provides you with the chance to explore a dataset of personal interest. The objectives of this problem are to:
csv
datasetsSelect a dataset that aligns with the learning objectives for this assignment. When choosing a dataset, consider the following:
The instructor is available to provide guidance if you need help selecting an appropriate dataset. Please reach out with any questions!
Using any (big) dataset of your interest, repeat Section 4.0 in Lessons 10 - 14
Task | Criteria | Points |
---|---|---|
1. Data Selection | Appropriateness of dataset and proper citation | 8 |
2. Read CSV File | Correctly read into DataFrame with Pandas | 1 |
3. Display DataFrame | Properly displayed with head/tail/info | 1 |
4. Filter Columns by Labels | Accurate filtering, demonstrated understanding | 5 |
5. Filter Rows by Keyword | Accurate filtering, demonstrated understanding | 5 |
6. Filter Rows by Value | Correct filtering by numeric and non-numeric values | 5 |
7. Datetime Index* | Appropriate datetime conversion and indexing, or alternative if no datetime | 5 |
8. Descriptive Statistics | Comprehensive descriptive statistics provided | 5 |
9. Resampling of Time-Series Data* | Correct application of resampling techniques, or alternative if no datetime | 5 |
10. Groupby | Correct usage of groupby for data aggregation | 5 |
11. Slicing with loc & iloc | Accurate slicing techniques demonstrated | 5 |
12. Dicing | Correct dicing of DataFrame to obtain smaller portions | 5 |
13. Slicing and Dicing Together | Effective combination of slicing and dicing to extract data | 5 |
14. Datetime Column Subsetting* | Correct subsetting using Datetime, or alternative if no datetime | 5 |
15. Quick Plots of Data | Creation of insightful plots that aid in data understanding | 5 |
16. Operations on DataFrame | Appropriate and effective operations applied to DataFrame | 5 |
17. Iterating over Rows | (Bonus) Successful iteration over DataFrame rows for additional insights | 1 (bonus) |
18. Save & Load DataFrame | Correctly saved to and loaded from file | 5 |
Explore a Pandas method that was not covered in class but piques your interest or aligns with your dataset analysis. Demonstrate its usage with examples from your dataset. Here are some suggestions: .agg()
method, .pivot_table()
methods, .stack()
and .unstack()
methods, .merge()
and .join()
methods, .cut()
and .qcut()
methods. .explode()
method, .shift()
method, .rolling()
method, .duplicated()
and .drop_duplicates()
methods, and many more.
Task | Criteria | Points |
---|---|---|
19. New Method | Successful demonstration of a new Pandas method not covered in class, with clear explanation and proper application to the dataset | 5 |
Report any valuable information and findings derived from your data analysis.
Task | Criteria | Points |
---|---|---|
20. Information Discoveries | Clear and informative report outlining valuable findings from the analysis, including patterns, anomalies, or other relevant observations | 5 |
Highlight the integration of advanced AI technologies in data analysis workflows and showcases your ability to leverage cutting-edge tools for effective problem-solving. In this section, document the specific instances where you utilized a Large Language Model (LLM) such as GPT-3.5 for problem-solving during this assignment and discuss your overall experience of using LLM.
Task | Criteria | Points |
---|---|---|
21. LLM Usage | Thorough documentation of LLM usage in the analysis process, including specific examples and reflection on the experience | 5 |
Additional Notes:
Additional Information:
For Task 3. Information discovries (5 points), if you are interested, you can learn about data-information-knowledge-wisdom hierarchy at the science-policy interface.
Red tides are caused by Karenia brevis harmful algae blooms. For Karenia brevis cell count data, you can use the current dataset of Physical and biological data collected along the Texas, Mississippi, Alabama, and Florida Gulf coasts in the Gulf of Mexico as part of the Harmful Algal BloomS Observing System from 1953-08-19 to 2023-07-06 (NCEI Accession 0120767). For direct data download, you can use this data link and this data documentation link. Alternatively, FWRI documents Karenia brevis blooms from 1953 to the present. The dataset has more than 200,000 records is updated daily. To request this dataset email: HABdata@MyFWC.com. To learn more about this data, check the FWRI Red Tide Red Tide Current Status.
Conduct your analysis in Tampa Bay and Charlotte Harbor estuary. For Tampa Bay, restrict the Karenia brevis measurements from 27° N to 28° N and 85° W to coast. For Charlotte Harbor estuary, restrict the Karenia brevis measurements from 25.5° N to less than 27° N and 85° W to coast.
Task 1: Plot the maximum cellcount of Karenia brevis (cell counts per letter) per week for the whole dataset for each of the regions of Tampa Bay and Charlotte Harbor estuary.
Task 2: FWRI classifies Karenia brevis abundance based on cell counts as described here as follows:
Index | Description | K. brevis abundance | Possible effects (K. brevis only) |
---|---|---|---|
0 | NOT PRESENT- BACKGROUND | background levels of 1,000 cells or less | no effects anticipated |
1 | VERY LOW | > 1,000 - 10,000 cells/L | possible respiratory irritation; shellfish harvesting closures when cell abundance equals or exceeds 5,000 cells/L |
2 | LOW | > 10,000 - 100,000 cells/L | respiratory irritation; shellfish harvesting closures; possible fish kills; probable detection of chlorophyll by satellites at upper range of cell abundance |
3 | MEDIUM | > 100,000 - 1,000,000 cells/L | respiratory irritation; shellfish harvesting closures; probable fish kills; detection of surface chlorophyll by satellites |
4 | HIGH | > 1,000,000 cells/L | as above, plus water discoloration |
Given the data in the above table, plot a histogram for each region to show the frequencies of the maximum cellcount per week according to the above classification. The histogram should only include 4 bins for the cases of ‘very low’, ‘low’, ‘medium’, and ‘high’.
Here is one solution strategy that you can follow.
(1) After you read and clean your data into DataFrame let use say df
as you did in task 1, create a new column BLOOM_CLASS
. Given the above table do bloom classifcation (i.e., ‘no bloom’, ‘very low bloom’, ‘low bloom’, ‘medium bloom’, and ‘hig bloomh’) for the whole dataset. For example, if the max cellcount in row 1 in the table below is 388400000 cells/L , then according to the table above this is a HIGH bloom, and then the first rows in BLOOM_CLASS
will have the values of 4 (i.e., the index value in the table above). If the max concentration in a given row is 0 then the index will 0 and that row under BLOOM_CLASS
will have a value of 0. Here is an example, of how your DataFrame df
will look like:
STATE_ID | DESCRIPTION | LATITUDE | LONGITUDE | CELLCOUNT | REGION | BLOOM_CLASS | |
---|---|---|---|---|---|---|---|
2022-11-30 18:50:00 | FL | Bay Dock (Sarasota Bay) | 27.331600 | -82.577900 | 388400000 | Tampa Bay | 4 |
1994-12-09 20:30:00 | FL | Bay Dock (Sarasota Bay) | 27.331600 | -82.577900 | 358000000 | Tampa Bay | 4 |
1996-02-22 00:00:00 | FL | Siesta Key; 8 mi off mkr 3A at 270 degrees | 27.277200 | -82.722300 | 197656000 | Tampa Bay | 4 |
2005-10-10 21:21:00 | TX | Windsurfing Flats, Pinell Property, south Padr… | 26.162420 | -97.182580 | 40000 | Other | 2 |
2019-01-02 20:30:00 | FL | Lido Key, 2.5 miles WSW of | 27.300000 | -82.620000 | 186266667 | Tampa Bay | 4 |
… | … | … | … | … | … | … | … |
2020-08-25 00:00:00 | MS | 5-9A | 30.361850 | -88.850067 | 0 | Other | 0 |
2020-09-30 00:00:00 | MS | Katrina Key | 30.356869 | -88.839592 | 0 | Other | 0 |
2021-01-25 00:00:00 | MS | Sample* Long Beach | 30.346020 | -89.141030 | 0 | Other | 0 |
2021-11-15 00:00:00 | MS | 10-Jun | 30.343900 | -88.602667 | 0 | Other | 0 |
2021-12-21 00:00:00 | MS | 10-Jun | 30.343900 | -88.602667 | 0 | Other | 0 |
205552 rows × 7 columns
This is similar to Exercise 4. In exercise 4 you created an new column REGION
and did a mask and dicing to fill-in this new column with values (i.e, ‘Tampa Bay’, ‘Charlotte Harbor’, and ‘Other’) based on the latitude and longitude mask. You can do the same here. Create a new column BLOOM_CLASS
, and do a mask and dicing to fill-in new columns with values of 0, 1, 2, 3, or 4 based on brevis abundance mask given the ranges in the table above.
(2) As a copy from your original DataFrame df
, create two new DataFrames, one for each region as follows: charlotte_harbor_hist_data
and tampa_bay_hist_data
. You can use these DataFrames to do resampling for each regions
(3) Resample your cellcount to find the maximum cell count per week. Remember from class that you can only do resample with numeric columns, so make sure that you only select the BLOOM_CLASS
column. It is always a good idea to do sorting before resampling. Then after you do sorting and weekly resample, your data should look like this for Charlotte Harbor, for example,
SAMPLE_DATE | BLOOM_CLASS |
---|---|
1953-08-23 | 3.0 |
1953-08-30 | 0.0 |
1953-09-06 | NaN |
1953-09-13 | NaN |
1953-09-20 | NaN |
… | … |
2023-06-04 | 0.0 |
2023-06-11 | 0.0 |
2023-06-18 | 0.0 |
2023-06-25 | 0.0 |
2023-07-02 | 0.0 |
3646 rows × 1 columns
(4) Create a histogram plot for only index values 1 to 4 in your BLOOM_Class
column for each region. You can have one figure for each region.
Hints:
This is a solution:
This is an extra figure in case your are curious about the last 10 years:
The student’s performance on this homework problem will be evaluated based on the ability to collect and organize the data, perform data analysis and visualization, interpret the results, and communicate the findings in a clear and concise manner as follows.