Lessons 10 - 14 : Pandas Primer

Lessons 10 - 14 : Pandas Primer#

This lesson is modified from Introduction to Pandas by Project Pythia, and from Exploring data using Pandas, Processing data with Pandas, Processing data with Pandas II by Geo-Python

Overview#

From the official documentation, Pandas “is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.” Pandas is a powerful library for working with tabular data. You can think of it as a programmable spreadsheet. By the end of this lesson you will be able to:

install and import Pandas
import a csv file with Pandas
use Pandas for exploratory data analysis
filter and query through data to perform data analysis

Cheat Sheets#

Basic: 1-page cheat sheet by Data Camp
Advanced: 2-page cheat sheet by Pandas Developers
Comprehensive: 12-page cheat sheet by Lee at Univ of Idaho

1. Installation#

If you are using anaconda, you do not need to install Pandas. If you just installed Python only, then need to install these common packages such as Pandas and numpy. You can Install panda using pip

pip install pandas

Before you type to above command in a new code cell, you can update your pip and install a dependancy.

Just run these cells one by one. After each the installation is complete, make sure to restart the kernel from the menu Kernel:Restart Kernel...

#pip install --upgrade pip

#pip install pyarrow

#pip install pandas

You need to install these once. Then you can comment the installation command (as they are already commented above) or delete these cells to avoid repeating this ever time your run this notebook.

2. Imports#

You will often see the nickname pd used as an abbreviation for Pandas in the import statement, just like numpy is often imported as np.

import pandas as pd

Note as a good coding practice, we generally import all the packages that we will need in the first cell in our notebook. However, we are importing Pandas here in the middle of our notebook for educational purpose.

3. Basic terminologies#

The Pandas DataFrame and The Pandas Series are labeled data structures.

The Pandas DataFrame (a 2-dimensional data structure) is used for storing and mainpulating table-like data (data with rows and columns) in Python. You can think of a Pandas DataFrame as a programmable spreadsheet.

dataframe schematic

The first column in dark gray, referred to as an index, contains information characterizing each row.
The first row in dark gray, referred to as header, contains the column lablels

The Pandas Series (a 1-dimensional data structure) is used for storing and manipulating a sequence of values. Pandas Series is kind of like a list, but more clever. One row or one column in a Pandas DataFrame is actually a Pandas Series.

Pandas Series

4. Learning Pandas by an example#

Let us learn Pandas by an example. This is mainly to

gain an understanding of how to effectively use the library in your own projects
showcase the wide range of capabilities
accelerate the learning process

The idea here is to empower you to adapt these tools to various data analysis tasks.

4.1 Collecting weather data#

Let us go to NOAA to collect weather data for Fort Myers area from Jan 1, 2020 to Dec 31, 2023. Go to Browse Datasets and then select Normals Daily: Search Tool. Then make the following selections:

Select Weather Observation Type/Dataset : Daily Summaries
Select Date Range : 2020-01-01 to 2023-12-31
Search For : Cities
Enter a Search Term : Fort Myers

Then add to cart ‘Fort Myers, FL US’ and checkout your data as ‘Custom GHCN-Daily CSV’.

On your checkout select the following

Station Name
Geographic Location
Include Data Flags

Show All / Hide All | Select All / Deselect All

Precipitation
Air Temperature
Wind
Weather Type

Then submit your order. Wait for a minute or so until your data is ready. Download your data. It is always a good idea to download the data documentation to get a better understanding of your data.

4.2 Importing a `csv` file#

We start by reading in some data in comma-separated value (.csv) format.

Once we have a valid path to a data file that Pandas knows how to read, we can open it with:

dataframe_name = pd.read_csv('filepath/your_file.csv')

# Read a csv file with Pandas
df = pd.read_csv('3593504.csv')

4.3 Displaying a `DataFrame`#

4.3.1 Display DataFrame#

If you are using a Jupyter notebook, do not use print(df). Jupyter can display a nicely rendered table by just typing your dataframe_name or display(dataframename)

#Display your DataFrame
df

	STATION	NAME	LATITUDE	LONGITUDE	ELEVATION	DATE	AWND	DAPR	MDPR	PRCP	...	TMIN	WDF2	WDF5	WSF2	WSF5	WT01	WT02	WT03	WT05	WT08
0	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-01	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-02	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-03	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-04	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-05	NaN	NaN	NaN	0.28	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14584	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-27	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14585	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-28	NaN	NaN	NaN	0.23	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14586	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-29	NaN	NaN	NaN	0.18	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14587	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-30	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14588	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-31	NaN	NaN	NaN	0.01	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

14589 rows × 24 columns

In order to flood your screen, Jupyter did not print the whole table to screen. You can play with what you want to view. For example, try playing with these dataframe methods df.head(), df.head(3), df.tail(8)

# Show the first few columns in your DataFrame
df.head()

	STATION	NAME	LATITUDE	LONGITUDE	ELEVATION	DATE	AWND	DAPR	MDPR	PRCP	...	TMIN	WDF2	WDF5	WSF2	WSF5	WT01	WT02	WT03	WT05	WT08
0	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-01	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-02	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-03	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-04	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-05	NaN	NaN	NaN	0.28	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 24 columns

# Show the first three rows in your DataFrame
df.head(3)

	STATION	NAME	LATITUDE	LONGITUDE	ELEVATION	DATE	AWND	DAPR	MDPR	...	TMIN	WDF2	WDF5	WSF2	WSF5	WT01	WT02	WT03	WT05	WT08
0	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-01	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-02	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	US1FLLE0037	FORT MYERS 0.8 N, FL US	26.642262	-81.848839	3.7	2020-01-03	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3 rows × 24 columns

#Show the last few rows in your DataFrame
df.tail(8)

	STATION	NAME	LATITUDE	LONGITUDE	ELEVATION	DATE	AWND	DAPR	MDPR	PRCP	...	TMIN	WDF2	WDF5	WSF2	WSF5	WT01	WT02	WT03	WT05	WT08
14581	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-24	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14582	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-25	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14583	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-26	NaN	NaN	NaN	2.72	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14584	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-27	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14585	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-28	NaN	NaN	NaN	0.23	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14586	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-29	NaN	NaN	NaN	0.18	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14587	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-30	NaN	NaN	NaN	0.00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14588	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	26.637205	-81.877153	2.7	2023-12-31	NaN	NaN	NaN	0.01	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

8 rows × 24 columns

4.3.2 Display the size of DataFrame#

You can learn about the number of rows and column in a dataFrame using df.shape, number of rows df.shape[0] and numbers of columns df.shape[1]

#Show number of rows and columns in your DataFrame
df.shape

(14589, 24)

#Show number of rows in your DataFrame
df.shape[0]

#Show number of columns in your DataFrame
df.shape[1]

4.3.3 Display different date types in your DataFrame#

Let’s check the datatype of columns ‘PRCP’, ‘STATION’, and ‘DATE” using df['column_label'].dtypes or df.column_label.dtypes

The data type dtype(‘O’) indicates that the Pandas column contains objects, typically strings, rather than numerical or datetime values.

#Data type for column PRCP 
df['PRCP'].dtypes

dtype('float64')

#Data type for column STATION 
df['STATION'].dtypes

dtype('O')

#Data type for column DATE
df.DATE.dtypes

dtype('O')

4.3.4 Change data type in your DataFrame#

Pandas did not recognize DATE as datetime format. To convert a column in a Pandas DataFrame to datetime format, you can use

df[column_name] = pd.to_datetime(column_names)

function. Here’s how you can do it:

# Convert format of DATE to datetime format
df.DATE = pd.to_datetime(df.DATE)

#Data type for column DATE
df.DATE.dtypes

dtype('<M8[ns]')

Note

Changing date types in Pandas is not uncommon because sometime Pandas cannot automatically format your data as you expect it, so you need to convert the data format manually as we did above.

4.4 Filter columns by column labels#

My data has 24 columns and I want to focus only on these columns

Column Label	Description
STATION	Station identification code (17 characters).
NAME	Name of the station (usually city/airport name, max 50 characters).
DATE	Year of the record (4 digits) followed by month (2 digits) and day (2 digits).
PRCP	Precipitation (mm or inches as per user preference, inches to hundredths on Daily Form pdf file)
TMAX	Maximum temperature (Fahrenheit or Celsius as per user preference, Fahrenheit to tenths on Daily Form pdf file)
TMIN	Minimum temperature (Fahrenheit or Celsius as per user preference, Fahrenheit to tenths on Daily Form pdf file)
AWND	Average daily wind speed (meters per second or miles per hour as per user preference)

You can use the statement df = df[selected_columns] to select and update the DataFrame df to contain only the columns specified in the list selected_columns.

# List of column names to focus on
selected_columns = ['STATION', 'NAME', 'DATE', 'PRCP','TMAX', 'TMIN', 'AWND']

# Filter the DataFrame to include only the selected columns
df = df[selected_columns]

# Display new DataFrame
df

	STATION	NAME	DATE	PRCP	TMAX	TMIN	AWND
0	US1FLLE0037	FORT MYERS 0.8 N, FL US	2020-01-01	0.00	NaN	NaN	NaN
1	US1FLLE0037	FORT MYERS 0.8 N, FL US	2020-01-02	0.00	NaN	NaN	NaN
2	US1FLLE0037	FORT MYERS 0.8 N, FL US	2020-01-03	0.00	NaN	NaN	NaN
3	US1FLLE0037	FORT MYERS 0.8 N, FL US	2020-01-04	0.00	NaN	NaN	NaN
4	US1FLLE0037	FORT MYERS 0.8 N, FL US	2020-01-05	0.28	NaN	NaN	NaN
...	...	...	...	...	...	...	...
14584	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	2023-12-27	0.00	NaN	NaN	NaN
14585	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	2023-12-28	0.23	NaN	NaN	NaN
14586	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	2023-12-29	0.18	NaN	NaN	NaN
14587	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	2023-12-30	0.00	NaN	NaN	NaN
14588	US1FLLE0072	FORT MYERS 1.7 WNW, FL US	2023-12-31	0.01	NaN	NaN	NaN

14589 rows × 7 columns

4.5 Filter rows by a keyword#

I only want stations at airports. You can filter a DataFrame based on whether a particular column contains a specific string using

df[column_name].str.contains(keywords)

method in Pandas.

In other words, I want to see which rows in the column df['NAME'] that contains the keyword ‘Airport’, and only retain these rows and filter out all other rows. Here is how to do this:

# Specify the keyword you want to filter for in the 'NAME' column
keyword = 'AIRPORT'

#Use Pandas method str.contains() to create a boolean mask with name 'airport_mask'
# which is `True` for rows where the `NAME` column contains the specified keyword.
airport_mask = df['NAME'].str.contains(keyword)

# Display your mask for understanding
display(airport_mask)

# Use this mask to filter the DataFrame; that is to update it with the desired condition
df = df[airport_mask]

# Print the resulting DataFrame
df

      False
      False
      False
      False
      False
         ...  
  False
  False
  False
  False
  False
Name: NAME, Length: 14589, dtype: bool

	STATION	NAME	DATE	PRCP	TMAX	TMIN	AWND
8216	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	2020-01-01	0.00	77.0	53.0	4.92
8217	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	2020-01-02	0.00	84.0	57.0	7.61
8218	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	2020-01-03	0.00	84.0	69.0	9.62
8219	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	2020-01-04	0.15	83.0	64.0	10.51
8220	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	2020-01-05	0.00	68.0	49.0	7.16
...	...	...	...	...	...	...	...
12599	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-27	0.00	76.0	58.0	5.37
12600	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-28	0.35	69.0	57.0	4.03
12601	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-29	0.00	71.0	55.0	6.04
12602	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-30	0.00	67.0	51.0	5.14
12603	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-31	0.00	70.0	50.0	NaN

2922 rows × 7 columns

Showing unique values in a column#

How many stations do I have in that column after updating my DataFrame with that keyword?

To find unique values in a specific column, in this case, the NAME column of a DataFrame, you can use the .unique() method in Pandas.

#Display unique station names in 'NAME'
df['NAME'].unique()

array(['FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US',
       'FORT MYERS PAGE FIELD AIRPORT, FL US'], dtype=object)

#Display unique station IDs in 'STATION'
df['STATION'].unique()

array(['USW00012894', 'USW00012835'], dtype=object)

4.6 Filter rows by a specific value#

If you want to filter your DataFrame to only include rows where the STATION column has the value ‘USW00012835’, you can also use boolean indexing.

df['STATION'] == 'USW00012835'

Let us do this here and let use save this a new DataFrame one_station by using .copy().

#Use Pandas create a boolean mask with name station_mask 
#for all rows that the column `STAINION` has value of 'USW00012835'
station_mask = df['STATION'] == 'USW00012835'

# Use the station_mask  to filter DataFrame for rows where 'Station' is 'USW00012835' and save it to a new DataFrame 'one_station'
one_station = df[station_mask].copy()

# Print the resulting DataFrame
one_station

	STATION	NAME	DATE	PRCP	TMAX	TMIN	AWND
11143	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2020-01-01	0.00	78.0	54.0	4.70
11144	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2020-01-02	0.00	85.0	59.0	7.16
11145	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2020-01-03	0.00	86.0	70.0	8.28
11146	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2020-01-04	0.17	82.0	65.0	10.07
11147	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2020-01-05	0.00	70.0	50.0	8.95
...	...	...	...	...	...	...	...
12599	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-27	0.00	76.0	58.0	5.37
12600	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-28	0.35	69.0	57.0	4.03
12601	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-29	0.00	71.0	55.0	6.04
12602	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-30	0.00	67.0	51.0	5.14
12603	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2023-12-31	0.00	70.0	50.0	NaN

1461 rows × 7 columns

Note

When using .copy(), you explicitly create a new DataFrame ensuring that modifications to new DataFrame do not affect the original DataFrame df. In other words, you clearly tell Pandas that you want an independent copy of the data and not just a different view.

Now I have the data for one station for my study period. While my header is informative, my index is not informative. We generally make date-time as our index as shown next.

4.7 Make a datetime column as the DataFrame index#

To make the DATE column as the index of your DataFrame, you can use:

# Note: inplace=True argument will modify the DataFrame directly, instead of creating a new one
df.set_index('Date_Column_Name',inplace=True) 

However, before we set our DATE column as our index column, let us change its format to datetime format. We already did this conversion above, but let us do it again.

# Convert the "DATE" column to datetime format using pd.to_datetime()
one_station['DATE'] = pd.to_datetime(one_station['DATE'])

# Set the "DATE" column as the index of your DataFrame
one_station.set_index('DATE', inplace=True)

#Display your DataFrame with `DATE` as index
display(one_station)

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2020-01-01	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	78.0	54.0	4.70
2020-01-02	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	85.0	59.0	7.16
2020-01-03	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	86.0	70.0	8.28
2020-01-04	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.17	82.0	65.0	10.07
2020-01-05	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	8.95
...	...	...	...	...	...	...
2023-12-27	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	76.0	58.0	5.37
2023-12-28	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.35	69.0	57.0	4.03
2023-12-29	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	71.0	55.0	6.04
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	67.0	51.0	5.14
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	NaN

1461 rows × 6 columns

Note

Pandas may or may not understand the format of DATE column as date. Thus, it is generally a good practice to convert the column to datetime format before setting it as the index, especially if you plan to perform time-based operations later on. If the column is not in datetime format, some time-related functionalities may not work correctly.

Check the index#

Now let us see how the index change of our original DataFrame df.index and the one we just modified one_station.index

#Index of the original DataFrame df
df.index

Index([ 8216,  8217,  8218,  8219,  8220,  8221,  8222,  8223,  8224,  8225,
       ...
       12594, 12595, 12596, 12597, 12598, 12599, 12600, 12601, 12602, 12603],
      dtype='int64', length=2922)

#Index of the DataFrame one_station after we changed its index to be the DATE column
one_station.index

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2023-12-22', '2023-12-23', '2023-12-24', '2023-12-25',
               '2023-12-26', '2023-12-27', '2023-12-28', '2023-12-29',
               '2023-12-30', '2023-12-31'],
              dtype='datetime64[ns]', name='DATE', length=1461, freq=None)

You can also show the range of your index with .index.min() and .index.max() methods.

Try it out:

#print min index value in DataFrame df
print(df.index.min())

#Print max index value in DataFrame df
print(df.index.max())


#print min index value in DataFrame one_station
print(one_station.index.min())

#Print max index value in DataFrame one_station
print(one_station.index.max())

8216
12603
2020-01-01 00:00:00
2023-12-31 00:00:00

Note

You might be wondering why the index of df is not starting from 0. That is because we already have removed some rows from it, so the retained rows retain their original index values. If you do not want this, you can reset your index using df = df.reset_index().

Try looking at the label of column by using: one_station.columns

#Print column labels in DataFrame one_station
one_station.columns

Index(['STATION', 'NAME', 'PRCP', 'TMAX', 'TMIN', 'AWND'], dtype='object')

4.8 Descriptive statistics#

Pandas DataFrames contain useful methods for getting summary statistics. Available methods include describe(), info(), count(), mean(), median(), min(), max(), and std() (the standard deviation). You can also do column_name.describe()

#Descriptive statistics for one_station Dataframe
one_station.describe()

	PRCP	TMAX	TMIN	AWND
count	1459.000000	1457.000000	1457.000000	1447.000000
mean	0.156854	86.536719	68.231297	7.100104
std	0.456492	7.159212	8.460276	2.668980
min	0.000000	48.000000	35.000000	2.010000
25%	0.000000	83.000000	63.000000	5.140000
50%	0.000000	88.000000	70.000000	6.710000
75%	0.040000	92.000000	75.000000	8.500000
max	4.770000	98.000000	82.000000	29.970000

You can also do it per column

#Descriptive statistics for the column 'PRCP' one_station Dataframe (dot notation)
one_station.PRCP.describe()

count    1459.000000
mean        0.156854
std         0.456492
min         0.000000
25%         0.000000
50%         0.000000
75%         0.040000
max         4.770000
Name: PRCP, dtype: float64

You can also use dic notation one_station['PRCP'] instead of dot notation one_station.PRCP.

#Descriptive statistics for the column 'PRCP' one_station Dataframe (dic notation)
one_station['PRCP'].describe()

count    1459.000000
mean        0.156854
std         0.456492
min         0.000000
25%         0.000000
50%         0.000000
75%         0.040000
max         4.770000
Name: PRCP, dtype: float64

Note

The dot notation will not work if the column name is not a valid variable name (e.g., having space, starting with a number, etc.).

You can do it per a group of columns

# List of columns to show
columns_group = ['PRCP', 'AWND']

#Descriptive statistics for a selected columns in one_station Dataframe
one_station[columns_group].describe()

	PRCP	AWND
count	1459.000000	1447.000000
mean	0.156854	7.100104
std	0.456492	2.668980
min	0.000000	2.010000
25%	0.000000	5.140000
50%	0.000000	6.710000
75%	0.040000	8.500000
max	4.770000	29.970000

4.9 Resampling of time-series data#

Remember Pandas is powerful because it is a programmable spreadsheet. For example, can to find the min, max, mean, or median value per month, week, year, or any other criteria. You can use the method of

df.resample("Freqency").arreegation_method()

Frequency: The target frequency you want to resample to. Pandas supports many frequencies:

“D” - Calender day
“B” - Business day
“M” - Monthly frequency
“W” - Weekly frequency
“h” - Hourly frequency
“min” - Minutely frequency

and many more as shown in Pandas documentation.

Aggregation Methods: When downsampling, you often want to summarize groups of data using Pandas provided methods such as:

mean() - Calculates the mean
sum() - Calculates the sum
min() - Finds the minimum value
max() - Finds the maximum value
std() - Calculates standard deviation

… and others

For example, find the max PRCP, TMAX, TMIN and AWND for each week in your dataset

#Max PRCP, TMAX, TMIN and AWND in every week of my dataset using .resample() method
one_station.resample("W").max()

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2020-01-05	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.17	86.0	70.0	10.07
2020-01-12	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	88.0	70.0	11.63
2020-01-19	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	85.0	68.0	10.07
2020-01-26	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	79.0	61.0	14.09
2020-02-02	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.47	78.0	60.0	9.17
...	...	...	...	...	...	...
2023-12-03	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	86.0	71.0	9.40
2023-12-10	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.18	84.0	68.0	10.07
2023-12-17	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	1.23	77.0	68.0	15.66
2023-12-24	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	78.0	59.0	10.51
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	2.01	80.0	67.0	8.50

209 rows × 6 columns

Find the min PRCP, TMAX, TMIN and AWND in each year in my dataset

#Min PRCP, TMAX, TMIN and AWND in each year .resample() method
one_station.resample("Y").min()

C:\Users\aelshall\AppData\Local\Temp\ipykernel_3728\152910108.py:2: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  one_station.resample("Y").min()

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2020-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	59.0	37.0	2.01
2021-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	61.0	41.0	2.24
2022-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	48.0	35.0	2.46
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	62.0	40.0	2.68

This is mainly a quick overview on .resample() method. To learn this method in details, check Pandas user-guide on resampling.

4.10 Groupby#

While resample is specifically designed to change the frequency of the time series, groupby is a more general-purpose tool for grouping data based on some criteria, which can include time-related criteria but is not limited to time-series data. It can be used with any categorical or numerical criteria for grouping. It’s more flexible and allows for custom aggregation functions.

4.10.1 Groupby for categorical data#

For example, what if we want to find the monthly mean of TMAX and TMIN for the two weather stations at two airports. Let us first create the DataFrame two_stations for these two weather stations.

#Create a new DataFrame contains only stations at airports
two_stations = df.copy()

# Convert the "DATE" column to datetime format
two_stations['DATE'] = pd.to_datetime(two_stations['DATE'])

# Set the "DATE" column as the index of your DataFrame
two_stations.set_index('DATE', inplace=True)

#Display data for two stations
two_stations

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2020-01-01	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.00	77.0	53.0	4.92
2020-01-02	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.00	84.0	57.0	7.61
2020-01-03	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.00	84.0	69.0	9.62
2020-01-04	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.15	83.0	64.0	10.51
2020-01-05	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.00	68.0	49.0	7.16
...	...	...	...	...	...	...
2023-12-27	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	76.0	58.0	5.37
2023-12-28	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.35	69.0	57.0	4.03
2023-12-29	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	71.0	55.0	6.04
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	67.0	51.0	5.14
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	NaN

2922 rows × 6 columns

We can see that we have two stations “USW00012894” and “USW00012835”.

We can to find the monthly mean of our data. However, if you tried to apply a method such as mean() or resample() as we did before in Secion 4.9 you will get an error

TypeError: agg function failed [how->mean,dtype->object]

suggesting that that you have non-numeric columns in your DataFrame, and the mean() operation is not applicable to those columns.

#This commented code will give you an error because two_stations has non-numeric data
#two_stations.mean()

Note

In case you are wonder, why did the .mean() method worked with the one_station DataFrame, but not with two_stations DataFrame even although the one_station also contained non-numeric values in columns NAME and STATION? In one_station DataFrame the values in non-numeric columns were not changing because it is only one station. But with two_stations DataFrame the values in the non-numeric columns NAME and STATION are chaning because we have two stations. Thus, we cannot apply the .mean() method. In other words, what is the mean of "USW00012894" and "USW00012835" ?

Thus, before using resample() and mean(), it’s a good idea to ensure that all columns you’re aggregating are numeric.

Let us filter out STATION and NAME that are the non-numeric columns, and only keep the numeric columns that are PRCP, TMAX, TMIN, and AWND. For filtering out, you can do two_stations[columns_to_keep] as we learned before in Section 4.4. Then we can apply the .mean() , .resample('Y').mean() , or whatever method you need to use.

#Select the numeric columns that you want
columns_to_keep =  ['PRCP', 'TMAX', 'TMIN', 'AWND']

#Apply mean to numeric columns 
two_stations[columns_to_keep].resample('Y').mean()

C:\Users\aelshall\AppData\Local\Temp\ipykernel_3728\1940048899.py:5: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  two_stations[columns_to_keep].resample('Y').mean()

	PRCP	TMAX	TMIN	AWND
DATE
2020-12-31	0.149454	85.946721	67.554036	7.337937
2021-12-31	0.130947	85.502740	66.902740	6.733370
2022-12-31	0.187269	86.472452	67.749311	6.953064
2023-12-31	0.126401	86.517808	68.080822	7.132926

The above method works, but using

df.groupby(groupby_column)[columns_to_show].method_you_need()

can be more pwoerful because it will show you these values for each category, which is in that case for each station.

#Select columns only
columns_to_keep =  ['PRCP', 'TMAX', 'TMIN', 'AWND']

# For the two_stations DataFrame group by 'STATION' then apply .resample('Y').mean() to the selected columns only
# groupy-> Select columns -> apply method
two_stations.groupby(['STATION'])[columns_to_keep].resample('Y').mean()

C:\Users\aelshall\AppData\Local\Temp\ipykernel_3728\579840131.py:6: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  two_stations.groupby(['STATION'])[columns_to_keep].resample('Y').mean()

		PRCP	TMAX	TMIN	AWND
STATION	DATE
USW00012835	2020-12-31	0.156913	86.120219	68.005464	7.369235
	2021-12-31	0.148301	86.112329	67.654795	6.754658
	2022-12-31	0.192775	86.808864	68.307479	6.998244
	2023-12-31	0.129451	87.109589	68.958904	7.275152
USW00012894	2020-12-31	0.141995	85.773224	67.101370	7.306639
	2021-12-31	0.113544	84.893151	66.150685	6.712082
	2022-12-31	0.181717	86.139726	67.197260	6.909370
	2023-12-31	0.123352	85.926027	67.202740	6.991479

Try to do it again, but this time groupby ‘NAME` or both of them.

#Select columns only
columns_to_keep =  ['PRCP', 'TMAX', 'TMIN', 'AWND']

# For the two_stations DataFrame group by 'STATION' then apply .resample('Y').mean() to the selected columns only
# groupy-> Select columns -> apply method
two_stations.groupby(['NAME'])[columns_to_keep].resample('Y').mean()

C:\Users\aelshall\AppData\Local\Temp\ipykernel_3728\1860762282.py:6: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  two_stations.groupby(['NAME'])[columns_to_keep].resample('Y').mean()

		PRCP	TMAX	TMIN	AWND
NAME	DATE
FORT MYERS PAGE FIELD AIRPORT, FL US	2020-12-31	0.156913	86.120219	68.005464	7.369235
	2021-12-31	0.148301	86.112329	67.654795	6.754658
	2022-12-31	0.192775	86.808864	68.307479	6.998244
	2023-12-31	0.129451	87.109589	68.958904	7.275152
FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	2020-12-31	0.141995	85.773224	67.101370	7.306639
	2021-12-31	0.113544	84.893151	66.150685	6.712082
	2022-12-31	0.181717	86.139726	67.197260	6.909370
	2023-12-31	0.123352	85.926027	67.202740	6.991479

4.11 Slicing with loc and iloc#

Slicing is a method used to extract a subset of data from a larger data structure, such as a list, array, or DataFrame, by specifying a range or specific indices.ases.

The following two ways to access data are preferred by Pandas over other methods:

For .loc["row_start":"row_end", "col_start":"col_end"]:
- .loc[]: Access data by label.
- "row_start":"row_end": Specify the range of rows by their labels, inclusively.
- "col_start":"col_end": Specify the range of columns by their labels, inclusively.
For .iloc[row_start:row_end, col_start:col_end]:
- .iloc[]: Access data by index position.
- row_start:row_end: Specify the range of rows by their index positions, exclusively.
- col_start:col_end: Specify the range of columns by their index positions, exclusively.

The .loc[] method when accessing data by label is inclusive, meaning it includes the specified start and end labels.
The iloc[] method when accessing data by index is exclusive, meaning it includes the start index but excludes the end index.

The above is a general syntax, but there are some many special cases.ases.

4.11.1 Slicing with `loc`#

Let us slice from index label 2023-12-30 to 2023-12-31 and column labels STATION to PRCP.

Note

In Pandas, to perform a value-based partial slicing operation, your DataFrame index need to sorted in ascending order. You can do this using .sort_index(inplace=True).

# Sort the two_stations DataFrame by the 'DATE' index in ascending order
two_stations.sort_index(inplace=True)

# View last 5 rows in two_stations DataFrame
display(two_stations.tail(5))

# View  slice from '2023-12-30' to'2023-12-31', and 'from STATION' to 'PRCP' using loc (inclusive)
display(two_stations.loc['2023-12-30':'2023-12-31', 'STATION':'PRCP'])

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2023-12-29	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	70.0	54.0	4.92
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	67.0	51.0	5.14
2023-12-30	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	66.0	51.0	4.70
2023-12-31	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	69.0	48.0	2.01
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	70.0	50.0	NaN

	STATION	NAME	PRCP
DATE
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0
2023-12-30	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0
2023-12-31	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0

You can also do '2023-12:30': to retrieve all rows from the specified start date 2023-12-30 to the end of the last row. You can also do the same for columns 'STATION': if needed.

# In two_stations DataFrame, view  slice from '2023-12-30' to the last row, and 'from STATION' to the last column using loc
display(two_stations.loc['2023-12-30':, 'STATION':])

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	67.0	51.0	5.14
2023-12-30	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	66.0	51.0	4.70
2023-12-31	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	69.0	48.0	2.01
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	70.0	50.0	NaN

You can also specify what you exactly ['NAME','STATION','PRCP'] need instead of having a range.

# In two_stations DataFrame, view slice from '2023-12-30' to the last row, and for clumns 'NAME','STATION', and'PRCP' in that order
display(two_stations.loc['2023-12-30':, ['NAME','STATION','PRCP']])

	NAME	STATION	PRCP
DATE
2023-12-30	FORT MYERS PAGE FIELD AIRPORT, FL US	USW00012835	0.0
2023-12-30	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	USW00012894	0.0
2023-12-31	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	USW00012894	0.0
2023-12-31	FORT MYERS PAGE FIELD AIRPORT, FL US	USW00012835	0.0

Special case for a datetime index#

If your index is datetime you can even do more, like slicing a certain year.

For example let us say I want to slice the data of 2023. It is very simple .loc['2023']

#Slice 2023 data in two_stations DataFrame using .loc[]
two_stations.loc['2023']

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2023-01-01	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	87.0	67.0	3.13
2023-01-01	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	86.0	67.0	2.91
2023-01-02	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	87.0	67.0	6.49
2023-01-02	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	85.0	65.0	6.26
2023-01-03	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	86.0	66.0	6.93
...	...	...	...	...	...	...
2023-12-29	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	70.0	54.0	4.92
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	67.0	51.0	5.14
2023-12-30	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	66.0	51.0	4.70
2023-12-31	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	69.0	48.0	2.01
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	70.0	50.0	NaN

730 rows × 6 columns

4.11.2 Slicing with `iloc`#

Do the same with iloc, and slice from rows 2023-12-30 to 2023-12-31 and columns STATION to PRCP.

First find the shape of this DataFrame to get an idea about the index number for the last four rows.

#Find the shape of two_stations DataFrame first
two_stations.shape

(2922, 6)

# View  slice from '2023-12-30' to'2023-12-31', and 'from STATION' to 'PRCP' using iloc (exclusive)
display(two_stations.iloc[2918:2922, 0:3])

	STATION	NAME	PRCP
DATE
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0
2023-12-30	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0
2023-12-31	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0

You can also do 2918: to retrieve all rows from the specified index to the end of the last index.

# View  slice from '2023-12-30' to the last row, and from 'STATION' to the last column using iloc (exclusive)
display(two_stations.iloc[2918:, 0:])

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	67.0	51.0	5.14
2023-12-30	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	66.0	51.0	4.70
2023-12-31	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	69.0	48.0	2.01
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	70.0	50.0	NaN

In the above example, since you are starting from index 0 you can just do : which will select all columns.

# View  slice from '2023-12-30' to the last row, and from 'STATION' to the last column using iloc (exclusive)
display(two_stations.iloc[2918:, :])

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	67.0	51.0	5.14
2023-12-30	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	66.0	51.0	4.70
2023-12-31	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	69.0	48.0	2.01
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	70.0	50.0	NaN

There are so many ways for slicing through a DataFrame. For more examples and details, check Introduction to Pandas by Project Pythia.

4.12 Dicing#

While slicing involves selecting specific rows and columns based on their indices or labels, dicing goes a step further by selecting specific elements based on some condition or criteria. It looks something like this

# for one condition and multiple columns to show
df.loc[condtion_on_row, columns_to_show] 

# for multiple conditions and multiple columns to show
df.loc[(condtion_on_row) & (condtion_on_row), columns_to_show] 

You can replace & by any other logical operator such as <,<=, >,>=, !=, ==, |.

Let us do a simple example investigating extreme values. Find the days where 'TMIN'<=36

#Condition 'TMIN'<=36 (dot notation)
condition= two_stations.TMIN<=36

#Retriving all data with the above condition
two_stations.loc[condition]

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2020-01-22	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	61.0	35.0	10.29
2022-01-24	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	66.0	36.0	3.58
2022-01-30	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	60.0	35.0	5.59
2022-01-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.0	62.0	35.0	6.93
2022-01-31	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.0	70.0	36.0	5.14

What we did in the above example, is a mask with a condition that returns Boolean value True when the condition is met and False otherwise.

Now let us do another extreme values by have two conditions: 'PRCP'>0 and 'TMAX'>=98

#Coditions of 'PRCP'>0 and 'TMAX'>98 (dot notation)
conditions= (two_stations.PRCP>0) & (two_stations.TMAX>=98)

#Retriving all data with the above condtions
two_stations.loc[conditions]

	STATION	NAME	PRCP	TMAX	TMIN	AWND
DATE
2020-07-31	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.01	98.0	77.0	5.14
2022-08-22	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	1.11	98.0	76.0	5.82

One more argument ["column_to_select1", "column_to_select2", ...] are the columns that you want retrive. For example, suppose we want to know the wwind speed and Station ID when these two conditions ‘PRCP’>0 and ‘TMAX’>80 are meet.

# Coditions of 'PRCP'>0 and 'TMAX'>98 (dot notation)
conditions= (two_stations.PRCP>0) & (two_stations.TMAX>=98)

# Columns to show 
columns_to_show = ['AWND','STATION']

#Retriving all data with the above condtions and showing selectec columns
two_stations.loc[conditions, columns_to_show ]

	AWND	STATION
DATE
2020-07-31	5.14	USW00012894
2022-08-22	5.82	USW00012894

In the above examples, we used dot notation, but we can also use dic notation. Check Section 4.8 for more information about the difference between dot and doc notations.

# Coditions of 'PRCP'>0 and 'TMAX'>98 (dic notation)
conditions= (two_stations['PRCP']>0) & (two_stations['TMAX']>=98)

# Columns to show 
columns_to_show = ['AWND','STATION']

#Retriving all data with the above condtions and showing selectec columns
two_stations.loc[conditions, columns_to_show ]

	AWND	STATION
DATE
2020-07-31	5.14	USW00012894
2022-08-22	5.82	USW00012894

4.13 Slicing and dicing together#

You can first slice and then dice and vice verse using chain method. For example, df.loc[dice].loc[slice].

For example, retrive wind and station data with 'TMAX'>=96 and 'PRCP'>0 in 2023.

#conditions of 'PRCP'>0 and 'TMAX'>=96
conditions= (two_stations["PRCP"] > 0) & (two_stations["TMAX"]>=96)

#columns to show
columns_to_show= ['STATION','AWND']

# Dice to my condition and columns to show -> slice to 2023 that is from '2023-01-01' to '2023-12-31'
two_stations.loc[conditions, columns_to_show].loc['2023-01-01':'2023-12-31']

	STATION	AWND
DATE
2023-07-04	USW00012835	3.58
2023-07-20	USW00012835	5.14
2023-07-21	USW00012835	4.25
2023-08-12	USW00012894	4.03
2023-08-12	USW00012835	3.80
2023-08-15	USW00012835	4.03

Challange your understanding#

Find the mean wind speed when 'TMAX'>=96 and 'PRCP'>0 in 2023. This is the same as the above example. But then you need to select column AWND using .AWMD, ['AWND'] , or .loc[:,'AWND'].Then do one more chain with the .mean() method.

#conditions of 'PRCP'>0 and 'TMAX'>=96
conditions= (two_stations["PRCP"] > 0) & (two_stations["TMAX"]>=96)

#columns to show
columns_to_show= ['STATION','AWND']

# Approach 1
# Dice to my condition and columns to show -> slice to 2023 
# -> select AWMD column -> use mean() method
two_stations.loc[conditions, columns_to_show].loc['2023-01-01':'2023-12-31'].AWND.mean()

np.float64(4.138333333333334)

You can also try the two other indexing approaches suggested above

# Approach 2
# Dice to my condition and columns to show -> slice to 2023 
# -> index to AWMD column -> use mean() method
two_stations.loc[conditions, columns_to_show].loc['2023-01-01':'2023-12-31']['AWND'].mean()

np.float64(4.138333333333334)

# Approach 3
# Dice to my condition and columns to show -> slice to 2023 
# -> index to AWMD column -> use mean() method
two_stations.loc[conditions, columns_to_show].loc['2023-01-01':'2023-12-31'].loc[:,'AWND'].mean()

np.float64(4.138333333333334)

Ask a LLM (opitional)#

Below are three approaches for finding the mean AWMD for the extrem conditions PRCP> 0 and TMAX>=96 in 2023. Ask a LLM if there are other approaches to find that mean. Here is the prompt that I provided:

Below are three approaches to find that mean of AWND. Can you suggest another approach?
    ```Python
    #conditions of 'PRCP'>0 and 'TMAX'>=96
    conditions= (two_stations["PRCP"] > 0) & (two_stations["TMAX"]>=96)
    
    #columns to show
    columns_to_show= ['STATION','AWND']
    
    # Approach 1
    # Dice to my condition and columns to show -> slice to 2023 
    # -> index to AWMD column -> use mean() method
    two_stations.loc[conditions, columns_to_show].loc['2023-01-01':'2023-12-31'].AWND.mean()
    
    # Approach 2
    # Dice to my condition and columns to show -> slice to 2023 
    # -> index to AWMD column -> use mean() method
    two_stations.loc[conditions, columns_to_show].loc['2023-01-01':'2023-12-31']['AWND'].mean()
    
    #Approach 3 
    # Dice to my condition and columns to show -> slice to 2023 
    # -> index to AWMD column -> use mean() method
    two_stations.loc[conditions, columns_to_show].loc['2023-01-01':'2023-12-31'].loc[:,'AWND'].mean()
    ```
I am writing in Jupyter notebook using the Python 3 (system-wide) kernel.

I provided the above prompt to ChatGPT 3.5. Here are the responses that I got:

ChatGPT 3.5, provided an approach to filter the data and calculate the mean, and suggested to “choose the method that best fits your coding style and preference.”

# Approach 4
# Use loc to apply conditions, slice to 2023, select the 'AWND' column, and calculate the mean directly
two_stations.loc[conditions & (two_stations.index.year == 2023), 'AWND'].mean()

np.float64(4.138333333333334)

ChatGPT suggested subsetting using Datetime column .index.year. This is what we will learn next. However, you can ask your LLM to explain the solution to you.

4.14 Subsetting using Datetime Column#

Slicing as shown in Section 4.11 is a useful technique for subsetting a DataFrame, but there are also other options that can be equally useful.

Before we start, let us first create a fresh copy of our DataFrame, just to review and remember what we did before.

#Create a new DataFrame contains only stations at airports
two_stations = df.copy()

# Convert the "DATE" column to datetime format
two_stations['DATE'] = pd.to_datetime(two_stations['DATE'])

# Set the "DATE" column as the index of your DataFrame
two_stations.set_index('DATE', inplace=True)

If your DataFrame uses datetime values for indices, you can select data from only one day using df.index.day, one month using df.index.month, one year using df.index.year, and so on.

For example, the expression

two_stations.index.month == 12

will create a boolean mask where each element corresponds to whether the month component of the index of each row is equal to 12. In other words, this expression give True for rows where the month is December (i.e., month number 12) and False for rows where the month is not December. This boolean mask can then be used to filter rows based on the condition that the month is December using for example,

df.loc([two_stations.index.month == 12])

Let us say that we want to calculate the mean of the TMIN and TMAX columns for the month of December across all years for ourtwo_stations DataFrame. How to approach this task? Here is one strategy

Filter the rows of the two_stations DataFrame based on the condition of the month of December as described above
Select the columns TMIN and TMAX from the DataFrame using [['TMIN','TMAX']], or .loc[:, ['TMIN','TMAX']]
Apply .mean() function to the filtered DataFrame

You can also do step 2 column selection before step 1 filtering by month using datetime subset.

Let us try to test both of them:

# Approach 1 Datetime subset -> Select 
#Subset condition with datetime index: month == 12 (December)
condition = two_stations.index.month == 12

#Subset given my condition -> select columns .loc[:,['TMIN','TMAX']]-> apply .mean() function 
two_stations.loc[condition,:].loc[:,['TMIN','TMAX']].mean()

TMIN    58.584677
TMAX    77.197581
dtype: float64

Test your understanding#

Redo the above example, but by applying step 2 that is column selections, before step 1 that is filtering for the month of december.

# Approach 2 Select -> Datetime subset 

#Subset condition with datetime index is the month == 12 (December)
condition = two_stations.index.month == 12

# Select columns .loc[:,['TMIN','TMAX']] -> Subset given my condition -> apply .mean() function 
two_stations.loc[:,['TMIN','TMAX']].loc[condition,:].mean()

TMIN    58.584677
TMAX    77.197581
dtype: float64

Another example with LLM (opitional)#

Do a for loop to find the mean TMIN and TMAX from Jan to Dec from 2020-01-01 to 2023-12-31. For each iteration print month, TMIN, and TMAX in a table format. To print in table format you can use install the package tabulate that can do this for you. We did not learn this package, but you can ask a LLM to solve this problem for you.

Here is one prompt:

    Do a for loop to find the mean TMIN and TMAX for the study period, using this code 
    ```Python
    #Subset to data in month 12 (December) and get the mean TMIN and TMAX in December throughout the study period 
    two_stations.loc[:,['TMIN','TMAX']].loc[two_stations.index.month == 12].mean() 
    ``` 
    For each iteration print the month name, TMIN, and TMAX in a table format. 

This is the solution that I got from ChatGPT 3.5.

#Install tabulate if you are using it for the first time by uncommenting the pip install command
#pip install tabulate

# ChatGPT 3.5 Solution
#Import two modules
import calendar
from tabulate import tabulate

# Initialize a list to store the results
results = []

# Iterate over each month
for month in range(1, 13):

    #Subset condition with datetime index is the month == 12 (December)
    condition= two_stations.index.month == month
    
    # Subset data for the current month and calculate mean TMIN and TMAX
    monthly_data = two_stations.loc[:, ['TMIN', 'TMAX']].loc[condition,:]
    mean_TMIN = monthly_data['TMIN'].mean()
    mean_TMAX = monthly_data['TMAX'].mean()
    
    # Get month name
    month_name = calendar.month_name[month]
    
    # Append the results to the list
    results.append([month_name, round(mean_TMIN, 2), round(mean_TMAX, 2)])

# Print results in table format
print(tabulate(results, headers=["Month", "Mean TMIN", "Mean TMAX"], tablefmt="pretty"))

+-----------+-----------+-----------+
|   Month   | Mean TMIN | Mean TMAX |
+-----------+-----------+-----------+
|  January  |   56.1    |   76.36   |
| February  |   60.26   |   81.19   |
|   March   |   62.72   |   84.81   |
|   April   |   66.59   |   87.55   |
|    May    |   69.44   |   89.94   |
|   June    |   74.18   |   90.94   |
|   July    |   76.18   |   92.62   |
|  August   |   76.21   |   92.82   |
| September |   74.6    |   90.66   |
|  October  |   70.92   |   87.29   |
| November  |   64.87   |   81.77   |
| December  |   58.58   |   77.2    |
+-----------+-----------+-----------+

A comprehensive example (advanced)#

Let us look at an example where we can apply many of the methods that we learned. Let use reslove the above problem without using loops. We can use something that we learned before like groupby in Section 4.10. Remember groupby is tool for grouping data based on some criteria, which can include time-related criteria but is not limited to time-series data.

Here is a solution strategy

Use something like .index.month to find the month number for each row
Add the month number for each row as a new column, e.g., ‘month’
Select the columns ‘TMIN’, ‘TMAX’, ‘month’: [['TMIN','TMAX', 'month']] or .loc[:, ['TMIN','TMAX','month']]
Groupby by ‘month’: .groupby('month')
Apply .mean() function

So far we have used the DataFrame as is without adding or removing a column to the DataFrame. Let us now add to two_stations a new column ‘month’. The value returned by df.index.month is used to obtain the data for this new column:

#Add new column 'month' with its values obtained from index
two_stations['month'] = two_stations.index.month

#Display new DataFrame
two_stations

	STATION	NAME	PRCP	TMAX	TMIN	AWND	month
DATE
2020-01-01	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.00	77.0	53.0	4.92	1
2020-01-02	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.00	84.0	57.0	7.61	1
2020-01-03	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.00	84.0	69.0	9.62	1
2020-01-04	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.15	83.0	64.0	10.51	1
2020-01-05	USW00012894	FORT MYERS SW FLORIDA REGIONAL AIRPORT, FL US	0.00	68.0	49.0	7.16	1
...	...	...	...	...	...	...	...
2023-12-27	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	76.0	58.0	5.37	12
2023-12-28	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.35	69.0	57.0	4.03	12
2023-12-29	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	71.0	55.0	6.04	12
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	67.0	51.0	5.14	12
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	NaN	12

2922 rows × 7 columns

Select the columns that you want to apply groupby to. Then groupby month and the apply .mean().

# Select columns ->  Groupby month -> Apply mean to find the mean for each month
two_stations.loc[:, ['TMIN','TMAX','month']].groupby('month').mean()

	TMIN	TMAX
month
1	56.100806	76.362903
2	60.261062	81.185841
3	62.721774	84.810484
4	66.591667	87.550000
5	69.435484	89.935484
6	74.175732	90.941667
7	76.177419	92.625000
8	76.213710	92.822581
9	74.599156	90.658228
10	70.923077	87.287449
11	64.870833	81.766667
12	58.584677	77.197581

In the above example, I solved the problem in five steps through adding the new column month for illustration purpose. You can only solve it in three steps without adding the new column month as follows:

# Select columns ->  Groupby month -> Apply mean to find the mean for each month
two_stations.loc[:, ['TMIN','TMAX']].groupby(two_stations.index.month).mean()

	TMIN	TMAX
DATE
1	56.100806	76.362903
2	60.261062	81.185841
3	62.721774	84.810484
4	66.591667	87.550000
5	69.435484	89.935484
6	74.175732	90.941667
7	76.177419	92.625000
8	76.213710	92.822581
9	74.599156	90.658228
10	70.923077	87.287449
11	64.870833	81.766667
12	58.584677	77.197581

How can you customize your table to look like the table you got in the above example with loop. I asked ChatGPT 3.5, and here is what I got:

import calendar

# Select columns, group by month, and apply mean to find the mean for each month
monthly_means = two_stations.loc[:, ['TMIN', 'TMAX']].groupby(two_stations.index.month).mean()

# Map month numbers to month names
monthly_means.index = monthly_means.index.map(lambda x: calendar.month_name[x])

# Round mean values to two decimal points
monthly_means = monthly_means.round(2)

# Rename the columns
monthly_means = monthly_means.rename(columns={'TMIN': 'Mean TMIN', 'TMAX': 'Mean TMAX'})

# Print the resulting DataFrame
display(monthly_means)

	Mean TMIN	Mean TMAX
DATE
January	56.10	76.36
February	60.26	81.19
March	62.72	84.81
April	66.59	87.55
May	69.44	89.94
June	74.18	90.94
July	76.18	92.62
August	76.21	92.82
September	74.60	90.66
October	70.92	87.29
November	64.87	81.77
December	58.58	77.20

In the code above we learned everything expect for .map() and lambda function. You can ask your LLM to learn more about this. However, if you can understand that much, you have a very good understanding of Pandas and Python in general.

Note

Prompt engineering entails crafting precise prompts to guide a large language model (LLM) to generate desired outputs. Mastering prompt engineering enables effective utilization of LLMs across tasks from problem-solving to creative writing. Learn more with Real Python's tutorial on Prompt Engineering: A Practical Example.

Prompt engineering involves crafting precise and effective prompts or instructions to guide a large language model (LLM) to generat the desired outputs. Learning prompt engineering, through providing clear guidance and constraints, will help you to effectively harness the capabilities of LLMs for various tasks from creative writing to problem-solving. To learn prompt engineering you can check for example Prompt Engineering: A Practical Example tutorial by Real Python.

4.15 Quick Plots of Your Data#

A good way to explore your data is by making a simple plot. Pandas contains its own .plot method; this allows us to plot Pandas series without needing matplotlib.

4.15.1 Line plot#

This is very simply:

df[columns_to_plot].plot(opitions_as_needed)

In this example, we plot the TMAX and TMIN series of our DataFrame with lin plot.

#Select data to plot: TMAX and TMIN
data_to_plot=['TMAX','TMIN']

#Plot method
one_station[data_to_plot].plot();

../../_images/10cdb23df1bb46512a24f06ceb1d87f340c1faf49675a835b41fda520215339e.png

Note

Note when using .plot()in Pandas, appending a semicolon ; at the end of the statement suppresses the output of additional information such as '< Axes: xlabel='DATE'>' text that is displayed at the top of the plot.

4.15.2 Histogram plot#

In the previous example we called .plot(), which generated a single line plot. Line plots can be helpful for understanding some types of data, but there are other types of data that can be better understood with different plot types. For example, if you want to see the distribution of your data, you can better do this with a histogram plot.

Let us do something more advanced than the previous example to build on what we learned before. Let use say we want to see the TMAX distributions for years 2022 and 2023. That is tricky. Let us think about this step-by-step. First we need to create two new columns 2022_TMAX and 2023_TMAX, such that 2022_TMAX will contain TMAX data for 2022, and 2023_TMAX will contain TMAX data for 2023. Then we can plot the data of these two new columns.

We have learned how to do this before with Slicing and with Subsetting. Here is how to do this with Slicing

#Slicing data for 2022 and 2023 and adding it to the DataFrame
one_station['2022_TMAX'] =  one_station.TMAX.loc['2022']
one_station['2023_TMAX'] =  one_station.TMAX.loc['2023']

# Display updated DataFrame to show the two new columns '2022_TMAX' and '2023_TMAX'
one_station

	STATION	NAME	PRCP	TMAX	TMIN	AWND	2022_TMAX	2023_TMAX
DATE
2020-01-01	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	78.0	54.0	4.70	NaN	NaN
2020-01-02	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	85.0	59.0	7.16	NaN	NaN
2020-01-03	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	86.0	70.0	8.28	NaN	NaN
2020-01-04	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.17	82.0	65.0	10.07	NaN	NaN
2020-01-05	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	8.95	NaN	NaN
...	...	...	...	...	...	...	...	...
2023-12-27	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	76.0	58.0	5.37	NaN	76.0
2023-12-28	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.35	69.0	57.0	4.03	NaN	69.0
2023-12-29	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	71.0	55.0	6.04	NaN	71.0
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	67.0	51.0	5.14	NaN	67.0
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	NaN	NaN	70.0

1461 rows × 8 columns

Let use delete these two columns and add them again with the subsetting method

# Drop the specified columns from the DataFrame

#one_station.drop(columns= ['2022_TMAX', '2023_TMAX'], inplace=True)
one_station

	STATION	NAME	PRCP	TMAX	TMIN	AWND	2022_TMAX	2023_TMAX
DATE
2020-01-01	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	78.0	54.0	4.70	NaN	NaN
2020-01-02	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	85.0	59.0	7.16	NaN	NaN
2020-01-03	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	86.0	70.0	8.28	NaN	NaN
2020-01-04	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.17	82.0	65.0	10.07	NaN	NaN
2020-01-05	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	8.95	NaN	NaN
...	...	...	...	...	...	...	...	...
2023-12-27	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	76.0	58.0	5.37	NaN	76.0
2023-12-28	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.35	69.0	57.0	4.03	NaN	69.0
2023-12-29	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	71.0	55.0	6.04	NaN	71.0
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	67.0	51.0	5.14	NaN	67.0
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	NaN	NaN	70.0

1461 rows × 8 columns

Now add these two columns back using the subsetting method

#Subsetting 2022 and 2023 data and adding it to the DataFrame
one_station['2022_TMAX'] =  one_station.TMAX.loc[one_station.index.year == 2022]
one_station['2023_TMAX'] =  one_station.TMAX.loc[one_station.index.year == 2023]

# Display updated DataFrame to show the two new columns '2022_TMAX' and '2023_TMAX'
one_station

	STATION	NAME	PRCP	TMAX	TMIN	AWND	2022_TMAX	2023_TMAX
DATE
2020-01-01	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	78.0	54.0	4.70	NaN	NaN
2020-01-02	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	85.0	59.0	7.16	NaN	NaN
2020-01-03	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	86.0	70.0	8.28	NaN	NaN
2020-01-04	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.17	82.0	65.0	10.07	NaN	NaN
2020-01-05	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	8.95	NaN	NaN
...	...	...	...	...	...	...	...	...
2023-12-27	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	76.0	58.0	5.37	NaN	76.0
2023-12-28	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.35	69.0	57.0	4.03	NaN	69.0
2023-12-29	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	71.0	55.0	6.04	NaN	71.0
2023-12-30	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	67.0	51.0	5.14	NaN	67.0
2023-12-31	USW00012835	FORT MYERS PAGE FIELD AIRPORT, FL US	0.00	70.0	50.0	NaN	NaN	70.0

1461 rows × 8 columns

Now we have our data ready for plotting.

The code for plotting histogram data is slighlty differs from the code for line plot. After calling the .plot method, we call an additional method called .hist, which converts the plot into a histogram. Also, we are calling .hist with two additional opitional parameters. The first parameter alpha is for adjusting the transparency. For example, alpha=0.5 sets the transparency level of the plotted histogram to 50%. The parameter bins specifies the number of bins used to divide the range of the data into equal intervals for plotting the histogram.

#Select data to plot
data_to_plot=['2022_TMAX','2023_TMAX']

#Histogram method (alpha parameter for transparency) 
one_station[data_to_plot].plot.hist(alpha=0.5, bins=20);

../../_images/9456a50229782ed8b6600d0cbf39582afcd073ae95f52fa0482ef95ab2e7c13e.png

4.15.3 Box plot#

The histogram plot helped us to look differently at the data. It seems that the two datasets are from the same distribution. To even better understand this data, it may also be helpful to create a box plot. This can be done using the same code, with one change: we call the .box method instead of .hist.

#Select data to plot
data_to_plot=['2022_TMAX','2023_TMAX']

#Box method
one_station[data_to_plot].plot.box();

../../_images/efa22033e98091c4263fa3080f51502fa656627f38482a34fe750735f53c30e9.png

Just like the histogram plot, this box plot indicates no clear difference in the distributions. Using multiple types of plot in this way can be useful for verifying large datasets. The Pandas plotting methods are capable of creating many different types of plots. To see how to use the plotting methods to generate each type of plot, please review the Pandas plot documentation.

4.15.4 Customize your Plot#

The Pandas plotting methods are, in fact, wrappers for similar methods in matplotlib. This means that you can customize Pandas plots by including keyword arguments to the plotting methods. These keyword arguments, for the most part, are equivalent to their matplotlib counterparts.

#Plain plotting without customization
one_station[['TMAX','TMIN']].plot();

#Here are some customizations that you can do 
one_station[['TMAX', 'TMIN']].plot(
    color=['red', 'blue'],          # Set different colors for TMAX and TMIN
    linewidth=1,                  # Set the width of the lines to 1
    xlabel='Date',                 # Set the label for the x-axis
    ylabel='Temperature (deg F)',   # Set the label for the y-axis
    figsize=(12, 4),                  # Set the size of the figure to (8, 6) inches
    title='Max and Min Temperature Over Time',  # Add a title to the plot
    grid=False,                      # Do not show gridlines on the plot
    legend=True,                    # Show a legend
    style=['-', '-'],              # Set solid line for TMAX and TMIN
    alpha=0.8,                      # Set the transparency of the lines to 0.8
    xlim=('2020-01-01', '2024-01-01'),  # Set the limits for the x-axis
    ylim=(30, None),                 # Set the lower limit for the y-axis to 0
    rot=0,                         # Rotate x-axis labels by 0 degrees
    fontsize=12,                    # Set font size for labels
);

../../_images/6b5f52073ed886ad5826f089ce052fa1bea3cbdc8355c31dbd4f17588ba81e7d.png

4.16 Applying operations to a DataFrame#

One of the most commonly used features in Pandas is the performing of calculations to multiple data values in a DataFrame simultaneously. Let’s first look at a a function that converts temperature values from degrees Fahrenheit to Celsius.

def fahrenheit_to_celsius(temp_ferh):
    """
    Converts from degrees Fahrenheit to Celsius.
    """
    return (temp_ferh - 32) * 5/9 

#Test your function here be checking freezing point
fahrenheit_to_celsius(32)

0.0

Let us create a new DataFrame for temperature data and apply ferhenirt_to_celsius function to convert data from Ferhenirt to Celsius. Then add the new data as two new columns TMAX_C and TMIN_C. This can take this form:

df[new_columns] = my_function(original_columns)

Also, rename the original TMAX to TMAX_F and TMIN to TMIN_F. For renaming you can use the rename function:

df = df.rename(columns={'old_name1': 'new_name1', 'old_name2': 'new_name2'})

Here is how to do this:

# Create a new DataFrame for temp values TMAX and TMIN
temp=one_station[['TMAX','TMIN']].copy()

# Convert temp from F to C and save to new columns TMAX_C and TMIN_C
temp[['TMAX_C','TMIN_C']]= fahrenheit_to_celsius(temp)

#Change the name of TMAX to TMAX_F and TMIN to TMIN_F
temp = temp.rename(columns={'TMAX': 'TMAX_F', 'TMIN': 'TMIN_F'})

# Display new DataFrame 
display(temp)

	TMAX_F	TMIN_F	TMAX_C	TMIN_C
DATE
2020-01-01	78.0	54.0	25.555556	12.222222
2020-01-02	85.0	59.0	29.444444	15.000000
2020-01-03	86.0	70.0	30.000000	21.111111
2020-01-04	82.0	65.0	27.777778	18.333333
2020-01-05	70.0	50.0	21.111111	10.000000
...	...	...	...	...
2023-12-27	76.0	58.0	24.444444	14.444444
2023-12-28	69.0	57.0	20.555556	13.888889
2023-12-29	71.0	55.0	21.666667	12.777778
2023-12-30	67.0	51.0	19.444444	10.555556
2023-12-31	70.0	50.0	21.111111	10.000000

1461 rows × 4 columns

Another way of doing this is to use .apply() such that

df[new_columns] = df[original_columns].apply(my_function)

Try it out:

# Create a new DataFrame for temp values TMAX and TMIN
temp=one_station[['TMAX','TMIN']].copy()

# Convert temp from F to C and save to new columns TMAX_C and TMIN_C
temp[['TMAX_C','TMIN_C']]= temp.apply(fahrenheit_to_celsius)

#Change the name of TMAX to TMAX_F and TMIN to TMIN_F
temp = temp.rename(columns={'TMAX': 'TMAX_F', 'TMIN': 'TMIN_F'})

# Display new DataFrame 
display(temp)

	TMAX_F	TMIN_F	TMAX_C	TMIN_C
DATE
2020-01-01	78.0	54.0	25.555556	12.222222
2020-01-02	85.0	59.0	29.444444	15.000000
2020-01-03	86.0	70.0	30.000000	21.111111
2020-01-04	82.0	65.0	27.777778	18.333333
2020-01-05	70.0	50.0	21.111111	10.000000
...	...	...	...	...
2023-12-27	76.0	58.0	24.444444	14.444444
2023-12-28	69.0	57.0	20.555556	13.888889
2023-12-29	71.0	55.0	21.666667	12.777778
2023-12-30	67.0	51.0	19.444444	10.555556
2023-12-31	70.0	50.0	21.111111	10.000000

1461 rows × 4 columns

This is just the beginning. You can convert a Pandas DataFrame to numpy array for more advanced array operations. You can do this conversion using the method .values.

Check the type of your DataFrame.

type(temp)

pandas.core.frame.DataFrame

Now use the method temp.values and check the type again.

type(temp.values)

numpy.ndarray

We will learn later learn Numpy, whihc handles multi-dimensional arrays with extensive collection of mathematical functions, facilitating high-performance numerical computations.

4.17 Iterating over rows (opitional)#

We can apply a function row at a time using a for loop and the iterrows() method. This will allow us to iterate row by row using iterrows() in a for loop to repeat a given process for each row in a Pandas DataFrame. Please note that iterating over rows is a rather inefficient approach, but this is your last resort when the other methods cannot do what you need.

Let’s start with a simple for loop that goes through each row in our DataFrame

# Create a new DataFrame for temp values TMAX and TMIN
temp=one_station.loc['2023-12-29':'2023-12-31', ['TMAX','TMIN']].copy()

# Iterate over the rows
for idx, row in temp.iterrows():

    # Print the index value
    display(idx)

    # Print the row
    display(row['TMAX'])
    display(row['TMIN'])

    break   #Break is used to show only the first iteration of the for loop for testing purpose

Timestamp('2023-12-29 00:00:00')

np.float64(71.0)

np.float64(55.0)

Now for each value let use apply our fahrenheit_to_celsius function.

# Create a new DataFrame for temp values TMAX and TMIN
temp=one_station[['TMAX','TMIN']].copy()

# Iterate over the rows
for idx, row in temp.iterrows():
    # Convert TMAX from Fahrenheit to Celsius
    temp.loc[idx, 'TMAX_C'] = fahrenheit_to_celsius(row['TMAX'])
    # Convert TMIN from Fahrenheit to Celsius
    temp.loc[idx, 'TMIN_C'] = fahrenheit_to_celsius(row['TMIN'])

#Change the name of TMAX to TMAX_F and TMIN to TMIN_F
temp = temp.rename(columns={'TMAX': 'TMAX_F', 'TMIN': 'TMIN_F'})

# Display the resulting DataFrame with temperatures converted to Celsius
display(temp)

	TMAX_F	TMIN_F	TMAX_C	TMIN_C
DATE
2020-01-01	78.0	54.0	25.555556	12.222222
2020-01-02	85.0	59.0	29.444444	15.000000
2020-01-03	86.0	70.0	30.000000	21.111111
2020-01-04	82.0	65.0	27.777778	18.333333
2020-01-05	70.0	50.0	21.111111	10.000000
...	...	...	...	...
2023-12-27	76.0	58.0	24.444444	14.444444
2023-12-28	69.0	57.0	20.555556	13.888889
2023-12-29	71.0	55.0	21.666667	12.777778
2023-12-30	67.0	51.0	19.444444	10.555556
2023-12-31	70.0	50.0	21.111111	10.000000

1461 rows × 4 columns

4.18 Save and load your DataFrame#

In this final example, we demonstrate the use of the to_csv method to save a DataFrame as a .csv file. This example also demonstrates the read_csv method, which reads .csv files into Pandas DataFrames.

4.18.1 Save a DataFrame as `.csv` file#

The to_csv method saves a DataFrame as a .csv file as

df.to_csv('file_path/file_name.csv')

Try to to save your temp DataFrame to ‘temp_data.csv’

# Save your temp data in F and C to a CSV file 'temp_data.csv'
temp.to_csv('temp_data.csv')

4.18.2 Read a `.csv` files into DataFrames#

We want to

read the file as new DataFrame called df_temp,
set our first column as our index column,
and change the format of our index to datetime format.

Let us break it down part by part.

First, you can use the read_csv method reads .csv files into Pandas DataFrames

df_name= pd.read_csv('file_path/file_name.csv', index_col = 0, parse_dates = True  )

Second, set the the first column which contain datetime data as our index using:

index_col = 0

which specifies that the first column of the CSV file should be used as the index of the DataFrame. If you want for example to use your third column as the index of your DataFrame, you can do index_col = 2 and so on.

Third, parse any date-like columns as datetime objects using

parse_dates = True 

which instructs Pandas to attempt to parse any datetime columns including index in the CSV file as datetime objects.

Here is how to do this:

# Read file as a new DataFrame 'df_temp', and set first column as index with datetime format
df_temp= pd.read_csv('temp_data.csv', index_col = 0, parse_dates = True)

#Display your new DataFrame
display(df_temp)

	TMAX_F	TMIN_F	TMAX_C	TMIN_C
DATE
2020-01-01	78.0	54.0	25.555556	12.222222
2020-01-02	85.0	59.0	29.444444	15.000000
2020-01-03	86.0	70.0	30.000000	21.111111
2020-01-04	82.0	65.0	27.777778	18.333333
2020-01-05	70.0	50.0	21.111111	10.000000
...	...	...	...	...
2023-12-27	76.0	58.0	24.444444	14.444444
2023-12-28	69.0	57.0	20.555556	13.888889
2023-12-29	71.0	55.0	21.666667	12.777778
2023-12-30	67.0	51.0	19.444444	10.555556
2023-12-31	70.0	50.0	21.111111	10.000000

1461 rows × 4 columns

What we did not learn so far is how to create a DataFrame from a given data rather than reading it from a file. You can learn this on your own. Ask a LLM for example and guide yourself through the learning process.

Summary#

Pandas is a very powerful tool for working with tabular (i.e., spreadsheet-style) data
There are multiple ways of subsetting your Pandas dataframe or series
Pandas allows you to refer to subsets of data by label, which generally makes code more readable and more robust
Pandas can be helpful for exploratory data analysis, including plotting and basic statistics
One can apply calculations to Pandas dataframes and save the output via csv files

What’s Next?#

To learn Pandas, you do not need to read more, but to practice more. Start working on your homework. When encountering challenges, refer back to this notebook and seek assistance from a LLM. If the LLM provided a new method that you do not understand, ask for explanation. When you write a code that does not work, ask for debugging. Practice as much as you can to learn this powerful data analysis tool.

Lessons 10 - 14 : Pandas Primer

Contents

Lessons 10 - 14 : Pandas Primer#

Overview#

Cheat Sheets#

1. Installation#

2. Imports#

3. Basic terminologies#

4. Learning Pandas by an example#

4.1 Collecting weather data#

4.2 Importing a csv file#

4.3 Displaying a DataFrame#

4.3.1 Display DataFrame#

4.3.2 Display the size of DataFrame#

4.3.3 Display different date types in your DataFrame#

4.3.4 Change data type in your DataFrame#

4.4 Filter columns by column labels#

4.5 Filter rows by a keyword#

Showing unique values in a column#

4.6 Filter rows by a specific value#

4.7 Make a datetime column as the DataFrame index#

Check the index#

4.8 Descriptive statistics#

4.9 Resampling of time-series data#

4.10 Groupby#

4.10.1 Groupby for categorical data#

4.11 Slicing with loc and iloc#

4.11.1 Slicing with loc#

Special case for a datetime index#

4.11.2 Slicing with iloc#

4.12 Dicing#

4.13 Slicing and dicing together#

Challange your understanding#

Ask a LLM (opitional)#

4.14 Subsetting using Datetime Column#

Test your understanding#

Another example with LLM (opitional)#

A comprehensive example (advanced)#

4.15 Quick Plots of Your Data#

4.15.1 Line plot#

4.15.2 Histogram plot#

4.15.3 Box plot#

4.15.4 Customize your Plot#

4.16 Applying operations to a DataFrame#

4.17 Iterating over rows (opitional)#

4.18 Save and load your DataFrame#

4.18.1 Save a DataFrame as .csv file#

4.18.2 Read a .csv files into DataFrames#

Summary#

What’s Next?#

Resources and References#

4.2 Importing a `csv` file#

4.3 Displaying a `DataFrame`#

4.11.1 Slicing with `loc`#

4.11.2 Slicing with `iloc`#

4.18.1 Save a DataFrame as `.csv` file#

4.18.2 Read a `.csv` files into DataFrames#