Lessons 25-26 : Xarray Basics#
This lesson is modified from Introduction to Xarray by Project Pythia, and Xarray Fundamentals by Earth and Environmental Data Science.
Overview#
Xarray is a Python library for handling labeled multi-dimensional arrays, commonly used for geospatial data such as climate, oceanographic, and remote sensing data in formats like netCDF. By the end of this lesson, you will be able to:
explain essential features of Xarray
load, manipulate, analyze, and visualize data in netCDF format
leverage Xarray for your water and environmental data science projects
Installation and import#
#pip install xarray
#pip install cartopy
#pip install netCDF4
import xarray as xr
import numpy as np
import pandas as pd
import hvplot.xarray
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import warnings
warnings.filterwarnings('ignore') # Suppress warnings issued by Cartopy when downloading data files
1. Basic Concepts#
1.1 Label dimensions#
Xarray borrows from pandas. Xarray introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, which makes working with arrays simple and clear.
Let us look at a simple example of a 4D array representing sea surface height (zos) data with dimensions:
latitude,
longitude,
depth,
time
We can add as many labeled dimensions. We can use Xarray to create a DataArray or DataSet with labeled dimensions and coordinates, and include metadata such as units and a description for better understanding and documentation.
1.2 DataArray and DataSet#
A Dataset holds many DataArrays which potentially can share coordinates. In analogy to pandas:
pandas.Series : pandas.Dataframe :: xarray.DataArray : xarray.Dataset
2. Creating a DataArray and Dataset#
2.1 Creating a DataArray#
Xarray handles multidimensional and spatiotemporal data using labeled dimensions and coordinates and maintain metadata. Let us look at a simple example of a 4D array representing sea surface temperature (tos) data with
latitude,
longitude,
depth,
date
We will store metadata according to the CF Conventions.
# Create coordinates for the four labeled dimensions
dates = pd.date_range('2024-01-03', periods=2) # Generating a date range for the data
depths = [10, 20, 30] # List of depth values
latitudes = [24.5, 26.5, 28.5, 30.5] # latitude values
longitudes = [-87, -84.5, -83, -81.5, -79] # longitude values
# Create a 4D Numpy Array representing sea surface height (zos) values for a 2x3x4x5 grid
size=(len(dates), len(depths), len(latitudes), len(longitudes))
zos = np.round(np.random.uniform(-1, 1, size= size), 2) #sea surface height [m]
# Create a DataArray with zos values
# with labeled dimensions and coordinates
da = xr.DataArray(zos, # zos values
dims=['date', 'depth', 'lat', 'lon'], # dimensions
coords={ # coordinates
'date': dates,
'depth': depths,
'lat': latitudes,
'lon': longitudes
})
# Add metadata to the DataArray
da.attrs['Long Name'] = 'sea_surface_height_above_geoid'
da.attrs['Short Name'] = 'zos'
da.attrs['Units'] = 'meters'
da.attrs['Description'] = "Sea surface height is the height of the sea surface above the geoid, \
which is an equipotential surface that approximates mean sea level.\
The sea surface height can be influenced by various factors\
such as ocean currents, tides, winds, and temperature."
# Display DataArray with labeled dimensions and coordinates, and metadata
display(da)
<xarray.DataArray (date: 2, depth: 3, lat: 4, lon: 5)> Size: 960B array([[[[-0.57, -0.26, 0.48, 0.5 , 0.34], [-0.23, 0.59, -1. , -0.83, -0.63], [ 0.98, 0.29, 0.19, 0.72, 0.85], [ 0.87, -0.39, -0.43, -0.96, 0.97]], [[-0.35, 0.13, -0.53, 0.38, 0.8 ], [ 0.8 , 0.55, 0.63, 0.27, -0.79], [-0.16, -0.01, -0.41, 0.97, 0.67], [ 0.72, 0.75, 0.13, -0.22, -0.02]], [[ 0.37, -0.4 , -0.13, -0.46, 0.7 ], [ 0.1 , -0.07, -0.09, 0.38, -0.5 ], [-0.14, -0.69, -0.33, -0.21, 0.54], [ 0.09, -0.21, -0.88, 0.91, -0.44]]], [[[ 0.11, -0.63, -0.4 , -0.47, 0.55], [ 0.77, -0.79, 0.37, -0.4 , -0.29], [ 0.43, 0.69, -0.56, 0.75, -0.06], [ 0.52, -0.11, 0.61, 0.15, 0.2 ]], [[ 0. , 0.38, -0.76, -0.2 , -0.9 ], [ 0.34, -0.16, -0.27, -0.46, 0.56], [-0.63, -0.95, 0.31, -0.01, 0.45], [ 0.71, -0.05, 0.11, -0.78, 0.81]], [[ 0.89, -0.2 , -0.36, -0.07, -0.33], [-0.77, 0.33, -0.9 , -0.62, 0.01], [ 0.31, -0.47, 0.06, 0.46, -0.85], [-0.34, 0.55, -0.25, -0.59, -0.25]]]]) Coordinates: * date (date) datetime64[ns] 16B 2024-01-03 2024-01-04 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Attributes: Long Name: sea_surface_height_above_geoid Short Name: zos Units: meters Description: Sea surface height is the height of the sea surface above t...
2.2 Creating a Dataset with one variable#
A Dataset holds many DataArrays which potentially can share coordinates. However, we can also create a Dataset with just one variable. Let use the above data to create a DataSet.
# Create coordinates for the four labels
dates = pd.date_range('2024-01-01', periods=2) # Generating a time range for the data
depths = [10, 20, 30] # List of depth values
latitudes= [24.5, 26.5, 28.5, 30.5] # latitude values
longitudes = [-87, -84.5, -83, -81.5, -79] # longitude values
# Create a 4D Numpy Array representing values for a 2x3x4x5 grid
size=(len(dates), len(depths), len(latitudes), len(longitudes))
zos = np.round(np.random.uniform(-1, 1, size= size), 2) #sea surface height [m]
# Create a DataArray with zos values
ds = xr.Dataset(
{
'zos': (('date', 'depth', 'lat', 'lon'), zos), #zos values and its dimensions
},
coords={'date': dates, 'depth': depths, 'lat': latitudes, 'lon': longitudes}) #Coordinates
# Add metadata to the DataArray
ds.zos.attrs['Long Name'] = 'sea_surface_height_above_geoid'
ds.zos.attrs['Short Name'] = 'zos'
ds.zos.attrs['Units'] = 'meters'
ds.zos.attrs['Description'] = "Sea surface height is the height of the sea surface above the geoid, \
which is an equipotential surface that approximates mean sea level.\
The sea surface height can be influenced by various factors\
such as ocean currents, tides, winds, and temperature."
# Displaying DataSet with labeled dimensions and coordinates, and metadata
display(ds)
<xarray.Dataset> Size: 1kB Dimensions: (date: 2, depth: 3, lat: 4, lon: 5) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Data variables: zos (date, depth, lat, lon) float64 960B -0.22 0.83 -0.3 ... 0.67 0.47
2.3 Creating a Dataset with multiple variables#
Let us put together a Dataset with sea surface temperature, and salinty
# Create coordinates for the four labels
dates = pd.date_range('2024-01-01', periods=2) # Generating a time range for the data
depths = [10, 20, 30] # List of depth values
lat= [24.5, 26.5, 28.5, 30.5] # latitude values
lon = [-87, -84.5, -83, -81.5, -79] # longitude values
# Create a 4D Numpy Array representing values for a 2x3x4x5 grid
size=(len(dates), len(depths), len(lat), len(lon))
tos = np.round(np.random.uniform(26, 30, size=size), 2) #sea surface temperature [C]
sos = np.round(np.random.uniform(33, 36, size=size), 2) #sea surface salinty (PSU)
# Create a DataArray with zos values
ds2 = xr.Dataset(
{
'tos': (('date', 'depth', 'lat', 'lon'), tos), #tos values and its dimensions
'sos': (('date', 'depth', 'lat', 'lon'), sos), #sos values and its dimensions
},
coords={'date': dates, 'depth': depths, 'lat': latitudes, 'lon': longitudes})
# Add metadata to the DataArray
ds2.tos.attrs['Long Name'] = 'sea_surface_temperature'
ds2.tos.attrs['Short Name'] = 'tos'
ds2.tos.attrs['Units'] = 'Degree Celsius'
ds2.sos.attrs['Long Name'] = 'sea_surface_salinty'
ds2.sos.attrs['Short Name'] = 'sos'
ds2.sos.attrs['Units'] = 'Practical salinity units (PSU)'
# Displaying DataSet with labeled dimensions and coordinates, and metadata
display(ds2)
<xarray.Dataset> Size: 2kB Dimensions: (date: 2, depth: 3, lat: 4, lon: 5) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Data variables: tos (date, depth, lat, lon) float64 960B 28.51 27.06 ... 26.0 27.7 sos (date, depth, lat, lon) float64 960B 34.57 35.12 ... 33.16 33.33
3. Understanding DataArray and DataSet#
2.3.1 Coordinates and indexes#
The coordinates are used to create indexes, which work similar to a Pandas index.
# Accessing the index objects associated with the dimensions of the DataArray
ds.indexes
Indexes:
date DatetimeIndex(['2024-01-01', '2024-01-02'], dtype='datetime64[ns]', name='date', freq='D')
depth Index([10, 20, 30], dtype='int64', name='depth')
lat Index([24.5, 26.5, 28.5, 30.5], dtype='float64', name='lat')
lon Index([-87.0, -84.5, -83.0, -81.5, -79.0], dtype='float64', name='lon')
DataArrays have coordinates associated with them. For example, lat
is one such coordinate variable. We can access lat
coordinate variable from the DataArray ds
by using ds.lat
.
# Retrieving the values of the `lat` coordinate from the DataArray `ds`.
ds.lat
<xarray.DataArray 'lat' (lat: 4)> Size: 32B array([24.5, 26.5, 28.5, 30.5]) Coordinates: * lat (lat) float64 32B 24.5 26.5 28.5 30.5
2.3.2 Cooridnates versus variables#
Data variables can be changed using arithmetic operations or other functions, while the coordinates remain unchanged.
da+1
<xarray.DataArray (date: 2, depth: 3, lat: 4, lon: 5)> Size: 960B array([[[[0.43, 0.74, 1.48, 1.5 , 1.34], [0.77, 1.59, 0. , 0.17, 0.37], [1.98, 1.29, 1.19, 1.72, 1.85], [1.87, 0.61, 0.57, 0.04, 1.97]], [[0.65, 1.13, 0.47, 1.38, 1.8 ], [1.8 , 1.55, 1.63, 1.27, 0.21], [0.84, 0.99, 0.59, 1.97, 1.67], [1.72, 1.75, 1.13, 0.78, 0.98]], [[1.37, 0.6 , 0.87, 0.54, 1.7 ], [1.1 , 0.93, 0.91, 1.38, 0.5 ], [0.86, 0.31, 0.67, 0.79, 1.54], [1.09, 0.79, 0.12, 1.91, 0.56]]], [[[1.11, 0.37, 0.6 , 0.53, 1.55], [1.77, 0.21, 1.37, 0.6 , 0.71], [1.43, 1.69, 0.44, 1.75, 0.94], [1.52, 0.89, 1.61, 1.15, 1.2 ]], [[1. , 1.38, 0.24, 0.8 , 0.1 ], [1.34, 0.84, 0.73, 0.54, 1.56], [0.37, 0.05, 1.31, 0.99, 1.45], [1.71, 0.95, 1.11, 0.22, 1.81]], [[1.89, 0.8 , 0.64, 0.93, 0.67], [0.23, 1.33, 0.1 , 0.38, 1.01], [1.31, 0.53, 1.06, 1.46, 0.15], [0.66, 1.55, 0.75, 0.41, 0.75]]]]) Coordinates: * date (date) datetime64[ns] 16B 2024-01-03 2024-01-04 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0
2.3.3 Dataset with aligined coordinates (advanced)#
Let us assume that we want to add ‘longitude’ coordinate that aligns with the existing ‘latitude’ coordinate in the Dataset. This means that the ‘longitude’ values will be associated with each ‘latitude’. longitudes. This is an array-like object containing the actual longitude values corresponding to each latitude coordinate in the Dataset.
Let us add a coordinate variable ‘longitude’ that will align with the existing ‘latitude’ coordinate, such that the values provided in the longitudes array will be associated with each latitude value in the Dataset.
# Create coordinates for the four labels
dates = pd.date_range('2024-01-01', periods=2) # Generating a time range for the data
latitudes= [24.5, 26.5, 28.5] # latitude values
longitudes = [-87, -84.5, -83] # longitude values
# Create a 4D Numpy Array representing values for a 2x3x4x5 grid
size=(len(dates), len(latitudes))
zos = np.round(np.random.uniform(-1, 1, size= size), 2) #sea surface height [m]
# Create a DataArray with zos values
ds_aliginded = xr.Dataset(
{'zos': (('date', 'latitude'), zos)}, #zos values and its dimensions
coords={'date': dates, 'latitude': latitudes}) #Coordinates
# Add metadata to the DataArray
ds_aliginded.zos.attrs['Description'] = "Longitude coordinates align with the latitude coordinates"
# Add a new coordinate variable 'longitude' to the xarray Dataset 'ds'
# The new coordinate will be aligned with the existing 'latitude' coordinate
# The values for the 'longitude' coordinate are provided in the 'longitudes' array
ds_aliginded.coords['longitude'] = ('latitude', longitudes)
# Displaying DataSet with labeled dimensions and coordinates, and metadata
display(ds_aliginded)
<xarray.Dataset> Size: 112B Dimensions: (date: 2, latitude: 3) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * latitude (latitude) float64 24B 24.5 26.5 28.5 longitude (latitude) float64 24B -87.0 -84.5 -83.0 Data variables: zos (date, latitude) float64 48B -0.6 -0.54 -0.51 -0.92 -0.62 -0.02
Note
The 'longitude' variable is specified under the 'latitude' coordinate. That is the 'longitude' values are aligned or linked to each specific 'latitude' value. This kind of hierarchical structure is commonly used in geospatial datasets where longitude and latitude are interrelated.4. Exploring Data#
4.1 Selecting data (indexing)#
When indexing a DataArray or Dataset in xarray using the syntax ds[:,:, ...]
we are selecting data along dimensions, which is equivealent to using the isel()
method as shown later.
Let us use indexing ds.zos[:,:,:,:]
to select ‘zos’ values for the:
first ‘date’ value,
third ‘depth’ level,
first three ‘lat’ values,
last two ‘lon’ values
and assign this slice to variable subset
. Here is how to do it:
# Select a subset of the 'zos' DataArray from the Dataset 'ds'
# The subset includes the first time step, the third depth level,
# the first 3 latitude values, and the last 2 longitude values.
subset = ds.zos[0, 2, 0:3, -3:-1]
subset
<xarray.DataArray 'zos' (lat: 3, lon: 2)> Size: 48B array([[-0.47, -0.75], [-0.6 , 0.98], [-0.75, -0.49]]) Coordinates: date datetime64[ns] 8B 2024-01-01 depth int64 8B 30 * lat (lat) float64 24B 24.5 26.5 28.5 * lon (lon) float64 16B -83.0 -81.5 Attributes: Long Name: sea_surface_height_above_geoid Short Name: zos Units: meters Description: Sea surface height is the height of the sea surface above t...
You can also retrieve the coordinate values. For example, retieve the lat values for the above selection ds.zos[].lat
# Select the latitude values corresponding to the subset obtained from the 'zos' DataArray in the Dataset 'ds'
# The subset includes the first time step, the third depth level,
# the first 3 latitude values, and the last 2 longitude values.
subset_lat = ds.zos[0, 2, 0:3, -3:-1].lat
subset_lat
<xarray.DataArray 'lat' (lat: 3)> Size: 24B array([24.5, 26.5, 28.5]) Coordinates: date datetime64[ns] 8B 2024-01-01 depth int64 8B 30 * lat (lat) float64 24B 24.5 26.5 28.5
4.2 Plotting data#
Similar to Pandas, Xarray has built-in plotting for quick visualization.
Let us do a 1d plot zos values along all ‘lat’ values, for the first ‘date’ value, first ‘depth’, and last ‘lon’ value. First we need to select this data as we did above and the call the .plot
method similar to Pandas.
# Plotting the values of the 'zos' DataArray from the Dataset 'ds'
# The plot is for the first time step, the first depth level,
# all latitude values, and the last longitude value.
# Each data point is marked with a red circle marker ('o').
ds.zos[0, 0, :, -1].plot(marker='o', markerfacecolor='red', markeredgecolor= 'red');

Let us do a 2d plot of zos values along ‘lat’ and ‘lon’ values for the third ‘depth’ level of the second ‘date’ value.
# Plotting a 2D plot from the 'zos' DataArray in the Dataset 'ds'
ds.zos[1, 2].plot()
<matplotlib.collections.QuadMesh at 0x1174b420380>

4.3 Integer-based indexing with isel#
The .sel()
and .isel()
methods are powerful for data selection. The .sel()
method is used for selection based on labels, while the .isel()
method is used for selection based on integer positions within the dimensions of the data array. For example, ds.zos.isel(time=0, lat=slice(0,3))
selects the first time step and the first three latitude values from DataArray ds.zos
.
Let us select the:
first ‘date’ value,
first ‘depth’ level,
second to fourth ‘lat’ values
using .isel()
method.
# Selects zos data at depth index 1, date index 0, and latitude indices from 1 to 3
# from the xarray dataset `ds` using the intger-based selection method
ds.zos.isel(depth=1, date=0, lat=slice(1,4))
<xarray.DataArray 'zos' (lat: 3, lon: 5)> Size: 120B array([[ 0.58, -0.28, -0.17, 0.28, -0.64], [ 0.24, -0.78, 0.61, -0.61, -0.17], [ 0.92, -0.68, -0.96, -0.15, 0.83]]) Coordinates: date datetime64[ns] 8B 2024-01-01 depth int64 8B 20 * lat (lat) float64 24B 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Attributes: Long Name: sea_surface_height_above_geoid Short Name: zos Units: meters Description: Sea surface height is the height of the sea surface above t...
Note
Slicing along coordinates can be done using theslice
method with labels in sel
and with integers in isel
.
4.4 Label-based indexing with sel#
Now let us select a given index values using .sel(key1='text_value', key2=num_value, method='nearest')
method to show why this method is so powerful.
Let us select ‘2024-01-01’ and ‘lat’ value 28. We have a problem since our lat coordinats has only ‘26.5’ and ‘28.5’, so our label ‘28’ does not exist. This is not a problem because .sel()
has the method
parameter to perform selection methods as follows.
Method |
Description |
---|---|
nearest |
Selects the nearest valid values to the specified coordinates |
ffill |
Carries the last valid value forward to fill gaps |
bfill |
Carries the next valid value backward to fill gaps |
You can use these methods in the .sel()
function based on your analysis need.
# label-based selection to extract 'zos' data from the 'ds' DataSet.
# Select data for the date '2024-01-01',
# and latitude equal to 28 using the 'nearest' selection method for interpolation.
ds.zos.sel(
date='2024-01-01', # Selecting data for the specified date
lat=28, method='nearest', # Selecting latitude nearest to 28
)
<xarray.DataArray 'zos' (depth: 3, lon: 5)> Size: 120B array([[ 0.2 , 0.22, 0.99, -0.11, -0.91], [ 0.24, -0.78, 0.61, -0.61, -0.17], [-0.45, -0.21, -0.75, -0.49, 0.6 ]]) Coordinates: date datetime64[ns] 8B 2024-01-01 * depth (depth) int64 24B 10 20 30 lat float64 8B 28.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Attributes: Long Name: sea_surface_height_above_geoid Short Name: zos Units: meters Description: Sea surface height is the height of the sea surface above t...
When using isel
for integer-based indexing and sel
for label-based indexing, we can also:
Use the
slice
method for selecting a range of valuesPerform approximate selection with methods like
nearest
Let us see an example of creating a subset based on specific latitude and longitude ranges. This is particularly handy when working with global datasets and you need to extract a regional subset.
print("Latitude values of original data")
display(ds.lat.values)
print("Longitude values of original data")
display(ds.lon.values)
sliced_ds= ds.zos.sel(
date='2024-01-01', # Selecting data for the specified date
depth=10, # Seecting data for the specified depth
lat=slice(26,30), lon=slice(-84, -81) #Selecting a slice from lat and lon
)
print("Latitude values of sliced data")
display(sliced_ds.lat.values)
print("Longitude values of sliced data")
display(sliced_ds.lon.values)
sliced_ds.plot();
Latitude values of original data
array([24.5, 26.5, 28.5, 30.5])
Longitude values of original data
array([-87. , -84.5, -83. , -81.5, -79. ])
Latitude values of sliced data
array([26.5, 28.5])
Longitude values of sliced data
array([-83. , -81.5])

4.6 Data selection with loc attribute (Optional)#
When extracting data from a DataArray, besides utilizing the .sel()
method, we have the option to employ the .loc
attribute. Every DataArray includes a .loc
attribute, which allows us to choose data by specifying a coordinate value within square brackets similar to NumPy and Pandas. When using the .loc
attribute, we can specify data slices using a syntax :
, similar to NumPy and Pandas, or we can use slice function. Both of these slicing techniques are illustrated above.
Let use use .loc
to select all ‘date’ values, ‘depth’ from 20 to 30, ‘lat’ of 24.5, and ‘lon’ from -87 to -83.
# Selecting data based on Date, depth, latitude, and longitude
ds.zos.loc[:, 20:30, 24.5, slice(-87, -81.5)]
<xarray.DataArray 'zos' (date: 2, depth: 2, lon: 4)> Size: 128B array([[[-0.43, 0.18, -0.72, -0.22], [ 0.8 , 0.52, -0.47, -0.75]], [[-0.29, -0.55, -0.75, -0.07], [ 0.75, 0.18, -0.87, 0.93]]]) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * depth (depth) int64 16B 20 30 lat float64 8B 24.5 * lon (lon) float64 32B -87.0 -84.5 -83.0 -81.5 Attributes: Long Name: sea_surface_height_above_geoid Short Name: zos Units: meters Description: Sea surface height is the height of the sea surface above t...
Note
When using.loc
, we can specify the values for each coordinate; however, we cannot specify the dimension names. Thus, we need to provide the dimensions in the correct sequence, and this sequence must be predetermined. Also, similar to Pandas, .loc
is inclusive while integer-based indexing da[:,;]
` or isel
is execlusive.
5. Computation#
Various mathematical and statistical computations can be applied to analyze and manipulate DataArrays or DataSets with tasks such as data transformation, data aggregation, and data interpolation. Here are examples of few computation operations that can be done on the DataArrays and DataSets.
5.1 Data transformation#
DataArrays and DataSets work seamlessly with arithmetic operators and numpy array functions. Here is an example:
tos_new = ds2.tos + 273.15
tos_new[0,0,0,:].plot()
[<matplotlib.lines.Line2D at 0x1174b4af950>]

Let use do one more example. Let us change the longitude convention 180° with Eastern longitudes from 0° to 180° and Western longitudes from 0° to -180°, to the longitude convention 360° from 0° to 360°.
ds3= ds2.copy()
ds3['lon'] = ds3['lon']+360
ds3
<xarray.Dataset> Size: 2kB Dimensions: (date: 2, depth: 3, lat: 4, lon: 5) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B 273.0 275.5 277.0 278.5 281.0 Data variables: tos (date, depth, lat, lon) float64 960B 28.51 27.06 ... 26.0 27.7 sos (date, depth, lat, lon) float64 960B 34.57 35.12 ... 33.16 33.33
We can also combine multiple xarray datasets in arithemtic operations
g = 9.8
buoyancy = g * (2e-4 * ds2.tos - 7e-4 * ds2.sos)
buoyancy[0,0].plot(yincrease=False)
<matplotlib.collections.QuadMesh at 0x1174da507d0>

5.2 Data aggregation#
5.2.1 Data aggregation methods#
These are examples of commonly used aggregation methods in Xarray
Aggregation |
Description |
---|---|
|
Total number of items |
|
Mean and median |
|
Minimum and maximum |
|
Standard deviation and variance |
|
Compute product of elements |
|
Compute sum of elements |
|
Find index of minimum and maximum value |
|
Compute the cumulative sum of elements |
|
Compute the cumulative product of elements |
|
Compute the cumulative minimum and maximum values |
|
Compute the difference of elements along a dimension |
|
Compute the quantile of elements |
|
Compute the mean absolute deviation from the mean |
Let us, for example, find the mean zos along all depth values
# Calculate the mean of the variable 'zos' along the 'depth' dimension in the DataSet 'ds'
ds.zos.mean(dim=('depth'))
<xarray.DataArray 'zos' (date: 2, lat: 4, lon: 5)> Size: 320B array([[[ 0.05 , 0.51 , -0.49666667, -0.39333333, 0.17333333], [ 0.23333333, 0.28666667, 0.00333333, 0.18666667, 0.40666667], [-0.00333333, -0.25666667, 0.28333333, -0.40333333, -0.16 ], [ 0.38666667, -0.48666667, -0.36666667, -0.17333333, 0.04 ]], [[ 0.14333333, -0.33333333, -0.61333333, 0.39333333, 0.18666667], [ 0.01666667, -0.54333333, -0.09333333, 0.26 , -0.84333333], [ 0.31666667, 0.08333333, -0.05333333, 0.52666667, -0.41333333], [-0.11666667, -0.00333333, -0.21333333, 0.80333333, 0.24333333]]]) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0
For example, let us find the mean zoz for the whole area at each date and depth value
# Calculate the mean of the variable 'zos' along the 'lat' and 'lon' dimensions in the DataSet 'ds'
ds.zos.mean(dim=('lat','lon'))
<xarray.DataArray 'zos' (date: 2, depth: 3)> Size: 48B array([[ 0.043, -0.071, 0.001], [ 0.003, -0.057, 0.016]]) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * depth (depth) int64 24B 10 20 30
We can calculate the max ‘zos’ values for each depth
ds.zos.max(dim=['lat', 'lon','date'])
<xarray.DataArray 'zos' (depth: 3)> Size: 24B array([0.99, 0.96, 0.98]) Coordinates: * depth (depth) int64 24B 10 20 30
5.2.2 Resample#
When we receive data in daily or monthly intervals and need to summarize it into monthly or yearly aggregates, we can utilize the .resample
function similar to what we did in Pandas.
Here is an example. Where we will combine .mean
and .resample
.
# Create coordinates for the four labels
dates = pd.date_range('2024-01-01', periods=240, freq='M') # Generating a time range for the data with monthly frequency
latitudes= [24.5, 26.5, 28.5, 30.5] # latitude values
# Create a 2D Numpy Array representing
size=(len(dates), len(latitudes))
zos = np.round(np.random.uniform(-1, 1, size= size), 2) #sea surface height [m]
# Create a DataArray with zos values
ds_monthly = xr.Dataset({'zos': (('date', 'lat',), zos)}, coords={'date': dates, 'lat': latitudes})
# Add metadata to the DataArray
ds_monthly.zos.attrs['Description'] = "Monthly zos data from 2024 to 2064"
# Displaying
display(ds_monthly)
# Find mean across lat values and plot
ds_monthly.zos.mean(dim=('lat')).plot()
# Create a new DataSet 'ds_annually' as a copy of the original DataSet 'ds'
ds_annually=ds_monthly.copy()
# Resample the 'pr' variable to calculate annual averages
ds_annually = ds_monthly.resample(date='Y').mean(dim='date')
display(ds_annually)
# Find mean across lat values and plot
ds_annually.zos.mean(dim=('lat')).plot()
<xarray.Dataset> Size: 10kB Dimensions: (date: 240, lat: 4) Coordinates: * date (date) datetime64[ns] 2kB 2024-01-31 2024-02-29 ... 2043-12-31 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 Data variables: zos (date, lat) float64 8kB -0.85 -0.83 -0.39 0.71 ... -0.45 0.66 0.3
<xarray.Dataset> Size: 832B Dimensions: (date: 20, lat: 4) Coordinates: * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * date (date) datetime64[ns] 160B 2024-12-31 2025-12-31 ... 2043-12-31 Data variables: zos (date, lat) float64 640B -0.2925 -0.09667 ... -0.085 -0.01083
[<matplotlib.lines.Line2D at 0x1174d8ac740>]

5.3 Data interpolation and extrapolation#
In Section 4.4 Label-based indexing, we explored using the nearest
method to select a value close to a specified coordinate that may not be present in our DataArray. Alternatively, we can interpolate this value using methods like interp
.
We can an interpolate or extrapolate values along dimensions using functions like interp
or reindex
. We can use interp
to interpolate data along a specified dimension using linear interpolation, cubic interpolation, or other methods.
Method |
Description |
---|---|
linear |
Performs linear interpolation to estimate values at the specified coordinates |
cubic |
Performs cubic interpolation to estimate values at the specified coordinates |
nearest |
Performs quadratic interpolation to estimate values at the specified coordinates |
We can use reindex
to change the index of the data along one or more dimensions, potentially adding or removing indices. Typically, interp
is used when we want to interpolate data values along a specific dimension to fill in values or create a smoother representation of the data. The reindex
is more commonly used when we want to align the indices of multiple DataArrays or Datasets, potentially filling missing values with NaNs or other fill values.
For example, we can interpolate ‘zos’ values at:
‘lat’ of 27,
‘lon’ of -84,
‘date’ of 2024-01-01,
‘depth’ of 10
and return the results as a NumPy array instead of a DataArray. Our data already contains ‘date’ of 2024-01-01, ‘depth’ of 10 so we need to interpolate for ‘lat’ and ‘lon’ values
# Interpolate 'zos' data in the 'ds' DataSet for latitude 27 and longitude -84,
# then select data for depth of 30 meters
# Return results as a NumPy array instead of a DataArray
ds.zos.interp(lat=27, lon=-84, method='linear').sel(depth=10, date='2024-01-01').values
array(0.54916667)
Note
The attribute.values
extracts the values of the resulting DataArray as a NumPy array.
Now let use combine everything together that is:
.sel
for label-based indexing,slice
to create a subset based on specific ‘lat’ and ‘lat’ rangeslinear
to interpolate for a depth value of 11
and plot our new slice. Note the zos values in this slice is different than the previous slice in Section 4.4 Label-based indexing, because it is at a slightly different depth.
ds.zos.sel(date='2024-01-01', #Select a date
lat=slice(26,30), #Select a lat slice
lon=slice(-84, -81) #Select a long slice
).interp(depth=11, method ='linear' #Interpolate at depth 11
).plot()
<matplotlib.collections.QuadMesh at 0x1174d8df2c0>

These are just a few examples of the operations we can perform. The xarray library provides a wide range of functionalities for working with labeled multi-dimensional data like this.
6. Combing Data#
There are various methods to combine arrays. Here are a few:
xr.concat
: Concatenates arrays along specified dimensions, creating a larger array with shared dimensions (e.g., combining two zos arrays with different date values).xr.merge
: Combines distinct arrays (e.g., zos and tos) into one dataset by aligning them based on their coordinates, accommodating varying dimensions or coordinates.xr.combine_nested
: Merges arrays into a dataset while preserving their nested structure, simplifying handling of hierarchical data layouts.
You can learn these functions and similar ones as needed. Here are a few examples to illustrate them.
6.1 Combing data with concat#
We can use xr.concat
function to combine data along a common dimension.
For example,
the
da
DataArray has ‘zos’ data for ‘2024-01-03’ and ‘2024-01-04’ along the ‘date’ dimension,the
ds
DataSet has ‘zos’ data for ‘2024-01-01’ and ‘2024-01-02’ along the ‘date’ dimension.
Let use combine this data to have an extended DataArray from ‘2024-01-01’ and ‘2024-01-04’ using xr.concat
function. After we combine the data we will use .sortby()
method to sort the dates in ascending order.
# Concatenate 'da' and 'ds.zos' along the 'date' dimension, then sort the result by 'date'
ds_date_extended=xr.concat([ds.zos, da], dim='date').sortby('date')
ds_date_extended
<xarray.DataArray 'zos' (date: 4, depth: 3, lat: 4, lon: 5)> Size: 2kB array([[[[-0.22, 0.83, -0.3 , -0.21, 0.37], [ 0.33, 0.47, 0.78, -0.7 , 0.97], [ 0.2 , 0.22, 0.99, -0.11, -0.91], [ 0.13, -0.92, -0.4 , -0.04, -0.62]], [[-0.43, 0.18, -0.72, -0.22, 0.75], [ 0.58, -0.28, -0.17, 0.28, -0.64], [ 0.24, -0.78, 0.61, -0.61, -0.17], [ 0.92, -0.68, -0.96, -0.15, 0.83]], [[ 0.8 , 0.52, -0.47, -0.75, -0.6 ], [-0.21, 0.67, -0.6 , 0.98, 0.89], [-0.45, -0.21, -0.75, -0.49, 0.6 ], [ 0.11, 0.14, 0.26, -0.33, -0.09]]], [[[-0.03, -0.63, -0.22, 0.32, 0.8 ], [-0.11, -0.52, -0.11, 0.67, -0.72], [ 0.27, 0.54, -0.95, 0.8 , -1. ], [ 0.04, 0.29, 0.33, 0.78, -0.49]], ... [[ 0.37, -0.4 , -0.13, -0.46, 0.7 ], [ 0.1 , -0.07, -0.09, 0.38, -0.5 ], [-0.14, -0.69, -0.33, -0.21, 0.54], [ 0.09, -0.21, -0.88, 0.91, -0.44]]], [[[ 0.11, -0.63, -0.4 , -0.47, 0.55], [ 0.77, -0.79, 0.37, -0.4 , -0.29], [ 0.43, 0.69, -0.56, 0.75, -0.06], [ 0.52, -0.11, 0.61, 0.15, 0.2 ]], [[ 0. , 0.38, -0.76, -0.2 , -0.9 ], [ 0.34, -0.16, -0.27, -0.46, 0.56], [-0.63, -0.95, 0.31, -0.01, 0.45], [ 0.71, -0.05, 0.11, -0.78, 0.81]], [[ 0.89, -0.2 , -0.36, -0.07, -0.33], [-0.77, 0.33, -0.9 , -0.62, 0.01], [ 0.31, -0.47, 0.06, 0.46, -0.85], [-0.34, 0.55, -0.25, -0.59, -0.25]]]]) Coordinates: * date (date) datetime64[ns] 32B 2024-01-01 2024-01-02 ... 2024-01-04 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Attributes: Long Name: sea_surface_height_above_geoid Short Name: zos Units: meters Description: Sea surface height is the height of the sea surface above t...
# Calculating and plotting the zos mean along the 'lat', 'lon', and depth dimensions
ds_date_extended.mean(dim=['lat', 'lon','depth']).plot(marker='o');

Warning
Xarray will not check the values of the coordinates beforeconcat
. It will just stick everything together into a new array. This can result overlapping data.
6.2 Combining data with merge#
We can merge both DataArrays and Datasets. If the data are not aligned, they will be aligned before merge.
For example, let us combine
‘zos’ data from the
ds_date_extended
DataSet that has four date valueswith the ‘tos’ and ‘sos’ data from the
ds2
DataSet that has two date values.
This will result in a DataSet with ‘zos’, ‘tos’, and ‘sos’ data.
# Merging two datasets, ds_date_extended and ds2, using xr.merge
ds3=xr.merge([ds_date_extended, ds2])
ds3
<xarray.Dataset> Size: 6kB Dimensions: (date: 4, depth: 3, lat: 4, lon: 5) Coordinates: * date (date) datetime64[ns] 32B 2024-01-01 2024-01-02 ... 2024-01-04 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Data variables: zos (date, depth, lat, lon) float64 2kB -0.22 0.83 -0.3 ... -0.59 -0.25 tos (date, depth, lat, lon) float64 2kB 28.51 27.06 28.44 ... nan nan sos (date, depth, lat, lon) float64 2kB 34.57 35.12 34.65 ... nan nan Attributes: Long Name: sea_surface_height_above_geoid Short Name: zos Units: meters Description: Sea surface height is the height of the sea surface above t...
NaN will be assigned to any missing values in the resulting dataset (i.e., ‘tos’ and ‘sos’ with no corresponding ‘zos’ values). We can plot mean values ds3.sos
across time to check this.
#Calculate the mean of 'sos' data along the dimensions 'lat', 'lon', and 'depth'
# in the dataset ds3, and plots the result
ds3.sos.mean(dim=['lat', 'lon','depth']).plot(marker='o');

We can alo specify the join options in merge
. For example, we can select a slice before we apply the merge
function.
# After selecting data for dates '2024-01-01' and '2024-01-02'
# and depths in the range of 10 to 30 from each dataset,
# merge two xarray datasets, ds_date_extended and ds2,
xr.merge([
ds_date_extended.sel(depth=slice(10, 30), date=['2024-01-01', '2024-01-02']), #Select this slice of data
ds2.sel(depth=slice(10, 30), date=['2024-01-01', '2024-01-02']) #Select this slice of data
])
<xarray.Dataset> Size: 3kB Dimensions: (date: 2, depth: 3, lat: 4, lon: 5) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Data variables: zos (date, depth, lat, lon) float64 960B -0.22 0.83 -0.3 ... 0.67 0.47 tos (date, depth, lat, lon) float64 960B 28.51 27.06 ... 26.0 27.7 sos (date, depth, lat, lon) float64 960B 34.57 35.12 ... 33.16 33.33 Attributes: Long Name: sea_surface_height_above_geoid Short Name: zos Units: meters Description: Sea surface height is the height of the sea surface above t...
7. Broadcasting (advanced)#
Broadcasting in xarray allows for operations between arrays with different shapes by automatically aligning dimensions. This feature simplifies the handling of data with varying dimensions, enabling seamless computation and manipulation across multiple arrays without the need for manual alignment or iteration.
This topic is beyond the scope of this lesson. If you are curious, I added a broadcasting example in Exerecise 9.
8. Write and read data to a netCDF file#
NetCDF is a file format widely used in scientific and engineering fields to store multidimensional data efficiently. Let us explore how to write our DataArray to a netCDF file for data preservation and learn to read data from existing netCDF files for analysis.
Info
NetCDF, which stands for Network Common Data Form, is a set of software libraries and data formats for array-oriented scientific data. NetCDF is commonly used in the geoscience, climate, and meteorology fields, among others, to store and distribute gridded data. NetCDF data is stored in a self-describing format, meaning that it includes metadata along with the actual data, making it easier for users to understand the data.8.1 Write to a NetCDF file using .to_netcdf
#
The .to_netcdf
method is useful for exporting data structures in xarray to NetCDF file format.
# Write the DataArray to a netCDF file
ds.to_netcdf('Data/ds/zos.nc')
ds2.to_netcdf('Data/ds/tos_sos.nc')
8.2 Read a NetCDF file using xr.open_dataset
#
The xr.open_dataset
method is used to open a single NetCDF file as an xarray Dataset.
It is suitable for working with a single dataset stored in a single file. It returns an xarray Dataset object.
# Read the netCDF file back into a Dataset
ds = xr.open_dataset('Data/ds/zos.nc')
# Displaying the temperature_data DataArray with labeled dimensions, coordinates, and metadata
display(ds)
# Close the file after reading the data
ds.close()
<xarray.Dataset> Size: 1kB Dimensions: (date: 2, depth: 3, lat: 4, lon: 5) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Data variables: zos (date, depth, lat, lon) float64 960B ...
8.3 Read multiple netCDF files using xr.open_mfdataset
#
The xr.open_mfdataset
is used to open multiple NetCDF files as a single xarray Dataset.
It is useful when you have a dataset spread across multiple files (e.g., one file per time step) and you want to combine them into a single logical dataset. It returns an xarray Dataset object that combines the data from all the input files.
# Use xr.open_mfdataset with the file paths of the two datasets
ds_combined = xr.open_mfdataset("Data/ds/*.nc", combine="by_coords")
# Displaying the temperature_data DataArray with labeled dimensions, coordinates, and metadata
display(ds_combined)
# Close the file after reading the data
ds_combined.close()
<xarray.Dataset> Size: 3kB Dimensions: (date: 2, depth: 3, lat: 4, lon: 5) Coordinates: * date (date) datetime64[ns] 16B 2024-01-01 2024-01-02 * depth (depth) int64 24B 10 20 30 * lat (lat) float64 32B 24.5 26.5 28.5 30.5 * lon (lon) float64 40B -87.0 -84.5 -83.0 -81.5 -79.0 Data variables: tos (date, depth, lat, lon) float64 960B dask.array<chunksize=(2, 3, 4, 5), meta=np.ndarray> sos (date, depth, lat, lon) float64 960B dask.array<chunksize=(2, 3, 4, 5), meta=np.ndarray> zos (date, depth, lat, lon) float64 960B dask.array<chunksize=(2, 3, 4, 5), meta=np.ndarray>
Summary#
Xarray extends Pandas’ labeled-data features for N-dimensional data, making it popular for analyzing gridded datasets. It enables easy access to metadata in NetCDF files, simplifying code writing and readability.
What’s next?#
If you want to use Xarray you can learn about these topics:
Remote data access with OPeNDAP, which facilitates accessing data stored remotely on servers. It allows for efficient retrieval of specific subsets of data without downloading the entire dataset, making it ideal for working with large datasets.
Advanced visualization with Cartopy, which we will learn next.
Resources and references#
Most basic questions and issues with Xarray can be resolved with help from the material in the Xarray documentation. Some of the most popular sections include Why Xarray, Quick overview, and Example gallery
Another resource you may find useful is this Xarray Tutorial collection, created from content hosted on GitHub.
Nasa Gesdisc-Tutorials contain a lot of examples and case studies
Python Tutorial Seminar Series - Xarray Part 1 and Xarray Part 2 provide recordings introducing the Python Package
xarray
.Project Pythia Notebooks on Xarray contains tutorials on using Xarray.