In this post, I go through the steps to prepare the data from the ERA5 weather dataset for machine learning (in particular, weather forecast and climate modeling.)
Obtain ERA5 Dataset
The official website for the ERA5 dataset is located here , which contains weather data from 1940s to present. The data is obtained, at the most granular level, at 1 hour interval and with $0.25 \times 0.25$ degree for atmosphere and $0.5\times 0.5$ for ocean waves.
- Note that since earth’s latitude ranges from -90 to 90 degrees, and longitude ranges from -180 to 180 degrees, the spatial discretization corresponds to $720\times 1440$ at full capacity.
- Different machine learning models and different hardware configuration (which determines how much memory capacity we have) determines what kind of resolution we can work on.
The ERA5 dataset can be conveniently obtained through two means (there are other ways, but two most straightforward ways for the ML community are):
- Approach 1: Through weatherbench 2
, a weather forecast benchmark project that hosts the data on google cloud, in particular, at
'gs://weatherbench2/datasets/era5-hourly-climatology/1990-2019_6h_1440x721.zarr'
. - Approach 2: Through weatherbench , which points you to the data bucket hosted by TU Munich. This is the way we are going to obtain the data in this tutorial.
For the 2nd approach, the data contains spatial grids at different granularities: (1.4 deg, 2.8 deg, and 5.625 deg); for simplicity, first we examine the 5.625 degree dataset. For each resolution subdirectory, one can choose to download all features at once, or just specific feature of choice. In our case, we start with only the 2m_temperature
, representing temperature of air at 2m above the surface of land, sea or inland waters, which is 2.2GB of size. Note that with enough hardware configuration, ideally you may want to use all features at once. For the list of features and their physical meaning, refer to the official ERA5 documentation page
The following command then downloads the data for 2m_temperature
:
wget "https://dataserv.ub.tum.de/s/m1524895/download?path=%2F5.625deg%2F2m_temperature&files=2m_temperature_5.625deg.zip" -O 2m_temperature_5.625deg.zip --no-check-certificate
Note that the certificate check has to be on due their use of FTP protocol.
Besides the 2m_temperature
data, we also need to download the constant.nc
file, using:
wget "https://dataserv.ub.tum.de/s/m1524895/download?path=%2F5.625deg%2Fconstants&files=constants_5.625deg.nc" -O constants_5.625deg.nc --no-check-certificate
This file contains constants such as land see mask, lattitude, longitude, and so on.
Examine the Dataset
After unzipping the file, you should obtain NetCDF files
of the format 2m_temperature_{year}_5.625deg.nc
where {year}
is the range of years (this coarse-grained case is from 1979 to 2018). These *.nc
files can be converted to the numpy format using the processing script from ClimaX
, in particular, the function n2cnp
.
Note that to modify this function to your own use, pay attention to the --root_dir
and --variables
input argument to main function. In this case:
python nc2np_equally_era5.py.py --variables=2m_temperature --root_dir=<your root dir for data>
This script by default splits the training, validation, and test data by years:
- data from year 1979 to year 2016 are set to be training dataset.
- data from year 2016 to 2017 is the validation dataset.
- data from year 2017 to 2018 is the test dataset.
Now that the numpy files are generated, we can now use them to train any deep learning models of choice.
For turning the numpy files into PyTorch Dataloader, again ClimaX has some neat code to handle it using PyTorch-Lightning.
Simply by using the code here, one can obtain the train data loader, the validation data loader, and test data loader, respectively; the following sample code can be used to examine the shape of data from the data loaders: (here I set up a config file to contain the dataset information)
if __name__ == "__main__":
with open("../../configs/dataset.yaml", "r") as f:
config_data = yaml.safe_load(f)
data_module = GlobalForecastDataModule(
root_dir = config_data["dataset"]["root_dir"], # directory that stores your processed np files
variables = config_data["dataset"]["variables"], # 2m_temperature
out_variables = config_data["dataset"]["out_variables"], # 2m_temperature
buffer_size= config_data["dataset"]["buffer_size"]
)
data_module.setup()
train_loader = data_module.train_dataloader()
val_loader = data_module.val_dataloader()
test_loader = data_module.test_dataloader()
# examine what's inside the data loader
"""
shape of x and y
x : [B, Vi, H, W] e.g., only geopotential means (64, 1, 32, 64)
y : [B, vo, H, W] e.g., only geopotential means (64, 1, 32, 64)
lead_times means how many time steps x is ahead of y
"""
for idx, data in enumerate(train_loader):
print(f"idx = {idx}")
print(len(data))
print(f"x = {data[0].shape}")
print(f"y = {data[1].shape}")
print(f"lead_times = {data[2]}")
print(f"variables = {data[3]}")
print(f"out_variables = {data[4]}")
break
where the input to the GlobalForecastDataModule
specifies the features (variables
) from input data and the target variable (out_variables
) for forecasting.
As the comment shows, the data has shape $(64, 1, 32, 64)$, where:
- first dimension is the batch size
- 2nd dimension is the number of input channels (in this case we only have one feature, so this value is one).
- the number of spatial discretizations on the latitude (180/5.625=32) and on the longitude (360/5.625 = 64).
The predicted value is also the same feature 2m_temperature
. The lead_times
specifies the (normalized) number of time stamps between input snapshot x
and the output snapshot y
. This is a variable that signifies time resolution: a large lead time refers to forecasting over a long time horizon. (simply $\Delta t$ between two snapshots).
The creates snapshots as simple input-output pairs, so it doesn’t provide a time-series like analysis.