Where to get Climate Datasets
Overview
Teaching: 15 min
Exercises: 5 minQuestions
Where do I find Climate Datasets?
Objectives
Learning to find climate datasets on the web
How and where to download a dataset from the web
How to be a good citizen of
/homes
Where can I find Climate Datasets?
Many climate datasets are available on the COLA servers, but how do I find this data? Where would I look to find a dataset that is not on the COLA servers? As I go forward in my research, how do I find datasets I might want to use?
-
COLA Datasets Catalog We are in the process of cataloging all the dataests on the COLA servers. Not everything is there yet, but this is a good place to start if you want to know what data are available locally.
-
NOAA/Physical Sciences Lab Many climate datasets are here with lots of information and searching capabilities.
-
IRI/LDEO Climate Data Library This is another great resource for finding Climate Datasets
-
NCAR Climate Data Guide Great resource for getting expert advice on which datasets you should use for your specific application
Where should I put Climate Datasets?
There are two main ways to access and use climate datasets that are available on the web.
-
You can download a copy of the dataset to your local computer system, and analyze it there.
-
You can access, subset, and (depending on the data server) even analyze the data remotely, having only the result on your local computer system.
Since some climate datasets can be very large, and you may need to use many different ones, option #1 may require a sizeable amount of disk space to store the data. However, once you have a copy of the dataset locally, you own it and you can easily use it over and over.
Option #2 will save space on your computer system, but may also slow down calculations, depending on Internet speed and the load on the remote data servers. It requires reaccessing the data remotely every time you make a change to your calculation.
Thus, there is a decision to be made depending on your situation - one or the other option will be the better choice.
Being a good COLA computer citizen
If you choose to download data sets to the COLA servers, do not store them in your home directory!! The
/homes
disk is a limited, shared and critical resource among all users. If you fill up the/homes
disk by downloading too many large datasets there, the system will stop working and no one will be able to use it!There are three categories of disks, denoted by three different top-level directories, where large datasets should be stored:
/scratch
- for temporary or non-critial data. This disk is not backed up, and old files may be scrubbed (deleted) if the disk becomes full.
/project
- for most working datasets. Each research project at COLA may have one or more project disks. These disks are regularly backed up to tape, so they can be restored if there is a hardware failure or other problem. Check with your advisor for access to project disks relevant to your use.
/shared
- for long-term, non-volatile datasets. These disks are where “final versions” of datasets are kept. These may be datasets downloaded from sources like those above, that are deemed essential enough to have local copies, or datasets produced by COLA scientists. Only new files get backed up, and it is expected they will be rarely if ever changed once placed here.
Downloading Climate Datasets with wget
The most common way to download data sets from the web is to use the unix command wget
.
From a terminal window logged in to one of the COLA servers, change to the /scratch
directory.
If you already have your own subdirectory there, go to that. If you do not, make one:
$ cd /scratch
$ mkdir <your_username>
$ cd <your_username>
At one of the data repository websites listed above, let’s find a dataset to download. In a browser, go to: https://psl.noaa.gov/data/gridded/ and scroll down to the entry: NOAA Extended Reconstructed SST V5. There you will find a web page with a nice description of the dataset.
In the section of the page called “Download/Plot Data”, in the “download file” column you will see two files listed.
Don't click, but right-click on “sst.mon.ltm.1981-2010.nc” and choose “copy link address” to put the URL on your clipboard.
Then paste the link into your terminal after typing wget
:
$ wget ftp://ftp.cdc.noaa.gov/Datasets/noaa.ersst.v5/sst.mon.ltm.1981-2010.nc
You will get some text on your screen like the following, that reports on the wget
process:
--2021-09-04 14:47:31-- ftp://ftp.cdc.noaa.gov/Datasets/noaa.ersst.v5/sst.mon.ltm.1981-2010.nc
=> ‘sst.mon.ltm.1981-2010.nc’
Resolving ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)... 140.172.38.117
Connecting to ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)|140.172.38.117|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /Datasets/noaa.ersst.v5 ... done.
==> SIZE sst.mon.ltm.1981-2010.nc ... 1159964
==> PASV ... done. ==> RETR sst.mon.ltm.1981-2010.nc ... done.
Length: 1159964 (1.1M) (unauthoritative)
100%[===================================================================================>] 1,159,964 --.-K/s in 0.1s
2021-09-04 14:47:37 (10.8 MB/s) - ‘sst.mon.ltm.1981-2010.nc’ saved [1159964]
Note that the URL is not http://
or https://
but is ftp://
. ftp
stands for “file transfer protocol”.
It is an old, robust but insecure protocol for moving data that works well from web sites because it has an “anonymous” mode that does not require a user to log in to retrieve files.
There is a secure version called sftp
that uses ssh
and requires passwords.
sftp
or scp
(the secure copy command) are preferred over ftp
for moving files between private sources (e.g. your COLA account and an account you might have at a supercomuting center).
Key Points
Don’t download big datasets to your
/homes
directory!
Data File Formats
Overview
Teaching: 30 min
Exercises: 0 minQuestions
What are the common file formats for Climate Datasets?
What is NetCDF?
How can we open and access data in NetCDF files?
Objectives
Become familiar with some of the varieties of data formats used for climate data
Learn how to peruse and parse a NetCDF file
Introduction to
xarray
Formats of Datasets
There are many data formats in use in the field of climate research, but there are a few that predominate:
-
NetCDF: The Network Common Data Form (originally called the Network Climate Data Format). Typical suffixes:
.nc
,.nc4
. This binary format was developed by the climate community, but has been adopted in many other communities due to its utility and self-describing format (i.e., the data file also includes metadata to describe the contents, its spatial, temporal and any other dimensions, properties and attributes). Beginning with version 4, data compression has become an option. It has become the most common format for climate data files. -
GRIB: GRIdded Binary format. Typical suffix:
grb
. GRIB was developed by the World Meteorological Organization (WMO) as an efficient compressed binary format for exchange of weather data. The original format (GRIB1) is only semi-self describing, as it required external look-up tables to decode indices used. GRIB2 is truly self describing. Unlike NetCDF, GRIB allows for tailored compression for each variable in a file. -
Flat binary files, including those produced as output from FORTRAN programs, and the data files produced by GrADS. No standard suffix. Flat binary files contain no metadata, and usually cannot be understood without additional documntation. The native GrADS file format pairs one or more flat binary data files with a “data descriptor file”, usually ending with the suffix
.ctl
, that contains all the metadata of a dataset in a human writable and readible form. -
ASCII files, which include
.csv
or “comma separated values” files, which are a simple form of storing spreadsheets. ASCII stands for “American Standard Code for Information Interchange”, and is the text or “string” format standard for many programming languages. ASCII is considered easy to read by humans, but is much less efficient at storing data than binary formats. It is advisable only for small files. -
Excel files, a spreadsheet format with a wide range of specialized formatting extensions from Microsoft. Typical suffixes:
.xls',
.xlsx` (the latter uses extensible markup language XML to encode metadata information). -
GIS (Geographic Information System) files come in a variety of different formats depending on the application and software used. Often information for a single dataset is split into multiple files, particularly for “shapefiles” containing polygonal data.
-
Orthorectified Imagery Formats such as
.tiff
files are binary image files used to store geo-located data (also often with metadata in separate files), common for some observational data, especially imagery-derived data from remote sensing or aircraft.
In this course, we will focus on the top part of this list, but know that Python has libraries available to read all these and many more dataset formats.
NetCDF
NetCDF has a software library called NCO (netCDF Operators) that, independent of Python or any other software,
can perform a variety of operations on NetCDF datasets.
The NCO executables (each function acts like its own Unix command with options and arguments - they even have man
and info
pages)
can be very handy for doing basic operations on data in the bash
shell on a Unix system, such as compressing
(or “deflating” in NetCDF-speak) a large uncompressed NetCDF dataset you downloaded, in order to conserve disk space.
Perhaps the most useful and often used NCO command is ncdump
. Without options, it will dump the entire contents of a NetCDF file to the screen as
ASCII numbers and text. The -h
option shows only “header” information and not the contents of variables, basically showing you only the metadata.
Returning to that data file you downloaded to your scratch directory… give it a try:
$ ncdump -h sst.mon.ltm.1981-2010.nc
A lot of text is sent to your screen, starting with a list of the dimensions
of the dataset in the file, then the
variables
including a listing of the attributes of each variable, and finally the attributes
of the dataset itself (global attributes).
NetCDF metadata
The
ncdump -h
command lists metadata in a human-readable format that is fairly easy to interpret.Find the following information:
- The names of the dimension variables in the dataset and the size of each
- The meaning of each dimension variable
- The data variables (the ones that vary in both space and time dimensions)
- What is this a dataset of? (peruse the “global attributes”)
Solution
- lon = 180; lat = 89; time = 12; nbnds = 2
- lon is longitude (˚E); lat is latitude (˚N); time is months (although that is not terribly obvious from the metadata); nbnds is time boundaries (also a bit mysterious at this point)
- sst is sea surface temperature [˚C]; valid_yr_count is the “count of non-missing values used in mean”, i.e., number of years of good data used.
- A monthly SST climatology averaged over 1971-2010 combining several sources of data. It’s called “NOAA Extended Reconstructed SST V5” and there is J. Climate paper describing it.
There are other useful software libraries similar to NCO. In particular, there is the Climate Data Operators (CDO) software package that has much of the same functionality as NCO, but also works with other data formats such as GRIB.
Introduction to xarray
To open and use NetCDF datasets in Python, we will use the xarray
module.
xarray
allows for the opening, manipulating and parsing of multi-dimensional datasets that include one or more variables and their associated metadata.
In particular, xarray
is built with the dask
parallel computing library that scales and vectorizes operations on large datasets, making calculations fast and efficient.
It provides a very nice balance between ease-of-use and efficiency when analyzing climate datasets.
Open a new Jupyter Notebook (name it something like “Plot_netcdf.ipynb”) and in the first code cell, type the following three import
statements
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
Note: there is nothing special about the choice of these abbreviations for these three modules - but they happen to be the most commonly used ones by most people. You could use any abbreviation you like, or none at all.
In a new cell, let’s define the path to the dataset we downloaded and open it with xarray
:
file = "/scratch/<your_username>/sst.mon.ltm.1981-2010.nc"
ds = xr.open_dataset(file)
Now query the object name ds
by typing its name and <return>
:
ds
You will get something that looks much like the result of ncdump -h
, but even more helpful.
You will also be able to view the contents of any variable, and expand/contract a view of any attributes.
For very large or multi-file datasets (xr.open_mfdataset()
can open multiple files at once and link them together in a single object),
You will also be shown how the data are “chunked” - i.e., organized when loaded into computer memory.
Next, let’s make a plot of the data. To access the variable sst
in ds
, we can either say:
ds.sst
…or…
ds['sst']
The latter is more flexible and also makes it clear that ds
is not a module with a function called sst
– it is less confusing for a human to read.
You can see that the dump of ds['sst']
shows more focused information, and a sample of the data in the arrays.
There are 3 dimensions: time, latitude and longitude (in that sequence - the sequence is important!).
To make a 2-D plot, we will choose the first (0th) time step:
plt.contourf(ds['sst'][0,:,:])
Things to note:
contourf
is one of the many plotting funtions of matplotlib.pyplot, and especially good for plotting gridded environmental data.- The
[0,:,:]
contruct tells what to do with each dimension ofds['sst']
in order: [time,lat,lon].- An integer (or variable containing an integer) specifies one index value for that dimension.
- A : by itself means the entire range.
- Sub-ranges are specified with a mix of integers and colons, e.g.:
1:
means all but the 0th element,2:5
would include elements 2-4 (remember it is “up to but not including” the last number). - A second colon could be used to indicate the step, e.g., :10:2 would include elements 0,2,4,6,8 (but not 10).
- The map is upside down! That is because although there are latitudes and longitudes associated with the last two dimensions of the array, we did not convey that information to the plotting function. Thus, it treats it like a mathematical array with the origin at the lower left. We can also see that the axes are labeled by the indices and not latitudes and longitudes.
We could flip the plot over by changing the indexing of the latitude dimension:
plt.contourf(ds['sst'][0,-1::-1,:])
What does that indexing mean?
- The first
-1
means the last element in the latitude dimension. Negative indices are a shortcut to counting from the back instead of the front. The next-to-last element would be -2, etc. - The second
-1
is the step.-1::-1
means start at the back and count toward the front (in this case, start in the south and count north) by ones.
The question mark, and
<tab><tab>
In Python, and particularly in Jupyter Notebooks, there are many sources of help as you are writing code. Two that are especially useful:
- If you place a question mark immediately after an object (Python is an “object-oriented” programming language, and everything in Python is an “object”), in most cases you will get a brief, helpful description of it. The degree of detail will vary, but it can help you to keep straight and understand what is what.
- If you are typing the name of a function and halfway through you don’t quite remember the spelling or syntax, you can hit the tab twice
and a list of auto-complete options will come up to help you. You can click on the one you want to finish typing it.
Key Points
NetCDF is now the most common climate data format.
xarray
wants to be your best friend - let it!
OPeNDAP and GRIB
Overview
Teaching: 20 min
Exercises: 0 minQuestions
How can we use OpenDap to access dataset directly across the web?
How do we open GRIB files?
Objectives
Learn how to use
xarray
to read remote datasetsLearn how to use
xarray
to read GRIB files
OPeNDAP
OPeNDAP (Open-source Project for a Network Data Access Protocol) is s a data server architecture
that allows access to remote datasets within many different software packages including Python.
Opening a remote file via OPeNDAP is as easy as opening a file on local disk, except a URL to the dataset is supplied instead of a path on local disk.
Regardless of the native data file format, OPeNDAP presents it to the client software in a NetCDF-style format.
When a file is opened via OPeNDAP, only the metadata is initially passed to the client. Just the specific slices of variables used in a calculation are sent over the Internet, not the entire dataset. Thus, it can be much more efficient than downloading a large dataset and only using a small part of it. However, performance of code using OPeNDAP depends on the speed of the Internet connection to the data server.
OPeNDAP in xarray
To open and use remote datasets in Python served via OPeNDAP, we use the xarray
function open_dataset()
in exactly the same way as before.
This time, let’s look at a different dataset from the NOAA PSL repository.
Let’s look at the Palmer Drought Severity Index page and select the OPeNDAP file name for the self-calibrated version.
In a new code cell, type the following (you can copy and paste the dataset path from the web page):
url = "https://psl.noaa.gov/thredds/dodsC/Datasets/dai_pdsi/pdsi.mon.mean.selfcalibrated.nc"
dd = xr.open_dataset(url)
This probably took a few seconds, whereas opening the local file was nearly instantaneous. There was some communication over the Internet to establish a link between your Python process running in a Jupyter Notebook on the COLA computer and the remote data server (which sits in Boulder, Colorado).
As before, query the new object dd
by typing its name and <return>
:
dd
In the same format as before you will see the metadata pertinent to this dataset. There is only one data variable, the horizontal resolution is a bit lower (the spatial dimensions of the arrays are smaller) but the time dimension is much larger.
The dimensions of this data are again in the order: [time, lat, lon]
(always check before you move forward!),
so we may make a quick plot of this data. This time, let’s plot the last time step:
plt.contourf(dd["pdsi"][-1,:,:]) ; plt.colorbar()
Things to note:
- There was more hesitation - although the spatial grid is smaller than the SST data we downloaded to disk, we had to retrieve the data over the Internet. If we were only interested in the last month, or last year, of this 164-year dataset, this would be more efficient than downloading the whole file first. However, if we wanted to do a number of different calculations using the whole time series, it might be better in the long run to download it.
- It’s not upside down! If you noticed when you queried the metadata, the latitudes start at -58.75 and count up to 76.25. So we also see it is not a full global grid, but it excludes Antarctica and areas north of the Arctic Circle.
- We added a colorbar using a different
pyplot
function. - Also note that we put two separate Python commands on the same line separated by a semicolon. This is valid syntax in Python - you do not have to have one command per line.
Opening a GRIB file in xarray
GRIB format is used by operational weather forecast centers around the world for forecast model output. Let’s go back to our terminal window and copy the following file to your scratch directory:
$ cd /scratch/<your_username>
$ cp /shared/working/rean/era-interim/daily/data/2014/ei.oper.an.pl.regn128cm.2014020800 test.grb
Back in our Jupyter notebook, let’s open a new code cell and proceed to open this GRIB file containing data from the ERA-Interim reanalysis from ECMWF:
gribfile = "/scratch/<your_username>/test.grb"
dg = xr.open_dataset(gribfile,engine='cfgrib')
We had to specify an “engine” that knows how to read a GRIB file.
If you look in your scratch directory, you will find a new file has been created.
Because GRIB files are extremely compressed, down to the bit level, an index file (ending in .idx
)
is generated from the metadata that holds a mapping of the exact places on disk where each grid starts, and specific information on how to
uncompress each grid. This speeds up the navigation and processing of the dataset.
Examine the file’s metadata:
dg
Note that here we again have 3 dimensions, but time is not one of them.
These are global grids at a single date/time, but across 37 pressure levels (the coordinate isobaricInhPa
)
Expand the attributes of any data variable and have a look.
Finally, make a plot of one of the variables at a particlar pressure level
(something you might be familiar with like t
temperature or u
zonal wind).
Do you remember the syntax? Can you add a colorbar?
Key Points
OPeNDAP allows remote access to datasets across the Internet without downloading files.
xarray
renders differences between local and remote data, and between different data file formats nearly imperceptable.