Climate Data - Where? How?

Where to get Climate Datasets

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • Where do I find Climate Datasets?

Objectives
  • Learning to find climate datasets on the web

  • How and where to download a dataset from the web

  • How to be a good citizen of /homes

Where can I find Climate Datasets?

Many climate datasets are available on the COLA servers, but how do I find this data? Where would I look to find a dataset that is not on the COLA servers? As I go forward in my research, how do I find datasets I might want to use?

  1. COLA Datasets Catalog We are in the process of cataloging all the dataests on the COLA servers. Not everything is there yet, but this is a good place to start if you want to know what data are available locally.

  2. NOAA/Physical Sciences Lab Many climate datasets are here with lots of information and searching capabilities.

  3. IRI/LDEO Climate Data Library This is another great resource for finding Climate Datasets

  4. NCAR Climate Data Guide Great resource for getting expert advice on which datasets you should use for your specific application

Where should I put Climate Datasets?

There are two main ways to access and use climate datasets that are available on the web.

  1. You can download a copy of the dataset to your local computer system, and analyze it there.

  2. You can access, subset, and (depending on the data server) even analyze the data remotely, having only the result on your local computer system.

Since some climate datasets can be very large, and you may need to use many different ones, option #1 may require a sizeable amount of disk space to store the data. However, once you have a copy of the dataset locally, you own it and you can easily use it over and over.

Option #2 will save space on your computer system, but may also slow down calculations, depending on Internet speed and the load on the remote data servers. It requires reaccessing the data remotely every time you make a change to your calculation.

Thus, there is a decision to be made depending on your situation - one or the other option will be the better choice.

Being a good COLA computer citizen

If you choose to download data sets to the COLA servers, do not store them in your home directory!! The /homes disk is a limited, shared and critical resource among all users. If you fill up the /homes disk by downloading too many large datasets there, the system will stop working and no one will be able to use it!

There are three categories of disks, denoted by three different top-level directories, where large datasets should be stored:

  1. /scratch - for temporary or non-critial data. This disk is not backed up, and old files may be scrubbed (deleted) if the disk becomes full.

  2. /project - for most working datasets. Each research project at COLA may have one or more project disks. These disks are regularly backed up to tape, so they can be restored if there is a hardware failure or other problem. Check with your advisor for access to project disks relevant to your use.

  3. /shared - for long-term, non-volatile datasets. These disks are where “final versions” of datasets are kept. These may be datasets downloaded from sources like those above, that are deemed essential enough to have local copies, or datasets produced by COLA scientists. Only new files get backed up, and it is expected they will be rarely if ever changed once placed here.

Downloading Climate Datasets with wget

The most common way to download data sets from the web is to use the unix command wget.

From a terminal window logged in to one of the COLA servers, change to the /scratch directory. If you already have your own subdirectory there, go to that. If you do not, make one:

$ cd /scratch
$ mkdir <your_username>
$ cd <your_username>

At one of the data repository websites listed above, let’s find a dataset to download. In a browser, go to: https://psl.noaa.gov/data/gridded/ and scroll down to the entry: NOAA Extended Reconstructed SST V5. There you will find a web page with a nice description of the dataset.

In the section of the page called “Download/Plot Data”, in the “download file” column you will see two files listed. Don't click, but right-click on “sst.mon.ltm.1981-2010.nc” and choose “copy link address” to put the URL on your clipboard. Then paste the link into your terminal after typing wget:

$ wget ftp://ftp.cdc.noaa.gov/Datasets/noaa.ersst.v5/sst.mon.ltm.1981-2010.nc

You will get some text on your screen like the following, that reports on the wget process:

--2021-09-04 14:47:31--  ftp://ftp.cdc.noaa.gov/Datasets/noaa.ersst.v5/sst.mon.ltm.1981-2010.nc
           => ‘sst.mon.ltm.1981-2010.nc’
Resolving ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)... 140.172.38.117
Connecting to ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)|140.172.38.117|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /Datasets/noaa.ersst.v5 ... done.
==> SIZE sst.mon.ltm.1981-2010.nc ... 1159964
==> PASV ... done.    ==> RETR sst.mon.ltm.1981-2010.nc ... done.
Length: 1159964 (1.1M) (unauthoritative)

100%[===================================================================================>] 1,159,964   --.-K/s   in 0.1s    

2021-09-04 14:47:37 (10.8 MB/s) - ‘sst.mon.ltm.1981-2010.nc’ saved [1159964]

Note that the URL is not http:// or https:// but is ftp://. ftp stands for “file transfer protocol”. It is an old, robust but insecure protocol for moving data that works well from web sites because it has an “anonymous” mode that does not require a user to log in to retrieve files. There is a secure version called sftp that uses ssh and requires passwords. sftp or scp (the secure copy command) are preferred over ftp for moving files between private sources (e.g. your COLA account and an account you might have at a supercomuting center).

Key Points

  • Don’t download big datasets to your /homes directory!


Data File Formats

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What are the common file formats for Climate Datasets?

  • What is NetCDF?

  • How can we open and access data in NetCDF files?

Objectives
  • Become familiar with some of the varieties of data formats used for climate data

  • Learn how to peruse and parse a NetCDF file

  • Introduction to xarray

Formats of Datasets

There are many data formats in use in the field of climate research, but there are a few that predominate:

  1. NetCDF: The Network Common Data Form (originally called the Network Climate Data Format). Typical suffixes: .nc, .nc4. This binary format was developed by the climate community, but has been adopted in many other communities due to its utility and self-describing format (i.e., the data file also includes metadata to describe the contents, its spatial, temporal and any other dimensions, properties and attributes). Beginning with version 4, data compression has become an option. It has become the most common format for climate data files.

  2. GRIB: GRIdded Binary format. Typical suffix: grb. GRIB was developed by the World Meteorological Organization (WMO) as an efficient compressed binary format for exchange of weather data. The original format (GRIB1) is only semi-self describing, as it required external look-up tables to decode indices used. GRIB2 is truly self describing. Unlike NetCDF, GRIB allows for tailored compression for each variable in a file.

  3. Flat binary files, including those produced as output from FORTRAN programs, and the data files produced by GrADS. No standard suffix. Flat binary files contain no metadata, and usually cannot be understood without additional documntation. The native GrADS file format pairs one or more flat binary data files with a “data descriptor file”, usually ending with the suffix .ctl, that contains all the metadata of a dataset in a human writable and readible form.

  4. ASCII files, which include .csv or “comma separated values” files, which are a simple form of storing spreadsheets. ASCII stands for “American Standard Code for Information Interchange”, and is the text or “string” format standard for many programming languages. ASCII is considered easy to read by humans, but is much less efficient at storing data than binary formats. It is advisable only for small files.

  5. Excel files, a spreadsheet format with a wide range of specialized formatting extensions from Microsoft. Typical suffixes: .xls', .xlsx` (the latter uses extensible markup language XML to encode metadata information).

  6. GIS (Geographic Information System) files come in a variety of different formats depending on the application and software used. Often information for a single dataset is split into multiple files, particularly for “shapefiles” containing polygonal data.

  7. Orthorectified Imagery Formats such as .tiff files are binary image files used to store geo-located data (also often with metadata in separate files), common for some observational data, especially imagery-derived data from remote sensing or aircraft.

In this course, we will focus on the top part of this list, but know that Python has libraries available to read all these and many more dataset formats.

NetCDF

NetCDF has a software library called NCO (netCDF Operators) that, independent of Python or any other software, can perform a variety of operations on NetCDF datasets. The NCO executables (each function acts like its own Unix command with options and arguments - they even have man and info pages) can be very handy for doing basic operations on data in the bash shell on a Unix system, such as compressing (or “deflating” in NetCDF-speak) a large uncompressed NetCDF dataset you downloaded, in order to conserve disk space.

Perhaps the most useful and often used NCO command is ncdump. Without options, it will dump the entire contents of a NetCDF file to the screen as ASCII numbers and text. The -h option shows only “header” information and not the contents of variables, basically showing you only the metadata.

Returning to that data file you downloaded to your scratch directory… give it a try:

$ ncdump -h sst.mon.ltm.1981-2010.nc

A lot of text is sent to your screen, starting with a list of the dimensions of the dataset in the file, then the variables including a listing of the attributes of each variable, and finally the attributes of the dataset itself (global attributes).

NetCDF metadata

The ncdump -h command lists metadata in a human-readable format that is fairly easy to interpret.

Find the following information:

  1. The names of the dimension variables in the dataset and the size of each
  2. The meaning of each dimension variable
  3. The data variables (the ones that vary in both space and time dimensions)
  4. What is this a dataset of? (peruse the “global attributes”)

Solution

  1. lon = 180; lat = 89; time = 12; nbnds = 2
  2. lon is longitude (˚E); lat is latitude (˚N); time is months (although that is not terribly obvious from the metadata); nbnds is time boundaries (also a bit mysterious at this point)
  3. sst is sea surface temperature [˚C]; valid_yr_count is the “count of non-missing values used in mean”, i.e., number of years of good data used.
  4. A monthly SST climatology averaged over 1971-2010 combining several sources of data. It’s called “NOAA Extended Reconstructed SST V5” and there is J. Climate paper describing it.

There are other useful software libraries similar to NCO. In particular, there is the Climate Data Operators (CDO) software package that has much of the same functionality as NCO, but also works with other data formats such as GRIB.

Introduction to xarray

To open and use NetCDF datasets in Python, we will use the xarray module. xarray allows for the opening, manipulating and parsing of multi-dimensional datasets that include one or more variables and their associated metadata. In particular, xarray is built with the dask parallel computing library that scales and vectorizes operations on large datasets, making calculations fast and efficient. It provides a very nice balance between ease-of-use and efficiency when analyzing climate datasets.

Open a new Jupyter Notebook (name it something like “Plot_netcdf.ipynb”) and in the first code cell, type the following three import statements

import numpy as np
import matplotlib.pyplot as plt
import xarray as xr

Note: there is nothing special about the choice of these abbreviations for these three modules - but they happen to be the most commonly used ones by most people. You could use any abbreviation you like, or none at all.

In a new cell, let’s define the path to the dataset we downloaded and open it with xarray:

file = "/scratch/<your_username>/sst.mon.ltm.1981-2010.nc"
ds = xr.open_dataset(file)

Now query the object name ds by typing its name and <return>:

ds

You will get something that looks much like the result of ncdump -h, but even more helpful.

Screendump of view of metadata from the SST dataset

You will also be able to view the contents of any variable, and expand/contract a view of any attributes. For very large or multi-file datasets (xr.open_mfdataset() can open multiple files at once and link them together in a single object), You will also be shown how the data are “chunked” - i.e., organized when loaded into computer memory.

Next, let’s make a plot of the data. To access the variable sst in ds, we can either say:

ds.sst

…or…

ds['sst']

The latter is more flexible and also makes it clear that ds is not a module with a function called sst – it is less confusing for a human to read.

You can see that the dump of ds['sst'] shows more focused information, and a sample of the data in the arrays. There are 3 dimensions: time, latitude and longitude (in that sequence - the sequence is important!). To make a 2-D plot, we will choose the first (0th) time step:

plt.contourf(ds['sst'][0,:,:])

Things to note:

  1. contourf is one of the many plotting funtions of matplotlib.pyplot, and especially good for plotting gridded environmental data.
  2. The [0,:,:] contruct tells what to do with each dimension of ds['sst'] in order: [time,lat,lon].
    • An integer (or variable containing an integer) specifies one index value for that dimension.
    • A : by itself means the entire range.
    • Sub-ranges are specified with a mix of integers and colons, e.g.: 1: means all but the 0th element, 2:5 would include elements 2-4 (remember it is “up to but not including” the last number).
    • A second colon could be used to indicate the step, e.g., :10:2 would include elements 0,2,4,6,8 (but not 10).
  3. The map is upside down! That is because although there are latitudes and longitudes associated with the last two dimensions of the array, we did not convey that information to the plotting function. Thus, it treats it like a mathematical array with the origin at the lower left. We can also see that the axes are labeled by the indices and not latitudes and longitudes.

We could flip the plot over by changing the indexing of the latitude dimension:

plt.contourf(ds['sst'][0,-1::-1,:])

What does that indexing mean?

The question mark, and <tab><tab>

In Python, and particularly in Jupyter Notebooks, there are many sources of help as you are writing code. Two that are especially useful:

  • If you place a question mark immediately after an object (Python is an “object-oriented” programming language, and everything in Python is an “object”), in most cases you will get a brief, helpful description of it. The degree of detail will vary, but it can help you to keep straight and understand what is what.
  • If you are typing the name of a function and halfway through you don’t quite remember the spelling or syntax, you can hit the tab twice and a list of auto-complete options will come up to help you. You can click on the one you want to finish typing it.

Key Points

  • NetCDF is now the most common climate data format.

  • xarray wants to be your best friend - let it!


OPeNDAP and GRIB

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • How can we use OpenDap to access dataset directly across the web?

  • How do we open GRIB files?

Objectives
  • Learn how to use xarray to read remote datasets

  • Learn how to use xarray to read GRIB files

OPeNDAP

OPeNDAP (Open-source Project for a Network Data Access Protocol) is s a data server architecture that allows access to remote datasets within many different software packages including Python.
Opening a remote file via OPeNDAP is as easy as opening a file on local disk, except a URL to the dataset is supplied instead of a path on local disk. Regardless of the native data file format, OPeNDAP presents it to the client software in a NetCDF-style format.

When a file is opened via OPeNDAP, only the metadata is initially passed to the client. Just the specific slices of variables used in a calculation are sent over the Internet, not the entire dataset. Thus, it can be much more efficient than downloading a large dataset and only using a small part of it. However, performance of code using OPeNDAP depends on the speed of the Internet connection to the data server.

OPeNDAP in xarray

To open and use remote datasets in Python served via OPeNDAP, we use the xarray function open_dataset() in exactly the same way as before. This time, let’s look at a different dataset from the NOAA PSL repository. Let’s look at the Palmer Drought Severity Index page and select the OPeNDAP file name for the self-calibrated version.

In a new code cell, type the following (you can copy and paste the dataset path from the web page):

url = "https://psl.noaa.gov/thredds/dodsC/Datasets/dai_pdsi/pdsi.mon.mean.selfcalibrated.nc"
dd = xr.open_dataset(url)

This probably took a few seconds, whereas opening the local file was nearly instantaneous. There was some communication over the Internet to establish a link between your Python process running in a Jupyter Notebook on the COLA computer and the remote data server (which sits in Boulder, Colorado).

As before, query the new object dd by typing its name and <return>:

dd

In the same format as before you will see the metadata pertinent to this dataset. There is only one data variable, the horizontal resolution is a bit lower (the spatial dimensions of the arrays are smaller) but the time dimension is much larger.

Screendump of view of metadata from PDSI dataset

The dimensions of this data are again in the order: [time, lat, lon] (always check before you move forward!), so we may make a quick plot of this data. This time, let’s plot the last time step:

plt.contourf(dd["pdsi"][-1,:,:]) ; plt.colorbar()

Things to note:

  1. There was more hesitation - although the spatial grid is smaller than the SST data we downloaded to disk, we had to retrieve the data over the Internet. If we were only interested in the last month, or last year, of this 164-year dataset, this would be more efficient than downloading the whole file first. However, if we wanted to do a number of different calculations using the whole time series, it might be better in the long run to download it.
  2. It’s not upside down! If you noticed when you queried the metadata, the latitudes start at -58.75 and count up to 76.25. So we also see it is not a full global grid, but it excludes Antarctica and areas north of the Arctic Circle.
  3. We added a colorbar using a different pyplot function.
  4. Also note that we put two separate Python commands on the same line separated by a semicolon. This is valid syntax in Python - you do not have to have one command per line.

Opening a GRIB file in xarray

GRIB format is used by operational weather forecast centers around the world for forecast model output. Let’s go back to our terminal window and copy the following file to your scratch directory:

$ cd /scratch/<your_username>
$ cp /shared/working/rean/era-interim/daily/data/2014/ei.oper.an.pl.regn128cm.2014020800 test.grb

Back in our Jupyter notebook, let’s open a new code cell and proceed to open this GRIB file containing data from the ERA-Interim reanalysis from ECMWF:

gribfile = "/scratch/<your_username>/test.grb"
dg = xr.open_dataset(gribfile,engine='cfgrib')

We had to specify an “engine” that knows how to read a GRIB file. If you look in your scratch directory, you will find a new file has been created. Because GRIB files are extremely compressed, down to the bit level, an index file (ending in .idx) is generated from the metadata that holds a mapping of the exact places on disk where each grid starts, and specific information on how to uncompress each grid. This speeds up the navigation and processing of the dataset.

Examine the file’s metadata:

dg

Screendump of view of metadata from Era-Interim dataset

Note that here we again have 3 dimensions, but time is not one of them. These are global grids at a single date/time, but across 37 pressure levels (the coordinate isobaricInhPa)

Expand the attributes of any data variable and have a look.

Finally, make a plot of one of the variables at a particlar pressure level (something you might be familiar with like t temperature or u zonal wind). Do you remember the syntax? Can you add a colorbar?

Key Points

  • OPeNDAP allows remote access to datasets across the Internet without downloading files.

  • xarray renders differences between local and remote data, and between different data file formats nearly imperceptable.