Where to get Climate Datasets

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • Where do I find Climate Datasets?

Objectives
  • Learning to find climate datasets on the web

  • How and where to download a dataset from the web

  • How to be a good citizen of /homes

Where can I find Climate Datasets?

Many climate datasets are available on the COLA servers, but how do I find this data? Where would I look to find a dataset that is not on the COLA servers? As I go forward in my research, how do I find datasets I might want to use?

  1. COLA Datasets Catalog We are in the process of cataloging all the dataests on the COLA servers. Not everything is there yet, but this is a good place to start if you want to know what data are available locally.

  2. NOAA/Physical Sciences Lab Many climate datasets are here with lots of information and searching capabilities.

  3. IRI/LDEO Climate Data Library This is another great resource for finding Climate Datasets

  4. NCAR Climate Data Guide Great resource for getting expert advice on which datasets you should use for your specific application

Where should I put Climate Datasets?

There are two main ways to access and use climate datasets that are available on the web.

  1. You can download a copy of the dataset to your local computer system, and analyze it there.

  2. You can access, subset, and (depending on the data server) even analyze the data remotely, having only the result on your local computer system.

Since some climate datasets can be very large, and you may need to use many different ones, option #1 may require a sizeable amount of disk space to store the data. However, once you have a copy of the dataset locally, you own it and you can easily use it over and over.

Option #2 will save space on your computer system, but may also slow down calculations, depending on Internet speed and the load on the remote data servers. It requires reaccessing the data remotely every time you make a change to your calculation.

Thus, there is a decision to be made depending on your situation - one or the other option will be the better choice.

Being a good COLA computer citizen

If you choose to download data sets to the COLA servers, do not store them in your home directory!! The /homes disk is a limited, shared and critical resource among all users. If you fill up the /homes disk by downloading too many large datasets there, the system will stop working and no one will be able to use it!

There are three categories of disks, denoted by three different top-level directories, where large datasets should be stored:

  1. /scratch - for temporary or non-critial data. This disk is not backed up, and old files may be scrubbed (deleted) if the disk becomes full.

  2. /project - for most working datasets. Each research project at COLA may have one or more project disks. These disks are regularly backed up to tape, so they can be restored if there is a hardware failure or other problem. Check with your advisor for access to project disks relevant to your use.

  3. /shared - for long-term, non-volatile datasets. These disks are where “final versions” of datasets are kept. These may be datasets downloaded from sources like those above, that are deemed essential enough to have local copies, or datasets produced by COLA scientists. Only new files get backed up, and it is expected they will be rarely if ever changed once placed here.

Downloading Climate Datasets with wget

The most common way to download data sets from the web is to use the unix command wget.

From a terminal window logged in to one of the COLA servers, change to the /scratch directory. If you already have your own subdirectory there, go to that. If you do not, make one:

$ cd /scratch
$ mkdir <your_username>
$ cd <your_username>

At one of the data repository websites listed above, let’s find a dataset to download. In a browser, go to: https://psl.noaa.gov/data/gridded/ and scroll down to the entry: NOAA Extended Reconstructed SST V5. There you will find a web page with a nice description of the dataset.

In the section of the page called “Download/Plot Data”, in the “download file” column you will see two files listed. Don't click, but right-click on “sst.mon.ltm.1981-2010.nc” and choose “copy link address” to put the URL on your clipboard. Then paste the link into your terminal after typing wget:

$ wget ftp://ftp.cdc.noaa.gov/Datasets/noaa.ersst.v5/sst.mon.ltm.1981-2010.nc

You will get some text on your screen like the following, that reports on the wget process:

--2021-09-04 14:47:31--  ftp://ftp.cdc.noaa.gov/Datasets/noaa.ersst.v5/sst.mon.ltm.1981-2010.nc
           => ‘sst.mon.ltm.1981-2010.nc’
Resolving ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)... 140.172.38.117
Connecting to ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)|140.172.38.117|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /Datasets/noaa.ersst.v5 ... done.
==> SIZE sst.mon.ltm.1981-2010.nc ... 1159964
==> PASV ... done.    ==> RETR sst.mon.ltm.1981-2010.nc ... done.
Length: 1159964 (1.1M) (unauthoritative)

100%[===================================================================================>] 1,159,964   --.-K/s   in 0.1s    

2021-09-04 14:47:37 (10.8 MB/s) - ‘sst.mon.ltm.1981-2010.nc’ saved [1159964]

Note that the URL is not http:// or https:// but is ftp://. ftp stands for “file transfer protocol”. It is an old, robust but insecure protocol for moving data that works well from web sites because it has an “anonymous” mode that does not require a user to log in to retrieve files. There is a secure version called sftp that uses ssh and requires passwords. sftp or scp (the secure copy command) are preferred over ftp for moving files between private sources (e.g. your COLA account and an account you might have at a supercomuting center).

Key Points

  • Don’t download big datasets to your /homes directory!