Where to get Climate Datasets
Overview
Teaching: 15 min
Exercises: 5 minQuestions
Where do I find Climate Datasets?
Objectives
Learning to find climate datasets on the web
How and where to download a dataset from the web
How to be a good citizen of
/homes
Where can I find Climate Datasets?
Many climate datasets are available on the COLA servers, but how do I find this data? Where would I look to find a dataset that is not on the COLA servers? As I go forward in my research, how do I find datasets I might want to use?
-
COLA Datasets Catalog We are in the process of cataloging all the dataests on the COLA servers. Not everything is there yet, but this is a good place to start if you want to know what data are available locally.
-
NOAA/Physical Sciences Lab Many climate datasets are here with lots of information and searching capabilities.
-
IRI/LDEO Climate Data Library This is another great resource for finding Climate Datasets
-
NCAR Climate Data Guide Great resource for getting expert advice on which datasets you should use for your specific application
Where should I put Climate Datasets?
There are two main ways to access and use climate datasets that are available on the web.
-
You can download a copy of the dataset to your local computer system, and analyze it there.
-
You can access, subset, and (depending on the data server) even analyze the data remotely, having only the result on your local computer system.
Since some climate datasets can be very large, and you may need to use many different ones, option #1 may require a sizeable amount of disk space to store the data. However, once you have a copy of the dataset locally, you own it and you can easily use it over and over.
Option #2 will save space on your computer system, but may also slow down calculations, depending on Internet speed and the load on the remote data servers. It requires reaccessing the data remotely every time you make a change to your calculation.
Thus, there is a decision to be made depending on your situation - one or the other option will be the better choice.
Being a good COLA computer citizen
If you choose to download data sets to the COLA servers, do not store them in your home directory!! The
/homes
disk is a limited, shared and critical resource among all users. If you fill up the/homes
disk by downloading too many large datasets there, the system will stop working and no one will be able to use it!There are three categories of disks, denoted by three different top-level directories, where large datasets should be stored:
/scratch
- for temporary or non-critial data. This disk is not backed up, and old files may be scrubbed (deleted) if the disk becomes full.
/project
- for most working datasets. Each research project at COLA may have one or more project disks. These disks are regularly backed up to tape, so they can be restored if there is a hardware failure or other problem. Check with your advisor for access to project disks relevant to your use.
/shared
- for long-term, non-volatile datasets. These disks are where “final versions” of datasets are kept. These may be datasets downloaded from sources like those above, that are deemed essential enough to have local copies, or datasets produced by COLA scientists. Only new files get backed up, and it is expected they will be rarely if ever changed once placed here.
Downloading Climate Datasets with wget
The most common way to download data sets from the web is to use the unix command wget
.
From a terminal window logged in to one of the COLA servers, change to the /scratch
directory.
If you already have your own subdirectory there, go to that. If you do not, make one:
$ cd /scratch
$ mkdir <your_username>
$ cd <your_username>
At one of the data repository websites listed above, let’s find a dataset to download. In a browser, go to: https://psl.noaa.gov/data/gridded/ and scroll down to the entry: NOAA Extended Reconstructed SST V5. There you will find a web page with a nice description of the dataset.
In the section of the page called “Download/Plot Data”, in the “download file” column you will see two files listed.
Don't click, but right-click on “sst.mon.ltm.1981-2010.nc” and choose “copy link address” to put the URL on your clipboard.
Then paste the link into your terminal after typing wget
:
$ wget ftp://ftp.cdc.noaa.gov/Datasets/noaa.ersst.v5/sst.mon.ltm.1981-2010.nc
You will get some text on your screen like the following, that reports on the wget
process:
--2021-09-04 14:47:31-- ftp://ftp.cdc.noaa.gov/Datasets/noaa.ersst.v5/sst.mon.ltm.1981-2010.nc
=> ‘sst.mon.ltm.1981-2010.nc’
Resolving ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)... 140.172.38.117
Connecting to ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)|140.172.38.117|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /Datasets/noaa.ersst.v5 ... done.
==> SIZE sst.mon.ltm.1981-2010.nc ... 1159964
==> PASV ... done. ==> RETR sst.mon.ltm.1981-2010.nc ... done.
Length: 1159964 (1.1M) (unauthoritative)
100%[===================================================================================>] 1,159,964 --.-K/s in 0.1s
2021-09-04 14:47:37 (10.8 MB/s) - ‘sst.mon.ltm.1981-2010.nc’ saved [1159964]
Note that the URL is not http://
or https://
but is ftp://
. ftp
stands for “file transfer protocol”.
It is an old, robust but insecure protocol for moving data that works well from web sites because it has an “anonymous” mode that does not require a user to log in to retrieve files.
There is a secure version called sftp
that uses ssh
and requires passwords.
sftp
or scp
(the secure copy command) are preferred over ftp
for moving files between private sources (e.g. your COLA account and an account you might have at a supercomuting center).
Key Points
Don’t download big datasets to your
/homes
directory!