Setup

This example will use riri for data downloading and some pre-processing under the hood. It also uses the tidyverse package to facilitate post-processing and visualization of data.

library(riri)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

You can find the source code for the riri package on GitHub: https://github.com/ashiklom/riri. It is still in the early stages of development, so expect it to change frequently. I welcome comments and critiques in the issues, and, of course, pull requests!

Introduction to the IRI/LDEO library

The IRI/LDEO library contains a variety of data sources, which are accessible based on specific URLs. The library is also capable of doing its own preprocessing of data sources, such as subsetting to specific coordinates and specific times, averaging over specific durations, and more advanced analyses like correlations and empirical orthagonal functions (EOFs). These analyses are applied by appropriate tags added to the end of URLs.

The functions in riri are designed around this principle of progressively extending a string, and are therefore best understood via the magrittr pipe (%>%) operator.

base_url()
## [1] "http://iridl.ldeo.columbia.edu"
base_url() %>% ncep_var('air_temperature')
## [1] "http://iridl.ldeo.columbia.edu" "SOURCES"                       
## [3] ".NOAA"                          ".NCEP"                         
## [5] ".CPC"                           ".GHCN_CAMS"                    
## [7] ".gridded"                       ".deg0p5"                       
## [9] ".temp"
base_url() %>% ncep_var('precipitation')
## [1] "http://iridl.ldeo.columbia.edu" "SOURCES"                       
## [3] ".NOAA"                          ".NCEP"                         
## [5] ".CPC"                           ".PRECL"                        
## [7] ".v1p0"                          ".deg0p5"                       
## [9] ".rain"
base_url() %>% ncep_var('air_temperature')
## [1] "http://iridl.ldeo.columbia.edu" "SOURCES"                       
## [3] ".NOAA"                          ".NCEP"                         
## [5] ".CPC"                           ".GHCN_CAMS"                    
## [7] ".gridded"                       ".deg0p5"                       
## [9] ".temp"

These functions create a character vector, which is concatenated into a URL at the end by generate_url.

base_url() %>% ncep_var('air_temperature') %>% generate_url()
## [1] "http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GHCN_CAMS/.gridded/.deg0p5/.temp"

These examples provide the root URLs of each dataset. You can browse to those URLs to confirm that they work.

Although you can download a global dataset if you wish, most analyses will focus on a specific site or region. To filter a dataset down to a specific point, use the filter_point function.

mysite <- base_url() %>% 
    ncep_var('air_temperature') %>% 
    filter_point(lat = 45.67, lon = -85.553)
mysite
##  [1] "http://iridl.ldeo.columbia.edu"             
##  [2] "SOURCES"                                    
##  [3] ".NOAA"                                      
##  [4] ".NCEP"                                      
##  [5] ".CPC"                                       
##  [6] ".GHCN_CAMS"                                 
##  [7] ".gridded"                                   
##  [8] ".deg0p5"                                    
##  [9] ".temp"                                      
## [10] "Y/(45.670000N) VALUES/X/(85.553000W) VALUES"
generate_url(mysite)
## [1] "http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GHCN_CAMS/.gridded/.deg0p5/.temp/Y/(45.670000N) VALUES/X/(85.553000W) VALUES"

Note the resultant IRI-specific syntax.

Downloading data

To actually download a netCDF file containing the data, use the retrieve_data function.

mysite %>% 
    generate_url %>% 
    retrieve_data

By default, this downloads to a temporary file. Optionally, you can also provide a specific file path.

mysite %>% 
    generate_url %>% 
    retrieve_data('mysite.nc')

The output of retrieve_data is the filename itself.

mysite_file <- mysite %>% 
    generate_url %>% 
    retrieve_data()

mysite_nc <- ncdf4::nc_open(mysite_file)
mysite_nc
## File /tmp/RtmpobrJn6/file7fed23fbfd91 (NC_FORMAT_CLASSIC):
## 
##      1 variables (excluding dimension variables):
##         float temp[X,Y,T]   
##             gribfield: 1
##             units: Kelvin_scale
##             standard_name: air_temperature
##             long_name: Air Temperature
##             missing_value: 9.99900026055401e+20
##             expires: 1493596800
## 
##      3 dimensions:
##         T  Size:831
##             calendar: 360
##             pointwidth: 1
##             standard_name: time
##             expires: 1493596800
##             gridtype: 0
##             units: months since 1960-01-01
##         Y  Size:1
##             pointwidth: 0.5
##             standard_name: latitude
##             gridtype: 0
##             units: degree_north
##         X  Size:1
##             pointwidth: 0.5
##             standard_name: longitude
##             period: 360
##             modulus: 360
##             gridtype: 0
##             units: degree_east

Preprocessing data

You may deal with the resultant netCDF files on your own if you wish. However, riri includes some functions to facilitate getting the data into a workable form first.

read_nc2list will read the contents a netCDF file into a list, with each item containing the full list of attributes. Here, we want to grab the variable and only the time dimension (T), since we are dealing with a time series at a single plot.

mysite_list <- read_nc2list(mysite_file, dims = 'T')
str(mysite_list)
## List of 2
##  $ T   : atomic [1:831] -144 -142 -142 -140 -140 ...
##   ..- attr(*, "calendar")= chr "360"
##   ..- attr(*, "pointwidth")= num 1
##   ..- attr(*, "standard_name")= chr "time"
##   ..- attr(*, "expires")= int 1493596800
##   ..- attr(*, "gridtype")= int 0
##   ..- attr(*, "units")= chr "months since 1960-01-01"
##  $ temp: atomic [1:831] 263 264 269 278 282 ...
##   ..- attr(*, "gribfield")= int 1
##   ..- attr(*, "units")= chr "Kelvin_scale"
##   ..- attr(*, "standard_name")= chr "air_temperature"
##   ..- attr(*, "long_name")= chr "Air Temperature"
##   ..- attr(*, "missing_value")= num 1e+21
##   ..- attr(*, "expires")= int 1493596800

A list format is convenient because, when the dimensions align, it can be easily made into a data.frame (or, in this case, a tibble).

as_data_frame(mysite_list)
## # A tibble: 831 × 2
##         T   temp
##     <dbl>  <dbl>
## 1  -143.5 262.94
## 2  -142.5 264.19
## 3  -141.5 269.14
## 4  -140.5 278.23
## 5  -139.5 281.92
## 6  -138.5 287.79
## 7  -137.5 292.25
## 8  -136.5 291.93
## 9  -135.5 289.40
## 10 -134.5 281.50
## # ... with 821 more rows

Note that the time is in a weird format. You can extract the units from the mysite_list via the object’s attributes.

attributes(mysite_list[['T']])
## $calendar
## [1] "360"
## 
## $pointwidth
## [1] 1
## 
## $standard_name
## [1] "time"
## 
## $expires
## [1] 1493596800
## 
## $gridtype
## [1] 0
## 
## $units
## [1] "months since 1960-01-01"
attr(mysite_list[['T']], 'units')
## [1] "months since 1960-01-01"

Months since 1960 is a weird date format. Fortunately, riri provides a conversion function that uses some lubridate magic to convert this to a date, and optionally drops the original T variable.

mysite_list2 <- process_date(mysite_list)
## Added date column. First few dates are: 1948-01-16, 1948-02-16, 1948-03-16, 1948-04-16, 1948-05-16, 1948-06-16
str(mysite_list2)
## List of 2
##  $ temp: atomic [1:831] 263 264 269 278 282 ...
##   ..- attr(*, "gribfield")= int 1
##   ..- attr(*, "units")= chr "Kelvin_scale"
##   ..- attr(*, "standard_name")= chr "air_temperature"
##   ..- attr(*, "long_name")= chr "Air Temperature"
##   ..- attr(*, "missing_value")= num 1e+21
##   ..- attr(*, "expires")= int 1493596800
##  $ date: Date[1:831], format: "1948-01-16" "1948-02-16" ...

Putting it all together

All of these operations can be conveneintly wrapped into a single long pipeline, which in turn can be wrapped in a function for retrieving one kind of data. Let’s create a function for retrieving NCEP monthly temperature, called ncep_temp.

ncep_temp <- function(lat, lon) {
    base_url() %>% 
        ncep_var('air_temperature') %>% 
        filter_point(lat = lat, lon = lon) %>% 
        generate_url() %>% 
        retrieve_data() %>% 
        read_nc2list(dims = 'T') %>% 
        process_date() %>% 
        as_data_frame()
}
site1_temp <- ncep_temp(43.63, -89.76)
## Added date column. First few dates are: 1948-01-16, 1948-02-16, 1948-03-16, 1948-04-16, 1948-05-16, 1948-06-16
site1_temp
## # A tibble: 831 × 2
##      temp       date
##     <dbl>     <date>
## 1  261.63 1948-01-16
## 2  265.27 1948-02-16
## 3  271.86 1948-03-16
## 4  283.20 1948-04-16
## 5  285.82 1948-05-16
## 6  291.73 1948-06-16
## 7  295.31 1948-07-16
## 8  294.43 1948-08-16
## 9  290.98 1948-09-16
## 10 281.76 1948-10-16
## # ... with 821 more rows
site2_temp <- ncep_temp(71.27, -156.65)
## Added date column. First few dates are: 1948-01-16, 1948-02-16, 1948-03-16, 1948-04-16, 1948-05-16, 1948-06-16
site2_temp
## # A tibble: 831 × 2
##      temp       date
##     <dbl>     <date>
## 1  245.66 1948-01-16
## 2  247.52 1948-02-16
## 3  245.28 1948-03-16
## 4  256.32 1948-04-16
## 5  262.90 1948-05-16
## 6  272.46 1948-06-16
## 7  278.06 1948-07-16
## 8  275.51 1948-08-16
## 9  269.34 1948-09-16
## 10 261.68 1948-10-16
## # ... with 821 more rows

We can use some tidyverse magic to efficiently extend this to a bunch of sites.

First, create a tibble containing site information.

coords_df <- tribble(
    ~site_id, ~lat, ~lon,
    1, 43.63330,  -89.75830,
    2, 71.27576, -156.64139,
    3, -2.85000,  -54.96667,
    4, 33.52174, -116.15635,
    5, 32.81471, -115.44200,
    6, 29.70000,  -82.16670,
    7, 42.49500,  -71.79810,
    8, 42.53100,  -72.19000,
    9, 45.22220,  -68.73560,
    10, 47.47000,   -0.56000
    )
coords_df
## # A tibble: 10 × 3
##    site_id      lat        lon
##      <dbl>    <dbl>      <dbl>
## 1        1 43.63330  -89.75830
## 2        2 71.27576 -156.64139
## 3        3 -2.85000  -54.96667
## 4        4 33.52174 -116.15635
## 5        5 32.81471 -115.44200
## 6        6 29.70000  -82.16670
## 7        7 42.49500  -71.79810
## 8        8 42.53100  -72.19000
## 9        9 45.22220  -68.73560
## 10      10 47.47000   -0.56000

Then, use the purrr::map2 function to map each site’s latitude and longitude to our new ncep_temp function.

temp_data <- coords_df %>% 
    mutate(temp_dat = map2(lat, lon, ncep_temp))
temp_data
## # A tibble: 10 × 4
##    site_id      lat        lon           temp_dat
##      <dbl>    <dbl>      <dbl>             <list>
## 1        1 43.63330  -89.75830 <tibble [831 × 2]>
## 2        2 71.27576 -156.64139 <tibble [831 × 2]>
## 3        3 -2.85000  -54.96667 <tibble [831 × 2]>
## 4        4 33.52174 -116.15635 <tibble [831 × 2]>
## 5        5 32.81471 -115.44200 <tibble [831 × 2]>
## 6        6 29.70000  -82.16670 <tibble [831 × 2]>
## 7        7 42.49500  -71.79810 <tibble [831 × 2]>
## 8        8 42.53100  -72.19000 <tibble [831 × 2]>
## 9        9 45.22220  -68.73560 <tibble [831 × 2]>
## 10      10 47.47000   -0.56000 <tibble [831 × 2]>

Note that the temperature data is stored as a nested tibble. Read more about nested tibbles in the excellent book R for Data Science by Garrett Grolemund and Hadley Wickham.

Now, let’s unnest this data and plot it.

temp_data_unnested <- unnest(temp_data)
ggplot(temp_data_unnested) + 
    aes(x = date, y = temp) +
    geom_line(color = 'grey60') + 
    geom_smooth() + 
    facet_wrap(~site_id, scales = 'free_y')
## `geom_smooth()` using method = 'loess'

It may be more instructive to look at some summary statistics. To make this plot a little easier to look at, let’s filter it down to just three of the sites.

summaries <- temp_data_unnested %>% 
    mutate(year = lubridate::year(date)) %>% 
    filter(year < 2017) %>% 
    group_by(year, site_id) %>% 
    summarize_at(vars(temp), funs(mean, min, max))
summaries
## Source: local data frame [690 x 5]
## Groups: year [?]
## 
##     year site_id     mean    min    max
##    <dbl>   <dbl>    <dbl>  <dbl>  <dbl>
## 1   1948       1 280.4892 261.63 295.31
## 2   1948       2 259.0500 242.52 278.06
## 3   1948       3 298.8500 298.23 299.76
## 4   1948       4 289.2542 280.16 298.76
## 5   1948       5 294.4108 283.70 305.17
## 6   1948       6 293.9608 283.88 299.54
## 7   1948       7 281.7508 266.06 294.78
## 8   1948       8 280.6858 263.90 294.06
## 9   1948       9 278.8225 262.35 293.09
## 10  1948      10 284.6550 278.65 290.59
## # ... with 680 more rows
ggplot(summaries %>% 
       gather(stat, value, mean, min, max) %>% 
       filter(site_id %in% c(3, 6, 10))) + 
    aes(x = year, y = value) + 
    geom_line() + 
    geom_smooth() + 
    facet_wrap(~site_id + stat, scales = 'free', ncol = 3)
## `geom_smooth()` using method = 'loess'