This example will use riri
for data downloading and some pre-processing under the hood. It also uses the tidyverse
package to facilitate post-processing and visualization of data.
library(riri)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
You can find the source code for the riri
package on GitHub: https://github.com/ashiklom/riri. It is still in the early stages of development, so expect it to change frequently. I welcome comments and critiques in the issues, and, of course, pull requests!
The IRI/LDEO library contains a variety of data sources, which are accessible based on specific URLs. The library is also capable of doing its own preprocessing of data sources, such as subsetting to specific coordinates and specific times, averaging over specific durations, and more advanced analyses like correlations and empirical orthagonal functions (EOFs). These analyses are applied by appropriate tags added to the end of URLs.
The functions in riri
are designed around this principle of progressively extending a string, and are therefore best understood via the magrittr
pipe (%>%
) operator.
base_url()
## [1] "http://iridl.ldeo.columbia.edu"
base_url() %>% ncep_var('air_temperature')
## [1] "http://iridl.ldeo.columbia.edu" "SOURCES"
## [3] ".NOAA" ".NCEP"
## [5] ".CPC" ".GHCN_CAMS"
## [7] ".gridded" ".deg0p5"
## [9] ".temp"
base_url() %>% ncep_var('precipitation')
## [1] "http://iridl.ldeo.columbia.edu" "SOURCES"
## [3] ".NOAA" ".NCEP"
## [5] ".CPC" ".PRECL"
## [7] ".v1p0" ".deg0p5"
## [9] ".rain"
base_url() %>% ncep_var('air_temperature')
## [1] "http://iridl.ldeo.columbia.edu" "SOURCES"
## [3] ".NOAA" ".NCEP"
## [5] ".CPC" ".GHCN_CAMS"
## [7] ".gridded" ".deg0p5"
## [9] ".temp"
These functions create a character vector, which is concatenated into a URL at the end by generate_url
.
base_url() %>% ncep_var('air_temperature') %>% generate_url()
## [1] "http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GHCN_CAMS/.gridded/.deg0p5/.temp"
These examples provide the root URLs of each dataset. You can browse to those URLs to confirm that they work.
Although you can download a global dataset if you wish, most analyses will focus on a specific site or region. To filter a dataset down to a specific point, use the filter_point
function.
mysite <- base_url() %>%
ncep_var('air_temperature') %>%
filter_point(lat = 45.67, lon = -85.553)
mysite
## [1] "http://iridl.ldeo.columbia.edu"
## [2] "SOURCES"
## [3] ".NOAA"
## [4] ".NCEP"
## [5] ".CPC"
## [6] ".GHCN_CAMS"
## [7] ".gridded"
## [8] ".deg0p5"
## [9] ".temp"
## [10] "Y/(45.670000N) VALUES/X/(85.553000W) VALUES"
generate_url(mysite)
## [1] "http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GHCN_CAMS/.gridded/.deg0p5/.temp/Y/(45.670000N) VALUES/X/(85.553000W) VALUES"
Note the resultant IRI-specific syntax.
To actually download a netCDF file containing the data, use the retrieve_data
function.
mysite %>%
generate_url %>%
retrieve_data
By default, this downloads to a temporary file. Optionally, you can also provide a specific file path.
mysite %>%
generate_url %>%
retrieve_data('mysite.nc')
The output of retrieve_data
is the filename itself.
mysite_file <- mysite %>%
generate_url %>%
retrieve_data()
mysite_nc <- ncdf4::nc_open(mysite_file)
mysite_nc
## File /tmp/RtmpobrJn6/file7fed23fbfd91 (NC_FORMAT_CLASSIC):
##
## 1 variables (excluding dimension variables):
## float temp[X,Y,T]
## gribfield: 1
## units: Kelvin_scale
## standard_name: air_temperature
## long_name: Air Temperature
## missing_value: 9.99900026055401e+20
## expires: 1493596800
##
## 3 dimensions:
## T Size:831
## calendar: 360
## pointwidth: 1
## standard_name: time
## expires: 1493596800
## gridtype: 0
## units: months since 1960-01-01
## Y Size:1
## pointwidth: 0.5
## standard_name: latitude
## gridtype: 0
## units: degree_north
## X Size:1
## pointwidth: 0.5
## standard_name: longitude
## period: 360
## modulus: 360
## gridtype: 0
## units: degree_east
You may deal with the resultant netCDF files on your own if you wish. However, riri
includes some functions to facilitate getting the data into a workable form first.
read_nc2list
will read the contents a netCDF file into a list, with each item containing the full list of attributes. Here, we want to grab the variable and only the time dimension (T
), since we are dealing with a time series at a single plot.
mysite_list <- read_nc2list(mysite_file, dims = 'T')
str(mysite_list)
## List of 2
## $ T : atomic [1:831] -144 -142 -142 -140 -140 ...
## ..- attr(*, "calendar")= chr "360"
## ..- attr(*, "pointwidth")= num 1
## ..- attr(*, "standard_name")= chr "time"
## ..- attr(*, "expires")= int 1493596800
## ..- attr(*, "gridtype")= int 0
## ..- attr(*, "units")= chr "months since 1960-01-01"
## $ temp: atomic [1:831] 263 264 269 278 282 ...
## ..- attr(*, "gribfield")= int 1
## ..- attr(*, "units")= chr "Kelvin_scale"
## ..- attr(*, "standard_name")= chr "air_temperature"
## ..- attr(*, "long_name")= chr "Air Temperature"
## ..- attr(*, "missing_value")= num 1e+21
## ..- attr(*, "expires")= int 1493596800
A list format is convenient because, when the dimensions align, it can be easily made into a data.frame
(or, in this case, a tibble
).
as_data_frame(mysite_list)
## # A tibble: 831 × 2
## T temp
## <dbl> <dbl>
## 1 -143.5 262.94
## 2 -142.5 264.19
## 3 -141.5 269.14
## 4 -140.5 278.23
## 5 -139.5 281.92
## 6 -138.5 287.79
## 7 -137.5 292.25
## 8 -136.5 291.93
## 9 -135.5 289.40
## 10 -134.5 281.50
## # ... with 821 more rows
Note that the time is in a weird format. You can extract the units from the mysite_list
via the object’s attributes
.
attributes(mysite_list[['T']])
## $calendar
## [1] "360"
##
## $pointwidth
## [1] 1
##
## $standard_name
## [1] "time"
##
## $expires
## [1] 1493596800
##
## $gridtype
## [1] 0
##
## $units
## [1] "months since 1960-01-01"
attr(mysite_list[['T']], 'units')
## [1] "months since 1960-01-01"
Months since 1960 is a weird date format. Fortunately, riri
provides a conversion function that uses some lubridate
magic to convert this to a date, and optionally drops the original T
variable.
mysite_list2 <- process_date(mysite_list)
## Added date column. First few dates are: 1948-01-16, 1948-02-16, 1948-03-16, 1948-04-16, 1948-05-16, 1948-06-16
str(mysite_list2)
## List of 2
## $ temp: atomic [1:831] 263 264 269 278 282 ...
## ..- attr(*, "gribfield")= int 1
## ..- attr(*, "units")= chr "Kelvin_scale"
## ..- attr(*, "standard_name")= chr "air_temperature"
## ..- attr(*, "long_name")= chr "Air Temperature"
## ..- attr(*, "missing_value")= num 1e+21
## ..- attr(*, "expires")= int 1493596800
## $ date: Date[1:831], format: "1948-01-16" "1948-02-16" ...
All of these operations can be conveneintly wrapped into a single long pipeline, which in turn can be wrapped in a function for retrieving one kind of data. Let’s create a function for retrieving NCEP monthly temperature, called ncep_temp
.
ncep_temp <- function(lat, lon) {
base_url() %>%
ncep_var('air_temperature') %>%
filter_point(lat = lat, lon = lon) %>%
generate_url() %>%
retrieve_data() %>%
read_nc2list(dims = 'T') %>%
process_date() %>%
as_data_frame()
}
site1_temp <- ncep_temp(43.63, -89.76)
## Added date column. First few dates are: 1948-01-16, 1948-02-16, 1948-03-16, 1948-04-16, 1948-05-16, 1948-06-16
site1_temp
## # A tibble: 831 × 2
## temp date
## <dbl> <date>
## 1 261.63 1948-01-16
## 2 265.27 1948-02-16
## 3 271.86 1948-03-16
## 4 283.20 1948-04-16
## 5 285.82 1948-05-16
## 6 291.73 1948-06-16
## 7 295.31 1948-07-16
## 8 294.43 1948-08-16
## 9 290.98 1948-09-16
## 10 281.76 1948-10-16
## # ... with 821 more rows
site2_temp <- ncep_temp(71.27, -156.65)
## Added date column. First few dates are: 1948-01-16, 1948-02-16, 1948-03-16, 1948-04-16, 1948-05-16, 1948-06-16
site2_temp
## # A tibble: 831 × 2
## temp date
## <dbl> <date>
## 1 245.66 1948-01-16
## 2 247.52 1948-02-16
## 3 245.28 1948-03-16
## 4 256.32 1948-04-16
## 5 262.90 1948-05-16
## 6 272.46 1948-06-16
## 7 278.06 1948-07-16
## 8 275.51 1948-08-16
## 9 269.34 1948-09-16
## 10 261.68 1948-10-16
## # ... with 821 more rows
We can use some tidyverse
magic to efficiently extend this to a bunch of sites.
First, create a tibble
containing site information.
coords_df <- tribble(
~site_id, ~lat, ~lon,
1, 43.63330, -89.75830,
2, 71.27576, -156.64139,
3, -2.85000, -54.96667,
4, 33.52174, -116.15635,
5, 32.81471, -115.44200,
6, 29.70000, -82.16670,
7, 42.49500, -71.79810,
8, 42.53100, -72.19000,
9, 45.22220, -68.73560,
10, 47.47000, -0.56000
)
coords_df
## # A tibble: 10 × 3
## site_id lat lon
## <dbl> <dbl> <dbl>
## 1 1 43.63330 -89.75830
## 2 2 71.27576 -156.64139
## 3 3 -2.85000 -54.96667
## 4 4 33.52174 -116.15635
## 5 5 32.81471 -115.44200
## 6 6 29.70000 -82.16670
## 7 7 42.49500 -71.79810
## 8 8 42.53100 -72.19000
## 9 9 45.22220 -68.73560
## 10 10 47.47000 -0.56000
Then, use the purrr::map2
function to map each site’s latitude and longitude to our new ncep_temp
function.
temp_data <- coords_df %>%
mutate(temp_dat = map2(lat, lon, ncep_temp))
temp_data
## # A tibble: 10 × 4
## site_id lat lon temp_dat
## <dbl> <dbl> <dbl> <list>
## 1 1 43.63330 -89.75830 <tibble [831 × 2]>
## 2 2 71.27576 -156.64139 <tibble [831 × 2]>
## 3 3 -2.85000 -54.96667 <tibble [831 × 2]>
## 4 4 33.52174 -116.15635 <tibble [831 × 2]>
## 5 5 32.81471 -115.44200 <tibble [831 × 2]>
## 6 6 29.70000 -82.16670 <tibble [831 × 2]>
## 7 7 42.49500 -71.79810 <tibble [831 × 2]>
## 8 8 42.53100 -72.19000 <tibble [831 × 2]>
## 9 9 45.22220 -68.73560 <tibble [831 × 2]>
## 10 10 47.47000 -0.56000 <tibble [831 × 2]>
Note that the temperature data is stored as a nested tibble
. Read more about nested tibbles
in the excellent book R for Data Science by Garrett Grolemund and Hadley Wickham.
Now, let’s unnest
this data and plot it.
temp_data_unnested <- unnest(temp_data)
ggplot(temp_data_unnested) +
aes(x = date, y = temp) +
geom_line(color = 'grey60') +
geom_smooth() +
facet_wrap(~site_id, scales = 'free_y')
## `geom_smooth()` using method = 'loess'
It may be more instructive to look at some summary statistics. To make this plot a little easier to look at, let’s filter it down to just three of the sites.
summaries <- temp_data_unnested %>%
mutate(year = lubridate::year(date)) %>%
filter(year < 2017) %>%
group_by(year, site_id) %>%
summarize_at(vars(temp), funs(mean, min, max))
summaries
## Source: local data frame [690 x 5]
## Groups: year [?]
##
## year site_id mean min max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1948 1 280.4892 261.63 295.31
## 2 1948 2 259.0500 242.52 278.06
## 3 1948 3 298.8500 298.23 299.76
## 4 1948 4 289.2542 280.16 298.76
## 5 1948 5 294.4108 283.70 305.17
## 6 1948 6 293.9608 283.88 299.54
## 7 1948 7 281.7508 266.06 294.78
## 8 1948 8 280.6858 263.90 294.06
## 9 1948 9 278.8225 262.35 293.09
## 10 1948 10 284.6550 278.65 290.59
## # ... with 680 more rows
ggplot(summaries %>%
gather(stat, value, mean, min, max) %>%
filter(site_id %in% c(3, 6, 10))) +
aes(x = year, y = value) +
geom_line() +
geom_smooth() +
facet_wrap(~site_id + stat, scales = 'free', ncol = 3)
## `geom_smooth()` using method = 'loess'