earthdatalogin
seeks to streamline the process of
accessing NASA data from the Earth Data cloud program from
anywhere. Because Amazon Web Services (AWS) typically charges
egress fees whenever network traffic leaves the data center which hosts
it, NASA has restricted access to its data hosted by Amazon to only be
accessible from AWS servers running in the same data center
(us-west-2
) when using the S3 access protocol. However,
NASA also makes this cloud data available publicly to any machine using
a standard HTTPS connection. Both cases require requesting and managing
credentials tied to a registered user name. This package merely makes
that process easier.
earthdatalogin
is now on CRAN, and can simply be
installed with
install.packages("earthdatalogin")
Or you can install the development version of
earthdatalogin
from GitHub:
# install.packages("devtools")
::install_github("boettiger-lab/earthdatalogin") devtools
Access to EarthData is free, but users are required to register. Currently,
earthdatalogin
provides it’s own default credentials for a
quick start. Users are still encouraged to register their
own credentials!
library(earthdatalogin)
The easiest and most portable method of access is using the netrc
basic authentication protocol for HTTP. Call edl_netrc()
to
set this up given your username and password (passed as optional
arguments or read from the environmental variables. If neither provides
credentials, earthdatalogin
will provide it’s own
credentials, but you may experience rate limits more readily):
edl_netrc()
If edl_netrc()
has been called, then most existing
spatial data packages in R can then seamlessly access NASA Earthdata
over HTTP links.
<- "https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T56JKT.2023246T235950.v2.0/HLS.L30.T56JKT.2023246T235950.v2.0.SAA.tif"
url
::rast(url, vsi=TRUE)
terra#> class : SpatRaster
#> dimensions : 3660, 3660, 1 (nrow, ncol, nlyr)
#> resolution : 30, 30 (x, y)
#> extent : 199980, 309780, 7190200, 7300000 (xmin, xmax, ymin, ymax)
#> coord. ref. : WGS 84 / UTM zone 56N (EPSG:32656)
#> source : HLS.L30.T56JKT.2023246T235950.v2.0.SAA.tif
#> name : HLS.L30.T56JKT.2023246T235950.v2.0.SAA
Note that no special earthdatalogin
functions are needed
in the rest of our code. This is important as it lets the user take
advantage of any existing R packages or tutorials without modification,
as if there was no authentication barrier to NASA EarthData in the first
place. If we had not called edl_netrc()
for authentication,
this would throw an error that the file does not exist. This call needs
be made only once per session (i.e. at the start of a script.)
Most R packages (terra
, sf
,
stars
, and others) access spatial data by using an
underlying C++ library called GDAL.1 GDAL is also used under
the hood of many other spatial tools, from Python
(geopandas
, rasterio
, others) to QGIS and
Google Earth Engine. earthdatalogin
sets a collection of
config files and environmental variables used by GDAL to allow it to
access authentication credentials. Crucially, the use of
netrc
-based authentication works just as well if you are
running from a laptop or if you are running from inside AWS compute in
us-west-2
– such as using the popular Openscapes 2i2c hub.
This portability does not hold for other mechanisms, such as S3-based
login, which in the case of NASA EarthData only works from inside
AWS-based compute, and not true of the bearer token mechanism, which
only works from outside AWS-based compute. The
earthdatalogin
package does provide functions for using
these other authentication mechanisms (see edl_s3_token()
and edl_set_token()
), but discourages their use as they are
less portable while offering no performance advantage.2
This function takes care of managing tokens for you. If you don’t
have any tokens, it will request one be minted. If your user name has
tokens already, it will look them up and re-use them. (EDL will issue at
most two tokens per user, and tokens expire after 6 month, but users
shouldn’t need to worry about this since edl_set_token()
handles it). This function will also set the token in as a GDAL
environmental variable. This means that popular R packages such as
terra
, sf
or stars
that all
involve bindings to GDAL will automatically be able to utilize this
token for any operations reading from HTTPS (using the
vsicurl
prefix).
Because NASA EarthData is also the first introduction to cloud-hosted
data for many researchers, the fact that NASA tries to minimize egress
charges by restricting S3 access to requests coming from AWS
us-west-2
compute center this sometimes gives the false
impression that accessing data “from the cloud” requires also
paying Amazon Web Services for compute time. This is entirely spurious.
For instance, NOAA
also provides an extensive set of regularly updated data products on
AWS without this restriction, which can be accessed over a standard
HTTPS connection or using the S3 protocol as an anonymous user (with no
keys or tokens). To maximize performance, heavy users of NOAA data will
frequently choose to access that data from AWS compute in the same
region (mostly us-east-1
for NOAA), but this is not
required. Technically speaking, we frequently use the vernacular phrase
of “accessing cloud data” to refer to network based access of data using
HTTP range requests – the ability to ask a web server
to transfer some range of bytes from an individual file rather than
transferring the entire file across the network.
Note that we can now successfully access this file over https from
any machine with an internet connection, and with no further
authentication steps. That URL could have been obtained in a variety of
ways, including https://search.earthdata.nasa.gov/search, searching
individual DAACs, or programmatically using the EarthData STAC catalog.
The point here is that despite the barrier of
earthdatalogin
, the R code required for cloud-native access
is now matches the standard strategies we would use for cloud-native
access of any other data source.
A key feature that makes ‘cloud native’ access fast is that this access is lazy. All though these individual files could be quite large, our request has not downloaded the entire file – it has instead used its knowledge of spatial data formats to read just those bytes of the file that provide critical metadata such as extent, projection, bands and coordinate ranges. Using that information, we can extract just the bits of data (locations, variables) we care about without having to download everything else as well. This saves the RAM on our computer, and drive space on our disks, as well as speeding up download. Without these techniques, processing the massive amounts of data coming from modern earth observation methods would be impractical.
However, not all data formats are equally amenable to this
approach. Requesting a few bytes from a file across hundreds of miles of
network connection is not the same thing as requesting a few bytes
across the six inches of PCI connection between your processor and your
hard-drive. More recent formats like “Cloud Optimized Geotiff” (COG)
files are, as the name suggests, optimized for this. Complex older
formats like HDF5 or GRIB are much less efficient. Network based range
requests are not possible on some older (but still widely used) formats,
like HDF4. In this last case, we will need to download the file to a
local disk (a POSIX filesystem, not a hyperscale Object Store) to read
it. Use edl_download()
to handle authenticated downloads in
this case.
To facilitate cloud-native access of NASA EarthData from R, this
package also includes a series of vignettes illustrating the use of some
popular R packages in there (often less well-known) cloud-native modes.
In each of these vignettes, we will take care to leverage “lazy
evaluation” to avoid downloading more than we have to. With the
exception of the vignette on S3 access from within
us-west-2
, these vignettes should run most anywhere, but
will be most effective on machines with fast network access. Many
university networks, and any cloud-hosted platform, such as GitHub
Codespaces, offer excellent network performance for this purpose.
Some R users have heard that the
rgdal
package is being deprecated. Don’t confuse this with
the GDAL C++ library being deprecated – rgdal
was only one
of many R packages that used the GDAL C++ libraries, and was deprecated
in favor of the same bindings to GDAL being available in
sf
.↩︎
NASA’s own documentation often points users to the S3-based access protocol when working on AWS compute. Note that the S3 tokens NASA’s AWS setup provides expire every hour, and are specific to each DAAC, making it very difficult for users to work across data products from multiple DAACs in a single workflow. Note also that this authentication mechanism provides essentially no advantage in terms of speed or cost to the user or to NASA, especially for large data.↩︎