Integrating NCAR’s data infrastructure with the OSDF

Project Website

Project Overview

Data intensive research, including data analytics, machine learning,
and data assimilation continues to drive innovation and discovery
across the geosciences. An obstacle to scientific discovery is that
critical research datasets are distributed and stored across many
disparate locations, making it challenging for researchers to easily
access data outside of their home environment and investigate cross
disciplinary relationships such as those explored at NCAR and NEON.
The Open Science Data Federation (OSDF,
https://osg-htc.org/services/osdf.html) is working to overcome this
challenge by providing a unified view of datasets stored across
autonomous facilities, integrated with the high-throughput
computational resources of the Open Science Pool (OSPool,
https://osg-htc.org/services/open_science_pool.html). We propose to
incorporate NCAR’s curated research data collections with the OSDF by
acquiring, operating, and maintaining OSDF Origin and OSDF Cache nodes
and by providing research, consulting, community engagement and
training services to: 1) broaden community access to NCAR’s model
generated (climate projections and historical reanalysis) and
observing facility produced datasets on NSF national
cyberinfrastructure resources, 2) explore, develop and publish example
workflows that leverage OSDF/OSPool resources to support investigation
of reference research use cases and identify future needs in the
OSDF/OSPool infrastructure and 3) engage and train researchers on how
their research workflows can leverage the capabilities of the OSDF,
including how they can develop and run workflows on OSPool resources
and share their personal datasets to the OSDF for reuse by others.

Example research use cases

We have developed the following example research use cases and documented it on Github.

  • Access CESM2 LENS data from the AWS opendata origin and the NCAR data origin and
    • a) Bias-correct surface temperature using ERA5 reanalysis.
    • b) Compute surface ocean heat content.
  • Access NOAA SONAR data from the AWS origin to plot echograms
  • Compute climatological average of daily temperature data using geocat-comp package
  • Run temperature bias-correction workflow on various compute platforms like
    • a) Texas Advanced Compute Center's Stampede3
    • b) NCAR's Casper

These use cases demonstrate the ingestion of data both from the NCAR
origin and the AWS OpenData origin:
https://github.com/NCAR/osdf_examples
Datasets from the NCAR original are now accessible:
--All NCAR datasets are accessible from NCAR's OSDF origin under:
https://osdf-director.osg-htc.org/ncar/rda/<dnnnnnn> where <dnnnnnn>
maps into a unique dataset identifier

Future Plans

We plan to develop additional example research use cases moving
forward and will host a hackathon in Summer of 2025 as part of the
pythia cookoff series to generate community develop example use cases
that leverage OSDF resources