Description of RDA Dataset Collection Curation Levels
A routinely updated inventory of all RDA datasets that includes dataset curation level assignment can be accessed on the RDA Documentation Page. Please refer to the ingest to dissemination workflow description, for additional details related to RDA Dataset Curation levels.
1. Basic curation
- All dataset collections maintained in the RDA must adhere to the Basic Curation standard. Requirements for Basic Curation include:
- Required metadata fields are completed to describe dataset collections. A list of required metadata fields can be found in the metadata manager documentation.
- MD5 checksums are computed on data archive files stored on RDA Dataset Collection Disk and the Quasar Tape Backup system.
2. Enhanced Curation
- Selected dataset collections require enhanced curation. Enhanced curation is determined on a case-by-case basis by the RDA staff. The goal of enhanced curation is to provide better support for end research use cases, long-term curation, and easier accessibility. Selected use cases include:
- Native data structure and format do not align with broad research use case.
- Climate research use case -Native model data are often structured in files with time-slice snapshots including all output parameters.
- Model output files are restructured from time-slice snapshots of all parameters to time-series structures organized by parameter, which better supports climate research by significantly reducing the amount of data accessed to examine a long-term trend, e.g. air temperature over 40 years.
- File format and associated metadata are translated from GRIB 2 to CF-NetCDF, which is more broadly used in the climate research community. CF-NetCDF also better supports long-term data curation since it is a self describing format.
- A reference copy of the native data is maintained offline to assure reproducibility and validation of translation, if necessary. A copy of this data can be provided to users upon request.
- Native data structure and format do not align with community support formats or conventions.
- Observational data use case -often the native data are provided in a proprietary ASCII data file format, which is not compatible with community supported data analysis tools.
- File format and associated metadata are converted from proprietary ASCII to CF-compliant NetCDF to facilitate community data analysis tool access. CF-NetCDF also better supports long-term data curation since it is a self describing format.
- A reference copy of the native data is maintained offline to assure reproducibility and validation of translation, if necessary. A copy of this data can be provided to users upon request.
- Observational data use case -often the native data are provided in a proprietary ASCII data file format, which is not compatible with community supported data analysis tools.
- Climate research use case -Native model data are often structured in files with time-slice snapshots including all output parameters.
- Native data structure and format do not align with broad research use case.
3. Data-level curation
- Selected dataset collections require data-level curation. Data-level curation is determined on a case-by-case basis by the RDA staff. The goal of data-level curation is to fix problems discovered in data or metadata, and improve support for end research use cases, long-term curation, and easier accessibility. Selected use cases include:
- Native data are stored on unique grid types that are difficult for the broader research community to work with.
- Reanalysis and operational model data use case where native data are organized in Spectral Space and Reduced Gaussian grids. This presents computational challenges to a large number of users, as most users are only familiar with data structured in regular latitude/longitude grids.
- Data are interpolated into regular latitude/longitude space and stored in CF-compliant NetCDF to better support ease of use, community tool access, and long term curation. All processing steps and components are described in the attribute fields provided by the NetCDF format to support provenance.
- A reference copy of the native data are maintained offline to assure reproducibility and validation of translation, if necessary. A copy of this data can be provided to users upon request.
- Documentation describing all data processing components is archived with the dataset collection.
- Reanalysis and operational model data use case where native data are organized in Spectral Space and Reduced Gaussian grids. This presents computational challenges to a large number of users, as most users are only familiar with data structured in regular latitude/longitude grids.
- A systematic problem is detected in metadata or data files for a dataset collection.
- Reanalysis use case -through RDA data ingest processing checks, it is determined that native data include incorrect descriptive metadata in the data files.
- The provider is notified of the systematic metadata issue.
- Metadata is corrected by RDA staff for all impacted data files.
- The corrected version of the data files is published to archive and made available for public access.
- A reference copy of the native data is maintained offline to assure reproducibility and validation of translation, if necessary. A copy of this data can be provided to users upon request.
- Reanalysis use case -through RDA data ingest processing checks, it is determined that native data include incorrect descriptive metadata in the data files.
- Native data are stored on unique grid types that are difficult for the broader research community to work with.