Scholar Data

Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”

This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs. A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality.Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.

Authors

Gary, Stefan ;
Scheibe, Timothy D. ;
Rexer, Em ;
Wilde, Michael ;
Vidal Torreira, Alvaro ;
Garayburu-Caruso, Vanessa A. ;
Goldman, Amy E. ;
Stegen, James C.

2 Citations0 Mentions15% FAIR0.8 Dataset Index

10.15485/2318723January 2024

Larval dispersal histogram data used for ATLAS deliverable D1.6: Biologically realistic Lagrangian dispersal and connectivity

Larval dispersal histogram data for ATLAS deliverable D1.6 "Biologically realistic Lagrangian connectivity" (https://www.eu-atlas.org/resources/atlas-partners-document-area/atlas-deliverables/455-d1-6-biologically-realistic-lagrangian-connectivity/file). Tar archive files are ordered by ATLAS case study source region and with folders by larval behaviour type. The numbered behaviour types are described in deliverable D1.6. Each netcdf histogram file, e.g. hists_age_21.nc, contains the histogram for larvae of a single age in 5-day steps, from 00 (0 days) to 37 (185 days). Within each file histogram file, particle counts in each Viking20 model grid-cell are contained in a 4-d array with dimensions (launch month, lauch year, model gridsquare y index, model gridsquare x index). The Viking20 grid in the North Atlantic is the ORCA tripolar grid. Details of the model mesh are in the included file viking20_mesh_mask.tgz Histograms are in netcdf files: ============================ $ ncdump -h hists_age_00.nc
netcdf hists_age_00 {
dimensions:
coordinate = 4 ;
coordinate_1 = 50 ;
coordinate_2 = 1719 ;
coordinate_3 = 1784 ;
variables:
int64 coordinate(coordinate) ;
coordinate:units = "month" ;
coordinate:long_name = "Launch month" ;
int64 coordinate_1(coordinate_1) ;
coordinate_1:units = "year" ;
coordinate_1:long_name = "Launch year" ;
int64 coordinate_2(coordinate_2) ;
coordinate_2:units = "index" ;
coordinate_2:long_name = "J index" ;
int64 coordinate_3(coordinate_3) ;
coordinate_3:units = "index" ;
coordinate_3:long_name = "I index" ;
int64 data(coordinate, coordinate_1, coordinate_2, coordinate_3) ;
data :long_name = "particle count" ; // global attributes:
:Conventions = "CF-1.6" ;
} ==========================================

Authors

Fox, Alan D. ;
Gary, Stefan F.

2 Citations0 Mentions73% FAIR2.6 Dataset Index

10.5281/zenodo.3548344November 2019

Larval dispersal histogram data used for ATLAS deliverable D1.6: Biologically realistic Lagrangian dispersal and connectivity

Authors

Fox, Alan D. ;
Gary, Stefan F.

0 Citations0 Mentions73% FAIR1.8 Dataset Index

10.5281/zenodo.3548343November 2019

Full depth ocean properties in the eastern subpolar North Atlantic, Cruise DY052, Extended Ellett Line, 2016, link to raw data in NetCDF format

The Extended Ellett Line is a hydrographic section between Iceland and Scotland that is occupied annually by scientists from the National Oceanography Centre (NOC) and the Scottish Association for Marine Science (SAMS), UK. The measurement programme began as a seasonally-occupied hydrographic section in the Rockall Trough in 1975, building on early surface observations made underway from ocean weather ships. In 1996 the section was extended to Iceland, sampling three basins: the Rockall Trough, the Hatton-Rockall Basin and the Iceland Basin. These three basins form the main routes though which warm saline Atlantic water flows northwards into the Nordic Seas and Arctic Ocean. The section crosses the eastern North Atlantic subpolar gyre; as well as the net northward flow there is a large recirculation of the upper layers as part of the wind-driven gyre. During its passage through the region, the warm saline water is subjected to significant modification by exchange of heat and freshwater with the atmosphere. The two deep basins (Rockall Trough and Iceland Basin) contain southward flowing dense northern overflow waters, and Labrador Sea Water in the intermediate layers. The specific objectives of the 2016 Extended Ellett Line cruise are: - To complete the annual Extended Ellett Line CTD section; - To collect water samples for measuring biogeochemical properties including dissolved oxygen, nutrients, carbon & trace metals; - To collect underway measurements of surface currents, surface temperature and salinity, bathymetry, surface meteorology; - To complete epibenthic sled tows at a deep location in the central Rockall Trough; - To capture water column and sea floor video with a downward-looking camera attached to the CTD; - To listen for whales and dolphins with a towed hydrophone; and - To deploy Argo floats provided by the UK Met Office as a contribution to the International Argo Project.

Authors

Gary, Stefan F

0 Citations0 Mentions92% FAIR2.0 Dataset Index

10.1594/pangaea.881182January 2017

Automated Author Profile
Gary, Stefan F
0000-0003-3525-5786

Gary, Stefan F

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”

Larval dispersal histogram data used for ATLAS deliverable D1.6: Biologically realistic Lagrangian dispersal and connectivity

Larval dispersal histogram data used for ATLAS deliverable D1.6: Biologically realistic Lagrangian dispersal and connectivity

Full depth ocean properties in the eastern subpolar North Atlantic, Cruise DY052, Extended Ellett Line, 2016, link to raw data in NetCDF format

Automated Author ProfileGary, Stefan F0000-0003-3525-5786

Gary, Stefan F

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”

Larval dispersal histogram data used for ATLAS deliverable D1.6: Biologically realistic Lagrangian dispersal and connectivity

Larval dispersal histogram data used for ATLAS deliverable D1.6: Biologically realistic Lagrangian dispersal and connectivity

Full depth ocean properties in the eastern subpolar North Atlantic, Cruise DY052, Extended Ellett Line, 2016, link to raw data in NetCDF format

Automated Author Profile
Gary, Stefan F
0000-0003-3525-5786