<?xml version='1.0' encoding='UTF-8'?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <idinfo>
    <citation>
      <citeinfo>
        <origin>J. William Lund</origin>
        <origin>Scott Hamshaw</origin>
        <pubdate>20251119</pubdate>
        <title>4 - Machine Learning application to predict and interpret streamflow event hysteresis in the Delaware and Illinois River Basins</title>
        <geoform>.csv, .ipynb, .yaml, .html</geoform>
        <pubinfo>
          <pubplace>St. Paul, MN</pubplace>
          <publish>U.S. Geological Survey</publish>
        </pubinfo>
        <onlink>https://doi.org/10.5066/P1492CKA</onlink>
      </citeinfo>
    </citation>
    <descript>
      <abstract>4_Hysteresis_ML is the fourth child item of this Data Release and includes annotated python scripts to train and test multi-classification XGBoost models using hysteresis index data from streamflow events at select sites in the Delaware and Illinois River Basins (DRB and IRB). Included are individual scripts and markdown output for each 5, 7, and 9  Hysteresis Index - Concentration Index (HI-CI) class methods, output files that contain final features used after recursive feature elimination, final model tuning parameters used, and a environment.yaml file. Additional information can be found in the "Entity and Attribute" section.</abstract>
      <purpose>This research aims to improve understanding of the complex sediment transport process and can lead to more informed inferences about sediment dynamics, hydrology, source, and in channel processes as well as improved predictions of sediment transport. 

Suspended sediment transport is critical to understanding future states of water quality and represents an important Integrated Water Science (IWS) basin effort. Delaware River Basin (DRB) and Illinois River Basin (IRB) are two of the IWS basins with a wide range of environmental, hydrologic, and landscape settings and human stressors of water resources. USGS stream gaging at 35 watersheds within these two regional Basins were selected to evaluate the storm events and the corresponding sediment and hydrodynamic observations. 

Machine learning has shown the ability to learn complex non-linear interactions within data and was deployed to help improve our understanding of process drivers of sediment transport and improve future modeling prediction capabilities.</purpose>
    </descript>
    <timeperd>
      <timeinfo>
        <rngdates>
          <begdate>20070101</begdate>
          <enddate>20231232</enddate>
        </rngdates>
      </timeinfo>
      <current>ground condition</current>
    </timeperd>
    <status>
      <progress>Complete</progress>
      <update>None planned</update>
    </status>
    <spdom>
      <bounding>
        <westbc>-92.3730</westbc>
        <eastbc>-71.5430</eastbc>
        <northbc>44.6530</northbc>
        <southbc>36.3151</southbc>
      </bounding>
    </spdom>
    <keywords>
      <theme>
        <themekt>ISO 19115 Topic Category</themekt>
        <themekey>inlandWaters</themekey>
        <themekey>climatologyMeteorologyAtmosphere</themekey>
        <themekey>geoscientificInformation</themekey>
        <themekey>environment</themekey>
      </theme>
      <theme>
        <themekt>None</themekt>
        <themekey>Machine Learning</themekey>
        <themekey>Daymet</themekey>
        <themekey>Principal Component Analysis</themekey>
        <themekey>Linear Discriminant Analysis</themekey>
      </theme>
      <theme>
        <themekt>USGS Thesaurus</themekt>
        <themekey>sediment transport</themekey>
        <themekey>turbidity</themekey>
        <themekey>datasets</themekey>
        <themekey>streamflow</themekey>
      </theme>
      <theme>
        <themekt>USGS Metadata Identifier</themekt>
        <themekey>USGS:6827b54ad4be02693eeabdfc</themekey>
      </theme>
      <place>
        <placekt>Common geographic areas</placekt>
        <placekey>Illinois</placekey>
        <placekey>Indiana</placekey>
        <placekey>New York</placekey>
        <placekey>New Jersey</placekey>
        <placekey>Pennsylvania</placekey>
        <placekey>Delaware</placekey>
      </place>
    </keywords>
    <accconst>No access constraints. Please see 'Distribution Information' for details.</accconst>
    <useconst>These data are marked with a Creative Common CC0 1.0 Universal License. These data are in the public domain and do not have any use constraints. Users are advised to read the dataset's metadata thoroughly to understand appropriate use and data limitations.</useconst>
    <ptcontac>
      <cntinfo>
        <cntperp>
          <cntper>John (William) Lund</cntper>
          <cntorg>USGS - MIDCONTINENT REGION</cntorg>
        </cntperp>
        <cntpos>Hydrologist</cntpos>
        <cntaddr>
          <addrtype>mailing and physical</addrtype>
          <address>1992 Folwell Ave</address>
          <address>Upper Midwest Water Science Center - University of Minnesota</address>
          <city>St. Paul</city>
          <state>MN</state>
          <postal>55108</postal>
        </cntaddr>
        <cntvoice>763-272-8690</cntvoice>
        <cntemail>jlund@usgs.gov</cntemail>
      </cntinfo>
    </ptcontac>
    <native>AWS SageMaker Distribution 2.2.1, using a ml.c6i.32xlarge instance.</native>
  </idinfo>
  <dataqual>
    <attracc>
      <attraccr>Verified that attribute values were within expected bounds.</attraccr>
    </attracc>
    <logic>Verified that data values were consistent with the value ranges of the source data.</logic>
    <complete>Data set is considered complete for the information presented, as described in the abstract. Users are advised to read the rest of the metadata record carefully for additional details.</complete>
    <posacc>
      <horizpa>
        <horizpar>A formal accuracy assessment of the horizontal positional information in the data set has not been conducted.</horizpar>
      </horizpa>
      <vertacc>
        <vertaccr>A formal accuracy assessment of the vertical positional information in the data set has either not been conducted, or is not applicable.</vertaccr>
      </vertacc>
    </posacc>
    <lineage>
      <procstep>
        <procdesc>Jupyter notebook python scripts were ran on AWS SageMaker Distribution 2.2.1, using a ml.c6i.32xlarge instance. One script for each 5, 7, and 9 class Hysteresis Index - Concentration Index (HI-CI) method. Each script loads the data file WQP_HI_EventSummary_Wiecz_Final.csv from child item 2_EventSeparation_HysteresisIndex, organizes data for training a multi-classification machine learning model using xgboost and sklearn packages, preforms boosted recursive feature elimination using BoostRFE() and hyperparmeter tuning BoostSearch() from the shaphupetune package. Model accuracy is assessed using a confusion matrix and classification report from the sklearn package and the final model is interpreted using the SHAP package to produce plots of important features for each HI-CI class. Additional details in script comments and .html output.</procdesc>
        <procdate>20240930</procdate>
        <proccont>
          <cntinfo>
            <cntperp>
              <cntper>John (William) Lund</cntper>
              <cntorg>USGS - MIDCONTINENT REGION</cntorg>
            </cntperp>
            <cntpos>Hydrologist</cntpos>
            <cntaddr>
              <addrtype>mailing and physical</addrtype>
              <address>1992 Folwell Ave</address>
              <address>Upper Midwest Water Science Center - University of Minnesota</address>
              <city>St. Paul</city>
              <state>MN</state>
              <postal>55108</postal>
            </cntaddr>
            <cntvoice>763-272-8690</cntvoice>
            <cntemail>jlund@usgs.gov</cntemail>
          </cntinfo>
        </proccont>
      </procstep>
    </lineage>
  </dataqual>
  <eainfo>
    <detailed>
      <enttyp>
        <enttypl>4_Hysteresis_ML\environment.yaml</enttypl>
        <enttypd>Python environment file for [5,7,9]class\HI_[5,7,9]class_ML.ipynb scripts.</enttypd>
        <enttypds>U.S. Geological Survey</enttypds>
      </enttyp>
    </detailed>
    <detailed>
      <enttyp>
        <enttypl>4_Hysteresis_ML\[5,7,9]class\HI_[5,7,9]class_ML.ipynb</enttypl>
        <enttypd>Annotated python scripts for each 5, 7, and 9 HI-CI class method.  organizes data for training a multi-classification machine learning model using xgboost and sklearn packages. Outputs final features used and tuning parameters along with markdown .html script.</enttypd>
        <enttypds>U.S. Geological Survey</enttypds>
      </enttyp>
    </detailed>
    <detailed>
      <enttyp>
        <enttypl>4_Hysteresis_ML\[5,7,9]class\Output\HI_[5,7,9]class_ML_rfe_features.csv</enttypl>
        <enttypd>Comma separated file containing list of selected features from recursive feature elimination</enttypd>
        <enttypds>U.S. Geological Survey</enttypds>
      </enttyp>
      <attr>
        <attrlabl>Selected Features</attrlabl>
        <attrdef>List of selected features from recursive feature elimination</attrdef>
        <attrdefs>U.S. Geological Survey</attrdefs>
        <attrdomv>
          <udom>See Column Definition</udom>
        </attrdomv>
      </attr>
    </detailed>
    <detailed>
      <enttyp>
        <enttypl>4_Hysteresis_ML\[5,7,9]class\Output\HI_[5,7,9]class_ML_tune_params.csv</enttypl>
        <enttypd>Comma separated file containing XGBoost final tuning parameters from grid search.</enttypd>
        <enttypds>U.S. Geological Survey</enttypds>
      </enttyp>
      <attr>
        <attrlabl>colsample_bytree</attrlabl>
        <attrdef>The fraction of columns to be randomly sampled for each tree</attrdef>
        <attrdefs>Producer Defined</attrdefs>
        <attrdomv>
          <udom>See Column Definition</udom>
        </attrdomv>
      </attr>
      <attr>
        <attrlabl>learning_rate</attrlabl>
        <attrdef>Step size shrinkage used in update to prevent overfitting</attrdef>
        <attrdefs>Producer Defined</attrdefs>
        <attrdomv>
          <udom>See Column Definition</udom>
        </attrdomv>
      </attr>
      <attr>
        <attrlabl>max_depth</attrlabl>
        <attrdef>Maximum depth of tree</attrdef>
        <attrdefs>Producer Defined</attrdefs>
        <attrdomv>
          <udom>See Column Definition</udom>
        </attrdomv>
      </attr>
      <attr>
        <attrlabl>max_leaves</attrlabl>
        <attrdef>Maximum number of nodes to be added.</attrdef>
        <attrdefs>Producer Defined</attrdefs>
        <attrdomv>
          <udom>See Column Definition</udom>
        </attrdomv>
      </attr>
      <attr>
        <attrlabl>min_child_weight</attrlabl>
        <attrdef>Minimum sum of weights of all observations required in a child</attrdef>
        <attrdefs>Producer Defined</attrdefs>
        <attrdomv>
          <udom>See Column Definition</udom>
        </attrdomv>
      </attr>
      <attr>
        <attrlabl>subsample</attrlabl>
        <attrdef>The fraction of observations to be randomly sampled for each tree</attrdef>
        <attrdefs>Producer Defined</attrdefs>
        <attrdomv>
          <udom>See Column Definition</udom>
        </attrdomv>
      </attr>
    </detailed>
    <detailed>
      <enttyp>
        <enttypl>4_Hysteresis_ML\[5,7,9]class\Output\HI_[5,7,9]class_ML.html</enttypl>
        <enttypd>Markdown output as record of the script run.</enttypd>
        <enttypds>U.S. Geological Survey</enttypds>
      </enttyp>
    </detailed>
  </eainfo>
  <distinfo>
    <distrib>
      <cntinfo>
        <cntorgp>
          <cntorg>U.S. Geological Survey - ScienceBase</cntorg>
        </cntorgp>
        <cntaddr>
          <addrtype>mailing address</addrtype>
          <address>Denver Federal Center</address>
          <address>Building 810</address>
          <address>Mail Stop 302</address>
          <city>Denver</city>
          <state>CO</state>
          <postal>80225</postal>
        </cntaddr>
        <cntvoice>1-888-275-8747</cntvoice>
        <cntemail>sciencebase@usgs.gov</cntemail>
      </cntinfo>
    </distrib>
    <distliab>Unless otherwise stated, all data, metadata and related materials are considered to satisfy the quality standards relative to the purpose for which the data were collected. Although these data and associated metadata have been reviewed for accuracy and completeness and approved for release by the U.S. Geological Survey (USGS), no warranty expressed or implied is made regarding the display or utility of the data for other purposes, nor on all computer systems, nor shall the act of distribution constitute any such warranty.</distliab>
    <stdorder>
      <digform>
        <digtinfo>
          <formname>Digital Data</formname>
        </digtinfo>
        <digtopt>
          <onlinopt>
            <computer>
              <networka>
                <networkr>https://doi.org/10.5066/P1492CKA</networkr>
              </networka>
            </computer>
          </onlinopt>
        </digtopt>
      </digform>
      <fees>None</fees>
    </stdorder>
  </distinfo>
  <metainfo>
    <metd>20251119</metd>
    <metc>
      <cntinfo>
        <cntperp>
          <cntper>John (William) Lund</cntper>
          <cntorg>USGS - MIDCONTINENT REGION</cntorg>
        </cntperp>
        <cntpos>Hydrologist</cntpos>
        <cntaddr>
          <addrtype>mailing and physical</addrtype>
          <address>1992 Folwell Ave</address>
          <address>Upper Midwest Water Science Center - University of Minnesota</address>
          <city>St. Paul</city>
          <state>MN</state>
          <postal>55108</postal>
        </cntaddr>
        <cntvoice>763-272-8690</cntvoice>
        <cntemail>jlund@usgs.gov</cntemail>
      </cntinfo>
    </metc>
    <metstdn>FGDC Content Standard for Digital Geospatial Metadata</metstdn>
    <metstdv>FGDC-STD-001-1998</metstdv>
  </metainfo>
</metadata>
