U.S. flag

An official website of the United States government

icon-dot-gov

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

icon-https

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters

As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.

Get Data and Metadata
Author(s) Mary R. Croke, Jenna L Shelton orcid, Aaron M Jubb orcid, Samuel W Saxe orcid, Emil D Attanasi orcid, Philip A Freeman orcid, Madalyn S Blondes orcid
Publication Date 2021-07-26
Beginning Date of Data 2021
Ending Date of Data 2021
Data Contact
DOI https://doi.org/10.5066/P95G2SZC
Citation Croke, M.R., Shelton, J.L., Jubb, A.M., Saxe, S.W., Attanasi, E.D., Freeman, P.A., and Blondes, M.S., 2021, Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters: U.S. Geological Survey data release, https://doi.org/10.5066/P95G2SZC.
Metadata Contact
Metadata Date 2021-07-26
Related Publication
Citations of these data No citations of these data are known at this time.
Access public
License http://www.usa.gov/publicdomain/label/1.0/
Loading...
Harvest Source: ScienceBase
Harvest Date: 2021-11-19T04:42:53.907Z