Mark, A. Engle
20190320
Codebook vectors and predicted rare earth potential from a trained emergent self-organizing map displaying multivariate topology of geochemical and reservoir temperature data from produced and geothermal waters of the United States
tabular digital data
Reston, VA
U.S. Geological Survey
https://doi.org/10.5066/P9GCYKG0
This data release consists of three products relating to a 82 x 50 neuron Emergent Self-Organizing Map (ESOM), which describes the multivariate topology of reservoir temperature and geochemical data for 190 samples of produced and geothermal waters from across the United States. Variables included in the ESOM are coordinates derived from reservoir temperature and concentration of Sc, Nd, Pr, Tb, Lu, Gd, Tm, Ce, Yb, Sm, Ho, Er, Eu, Dy, F, alkalinity as bicarbonate, Si, B, Br, Li, Ba, Sr, sulfate, H (derived from pH), K, Mg, Ca, Cl, and Na converted to units of proportion. The concentration data were converted to isometric log-ratio coordinates (following Hron et al., 2010), where the first ratio is Sc serving as the denominator to the geometric mean of all of the remaining elements (Nd to Na), the second ratio is Nd serving as the denominator by the geometric mean of all of the remaining elements (Pr to Na), and so on, until the final ratio is Na to Cl. Both the temperature and log-ratio coordinates of the concentration data were normalized to a mean of zero and a sample standard deviation of one. The first table is the mean and standard deviation of all of the data in this dataset, which is used to standardize the data. The second table is the codebook vectors from the trained ESOM where all variables were standardized and compositional data converted to isometric log-ratios. The final tables provides are rare earth element potentials predicted for a subset of the U.S. Geological Survey Produced Waters Geochemical Database, Version 2.3 (Blondes et al., 2017) through the used of the ESOM. The original source data used to create the ESOM all come from the U.S. Department of Energy Resources Geothermal Data Repository and are detailed in Engle (2019).
This data release is provided to: 1) allow users to map new sample sources to the ESOM using a minimum distance measurement (such Euclidean distance) through an algorithm such a k-nearest neighbor and 2) provide predicted rare earth element potential output from the exercise for produced and geothermal waters of the United States. Any data sets used for mapping to the trained ESOM need to be isometrically log-ratio transformed and standardized (using means and standard deviations from the first table) using the exact same formulation of the training dataset used to create this matrix. This case be useful both for instances of data classification or for non-linear estimation. In the case of the latter, missing values (i.e., those in need of estimation) can be imputed from the codebook vector for the best match unit (i.e., the neuron with the smallest multivariate distance to the point being estimated). The imputed value can then convert back into the original units through the inverse of data standardization and for concentration data, the inverse of the isometric log-ratio transformation (Hron et al., 2010). Note that for concentration data, the results are in units of proportion and can be converted back into the original units by multiplying each row by the sum of the compositional data in the original dataset.
20180101
20181231
observed
None planned
-169.1015625
-64.33593750000001
71.74643171904148
22.59372606392931
USGS Thesaurus
economic geology
rare earth elements
natural resource exploration
geochemistry
USGS Metadata Identifier
USGS:5c5049ebe4b0708288f86ee1
Geographic Names Information System (GNIS); https://geonames.usgs.gov/apex/f?p=138:1:1468614298857
United States
None
None
None
None
None. Please see 'Distribution Info' for details.
None. However, users are advised to read the data set's metadata thoroughly to understand appropriate use and data limitations.
Mark A Engle
U.S. Geological Survey, Midwest Region
Research Geologist
Mailing and Physical
UTEP - Dept. of Geological Sciences
El Paso
TX
79968
United States
915-747-5503
engle@usgs.gov
The dataset was created using R (version 3.4.4) with the following packages used for the critical steps: UMatrix (version 3.1) for ESOM creation and robCompositions (2.0.9) for implementation of isometric log-ratio coordinates. The operating system used was OS X Version 10.11.6 (El Capitan).
M. S. Blondes
K. D. Gans
M. A. Engle
Y. K. Kharaka
M. E. Reidy
V. Sarawathula
J. J. Thordsen
E. L. Rowan
E. A. Morrissey
20171208
U.S. Geological Survey National Produced Waters Geochemical Database, version 2.3
tabular digital data
Reston, VA
U.S. Geological Survey
https://energy.usgs.gov/EnvironmentalAspects/EnvironmentalAspectsofEnergyProductionandUse/ProducedWaters.aspx#3822349-data
K. Hron
M. Templ
P. Filzmoser
20101201
Imputation of missing values for compositional data using classical and robust methods
publication
Computation Statistics and Data Analysis
Elsevier
https://www.sciencedirect.com/science/article/pii/S0167947309004368
No formal attribute accuracy tests were conducted. The tables included here are secondary products created from data that the USGS did not produce.
The transformed data used in the creation of the emergent self-organizing map from which the codebook vectors were produced were examined for univariate and multivariate outliers. Univariate investigation included generation of an exploration data analysis plot for each ilr coordinate and subsequent inspection. The plots include a density trace, a histogram, a 1-dimensional scatter plot, a Tukey box plot, and an empirical cumulative density function plot, which allow for examination of normality, multiple populations, and extreme values. Multiple populations were observed for ilr coordinates z16 (F in the denominator), z24 (SO4 in the denominator), and z27 (Mg in the dominator), and likely represent clear geochemical differences between geothermal and produced waters. Visual examination of the EDA plots for all ilr-transformed data showed no evidence of significant outliers or extreme univariate values.
An adaptive chi-square distance threshold of minimum covariance determinant-based Mahalanobis distance of ilr coordinates was applied to complete cases of the REE data in the input dataset. No significant multi-variate outliers were identified that would suggest analytical or data entry problems.
The dataset from the ESOM is generated consist of an input dataset of only 185 samples available from the U.S. Department of Energy Geothermal Data Repository that contain information of rare earth element concentrations, other major, and minor constituents and water quality parameters and which all of the rare earth element data came from the same analytical laboratory (Idaho National Laboratory) and analyzed by the same separation procedure.
The dataset that was used to predict rare earth element potential in produced and geothermal waters across the United States is Version 2.3 of the U.S. Geological Survey Produced Waters Database (Blondes et al., 2017). In order to produce a model with reasonable predictive power only samples for which reported concentration for at least 8 parameters was included: alkalinity and HCO3, Ca, Cl, K, Mg, Na, pH as H+, SO4, and Sr. This greatly reduced the number of data in the U.S. Geological Survey Produced Waters Database from roughly 115,000 to 3,688 data points. Despite this, the dataset still provides broad converage of data from all major on-shore oil and gas basins in the U.S.
No formal positional accuracy tests were conducted
No formal positional accuracy tests were conducted
Scott, A. Quillinan
20181231
Aqueous Rare Earth Element Patterns and Concentration in Thermal Brines Associated with Oill and Gas Production
tabular digital data
https://gdr.openei.org/submissions/930
Digital and/or Hardcopy
20180101
20181231
2019
REE concentrations in produced and thermal waters
This is the data used to create the ESOM.
Cross-validation methods were applied in an attempt to quantify uncertainty through the application of the ESOM to estimate REE values in unknown samples. Rows from the input dataset were randomly split between the training dataset (85% of rows; n=190) and the cross-validation dataset (15% of rows, n=34). The former was used to train the ESOM model and is the basis for all prediction. The latter was used to check the ability of the ESOM model to accurately predict REE concentrations for “blind” samples. The cross-validation data were mapped to trained ESOM (described in the next section) to generate predicted REE concentrations. Because a minimum of 8 compositional parameters (alkalinity and HCO3, Ca, Cl, K, Mg, Na, pH as H+, SO4, and Sr) were used as the minimum threshold for REE predicted potential for the U.S. Geological Survey Produced Waters Geochemical database, only these same parameters used used in the cross-verification dataset. The predicted concentrations were compared against the known REE concentrations for cross-validation dataset, allowing for calculation of model error.
20181210
Starting with our normalized data for temperature and the ilr coordinates (a total of m columns), a codebook vector of length m was created for each neuron and filled with random values. The general SOM algorithm was then run as follows (details in the associated report):
1) An input vector ẑi is selected at random from the training dataset and the Euclidean distances between it and all codebook vectors for all the neurons on the map are computed. Note that the Euclidean distance of ilr-transformed variables is equal to the Aitchison distance of untransformed compositional data.
2) The input vector ẑi is assigned to the codebook vector of the neuron that is closest to it. This neuron is known as the best matching unit (BMU). A Gaussian function is used to define the neighborhood of nearby neurons around the BMU. With each iteration of the algorithm, radius of the neighborhood decreases.
3) The codebook vectors of those neurons within the neighborhood are re-weighted to be more similar to xi using one of several possible functions. Typically, codebook vectors of neurons closer to BMU are more heavily re-weighted than those more distal. The amount of re-weighting (learning weight) also decreases over time.
4) The next input vector, ẑi+1, is randomly selected and steps 1–3 are repeated. Once all the input vectors have been mapped (defined as 1 epoch in the ESOM algorithm), they are removed from the map and the process is repeated starting back at step 1. The learning is continued for a set number of epochs. Because the neighborhood and re-weighting function both decrease with each epoch, the map stabilizes with an increasing number of iterations.
20181210
Data from the U.S. Geological Survey Produced Waters Database, Version 2.3 (Blondes et al., 2017), were loaded in R. The compositional parameters were converted to isometric log-ratio coordinates using the identical construction to that of the training dataset for the ESOM. Reservoir temperature was estimated using the Li-Mg geothermometer for samples where concentrations for Li and Mg were reported. All variables were then standardized using the mean and standard deviation of the corresponding variables in the training dataset for the ESOM. The data set was reduced to only those samples that had a minimum of reported values for alkalinity and HCO3, Ca, Cl, K, Mg, Na, pH as H+, SO4, and Sr. The normalized and transformed data in this abridged version of the USGS database were mapped to the trained ESOM by finding the neuron whose codebook vector has the shortest distance to the input vector ẑi for each sample (i.e., the BMU). The missing values for each input vector were then taken from the corresponding element in the codebook vector of the respective BMU. The resulting data were un-normalized using the mean and standard deviation from the training dataset, and the compositional parameters were back-transformed into the original variables. The resulting compositional data are proportional; to convert them back into units of mg/L, each row was multiplied by the sum of the compositional data in the original dataset. Finally, to convert the data into predicted REE potential, the predicted concentration values were normalized by data for bulk seawater (i.e., North Pacific Deep Water).
20181210
Geographic Names Index System (GNIS) placenames; (https://geonames.usgs.gov/apex/f?p=138:1:0:::::)
Point
Raw var mean and sd.csv
Comma Separated Value (CSV) file containing data.
Producer defined
Variable
List of variables in the training data set used in the creation of the ESOM
Producer defined
Temp
Reservoir Temperature
Producer defined
z1
First isometric log-ratio coordinate
Producer defined
z2
Second isometric log-ratio coordinate
Producer defined
z3
Third isometric log-ratio coordinate
Producer defined
z4
Fourth isometric log-ratio coordinate
Producer defined
z5
Fifth isometric log-ratio coordinate
Producer defined
z6
Sixth isometric log-ratio coordinate
Producer defined
z7
Seventh isometric log-ratio coordinate
Producer defined
z8
Eighth isometric log-ratio coordinate
Producer defined
z9
Ninth isometric log-ratio coordinate
Producer defined
z10
Tenth isometric log-ratio coordinate
Producer defined
z11
Eleventh isometric log-ratio coordinate
Producer defined
z12
Twelfth isometric log-ratio coordinate
Producer defined
z13
Thirteenth isometric log-ratio coordinate
Producer defined
z14
Fourteenth isometric log-ratio coordinate
Producer defined
z15
Fifteenth isometric log-ratio coordinate
Producer defined
z16
Sixteenth isometric log-ratio coordinate
Producer defined
z17
Seventeenth isometric log-ratio coordinate
Producer defined
z18
Eighteenth isometric log-ratio coordinate
Producer defined
z19
Nineteenth isometric log-ratio coordinate
Producer defined
z20
Twentieth isometric log-ratio coordinate
Producer defined
z21
Twenty-first isometric log-ratio coordinate
Producer defined
z22
Twenty-second isometric log-ratio coordinate
Producer defined
z23
Twenty-third isometric log-ratio coordinate
Producer defined
z24
Twenty-fourth isometric log-ratio coordinate
Producer defined
z25
Twenty-fifth isometric log-ratio coordinate
Producer defined
z26
Twenty-sixth isometric log-ratio coordinate
Producer defined
z27
Twenty-seventh isometric log-ratio coordinate
Producer defined
z28
Twenty-eighth isometric log-ratio coordinate
Producer defined
z29
Twenty-ninth isometric log-ratio coordinate
Producer defined
Mean
Mean value for this parameter from the training dataset
Producer defined
-14.60644189
85.75569938
Sample Standard Deviation
Sample standard deviation for this parameter from the training dataset
Producer defined
0.916808589
54.74892281
Codebook vectors.csv
Comma Separated Value (CSV) file containing data.
Producer defined
Temp
Estimated reservoir temperature, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.043514689
1.755131613
z1
First isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-5.445396929
3.060465659
z2
Second isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.217206465
2.853141376
z3
Third isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.592499197
2.961289936
z4
Fourth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.651777914
2.743547304
z5
Fifth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.467036529
2.293378065
z6
Sixth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.312231969
3.246556492
z7
Seventh isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.690440767
2.992472747
z8
Eighth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.363151595
2.561723291
z9
Ninth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.02419728
3.493410099
z10
Tenth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.845414478
2.719127748
z11
Eleventh isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.009726733
3.123616511
z12
Twelfth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.4652381133
2.963290808
z13
Thirteenth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.882979256
2.833179607
z14
Fourteenth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.667280639
2.669894317
z15
Fifteenth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.846581937
3.031999976
z16
Sixteenth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.128740452
1.89837255
z17
Seventeenth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.96685552
1.494072751
z18
Eighteenth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.887365217
2.134596936
z19
Nineteenth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.929114842
2.086253284
z20
Twentieth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-5.030476365
1.787186429
z21
Twenty-first isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.573888453
2.293587065
z22
Twenty-second isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.292584472
2.357489008
z23
Twenty-third isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-1.944694958
3.141813366
z24
Twenty-fourth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-5.051799776
1.260953967
z25
Twenty-fifth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.473741638
1.908503719
z26
Twenty-sixth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-3.28550767
2.072246454
z27
Twenty-seventh isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.264043115
1.770191234
z28
Twenty-eighth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.698781778
2.659533327
z29
Twenty-ninth isometric log-ratio coordinate, standardized to a mean of zero and a standard deviation of one.
Producer defined
-2.775283221
2.498712025
REE potential for US.csv
Comma Separated Value (CSV) file containing data.
Producer defined
IDUSGS
Unique samples ID corresponding to the sample in the U.S. Geological Survey Produced Waters Database, Version 2.3
U.S. Geological Survey Produced Waters Database, Version 2.3
177
114915
LATITUDE
Sample latitude
U.S. Geological Survey Produced Waters Database, Version 2.3
25.2863
71.26119
Decimal degrees
LONGITUDE
Sample longitude
U.S. Geological Survey Produced Waters Database, Version 2.3
-156.60581
-75.88
Decimal degrees
FORMATION
Geologic formation sample was produced from, as provided by the U.S. Geological Survey Produced Waters Geochemical Database, Version 2.3
U.S. Geological Survey Produced Waters Database, Version 2.3
Geologic formation sample was produced from, as provided by the U.S. Geological Survey Produced Waters Geochemical Database, Version 2.3
DEPTHWELL
Reported total depth of well as provided by the U.S. Geological Survey Produced Waters Geochemical Database, Version 2.3
U.S. Geological Survey Produced Waters Database, Version 2.3
170.0
16958.0
feet
La.potential
Predicted La potential
Producer defined
0.000294002
33604.46932
unitless
Ce.potential
Predicted Ce potential
Producer defined
0.00507206
346462.4793
unitless
Pr.potential
Predicted Pr potential
Producer defined
0.000401598
33238.966
unitless
Nd.potential
Predicted Nd potential
Producer defined
0.000216055
29078.79606
Predicted La potential
Sm.potential
Predicted Sm potential
Producer defined
0.000856029
26884.62355
unitless
Eu.potential
Predicted Eu potential
Producer defined
0.019764142
47592.86925
unitless
Gd.potential
Predicted Gd potential
Producer defined
0.000656786
22061.79084
unitless
Tb.potential
Predicted Tb potential
Producer defined
0.001335604
14911.32564
unitless
Dy.potential
Predicted Dy potential
Producer defined
0.001665052
10573.21685
unitless
Ho.potential
Predicted Ho potential
Producer defined
0.001020264
7779.651371
unitless
Er.potential
Predicted Er potential
Producer defined
0.000577907
5999.874949
unitless
Tm.potential
Predicted Tm potential
Producer defined
0.001076244
3462.498276
unitless
Yb.potential
Predicted Yb potential
Producer defined
0.000935646
2840.553835
unitless
Lu.potential
Predicted Lu potential
Producer defined
0.002209805
2465.62607
unitless
This data release consist of 3 products:
1) A file of the sample mean and standard deviations of the data used in the training dataset for the ESOM
2) The codebook vectors for each neuron from the trained ESOM. If new data are to be mapped to the trained ESOM, the compositional variables data must be converted to isometric log-ratio transformed data using the identical approach applied to the training dataset. The resulting compositional data and reservoir temperatures, must then be standardized to the sample mean and standard deviation from the training dataset (that is, the first produced in this release).
3) Predicted REE potential for a subset of the U.S. Geological Survey Produced Waters Geochemical Database, Version 2.3. The first 5 columns in this file are taken directly from that data base. The remaining columns are predicted REE potential for each sample. REE potential is defined as the predicted concentration from the ESOM model, normalized by the value for seawater (that is, North Pacific Deep Water).
Engle, M.A., 2019, Codebook vectors and predicted rare earth potential from a trained emergent self-organizing map displaying multivariate topology of geochemical and reservoir temperature data from produced and geothermal waters of the United States: U.S. Geological Survey data release, https://doi.org/10.5066/P9GCYKG0.
GS ScienceBase
U.S. Geological Survey
Mailing and Physical
Denver Federal Center, Building 810, Mail Stop 302
Denver
CO
80225
United States
1-888-275-8747
sciencebase@usgs.gov
Unless otherwise stated, all data, metadata and related materials are considered to satisfy the quality standards relative to the purpose for which the data were collected. Although these data and associated metadata have been reviewed for accuracy and completeness and approved for release by the U.S. Geological Survey (USGS), no warranty expressed or implied is made regarding the display or utility of the data on any other system or for general or scientific purposes, nor shall the act of distribution constitute any such warranty.
Digital Data
https://doi.org/10.5066/P9GCYKG0
None
The file contains data available in comma separated value (.csv) file format. The user must have software capable of opening and viewing a .csv file.
20200819
Eric A Morrissey
U.S. Geological Survey, Midwest Region
Info Tech Spec (Internet)
Mailing and Physical
Mail Stop 956, 12201 Sunrise Valley Dr
Reston
VA
20192
United States
703-648-6409
703-648-6419
emorriss@usgs.gov
Content Standard for Digital Geospatial Metadata
FGDC-STD-001-1998