Gila spp. (bonytail chub, humpback chub, and roundtail chub) genomic data from various locations including the Colorado River in the Grand Canyon (2020-2024)

Rob Massatti Maria C Dzul 20250304 Gila spp. (bonytail chub, humpback chub, and roundtail chub) genomic data from various locations including the Colorado River in the Grand Canyon (2020-2024) tabular data Flagstaff, AZ U.S. Geological Survey Additional information about Originators: Massatti, R, https://orcid.org/0000-0001-5854-5597; Dzul, Maria C, https://orcid.org/0000-0002-4798-5930 https://doi.org/10.5066/F7251GFZ Maria C Dzul Rob Massatti Charles B Yackulic Emily Omana-Smith Kirk Young 20250722 Genetic analysis of Humpback Chub (Gila cypha) in Grand Canyon in light of recent population expansion journal manuscript M C Dzul, R Massatti, C B Yackulic, E Omana Smith, K Young, Genetic structure of an expanding population of Humpback Chub in Grand Canyon, North American Journal of Fisheries Management, 2025;, vqaf060, https://doi.org/10.1093/najfmt/vqaf060 https://doi.org/10.1093/najfmt/vqaf060 These data were compiled to investigate genomic patterns in humpback chub (Gila cypha) sampled from various locations (New Mexico and Utah) including the Colorado River in the Grand Canyon, AZ. Specifically, our objectives were to understand 1) if unique genetic variation persisted in the western Grand Canyon during the period when chub was restricted to warm water refugia and 2) if a behavioral difference between migratory and resident fish in the Little Colorado River led to assortative mating. These data represent anonymous genomic sequences generated as a result of next-generation sequencing. The data release consists of two tab delimited text files that may be used to infer population structure and genetic diversity statistics (gicy_pop4.vcf, gicy_pop5.vcf). These data represent genetic variation on an individual level. These files may be opened and edited in a text editor program, such as Notepad ++ (PC) or BBEdit (Mac). The .vcf file can be loaded into the Stacks population program (Catchen et al. 2013) to calculate genetic diversity statistics or convert to other file formats (e.g., .stru or .ped). The sequencing data were collected in 2021 and leveraged fin clips resulting from sampling between 2015 and 2020. The data were collected by scientists and technicians at the U.S. Geological Survey Southwest Biological Science Center by sequencing a library on an Illumina NovaSeq 6000 (Illumina, San Diego, CA USA). These data can be used to investigate patterns of genetic diversity and structure to illuminate the ecological and evolutionary forces influencing humpback chub. The purpose of these data are to describe the population structure, intraspecific relationships, genetic diversity, and history of Gila cypha in the Grand Canyon, New Mexico, and Utah. These data were created to investigate the history of the species and the role of a behavioral difference in influencing population genetic patterns. The datasets archived herein can be used to further investigate evolutionary or ecological processes that affect Gila cypha, or they may be used to reconfirm the patterns reported in Dzul et al. 2025 (see Larger Work Citation). Ultimately, these data will be used to inform conservation management decisions regarding the recovery of Gila cypha. The datasets provided herein are based on a modified RADseq library preparation protocol, and as such, genetic information represents anonymous loci from throughout the Gila genomes. The benefits and downfalls of this approach are much discussed in the literature and users of these data should read about and understand potential issues before analyzing the data or interpreting results. Data users should read the entire metadata record and acquire the manuscript identified as the ‘Larger Work Citation’ and any manuscripts identified as ‘Cross Reference' to have a complete understanding of how these data were created and used. The data are specific to the uses identified above, as described in the ‘Larger Work Citation’, and any other use of these data would be inappropriate. See 'Distribution liability' statements for more information. 2020 2024 observed Complete None planned -113.9392 -104.3302 39.1035 33.1797 ISO 19115 Topic Category biota inlandWaters USGS Thesaurus biogeography biological population management demographics endangered species evolution genetic diversity genetics native species phylogeny population and community ecology population dynamics USGS Metadata Identifier USGS:58f7aab4e4b0b7ea5451f611 USGS information products data release None bonytail chub Gila cypha Gila elegans Gila robusta humpback chub roundtail chub single nucleotide polymorphism Geographic Names Information System (GNIS) Arizona Colorado Colorado River Desolation Canyon Grand Canyon Gray Canyon Little Colorado River New Mexico Utah No access constraints No use constraints. License, Creative Commons Zero v1.0 Universal. Robert T Massatti U.S. Geological Survey Research Ecologist mailing and physical

2255 North Gemini Drive

Flagstaff AZ 86001 US 928-523-5036 rmassatti@usgs.gov We thank Steve Mussmann (USFWS) for providing samples of Bonytail Chub and Humpback Chub from the Desolation-Gray population. Additionally, we thank Tiffany Love-Chezem (USFWS), Mariah Giardina (USFWS), Michael Yard (USGS), David Ward (USFWS) for helping with sample collection. We thank Erica Sukovich (USGS) for helping prepare samples for genetic analysis and Erica Byerley (USGS) for GIS support. Julian Catchen Paul A. Hohenlohe Susan Bassham Angel Amores William A. Cresko 2013 Stacks: an analysis tool set for population genomics Wiley Online Library Molecular ecology https://doi.org/10.1111/mec.12354 Daniel Falush Matthew Stephens Jonathan K. Pritchard 2003 Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies Genetics Society of America (online) Genetics https://doi.org/10.1093/genetics/164.4.1567 Stéphane Guindon Jean-François Dufayard Vincent Lefort Maria Anisimova Wim Hordijk Olivier Gascuel 2010 New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0 Oxford Academic (online) Systematic Biology https://doi.org/10.1093/sysbio/syq010 Josephine R. Paris Jamie R. Stevens Julian M. Catchen 2017 Lost in parameter space: a road map for stacks British Ecological Society (online) Methods in Ecology and Evolution https://doi.org/10.1111/2041-210X.12775 Nicolas C Rochette Julian M Catchen 2017 Deriving genotypes from RAD-seq short-read data using Stacks Nature (online) Nature Protocols https://doi.org/10.1038/nprot.2017.123 The unique values for each attribute field were reviewed and checked for spelling, consistency of terms, accuracy, adherence to controlled vocabularies, and completeness. Attribute values are within expected ranges. Range queries were conducted to confirm that numerical values were not outside a reasonable range for a particular field. Outlier checks were performed by plotting numerical values bounded within a range. Data set is considered complete for the information presented, as described in the abstract. Users are advised to read the rest of the metadata record carefully for additional details. Development of the gicy_pop4.vcf and gicy_pop5.vcf: The process_radtags script in the Stacks package (Catchen et al. 2013) was used to exclude reads containing more than four low-quality sites and adapter contamination. Parameters affecting the assembly were assessed based on how parameter combinations affected r80 loci (i.e., those found in 80% of samples or more); the optimal parameter set associated with the plateau of the number of r80 loci was selected (Paris et al. 2017; Rochette & Catchen 2017). Values used in the final assembly were: read to initiate a new putative allele (-m in ustacks) = 3; number of mismatches allowed between the two alleles of a heterozygote sample (-M in ustacks) = 5; mismatches allowed between any two alleles of the population (-n in cstacks) = 5. The populations program in Stacks was executed under the settings: minimum percentage of individuals in a population required to process a locus for that population (-r) = 0.5, minimum minor allele frequency required to process a nucleotide site at a locus (--min_maf) = 0.05, maximum observed heterozygosity required to process a nucleotide site at a locus (max_obs_het) = 0.7, and the correction applied to FST values (--fst_correction) = p_value. The Larger Work Citation contains greater detail regarding DNA extraction, library preparation, and data processing. We leveraged three approaches to estimate genetic structure, including 1) Bayesian clustering implemented in STRUCTURE v2.3.4 (Falush et al. 2003), 2) principal components analysis (PCA) in R V4.0.5 (R Core Team 2021); and 3) PhyML 3.0 (Guindon et al. 2010). We executed twenty independent runs in STRUCTURE for each K value ranging from 1 to 9; each run utilized an admixture model with correlated allele frequencies where population membership was not assigned a priori and consisted of 100,000 burn-in and 250,000 Markov chain Monte Carlo iterations. 2024 Data Quality Assessment and Quality Control (QAQC): FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used to assess the quality of data. Low quality sites and adapter contamination were excluded using the process_radtags script in Stacks (Catchen et al. 2013). Parameter selection was chosen using the r80 protocol (Paris et al. 2017; Rochette & Catchen 2017). 2024 Finalize Data for Dissemination: Data sent to the Southwest Biological Science Center Data Steward for dissemination and preservation per USGS Data Management Policies SM 502.6, SM 502.7, SM 502.8 and SM 502.9 (1 October 2016). 2025 gicy_pop4.vcf data table This text file format data is used in bioinformatics to store DNA sequence variations in the variant call format (VCF - for example, see www.internationalgenome.org/formats). This data table contains data across all samples included in the study and that passed quality control procedures. The first 14 rows are headers and contain descriptive metadata only. Subsequent rows contain the genetic data - each row is a unique single nucleotide polymorphism. The purpose of this data table is to answer evolutionary questions about the species using population genetics techniques as implemented in STRUCTURE or principal components analysis. Producer defined #CHROM Each unique locus, or sequenced, homologous portion of the genome, receives a unique number in this column. Producer defined 6 92859 integer number POS The reference position of the base pair within the locus, with the first base in the first locus having position 1. Positions are sorted numerically, in increasing order. Producer defined 2 117 integer number ID Unique identifier for each locus. Producer defined The first number represents the locus number in the database, while the second number after the colon represents the position within the read after which the polymorphism is found. Thus, all polymorphisms with the same first number come from the same locus. Data values = 6:104 to 92859:101. REF Values represent nucleobases that compose the units of DNA. The data values in this column label reference a single nucleotide polymorphism. Producer defined A adenine nucleotide Producer defined C cytosine nucleotide Producer defined G guanine nucleotide Producer defined T thymine nucleotide Producer defined ALT Values represent nucleobases that compose the units of DNA. The data values in this column label reference an alternative single nucleotide polymorphism. Producer defined A adenine nucleotide Producer defined C cytosine nucleotide Producer defined G guanine nucleotide Producer defined T thymine nucleotide Producer defined QUAL This attribute denotes a phred-scaled quality score for the assertion made in ALT. The Phred quality score (Q) is logarithmically related to the error probability (E). Producer defined These data values are an estimate of accuracy, where the accuracy is e.g. the probability that the base was identified correctly by the sequencer. All data values are "." (no variant) then this is -10log_10 p (variant). FILTER Value is the verbal determination of the quality filtering. Producer defined PASS The data value indicates that the locus passed all quality filters, i.e. a call is made at this position. Producer defined INFO This attribute represents additional information. Producer defined Data values = NS=10;AF=0.050 to NS=99;AF=0.475. NS = the number of individuals for which data is reported; AF = the allele frequency for each ALT allele in the same order as listed. FORMAT This value denotes the format (data types and order) of the data in the following columns. Producer defined Data values = GT:DP:AD:GQ:GL. Specifically Genotype:Read Depth:Allele Depth:Genotype Quality:Genotype Likelihood. bony_1 to bony_5 This attribute in the data table represents labels for 5 unique Gila elegans (Bonytail chub) individuals. Producer defined These fish are referred to as bonytail chub (BTC) in the Larger Work Citation. The attribute values represent unique fish samples. fallcyn1 to fallcyn40 This attribute in the data table represents labels for 35 unique Gila cypha (Humpback chub) individuals. Producer defined These fish are referred to as western Grand Canyon (WGC) in the Larger Work Citation. The attribute values represent unique fish samples. havas_1 to havas_29 This attribute in the data table represents labels for 25 unique Gila cypha (Humpback chub) individuals. Producer defined These fish are referred to as, Havasu Creek (HAV) in the Larger Work Citation. The attribute values represent unique fish samples. jcm_1 to jcm_39 This attribute in the data table represents labels for 38 unique Gila cypha (Humpback chub) individuals. Producer defined These fish are referred to as fish that migrate into and out of the Little Colorado River (LCR-MIG) in the Larger Work Citation. The attribute values represent unique fish samples. lcr_1 to lcr_61 This attribute in the data table represent labels for 60 unique Gila cypha (Humpback chub) individuals. Producer defined These fish are referred to as resident fish in the Little Colorado River (LCR-RES) in the Larger Work Citation. The attribute values represent unique fish samples. round_1 to round_4 This attribute in the data table represents labels for 3 unique Gila robusta (Roundtail chub) individuals. Producer defined These fish are referred to as RTC in the Larger Work Citation. The attribute values represent unique fish samples. up_col_1 to up_col_15 This attribute in the data table represents labels for 15 unique Gila cypha (Humpback chub) individuals. Producer defined These fish are referred to as DGP in the Larger Work Citation. The attribute values represent unique fish samples. gicy_pop5.vcf data table This a text file format data used in bioinformatics to store DNA sequence variations in the variant call format (VCF - for example, see www.internationalgenome.org/formats). This data table contains data across Grand Canyon humpback chub samples included in the study and that passed quality control procedures. The first 14 rows are headers and contain descriptive metadata only. Subsequent rows contain the genetic data - each row is a unique single nucleotide polymorphism. The purpose of this data table is to answer evolutionary questions about the species using population genetics techniques as implemented in STRUCTURE or principal components analysis. Producer defined #CHROM Each unique locus, or sequenced, homologous portion of the genome, receives a unique number in this column. Producer defined 8 97445 integer number POS The reference position of the base pair within the locus, with the first base in the first locus having position 1. Positions are sorted numerically, in increasing order. Producer defined 2 117 integer number ID Unique identifier for each locus. Producer defined The first number represents the locus number in the database, while the second number after the colon represents the position within the read after which the polymorphism is found. Thus, all polymorphisms with the same first number come from the same locus. Data values = 8:38 to 97445:35. REF Values represent nucleobases that compose the units of DNA. The data values in this column label reference a single nucleotide polymorphism. Producer defined A adenine nucleotide Producer defined C cytosine nucleotide Producer defined G guanine nucleotide Producer defined T thymine nucleotide Producer defined ALT Values represent nucleobases that compose the units of DNA. The data values in this column label reference an alternative single nucleotide polymorphism. Producer defined A adenine nucleotide Producer defined C cytosine nucleotide Producer defined G guanine nucleotide Producer defined T thymine nucleotide Producer defined QUAL This attribute denotes a phred-scaled quality score for the assertion made in ALT. The Phred quality score (Q) is logarithmically related to the error probability (E). Producer defined These data values are an estimate of accuracy, where the accuracy is e.g. the probability that the base was identified correctly by the sequencer. All data values are "." (no variant) then this is -10log_10 p (variant). FILTER Value is the verbal determination of the quality filtering. Producer defined PASS The data value indicates that the locus passed all quality filters, i.e. a call is made at this position. Producer defined INFO This attribute represents additional information. Producer defined Data values = NS=10;AF=0.050 to NS=99;AF=0.475. NS = the number of individuals for which data is reported; AF = the allele frequency for each ALT allele in the same order as listed. FORMAT This value denotes the format (data types and order) of the data in the following columns. Producer defined Data values = GT:DP:AD:GQ:GL. Specifically Genotype:Read Depth:Allele Depth:Genotype Quality:Genotype Likelihood. fallcyn1 to fallcyn40 This attribute in the data table represents labels for 35 unique Gila cypha (Humpback chub) individuals. Producer defined These fish are referred to as western Grand Canyon (WGC) in the Larger Work Citation. The attribute values represent unique fish samples. havas_1 to havas_29 This attribute in the data table represents labels for 25 unique Gila cypha (Humpback chub) individuals. Producer defined These fish are referred to as Havasu Creek (HAV) in the Larger Work Citation. The attribute values represent unique fish samples. jcm_1 to jcm_39 This attribute in the data table represents labels for 38 unique Gila cypha (Humpback chub) individuals. Producer defined These fish are referred to as fish that migrate into and out of the Little Colorado River (LCR-MIG) in the Larger Work Citation. The attribute values represent unique fish samples. lcr_1 to lcr_61 This attribute in the data table represent labels for 60 unique Gila cypha (Humpback chub) individuals. Producer defined These fish are referred to as resident fish in the Little Colorado River (LCR-RES) in the Larger Work Citation. The attribute values represent unique fish samples. GS ScienceBase U.S. Geological Survey mailing and physical

Denver Federal Center, Building 810, Mail Stop 302

Denver CO 80225 United States 1-888-275-8747 sciencebase@usgs.gov Unless otherwise stated, all data, metadata and related materials are considered to satisfy the quality standards relative to the purpose for which the data were collected. Although these data and associated metadata have been reviewed for accuracy and completeness and approved for release by the U.S. Geological Survey (USGS), no warranty expressed or implied is made regarding the display or utility of the data for other purposes, nor on all computer systems, nor shall the act of distribution constitute any such warranty. This text file format data is used in bioinformatics to store DNA sequence variations in the variant call format (VCF - for example, see www.internationalgenome.org/formats). The user must have software capable of displaying the data table. 20250724 Robert T Massatti U.S. Geological Survey Research Ecologist mailing and physical

2255 North Gemini Drive

Flagstaff AZ 86001 US 928-523-5036 rmassatti@usgs.gov Content Standard for Digital Geospatial Metadata FGDC-STD-001-1998