Campylorhynchus brunneicapillus single nucleotide polymorphism genotype data from southern California, USA

Vandergast, A.G. Wood, D.A. Smith, J.G. Mitelberg, A. 20250623 Campylorhynchus brunneicapillus single nucleotide polymorphism genotype data from southern California, USA tabular digital data (.csv and .vcf) ScienceBase U.S. Geological Survey data release https://doi.org/10.5066/P18WGRYU This dataset contains 184 single nucleotide polymorphism (SNP) genotypes of Campylorhynchus brunneicapillus (Cactus Wren) sampled in southern California. Genomic markers were generated from ddRAD loci and analyzed using the Stacks v2.60 pipeline. The genotypes for all samples are provided in a VCF file with 40,707 independent loci. A companion sample data file is provided with sample names, location, and other data. These files may be opened and edited in a text editor program, such as Notepad (PC) or BBEdit (Mac). The .vcf file can be loaded into the Stacks population program to calculate genetic diversity statistics, or loaded into R, using vcfR, for further analysis. These data were collected to assess the distribution of genetic diversity within and among occurrences of cactus wren in southern California 20110503 20210602 observed Complete None planned Southern California -119.0372 -115.0237 35.6997 32.5866 ISO 19115 Topic Category biota USGS Thesaurus birds genotype genetic diversity genetics native species USGS Metadata Identifier USGS:683dcb8fd4be0234870fc967 Common Geographic Areas San Diego Orange Imperial Riverside San Bernardino Ventura Los Angeles Kern Integrated Taxonomic Information System (ITIS) Campylorhynchus brunneicapillus U.S. Geological Survey 2013 Integrated Taxonomic Information System (ITIS) Online Database https://doi.org/10.5066/F7KH0KBK www.itis.gov expert identifier Kingdom Animalia Subkingdom Bilateria Infrakingdom Deuterostomia Phylum Chordata Subphylum Vertebrata Infraphylum Gnathostomata Superclass Tetrapoda Class Aves Order Passeriformes Family Troglodytidae Genus Campylorhynchus Species Campylorhynchus brunneicapillus TSN: 178587 Cactus Wren No access constraints. Please see 'Distribution Info' for details. No use constraints. Questions pertaining to appropriate use or assistance with understanding limitations or interpretation of the data are to be directed to the individuals/organization listed in the Point of Contact section. U.S. Geological Survey, Western Ecological Research Center Data Manager mailing address

3020 State University Drive, Modoc Hall, suite 4004

Sacramento CA 95819 US 279-782-0904 gs-b-werc_data_management@usgs.gov Stacks 2.60 was used to analyze genomic markers for these data. Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography. https://catchenlab.life.illinois.edu/stacks/ Stacks is implemented in C++ with wrapper programs written in Perl. Stacks should build on any standard UNIX-like environment (Apple OS X, Linux, etc.) Stacks is an independent pipeline and can be run without any additional external software. Stacks uses a standard autotools installation script. Visit https://catchenlab.life.illinois.edu/stacks/manual/#install to view the standard autotools installation script. Visit https://catchenlab.life.illinois.edu/stacks to download the Stacks package. View the INSTALL file within the package for specific installation instructions. Catchen, J. Hohenlohe, P. A. Bassham, S. Amores, A. Cresko, W. A. 2013 Stacks: An analysis tool set for population genomics Publication Raw sequence demultiplexing, quality filtering, and genotyping was performed using Stacks v2.60 on the USGS Yeti High Performance Computing platform. Decommissioned in 2023, Yeti was the first supercomputer available to all USGS research staff. https://doi.org/10.5066/F7D798MJ Hovenweep is a USGS on-premises supercomputer that replaced Yeti as the workhorse USGS supercomputer for general-purpose HPC workloads. For more information or to get started using McKinley, Tallgrass, or Hovenweep supercomputers, contact the USGS High Performance Computing team at hpc@usgs.gov. Falgout, J.T. Gordon, J. 2023 USGS Yeti Supercomputer: U.S. Geological Survey supercomputer Molecular Ecology vol. 22, issue 11, pages 3124-3140 n/a Wiley https://doi.org/10.5066/F7D798MJ https://doi.org/10.1111/mec.12354 The .vcf file can be loaded into R using vcfR to calculate genetic diversity statistics. VcfR is an R package intended to allow easy manipulation and visualization of variant call format (VCF) data. https://knausb.github.io/vcfR_documentation/ https://github.com/knausb/vcfR/blob/master/README.md vcfR is available on the Comprehensive R Archive Network. For download instructions, visit https://github.com/knausb/vcfR/blob/master/README.md. Knaus, B.J. Grünwald, N.J. 2017 vcfr: a package to manipulate and visualize variant call format data in R. publication Molecular Ecology Resources vol 17, issue 1, Special Issue: Population Genomics with R n/a Wiley https://doi.org/10.1111/1755-0998.12549 FastQC is a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ A Java Runtime environment is required for download. FastQC can be downloaded by visiting https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Andrews, S. 2010 FastQC: A Quality Control Tool for High Throughput Sequence Data Tool https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ The quality control tool for high throughput sequence data, FastQC, was used to assess the attribute quality of data and remove samples that had greater than 30 percent missing data. A minor allele frequency cut off of 5% was also applied. It was verified that there were no duplicate samples and all samples included passed quality filters. A minimum of 9X coverage at a locus was required, and samples with > 30% missing data were removed. Data set is considered complete for the information presented, as described in the abstract. Users are advised to read the rest of the metadata record carefully for additional details. No formal positional accuracy tests were conducted. Handheld GPS were used to collect location data. Paris, J.R. Stevens, J.R. Catchen. J.M. 2017 Lost in parameter space: a road map for stacks publication Methods in Ecology and Evolution vol. 8, issue 10 n/a British Ecological Society https://doi.org/10.1111/2041-210X.12775 Digital and/or Hardcopy 2017 publication Stacks parameters Clustering, assembly, and filtering parameters were optimized using a subset of individuals following the r80 method by Paris et al. Catchen. J.M. Hohenlohe, P.A. Bassham, S. Amores, A. Cresko, W.A. 2013 Stacks: an analysis tool set for population genomics publication Molecular Ecology vol. 22, issue 11, Special Issue: Genotyping by Sequencing in Ecological and Conservation Genomics n/a Wiley https://doi.org/10.1111/mec.12354 Digital and/or Hardcopy 2013 publication Population genomics functions The following final parameters were used to create a locus catalog (cstacks) for the full Stacks genotyping pipeline: minimum number of raw reads required to form a stack (putative allele), m = 3; maximum number of mismatches allowed when matching stacks (putative alleles within loci) within samples, M = 2; maximum number of mismatches between stacks (putative loci) of the samples and the catalog of loci among all samples, n = 1; minimum percentage of individuals across populations required to process a locus, R = 80 Peterson, B.K. Weber, J.N. Kay, E.H. Fisher, H.S. Hoekstra, H.E. 2012 Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species publication PLoS One 7(5):e37135 n/a PLoS One https://doi.org/10.1371/journal.pone.0037135 Digital and/or Hardcopy 2012 publication ddRAD The double-digest restriction-associated DNA (ddRAD) sequencing protocol developed in Peterson et al. was followed when extracting genomic DNA. Genetic blood samples were collected from Cactus Wrens at California desert sites. Unknown Genomic DNA was extracted from blood cards using the gentra purgene extraction kit and following the manufacturer's protocol. We followed the double-digest restriction-associated DNA (ddRAD) sequencing protocol developed in Peterson et al. (2012) and used restriction enzymes EcoRI and MseI and size-selected for 250-400 base pair fragments before being pooled for sequencing (150 bp paired-end reads) on Illumina HiSeqX and NovaSeq platforms at MedGenome, Inc. (Foster City, CA). ddRAD 2022 Dustin Wood U.S. Geological Survey, Western Ecological Research Center Geneticist mailing and physical

4165 Spruance Road, suite 200

San Diego CA 92101 USA 619-225-6432 dawood@usgs.gov Development of Campylorhynchus brunneicapillus genomic marker dataset from ddRADseq library prep: Raw sequence demultiplexing, quality filtering, and genotyping was performed using Stacks v2.60 on the USGS Yeti High Performance Computing platform. Clustering, assembly, and filtering parameters were optimized using a subset of individuals following the r80 method by Paris et al (2017). This subset included 70 individuals that were distributed across collection locations and with read coverages that fell within one standard deviation of overall mean retained reads per sample. This subset was also used to create a locus catalog (cstacks) for the full Stacks genotyping pipeline. The following final parameters were used: minimum number of raw reads required to form a stack (putative allele), m = 3; maximum number of mismatches allowed when matching stacks (putative alleles within loci) within samples, M = 2; maximum number of mismatches between stacks (putative loci) of the samples and the catalog of loci among all samples, n = 1; minimum percentage of individuals across populations required to process a locus, R = 80 (Catchen et al. 2013). We further filtered the number of samples and loci by requiring a minimum depth of coverage of 9x and less than 30% missing data per sample. One single nucleotide polymorphism (SNP) was randomly chosen from each locus and exported in a varianct call format (VCF) file. Stacks parameters Population genomics functions 2022 0.0197727549 0.0237726774 Decimal seconds WGS_1984 WGS 84 6378137.0 298.257223563 CACW_SNP_sampledata.csv Comma Separated Value (CSV) file containing data. Producer Defined sample genetic sample unique identifier Producer Defined alpha numeric codes pop_code sampling location code Producer Defined three letter site code vcf_order order of samples in the .vcf file Producer Defined 1 187 site_name full name of site Producer Defined researcher designated site names for collection locations latitude latitude in decimal degrees, WGS 84 datum Producer Defined 32.58655 35.69969 decimal degrees longitude longitude in decimal degrees, WGS 84 datum Producer Defined -119.0372 -115.02366 decimal degrees collection_date dates birds were sampled in the field Producer Defined na no date collected Producer defined collection dates formatted month/day/year field sample number sample code or bird band number given to bird sampled. Producer Defined na not available Producer defined numeric bird bands (USGS bird band lab); or alpha numeric code given to individual birds at the time of field sampling clus1 proportion of genetic assignment to cluster 1 Producer Defined 0.00001 0.99998 none clus2 proportion of genetic assignment to cluster 2 Producer Defined 0.00001 0.99998 none clus3 proportion of genetic assignment to cluster 2 Producer Defined 0.00001 0.968875 none assignment cluster assignment Producer Defined pop2 majority of assignment to pop2 Producer defined pop1 majority of assignment to pop 1 Producer defined pop3 majority of assignment to pop 3 Producer defined CACW_populations.snps.vcf This is a tab delimited text file in the variant call format (VCF). The first 15 rows are headers and contain descriptional metadata only. Subsequent rows contain the genetic data - each row is a unique single nucleotide polymorphism. These data can be used to answer evolutionary questions about the species. Producer defined CHROM An unique identifier from the reference genome Producer defined 10 441387 POS Position of the SNP in the reference sequence CHROM. Postions are sorted numerically. Producer defined 6 151 ID Semicolon-separated list of unique identifiers within the RAD locus Producer defined 10:14 to 441387:114 REF Values represent nucleobases that compose the units of DNA. The value in this column is the reference single nucleotide polymorphism. Producer defined G guanine Producer defined C cytosine Producer defined T thymine Producer defined A adenine Producer defined ALT Values represent nucleobases that compose the units of DNA. The value is this column is the alternative non-reference single nucleotide polymorphism. Producer defined T thymine Producer defined A adenine Producer defined C cytosine Producer defined G guanine Producer defined QUAL Value of '.' denotes that locus passed internal quality filters. All value are the same (i.e., all values passed quality control). Producer defined . passed internal quality filters Producer defined FILTER Value is the verbal determination of the quality filtering (i.e., PASS means the locus passed quality filtering). Producer defined PASS passed quality filtering Producer defined INFO NS equals the number of individuals for which data is reported; AF equals the allele frequency of the alternative non-reference allele (see ALT) Producer defined NS equals 148; AF equals 0.051 FORMAT This value denotes the format of the data in the following columns, specifically Genotype:Total Read Depth:Depth of each allele:Genotype Quality:Genotype Likelihood for the biallelic data [AA,AB,BB] Producer defined GT:DP:AD:GQ:GL Genotype:Total Read Depth:Depth of each allele:Genotype Quality:Genotype Likelihood Producer defined ABS08_r to WEL04 These are the actual data, in the format specified in the previous column (i.e., GT:DP:AD:GQ:GL). Values correspond to the individual names in row 15. Producer defined An example of the encoding is: 0/1:51:15,36:40:-120.01,-0.00,-41.81. This individual has 2 alleles (0 and 1) representing REF and ALT, a total read depth of 51 with 15 reads of the REF allele and 36 reads of the ALT allele, a phred score of 40 and likelihoods of -120.01, 0, and -41.81 for the different possible genotypes (00, 01, 11). U.S. Geological Survey - ScienceBase mailing address

Denver Federal Center, Building 810, Mail Stop 302

Denver CO 80225 United States 1-888-275-8747 sciencebase@usgs.gov Unless otherwise stated, all data, metadata and related materials are considered to satisfy the quality standards relative to the purpose for which the data were collected. Although these data and associated metadata have been reviewed for accuracy and completeness and approved for release by the U.S. Geological Survey (USGS), no warranty expressed or implied is made regarding the display or utility of the data for other purposes, nor on all computer systems, nor shall the act of distribution constitute any such warranty. Digital Data https://doi.org/10.5066/P18WGRYU None 20250623 Amy Vandergast U.S. Geological Survey, Western Ecological Research Center Research Geneticist mailing and physical

4165 Spruance Road, Suite 200

San Diego CA 92101 USA 619-225-6445 avandergast@usgs.gov FGDC Biological Data Profile of the Content Standard for Digital Geospatial Metadata FGDC-STD-001.1-1999