Tissue sampling and DNA isolation
T. californicus of mixed sex and life stages were collected from high intertidal rocky pools in San Diego (32° 44′ 44′′ N, 117° 15′ 18′′ W) and kept en masse in 1 l beakers with filtered seawater (0.2 µm pore filter) at 20 °C. To reduce the contribution of gut contents and bacteria to the DNA extract, the seawater was treated with an antibiotic mixture (ampicillin, 200 mg l–1; streptomycin, 50 mg l–1; spectinomycin, 50 mg l–1; kanamycin, 50 mg l–1; penicillin, 200 units ml–1; neomycin, 50 mg l–1; and chloramphenicol, 20 mg l–1) and the animals were moved to clean filtered seawater daily for 3 days, without food, before extraction. DNA was isolated from groups of hundreds of copepods using a Qiagen DNeasy Blood and Tissue Kit, and samples were pooled to ~100 µg of DNA.
Genome sequencing and assembly
Cofactor Genomics generated 3 short-insert libraries (200, 500 and 800 bp) and 2 long-insert libraries (2 and 5 kb), and then barcoded and sequenced the libraries in 1 lane using the Illumina HiSeq 2000 platform to obtain 2× 100 bp paired-end (short inserts) or mate-paired sequences (Supplementary Table 1). Illumina sequences from the five libraries were first filtered using an internal quality control script to remove the low-quality reads. For the 200 and 500 bp insert libraries, which have high coverage, the quality control parameters were more stringent. All raw reads with unknown base 'N' were removed. In addition, raw reads were removed if the sum of per-base error probability across all bases was greater than two. For other libraries, whose coverage is relative lower, reads with 'N' were kept. However, reads were deleted if the sum of per-base error probability was greater than three. We only kept read pairs for which both read ends passed the quality filter. Duplicated read pairs were identified by CD-HIT-DUP in the CD-HIT package with the parameters '-u 50 –e'. Here, the read pairs having near-identical sequences with up to 2 mismatches within the first 50 bases on both ends were considered duplicated. The duplicated read pairs with lower-quality scores were removed. The processed reads from these five libraries were assembled using AllPaths-LG (release 46212) following AllPaths’ standard workflow, including initial input file preparation and final assembly runs. The assembled scaffold sequences were further extended using the GapCloser programme in the SOAPdenovo package.
To improve contiguity of the T. californicus assembly, proximity ligation libraries based on the Chicago and Hi-C methods were prepared by Dovetail Genomics as described previously
In an effort to fill assembly gaps, 1.68 million PacBio reads were obtained for the San Diego population and used to improve the final HiRise assembly using PBJellySupplementary Note) and removing those that appeared to be bacterial. Over 1.5 million PacBio reads with a mean read length of 5,470 bp remained. Next, these filtered reads were mapped to the HiRise assembly using BLASR via PBJelly, and gaps were filled or ends extended for regions for which there was a depth of at least ten PacBio reads. Overfilled gaps (gaps that were larger than the estimated size in the HiRise assembly) were set to an arbitrary size of 51 bp when they could not be completely filled with this set of PacBio reads at this minimum coverage threshold of 10. The resulting assembly then went through two more rounds of gap filling and end extension using this procedure, yielding our current T. californicus assembly, SDv2.1. These three rounds of PBJelly dropped the percentage of Ns from 4.67 to 1.97%, decreased the number of contigs in scaffolds from 7,105 to 4,789, and increased the overall size of the assembly by 4.7 megabase pairs (Mb) while increasing the sequence in contigs by 9.7 Mb.
Genome annotation
We used the pipeline in MAKER2 (ref. T. californicus assembly. First, we generated a de novo repeat library for T. californicus using the programme RepeatModelerT. californicus transcriptsT. californicus from the same populationT. californicus RNA-Seq reads and transcripts, and using the Benchmarking Universal Single-Copy Orthologs approach
Functional annotation of gene models was performed by BLASTP searches of the UniProt/Swiss-Prot database, followed by assignment of Gene Ontology terms and identification of protein motifs and domains from InterProScanSupplementary Note.
Orthology among arthropod models
We employed clustering algorithms to determine orthology relationships among T. californicus and metazoan model genomes, and to detect possible gene family expansions in the copepod. In OrthoVennDrosophila melanogaster and Apis mellifera), and a branchiopod crustacean (D. pulex) (Supplementary Table 12). To assess orthology among many taxa (Supplementary Table 13), we used OrthoMCLT. californicus proteins to pre-clustered orthologous groups curated in OrthoMCL-DB version 5 (ref. –5 and 1.5, respectively).
Sequencing and assembly of additional T. californicus populations
Copepods were collected from intertidal pools in San Roque (27° 11′ 12′′ N, 114° 23′ 52′′ W), Bird Rock (32° 48′ 54′′ N, 117° 16′ 23′′ W), Abalone Cove (33° 44′ 16′′ N, 118° 22′ 31′′ W), Catalina Island (33° 26.8′ N, 118° 28.6′ W), Santa Cruz (36° 56′ 58′′ N, 122° 02′ 49′′ W), Bodega Bay (38° 19′ 4′′ N, 123° 4′ 23′′ W) and Friday Harbor (48° 32′ 47′′ N, 123° 0′ 35′′ W). Animals were maintained as described previously. For each population, 250–300 adult individuals were pooled and DNA was isolated using the Qiagen DNeasy Blood and Tissue Kit or a phenol–chloroform procedure
Samples were prepared and sequenced by the University of North Carolina High Throughput Genomic Sequencing Facility as 100-bp paired-end libraries on the Illumina HiSeq 2000. Reads were trimmed for quality using PoPoolation
Assembly and annotation of mitochondrial genomes
Reads of mitochondrial origin were identified by mapping the Illumina samples above to the geographically nearest published Tigriopus mitochondrial genomeT. californicus mitochondrial genomes
Intraspecific evolution of nuclear coding regions
For each of the eight T. californicus populations, we used the ‘extractfeat’ script within GenomeTools
We used the programme PRANK-codon5). Poorly aligned regions were then removed with Gblocks version 0.91b (ref. dN/dS using the model M0 in the programme CODEML within PAML version 4.7 (ref. dN/dS estimates to test for general patterns consistent with the hypothesis of compensatory evolution. Using the RNA-Seq mapping performed above, we quantified transcription levels across the San Diego genes, and used ANCOVAs to account for expression in the hypothesis tests above.
We screened the genome for signals of positive selection on amino acid changes by employing codon-level evolutionary models of nucleotide substitution implemented in CODEML. We tested the ‘sites’ hypothesis by applying models M7 and M8 on each coding sequence alignment, and we compared the resulting log-likelihood ratios with a chi-squared distribution with two degrees of freedomSupplementary Note.
Molecular evolution and population genetic analysis of mtDNA
Multiple sequence alignment of each mtDNA gene was done in MEGA6 (ref. π, weighted for gene length, for each protein-coding gene using PoPoolation
Patterns of positive selection across the genus Tigriopus
We downloaded the annotated gene set of T. kingsejongensisT. japonicus (accession numbers SAMN05933003 and SAMN05933004) generated previouslyT. japonicus using TrinityTigriopus species, including T. japonicus, T. kingsejongensis, our reference T. californicus from San Diego and the reproductively isolated T. californicus from San Roque. Therefore, the orthologues identified above were aligned across the four taxa, then quality-trimmed with PRANKdS for all genes and discarded those with dS > 5 in order to reduce the influence of site saturation or poor alignments. The final set of alignments contained 4,863 genes.
We concatenated 20 randomly picked alignments and estimated a maximum likelihood phylogeny in MEGA6 (ref. dN/dS to vary among sites and among branches, and tested for evidence of positive selection on each of the 4,863 aligned coding sequences along each internal branch and tip. For each gene and each foreground branch, we calculated a log-likelihood ratio between a model allowing for positive selection and a null model in which dN/dS is fixed at 1, and these ratios were compared with a chi-squared distribution with one degree of freedom. After the analysis, we applied an FDR correction for multiple testing
Reporting Summary
Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.
Data availability
Raw Illumina and PacBio DNA sequencing reads have been deposited in the NCBI Sequence Read Archive database (San Diego reference: SRX469409–SRX469413; additional populations: SRX2746698–SRX2746704; PacBio reads for the San Diego reference: SRX3778522–SRX3778523). The T. californicus annotated genome assembly was deposited in the i5k Workspace of the National Agricultural Library (US Department of Agriculture), where it can be browsed, searched and downloaded: https://i5k.nal.usda.gov/Tigriopus_californicus.