SBS8 is depleted in exons and enriched in heterochromatin
We analyzed mutational signatures associated with somatic point mutations identified from whole-genome sequencing data for 18 cancer cohorts from the International Cancer Genome Consortium (ICGC)1b; Supplementary Data 1); the selected cancer types have diverse tissue-of-origin, and different exposures to endo- and exogenous mutagenic processes, which allow us to decouple tissue-dependent and context-dependent effects. SBS8 was present with sufficient footprints in most of the cohorts.
Since the mechanisms of endo- and exogenous DNA damage and repair preferences depend on local sequence, chromatin, and nuclear contexts
Among different genomic contexts, SBS8 was depleted in the telomere and exonic regions in nearly all cancer types analyzed (Fig. 1c–k). At the level of whole genes, presence of SBS8 was detected, but that contribution primarily came from the intronic regions. We observed consistent results in most cancer types, including those that were represented by multiple independent cohorts. It was relatively over-represented in repeats compared to exons (Wilcoxon rank sum test; combined p value across all cancer types <1e−05), although the SBS8 mutational signature did not indicate any specific preference for homopolymeric tracks or specific repeat motifs (Fig. 1a).
We next focused on chromatin and nuclear localization contexts, which unlike the genomic contexts, are tissue dependent. Since the cell of origin is not known for many cancer types and/or relevant tissue-specific epigenetic data is available for limited tissue types, we first used tissue-invariant chromatin and nuclear localization data1c–k; Spearman correlation; combined p value across all cancer types <1e−05). Similarly, it was significantly more over-represented in lamina-proximal regions in the nuclear periphery than inter-lamina regions in the nuclear interior (Fig. 1c–k; Wilcoxon rank sum test; combined p value across all cancer types <1e−05). We did not observe any specific enrichment for SBS8 in fragile sites (Supplementary Fig. 1). The results were consistent across the cancer cohorts, including those representing similar cancer types.
We considered the possibility that the number of mutations attributed to a mutational signature (signature weight × number of mutations/Mb) could be actually higher in a given context, even when there is an apparent decrease in relative proportion of that signature due to an excess of other signatures. We found no evidence supporting that possibility confounding our conclusions about the observed difference in preference of SBS8 for heterochromatin over euchromatin. In fact, somatic mutation rate in gene rich euchromatin is lower than that in the heterochromatin regions
Composite epigenomic context preference of SBS8
The nucleotide, genomic, and epigenomic features are not independent, and combinatorically influence DNA damage and repair2a). The HMM approach allowed us to describe combinatorial patterns of relevant epigenomic features using a small number of composite contexts that are prevalent in the genome, and flexibly determine the resolution of the context-map by adjusting the number of such contexts. This offered a distinct advantage over considering exhaustive combinations of features, because the number of possible combinations increases exponentially with an increase in the number of features considered, and some combinatorial contexts are rarely observed in mammalian genomes.
Fig. 2: Composite epigenomic context analysis of SBS8.
a A schematic representation of the Hidden Markov Model used to identify mutagenesis-related epigenomic (MRE) states integrating genomic, epigenomic, and cellular process features relevant for mutagenesis and DNA repair. b Enrichment score of the features for the MRE states in a 20-state model. Descriptions of the MRE states are provided in Supplementary Data 1. Enrichment values of exon for E9–10 and 14–16 are >9. Contrast is saturated for values >3.5. c Relationship between MRE states from the 10, 20, and 30 state models. d Annotation of chromosome 21:27–30 Mb regions using 10, 20, and 30 state models and UCSC Genes are shown in breast epithelial cell type. Genomic coordinates of the MRE annotations from the 20-state model are provided in Supplementary Fig. 2 and Supplementary Data 3. e Relationship between MRE states and ChromHMM chromatin states. f Boxplot showing distributions of weight of SBS8 in different MRE contexts for multiple cancer types. See Supplementary Data 1 for description of the cancer cohorts including the number of samples.
We jointly annotated mutagenesis-related epigenomic (MRE) states for multiple cell types from the ENCODE project2b). Likewise, E9–10 and E14–16 are exonic regions, but differ in terms of their chromatin, nuclear localization, telomere, and replication contexts. Joint annotation of MRE states across cell types meant that the interpretation of the MRE state is invariant across cell types, but genomic segments attributed to that state might differ between cell types, primarily due to difference in cell-type dependent epigenomic makeups. A predominantly parent–child relationship between the MRE states in the lower and higher order models was observed, such that the MRE states are mostly subclassified into finer sub-states in corresponding higher order models (Fig. 2c), which would allow us to control resolution of the context-map by selecting appropriate state model if necessary. For instance, a single state (10E9) in the 10-state model was subdivided into E18 and E19 in the 20-state model. Interpretation of the contexts and their genome-wide prevalence in different cell types are provided in Supplementary Data 2 and 3 respectively, while an example of MRE annotations from the 10, 20, and 30 state models for chromosome 21 in breast epithelial cell type are shown in Fig. 2d and Supplementary Fig. 2. Our approach is conceptually similar to that adopted to identify chromHMM states2e). The composite MRE states are more broadly distributed genome-wide than the chromHMM states which show variations primarily around coding and regulatory regions which cover only about 2–5% of the genome.
SBS8 was over-represented in MRE state E20 (Fig. 2f), which is late replicating heterochromatin across multiple cancer types, but also in E6 and E17 states, which showed similar contextual composition. In liver cancer, SBS8 was also common in E18 and E19 contexts (Fig. 2f), which shared the late replication patterns. Although there were minor variations between the cancer types, SBS8 was prominently present in late replicating heterochromatin and depleted in early replicating euchromatin in all cancer types analyzed. Based on the feature-by-feature and composite context analyses, we conclude that SBS8 is prevalent in late replicating, repeat-rich, heterochromatic regions over early replicating, gene-rich, euchromatic regions, as consistently observed in tissue-invariant feature-by-feature and tissue-specific composite context analyses in all cancer types.
Inference of etiology of SBS8 per exclusionem
SBS8 was present in multiple cancer cohorts, including those not attributed to environmental exposure and its nucleotide substitution pattern did not overlap with any known exogenous mutagen. This suggests that it is unlikely to occur due to external agents, and might arise via endogenous processes. The context-guided analysis further indicates that SBS8 rarely occurs in certain epigenomic contexts, allowing us to exclude certain classes of mutagenic processes from consideration. Unlike other mutational signatures (e.g., SBS4, SBS12, SBS16, and SBS19) that are specifically associated with transcription-coupled DNA damage and repair, SBS8 was depleted in exons (Fig. 1c–k) and did not have strong transcriptional strand bias
Replication context preference of SBS8
Using Repli-seq data3a). Although replication speed showed regional variations, in general, it increased towards very late replication in all cell types analyzed (Fig. 3b, Supplementary Fig. 3). This is in agreement with reports that late replication is marked by low origin density but higher replication speed (1.5–2.3 kb/min) than that of early replication domains (1.1–1.2 kb/min)
Fig. 3: Replication context analysis of SBS8.
a Schematic representation of inference of replication timing, direction of fork progression, and replication speed from repliseq data. b Scatterplot showing changes in replication speed with replication timing in MCF7 breast cancer cell line, which shows an increase in replication speed late during replication. Similar results are observed for other cell lines. c Boxplot showing distributions of weight of Signature 8 in replication timing contexts in breast cancer (BRCA-EU), ovarian cancer (OV-AU), and lymphoma (MALY-DE). d Boxplot showing distributions of weight of Signature 8 in combinations of replication timing and speed contexts in breast cancer (BRCA-EU), ovarian cancer (OV-AU), and lymphoma (MALY-DE). p Values for comparisons between fast and slow replication speed in late replication contexts are listed; combined p value for the three cohorts using Fishers method is 3.45e−09. e Boxplot showing distributions of weight of Signature 8 in combinations of replication timing, speed, and direction contexts in breast cancer (BRCA-EU), ovarian cancer (OV-AU), and lymphoma (MALY-DE). See Supplementary Fig. 3 for similar results for other cancer cohorts. p Values for comparisons between left and right replication direction were not significant, when analyzed in the context of combinations of replication timing and speed in the cohorts. See Supplementary Data 1 for description of the cancer cohorts including the number of samples.
Analyzing the proportion of SBS8-associated somatic mutations in tumor genomes in the replication contexts from closely related cell types, we found that late replicating regions had significant excess of SBS8 compared to early replicating regions in cancer (Fig. 3c, Supplementary Fig. 3; Wilcoxon rank sum test; combined p value <1e−05), and within late replication timing contexts, high replication speed was associated with increased burden of SBS8 (Fig. 3d; Wilcoxon rank sum test; combined p value <1e−05). We also found similar results using tissue-invariant replication timing data on all cancer cohorts (Supplementary Data 4), and our findings are consistent with the observations in breast cancer3e), that is in agreement with previous reports that SBS8 displays no major replication strand bias1), especially relative to late replicating regions in general—indicating that replication fork collapse may not be a major source of SBS8.
Anyhow, Fig. 3d suggests that both replication speed and timing likely have independent effects, although replication timing might have proportionally higher effect size. It is known that average replication fork speed increases markedly in presence of A + T)4) and drive mutation spectrum that favors AT nucleotides at late S-phase
SBS8 and genome maintenance
Uncorrected replication errors have potentials to stall replication, trigger checkpoint activation, and promote genomic instabilityATR mediated DNA damage sensing for single strand breaks and CHEK1/2-mediated checkpoint activation are tightly coupled such that mis-incorporated bases trigger DNA damage sensing and checkpoint activation. Checkpoint defects are common in cancer genomes, which might allow the cells to proceed through the cell cycle without appropriate repair of these lesions resulting in mutations. Therefore, if SBS8 is indeed due to replication errors, we should detect additional evidence at genomic and cell cycle contexts. At this end, we grouped the tumors in respective cohorts into three groups based on purity adjusted ATR expression—low (0–33%), middle (33–67%), and high (top 66–100%), and found that the ATR-high tumors indeed have high proportion of SBS8 in somatic mutations accumulated in late replicating domains (Supplementary Fig. 5); in contrast, when the tumors are grouped according to purity adjusted CHEK1 or CHEK2 expression, low checkpoint gene expression was associated with high proportion of SBS8 in late replicating domains (Supplementary Fig. 5). In fact, the tumors with high ATR and also low CHEK1 or CHEK2 expression had proportionally more SBS8 compared to other combinations (Supplementary Fig. 6). These observations are consistent with a model that checkpoint defects are associated with high prevalence of SBS8.
We note that tumor transcriptome changes with time such that current expression of those genes is a poor proxy of their past expression, and it is not possible to obtain expression data from the time-point when dividing cells accumulated the observed somatic mutations in the genomes. Moreover, components of DNA repair pathways are regulated at transcriptional and post-translational levels, such that correlative data need to be interpreted keeping the caveats in the mind. Thus, next we analyzed data on acquired mutations in clonally derived cell lines with checkpoint defects, i.e., that had no functional copy of multiple DNA repair pathway genes including CHEK2CHEK2−/− clones had predominantly background genome maintenance signature (dubbed BG signature) while contribution from homology repair defect signature (SBS3 like) was minimal
Crosstalk between SBS8 and other mutational signatures
We investigated association between SBS8 and other mutational signatures within and across genomic and epigenomic contexts to understand context-dependent interplay between these signatures for a number of reasons. First, mutagenesis and DNA repair do not occur in isolation and there is crosstalk between different mutagenic, DNA damage sensing, and repair pathways4a), which also have broad-spectrum nucleotide substitution patterns and are often discussed together.
Fig. 4: Crosstalk between SBS8 and other mutational signatures.
a PCA plot showing different mutational signatures projected based on their trinucleotide frequencies. Cosine similarity is shown to the right. b PCA plot showing different mutational signatures projected based on their weights in different epigenomic contexts. SBS8 is marked with an arrow. Cosine similarity is shown to the right. c Effect size of selected mutational signatures SBS1, SBS3, SBS5, and SBS8 in late replication contexts relative to early replication contexts in different cohorts. Negative values indicate preferential occurrence in early replication contexts. d Scatterplot showing mean proportion of each signature in late replication against its effect size between early and late replication contexts. SBS8, SBS40, and SBS12 are marked. Whiskers indicate the maximum and minimum values across the cancer cohorts. See Supplementary Data 1 for description of the cancer cohorts including the number of samples.
Nonetheless, when the epigenomic and replication context preferences were analyzed, differences among the signatures became evident. We used PCA plot to compare epigenomic context-based proportions of different signatures (Fig. 4b) based on genomic, chromatin, and nuclear localization features from Fig. 1; SBS8 showed similarity with SBS40 and was distinct from other broad-spectrum nucleotide substitution signatures such as SBS1, SBS3, and SBS5. Like SBS8, SBS40 is also a broad-spectrum substitution-based signature with unknown etiology. Among the closely related broad-spectrum substitution-based signatures, only SBS8 shows consistent and significant preference for late replication, while SBS1, SBS3, and SBS5 consistently were depleted in late replication context, in all cancer cohorts including those representing similar cancer types (Fig. 4c; Supplementary Fig. 7). We also observed similar results using cell type dependent replication timing data. In a pan-signature analysis, among the signatures with sufficient presence (>5% proportion) in the cohorts, SBS8 showed the highest effect size in to discriminate early and late replication contexts (Fig. 4d). Apart from the SBS8, only SBS40 and to some extent SBS12 had high proportional contribution among somatic mutations in late replicating regions in all cancer types, and also high effect size to discriminate early and late replication contexts (Fig. 4d; Supplementary Fig. 8). SBS12 is a NER signature marked by excess of T > C, which is distinctly different from SBS8, but the etiology of SBS40 is unknown. Based on their similarity both at the trinucleotide level and presence in different epigenomic and replication contexts, we argue that SBS8 and SBS40 might be related.
Next, we investigated whether SBS8 in late replication context correlated with any other mutational signature, especially those known to be genome maintenance-related, in early replication context complementing it. Association of SBS8 with other mutational signatures was cancer type dependent and context-dependent (Supplementary Data 3). The proportion of different NER signatures (e.g., SBS7, SBS19, and SBS32) in early replicating regions correlated with the proportion of SBS8 in late replication in multiple cohorts. But no single signature, NER-related or otherwise, correlated with SBS8 within and across epigenomic contexts in a majority of cancer types tested. In liver cancer the proportion of SBS8 in late replication context correlated with SBS5 in early replication, while in breast and ovarian cancer, proportion of SBS8 in late replication significantly correlated with the proportions of Signature SBS3 and SBS1 in early replicating contexts. Our observations are consistent with reports that tumors with BRCA1/BRCA2 deficiency have high burden of SBS8
Replication errors have potentials to cause DNA double strand breaks, rearrangements, and genomic instability, and the burden of genomic structural variations in cancer genomes is known to be high in late replicating heterochromatin domains9); we observed similar results based on proportion of SBS8 in late replicating regions in these cohorts. Our observations are consistent with that based on SBS3 published reports in breast cancer
SBS8 is uncommon in nonmalignant tissues
Analyzing de novo germ line mutations from whole genome sequencing of 250 parent–offspring familiesp value > 0.05; Supplementary Fig. 10). This was not due to modest mutation count per sample; we observed similar results when the analysis was performed at the cohort-level after mutations from all samples were pooled. Likewise, mutational signature analysis of nonmalignant somatic did not show any substantial contribution of SBS8 at a genome-wide level10). Therefore, SBS8 mutational signature appears to be rare in nonmalignant cells, but likely arises during cancer progression.