To investigate the role of chromatin architecture in the Unc5b/Slc29a3 locus we generated two mouse lines with deleted TAD boundary forming CTCF-bs (Fig. 1D).
First, we deleted two centromeric sites by introducing ~ 5kbp deletion (chr10:60,755,585 − 60,761,088, mm10). As they are placed in intergenic regions and do not overlap any known regulatory or coding elements, we apply the conventional CRISPR/Cas9 system to introduce two double-strand breaks, supplemented with Homology-Directed Repair (HDR) template, stimulating deletions formation (Fig. 1E). This results in the generation of animals with deletion of a 5 kb region containing both CTCF-bs. From 19 animals carrying the target modification, we picked a single male mouse for backcrosses and obtained the homozygous line. PCR-genotyping and Sanger sequencing confirmed that deletion occurred between target Cas9 cut sites and likely represented the HDR product (Fig. 1F).
Fig. 1Spatial organization of the Unc5b/Slc29a3 locus and its gene editing. A - Hi-C map and gene localisation of mouse Unc5b/Slc29a3 locus. From [58]. B - Hi-C map and gene localisation of human UNC5B/SLC29A3 locus. From [59]. C - Hi-C map and gene localisation of western clawed frog Unc5b/Slc29a3 locus. From [60]. D - CTCF-bs cluster at the mouse Slc29a3/Unc5b locus and CTCF ChIP-Seq tracks from mouse liver tissue (Wild-type and Mutant variant inherited from mouse #16). Mutations coordinates are shown by red rectangles, and CTCF-bs orientations shown by blue triangles. E - Gene editing design employed for deleting the genomic region with two CTCF-bs. F - Sanger sequencing of the obtained deletion breakpoint regions. Obtained sequence aligned to the expected sequence of deletion. G - gene editing design for telomeric CTCF-binding sites knock-out
Further, we attempted to disrupt three telomeric CTCF-bs located within Unc5b introns. We refer to these telomeric binding sites as Left, Middle, and Right sites, according to their arrangement on the centromere to telomere axis (Fig. 1D). Close location to Unc5b gene elements prohibits the usage of simple editing designs such as removing all three CTCF-bs via single extended deletion. Thus, we constructed three CRISPR/Cas9 sgRNAs that provided cleavage within the predicted CTCF motifs. To minimize the risk of unintended deletions between the cleavage sites [61], we engineered single-stranded oligodeoxynucleotides (ssODNs) to guide the repair process. This design ensures the replacement of the CTCF core motif with a HindIII restriction site. This modification serves two purposes: it disrupts CTCF binding and simplifies the genotyping of the modified animals. We based our approach on the assumption that during the repair process, a few nucleotides at the ends of the double-strand breaks (DSBs) are resected before new synthesis occurs using the ssODN as a donor of homology (Fig. 1G).
In this experiment, we obtained 19 animals that were genotyped by PCR plus restriction fragment length polymorphism (PCR-RFLP) with enzymatic cleavage with the HindIII enzyme. This analysis showed that the expected mutation variant was obtained only in three mice (#1, 2, and 9) out of 19, and only in the one, middle CTCF site (Supplementary Table 2).
The observed mutation rate was significantly lower than anticipated, based on our previous studies [7]. Consequently, we hypothesized that the introduction of mutations through HDR was insufficiently effective. Therefore, we proceeded to investigate the presence of INDELs that might have resulted from other repair pathways. To achieve this, we analyzed PCR amplicons that encompass the CRISPR/Cas9 target sites using next-generation sequencing.
The analysis, performed for each target site and for each animal individually, did not reveal any new HDR outcomes in addition to those detected by PCR-RFLP; however, we identified multiple alleles harboring INDELs at target regions (Fig. 2A-C). Based on the presence of short homology motifs flanking the INDELs, we assumed that many of the observed deletions were generated via the microhomology end joining (MMEJ) repair pathway. We observed 24 MMEJ mutation occurrences (counting each variant from each animal), 23 NHEJ, and only 4 HDR occurrences among 19 animals and three mutation points. We also found that mutation outcomes agreed well with inDelphi [62] software predictions.
To predict how mutations affect CTCF binding, we assessed the CTCF motif score of all generated sequences. We found that the majority of unexpected NHEJ and MMEJ deletions had low CTCF binding scores, comparable with binding scores predicted for intended HDR-mediated mutations. We chose the mouse #16, carrying a compound of 5,5-kbp deletion of intergenic CTCF-bs with Left and Middle mutations that have low CTCF binding scores, as the most satisfying to derive a homozygous line.
Next, we confirmed that mutations of CTCF-bs affect CTCF binding. For this aim, we focused on liver tissue, where all five CTCF-bs in Unc5b/Slc29a3 locus are bound by CTCF according to the public ENCODE data. Based on the ChIP-Seq experiment results on the homozygous animals livers, we confirmed the absence of CTCF binding for the mutated sites (Fig. 1D).
Thus, we obtained homozygous mouse line carrying deletions [mm10] chr10:60,755,585 − 60,761,088 (of two intergenic CTCF-bs, 5504 bp), chr10:60,775,689 − 60,775,698 (deletion of 9 bp at Left site of telomeric cluster), and chr10:60,778,725 − 60,778,738 (deletion of 13 bp with insertion of -AA- dinucleotide at Middle site of telomeric cluster), disrupting CTCF binding in this region.
Fig. 2A-C - NGS genotyping results for Left (A), Middle (B), and Right (C) sites. Potential microhomology motifs at deletion borders are highlighted by magenta. Variants chosen for homozygous line derivation (mouse #16) are highlighted in bold
CTCF binding site deletions reconfigure local Spatial contactsTo investigate the introduced mutations outcomes on spatial interactions of locus, we prepared capture Hi-C (cHi-C) libraries for cerebellum, kidney, and liver, using mice with two or four CTCF-sites deleted and wild-type controls (Fig. 3A-D).
In the wild-type state, the patterns of spatial contact were almost identical in all analyzed tissues. The centromeric TAD contains multiple loops. The three upstream loop anchors include: a distal TAD boundary (located between the terminal end of the Cdh23 gene and Psap gene), and two anchors within the Cdh23 gene body. These three sites are forming the loops with downstream anchors, which are in the Cdh23 promoter and within the TAD boundary located between Slc29a3 and Unc5b genes. In the telomeric TAD, the loops are formed between the TAD boundary and Unc5b promoter and between two TAD boundaries. The distal boundary is placed near the cluster of long noncoding RNA genes between Unc5b and Sgpl1 genes. It is noteworthy that almost all described loops are formed by convergent CTCF binding sites. This locus has only a single exception in this case: the CTCF-binding sites at the distal TAD boundary of the Slc29a3 TAD align in the same direction as those at the anchors it interacts with.
The spatial maps of chromatin contacts in animals with two and four CTCF-bs deletions displayed similar subtle sub-TAD perturbations across all tissues examined. The similarity of contact patterns indicate that truncating the boundary by the deletion of centromeric CTCF-bs is sufficient to drive the detectable changes of chromatin contacts, whereas excising the additional pair does not impose any further, large-scale changes on the three-dimensional organization of the locus (Fig. 3C-E). In both genotypes we confirmed that loops formed by deleted CTCF-bs and directed into the Slc29a3 TAD side were disturbed in all tissues (Fig. 3C, D, black-filled arrowheads). In wild-type animals, these loops connect the TAD boundary region with CTCF-bs in the Cdh23 gene body. At the same time, loops directed towards the opposite domain were preserved.
The chromosomal segment extending from the boundary to the Cdh23 promoter, encompassed whole Slc29a3 gene, acquired new contacts with the entire Unc5b TAD. This alteration gradually becomes more noticeable from the liver to cerebellum tissues, coinciding with overall locus activity. It is apparent that the TAD boundary has shifted to the Cdh23 promoter site, transferring Slc29a3 gene from Cdh23 and Vsir to the spatial proximity with the Unc5b gene.
In addition to disruption of some loops, we noted the formation of new, ectopic long-range interactions. These new loops occur between the inner CTCF-binding sites of the Cdh23 gene and the distant border of the Unc5b TAD, an area containing the cluster of non-coding RNA genes, commonly silenced. Notably, these interactions cross the insulator TAD boundary. Furthermore, the pattern of these new loops differs significantly between tissues (Fig. 3C, D, white-filled arrowhead).
Overall, our cHi-C analysis shows that alterations of CTCF binding sites in the Slc29a3/Unc5b locus causes changes in local chromatin architecture, disruption of loops, and formation of novel long-range interactions between gene promoters and regulatory sequences.
Fig. 3Alterations in Spatial Chromatin Architecture at the Slc29a3/Unc5b locus. A - Gene locations and their activity status are indicated by a color scale from green (active expression) to red (repressed state). B - Location and orientation of CTCF binding sites. C, D - cHi-C Interaction Maps, displaying spatial chromatin interactions within the Slc29a3/Unc5b locus (chr10:60,103,000–61,356,000, mm10) in the liver, cerebellum, and kidney tissues for obtained model mice and wild-type controls. Maps above the main diagonal represent mice with deletions of two (C) or four (D) CTCF-bs, while maps below the diagonal show data from wild-type mice. Black-filled arrowheads indicate chromatin loops that were disturbed in mutation genotype, white-filled arrowhead indicate differing between wild-type and CTCF-bs deletion conditions. E– Insulation score for wild-type (black), two (blue) or four (red) CTCF-bs deleted. F– Chip-seq profiles of the Unc5b/Slc29a3 locus according to ENCODE data. Each column shows data for one of three organs: Liver, Cerebellum, and Kidney. Red vertical line represent 5-kb deletion of centromeric CTCF-bs. G - Bar plots representing expression of genes within Unc5b/Slc29a3 TADs across tissues. Data is shown in TPM units, whiskers represent standard error. H, I - statistically significant changes in gene expression relative to the wild-type levels for mice with deletions of two (H) or four (I) CTCF binding sites
Reorganization of local Spatial contacts leads to changes in the transcriptional activity of nearby genesReorganization of spatial contacts caused by deletions of CTCF binding sites may result in alteration of gene expression in this locus. To test this hypothesis, we performed expression analysis in the obtained mice (Fig. 4B, C). Previous studies have suggested that the magnitude of gene expression changes caused by TAD boundary perturbation might be relatively small. To obtain precision necessary to detect small expression changes, we developed advanced methodology based on allelic imbalance measurement. Specifically, we assessed gene expression alterations by utilizing hybrid models from F1 offspring, produced by crossing genetically modified mice with CTCF-bs deletions maintained on C57BL/6 background with wild-type CAST mouse strains. These hybrids possess distinct alleles that differ in exonic SNPs, allowing allelic transcripts discrimination. Thus, the effects of CTCF binding sites deletion can be studied in the same tissue samples by comparing the expression of CAST and C57BL/6 alleles. This design largely protects us from many biases since alleles share trans-regulating factors, so the only factor responsible for the differences in allele expression is the cis-environment.
Transcripts were quantified by targeted RNA-seq incorporating unique molecular identifiers (UMIs). Reverse transcription was primed using gene-specific oligonucleotides bearing UMI sequences, ensuring that each cDNA molecule acquired a unique tag. After PCR amplification and sequencing, a SNP-aware analysis leveraged these UMIs to digitally count individual allelic transcript molecules (Fig. 4A). Using this strategy, we profiled three groups of male F1 hybrids, obtained by crossing CAST females with (1) males from wild-type C57BL/6 mice, and (2) C57BL/6 mice with two centromeric or (3) four (two centromeric and two telomeric) CTCF-binding sites deleted at the Slc29a3/Unc5b locus.
Using this approach, we quantified the expression of six genes within the targeted locus (Unc5b, Slc29a3, Psap, Vsir, Cdh23, and Sgpl1) in five organs (bladder, cerebellum, kidney, liver, and olfactory bulb) (Fig. 4B, C). We identified 20 cases of significant gene expression changes caused by CTCF-bs deletions among 60 analyzed cases (FDR = 0.1). As we expected from previous studies, the alterations in gene transcription observed were modest, not surpassing a 50% change. Intriguingly, the nature of these changes showed a tissue-specific pattern with variation not just in magnitude but also in the direction of changes. For example, the Slc29a3 gene decreases its expression in the kidney, while increases expression levels in the cerebellum. To confirm the observed changes in gene expression levels, we performed digital PCR on the samples with four CTCF-bs deleted (Fig. 4C).
Fig. 4Transcription changes at the Slc29a3/Unc5b locus. A - graphical scheme illustrating the allele imbalance quantification in hybrids using UMI-assisted targeted RNA-seq. B, C - Heatmap representations of locus gene expression changes observed via UMI-assisted targeted RNA-seq for two (B) or four (C) CTCF-bs deletion versus wild-type alleles. D - Heatmap representations of locus gene expression changes measured via digital PCR for four CTCF-bs deletion mice versus wild-type mice. E, F - Enformer in silico predictions of the transcriptional effect of deletions of two (D) or four (E) CTCF-bs versus wild-type. G, H– AlphaGenome in silico predictions of the transcriptional effect of deletions of two (F) or four (G) CTCF-bs versus wild-type
For all heatmaps expression is shown as a percentage of changes compared to wild-type, NS indicates non-significant results (FDR > 0.1).
Genomic foundation models fails to predict expression changes caused by CTCF mutationsNext, we aimed to compare our experimental data with predictions made by the novel deep learning model Enformer [63] and AlphaGenome [64], which has been developed to predict epigenetic characteristics, including gene expression level, from DNA sequences. Our objective was to evaluate the applicability of such models for studying the regulatory function of CTCF-binding sites. The authors of the Enformer paper demonstrated that enhancer and insulator sequences contribute most significantly to the transcription level prediction accuracy. We hypothesized that applying the Enformer and AlphaGenome models would enable us to predict gene expression changes resulting from CTCF-binding site (CTCF-bs) mutations across various mouse cell types.
We used both models to simulate the deletion of CTCF‑binding sites and to predict how gene expression would change across the locus. Since Enformer accepts only a 200 kb sequence window, we could predict expression levels for just three genes, which lie closest to the engineered deletions (Cdh23, Slc29a3, and Unc5b). AlphaGenome handles a 1 Mb window, so we generated predictions for all genes within that span.
We predicted changes in gene expression for all genes and organs under two types of mutations: 2 CTCF-bs deletions and 4 CTCF-bs deletions (Fig. 4E, F for Enformer, Fig. 4G, H for AlphaGenome). The differences in predicted gene expression changes between these two mutation types were moderate for both used models. While Enformer yields largely similar predictions across cell types, consistently predicting down‑regulation of every gene in all cell types, with only the size of the decrease varying, AlphaGenome predict changes that vary in both magnitude and direction, demonstrating its greater sensitivity to cell‑type–specific effects.
To evaluate the predictive accuracy of the models, we performed Pearson’s chi-squared test to assess how well the models predicted the direction of gene expression changes among different genes and cell types. For Enformer, the test yielded a p-value of 0.24, and 0.33 for AlphaGenome, indicating low concordance between the predicted and experimental data.
Notably, recent studies [65,66,67] have also reported that Enformer does not always correctly predict the direction and magnitude of gene transcription changes in individuals. These researchers assessed Enformer’s predictive accuracy for specific genotypes using the RosMAP and Geuvadis databases, which include whole genome sequencing and gene transcriptional activity data from a single cell type for individuals in the 1000 Genomes Project. The Enformer model often failed to capture minor variations in gene expression between healthy individuals.
While AlphaGenome is a clear improvement over previous models, it retains some of their limitations. It still cannot fully account for the cell type specificity or the effects of individual genetic variation, as the authors themselves acknowledge [64]. Moreover, the mutations we study affect long-range regulatory elements, making them inherently difficult to model and interpret compared with intragenic variants. Based on our data and published findings, we conclude that the current genomic foundation models are not suitable for predicting moderate changes in gene expression.
Transcriptional effects rarely align with known mechanismsWe next aimed to fit observed in NGS-experiment differences in gene expression to the known models of interactions between promoters and regulatory elements. Since we observed tissue-specific patterns of gene expression changes, we assume that promoter-enhancer interactions in this locus are influenced by the local epigenetic context and are presumably determined by the gene’s transcription activity in the particular tissue.
Case I. Expression of genes separated by TAD boundary becomes more similar after CTCF-bs deletion. The pattern of expression changes of closely located Slc29a3 and Unc5b genes suggests that their cis-regulatory landscapes may interfere with each other after TAD boundary disruption. In the kidney, we detected almost a 40% decrease in Slc29a3 expression (Figs. 3H and I and 4B and C). In the same tissue, we observe the upregulation of Unc5b. The increase of Unc5b expression ranges from 20% in animals with 2 CTCF-bs deleted to 40% in animals with 4 CTCF-bs detected. Notably, in this organ Slc29a3 is normally activated, while the Unc5b is repressed, as suggested by its wild-type levels of expression (Fig. 3G). Thus, we assume that when the insulator boundary between these genes is disturbed, they are influenced by the regulatory elements of each other. As a result, the expression of these genes is “averaging”.
Case II. Enhancer hijacking causes upregulation of low expressed genes. We detected approximately a 15% increase in Slc29a3 expression in the cerebellum, where the expression of the Unc5b gene is significantly higher compared to its average level across mouse tissues. This suggests that Slc29a3 expression in this tissue is influenced by active cis-factors, for example Unc5b active super enhancer signatures within its first intron according to the ENCODE database (Fig. 3F). The Slc29a3 gene, by entering into spatial interaction with the Unc5b gene, also interacts with its chromatin environment, which explains the increase in its activity. In contrast to the aforementioned expression “averaging” observed in the kidney, there is no detectable decrease in the expression of the Unc5b gene, which can be attributed to the specific mode of its enhancer action or the small scales of these changes.
In addition to the two cases described above, there are many other examples of gene expression changes that do not fit well into known mechanisms of gene regulation.
In the liver, both Unc5b and Slc29a3 genes are repressed; however, we observe weak Slc29a3 expression increase upon boundary disruption. Increased Slc29a3 expression can not be explained by the interaction of the Slc29a3 gene with regulatory elements from the Unc5b spatial environment, because there are no candidate elements based on available ChIP-seq data, and it is not clear why if these elements exist in the Unc5b environment they do not cause active expression of the Unc5b gene.
A possible explanation for the increased Slc29a3 transcription is that centromeric CTCF-bs may exert a CTCF-mediated repressive effect, given the limited evidence that CTCF itself can act as a repressor. However, in this case, the distant sites are located much further from the Slc29a3 promoter than in documented examples of CTCF-mediated repression, and the observed effects in other tissues do not fully support this mechanism. Therefore, we find insufficient evidence for a CTCF repression model in this context.
In the cerebellum, we observed a significant reduction of the Cdh23 expression. There are two known mechanisms of gene silencing caused by TAD boundary disruption: disconnection of the promoter from the regulatory element and spreading of heterochromatin. We did not find any obvious candidate element with an enhancer signature that changes contact frequency with Cdh23 promoter upon TAD disruption.
As shown by Hi-C experiments, the disruption of the TAD boundary led to the emergence of ectopic long-range interactions between the Cdh23 gene body and Unc5b TAD distal boundary. Noteworthy that these loops are observed only for tissues where Cdh23 is expressed (for cerebellum and kidney, but not for the liver). We speculate that these novel interactions may be associated with the propagation of polycomb-mediated chromatin states (H3K27me3), which are known to form analogous loops in neural tissue [68]. This mechanism could potentially account for the downregulation of Cdh23 in the cerebellum. However, we did not detect any repressive chromatin marks at the anchors of the newly identified Cdh23 loop (Fig. 3F).
Among the genes we examined, Psap showed the most intriguing behavior. Although it is the farthest from the mutated TAD boundary, Psap is down-regulated in the kidney and up-regulated in the liver of animals with two deleted CTCF-bs. By contrast, its expression is normal on the allele lacking four CTCF-bs. Interestingly, this pattern closely mirrors that of Slc29a3. A similar shift may also occur for Vsir and Sgpl1 in the kidney, liver, and olfactory bulb. These pronounced, non-linear differences between the two-site and four-site deletions are puzzling, yet our cHi-C maps provide no direct mechanistic explanation. Finally, the changes observed for Unc5b and Sgpl1 in the bladder, and for Cdh23 in the olfactory bulb, remain unexplained with the data currently available.
In general, changes in gene expression tend to be opposite to their baseline activity level in a given tissue. Thus, in cases where expression levels change, active genes typically decrease their expression, while repressed genes show an increase.
Comments (0)