Enhlink infers distal and context-specific enhancer–promoter linkages

Human heart snATAC-seq dataset processing

We downloaded the Human Heart snATAC-seq dataset from the portal (http://ns104190.ip-147-135-44.us/CARE_portal/ATAC_data_and_download.html) described in the publication [21]. From the portal, we downloaded the matrix file (all.npz), the cell index file (all.index), the OCR features index file (all.merged.ygi), the genome reference (Homo_sapiens.GRCh38.99.TSS.2 K.bed), the cluster file (all.cluster), and the all.group file with the library ID of each cell. The matrix from the all.npz file is a scipy sparse matrix with 79,515 cells and 287,415 OCRs with boolean values indicating if at least one read mapped to the cell is found within the boundaries of the corresponding OCR. From the genome reference and features index files, we defined the KCNH2 promoter regions by the following OCRs: "chr7:150,976,584–150977120", "chr7:150,978,193–150978805", and "chr7:150,978,915–150979661". The KCNH2 enhancer was defined as the following genomic region: "chr7:150,955,147–150956502". We also defined the MYL2 promoter regions by “chr12:110,920,282–110920944” and "chr12:110,921,386–110921633". We defined three enhancer regions of MYL2 by "chr12:110,931,149–110931877", "chr12:110,928,658–110929096", and "chr12:110,907,461–110909456". We then focused on the atrial (aCM) and ventricular (vCM) cardiomyocyte cell types, since KCNH2 is only expressed in these cell types, and defined a KCNH2 promoter boolean vector for either aCM or vCM by merging (using a logical or operand) the three promoter region vectors from the feature matrix using either the cell index of aCM or vCM. We followed the same strategy for the promoter of MYL2, but restricted to the vCM cell type since MYL2 is only expressed in vCM. We also downloaded the bigwig tracks from the same portal for the aCM and vCM cell types from the same portal. Finally, we computed the Enhlink co-accessibilities for these two promoters and for aCM and vCM based on the workflow described below.

Co-accessibility signals from KCNH2 and MYL2

We used the KCNH2 promoter vectors of aCM and vCM as ground truth labels (either accessible or not for a given cell) and the KCNH2 enhancer vectors as estimated labels to compute the precision, recall, and f1-scores. We then drew a random subset of 500 enhancers among the 287,415 features of the feature matrix to compute the baseline distributions for the precision, recall, and f1-score. We then derived a p-value for the computed precision, recall, and f1-score with regard to the baseline distributions. We followed the same strategy for the MYL2 promoter and its associated enhancers.

Mouse islet and adipose scATAC-seq

We used two scATAC-seq datasets from the mouse islet and adipose tissues whose generation and processing we previously described (Poirion et al., 2024). The islet dataset gathered 100,720 cells, 295,089 OCR features, 10 cell populations including alpha, beta, and delta cells, and was processed using 18 10X Genomics sequencing libraries. The adipose dataset gathered 60,229 cells, 311,645 OCR features, 9 cell populations including adipocytes and macrophages, and was processed using 24 10X Genomics sequencing libraries. Cells from both datasets were labeled with their mouse strain/genotype ID (CAST, NZO, B6), diet (4% or 44% fat), and sex (M/F).

Mouse (RNA/ATAC) multi-omic single-cell collection

Mouse striatum was dissected from 12-week-old mice from eight inbred strains: A/J (The Jackson Laboratory Stock 000646), C57BL/6 J (000664), 129S1/SvImJ (002448), NOD/ShiLtJ (001976), NZO/H1LtJ (002105), CAST/EiJ (000928), PWK/PhJ (003715), and WSB/EiJ (001145). Striatum was collected from two males and two females for each strain. Striatum samples were flash-frozen in liquid nitrogen and stored at − 80 °C until processing. After collection, the striatum samples were processed in four batches of eight over 2 days. Each batch consisted of four male and four female samples, and each strain was represented in each batch.

Multi-omic library generation

For single-nuclei preparation, frozen striatum was placed into 500 µl Miltenyi Nuclei Extraction Buffer (Miltenyi 130–128-024) plus 0.8 U/µl RiboLock RNase Inhibitor (ThermoFisher EO0382) in a gentleMACS C tube (Miltenyi 130–093-237) on ice, and then nuclei were immediately extracted in batches of four through the “4C_nuclei_1” program on the MACS Dissociator. Nuclei were placed on ice and filtered through a 70-µm pluriStrainer Mini into a pre-chilled 2-ml tube, and the C tube was washed with an additional 500 µl cold Miltenyi Nuclei Extraction Buffer and filtered. Nuclei were spun down at 350 rcf at 4 °C for 5 min and resuspended in 80 µl cold PBS with 2% BSA and 1 U/µl RiboLock. Nuclei were counted (~ 4000 nuclei/µl) and 60,000 nuclei from each of the eight samples were mixed in a chilled 1.5-ml Eppendorf tube. The pooled nuclei were spun at 300 rcf at 4 °C for 3 min and resuspended in 30 µl 10 × Genomics Nuclei Buffer with 1 U/µl RNase inhibitor. The pooled nuclei were counted once more to confirm counts (~ 10,000 nuclei/μl), single-cell suspension, and lack of debris.

Nuclei viability was assessed on a LUNA-FX7 automated cell counter (Logos Biosystems), and up to 40,000 nuclei (~ 5,000 from each sample) were loaded onto each of 4 lanes of a 10 × Chromium microfluidic chip. Single nuclei capture and library preparation were performed using the 10 × Chromium platform and according to the manufacturer’s protocol (#CG000388 Chromium Next GEM Single Cell Multiome ATAC + Gene Expression). Because the 10 × chip was superloaded, to reduce duplicate reads arising from multiple PCR steps, the number of cycles was adjusted from the 10 × protocol. After GEM generation, barcoded cDNA and transposed DNA fragments were pre-amplified using 6 cycles. The pre-amplified sample was divided and used for two separate steps. An additional 6 cycles were used for adding a sample index for ATAC library construction and 12 PCR cycles for the gene expression library construction. cDNA and ATAC libraries were checked for quality on Agilent 4200 Tapestation and ThermoFisher Qubit Fluorometer, and quantified by KAPA qPCR, before sequencing; each gene expression library was sequenced using NovaSeq 6000 S4 v1.5 200 cycle flow cell lane, dual index scRNAseq asymmetric read configuration 28–10-10–90, targeting 20,000 nuclei with an average sequencing depth of 50,000 read pairs per nucleus. Each ATAC library was sequenced using Illumina NovaSeq 6000 S2 v1.5 100 cycle flow cell lane, with a 50–8-24–49 read configuration, also targeting 20,000 nuclei with an average sequencing depth of 50,000 reads per cell.

Multi-omic library sequencing

Illumina base call files for all libraries were demultiplexed and converted to FASTQ files using bcl2fastq 2.20.0.422 (Illumina). A filtered joint digital gene expression and chromatin accessibility matrix was generated against the 10 × Genomics mm10-2020-A reference build (version 2020-A, Assembly: GRCm38, ENSEMBL release 98; Annotations: Gencode vM23) using a modified 10 × Genomics CellRanger-ARC count pipeline (v2.0.0), which had the 20,000 cell limit of cell calling removed.

scRNA-seq analysis for the striatum dataset

For each batch of the libraries, the output from Cell Ranger ARC consisted of both ATAC and gene expression BAM files. These two BAM files were used as inputs for Demuxlet [44] to determine the mouse strain identities of individual cells and to detect doublets. Only cells that were identified as singlets with consistent mouse strain identities in both assays were retained for further analysis.

To annotate and visualize the retained single cells, we performed downstream gene expression analysis on them using Seurat (version 4.0.3, [45]). For each batch, we removed low-quality cells and multiplets by filtering out cells with fewer than 200 or more than 7500 detected genes and those with greater than 15% mitochondrial counts. We then used DoubletFinder [46] to further exclude any remaining doublets. Next, we merged the Seurat objects from the four batches and performed normalization, highly variable feature identification, scaling, and linear dimensionality reduction using the NormalizeData(), FindVariableFeatures(), ScaleData(), and RunPCA() commands, respectively, with default parameters. To remove batch effects, we integrated the single-cell datasets from the four batches using Harmony [47]. Using the first 40 principal components determined manually by the Elbow plot, we conducted unsupervised cell clustering through the Louvain algorithm on the K-nearest neighbor (KNN) graph with resolution set to 0.2, which resulted in 22 cell clusters. We visualized the output in a 2D Uniform Manifold Approximation and Projection (UMAP) embedding using the same PCs used for cell clustering. We excluded four cell clusters that (1) co-expressed neuron markers Drd1/Drd2 and oligodendrocyte marker Aspa; (2) co-expressed neuron markers Drd1/Drd2 and astrocyte marker Gja1; (3) co-expressed neuron markers Drd1/Drd2 and macrophage marker C1qa; or (4) highly expressed mitochondrial genes. We then re-processed and integrated the remaining cells using the same steps as before, yielding 18 cell clusters. We annotated these 18 cell clusters using marker genes identified through the FindAllMarkers() command through DropViz [48].

We processed the snATAC-seq data separately from the scRNA-seq data and followed the same preprocessing procedure as the one described for the mouse islet and adipose snATAC-seq dataset (Poirion et al., 2023). Briefly, we aligned the reads to the mm10 genome with Cell Ranger V6 with default parameters. We inferred the peaks fusing MAC2 [49].

Enhlink analytical workflow

The analytical procedure carried out by Enhlink involves multiple steps illustrated in Figure S2 and explained in greater detail in Additional File S1. It can be summarized as follows: (a) create a feature matrix (i.e., OCR × cell matrix) and a response vector (i.e., single-cell promoter accessibility or target gene expression) for each target genomic region, (b) model the response vector as a function of the feature matrix and identify the significant features associated with the target region, (c) optionally perform a secondary analysis to detect biological covariates associated with the linkage, and (d) in the case of multi-omics data, a linkage analysis is conducted for each omic and consensus links, found in all omics, are then inferred.

At a minimum, Enhlink requires a boolean sparse matrix, indicating the OCRs of each cell and a list of genomic regions, typically the promoters, defining the linkage targets. In addition, Enhlink optionally takes a sparse matrix containing the values of the target regions (such as a single-cell expression matrix in the case of multi-omics data), a file containing the covariates of each cell, and the cluster IDs of each cell. Enhlink constructs a feature matrix \(_\) for each target region by iterating through the OCRs surrounding it (+ / − 250 kb by default). If the feature matrix contains fewer than 100 features by default, Enhlink supplements it with random features derived from the existing features. After constructing the feature matrix for each target region, Enhlink incorporates one-hot-encoded covariates and generates a response vector \(_\), representing either the boolean accessibility or the expression of the gene/target region g. If \(_\) is continuous (e.g., representing gene expression), it is binarized using the mean of the non-null values as threshold. This would be the case when processing multi-omics data for which gene expressions from the RNA-seq are used to create the target regions. Then, Enhlink proceeds to model \(_= f(_)\) for each cluster (or for all cells if no clusters are provided) and identifies features from Mn that significantly predict Vg. Enhlink employs a strategy similar to that of a random forest classifier (Breiman, 2001), performing 100 iterations by default. In each iteration, a random sample of cells and features is selected, and a decision tree [50] is used to recursively reduce the number of samples and features based on the top feature selected at each level of the tree. The top features are selected using a modified score derived from the Information Gain (IG) (see Additional File S1), which favors informative, positively correlated, and accessible features. For each feature, Enhlink calculates a p-value for its score to be different from 0 across iterations using Student’s t-test from the disturb library (https://pkg.go.dev/gonum.org/v1/gonum/stat/distuv) and corrects the feature p-values using the Benjamini–Hochberg false discovery rate (FDR) procedure.

Estimating the expected accuracy of inferred links through simulation

Enhlink provides an option to estimate the expected accuracy of the inferred links for a given target region by generating simulated enhancers and target regions. This simulation procedure aims to replicate the observed correlations between experimentally validated enhancers and promoters. More detailed information on this procedure can be found in Additional File S1. Briefly, Enhlink first simulates a promoter by shuffling a target region. Then, to simulate an associated enhancer, Enhlink duplicates the simulated promoter and introduces two types of random noise. The noise is controlled by two hyperparameters, λ open and λ close, which model the scenario where the enhancer is not accessible in a given cell (λ open) or the target region is not accessible (λ close). The values of λ open and λ close are estimated from the heart snATAC data, using the experimentally validated enhancer of KCNH2 and enhancers from MYL2, as described below.

Inferring biological context-specific enhancer–promoter interactions

Enhlink can optionally infer biological context-specific linkages with the details of the procedure found in Additional File S1 and summarized here. Each context is represented by a cell-level categorical covariate. Note that Enhlink one-hot encodes categorical variables into boolean features and cannot currently process continuous covariates. In this configuration, briefly, Enhlink first infers the set of enhancers and covariates linked to a target region g then, for each of these enhancers, e Enhlink computes \(Veg = Ve \circ Vg\) as the Hadamard product between the target vector Vg and the enhancer vector Ve. Veg corresponds to a boolean vector indicating when both the enhancer e and the target region g are accessible within a cell. Veg is then used as a new target vector to find the covariates significantly associated with it.

Addressing class imbalance in datasets with unequal covariate distributions

Enhlink provides an option to mitigate class imbalance in datasets with varying covariate distributions, which is beneficial when one or more covariates are under- or over-represented. In such cases, the bootstrap samples may not adequately represent the covariate space. Enhlink addresses this issue by generating a more balanced distribution of covariates through an incremental process. Specifically, Enhlink iteratively selects subsets of cells according to their covariates to obtain a near-uniform distribution of the covariates. While this strategy can be helpful in achieving a more uniform distribution of covariates, it is important to note that it may produce biased results if one or more covariates are vastly underrepresented. Therefore, it is recommended to thoroughly assess the distribution of covariates and to consider alternative approaches such as covariate removal if necessary to ensure more representative results. More details are given in Additional File S1.

Enhlink hyperparameters

The Enhlink inference workflow is governed by multiple hyperparameters summarized in Table S1. The main hyperparameters governing the regularization are max_features, which controls the maximum number of features that a tree can use, and depth which controls the maximum depth of each tree. Depth and max_features are intertwined since depth also indirectly controls the maximum number of features. However, max_features is a more intuitive hyperparameter to use than depth. The latter influences the speed of Enhlink and is set to 2 by default, but needs to be increased if max_features is set higher. secondOrderMaxFeat (set as 2 by default) is the equivalent of the max_features parameter for the inference of the biological context-specific enhancer–promoter interactions, triggered with the secondOrder hyperparameter. N_boot defines the number of trees to build and controls the false positive rate and influences the speed of Enhlink, similarly to min_matsize, controlling the minimum number of features of Mn. The hyperparameters downsample (the size of the bootstrap) and maxFeatType (the number of features to include) can be used to increase the speed of the procedure but lower values might lead to lower accuracies. Finally, threshold defines the p-value cutoff and can be used to regulate the precision and the recall.

Enhlink software suite

Enhlink is an analytical framework developed in Go (https://go.dev/) and compiled into three executables: enhlink, enhgrid, and enhtools. The command line manual and arguments of each executable can be accessed using the -h flag, (e.g., enhlink -h). enhlink is the main executable that launches the Enhlink pipeline, while enhgrid allows launching Enhlink for a range of input values for all the hyperparameters accepting a numerical value. enhgrid is useful for, for example, automatizing a grid-search approach by trying a combination of multiple hyperparameters or for testing different noise levels. enhtools intersect results from multiple runs and output either the common or unique links of a particular run. It also computes the accuracy between two runs (f1-score, precision, recall). Finally, enhtools can filter links that are not within specific regions defined in an input BED file. This functionality can be used for example to filter links not inside topologically associated domains (TAD). The Go source code, the manual, and the tutorial are available here: (https://gitlab.com/Grouumf/enhlinktools).

Simulation studies of the performance impact of hyperparameters

We conducted a simulation study to estimate the impact of the different hyperparameters using three scATAC-seq datasets: the mouse islet, the mouse adipose, and the snATAC-seq from the multi-omic striatum dataset mentioned earlier. We aimed to investigate the impact of various hyperparameters and experimental conditions, such as dataset origin and read depth, on Enhlink’s expected accuracy (see Fig. 2C). To accomplish this, we first subset the datasets by selecting specific cell types, such as beta cells for the islet, adipocytes for the adipose, and Drd1 neurons for the striatum. We then utilized Enhlink’s simulation framework, described in Additional file S1, to estimate the expected accuracy of Enhlink analysis at a given promoter. We used a consistent grid of hyperparameters, as detailed in Table S2, for each dataset and applied the enhgrid tool to perform Enhlink on the hyperparameter grid for all three datasets. Furthermore, we conducted the Enhlink analysis on a random subset of 40 genes from the features list of the three datasets for each combination of hyperparameters in the grid.

Estimating the hyperparameters importance from the simulation studies

From the hyperparameters simulation study described above, we estimated the most significant hyperparameters by fitting a Random Forest regressor from the scikit-learn library [51] using the f1-score as the dependent variable and Enhlink’s hyperparameters as predictive variables (Figure S6). The simulation generated the true positive rate (TPR), false positive rate (FPR), and false negative rate (FNR) for each gene tested and for each hyperparameters combination. We then computed the f1-score as \(f_} =\frac} + 0.5 . (\text + \text))}\) and model it as a function of\(f_} = f(\text)\). We one-hot encoded all the hyperparameters and their values, removing the first value to avoid colinearity, and used them as binary variables. We also added the dataset ID as an additional variable. We then collected the feature importance of the model, computed as the Gini importance, an impurity-based measurement, using the feature_importances_ attribute of the scikit-learn class. For each simulated promoter, we also computed the accessibility ratio as the ratio between the number of cells having the promoter accessible and the total number of cells. We then plotted the influence of the promoter accessibility ratio on the f1-score using the Python library seaborn.

Generating reference datasets for methods comparison

We generated two simulated datasets, using either snATAC-seq only from the mouse islet or the snATAC-seq + scRNA-seq from the striatum dataset, containing either simulated enhancer–promoter co-accessibilities (ATAC-seq only) or correlated enhancer-gene links (ATAC + RNA). From the mouse islet dataset, we used the delta cell subset and a random set of 400 genes to simulate 400 promoters and 1800 enhancers. For each random promoter, we generated a random number of simulated enhancers (between 2 and 7) with the noise parameters λ close = 1.25, λ open = 0.25. We created dummy genomic coordinates for these enhancers in order for them to be in the vicinity of their matching promoter. These simulated enhancers/promoters were injected in a matrix of 10,100 cells (the delta cells) and 295,089 peaks (the total number of peaks). From the striatum dataset, we used the Drd1 neuron cell types and a random set of 897 genes to simulate between 2 and 7 enhancers for each gene, for a total of 4090 enhancers. We used the same noise parameters: λ close = 1.25, λ open = 0.25, and binarized the randomized gene vectors using the mean of its non-null element before generating their associated enhancers (see above). The simulated enhancers were further injected into a matrix of 10,000 cells and 259,720 peaks.

Methods comparison procedure

We processed the ATAC simulated matrix with Enhlink, Chi2, Chi2 + FDR, Cicero, and ArchR workflows and processed the ATAC + RNA simulated matrix with Enhlink, Signac, SnapATAC, and ArchR workflows. We used the same genomic window of + / − 250 kb around the promoter for all workflows. We then computed the\(\text= \frac}+\text}\),\(\text= \frac}+\text}\), and the \(f_} = 2\frac.\text}+\text}\) for each of the simulated promoters/genes using the set of simulated enhancers as the true positives. We computed for each workflow and dataset the overall accuracy scores with their standard deviation (sd) using the mean and the sd of the three metrics. Finally, we also plotted the correlation between the accuracy scores and the mean of each simulated gene or promoter using the lmplot function of the Python seaborn library.

Alternative workflows implementationChi2

A straightforward approach for inferring the co-accessibility of a promoter region g with the set E of its surrounding enhancers is to perform a Chi2 test between g and e for each \(e \in E\) . Since \(_\) and \(_\), i.e., the accessibility vector of g and e, were binary, we constructed a 2 × 2 contingency table for each (g, e), containing the occurrence for the following conditions: \(_ == 1 \cap _ == 1\) , \(_ == 0 \cap _ == 1\) , \(_ == 0 \cap _ == 0\) , and \(_ == 1 \cap _ == 0\) . We used the chi2_contingency function from the Python library Scipy [52] to infer the p-value.

Chi2 + FDR

A more refined approach to the Chi2 method described above is to correct for multiple hypothesis testing by applying the false discovery rate procedure from Benjamini–Hochberg [25] on the set of p-values obtained when applying the Chi2 on E. The method consisted in computing an expected p-value (fdr), assuming a false positive, based on the rank of the feature and the corrected p-value was equal to p-value x fdr. We used the fdrcorrection method from the Python Statsmodels library [52] to perform the FDR corrections.

Cicero

We downloaded the latest version of the Cicero algorithm [9] using the R library developed by the authors (https://cole-trapnell-lab.github.io/cicero-release/docs/). To parallelize the workflow and reduce the shared memory used, we processed each chromosome independently. For each chromosome, we created a UMAP embedding with the following steps: (i) we performed a TF-IDF embedding using the TfidfTransformer class from Scikit-Learn. (ii) We used the singular value decomposition (SVD) using the TruncatedSVD class to embed the data into 25 components. (iii) We transformed the 25 components with the Harmony algorithm correcting for the library ID. (iv) We used the UMAP class from the umap Python library with the “correlation” as a metric, 2.0 as repulsion_strength, and 0.01 as min_dist. Our Cicero workflow consisted of the following steps: (a) load the sparse matrix and creating a Cicero data object with make_atac_cds and binarize set as TRUE, (b) aggregate the raw count data with make_cicero_cds and k = 50, where cell similarity used to bin cells is determined in the harmony-transformed UMAP embedding, (c) estimate the distance parameter with estimate_distance_parameter using window = 250,000, maxit = 100, sample_num = 100, and distance_constraint = 25,000 and computed the mean of the distance parameters, (d) generate the Cicero models with generate_cicero_models using the mean distance parameter and window = 250,000, and (e) assemble the connections with assemble_connections and save all the Cicero connections.

Signac

The Signac methodology to identify enhancers significantly correlated with the expression of a given gene was described in the Method Peak-to-gene section of the published study [14]. The Signac methodology was described in four steps: (i) compute the Pearson correlation between the gene expression and the accessibility of each peak within 500 kb. (ii) For each peak, compute a background distribution using 200 random peaks matching the GC content, the accessibility, and the sequence length of the peak, and identified with the MatchRegionStats R function (https://github.com/stuart-lab/signac/blob/HEAD/R/utilities.R). (iii) Compute a z-test using the z-score obtained with the mean and variance of the background distribution. (iv) Retain links with p-value < 0.05 and |PearsonScore|> 0.05. In order to process the same simulated matrix as the other methods, we reimplemented these four steps in a Python method: (a) We first used a 250-kb window instead of 500 kb to be consistent with our chosen simulation setup and computed the Pearson correlation using the correlation function of the Python NumPy library. (b) In order to randomly select a set of enhancers \(_\) with similar accessibility to a given enhancer \(_}\) and its accessibility \(_}\), we then computed the probability \(P(e|_})\) of each \(e \in E\) and their corresponding accessibility mean \(_\) to be in \(_\) using the normed kernel function and sigma = 50:

$$P(e|_}) = \frac^ . (_-_)}^}}_ \in E} ^ . (_}-_)}^}}$$

We randomly selected 200 enhancers using these probabilities and computed the Pearson score of each \(e \in _\). All the peaks used in the simulation had the same length. (c) We transformed the Pearson correlations (of enhancers in the random set Er or the query enhancer e with the target gene expression) into a p-value using the cumulative distribution function norm.cdf from the Scipy.stats package, and excluded links with |Pearson|< 0.05 or p-value > 0.05.

SnapATAC

The SnapATAC methodology was described in the methods section of the SnapATAC paper [16]. It consisted of considering the expression of a gene g as a variable in a univariate logistic regression model that predicted the binary accessibility status of each enhancer within a 1-Mb window flanking g. For each enhancer e from the set of the flanking enhancers E, the method built a univariate logistic model e = Logit(g) using the glm function with link = ’binomial’ from the R software and used a p-value cutoff of 5e-8. We reimplemented the workflow of SnapATAC in a Python function using the Logit class with its fit method from the statsmodels Python library [52]. In the original study, a label-transfer procedure was conducted in order to identify the matching cells between two separate snATAC-seq and scRNA-seq datasets. However, this procedure was not necessary in our case because we derived the simulated matrices from a multi-omic snRNA-/snATAC-seq dataset for which we had a matching cell ID between the two modalities.

ArchR

The ArchR methodology to infer enhancer–promoter links from scATAC-seq and enhancer–gene correlations from single-cell multi-omic data was described in the supplementary methods of the original study [15]. The method first created a low-overlapping aggregate of cells which excludes any pair of cells within this aggregate that share more than 80% of accessible peaks. This was done by first computing a K-nearest neighbor clustering (K = 100 with the Euclidean distance) for a subset of 500 random reference cells (NS) and using a 2D embedding of the cells as input. The method iteratively removed cells having > 80% of similarities with regard to the KNN neighbors and aggregated the cell vectors of the remaining cells based on their KNN neighbors, with the values further scaled and log-transformed. The Pearson correlation was then computed between the target enhancer and promoter accessibilities, and a p-value was inferred by (a) transforming the correlation score into a t-statistic: \(t\text = \frac|}(}^, \frac^})}}\) and using the cumulative distribution function of the Student’s t law to convert tStat into a p-value. Finally, an FDR correction was applied using the Benjamini–Hochberg procedure. The R source code of the peakaddCoaccessibility method is available here: https://rdrr.io/github/GreenleafLab/ArchR/src/R/IntegrativeAnalysis.R and the C +  + source code of the iterative aggregation strategy is available here: https://github.com/GreenleafLab/ArchR/blob/master/src/KNN_Utils.cpp. In order to process the same matrices in the same conditions as the other methods, we reimplemented the ArchR strategy in a Python script using the following modifications: (a) we used a Harmony + UMAP embedding instead of the iterative LSI strategy used in the original ArchR workflow. The embedding process was the same as described for Cicero. (b) We reimplemented the aggregate strategy from the R and the C +  + scripts and used the NearestNeighbors from the Scikit-Learn Python library class with the cosine distance to compute the K-neighbors and iteratively computed the Jaccard similarity to exclude cells from the aggregates. We used the pairwise_distances function from Scikit-learn to compute the Pearson scores and the fdrcorrection function from the Statsmodels library to compute the FDR.

Robustlink

The Robustlink methodology was described in the original study [17]. It requires single-cell RNA-seq data along with either single-cell ATAC-seq or single-cell methylation data. If the data are from the same sample but are not multi-omics, the first step is to identify similar cells with scFusion. Secondly, Robustlink infers meta-cell by first performing PCA dimension reduction followed by a KNN graph construction (K = 30) and then applies the Leiden algorithm to find cell communities. Each community constitutes a meta-cell for which the pseudo-bulk of the cells within the community is used. The RNA-seq raw count of each community is normalized with log(CPM + 1) and the ATAC-seq raw count is normalized with log(TPM + 1). Then, the Robust link computes the Spearman correlation between each relevant enhancer and gene and assesses their significance by constructing two null distributions, shuffling either metacells or shuffling regions. We installed Robustlink using the instructions from the Git package and applied Robustlink on our artificial ATAC + RNA dataset with the following protocol. After loading the single-cell RNA and ATAC cell × features matrices, we transformed the RNA matrix on a new embedding with its 25 first components using the TruncatedSVD function from the Scikit-learn Python library. We didn’t have to use scFusion since our data were multi-omics. We then computed a KNN graph with K = 30 and the Euclidean metric using the NearestNeighbors function from Scikit-Learn. We then partitioned the graph using the Leiden algorithm with the RBConfigurationVertexPartition as spartition type from the leidenalg Python package. We then create meta-cells from the ATAC and RNA matrices by aggregating the cells within each community in pseudo-bulk and by normalizing the raw read count. Finally, we called the compute_enh_gene_corrs and get_significance_stats from the robustlink Python package. WE used dist_th = 1e6, bins = 501 and 0.20 as fdr cutoff. Furthermore, we tested an array of resolutions from 1.0 to 90.0. We also tested using all cells instead of inferring the meta-cells. In this case, each cell is a unique meta-cell. The robustlink workflow used is available as a Python script (see Software’s availability).

Intersecting the inferred links with the PCHi-C data

The PCHi-C data are available on GEO (see “Availability of data and materials”) with a limited access using GSE214107 as accession ID. We downloaded the ibed files, which are similar to bedpe files with the six first columns referring to two genomic locations, for each strain, merged them, and retained only the unique links.

Estimating accuracy with the PCHi-C data

We used the physical enhancer–promoter interactions from the PCHi-C datasets from the islet and adipose tissues as references to compute the overall accuracy scores (precision, recall, f1-score) of the links inferred with the Enlink, Cicero, and Chi2 + FDR workflows. For both the islet and adipose, we analyzed each cell population independently with each workflow. We then combined all the links obtained for a given dataset and workflow and estimated the overall precision, recall, and f1-score using the set of the PCHi-C links as true positives. We used the enhtools software with the intersect3 option from the Enhlink software suite (see above) to find the intersecting links between the reference set and the results of the different workflows.

Enrichment analysis with EnhancerAtlas datasets

We obtained six reference datasets from EnhancerAtlas 2.0 [26]. comprising reference enhancers for the human heart, human left ventricle, mouse striatum, and mouse islet, as well as reference enhancer–gene interactions for the human left ventricle and mouse striatum. Subsequently, we converted the genomic coordinates from mm9 to mm10 and from hg19 to hg38 using the LiftOver tool (https://genome.ucsc.edu/cgi-bin/hgLiftOver) from the UCSC genome browser [53]. For each relevant cell type, we employed enhtools to tally the number of links inferred from that cell type by Enhlink, Cicero, or the Chi2 + FDR procedure, where the linked enhancer overlapped with at least one reference enhancer, or both the promoter and the linked enhancer overlapped with a reference enhancer-gene interaction. The enrichment score was then defined as the ratio between the number of intersecting links and the total number of links. We intersected the Heart dataset with the nine CARE cell populations, the left ventricle datasets with the vCM CARE population, the islet dataset with the ten islet cell populations, and the striatum datasets with the combined Drd1 + Drd2 populations.

Estimating batch effect with entropy measurements

We quantified the impact of technical batch effects by first reasoning that a true link should be distributed over sequencing batches (that are otherwise biologically similar). We used the information theory principle and computed the Shannon entropy of the link with regard to the sequencing library ID of the dataset. The library ID of the islet and adipose datasets were the ID of the 10 × Genomics sequencing runs and were regarded as variables associated with the batch effect. We first computed for each inferred link from an enhancer e to a gene g the link vector \(_\) such as

$$_ = v(e \cap g) = _ \circ _$$

We then computed the Shanon entropy of \(_\) with regard to the set of library IDs B:

$$E(v_l\vert B)=-\sum__}(\fraci}).log(\sum_(\fraci}))$$

Here, \(_\) corresponds to a binary vector indicating which cells belong to b. Finally, we plotted the batch effect entropy distribution using the seaborn Python library.

Inferring Enhlinks atlases for the islet and adipose tissues

For each tissue, we processed each cell type independently using the library ID, sex, diet, and genotype/strain as covariates. We used 0.01 as the p-value cutoff, with the secondOrder and the uniformSampling options to infer the covariate-specific linkages from a uniform distribution of the covariates within each bootstrap sample. For each bootstrap sample, we used a random subset of 66% of the features, a maximum of 4 explanatory features, a depth of 2, and a downsampling size of 15,000 cells. We created a binary cell × promoter sparse matrix by considering all the promoter regions of each gene and give the value 1 for a given cell if at least one read is found, 0 otherwise. This matrix is then used as \(_}\) (see Additional file S1). We used the ATACMatUtils command with the -use_symbol option from the ATACdemultiplex package (https://gitlab.com/Grouumf/ATACdemultiplex) to create the matrix from the BED file containing the reads and barcode IDs. We then intersected the obtained linkage of each cell type with the PCHi-C links of either the adipose or the islet tissues using the enhtools software with the -intersect3 option.

Inferring Enhlinks for the striatum tissue

We restricted the Enhlink analysis to the two neuron cell types expressing either Drd1 or Drd2. We first performed two Enhlink analyses with the default parameters on the two neuron subtypes expressing either Drd1 or Drd2 using (a) the scATAC-seq matrix alone and (b) the scATAC-seq matrix for the enhancer features and the scRNA-seq matrix to infer enhancers linked to gene expression. We used the default parameters with the library ID as a covariate when using the scATAC-seq matrix only but extended the max_features and depth parameters to 6 and 4, respectively. We then intersected the ATAC and ATAC + RNA linkages of the Drd1 and Drd2 cell type using the enhtools executable with the -intersect option. In a second analysis aimed at identifying Drd1- or Drd2-specific linkages, we processed with Enhlink all the cells from either the Drd1 or the Drd2 neurons and used the library ID together with the cell type (Drd1 or Drd2) as covariates. We performed two processing steps using either the ATAC-seq data alone or the ATAC-seq data combined with the RNA-seq data that we further intersected with enhtools. We also processed the same ATAC-seqs dataset with Cicero using the protocol described above.

Preparing eQTL collections

Bulk RNA-sequencing for eQTL analysis was accessed and downloaded from the Churchill Lab QTL viewer (https://qtlviewer.jax.org/viewer/CheslerStriatum accessed 02/17/2023). This dataset represents striatum samples collected from individual mice (n = 368) from the Diversity Outbred genetic reference population (The Jackson Laboratory catalog #009376). The downloaded package includes a gene expression estimate matrix, genotype probabilities, kinship matrix, and metadata including sex. We performed eQTL mapping restricted to the set of genes differentially expressed between Drd1 and Drd2 neurons (n = 96 and 103, respectively). eQTL mapping was performed using a linear mixed model to account for kinship on normalized, transformed gene-level expression values using the “scan1” function in r/qtl2 [54], including sex as an additive covariate. Genes passing a filter requiring a local-eQTL with a LOD score greater than 8 (n = 162) were further included for SNP association using the “scan1snps” function in r/qtl2. All variants within a 1.5 LOD confidence interval were included for association analysis. Resulting variants with LOD score drop of 1.5 from the maximum were retained, along with their strain distribution pattern, to intersect with Enlink significant links and compare to haplotype effect pattern for the linked eQTL gene. The haplotype effects of the QTL for the candidate genes Kcnb2, Gulp1, and Col25a1 were estimated using “scan1blup” from r/qtl2.

Intersecting Enhlinks from the striatum with eQTL

We formatted the results of the eQTL analysis described above as a BED file listing the genomic coordinates of the SNPs and intersected these regions with the enhancers for the Drd1 and Drd2 neuron subtypes that were correlated to both promoter accessibility and expression of their corresponding genes (see above). We then computed a p-value using the Mann–Whitney procedure to test if the overlapping enhancer presented differential accessibility between the genotypes. We used the p-values and the LOD score to identify the enhancers of Kcnb2, Gulp1, and Col25a1. For each enhancer and their associated promoter and gene, we computed the barplot of accessibility (for the enhancer and promoter) or gene expression, using the four 10 × Genomics libraries as individual measurements.

Evaluating Enhlink and Cicero with the striatum dataset

We intersected the linkages obtained from Cicero and Enlink for the Drd1/Drd2 neuron cell types from the ATAC-seq data with the 96 and 103 marker genes of the Drd1 and Drd2 neurons, respectively. For each gene and its inferred enhancer, we computed a univariate logistic regression: Ve = f(Vg), with Ve the boolean vector of the enhancer e and of size 1 × cell, indicating if e is accessible for each cell, and Vg the numerical vector indicating the scaled gene expression of g for each cell. We used the Logit class with its fit function from the Python statsmodels library to infer the p-value of each model. We then compared the -log10(p-value) distribution of the enhancer from Enhlink and from Cicero for different cutoffs: 0.0, 0.1, 0.2, and 0.28. We chose 0.28 as a cutoff to obtain a number of links (704) similar to the number obtained with Enhlink (802). We used the Mann–Whitney test to assess the difference of the p-value distributions between the links from Enhlink and with Cicero results. Since no difference was observed between Enhlink and Cicero with 0.28 as cutoff, we applied a t-test in this case, assuming normality of the distributions.

Comments (0)

No login
gif