This study utilized the GSE185051 data set, which contains 52 NAFLD samples and 5 control samples. The data were first standardized to ensure consistency among samples. Through differential expression analysis, significantly DEGs between NAFLD and control samples were identified, and GSEA was performed to identify gene sets and pathways associated with NAFLD. After identifying the DEGs, the focus shifted to genes related to energy metabolism, screening these genes and conducting detailed expression difference analysis. Subsequently, GO analysis and KEGG pathway enrichment analysis were performed on the EMRDEGs to reveal their roles in biological processes and signaling pathways. PPI analysis was conducted to identify the interaction relationships between these genes, to further understand their role in NAFLD. To gain deeper insights into the regulatory network of genes and potential drug targets, correlation analyses were also performed between mRNA and miRNA, mRNA and RBP, mRNA and TF, as well as mRNA and drugs. Finally, CIBERSORT was used to analyze immune cell infiltration, providing insights into the changes in the immune microenvironment in NAFLD. The workflow of bioinformatics analysis of the present study was shown in Fig. 1.
Fig. 1The workflow of bioinformatics analysis
Standardization and Differential Analysis of Pediatric NAFLDWhile all the samples in the GSE185051 dataset originate from the same cohort (pediatric NAFLD and healthy controls), technical variations during high-throughput sequencing data processing can introduce batch effects, we standardized the pediatric NAFLD data set GSE185051 using the R package “limma” (Fig. 2A–D). The dataset consisted of 57 samples in the GSE185051 data set: including 5 healthy liver samples in the Control group and 52 samples from pediatric NAFLD patients in the NAFLD group (Fig. 2A–B). As illustrated in Fig. 2A–B, the expression profile data of the pediatric NAFLD data set GSE185051 were effectively normalized, resulting in consistent expression patterns of the sample data. Next, we assessed the effectiveness of batch effect removal using principal component analysis (PCA). As evident from comparing the two figures (Fig. 2C–D), the corrected plot showed a tighter clustering of samples across different experimental groups, indicating that batch effects have been mitigated to a certain extent. This ensured the accuracy of subsequent differential expression analyses.
Fig. 2Standardization processing of pediatric NAFLD dataset. A–B Boxplot diagrams of GSE185051 data set before (A) and after (B) normalization. The Y-axis represented the gene expression values (after log2 transformation). C–D PCA plots of GSE185051 before (C) and after (D) batch effect removal. E Volcano plot of DEGS between the NAFLD group and the Control group of the GSE185051 data set. F Venn diagram of DEGs and EMRGs in the dataset. G Complex numerical heatmap of EMRDEGs in the GSE185051 data set
To examine the disparities in gene expression between the NAFLD group and the Control group, we used the “limma” software to conduct differential analysis on the GSE185051 data set, resulting in the identification of DEGs.
Overall, 22,808 genes were obtained from the data set GSE185051, among which 3175 genes matched the threshold of |logFC|> 1.5 and P. adj. < 0.05. Under this threshold, 1439 genes exhibited higher expression in the NAFLD group (with positive logFC indicating lower expression in the Control group), while 1736 genes showed lower expression in the NAFLD group (with negative logFC, indicating higher expression in the Control group). A volcano plot was displayed to visualize the outcomes of the differential analysis of the data sets. (Fig. 2E).
We identified 16 EMRDEGs in pediatric NAFLD, and a Venn diagram was constructed to illustrate their overlap (Fig. 2F). These 16 EMRDEGs included ACSL4, C1QBP, CACNB2, CD36, FOXO1, GNG12, HIF1A, MLXIPL, PKLR, PPARGC1A, PPP2R1B, PRKAA2, PRKAR2A, RAPGEF3, SLC2A2, and UCP2.
According to the findings from the Venn diagram, we examined the differential expression of these 16 EMRDEGs in the NAFLD group compared to the Control group within the GSE185051 data set (Fig. 2G). Subsequently, we created a heatmap using the R package “pheatmap” to visualize the expression patterns of these 16 EMRDEGs.
Expression Difference Analysis of EMRDEGsTo further investigate the expression differences of the 16 EMRDEGs, we conducted an analysis of their expression levels in the GSE185051 dataset and their association with the relationship between the concentrated expression level of EMRDEGs in the GSE185051 data set and the NAFLD group and Control group (Fig. 3A).
Fig. 3Expression difference analysis of EMRDEGs. A Group comparison chart of the expression difference analysis of EMRDEGs in the GSE185051 data set. The “value” on the Y-axis represented the normalized expression value of the gene. B–O ROC curve analysis results for ACSL4 (B), C1QBP (C), CACNB2 (D), CD36 (E), FOXO1 (F), GNG12 (G), HIF1A (H), PKLR (I), PPARGC1A (J), PRKAA2 (K), PRKAR2A (L), RAPGEF3 (M), SLC2A2 (N), UCP2 (O) in the GSE185051 data set. The symbol ns was equal to P ≥ 0.05, not statistically significant; the symbol * was equal to P < 0.05, statistically significant; the symbol ** was equal to P < 0.01, highly statistically significant; the symbol *** was equal to P < 0.001, very statistically significant
Firstly, we assessed the expression differences of the 16 EMRDEGs in the GSE185051 data set between the NAFLD group and the Control group using the Wilcoxon signed rank test. The results showed that 14 EMRDEGs (ACSL4, C1QBP, CACNB2, CD36, FOXO1, GNG12, HIF1A, PKLR, PPARGC1A, PRKAA2, PRKAR2A, RAPGEF3, SLC2A2, and UCP2) in the GSE185051 data set exhibited statistically significant differences in expression between the NAFLD group and the Control group (Fig. 3A) (Symbol * was equivalent to P < 0.05, symbol ** was equivalent to P < 0.01, symbol *** was equivalent to P < 0.001. The significant differential expression of these genes in NAFLD may be associated with the occurrence and progression of the disease, deserving further investigation and research).
Overall, 14 EMRDEGs exhibited statistically significant expression differences in the GSE185051 data set and the results were displayed (Fig. 3B–O). The ROC curve analysis revealed the diagnostic accuracy of these EMRDEGs for pediatric NAFLD. Notably, the expression levels of CACNB2 (AUC = 0.965, Fig. 3D), CD36 (AUC = 1.000, Fig. 3E), GNG12 (AUC = 0.965, Fig. 3G), HIF1A (AUC = 0.946, Fig. 3H), PKLR (AUC = 0.992, Fig. 3I), PRKAA2 (AUC = 0.988, Fig. 3K), PRKAR2A (AUC = 1.000, Fig. 3L), and SLC2A2 (AUC = 0.973, Fig. 3N) demonstrated high diagnostic accuracy for pediatric NAFLD. Additionally, ACSL4 (AUC = 0.823, Fig. 3B), C1QBP (AUC = 0.881, Fig. 3C), FOXO1 (AUC = 0.881, Fig. 3F), PPARGC1A (AUC = 0.862, Fig. 3J), RAPGEF3 (AUC = 0.808, Fig. 3M), UCP2 (AUC = 0.792, Fig. 3O) demonstrated a reasonable level of diagnostic accuracy for pediatric NAFLD.
EMRDEGs GO and KEGG AnalysisTo elucidate the relationship between BP, MF, CC, and other biological pathways associated with 14 EMRDEGs (ACSL4, C1QBP, CACNB2, CD36, FOXO1, GNG12, HIF1A, PKLR, PPARGC1A, PRKAA2, PRKAR2A, RAPGEF3, SLC2A2, and UCP2) and pediatric NAFLD, we conducted a GO analysis on the 14 EMRDEGs (Table 1). Enrichment items with a P. adj. < 0.1 and an FDR value (q.value) < 0.05 were considered statistically significant. The results indicated that these 14 EMRDEGs were primarily enriched in BP related to energy homeostasis, cellular glucose homeostasis, positive regulation of carbohydrate metabolic process, responses to oxidative stress, responses to muscle activity, and other BP in pediatric NAFLD. Furthermore, they were associated with CC, such as the brush border, plasma membrane raft, protein kinase complex, cluster of actin-based cell projections, and cAMP-dependent protein kinase complex, among others. Additionally, these genes were found to be involved in MF, including enrichment in ubiquitin protein ligase binding, ubiquitin-like protein ligase binding, cAMP binding, cyclic nucleotide binding, and chromatin DNA binding, among others. The results of the GO analysis were visually presented in a bubble plot (Fig. 4A) and network maps (Fig. 4B–D), illustrating the results of the BP, CC, and MF pathways.
Table 1 GO enrichment analysis results of hub genesFig. 4GO analysis and KEGG analysis of EMRDEGs. A Bubble diagram display of GO functional enrichment analysis results of EMRDEGs. B–D Circular network diagrams of BP pathway (B), CC pathway (C), and MF pathway (D) in the GO functional enrichment analysis results of EMRDEGs. E–F The results of KEGG pathway enrichment analysis of EMRDEGs were shown in bar graphs (E) and network graphs (F). The ordinate in the bubble diagram (A) was GO terms, and the bubble color indicated the P. adj. value of the pathway. Blue dots in network diagrams (B, C, D, F) represented specific genes, and red circles represented specific pathways. The screening criteria for GO and KEGG enrichment items were P.adj. < 0.1 and FDR value (q.value) < 0.05
Next, we conducted a KEGG analysis (Table 2) on these 14 EMRDEGs. The analysis revealed significant enrichment of these genes in pathways such as Insulin resistance, Insulin signaling, Adipocytokine signaling, Glucagon signaling, and AMPK signaling pathways. We used histograms (Fig. 4E) and network diagrams (Fig. 4F) to display the gene expression in these five enriched KEGG pathways.
Table 2 KEGG enrichment analysis results of EMRDEGsGSEA of Pediatric NAFLD Data SetTo investigate the influence of gene expression levels on the risk of pediatric NAFLD, we used GSEA to assess the BP, affected cell types, and the expression of individual genes within the GSE185051 data set. We employed stringent screening criteria, requiring a P. adj. < 0.05 and an FDR value (q. value) < 0.25. According to the findings, the GSEA of the GSE185051 data set had five main biological characteristics (Fig. 5A), the DEGs were considerably enriched in Biosynthesis of Unsaturated Fatty Acids (Fig. 5B), Glycolysis Gluconeogenesis (Fig. 5C), Omega-9 Fatty Acid Synthesis (Fig. 5D), Fatty Acid Metabolism (Fig. 5E), Hedgehog Ligand Biogenesis (Fig. 5F) and various other pathways (Fig. 5B–F, Table 3). These findings offer insights into how specific biological processes and pathways may contribute to the risk of pediatric NAFLD based on gene expression patterns.
Fig. 5GSEA of pediatric NAFLD data set GSE185051. A The GSEA of the GSE185051 dataset revealed five main biological characteristics. The X-axis represented the LogFC (Log2 Fold Change) value of gene expression changes. B–E The DEGs in the GSE185051 data set were significantly enriched in Biosynthesis of Unsaturated Fatty Acids (B), Glycolysis Gluconeogenesis (C), Omega-9 Fatty Acid Synthesis (D), Fatty Acid Metabolism (E), Hedgehog Ligand Biogenesis (F) pathway. The significant enrichment screening criteria for GSEA enrichment analysis were P. adj. < 0.05 and FDR value (q.value) < 0.25
Table 3 GSEA analysis results of dataset GSE185051EMRDEGs PPI Network, mRNA-miRNA, mRNA-RBP, mRNA-TF, and mRNA-Drugs Interaction NetworkWe used the STRING database to analyze 14 EMRDEGs (ACSL4, C1QBP, CACNB2, CD36, FOXO1, GNG12, HIF1A, PKLR, PPARGC1A, PRKAA2, PRKAR2A, RAPGEF3, SLC2A2, and UCP2) for PPI network (minimum required interaction score: medium confidence (0.400)). We retained only EMRDEGs connected to other nodes, resulting in a PPI network comprising 12 EMRDEGs (CACNB2, CD36, FOXO1, GNG12, HIF1A, PKLR, PPARGC1A, PRKAA2, PRKAR2A, RAPGEF3, SLC2A2, and UCP2) all considered as hub genes. We visualized this PPI network using Cytoscape software (Fig. 6A).
Fig. 6EMRDEGs PPI network. A PPI network of EMRDEGs. B–E MCC (B), DMNC (C), and MNC (D) algorithm top 10 node networks of the PPI network of EMRDEGs. E The top 10 node Venn diagram results of the four algorithms MCC, DMNC, MNC, and Degree in the EMRDEGs PPI network were displayed
To identify the hub genes within the PPI network, we analyzed the nodes connected to other nodes through the cytoHubba plug-in in the Cytoscape software. The cytoHubba [29] plug-in is used to identify hub genes in PPI networks. This plug-in implements various algorithms to evaluate the importance of genes in PPI networks. Through the cytoHubba plug-in, we used three algorithms: Matthews correlation coefficient metric (Fig. 6B), differential metabolic network construction (Fig. 6C), maximal neighborhood coefficient (Fig. 6D). Matthews Correlation Coefficient is a statistical metric that takes into account true positives, true negatives, false positives, and false negatives to measure the performance of binary classification models in classification tasks. It is also a measurement method to evaluate the importance of nodes in a network. Its value ranges from -1 to 1, where 1 represents complete correlation, 0 indicates no correlation, and -1 represents negative correlation. Through ranking by scores, genes with high centrality and importance in PPI networks can be identified. Differential metabolic network reconstruction is a method that identifies significantly changed network nodes and edges in a metabolic network by comparing different conditions (such as healthy and pathological groups) to determine the changes that occur under specific diseases or physiological states. In this study, it was utilized to analyze and identify hub genes involved in the pathological process of NAFLD. The maximal neighborhood coefficient is an indicator used to evaluate the importance of network nodes. It is calculated based on the neighborhood density surrounding a node. Maximal neighborhood coefficient measures the degree of interconnection between a node and its neighboring nodes. A high maximal neighborhood coefficient indicates that the node has higher connectivity in the network, potentially playing a more important biological function. By combining these three methods, we were able to identify hub genes with biological importance under specific conditions in the PPI network. The top 10 EMRDEGs with the highest scores were selected for each algorithm (Fig. 6B–D). These scores gradually increased from yellow to red in the figures. We then identified the intersection of the top 10 EMRDEGs obtained by each algorithm (Matthews correlation coefficient, differential metabolic network construction, maximal neighborhood coefficient), respectively, to acquire the hub genes of the EMRDEGs PPI network and displayed the results in a Venn diagram (Fig. 6E). Overall, we identified 9 hub genes: CD36, FOXO1, HIF1A, PKLR, PPARGC1A, PRKAA2, PRKAR2A, SLC2A2, and UCP2.
For miRNA interactions with these 9 hub genes (CD36, FOXO1, HIF1A, PKLR, PPARGC1A, PRKAA2, PRKAR2A, SLC2A2, UCP2), we utilized the mRNA-miRNA data from the miRDB database. The mRNA-miRNA interaction network was then displayed using Cytoscape software (Fig. 7A). The sky blue circles in the mRNA-miRNA interaction network represented mRNAs, and the green circles represented miRNAs. The mRNA-miRNA interaction network revealed that our network included 5 hub genes (FOXO1, HIF1A, PPARGC1A, PRKAA2, and PRKAR2A), 117 pairs of mRNA-miRNA interactions, and 101 miRNA molecules. The results were shown in Table S2.
Fig. 7Hub genes mRNA-miRNA, mRNA-RBP, mRNA-TF, and mRNA-drugs interaction network. A–D mRNA-miRNA (A), mRNA-RBP (B), mRNA-TF (C), mRNA-drugs (D) interaction network of hub genes. The sky blue circles in the mRNA-miRNA (A) interaction network were mRNAs; the green circles were miRNAs. In the mRNA-RBP (B) interaction network, the sky blue circles were mRNAs; the orange circles were RBPs. The sky blue circles in the mRNA-TF (C) interaction network were mRNAs; the purple circles were TFs. In the mRNA-drugs (D) interaction network, the sky blue circles were mRNAs; the pink circles were drugs. RBP RNA-Binding Protein, TF Transcription Factors
Additionally, we predicted RBPs interacting with 9 hub genes (CD36, FOXO1, HIF1A, PKLR, PPARGC1A, PRKAA2, PRKAR2A, SLC2A2, UCP2) via the ENCORI database, retaining mRNA-RBP interaction pairs with clusterNum > 4 and clipExpNum > 4, then visualized the mRNA-RBP interaction network by Cytoscape software (Fig. 7B). This network included 7 hub genes (CD36, FOXO1, HIF1A, PPARGC1A, PRKAA2, PRKAR2A, and UCP2), 40 RBP molecules and 102 pairs of mRNA-RBP interaction relationships. Notably, hub gene HIF1A had interactions with 31 RBP molecules. The specific mRNA-RBP interaction relationships were shown in Table S3.
We identified TFs that bind to hub genes through the CHIPBase database (version 2.0) and the hTFtarget database. These interactions overlapped with 9 hub genes, resulting in 7 hub genes (FOXO1, HIF1A, PPARGC1A, PRKAA2, PRKAR2A, SLC2A2, and UCP2) and 100 TFs, which we mapped out using Cytoscape (Fig. 7C). The sky blue circles were mRNAs and the purple circles were TFs. In the mRNA-TF interaction network, we found 65 pairs of significant mRNA-TF interactions particularly involving the hub gene UCP2. Table S4 displayed the precise interactions between mRNAs and TFs.
To find potential medications or chemical compounds targeting 9 hub genes, as shown in the mRNA-drugs interaction network (Fig. 7D), we searched 76 potential drugs or molecular compounds corresponding to 8 hub genes (CD36, HIF1A, PKLR, PPARGC1A, PRKAA2, PRKAR2A, SLC2A2, and UCP2) through the CTD database. The sky blue circles represented mRNAs, and the pink circles represented drugs. We identified 43 medications or chemical substances that target the CD36 gene among them. Various mRNA-drug interaction correlations were shown in Table S5.
Immune Infiltration Analysis of Pediatric NAFLD Data Set (CIBERSORT)We used the CIBERSORT package and the Pearson algorithm [30] to evaluate the correlation between 22 distinct immune cell types and the expression profile data of pediatric NAFLD data set GSE185051. A graphical representation of the immune cell infiltration in each sample of the GSE185051 data set was created (Fig. 8A), focusing on 21 immune cell types where the cumulative infiltration abundance was greater than 0.
Fig. 8Immune infiltration analysis of pediatric NAFLD (CIBERSORT). A Histogram display of immune infiltration results of immune cells in the pediatric NAFLD data set GSE185051. B Correlation heatmap analysis results of immune cell infiltration abundance in the GSE185051 data set. C Correlation dot plots of immune cells and hub genes in the GSE185051 data set. The absolute value of the correlation coefficient (r) in the correlation heat map was strong if the absolute value was above 0.8; if the absolute value was 0.5–0.8, it was moderately correlated; if the absolute value was 0.3–0.5, it was weak; if the absolute value was below 0.3, it was weak or not relevant
The association between the infiltration abundance of 21 immune cell types and the expression profiles within the pediatric NAFLD data set GSE185051 was computed and presented (Fig. 8B). The findings revealed that the majority of the infiltration abundances among these 21 immune cell categories exhibited a negative correlation.
Simultaneously, we investigated the relationships between the expression levels of 9 hub genes (CD36, FOXO1, HIF1A, PKLR, PPARGC1A, PRKAA2, PRKAR2A, SLC2A2, and UCP2) and the infiltration abundance of 21 immune cell types (Fig. 8C). Notably, mast cells had a significantly positive correlation with CD36 and PRKAR2A, while T follicular helper cells exhibited positive correlation with FOXO1 and PPARGC1A. Conversely, B cells memory had a significant negative correlation with CD36, HIF1A, PRKAR2A, and UCP2, and T follicular helper cells showed negative correlation with CD36, HIF1A, PKLR, PRKAR2A, and UCP2.
Comments (0)