Enrichment Analysis¶

Traditional Enrichment Analysis¶

Traditional gene set enrichment analysis involves testing whether a set of genes is statistically overrepresented in predefined functional categories such as Gene Ontology terms, KEGG pathways, or other curated gene sets. This approach typically:

Takes a query gene set and background gene set as input
Identifies functional categories containing genes from the query set
Uses statistical tests (e.g., hypergeometric test, Fisher’s exact test) to assess overrepresentation
Applies multiple testing correction (e.g., Benjamini-Hochberg) to control false discovery rate
Reports enriched terms with their statistical significance

While powerful, this approach has limitations:

Relies on predefined functional categories that may be incomplete or outdated
Returns numerous overlapping terms that are difficult to interpret holistically
Often misses biological connections that span multiple databases
Requires manual curation to identify meaningful patterns

GeneInsight’s Enhanced Approach¶

GeneInsight enhances traditional enrichment analysis through a gene-centric approach:

Gene-Level Querying: Rather than analyzing the entire gene set at once, GeneInsight queries the STRING database for each gene individually, creating a comprehensive corpus of gene-specific annotations
Topic-Based Enrichment: After topic modeling identifies coherent biological themes, GeneInsight links each theme back to genes via their associated descriptions
Statistical Validation: Hypergeometric testing with false discovery rate correction validates which biological themes are significantly enriched within the original gene set

Semantic Overlap Metrics¶

To evaluate relationships between biological term sets, GeneInsight employs a semantic overlap metric defined as:

\[SC(A,B) = \sum_{a \in A} \max_{b \in B}(\cos(e_a, e_b))\]

where \(e_a\) and \(e_b\) represent embedding vectors of terms from sets A and B respectively, and \(\cos\) denotes cosine similarity.

This metric allows semantically similar phrases to be counted as overlapping, rather than requiring exact matches. For comparison across datasets of varying sizes, a normalized form is implemented:

\[NSC(A,B) = \frac{SC(A,B)}{|A|}\]

where \(|A|\) represents the cardinality of set A.

Top-k Semantic Similarity Analysis¶

To evaluate how well summaries preserve core themes from source terms, GeneInsight employs a recall-based similarity metric that captures nuanced relationships beyond exact-match approaches:

\[\text{TopK}(i,B) = \max_k\{\cos(e_{A_i}, e_{B_j}) \mid j \in [1,m]\}\]

\[S(A,B) = \frac{1}{n}\sum_{i=1}^{n} \text{mean}(\text{TopK}(i,B))\]

This approach:

Focuses on each source term’s strongest matches rather than averaging over all pairwise comparisons
Prevents dilution of the most meaningful relationships
Improves robustness to noise and outliers
Effectively handles imbalanced sets where many terms in a larger set may not have close equivalents in the summary

Performance Evaluation¶

As demonstrated in the manuscript, GeneInsight’s enrichment approach offers several advantages over traditional methods:

Identifies 2-3 times more enriched terms than STRING-DB at equivalent statistical thresholds
Shows strong positive correlation with STRING-DB results (r=0.69-0.87), indicating consistency
Demonstrates significantly greater semantic diversity in identified terms
Maintains robust semantic similarity across different summary lengths