Topic Modeling¶

Fundamentals of Topic Modeling¶

Topic modeling is an unsupervised machine learning technique that discovers abstract “topics” within document collections. In GeneInsight, documents are gene annotations from sources like the STRING database, literature, and ontologies.

The basic premise is that documents contain mixtures of topics, with each topic representing a probability distribution over words. These algorithms identify word co-occurrence patterns to reveal underlying themes without requiring predefined categories.

BERTopic Implementation¶

GeneInsight uses BERTopic, which leverages BERT embeddings and clustering for topic creation through these steps:

Document Embedding: Converting gene annotations into vector representations using SentenceTransformer
Dimensionality Reduction: Using UMAP to project vectors into lower-dimensional space
Clustering: Applying HDBSCAN to group similar documents
Topic Representation: Extracting key terms for each topic using class-based TF-IDF

This approach improves upon traditional methods by: * Capturing semantic relationships between terms * Creating more coherent topics * Handling polysemy and synonymy * Working effectively with short gene annotations

LLM Integration for Theme Generation¶

GeneInsight connects topic modeling with biological interpretation through Large Language Models (LLMs):

Representative terms and documents from each cluster create prompts for the LLM
Biological context guides the LLM toward meaningful interpretations
The LLM generates interpretable biological themes from the topics
Structured outputs include both concise summaries and detailed biological explanations

This LLM layer transforms statistical clusters into actionable biological insights for enrichment analysis, leveraging broader biomedical knowledge to create themes that are both statistically robust and biologically meaningful.

Multi-run Convergence Strategy¶

To ensure robust results, GeneInsight performs multiple independent rounds of topic modeling with different random seeds, addressing the inherent variability in dimensionality reduction and clustering.

The system measures consistency between topics from different runs, focusing on persistent themes. Validation metrics include:

Normalized Soft Cardinality: Measuring semantic overlap between terms
Rank-Based Metrics: Evaluating topic ranking stability

Research shows the approach converges after approximately 5 sampling rounds, with minimal gains from additional runs.

Topic Clustering and Hierarchical Organization¶

After generating topics across multiple runs, GeneInsight performs secondary clustering to:

Ensure biological distinctiveness of themes
Create a hierarchical organization for user navigation

The process:

Computes semantic similarity between topic pairs
Applies hierarchical clustering to identify topic groups
Selects representative topics from each cluster

Cluster quality is evaluated using:

Davies-Bouldin Index: Measuring average similarity between clusters
Calinski-Harabasz Score: Evaluating cluster separation

This hierarchical organization helps users:

Explore major biological themes at the top level
Drill down into related sub-themes
Navigate between connected concepts
Locate specific biological processes efficiently

This structure transforms a flat list of terms into an organized knowledge framework reflecting the natural hierarchy of biological processes.

Downstream Enrichment Analysis¶

The LLM-generated themes form the foundation for enrichment analysis:

Themes are linked back to original genes through their annotations
Hypergeometric testing identifies statistically enriched themes
False discovery rate correction controls for multiple testing
Enriched themes provide meaningful biological interpretation of the gene set

This approach produces human-readable summaries in the HTML report, with the LLM translating findings into clear insights that biologists can readily understand and apply.