System Architecture

GeneInsight employs a two-stage approach to extract and organize biological information from gene sets.

GeneInsight system architecture and workflow

Figure 1: GeneInsight system architecture and workflow

Biological Theme Generation Stage

  1. Annotation Collection: The system collects functional annotations from the STRING database for each input gene, creating a collection of gene-specific descriptions.

  2. Cluster-based Topic Modeling: This textual corpus is subjected to cluster-based topic modeling, which groups similar annotations into clusters (topics) and identifies key terms for each cluster.

  3. LLM Theme Generation: A large language model (LLM) then converts representative annotations from each cluster into interpretable biological themes.

  4. Gene-Theme Linking: These biological themes are linked back to genes via their associated descriptions.

  5. Statistical Validation: The system performs hypergeometric testing with false discovery rate correction to identify which biological concepts are significantly enriched within the original gene set.

Summarization Stage

  1. Theme Refinement: Another round of cluster-based topic modeling identifies key themes, measuring how consistently they appear as cluster representatives across multiple runs.

  2. Hierarchical Structuring: The software extracts the final summary by selecting themes based on user-defined length preferences.

  3. Interactive Report Generation: A large language model creates a hierarchical summary where major biological themes appear as main headings with related subheadings grouped beneath them.

  4. Cross-reference Integration: The final interactive HTML report seamlessly links theme descriptions to their corresponding gene annotations, enabling researchers to navigate between overarching biological processes and their specific components.

Technical Implementation

GeneInsight is implemented in Python (3.9+) and distributed as a Docker container to ensure reproducible deployment across platforms. The core computational components include:

  • BERTopic (v0.15.0) for topic modeling

  • SentenceTransformer for generating dense vector representations

  • Optuna (v3.3.0) for hyperparameter optimization

  • Snakemake (v7.32.4) for workflow management

The system employs multiple rounds of topic modeling with different random seeds to ensure robust theme identification, identifying stable topics through cosine similarity measurements between iterations.