System Architecture
===================

GeneInsight employs a two-stage approach to extract and organize biological information from gene sets.

.. figure:: /_static/GeneInsight.png
   :alt: GeneInsight system architecture and workflow
   :width: 600px

   Figure 1: GeneInsight system architecture and workflow

Biological Theme Generation Stage
--------------------------------

1. **Annotation Collection**:
   The system collects functional annotations from the STRING database for each input gene, creating a collection of gene-specific descriptions.

2. **Cluster-based Topic Modeling**:
   This textual corpus is subjected to cluster-based topic modeling, which groups similar annotations into clusters (topics) and identifies key terms for each cluster.

3. **LLM Theme Generation**:
   A large language model (LLM) then converts representative annotations from each cluster into interpretable biological themes.

4. **Gene-Theme Linking**:
   These biological themes are linked back to genes via their associated descriptions.

5. **Statistical Validation**:
   The system performs hypergeometric testing with false discovery rate correction to identify which biological concepts are significantly enriched within the original gene set.

Summarization Stage
-----------------

1. **Theme Refinement**:
   Another round of cluster-based topic modeling identifies key themes, measuring how consistently they appear as cluster representatives across multiple runs.

2. **Hierarchical Structuring**:
   The software extracts the final summary by selecting themes based on user-defined length preferences.

3. **Interactive Report Generation**:
   A large language model creates a hierarchical summary where major biological themes appear as main headings with related subheadings grouped beneath them.

4. **Cross-reference Integration**:
   The final interactive HTML report seamlessly links theme descriptions to their corresponding gene annotations, enabling researchers to navigate between overarching biological processes and their specific components.

Technical Implementation
----------------------

GeneInsight is implemented in Python (3.9+) and distributed as a Docker container to ensure reproducible deployment across platforms. The core computational components include:

* **BERTopic (v0.15.0)** for topic modeling
* **SentenceTransformer** for generating dense vector representations
* **Optuna (v3.3.0)** for hyperparameter optimization
* **Snakemake (v7.32.4)** for workflow management

The system employs multiple rounds of topic modeling with different random seeds to ensure robust theme identification, identifying stable topics through cosine similarity measurements between iterations.