System Components

This section details the key components and algorithms that constitute the GeneInsight system.

GeneInsight Framework Diagram

Figure 1: GeneInsight workflow framework showing the two main phases: Theme Generation (top) and Summarisation (bottom). Theme Generation processes gene sets through STRING database and topic modeling before identifying themes via hypergeometric testing. Summarisation organizes themes hierarchically and produces an interactive HTML report.

Data Sources

STRING Database

The STRING database serves as a fundamental data source for GeneInsight, providing comprehensive protein-protein interaction (PPI) data across multiple organisms. GeneInsight utilizes STRING version 12.0, which integrates:

  • Direct physical interactions

  • Indirect functional associations

  • Experimental data

  • Pathway knowledge

  • Co-expression patterns

  • Text mining of scientific literature

For the analysis pipeline, GeneInsight leverages STRING’s functional enrichment API, which provides statistically validated functional annotations for protein sets. Each returned term is accompanied by:

  • False discovery rate (FDR)

  • p-value

  • Number of proteins mapped to that term

Core Algorithms

Semantic Embedding

GeneInsight generates dense vector representations of biological terms using the SentenceTransformer framework with the paraphrase-MiniLM-L6-v2 model. This produces 384-dimensional embeddings that capture contextual relationships between terms.

The embedding process enables:

  • Measuring semantic similarity between terms

  • Identifying conceptually related annotations

  • Clustering similar biological concepts

  • Quantifying term diversity in enrichment results

BERTopic Topic Modeling

The implementation uses BERTopic (v0.15.0) with customized parameters for biological text:

  1. Document Vectorization: Converting gene annotations to semantic vectors

  2. Dimensionality Reduction: Using UMAP to create a lower-dimensional representation

  3. Document Clustering: Applying HDBSCAN to identify coherent topics

  4. Topic Representation: Extracting characteristic terms using c-TF-IDF

Hyperparameter optimization focuses on two critical aspects:

  • Biological theme identification parameters

  • Rank-based clustering iterations

Statistical Framework

GeneInsight employs several statistical methods for validation and evaluation:

  1. Hypergeometric Testing: Validating the significance of biological themes

  2. Benjamini-Hochberg Correction: Controlling false discovery rate in multiple comparisons

  3. Clustering Quality Metrics: * Davies-Bouldin Index for measuring cluster separation * Calinski-Harabasz Score for evaluating clustering quality

  4. Semantic Similarity Metrics: * Cosine similarity for comparing embedding vectors * Word Mover’s Distance for orthogonal validation * TopK recall for evaluating theme preservation

Language Model Integration

GeneInsight interfaces with modern language models to enhance biological interpretation:

API Services

The system supports multiple API services:

  • OpenAI API: Default option that works with models like gpt-4o-mini

  • Ollama API: Local option for running models on your own hardware

Prompt Engineering

GeneInsight’s prompting strategy includes:

  1. Context Provision: Including relevant background on biological concepts

  2. Task Specification: Clear instructions for theme interpretation

  3. Output Formatting: Guidelines for creating consistent, structured responses

  4. Few-Shot Examples: Demonstration of desired output format and quality

Output Processing

After receiving API responses, GeneInsight:

  1. Validates the structural integrity of the response

  2. Extracts the thematic content and metadata

  3. Integrates the interpreted themes with statistical results

  4. Structures the information for the final report

Visualization Framework

The interactive HTML report is built using:

  • JavaScript: For interactive elements and dynamic content

  • D3.js: For data visualization components

  • Bootstrap: For responsive layout and styling

Key visualizations include:

  1. Topic Map: 2D representation of theme relationships

  2. Gene-Theme Network: Interactive graph showing connections

  3. Heatmaps: Visualizing gene presence across themes

File System and Storage

GeneInsight organizes analysis outputs in a structured directory:

  • CSV files for detailed data export

  • Interactive HTML for visualization

  • Compressed archives for easy sharing of both the report and data

  • Full enrichment results and API call information in single CSV files for easy access