Helix Insight

Documentation / Literature Evidence / Relevance Scoring

Relevance Scoring

Each candidate publication is scored from 0.0 to 1.0 using a six-component weighted system optimized for ACMG evidence assessment. The total score determines the publication's rank in search results.

Six Scoring Components

30%Phenotype

Measures how many of the patient’s HPO terms are mentioned in the publication. Uses morphological matching (stemming) so that "seizure" matches "seizures" and "epileptic" matches "epilepsy". Score equals the fraction of patient HPO terms found in the title and abstract.

20%Publication Type

Prioritizes study types with the highest clinical relevance for variant interpretation. Case reports receive the highest score because they directly describe patient phenotypes and variant associations.

15%Gene Centrality

Measures how prominently the query gene is discussed in the publication, based on mention frequency. A paper mentioning the gene 20+ times is likely focused on that gene, while a single mention may be incidental.

15%Functional Data

Detects the presence of functional studies -- animal models (zebrafish, mouse), knockout experiments, cell line assays, and molecular biology techniques. Functional evidence is critical for ACMG PS3 criterion assessment.

10%Variant Match

Awards a bonus when the exact variant notation is found in the publication. An exact match (1.0) indicates the specific variant has been studied; a gene-only match (0.3) indicates relevance at the gene level.

10%Recency

More recent publications are scored higher using linear decay over 10 years. A 2025 publication scores 1.0; a 2015 publication scores approximately 0.0. This reflects the evolving understanding of variant pathogenicity.

Publication Type Scoring

Publication TypeScoreRationale
Case Report1.0Directly describes patient phenotypes and variant associations
Clinical Trial0.9Strong clinical evidence with structured methodology
Research Article0.7Original research contributing new findings
Journal Article0.5General scientific publication
Review0.3Secondary source, lower novelty for classification

Gene Centrality Scoring

Mention CountScore
20 or more1.0
10-190.8
5-90.6
2-40.4
10.2

Morphological Matching

Phenotype matching uses stemming (NLTK SnowballStemmer, English) to handle morphological variations in clinical language. The stem of each HPO term name is compared against stemmed words in the publication title and abstract. This ensures that "seizure" matches "seizures", "epileptic" matches "epilepsy", and "developmental" matches "development" without requiring exact string matches.

Score Interpretation

0.7 - 1.0Highly relevant -- strong phenotype match, relevant study type, prominent gene discussion
0.4 - 0.7Moderately relevant -- partial phenotype overlap or relevant gene without phenotype match
0.1 - 0.4Low relevance -- gene mentioned but limited clinical overlap. Included for completeness.
Below 0.1Filtered out -- insufficient relevance for clinical review

Parallel Scoring

Relevance scoring runs across 16 parallel workers, enabling the platform to score thousands of candidate publications in under 500 milliseconds. Each worker is pre-initialized with the stemming engine to eliminate per-request overhead.