Reference Databases

Helix Insight uses eight reference databases for variant annotation and ACMG classification. All databases are stored locally on EU-based infrastructure in Helsinki, Finland. No variant data is sent to external APIs during processing.

Database versions are fixed per deployment. Each version undergoes validation testing before production deployment to ensure consistency with expected classification outcomes. The current versions and their roles in ACMG classification are documented below.

Zero External API Calls

During variant processing, Helix Insight makes zero external API calls. All reference databases are stored locally. Ensembl VEP runs with a local cache. No patient data leaves the server at any processing stage.

Database Summary

Database	Version	Records	Primary Use	ACMG Criteria
gnomAD	v4.1.0	~759M variants	Population frequencies	BA1, BS1, BS2, PM2
ClinVar	2025-01	~4.1M variants	Clinical significance	PS1, PP5, BP6, ClinVar override
dbNSFP	4.9c	~80.6M sites	Functional predictions	PP3, BP4 (BayesDel_noAF)
SpliceAI	MANE R113	All coding variants	Splice impact	PP3_splice, BP7 guard
gnomAD Constraint	v4.1.0	~18.2K genes	Gene-level tolerance	PVS1, PP2, BP1
HPO	Latest release	~320K associations	Gene-phenotype mapping	PP4
ClinGen	Latest release	~1.6K genes	Dosage sensitivity	BS1, BP2
Ensembl VEP	Release 113	All consequences	Variant effect prediction	PVS1, PM1, PM4, BP1, BP3, BP7

Annotation Pipeline

Reference data is loaded into each variant record during Stage 4 of the processing pipeline. After annotation, every variant carries all reference columns directly -- no database lookups are needed during classification or clinical review. The annotation order is:

gnomAD v4.1

Population allele frequencies. Positional match on chromosome, position, reference allele, and alternate allele. Loads 6 columns.

ClinVar

Clinical significance assertions. Same positional match. Loads 7 columns including review stars and disease associations.

dbNSFP 4.9c

Functional predictions from SIFT, AlphaMissense, MetaSVM, DANN, BayesDel, and conservation scores. Loads 9 columns with duplicate variant aggregation.

gnomAD Constraint

Gene-level tolerance metrics. Joined on gene symbol. Loads 4 columns: pLI, LOEUF, o/e LoF, and missense Z-score.

HPO

Gene-phenotype associations. Joined on gene symbol with deduplication and aggregation. Loads 6 columns.

ClinGen

Dosage sensitivity scores. Joined on gene symbol. Loads 2 columns: haploinsufficiency and triplosensitivity.

Ensembl VEP runs as a separate stage (Stage 3) before database annotation, providing consequence predictions and transcript selection that the annotation phases then build upon. SpliceAI scores are accessed from precomputed data during VEP annotation.

In This Section

gnomAD

Population allele frequencies from 807,162 individuals across 8 genetic ancestry groups.

ClinVar

Clinical significance assertions from submitting laboratories worldwide.

dbNSFP

Functional predictions and conservation scores for all possible coding SNVs.

HPO

Gene-phenotype associations from the Human Phenotype Ontology.

ClinGen

Gene dosage sensitivity curation from the Clinical Genome Resource.

Ensembl VEP

Variant Effect Predictor for consequence annotation and transcript selection.

SpliceAI Precomputed

Precomputed splice impact delta scores for all coding variants.

Update Policy

How and when reference databases are updated, validated, and versioned.

For details on how these databases are combined during ACMG classification, see the Criteria Reference.