Helix Insight

Documentation / Literature Evidence / PubMed Coverage

PubMed Coverage

The literature database is derived from the complete NCBI PubMed corpus, filtered to retain only genetics-relevant publications. This page describes what is included, what is excluded, and how the database is maintained.

Source

SourceNCBI PubMed (ftp.ncbi.nlm.nih.gov)
Total PubMedApproximately 35 million articles
After FilteringApproximately 2-3 million genetics-relevant articles
Filter Pass Rate7-8% of total PubMed corpus
Date Range1990 to present
Update FrequencyDaily (PubMed update files, approximately 15 minutes)

Genetics Relevance Filter

Articles must have at least one MeSH descriptor from the following curated set of genetics-relevant terms:

Mutation

Polymorphism, Single Nucleotide

Genetic Variation

Sequence Analysis, DNA

Exome Sequencing

Whole Genome Sequencing

Genome-Wide Association Study

Pharmacogenetics

MeSH descriptors are assigned by NLM (National Library of Medicine) indexers and represent curated, high-confidence topic annotations. This is the most reliable signal for genetics relevance.

Excluded Publication Types

The following publication types are excluded regardless of MeSH descriptors, as they typically do not contain original clinical or functional data:

Editorials

Letters

News articles

Published errata

Comments

Retracted publications

What Is NOT Covered

Very recent publications

Publications indexed by PubMed within the last 1-2 days may not yet be available. Daily updates typically process new entries within 24 hours.

Preprints

Preprint servers (bioRxiv, medRxiv) are not included. Only peer-reviewed publications indexed in PubMed are covered.

Non-genetics literature

Publications without genetics-relevant MeSH descriptors are excluded. A cardiology paper without genetic context will not appear even if it mentions a gene incidentally.

Pre-1990 publications

Articles published before 1990 are excluded. While some older studies remain relevant, the vast majority of clinically actionable genetics literature is from the past 30 years.

Full-text content

Only titles, abstracts, and MeSH descriptors are indexed. Full-text articles are not downloaded. The PubMed Central ID is provided for publications with open-access full text.

Gene Validation

Gene symbols extracted from publications are validated against the human protein-coding gene database. This strict validation uses a three-layer filter:

Blacklist filtering

Common abbreviations (DNA, RNA, PCR, MRI, HIV, ELISA) and non-gene entities are rejected before validation.

Protein-coding gene verification

Each candidate symbol is verified as a human protein-coding gene. Pseudogenes, antisense transcripts, and non-coding RNA genes are excluded.

Exact symbol matching

The extracted symbol must exactly match an HGNC-approved gene symbol. Partial matches and aliases are rejected to prevent false associations.

Database Updates

The literature database uses a blue-green deployment pattern: daily updates are ingested into a separate database, verified, and then atomically promoted to production. This ensures zero-downtime updates and enables instant rollback if an update introduces issues. The baseline database is rebuilt periodically to incorporate PubMed's retroactive corrections and retractions.