Documentation / Literature Evidence / PubMed Coverage
PubMed Coverage
The literature database is derived from the complete NCBI PubMed corpus, filtered to retain only genetics-relevant publications. This page describes what is included, what is excluded, and how the database is maintained.
Source
Genetics Relevance Filter
Articles must have at least one MeSH descriptor from the following curated set of genetics-relevant terms:
Mutation
Polymorphism, Single Nucleotide
Genetic Variation
Sequence Analysis, DNA
Exome Sequencing
Whole Genome Sequencing
Genome-Wide Association Study
Pharmacogenetics
MeSH descriptors are assigned by NLM (National Library of Medicine) indexers and represent curated, high-confidence topic annotations. This is the most reliable signal for genetics relevance.
Excluded Publication Types
The following publication types are excluded regardless of MeSH descriptors, as they typically do not contain original clinical or functional data:
Editorials
Letters
News articles
Published errata
Comments
Retracted publications
What Is NOT Covered
Very recent publications
Publications indexed by PubMed within the last 1-2 days may not yet be available. Daily updates typically process new entries within 24 hours.
Preprints
Preprint servers (bioRxiv, medRxiv) are not included. Only peer-reviewed publications indexed in PubMed are covered.
Non-genetics literature
Publications without genetics-relevant MeSH descriptors are excluded. A cardiology paper without genetic context will not appear even if it mentions a gene incidentally.
Pre-1990 publications
Articles published before 1990 are excluded. While some older studies remain relevant, the vast majority of clinically actionable genetics literature is from the past 30 years.
Full-text content
Only titles, abstracts, and MeSH descriptors are indexed. Full-text articles are not downloaded. The PubMed Central ID is provided for publications with open-access full text.
Gene Validation
Gene symbols extracted from publications are validated against the human protein-coding gene database. This strict validation uses a three-layer filter:
Blacklist filtering
Common abbreviations (DNA, RNA, PCR, MRI, HIV, ELISA) and non-gene entities are rejected before validation.
Protein-coding gene verification
Each candidate symbol is verified as a human protein-coding gene. Pseudogenes, antisense transcripts, and non-coding RNA genes are excluded.
Exact symbol matching
The extracted symbol must exactly match an HGNC-approved gene symbol. Partial matches and aliases are rejected to prevent false associations.
Database Updates
The literature database uses a blue-green deployment pattern: daily updates are ingested into a separate database, verified, and then atomically promoted to production. This ensures zero-downtime updates and enables instant rollback if an update introduces issues. The baseline database is rebuilt periodically to incorporate PubMed's retroactive corrections and retractions.