Documentation / AI Clinical Assistant / Database Queries
Database Queries
Helix AI can query two databases through natural language: the patient's classified variant database and the biomedical literature database. The assistant automatically determines which database to query based on your question, generates SQL, executes it, and incorporates the results into its response.
Two Databases
| Database | Content | Scale |
|---|---|---|
| Patient Variants | Classified variants with ACMG annotations, population frequencies, functional predictions, phenotype matches, and screening scores. | ~2.3M variants, 70 columns per variant |
| Biomedical Literature | PubMed publications with gene mentions, variant mentions, abstracts, MeSH terms, and publication metadata. | 1M+ publications, 400K gene mentions, 100K variant mentions |
How Queries Work
Question Analysis
The assistant determines whether the question requires a database query. Questions about specific patient data trigger a query; general genetics knowledge is answered directly.
SQL Generation
The question is sent to a specialized SQL generation module that translates natural language into DuckDB SQL. The generator uses a low temperature (0.1) for precise, deterministic output and has access to the complete database schema.
Execution
The SQL query runs against a read-only DuckDB connection with a 30-second timeout. All queries are read-only -- the assistant cannot modify patient data.
Result Filtering
For detail queries (specific variants), results are filtered to 20 clinically essential columns out of 70, reducing token usage by approximately 70%. Aggregation queries preserve all columns.
Response Integration
The assistant receives the query results and incorporates them into its clinical response, adding visualization suggestions for chart-appropriate data.
Queryable Data
The patient variant database contains 70 columns per variant. The most commonly queried fields include:
| Category | Fields |
|---|---|
| Identity | gene_symbol, chromosome, position, hgvs_protein, hgvs_cdna, rsid, transcript_id |
| Classification | acmg_class, acmg_criteria, confidence_score |
| Consequence | consequence, impact, biotype, exon_number, domains |
| Population Frequency | gnomad_af, gnomad_popmax, gnomad_popmax_af, gnomad_hom |
| ClinVar | clinvar_significance, clinvar_review_status, stars |
| Functional Predictions | sift_prediction, alphamissense_prediction, metasvm_prediction, dann_score |
| Gene Constraint | gene_pli, gene_oe_lof, gene_loeuf |
| Phenotype | hpo_terms, hpo_count, hpo_phenotypes |
| Screening | priority_score, priority_tier |
Query Performance
| Operation | Typical Latency |
|---|---|
| SQL generation | 1-3 seconds |
| Variant database query | Under 200 milliseconds |
| Literature database query | Under 500 milliseconds |
| Total (generation + execution) | 2-4 seconds |
Safety
All database access is strictly read-only. The assistant cannot insert, update, or delete any data. Query execution has a 30-second timeout to prevent runaway queries. Results are capped at a safe size limit to maintain responsive conversation flow.