Documentation / Phenotype Matching / Semantic Similarity

Semantic Similarity

The matching algorithm uses semantic similarity, not exact string matching. "Seizure" and "Epilepsy" are recognized as related concepts because they share common ancestors in the HPO ontology graph. The algorithm quantifies how closely related any two clinical terms are.

Lin Similarity

Helix Insight uses Lin similarity, a well-established measure from information theory that produces scores on a 0 to 1 scale:

Lin(term1, term2) = 2 x IC(MICA) / (IC(term1) + IC(term2))

Information Content -- how specific or rare a term is. "Focal clonic seizure" has higher IC than "Abnormality of the nervous system" because it is more specific.

MICA

Most Informative Common Ancestor -- the most specific term that is an ancestor of both terms in the ontology graph.

Range

0.0 (completely unrelated) to 1.0 (identical terms).

Information Content

Information Content (IC) measures how specific a term is based on its frequency in disease annotations. A term associated with many diseases (like "Abnormality of the nervous system") has low IC because it carries little diagnostic information. A term associated with few diseases (like "Agenesis of the corpus callosum") has high IC because observing it significantly narrows the differential diagnosis.

Helix Insight derives IC values from OMIM disease annotations, ensuring that information content reflects clinical significance rather than annotation frequency in other databases.

Set Similarity: Best-Match Average

A patient has multiple HPO terms and a gene may be associated with dozens of HPO terms. The algorithm compares these two sets using a best-match average approach:

For each patient HPO term, compute Lin similarity against every gene HPO term.

Select the highest similarity score for each patient term (the best match in the gene's profile).

Average all best-match scores across the patient's HPO terms.

Normalize to a 0-100 scale for the final phenotype match score.

A match is considered significant when the similarity score for an individual term pair exceeds 0.5. The total number of significant matches is reported alongside the overall score.

Worked Example

Patient HPO terms: Seizure (HP:0001250), Intellectual disability (HP:0001249), Microcephaly (HP:0000252). Gene SCN1A HPO profile includes: Febrile seizure (HP:0002373), Epileptic encephalopathy (HP:0200134), Global developmental delay (HP:0001263), and 12 other terms.

Patient Term	Best Gene Match	Lin Score
Seizure	Febrile seizure	0.82
Intellectual disability	Global developmental delay	0.71
Microcephaly	(no close match)	0.15

Average: (0.82 + 0.71 + 0.15) / 3 = 0.56. Normalized score: 56/100. Significant matches: 2 of 3.

Why Lin Similarity?

Lin similarity normalizes the Resnik score to a 0-1 range, making it directly comparable across term pairs regardless of their position in the ontology hierarchy. Alternative measures like Resnik similarity produce unbounded scores that are difficult to interpret clinically. The OMIM-based information content ensures that IC values reflect clinical significance rather than database annotation practices.

Reference

Lin D. "An information-theoretic definition of similarity." Proceedings of the 15th International Conference on Machine Learning (ICML). 1998;296-304.