Using gene and phenotype ontologies to uncover hidden drug side effects that standard safety reporting systems miss.
Think of it like this: When you take a medication, it's designed to hit one specific target in your body. But drugs are imprecise and they can also interact with other proteins and biological systems, causing unexpected side effects. These are called off-target effects. Our project builds a system that can predict which side effects are likely and why they happen, using biology as the explanation.
The FDA's FAERS database tracks which side effects get reported after a drug is approved — but it only tells us what happened, not why. A drug causing 500 liver injury reports doesn't tell us which biological pathway caused it.
Some dangerous side effects are rarely reported, not simply because they don't happen, but because they're hard to connect to a drug, or simply overlooked. Our framework uses biological pathway knowledge to surface these hidden signals before they become clinical problems.
Can we use structured knowledge about genes and biological processes to predict drug side effects and use it to explain why they happen, beyond what frequency-based reporting alone can reveal?
In simpler terms: instead of just counting how often side effects are reported, can we trace the molecular chain of events, from a drug binding to a protein, to a biological process going wrong, to a patient experiencing harm?
We built a knowledge graph that maps the full chain from a drug's molecular interactions to real-world side effects. Instead of treating these as unrelated data points, we trace the biological route that connects them.
The map analogy: Imagine navigating a city. You could guess traffic patterns by counting cars on a road (FAERS frequency). Or you could use an actual map of streets, highways, and traffic lights (our knowledge graph). The map tells you why the traffic is there — not just that it is.
Each arrow represents a known biological relationship. By chaining them together, we can reason about why a drug might cause a particular side effect, not just correlate the two.
We combined four publicly available databases to build our reasoning network and validate our findings with each contributing a different layer of knowledge.
Tells us which proteins Pralsetinib binds to inside the body. Think of proteins as locks and DrugBank maps which locks this drug's key fits into, including ones it wasn't designed for. We identified 11 protein targets for Pralsetinib.
A standardized biological dictionary that categorizes what each protein does whether it's involved in cell death, immune signaling, or DNA repair. We parsed 38,736 GO terms and 13,909 parent-child relationships to build the knowledge graph.
A database that maps individual human proteins to specific GO biological process terms. While GO defines what processes exist, GOA tells us which proteins participate in each process. Used for both the knowledge graph (159 GO terms across 3 primary targets) and as the statistical background for the mechanistic enrichment analysis (34,737 annotated human proteins).
The FDA's real-world side effect database. After a drug hits the market, doctors, patients, and companies submit reports of observed adverse events. We extracted 2,196 reports across 200 distinct adverse events for Pralsetinib which is our ground-truth frequency signal.
This interactive visualization shows Pralsetinib's full web of connections and which proteins it touches, which biological processes those proteins control, and which side effects emerge at the end of those chains. Click any node to explore its neighborhood. Colors represent different types of entities in the network.
We built a progressive pipeline starting from a simple frequency count, adding biological reasoning, then directly measuring how independent those two signals are. A fourth model uses the full knowledge graph topology to predict drug-target relationships using graph neural network embeddings.
"What side effects are reported most often?"
We rank side effect themes purely by how many reports appear in the FDA database. This is the standard approach which is useful, but ultimately is limited. High counts may reflect reporting bias, not actual biological risk.
"What does biology say should be risky?"
We combine FAERS report frequency with the number of biological pathway connections in our knowledge graph. A side effect with few reports but many biological links gets elevated which reveals potentially underreported signals.
"Are biology and reporting measuring different things?"
We fit two logistic regression models , one using only knowledge graph features, one adding FAERS frequency, and evaluated both with leave-one-out cross-validation (LOOCV). The near-chance LOOCV AUC on the KG-only model revealed that the scientific value of KG features is in their structural independence from FAERS and not in replacing it.
"Can the full graph topology predict drug-target relationships?"
A graph convolutional network is trained on the full knowledge graph, learning node embeddings from the entire topology of drug–protein–pathway–disease connections. It predicts drug-target relationships using the structure of the graph itself, rather than hand-crafted path features.
Independent validation via Mechanistic Enrichment Analysis (Fisher's Exact Test against 34,737 GOA proteins):
The knowledge graph encodes four node types: Drug (Pralsetinib, CHEMBL4582651), Protein targets (JAK2, FLT3, RET — 11 total from DrugBank), GO Biological Process terms (159 terms), and Toxicity Themes (13 categories). Model 2 uses a Bayesian noisy-OR framework: posterior ∝ prior(FAERS) × noisy-OR(KG paths), with base_prob=0.18 and alpha_prior=1.5. Model 3A uses KG-only features (path_count, go_overlap_ratio, max_path_score, mean_path_score, theme_specificity, n_proteins, has_direct_maps_to); Model 3B adds log_faers. LOOCV holds out each of the 13 themes in turn. Model 4 applies a graph convolutional network over the full knowledge graph topology to learn node embeddings and predict drug-target relationships; interpretability is reduced compared to path-based approaches. Independent mechanistic validation applies one-sided Fisher's Exact Test against 34,737 annotated human proteins from the GO Annotation (GOA) database, with Benjamini-Hochberg FDR correction (threshold: FDR < 0.05).
Report frequency and biological pathway evidence are nearly uncorrelated Spearman ρ ≈ 0.18 across 13 toxicity themes. This is the project's central finding: our biology-based approach isn't replicating FAERS and it's discovering genuinely new information.
A 2×2 quadrant framework classifies every theme by KG evidence vs. FAERS frequency:
Cell Death/Apoptosis was nominated by KG path scoring (rank 5, 12 paths, score 0.55) despite ranking 13th in FAERS. A completely separate Fisher's Exact Test enrichment analysis independently confirmed it and is testing Pralsetinib's 11 protein targets against 34,737 human proteins in the GO background:
The LOOCV logistic regression produced near-chance AUC on KG-only features which is expected with 13 themes from a single drug. But this is a diagnostic finding, not a failure: when FAERS frequency is added as a feature, it captures over 95% of model weight, completely masking KG signal.
KG-derived mechanistic features and post-marketing report frequency are near-uncorrelated (ρ ≈ 0.18). Two independent analyses, KG path scoring and GO enrichment testing, which are both nominate Cell Death/Apoptosis as an underreported risk despite its rank-13 FAERS position. Together, these findings demonstrate that ontology integration surfaces complementary signal that frequency-based pharmacovigilance structurally cannot.
Most AI tools in healthcare produce a probability such as "70% chance of side effect X." This framework produces an explanation: "this drug causes side effect X because it binds protein Y, which disrupts biological process Z."
That distinction matters for clinicians, regulators, and patients who need to understand and trust safety predictions and not just receive them.
Known constraints of this study:
A reference for anyone new to the biology or data science concepts in this project.
A side effect caused by a drug interacting with proteins it wasn't designed to interact with. Like a key accidentally opening the wrong lock.
FDA Adverse Event Reporting System. A database of post-market side effect reports submitted by doctors, patients, and pharmaceutical companies after a drug is approved.
A network of connected facts represented as nodes (entities) and edges (relationships). Enables structured reasoning across large datasets by following chains of relationships.
A standardized biological dictionary that classifies what proteins do — grouping them into categories like "cell death," "immune response," or "DNA repair."
A sequence of molecular events inside a cell — like a chain reaction where protein A activates protein B, which triggers effect C. Disrupting one step can cause downstream harm.
An AI approach combining neural networks (which find patterns in data) with symbolic reasoning (which follows logical rules). It gains interpretability from the symbolic component.
Area Under the Receiver Operating Characteristic curve. A measure of classifier performance — 0.5 = random guessing, 1.0 = perfect. 0.644 means meaningfully better than chance.
A statistical measure of how closely two rankings agree. Values near 0 mean they're essentially independent signals; near 1 means they move together perfectly.
False Discovery Rate. When running many statistical tests, FDR adjustment controls how often we'd expect a "significant" result by pure chance. A q-value < 0.05 is considered reliable.
A targeted cancer drug approved for certain types of lung and thyroid cancer. It works by blocking a specific mutated protein (RET kinase). Our case study drug.
A machine learning model that predicts a binary outcome (serious / not serious) by finding which input features best separate the two groups. Interpretable and well-suited for medical data.
Leave-one-out cross-validation holds out one data point at a time as the test case and trains on the rest. The most rigorous form of cross-validation — used here because we only had 13 toxicity themes to work with.
A statistical test that checks whether a group of proteins is over-represented in a biological process compared to all human proteins. An odds ratio of 9.1 means Pralsetinib targets are 9× more likely to be involved in cell death than a random set of proteins.
A network where arrows flow in one direction and never loop back. The Gene Ontology is structured as a DAG — "apoptosis" is a child of "cell death," which is a child of broader biological processes. This lets us traverse the hierarchy to collect all related terms.
A statistical analysis testing whether a drug's protein targets cluster within specific biological processes more than expected by chance. Provides independent validation that a predicted biological mechanism is real — not just an artifact of the knowledge graph structure.
DSC 180B Capstone · UC San Diego · 2025–2026