A large language model framework for literature-based disease–gene association prediction
Journal
Briefings in Bioinformatics
Journal Volume
26
Journal Issue
1
ISSN
1467-5463
1477-4054
Date Issued
2025-02-25
Author(s)
Abstract
With the exponential growth of biomedical literature, leveraging Large Language Models (LLMs) for automated medical knowledge understanding has become increasingly critical for advancing precision medicine. However, current approaches face significant challenges in reliability, verifiability, and scalability when extracting complex biological relationships from scientific literature using LLMs. To overcome the obstacles of LLM development in biomedical literature understating, we propose LORE, a novel unsupervised two-stage reading methodology with LLM that models literature as a knowledge graph of verifiable factual statements and, in turn, as semantic embeddings in Euclidean space. LORE captured essential gene pathogenicity information when applied to PubMed abstracts for large-scale understanding of disease–gene relationships. We demonstrated that modeling a latent pathogenic flow in the semantic embedding with supervision from the ClinVar database led to a 90% mean average precision in identifying relevant genes across 2097 diseases. This work provides a scalable and reproducible approach for leveraging LLMs in biomedical literature analysis, offering new opportunities for researchers to identify therapeutic targets efficiently.
Subjects
biomedical relation extraction
knowledge graph
large language model
literature mining
NLP
Publisher
Oxford University Press (OUP)
Description
Article number bbaf070
Type
journal article
