A large language model framework for literature-based disease–gene association prediction

Peng-Hsuan Li; Yih-Yun Sun; HSUEH-FEN JUAN; CHIEN-YU CHEN; Huai-Kuang Tsai; Jia-Hsin Huang

doi:10.1093/bib/bbaf070

A large language model framework for literature-based disease–gene association prediction

Journal

Briefings in Bioinformatics

Journal Volume

26

Journal Issue

1

ISSN

1467-5463

1477-4054

Date Issued

2025-02-25

Author(s)

Peng-Hsuan Li

Yih-Yun Sun

HSUEH-FEN JUAN

CHIEN-YU CHEN

Huai-Kuang Tsai

Jia-Hsin Huang

DOI

10.1093/bib/bbaf070

URI

https://www.scopus.com/record/display.uri?eid=2-s2.0-85219524015&origin=resultslist

https://scholars.lib.ntu.edu.tw/handle/123456789/726033

Abstract

With the exponential growth of biomedical literature, leveraging Large Language Models (LLMs) for automated medical knowledge understanding has become increasingly critical for advancing precision medicine. However, current approaches face significant challenges in reliability, verifiability, and scalability when extracting complex biological relationships from scientific literature using LLMs. To overcome the obstacles of LLM development in biomedical literature understating, we propose LORE, a novel unsupervised two-stage reading methodology with LLM that models literature as a knowledge graph of verifiable factual statements and, in turn, as semantic embeddings in Euclidean space. LORE captured essential gene pathogenicity information when applied to PubMed abstracts for large-scale understanding of disease–gene relationships. We demonstrated that modeling a latent pathogenic flow in the semantic embedding with supervision from the ClinVar database led to a 90% mean average precision in identifying relevant genes across 2097 diseases. This work provides a scalable and reproducible approach for leveraging LLMs in biomedical literature analysis, offering new opportunities for researchers to identify therapeutic targets efficiently.

Subjects

biomedical relation extraction

knowledge graph

large language model

literature mining

NLP

SDGs

[SDGs]SDG3

[SDGs]SDG4

Publisher

Oxford University Press (OUP)

Description

Article number bbaf070

Type

journal article

A large language model framework for literature-based disease–gene association prediction

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)