variant2literature: full text literature search for genetic variants
Journal
bioRxiv
Date Issued
2019-03-21
Author(s)
Lin, Yin-Hung
Lu, Yu-Chen
Chen, Ting-Fu
Lee, Ko-Han
Cheng, Yi-Wei
Fan, Jhih-Sheng
Tu, Chien-Ta
Hsu, Chen-Ming
Chou, Chih-Chen
Tu, Yi-Chin Ethan
Abstract
Motivation
Whole genome sequencing (WGS) by next-generation sequencing produces millions of variants for an individual. The retrieval of biomedical literature for such a large number of genetic variants remains challenging, because in many cases the variants are only present in tables as images, or in the supplementary documents of which the file formats are diverse.
Results
The proposed tool named variant2literature from the TaiGenomics (Toolkits for AI genomics) resolves the problem by incorporating text recognition with image processing. In addition to the adoption of advanced image-based text retrieval, the recall rate of finding the literature containing the variants of interest is further improved by employing the skill of variant normalization. Different variant presentations are transformed into chromosome coordinates (standard VCF format) such that false negatives can be largely avoided. variant2literature is available in two ways. First, a web-based interface is provided to search all the literature in PMC Open Access Subset. Second, the command-line executable can be downloaded such that the users are free to search all the files in a specified directory locally.
Availability
http://variant2literature.taigenomics.com/
Contact
chienyuchen@ntu.edu.tw
SDGs
Publisher
Cold Spring Harbor Laboratory
Type
journal article
