Helmholtz Gemeinschaft

Search
Browse
Statistics
Feeds

PEDL: extracting protein-protein associations using deep language models and distant supervision

[img]
Preview
PDF (Original Article) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
456kB
[img]
Preview
PDF (Supplementary Data) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
844kB

Item Type:Article
Title:PEDL: extracting protein-protein associations using deep language models and distant supervision
Creators Name:Weber, L. and Thobe, K. and Migueles Lozano, O.A. and Wolf, J. and Leser, U.
Abstract:MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. CONTACT: leser@informatik.hu-berlin.de or jana.wolf@mdc-berlin.de
Keywords:Language, Proteins, Publications, Research Design
Source:Bioinformatics
ISSN:1367-4803
Publisher:Oxford University Press
Volume:36
Number:Suppl 1
Page Range:i490-i498
Date:July 2020
Official Publication:https://doi.org/10.1093/bioinformatics/btaa430
PubMed:View item in PubMed

Repository Staff Only: item control page

Downloads

Downloads per month over past year

Open Access
MDC Library