Helmholtz Gemeinschaft


Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models

PDF (Original Article) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader

Item Type:Article
Title:Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models
Creators Name:Weber, L. and Sänger, M. and Garda, S. and Barth, F. and Alt, C. and Leser, U.
Abstract:The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot.
Keywords:Factual Databases, Language, Proteins, Toxicogenetics
Publisher:Oxford University Press
Page Range:baac098
Date:18 November 2022
Official Publication:https://doi.org/10.1093/database/baac098
PubMed:View item in PubMed

Repository Staff Only: item control page


Downloads per month over past year

Open Access
MDC Library