Login

Search

Advanced Search

Browse

Research Area
Research Team (MDC)
Research Team (ECRC)
Journal Title
Year

Statistics

Latest Additions
High Impact Papers
Downloads

Feeds

Atom

RSS 1.0

RSS 2.0

Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models

Tools

[thumbnail of Original Article]

Preview

PDF (Original Article) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB

Item Type:	Article
Title:	Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models
Creators Name:	Weber, L., Sänger, M., Garda, S., Barth, F., Alt, C. and Leser, U.
Abstract:	The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot.
Keywords:	Factual Databases, Language, Proteins, Toxicogenetics
Source:	Database
ISSN:	1758-0463
Publisher:	Oxford University Press
Volume:	2022
Page Range:	baac098
Date:	18 November 2022
Official Publication:	https://doi.org/10.1093/database/baac098
PubMed:	View item in PubMed

Repository Staff Only: item control page

Download Statistics

Download Statistics

Downloads

Downloads per month over past year

Open Access

OA at the MDC
OA at Helmholtz
OAI-PMH

MDC Library

Library
Catalogue
Journals
Databases

MDC Repository is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton.