Helmholtz Gemeinschaft


Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families

Item Type:Article
Title:Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families
Creators Name:Andrade, M.A. and Valencia, A.
Abstract:MOTIVATION: Annotation of the biological function of different protein sequences is a time-consuming process currently performed by human experts. Genome analysis tools encounter great difficulty in performing this task. Database curators, developers of genome analysis tools and biologists in general could benefit from access to tools able to suggest functional annotations and facilitate access to functional information. APPROACH: We present here the first prototype of a system for the automatic annotation of protein function. The system is triggered by collections of s related to a given protein, and it is able to extract biological information directly from scientific literature, i.e. MEDLINE abstracts. Relevant keywords are selected by their relative accumulation in comparison with a domain-specific background distribution. Simultaneously, the most representative sentences and MEDLINE abstracts are selected and presented to the end-user. Evolutionary information is considered as a predominant characteristic in the domain of protein function. Our system consequently extracts domain-specific information from the analysis of a set of protein families. RESULTS: The system has been tested with different protein families, of which three examples are discussed in detail here: 'ataxia-telangiectasia associated protein', 'ran GTPase' and 'carbonic anhydrase'. We found generally good correlation between the amount of information provided to the system and the quality of the annotations. Finally, the current limitations and future developments of the system are discussed.
Keywords:Abstracting and Indexing as Topic, Algorithms, Automation, Forecasting, Internet, MEDLINE, Mathematical Computing, Proteins
Page Range:600-607
Official Publication:https://doi.org/10.1093/bioinformatics/14.7.600
PubMed:View item in PubMed

Repository Staff Only: item control page

Open Access
MDC Library