Benchmarking the next generation of homology inference tools

Saripella, G.V.; Sonnhammer, E.L.L.; Forslund, K.

Benchmarking the next generation of homology inference tools

Tools

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
679kB

Item Type:	Article
Title:	Benchmarking the next generation of homology inference tools
Creators:	Saripella, G.V., Sonnhammer, E.L.L. and Forslund, K. ORCID: https://orcid.org/0000-0003-4285-6993
Abstract:	Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the 'next generation' of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA. Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases. Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization. Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity. Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark). Contact: forslund@embl.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Keywords:	Amino Acid Sequence Homology, Benchmarking, High-Throughput Nucleotide Sequencing, Protein Databases, Proteins, Sequence Homology
Source:	Bioinformatics
ISSN:	1367-4803
Volume:	32
Number:	17
Page Range:	2636-2641
Date:	1 September 2016
Official Publication:	https://doi.org/10.1093/bioinformatics/btw305
PubMed:	View item in PubMed

Repository Staff Only: item control page

Download Statistics

Downloads

Downloads per month over past year