Login

Search

Advanced Search

Browse

Research Area
Research Team (MDC)
Research Team (ECRC)
Journal Title
Year

Statistics

Latest Additions
High Impact Papers
Downloads

Feeds

Atom

RSS 1.0

RSS 2.0

Leveraging large language models for data analysis automation

Tools

Preview	PDF (Original Article) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader 2MB
Preview	PDF (Supporting Information) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader 54kB

Item Type:	Article
Title:	Leveraging large language models for data analysis automation
Creators Name:	Jansen, J.A., Manukyan, A., Al Khoury, N. and Akalin, A.
Abstract:	Data analysis is constrained by a shortage of skilled experts, particularly in biology, where detailed data analysis and subsequent interpretation is vital for understanding complex biological processes and developing new treatments and diagnostics. One possible solution to this shortage in experts would be making use of Large Language Models (LLMs) for generating data analysis pipelines. However, although LLMs have shown great potential when used for code generation tasks, questions regarding the accuracy of LLMs when prompted with domain expert questions such as omics related data analysis questions, remain unanswered. To address this, we developed mergen, an R package that leverages LLMs for data analysis code generation and execution. We evaluated the performance of this data analysis system using various data analysis tasks for genomics. Our primary goal is to enable researchers to conduct data analysis by simply describing their objectives and the desired analyses for specific datasets through clear text. Our approach improves code generation via specialized prompt engineering and error feedback mechanisms. In addition, our system can execute the data analysis workflows prescribed by the LLM providing the results of the data analysis workflow for human review. Our evaluation of this system reveals that while LLMs effectively generate code for some data analysis tasks, challenges remain in executable code generation, especially for complex data analysis tasks. The best performance was seen with the self-correction mechanism, in which self-correct was able to increase the percentage of executable code when compared to the simple strategy by 22.5% for tasks of complexity 2. For tasks for complexity 3, 4 and 5, this increase was 52.5%, 27.5% and 15% respectively. Using a chi-squared test, it was shown that significant differences could be found using the different prompting strategies. Our study contributes to a better understanding of LLM capabilities and limitations, providing software infrastructure and practical insights for their effective integration into data analysis workflows
Keywords:	Automation, Computational Biology, Data Analysis, Genomics, Programming Languages, Software, Workflow
Source:	PLoS ONE
ISSN:	1932-6203
Publisher:	Public Library of Science
Volume:	20
Number:	2
Page Range:	e0317084
Date:	21 February 2025
Official Publication:	https://doi.org/10.1371/journal.pone.0317084
PubMed:	View item in PubMed

Repository Staff Only: item control page

Download Statistics

Download Statistics

Downloads

Downloads per month over past year

Open Access

OA at the MDC
OA at Helmholtz
OAI-PMH

MDC Library

Library
Catalogue
Journals
Databases

MDC Repository is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton.