Exploring the Performance of Large Language Models for Data Analysis Tasks Through the CRISP-DM Framework

Research output: Chapter in Book/Conference proceedingConference contributionScientificpeer-review

Abstract

This paper investigates the impact of Large Language Models (LLMs), specifically GPT, on data analysis tasks within the framework of CRISP-DM (Cross-Industry Standard Process for Data Mining). In order to assess the efficiency of text-to-code language models in data-related tasks, we systematically examine the performance of LLMs in the stages of the data mining process. GPT models are tested against a series of Python programming and SQL tasks derived from a Master’s program’s curriculum. The tasks focus on data exploration, visualization, preprocessing, and advanced analytical tasks like association rule mining and classification. The findings show that GPT models exhibit proficiency in Python programming across various CRISP-DM stages, particularly in Data Understanding, Preparation, and Modeling. They adeptly utilize Python libraries for data manipulation and visualization, demonstrating potential as effective tools in data science. However, the study also uncovers areas where the GPT Text-to-code model shows partial correctness, highlighting the need for human oversight in complex data analysis scenarios. This research contributes to understanding how AI can augment traditional data analysis methods, particularly under the CRISP-DM framework. It reveals the potential of LLMs in automating stages of data analysis, suggesting an acceleration in analytical processes and decision-making. The study provides valuable insights for organizations integrating AI into data analysis, balancing AI strengths with human expertise.
Original languageEnglish
Title of host publicationGood Practices and New Perspectives in Information Systems and Technologies - WorldCIST 2024
Subtitle of host publicationWorldCIST 2024, Volume 5
EditorsÁlvaro Rocha, Hojjat Adeli, Gintautas Dzemyda, Fernando Moreira, Aneta Poniszewska-Maranda
PublisherSpringer, Cham
Pages56-65
Volume5
Edition1
ISBN (Electronic)978-3-031-60227-6
ISBN (Print)978-3-031-60226-9
DOIs
Publication statusPublished - 16 May 2024
MoE publication typeA4 Article in a conference publication
EventWorld Conference on Information Systems and Technologies - Lodz, Poland
Duration: 26 Mar 202428 Mar 2024
Conference number: 12

Publication series

NameLecture Notes in Networks and Systems
Volume989
ISSN (Print)2367-3370
ISSN (Electronic)2367-3389

Conference

ConferenceWorld Conference on Information Systems and Technologies
Abbreviated titleWorldCIST'24
Country/TerritoryPoland
CityLodz
Period26/03/2428/03/24

Keywords

  • Large Language Models
  • GPT
  • CRISP-DM
  • Decision Support

Fingerprint

Dive into the research topics of 'Exploring the Performance of Large Language Models for Data Analysis Tasks Through the CRISP-DM Framework'. Together they form a unique fingerprint.

Cite this