Polygence Scholar2025

Aditya Rotte

Class of 2026Cumming, Georgia

About

Projects

"Using LLMs for data cleaning" with mentor Jim (Sept. 9, 2025)

Aditya's Symposium Presentation

Project Portfolio

Using LLMs for data cleaning

Started May 20, 2025

Abstract or project description

The rapid growth of AI brings a significant opportunity to streamline workflows in data science, in which the Extract, Transform, Load (ETL) process is particularly time-consuming. While commercial AI platforms promise to accelerate these ETL tasks, their actual accuracy and capabilities in handling complex, multi-skill data science problems in the real world remain largely untested. This research aims to fix that by looking into the effectiveness of AI platforms in completing ETL on four different popular platforms: Julius AI, Gemini, GPT, and Deepseek. My central hypothesis was that only platforms specifically fine-tuned for data science, such as Julius AI, would consistently complete ETL tasks with high accuracy. To test this, each platform was subjected to three unique tests using real-world Vex Robotics datasets. These tests were designed to assess a range of essential ETL skills, including data calculations, joining, blending, text mining, Natural Language Processing (NLP), noise reduction, and dimensionality reduction through feature extraction. Performance was evaluated based on accuracy, process feasibility, time spent prompting, and the ease of correcting errors potentially made by the platforms. The results revealed that Julius AI, the data-science-specialized platform, consistently delivered accurate results with a sound process across all tests. Surprisingly, the general-purpose model Gemini also performed exceptionally well, achieving high accuracy despite some minor inconsistencies in handling large datasets. GPT struggled to complete the tasks effectively, a result potentially influenced by the limitations of the free-tier version used. DeepSeek also struggled, but primarily due to its inability to handle large amounts of data. These findings partially support the hypothesis: while the data-science-trained platform was the most reliable, the strong performance of an all-purpose model like Gemini suggests that specific training is not required to be successful. With continued advancements, general-purpose AI can become an equally powerful tool for complex ETL tasks in the data science field.