

Laura Shang
Class of 2026Sammamish, Washington
About
Hello! My name is Laura and I have not decided what my Polygence project is on yet.Projects
- "Balancing Speed and Precision: A Comparative Analysis of Lightweight LLMs on SAT Reading and Writing Tasks" with mentor Jesus (May 31, 2025)
Project Portfolio
Balancing Speed and Precision: A Comparative Analysis of Lightweight LLMs on SAT Reading and Writing Tasks
Started Oct. 30, 2024
Abstract or project description
This study evaluates six cost-efficient large language models (LLMs)—ChatGPT 4.1 mini, Gemini 2.0 Flash, Qwen3 235B-A22B, Llama 3.3 70B Instruct, Claude 3.5 Haiku, and DeepSeek V3—on SAT Reading and Writing multiple-choice questions. Using a structured pipeline with LangChain, we assessed 90 questions across difficulty levels (easy, medium, hard) and skill subdivisions (Craft and Structure, Expression of Ideas, Information and Ideas, Standard English Conventions). The LLMs are not tested on Command of Evidence questions that include a graphical representation of data. Key findings reveal ChatGPT 4.1 mini and DeepSeek V3 as top performers (91.1% accuracy), closely followed by Gemini 2.0 Flash (88.9%), with Qwen3 235B-A22B lagging significantly (32.2%). Accuracy declined with question difficulty (e.g., ChatGPT 4.1 mini dropped from 96.7% on easy to 83.3% on hard questions), and all models struggled most with Standard English Conventions (16.7–72.2% accuracy), particularly grammar tasks like boundaries (44.4% average accuracy). While Gemini 2.0 Flash delivered optimal speed-accuracy balance (88.9% accuracy in 94.51 seconds), DeepSeek V3 matched ChatGPT’s precision (91.1%) at half the latency (221.84s vs. 184.5s). Models demonstrated moderate-to-high consistency (variability = 1.00–1.39). These results suggest smaller LLMs are viable for automated SAT-style assessments in comprehension (Information and Ideas: 96.3% accuracy) and analysis (Craft and Structure: 100% for top models) but require urgent improvements in grammatical precision and complex reasoning.