Polypilot product mascot

Introducing PolyPilot:

Our AI-Powered Mentorship Program

Learn More
Go to Polygence Scholars page
Juliana Li's cover illustration
Polygence Scholar2023
Juliana Li's profile

Juliana Li

Class of 2025Saratoga, CA



  • "Machine Learning-Based Detection of Autism Spectrum Disorder Using Linguistic Features" with mentor Cristina (Aug. 9, 2023)

Project Portfolio

Machine Learning-Based Detection of Autism Spectrum Disorder Using Linguistic Features

Started Feb. 23, 2023

Abstract or project description

Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder that affects communication, social interaction, and behavior, with individuals affected often displaying speech idiosyncrasies (Chi et al., 2022). In recent years, machine learning (ML) and natural language processing (NLP) techniques have shown promising results in detecting ASD from speech and language samples, and the use of these systems has helped to shorten and improve the time-consuming and extensive process of ASD diagnosis (Ramesh & Assaf, 2021).

In this study, we investigate the potential of ML and NLP techniques to identify ASD from speech transcripts based on several linguistic components. We aim to improve the accuracy of existing ASD detection methods and provide an accessible at-home tool to aid in the diagnosis of ASD. The main research gap in the area of ASD detection using ML and NLP involves a lack of focus on the specific linguistic features that differentiate children with ASD from typically developing (TD) children, so our objective is to improve upon the accuracy of previous studies by focusing on our chosen linguistic components. 

We analyze speech transcripts of 64 children aged 3 to 6 from the data banks CHILDES and ASDBank, including 30 children with ASD and 34 TD controls. First, we examine the annotations of the transcriptions, then extract various linguistic features from the texts. We train several machine learning algorithms, such as logistic regression, naive Bayes, and random forests, to determine whether a child has ASD based on the characteristics of the transcripts analyzed. To assess the performance of our models, we used k-fold cross-validation and evaluation metrics such as receiver operating characteristic (ROC) curves, the area under the ROC curve (AUC), accuracy, precision, and recall.

We aimed to achieve an accuracy of at least 80% to be comparable to or improve upon other models in the literature. Our most accurate model, multilayer perceptron, scored 80-85% across all the evaluation metrics using only the most relevant linguistic features.