Statistical Model for Identifying Unclear and Doubtfully Restored Signs of the Indus Script

Project by Polygence alum Varun

Project's result

Varun presented his project at the Seventh Symposium of Rising Scholars.

Watch the symposium presentation Visit the resource

They started it from zero. Are you ready to level up with us?

Summary

A writing system developed between 2500 and 1800 BCE in the Indus Valley civilization in the Indian subcontinent and it remains undeciphered. Indus script texts found so far in the archeological digs from this civilization are limited in number and include a lot of damaged artifacts with unclear and missing signs. Identifying the missing and unclear signs and extending this text corpus will be beneficial for further research. This work aims at predicting the missing and unclear signs using n-gram Markov chain models using the ICIT Indus text corpus. First, we analyze patterns and concordances of the signs, pairs, triplets, and other n-grams and discover how the signs behave with respect to their positions in the texts. With that understanding, we built Markov chain language models based on n-grams, augmented with positional probability. Since signs could be missing in any location of the texts, we devised and implemented effective sign fill-in models on top of these Markov chain models. Using the language models and the sign fill-in models, we then identified missing single signs in the test dataset and tuned our parameters to improve the accuracy of a match to about 63%. Then we filled in the actual unclear texts with our predicted signs. We hope that the statistical models we developed here and the results from this work add to the Indus text corpus and aid in understanding the Indus script and contribute to the decipherment effort.