Among the most enduring mysteries of the ancient world is the language of the Indus people. The most surefire way of learning a lost civilization’s language is through its writings. Back in the day, people left behind seals and inscriptions just as today we leave behind books and billboards. If an inscription is bilingual, one of the scripts being familiar, it makes the job of deciphering the unknown easier. The Rosetta Stone is arguably the most celebrated example of such an artifact. It helped us decipher Egyptian hieroglyphs a good 2,000 years after they were carved. Greek helped us there. But not all inscriptions offer that convenience.
The Indus civilization remains an odd mystery majorly because it has left behind very little writing. Everything we know so far is educated guesswork, logical extrapolation. But there’s a great deal we still don’t know even with such extrapolations. For instance, we’re yet to confidently ascertain what they imported. We know they exported carnelian, bronze, fowls, buffalos, and other exotic items to the Mesopotamians, but we don’t know much about that they got back in return. A correct decipherment of their script could help answer a long list of similar questions.
With over 4,000 inscriptions recovered so far, it’s hard to believe that they left behind insufficient writing. But while the volume of inscriptions is great, not so great is the average size of these inscriptions. The longest is about 30-35 characters, the shorted just one character. The average hovers around 5. That’s not a whole lot to work with. But that doesn’t mean attempts haven’t been made.
As is the case with the Aryans, the Harappans too come with immense political and ideological baggage. The Indus civilization is primordial India. Whoever gets to claim it, gets to claim India. Much ideological debate in the subcontinent revolves around the ethnic identity of the Meluhhans or Harappans with two major contentions—Dravidian or Aryan. South Indians want the civilization to be Proto-Dravidian so they could claim primacy over the country’s civilizational heritage. This ties in well with their notion that they are natives and the Aryans or North Indians, foreign invaders. The converse is claimed by North Indians.
This politics colors scholarship. Many scholars are also ideologues and enter the study not to find out the truth, but to confirm something they already hold as the truth. In the simplest terms, this debate boils down to language. South Indian ideologues want the Indus language to be some kind of a Proto-Dravidian dialect, so as to fortify their claims of nativity. North Indians, on the other hand, want the Indus language to be Sanskrit, for the exact opposite reasons. Who is right then?
This article does not answer that question. Hundreds of scholars and experts all over the world are working hard to crack this code, certainly not something an article like this could even begin to cover. What this article does endeavor is to pick one of the myriad hypotheses and examine it objectively for viability. We’re talking a 2024 paper titled “A Cryptanalytic Decipherment of the Indus Script” that makes a bold claim in favor of Sanskrit. The paper which is yet to be peer-reviewed and published in a journal, uses an impressive array of cryptographic novelties to arrive at its conclusions. Reading it thoroughly before proceeding is highly recommended even if not an absolute necessity.
This piece is going to be very different from all others on this site, because this one involves a good deal of math and a little bit of computers, in addition to linguistics and history. Cryptography is a math-intensive discipline and therefore hard to keep purely equation-free. As for computers, there’s only this concept of regex. Not the whole concept, but a couple of real-world examples. Nothing that should deter a sincere reader. The only caveat here is that the Indus glyphs are mentioned using their short English descriptions rather than the characters. This is because the Indus font is not supported on this platform yet, so the glyphs would render as unreadable blocks like this—. So, instead of typing out the jar character, we’ll just spell out “jar.” We’ll start with some elementary jargon and relevant math and then wiggle our way into the meat and potatoes of Yajnadevam’s decipherment.
The Mathematics of Decipherment
Entropy is randomness. It measures the unpredictability in a system. The higher the predictability, the lower the entropy, and vice versa. This is not just some vague abstraction, but a tangible, quantifiable value that can be derived and expressed in real numbers. Mathematically, this randomness is expressed in bits. Why? Because it’s the lowest unit of information. For instance, the answer to a basic yes/no question only needs two states, and those can be expressed as either 0 or 1, i.e. using a single bit. If a question can be answered one of four ways, a single bit isn’t sufficient. Two are—00, 01, 10, and 11. The number of bits required to express a piece of information can be calculated as:
Where i is the number of possible values in the information. So, for a yes/no question, i is 2, which can be expressed using log2(2) = 1 bit. For a question that can be answered 10 different ways, log2(10) = 3.32 bits would be needed.
For languages, entropy tells us how much information is packed into each symbol (like letters). If a language is very predictable (e.g., “th_” often becomes “the”), its entropy is lower. In this context, the number of possible answers is the probability of a given letter. We know that all letters are not equally ubiquitous. “E” is far more frequent than, say, “X,” and so on. Thus, for a letter, its frequency or probability is what governs its entropy. This probability is a quantifiable value. Thus, the above entropy formula can be extended to calculate the overall entropy for an entire language system:
Where:
pi = probability of the i-th symbol (e.g., how often “A” appears in English)
n = total number of symbols (e.g., 26 for English)
This formula sums up the contributions of all symbols based on their probabilities. Once again, the final value is expressed in bits. Let’s illustrate this with a calculation. English has 26 letters, but they don’t appear equally often. For simplicity, let’s use approximate probabilities for common letters:
“E” appears 13% of the time, hence pE = 0.13, HE = -pE ⋅ log2(pE) ≈ 0.38 bits.
“T” appears 9% of the time, hence pT = 0.09, HT = -pT ⋅ log2(pT) ≈ 0.31 bits.
Other letters have lower probabilities.
For “E” and “T” we get a total entropy of 0.69 bits. If we calculate this for all 26 letters in English, using their actual probabilities (which vary widely), we find that English has an average entropy of about 1.5 bits per letter. This means that each letter carries about 1.5 bits of information on average.
Keep reading with a 7-day free trial
Subscribe to Schandillia to keep reading this post and get 7 days of free access to the full post archives.