Shannon Entropy Analysis Across Languages
Background
Claude Shannon’s 1951 paper introduced methods to estimate entropy and redundancy in language by analyzing the predictability of characters based on preceding text (Shannon, 1951). Entropy captures the average information character, while redundancy reflects how much structure in language makes sequences predictable.
This project extends Shannon’s methodology using KenLM, a modern language modeling tool that generates n-gram models to predict character sequences. Applied to multiple English corpora, European languages from the Europarl corpus, and the Linear B script (an ancient Mycenaean Greek writing system), this project explores linguistic predictability across different writing systems and language families.
Methodology
Load and Format Corpus:
- Language-specific character filtering
- Support for diacritics and special characters
- Unicode handling for Linear B (U+10000 to U+100FF)
- Deduplication and cleaning
Build KenLM Model:
- 8-gram language models for sequence prediction
- Modified Kneser-Ney smoothing for robust probability estimates
- Both ARPA and binary model formats
Calculate Entropy Measures: This analysis uses progressively more sophisticated entropy measures to capture different levels of character predictability:
Zero-order Entropy (): Represents the theoretical maximum entropy if all characters were equally likely: where is the number of unique characters. This serves as our baseline.
First-order Entropy (): Incorporates actual character frequencies, showing how individual character probabilities affect entropy: where is the probability of character . always, with equality only if all characters are equally frequent.
Second-order/Collision Entropy (): Measures entropy considering character pair frequencies: is always less than or equal to , with the difference revealing how much predictability is gained by considering pairs.
Third-order Entropy (): Uses KenLM with modified Kneser-Ney smoothing to capture longer-range dependencies:
where the probability is calculated using Kneser-Ney smoothing:
This captures the most sophisticated level of character prediction, with showing additional structure captured by longer sequences.
- Calculate Redundancy: Finally, we quantify how much of the theoretical maximum entropy is reduced by the language’s structure:
This percentage represents how much of the potential information capacity is used for structural patterns rather than conveying new information. A higher redundancy indicates more predictable character sequences.
Each entropy measure ( through ) reveals a different aspect of the writing system’s structure, with the sequence showing how each level of analysis captures additional predictability in the text.
Results
European Languages
Language | Grapheme Inventory | Redundancy | ||||
---|---|---|---|---|---|---|
Linear B | 86 | 6.43 | 5.74 | 5.46 | 2.34 | 63.54% |
English (Europarl) | 26 | 4.70 | 4.14 | 3.89 | 1.60 | 65.94% |
French | 39 | 5.29 | 4.13 | 3.85 | 1.63 | 69.08% |
German | 30 | 4.91 | 4.17 | 3.78 | 1.39 | 71.68% |
Italian | 35 | 5.13 | 4.02 | 3.76 | 1.62 | 68.46% |
Greek | 24 | 4.58 | 4.16 | 3.96 | 1.80 | 60.64% |
Spanish | 33 | 5.04 | 4.14 | 3.85 | 1.64 | 67.45% |
Dutch | 28 | 4.81 | 4.09 | 3.70 | 1.40 | 70.82% |
English Corpora
Corpus | Token Count | Vocab Count | Redundancy | ||||
---|---|---|---|---|---|---|---|
Brown | 4,369,721 | 46,018 | 4.70 | 4.18 | 3.93 | 1.63 | 65.39% |
Reuters | 5,845,812 | 28,835 | 4.75 | 4.19 | 3.95 | 1.80 | 62.08% |
Webtext | 1,193,886 | 16,303 | 5.13 | 4.27 | 4.06 | 1.72 | 66.50% |
Inaugural | 593,092 | 9,155 | 4.75 | 4.15 | 3.88 | 1.63 | 65.81% |
State Union | 1,524,983 | 12,233 | 4.81 | 4.16 | 3.91 | 1.67 | 65.17% |
Gutenberg | 8,123,136 | 41,350 | 4.91 | 4.16 | 3.91 | 1.83 | 62.70% |
Analysis
Language Family Patterns
Germanic Languages
The highest redundancy rates are observed in German (71.68%) and Dutch (70.82%), suggesting a strong structuring in character sequences and predictable language patterns. This predictability contributes to the sharpest entropy reductions from to , which may point to an optimized orthographic system that relies heavily on established character combinations.
Romance Languages
French, Italian, and Spanish exhibit moderate redundancy levels (65-69%). This cluster balances structure and information density, possibly due to phonological transparency and shared Latin roots.
Modern Greek
With the lowest redundancy rate at 60.64%, Greek shows an efficient information encoding system, using a small grapheme inventory (24 characters) to maintain high information density and minimal redundancy.
Writing System Complexity
Linear B (Syllabary)
Despite its large grapheme inventory (86 symbols), Linear B has a redundancy similar to alphabetic systems (63.54%). This suggests an optimized information encoding despite the use of a syllabary, which traditionally encodes larger linguistic units. The high absolute entropy ( of 6.43) reflects the character set size rather than redundancy, pointing to adaptability across writing system types.
Alphabetic Systems
Alphabetic languages like English, German, and Spanish display values ranging from 4.58 to 5.29, while redundancy varies significantly (60-72%). This indicates that languages optimize orthographic efficiency based on linguistic and structural needs, balancing character set size, information density, and redundancy.
Historical Implications
Linear B’s redundancy rate of close to 63% aligns with modern alphabetic systems, implying a universal constraint on redundancy in writing systems that balances clarity with processing demands. This consistency across time and writing types is fundamental in evolving writing systems to maintain readable and efficient information encoding.
Conclusion
These findings highlight a striking convergence in redundancy rates across diverse writing systems, with redundancy clustering between 60-70% across ancient and modern languages. This suggests an underlying principle in how human-written communication systems evolve to optimize readability and efficiency, irrespective of structural differences such as syllabaries versus alphabets. The observed patterns suggest that successful writing systems may converge toward an optimal balance of information density, where approximately two-thirds redundancy ensures efficient yet accessible communication.
This convergence, seen in English (62.52%) and Linear B (63.54%), implies that redundancy rates of around 63% may represent a universal optimization point in human communication, transcending linguistic, cultural, and structural differences. These insights are valuable for understanding the evolution of writing systems and may also have applications in developing new or enhanced systems for digital and human communication.
Source Code
References
Citation
@online{winstead2024,
author = {Winstead, John},
title = {Shannon {Entropy} {Analysis} {Across} {Languages}},
date = {2024-10-01},
url = {https://jhnwnstd.github.io/projects/shannon-entropy/},
langid = {en}
}