Shannon Entropy Analysis Across Languages
Background
Claude Shannon’s 1951 paper introduced methods to estimate entropy and redundancy in language by analyzing the predictability of characters based on preceding text (Shannon, 1951). Entropy captures the average information character, while redundancy reflects how much structure in language makes sequences predictable.
This project extends Shannon’s methodology using KenLM, a modern language modeling tool that generates n-gram models to predict character sequences. Applied to multiple English corpora, European languages from the Europarl corpus, and the Linear B script (an ancient Mycenaean Greek writing system), this project explores linguistic predictability across different writing systems and language families.
Methodology
Load and Format Corpus:
- Language-specific character filtering
- Support for diacritics and special characters
- Unicode handling for Linear B (U+10000 to U+100FF)
- Deduplication and cleaning
Build KenLM Model:
- 8-gram language models for sequence prediction
- Modified Kneser-Ney smoothing for robust probability estimates
- Both ARPA and binary model formats
Calculate Entropy Measures: This analysis uses progressively more sophisticated entropy measures to capture different levels of character predictability:
Zero-order Entropy (H_0): Represents the theoretical maximum entropy if all characters were equally likely: H_0 = \log_2(N) where N is the number of unique characters. This serves as our baseline.
First-order Entropy (H_1): Incorporates actual character frequencies, showing how individual character probabilities affect entropy: H_1 = -\sum_{i=1}^N p_i \log_2(p_i) where p_i is the probability of character i. H_1 \leq H_0 always, with equality only if all characters are equally frequent.
Second-order/Collision Entropy (H_2): Measures entropy considering character pair frequencies: H_2 = -\log_2 \sum_{i=1}^N p_i^2 H_2 is always less than or equal to H_1, with the difference revealing how much predictability is gained by considering pairs.
Third-order Entropy (H_3): Uses KenLM with modified Kneser-Ney smoothing to capture longer-range dependencies: H_3 = -\frac{1}{T}\sum_{i=1}^T \log_2 P(x_i|x_{i-k},\ldots,x_{i-1})
where the probability P is calculated using Kneser-Ney smoothing: P_{KN}(w_i|w_{i-n+1}^{i-1}) = \frac{\max(c(w_{i-n+1}^i) - D, 0)}{\sum_w c(w_{i-n+1}^{i-1}w)} + \lambda(w_{i-n+1}^{i-1})P_{KN}(w_i|w_{i-n+2}^{i-1})
This captures the most sophisticated level of character prediction, with H_3 \leq H_2 showing additional structure captured by longer sequences.
- Calculate Redundancy: Finally, we quantify how much of the theoretical maximum entropy is reduced by the language’s structure: \text{Redundancy} = (1 - H_3 / H_0) \times 100
This percentage represents how much of the potential information capacity is used for structural patterns rather than conveying new information. A higher redundancy indicates more predictable character sequences.
Each entropy measure (H_0 through H_3) reveals a different aspect of the writing system’s structure, with the sequence H_0 \geq H_1 \geq H_2 \geq H_3 showing how each level of analysis captures additional predictability in the text.
Results
European Languages
Language | Grapheme Inventory | H_0 | H_1 | H_2 | H_3 | Redundancy |
---|---|---|---|---|---|---|
Linear B | 86 | 6.43 | 5.74 | 5.46 | 2.34 | 63.54% |
English (Europarl) | 26 | 4.70 | 4.14 | 3.89 | 1.60 | 65.94% |
French | 39 | 5.29 | 4.13 | 3.85 | 1.63 | 69.08% |
German | 30 | 4.91 | 4.17 | 3.78 | 1.39 | 71.68% |
Italian | 35 | 5.13 | 4.02 | 3.76 | 1.62 | 68.46% |
Greek | 24 | 4.58 | 4.16 | 3.96 | 1.80 | 60.64% |
Spanish | 33 | 5.04 | 4.14 | 3.85 | 1.64 | 67.45% |
Dutch | 28 | 4.81 | 4.09 | 3.70 | 1.40 | 70.82% |
English Corpora
Corpus | Token Count | Vocab Count | H_0 | H_1 | H_2 | H_3 | Redundancy |
---|---|---|---|---|---|---|---|
Brown | 4,369,721 | 46,018 | 4.70 | 4.18 | 3.93 | 1.63 | 65.39% |
Reuters | 5,845,812 | 28,835 | 4.75 | 4.19 | 3.95 | 1.80 | 62.08% |
Webtext | 1,193,886 | 16,303 | 5.13 | 4.27 | 4.06 | 1.72 | 66.50% |
Inaugural | 593,092 | 9,155 | 4.75 | 4.15 | 3.88 | 1.63 | 65.81% |
State Union | 1,524,983 | 12,233 | 4.81 | 4.16 | 3.91 | 1.67 | 65.17% |
Gutenberg | 8,123,136 | 41,350 | 4.91 | 4.16 | 3.91 | 1.83 | 62.70% |
Analysis
Language Family Patterns
Germanic Languages
The highest redundancy rates are observed in German (71.68%) and Dutch (70.82%), suggesting a strong structuring in character sequences and predictable language patterns. This predictability contributes to the sharpest entropy reductions from H_0 to H_3, which may point to an optimized orthographic system that relies heavily on established character combinations.
Romance Languages
French, Italian, and Spanish exhibit moderate redundancy levels (65-69%). This cluster balances structure and information density, possibly due to phonological transparency and shared Latin roots.
Modern Greek
With the lowest redundancy rate at 60.64%, Greek shows an efficient information encoding system, using a small grapheme inventory (24 characters) to maintain high information density and minimal redundancy.
Writing System Complexity
Linear B (Syllabary)
Despite its large grapheme inventory (86 symbols), Linear B has a redundancy similar to alphabetic systems (63.54%). This suggests an optimized information encoding despite the use of a syllabary, which traditionally encodes larger linguistic units. The high absolute entropy (H_0 of 6.43) reflects the character set size rather than redundancy, pointing to adaptability across writing system types.
Alphabetic Systems
Alphabetic languages like English, German, and Spanish display H_0 values ranging from 4.58 to 5.29, while redundancy varies significantly (60-72%). This indicates that languages optimize orthographic efficiency based on linguistic and structural needs, balancing character set size, information density, and redundancy.
Historical Implications
Linear B’s redundancy rate of close to 63% aligns with modern alphabetic systems, implying a universal constraint on redundancy in writing systems that balances clarity with processing demands. This consistency across time and writing types is fundamental in evolving writing systems to maintain readable and efficient information encoding.
Conclusion
These findings highlight a striking convergence in redundancy rates across diverse writing systems, with redundancy clustering between 60-70% across ancient and modern languages. This suggests an underlying principle in how human-written communication systems evolve to optimize readability and efficiency, irrespective of structural differences such as syllabaries versus alphabets. The observed patterns suggest that successful writing systems may converge toward an optimal balance of information density, where approximately two-thirds redundancy ensures efficient yet accessible communication.
This convergence, seen in English (62.52%) and Linear B (63.54%), implies that redundancy rates of around 63% may represent a universal optimization point in human communication, transcending linguistic, cultural, and structural differences. These insights are valuable for understanding the evolution of writing systems and may also have applications in developing new or enhanced systems for digital and human communication.
Source Code
References
Citation
@online{winstead2024,
author = {Winstead, John},
title = {Shannon {Entropy} {Analysis} {Across} {Languages}},
date = {2024-10-01},
url = {https://jhnwnstd.github.io/projects/shannon-entropy/},
langid = {en}
}