Minimalistic logo symbolizing computational linguistics John Winstead
  • Home
  • Services
  • Research
  • Blog
  • CV

On this page

  • Background
  • Methodology
  • Results
    • European Languages
    • English Corpora
  • Analysis
    • Language Family Patterns
    • Writing System Complexity
    • Historical Implications
  • Conclusion
  • Source Code
  • Edit this page
  • Report an issue

Shannon Entropy Analysis Across Languages

Information Theory
Computational Linguistics
Quantifying linguistic entropy across ancient and modern writing systems via n-gram modeling
Published

October 1, 2024

Background

Claude Shannon’s 1951 paper introduced methods to estimate entropy and redundancy in language by analyzing the predictability of characters based on preceding text (Shannon, 1951). Entropy captures the average information character, while redundancy reflects how much structure in language makes sequences predictable.

This project extends Shannon’s methodology using KenLM, a modern language modeling tool that generates n-gram models to predict character sequences. Applied to multiple English corpora, European languages from the Europarl corpus, and the Linear B script (an ancient Mycenaean Greek writing system), this project explores linguistic predictability across different writing systems and language families.

Methodology

  1. Load and Format Corpus:

    • Language-specific character filtering
    • Support for diacritics and special characters
    • Unicode handling for Linear B (U+10000 to U+100FF)
    • Deduplication and cleaning
  2. Build KenLM Model:

    • 8-gram language models for sequence prediction
    • Modified Kneser-Ney smoothing for robust probability estimates
    • Both ARPA and binary model formats
  3. Calculate Entropy Measures: This analysis uses progressively more sophisticated entropy measures to capture different levels of character predictability:

    • Zero-order Entropy (H_0): Represents the theoretical maximum entropy if all characters were equally likely: H_0 = \log_2(N) where N is the number of unique characters. This serves as our baseline.

    • First-order Entropy (H_1): Incorporates actual character frequencies, showing how individual character probabilities affect entropy: H_1 = -\sum_{i=1}^N p_i \log_2(p_i) where p_i is the probability of character i. H_1 \leq H_0 always, with equality only if all characters are equally frequent.

    • Second-order/Collision Entropy (H_2): Measures entropy considering character pair frequencies: H_2 = -\log_2 \sum_{i=1}^N p_i^2 H_2 is always less than or equal to H_1, with the difference revealing how much predictability is gained by considering pairs.

    • Third-order Entropy (H_3): Uses KenLM with modified Kneser-Ney smoothing to capture longer-range dependencies: H_3 = -\frac{1}{T}\sum_{i=1}^T \log_2 P(x_i|x_{i-k},\ldots,x_{i-1})

      where the probability P is calculated using Kneser-Ney smoothing: P_{KN}(w_i|w_{i-n+1}^{i-1}) = \frac{\max(c(w_{i-n+1}^i) - D, 0)}{\sum_w c(w_{i-n+1}^{i-1}w)} + \lambda(w_{i-n+1}^{i-1})P_{KN}(w_i|w_{i-n+2}^{i-1})

This captures the most sophisticated level of character prediction, with H_3 \leq H_2 showing additional structure captured by longer sequences.

  1. Calculate Redundancy: Finally, we quantify how much of the theoretical maximum entropy is reduced by the language’s structure: \text{Redundancy} = (1 - H_3 / H_0) \times 100

This percentage represents how much of the potential information capacity is used for structural patterns rather than conveying new information. A higher redundancy indicates more predictable character sequences.

Each entropy measure (H_0 through H_3) reveals a different aspect of the writing system’s structure, with the sequence H_0 \geq H_1 \geq H_2 \geq H_3 showing how each level of analysis captures additional predictability in the text.

Results

European Languages

Language Grapheme Inventory H_0 H_1 H_2 H_3 Redundancy
Linear B 86 6.43 5.74 5.46 2.34 63.54%
English (Europarl) 26 4.70 4.14 3.89 1.60 65.94%
French 39 5.29 4.13 3.85 1.63 69.08%
German 30 4.91 4.17 3.78 1.39 71.68%
Italian 35 5.13 4.02 3.76 1.62 68.46%
Greek 24 4.58 4.16 3.96 1.80 60.64%
Spanish 33 5.04 4.14 3.85 1.64 67.45%
Dutch 28 4.81 4.09 3.70 1.40 70.82%

Shannon Entropy Analysis

English Corpora

Corpus Token Count Vocab Count H_0 H_1 H_2 H_3 Redundancy
Brown 4,369,721 46,018 4.70 4.18 3.93 1.63 65.39%
Reuters 5,845,812 28,835 4.75 4.19 3.95 1.80 62.08%
Webtext 1,193,886 16,303 5.13 4.27 4.06 1.72 66.50%
Inaugural 593,092 9,155 4.75 4.15 3.88 1.63 65.81%
State Union 1,524,983 12,233 4.81 4.16 3.91 1.67 65.17%
Gutenberg 8,123,136 41,350 4.91 4.16 3.91 1.83 62.70%

English Entropy Analysis

Analysis

Language Family Patterns

Germanic Languages

The highest redundancy rates are observed in German (71.68%) and Dutch (70.82%), suggesting a strong structuring in character sequences and predictable language patterns. This predictability contributes to the sharpest entropy reductions from H_0 to H_3, which may point to an optimized orthographic system that relies heavily on established character combinations.

Romance Languages

French, Italian, and Spanish exhibit moderate redundancy levels (65-69%). This cluster balances structure and information density, possibly due to phonological transparency and shared Latin roots.

Modern Greek

With the lowest redundancy rate at 60.64%, Greek shows an efficient information encoding system, using a small grapheme inventory (24 characters) to maintain high information density and minimal redundancy.

Writing System Complexity

Linear B (Syllabary)

Despite its large grapheme inventory (86 symbols), Linear B has a redundancy similar to alphabetic systems (63.54%). This suggests an optimized information encoding despite the use of a syllabary, which traditionally encodes larger linguistic units. The high absolute entropy (H_0 of 6.43) reflects the character set size rather than redundancy, pointing to adaptability across writing system types.

Alphabetic Systems

Alphabetic languages like English, German, and Spanish display H_0 values ranging from 4.58 to 5.29, while redundancy varies significantly (60-72%). This indicates that languages optimize orthographic efficiency based on linguistic and structural needs, balancing character set size, information density, and redundancy.

Historical Implications

Linear B’s redundancy rate of close to 63% aligns with modern alphabetic systems, implying a universal constraint on redundancy in writing systems that balances clarity with processing demands. This consistency across time and writing types is fundamental in evolving writing systems to maintain readable and efficient information encoding.

Conclusion

These findings highlight a striking convergence in redundancy rates across diverse writing systems, with redundancy clustering between 60-70% across ancient and modern languages. This suggests an underlying principle in how human-written communication systems evolve to optimize readability and efficiency, irrespective of structural differences such as syllabaries versus alphabets. The observed patterns suggest that successful writing systems may converge toward an optimal balance of information density, where approximately two-thirds redundancy ensures efficient yet accessible communication.

This convergence, seen in English (62.52%) and Linear B (63.54%), implies that redundancy rates of around 63% may represent a universal optimization point in human communication, transcending linguistic, cultural, and structural differences. These insights are valuable for understanding the evolution of writing systems and may also have applications in developing new or enhanced systems for digital and human communication.

Source Code

Implementation available at: github.com/jhnwnstd/shannon

References

Shannon, C. E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1), 50–64. https://doi.org/10.1002/j.1538-7305.1951.tb01366.x

Citation

BibTeX citation:
@online{winstead2024,
  author = {Winstead, John},
  title = {Shannon {Entropy} {Analysis} {Across} {Languages}},
  date = {2024-10-01},
  url = {https://jhnwnstd.github.io/projects/shannon-entropy/},
  langid = {en}
}
For attribution, please cite this work as:
Winstead, J. (2024, October 1). Shannon Entropy Analysis Across Languages. https://jhnwnstd.github.io/projects/shannon-entropy/

© 2025, John Winstead

Built with Quarto

  • Edit this page
  • Report an issue
Cookie Preferences
  • Contact