Using Gini index and Entropy analysis to determine writing direction
Published
November 15, 2024
Overview
Writing systems are not only a reflection of language but also encode directional properties in their stochastic patterns on the sub-word level. This research demonstrates how analyzing character distributions at word boundaries can determine a text’s writing direction, whether left-to-right (LTR) or right-to-left (RTL), without prior knowledge of the language or script. The project builds upon the methods outlined in The Handedness of Language by (Ashraf & Sinha, 2018), integrating these with computational tools to validate and refine directionality detection for writing systems.
Historical Context and Significance
Writing direction has evolved differently across civilizations, reflecting cultural practices and cognitive constraints. While modern Indo-European languages predominantly use left-to-right writing, many historical and contemporary writing systems employ the reverse or even top-to-bottom orientations. Being able to determine what the correct direction of writing and therefore reading would prove indispensable to the field of historical linguistics and decipherment of ancient languages.
Research Methodology
Data Collection Process
The study analyzed texts from multiple language families, ensuring:
Representative samples across writing systems
Consistent sample sizes (Max number of tokens from the respective corpora)
Clean, standardized text preprocessing
Key Innovation
This approach leverages two mathematical properties:
Character Distribution Inequality (Gini coefficient)
Measures statistical dispersion in character usage
Reveals systemic constraints at word boundaries
Provides a quantitative measure of positional rules
Character Randomness (Entropy)
Quantifies predictability of character sequences
Captures linguistic constraints in writing systems
Reflects underlying phonological patterns
We can determine writing direction with 100% accuracy across tested languages by analyzing these properties at word boundaries.
Statistical Measures
Gini Coefficient
The Gini coefficient measures distribution inequality:
G = \frac{\sum_{i=1}^n\sum_{j=1}^n|x_i - x_j|}{2n\sum_{i=1}^nx_i} The coefficient reveals crucial patterns:
Higher values indicate uneven distribution
Computed separately for initial and final positions
Key indicator of character constraints
Reflects writing system optimization
Shannon Entropy
Entropy quantifies randomness in character usage:
H = -\sum_{i=1}^n p_i\log_2(p_i) This measure provides essential insights:
Higher values indicate more varied distribution
Measures character choice flexibility
Critical for direction determination
Captures linguistic constraints
Pattern Analysis
Left-to-Right Scripts
LTR Characteristics
My analysis revealed clear patterns in LTR writing systems. Words begin with high entropy (varied characters) and low Gini coefficients (even distribution), showing flexibility in word-initial positions. Conversely, word endings show low entropy with high Gini coefficients, indicating restricted character sets and stronger morphological constraints.
Finnish exemplifies this pattern perfectly: initial entropy of 3.8126 versus final entropy of 2.3642. This stark difference provides a reliable signal for detecting LTR writing direction.
Right-to-Left Scripts
RTL Characteristics
RTL scripts invert these patterns. Word beginnings show low entropy and high Gini coefficients, reflecting strict constraints on initial characters. Word endings display high entropy with low Gini coefficients, allowing more varied character combinations.
Arabic demonstrates this clearly: initial entropy of 3.5498 rises to 3.8864 at word endings. This reverse pattern, typical of Semitic scripts, provides a clear statistical signature for RTL writing systems.
This research enables several key advances in script analysis. The statistical patterns discovered can automatically detect writing direction in unknown scripts by analyzing character distributions at word boundaries - particularly valuable for ancient texts where the direction is not immediately apparent. For ancient writing systems, these metrics provide objective evidence for directional hypotheses and can highlight potential transcription errors where character patterns deviate unexpectedly.
Future Work
Primary Research Directions
Complex Scripts
Vertical writing systems
Mixed-direction texts
Hybrid writing systems
Advanced Methods
Enhanced statistical measures
Pattern learning systems
New Applications
Ancient script analysis
Unknown writing systems
Methodological Extensions
Future work will expand this research in several critical directions. I plan to analyze both smaller and larger text corpora across more languages to validate the patterns I have found, while incorporating additional statistical measures beyond Gini and entropy to capture subtler directional features. I aim to test the method’s robustness through cross-script validation, especially with scripts that mix directions or use unconventional layouts.
Ashraf, Md. I., & Sinha, S. (2018). The “handedness” of language: Directional symmetry breaking of sign usage in words. PLoS ONE, 13(1), e0190735. https://doi.org/10.1371/journal.pone.0190735
Citation
BibTeX citation:
@online{winstead2024,
author = {Winstead, John},
title = {Writing {Direction} {Detection}},
date = {2024-11-15},
url = {https://jhnwnstd.github.io/projects/writing-direction/},
langid = {en}
}