Writing Direction Detection

Writing Systems

Information Theory

Computational Linguistics

Using Gini index and Entropy analysis to determine writing direction

Published

November 15, 2024

Overview

Writing systems are not only a reflection of language but also encode directional properties in their stochastic patterns on the sub-word level. This research demonstrates how analyzing character distributions at word boundaries can determine a text’s writing direction, whether left-to-right (LTR) or right-to-left (RTL), without prior knowledge of the language or script. The project builds upon the methods outlined in The Handedness of Language by (Ashraf & Sinha, 2018), integrating these with computational tools to validate and refine directionality detection for writing systems.

Historical Context and Significance

Writing direction has evolved differently across civilizations, reflecting cultural practices and cognitive constraints. While modern Indo-European languages predominantly use left-to-right writing, many historical and contemporary writing systems employ the reverse or even top-to-bottom orientations. Being able to determine what the correct direction of writing and therefore reading would prove indispensable to the field of historical linguistics and decipherment of ancient languages.

Research Methodology

Data Collection Process

The study analyzed texts from multiple language families, ensuring:

Representative samples across writing systems
Consistent sample sizes (Max number of tokens from the respective corpora)
Clean, standardized text preprocessing

Key Innovation

This approach leverages two mathematical properties:

Character Distribution Inequality (Gini coefficient)
- Measures statistical dispersion in character usage
- Reveals systemic constraints at word boundaries
- Provides a quantitative measure of positional rules
Character Randomness (Entropy)
- Quantifies predictability of character sequences
- Captures linguistic constraints in writing systems
- Reflects underlying phonological patterns

We can determine writing direction with 100% accuracy across tested languages by analyzing these properties at word boundaries.

Statistical Measures

Gini Coefficient

The Gini coefficient measures distribution inequality:

G = \frac{\sum_{i=1}^n\sum_{j=1}^n|x_i - x_j|}{2n\sum_{i=1}^nx_i} The coefficient reveals crucial patterns:

Higher values indicate uneven distribution
Computed separately for initial and final positions
Key indicator of character constraints
Reflects writing system optimization

Shannon Entropy

Entropy quantifies randomness in character usage:

H = -\sum_{i=1}^n p_i\log_2(p_i) This measure provides essential insights:

Higher values indicate more varied distribution
Measures character choice flexibility
Critical for direction determination
Captures linguistic constraints

Pattern Analysis

Left-to-Right Scripts

LTR Characteristics

My analysis revealed clear patterns in LTR writing systems. Words begin with high entropy (varied characters) and low Gini coefficients (even distribution), showing flexibility in word-initial positions. Conversely, word endings show low entropy with high Gini coefficients, indicating restricted character sets and stronger morphological constraints.

Finnish exemplifies this pattern perfectly: initial entropy of 3.8126 versus final entropy of 2.3642. This stark difference provides a reliable signal for detecting LTR writing direction.

Right-to-Left Scripts

RTL Characteristics

RTL scripts invert these patterns. Word beginnings show low entropy and high Gini coefficients, reflecting strict constraints on initial characters. Word endings display high entropy with low Gini coefficients, allowing more varied character combinations.

Arabic demonstrates this clearly: initial entropy of 3.5498 rises to 3.8864 at word endings. This reverse pattern, typical of Semitic scripts, provides a clear statistical signature for RTL writing systems.

Results

Performance by Language

Language	Initial Gini	Final Gini	Initial Entropy	Final Entropy	Gini Diff	Entropy Diff	Direction
Danish	0.4391	0.4843	4.1322	3.3451	-0.0452	0.7871	LTR
Dutch	0.4505	0.5055	3.9709	3.0675	-0.0550	0.9034	LTR
English	0.3716	0.5193	3.9666	3.4086	-0.1477	0.5581	LTR
Finnish	0.3359	0.4985	3.8126	2.3642	-0.1626	1.4484	LTR
French	0.4012	0.6053	4.0665	2.9424	-0.2041	1.1241	LTR
German	0.4223	0.5609	4.0246	2.9083	-0.1385	1.1163	LTR
Greek	0.4911	0.5529	4.0549	3.0230	-0.0618	1.0319	LTR
Italian	0.4742	0.5219	3.7092	2.4568	-0.0477	1.2524	LTR
Portuguese	0.4183	0.5974	3.9604	2.6653	-0.1790	1.2951	LTR
Spanish	0.4682	0.5281	3.7903	2.6029	-0.0599	1.1874	LTR
Swedish	0.4063	0.5230	4.2511	3.3237	-0.1167	0.9275	LTR
Arabic	0.5121	0.4185	3.5498	3.8864	0.0935	-0.3366	RTL
Hebrew	0.5265	0.4938	3.3217	3.5849	0.0327	-0.2633	RTL

Performance by Language Family

Family	Languages	Average Entropy Diff	Average Gini Diff	Accuracy
Germanic	Danish, Dutch, English, German, Swedish	0.86	-0.12	100%
Romance	French, Italian, Portuguese, Spanish	1.21	-0.13	100%
Semitic	Arabic, Hebrew	-0.30	0.06	100%
Finnic	Finnish	1.45	-0.16	100%
Hellenic	Greek	1.03	-0.06	100%

Notable Patterns

Strongest Directional Signals

Left-to-Right Languages
- Finnish: Highest entropy difference (1.4484)
  - Exceptional clarity in directional signal
  - Strong morphological constraints
- French: Largest Gini difference (-0.2041)
  - Clear positional character preferences
  - Robust statistical separation
- Portuguese: Strong combined signal
  - High entropy difference (1.2951)
  - Significant Gini contrast (-0.1790)
Right-to-Left Languages
- Arabic: Clear reverse pattern
  - Negative entropy difference (-0.3366)
  - Positive Gini difference (0.0935)
  - Strong RTL characteristics
- Hebrew: Consistent RTL signal
  - Aligned with Arabic patterns
  - Weaker but clear direction markers

Language Family Characteristics

Average entropy difference: 0.86
Average Gini difference: -0.12
Key Features:
- Consistent LTR pattern across languages
- English shows moderate signal strength
- Swedish and Danish display similar patterns
Notable Variations:
- Dutch/German stronger signals than English
- Consistent family-level trends

Average entropy difference: 1.21
Average Gini difference: -0.13
Distinguishing Features:
- Strongest overall family signal
- French shows an exceptionally clear pattern
- Italian and Spanish share characteristics
Pattern Consistency:
- High uniformity across languages
- Strong morphological markers

Average entropy difference: -0.30
Average Gini difference: 0.06
Distinctive Traits:
- Clear reverse pattern from Indo-European
- Arabic shows a stronger signal than Hebrew
- Consistent RTL characteristics
Structural Implications:
- Root-based morphology effects
- Consistent writing traditions

Validation Results

Cross-Validation
- Reversed text analysis confirms detection method
- Signal strength preserved in reverse
- Consistent direction detection
Statistical Significance
- Clear separation between LTR and RTL patterns
- No ambiguous or indeterminate cases
- Robust across sample sizes
Error Analysis
- No false positives
- No false negatives
- Perfect classification accuracy

Discussion

Practical Applications

This research enables several key advances in script analysis. The statistical patterns discovered can automatically detect writing direction in unknown scripts by analyzing character distributions at word boundaries - particularly valuable for ancient texts where the direction is not immediately apparent. For ancient writing systems, these metrics provide objective evidence for directional hypotheses and can highlight potential transcription errors where character patterns deviate unexpectedly.

Future Work

Primary Research Directions

Complex Scripts
- Vertical writing systems
- Mixed-direction texts
- Hybrid writing systems
Advanced Methods
- Enhanced statistical measures
- Pattern learning systems
New Applications
- Ancient script analysis
- Unknown writing systems

Methodological Extensions

Future work will expand this research in several critical directions. I plan to analyze both smaller and larger text corpora across more languages to validate the patterns I have found, while incorporating additional statistical measures beyond Gini and entropy to capture subtler directional features. I aim to test the method’s robustness through cross-script validation, especially with scripts that mix directions or use unconventional layouts.

Source Code

Implementation available at: github.com/jhnwnstd/writing_direction

References

Ashraf, Md. I., & Sinha, S. (2018). The “handedness” of language: Directional symmetry breaking of sign usage in words. PLoS ONE, 13(1), e0190735. https://doi.org/10.1371/journal.pone.0190735

Citation

BibTeX citation:

@online{winstead2024,
  author = {Winstead, John},
  title = {Writing {Direction} {Detection}},
  date = {2024-11-15},
  url = {https://jhnwnstd.github.io/projects/writing-direction/},
  langid = {en}
}

For attribution, please cite this work as:

Winstead, J. (2024, November 15). Writing Direction Detection. https://jhnwnstd.github.io/projects/writing-direction/