Minimalistic logo symbolizing computational linguistics John Winstead
  • Home
  • Services
  • Research
  • Blog
  • CV

On this page

  • Overview
    • Historical Context and Significance
  • Research Methodology
    • Data Collection Process
  • Statistical Measures
    • Gini Coefficient
    • Shannon Entropy
  • Pattern Analysis
    • Left-to-Right Scripts
    • Right-to-Left Scripts
  • Results
    • Performance by Language
    • Performance by Language Family
    • Notable Patterns
    • Language Family Characteristics
    • Validation Results
  • Discussion
    • Practical Applications
  • Future Work
    • Primary Research Directions
    • Methodological Extensions
  • Source Code
  • Edit this page
  • Report an issue

Writing Direction Detection

Writing Systems
Information Theory
Computational Linguistics
Using Gini index and Entropy analysis to determine writing direction
Published

November 15, 2024

Overview

Writing systems are not only a reflection of language but also encode directional properties in their stochastic patterns on the sub-word level. This research demonstrates how analyzing character distributions at word boundaries can determine a text’s writing direction, whether left-to-right (LTR) or right-to-left (RTL), without prior knowledge of the language or script. The project builds upon the methods outlined in The Handedness of Language by (Ashraf & Sinha, 2018), integrating these with computational tools to validate and refine directionality detection for writing systems.

Historical Context and Significance

Writing direction has evolved differently across civilizations, reflecting cultural practices and cognitive constraints. While modern Indo-European languages predominantly use left-to-right writing, many historical and contemporary writing systems employ the reverse or even top-to-bottom orientations. Being able to determine what the correct direction of writing and therefore reading would prove indispensable to the field of historical linguistics and decipherment of ancient languages.

Research Methodology

Data Collection Process

The study analyzed texts from multiple language families, ensuring:

  • Representative samples across writing systems
  • Consistent sample sizes (Max number of tokens from the respective corpora)
  • Clean, standardized text preprocessing
Key Innovation

This approach leverages two mathematical properties:

  1. Character Distribution Inequality (Gini coefficient)
    • Measures statistical dispersion in character usage
    • Reveals systemic constraints at word boundaries
    • Provides a quantitative measure of positional rules
  2. Character Randomness (Entropy)
    • Quantifies predictability of character sequences
    • Captures linguistic constraints in writing systems
    • Reflects underlying phonological patterns

We can determine writing direction with 100% accuracy across tested languages by analyzing these properties at word boundaries.

Statistical Measures

Gini Coefficient

The Gini coefficient measures distribution inequality:

G = \frac{\sum_{i=1}^n\sum_{j=1}^n|x_i - x_j|}{2n\sum_{i=1}^nx_i} The coefficient reveals crucial patterns:

  • Higher values indicate uneven distribution
  • Computed separately for initial and final positions
  • Key indicator of character constraints
  • Reflects writing system optimization

Shannon Entropy

Entropy quantifies randomness in character usage:

H = -\sum_{i=1}^n p_i\log_2(p_i) This measure provides essential insights:

  • Higher values indicate more varied distribution
  • Measures character choice flexibility
  • Critical for direction determination
  • Captures linguistic constraints

Pattern Analysis

Left-to-Right Scripts

LTR Characteristics

My analysis revealed clear patterns in LTR writing systems. Words begin with high entropy (varied characters) and low Gini coefficients (even distribution), showing flexibility in word-initial positions. Conversely, word endings show low entropy with high Gini coefficients, indicating restricted character sets and stronger morphological constraints.

Finnish exemplifies this pattern perfectly: initial entropy of 3.8126 versus final entropy of 2.3642. This stark difference provides a reliable signal for detecting LTR writing direction.

Right-to-Left Scripts

RTL Characteristics

RTL scripts invert these patterns. Word beginnings show low entropy and high Gini coefficients, reflecting strict constraints on initial characters. Word endings display high entropy with low Gini coefficients, allowing more varied character combinations.

Arabic demonstrates this clearly: initial entropy of 3.5498 rises to 3.8864 at word endings. This reverse pattern, typical of Semitic scripts, provides a clear statistical signature for RTL writing systems.

Results

Performance by Language

Language Initial Gini Final Gini Initial Entropy Final Entropy Gini Diff Entropy Diff Direction
Danish 0.4391 0.4843 4.1322 3.3451 -0.0452 0.7871 LTR
Dutch 0.4505 0.5055 3.9709 3.0675 -0.0550 0.9034 LTR
English 0.3716 0.5193 3.9666 3.4086 -0.1477 0.5581 LTR
Finnish 0.3359 0.4985 3.8126 2.3642 -0.1626 1.4484 LTR
French 0.4012 0.6053 4.0665 2.9424 -0.2041 1.1241 LTR
German 0.4223 0.5609 4.0246 2.9083 -0.1385 1.1163 LTR
Greek 0.4911 0.5529 4.0549 3.0230 -0.0618 1.0319 LTR
Italian 0.4742 0.5219 3.7092 2.4568 -0.0477 1.2524 LTR
Portuguese 0.4183 0.5974 3.9604 2.6653 -0.1790 1.2951 LTR
Spanish 0.4682 0.5281 3.7903 2.6029 -0.0599 1.1874 LTR
Swedish 0.4063 0.5230 4.2511 3.3237 -0.1167 0.9275 LTR
Arabic 0.5121 0.4185 3.5498 3.8864 0.0935 -0.3366 RTL
Hebrew 0.5265 0.4938 3.3217 3.5849 0.0327 -0.2633 RTL

Performance by Language Family

Family Languages Average Entropy Diff Average Gini Diff Accuracy
Germanic Danish, Dutch, English, German, Swedish 0.86 -0.12 100%
Romance French, Italian, Portuguese, Spanish 1.21 -0.13 100%
Semitic Arabic, Hebrew -0.30 0.06 100%
Finnic Finnish 1.45 -0.16 100%
Hellenic Greek 1.03 -0.06 100%

Notable Patterns

Strongest Directional Signals
  1. Left-to-Right Languages
    • Finnish: Highest entropy difference (1.4484)
      • Exceptional clarity in directional signal
      • Strong morphological constraints
    • French: Largest Gini difference (-0.2041)
      • Clear positional character preferences
      • Robust statistical separation
    • Portuguese: Strong combined signal
      • High entropy difference (1.2951)
      • Significant Gini contrast (-0.1790)
  2. Right-to-Left Languages
    • Arabic: Clear reverse pattern
      • Negative entropy difference (-0.3366)
      • Positive Gini difference (0.0935)
      • Strong RTL characteristics
    • Hebrew: Consistent RTL signal
      • Aligned with Arabic patterns
      • Weaker but clear direction markers

Language Family Characteristics

  • Germanic Family
  • Romance Family
  • Semitic Family
  • Average entropy difference: 0.86
  • Average Gini difference: -0.12
  • Key Features:
    • Consistent LTR pattern across languages
    • English shows moderate signal strength
    • Swedish and Danish display similar patterns
  • Notable Variations:
    • Dutch/German stronger signals than English
    • Consistent family-level trends
  • Average entropy difference: 1.21
  • Average Gini difference: -0.13
  • Distinguishing Features:
    • Strongest overall family signal
    • French shows an exceptionally clear pattern
    • Italian and Spanish share characteristics
  • Pattern Consistency:
    • High uniformity across languages
    • Strong morphological markers
  • Average entropy difference: -0.30
  • Average Gini difference: 0.06
  • Distinctive Traits:
    • Clear reverse pattern from Indo-European
    • Arabic shows a stronger signal than Hebrew
    • Consistent RTL characteristics
  • Structural Implications:
    • Root-based morphology effects
    • Consistent writing traditions

Validation Results

  1. Cross-Validation
    • Reversed text analysis confirms detection method
    • Signal strength preserved in reverse
    • Consistent direction detection
  2. Statistical Significance
    • Clear separation between LTR and RTL patterns
    • No ambiguous or indeterminate cases
    • Robust across sample sizes
  3. Error Analysis
    • No false positives
    • No false negatives
    • Perfect classification accuracy

Discussion

Practical Applications

This research enables several key advances in script analysis. The statistical patterns discovered can automatically detect writing direction in unknown scripts by analyzing character distributions at word boundaries - particularly valuable for ancient texts where the direction is not immediately apparent. For ancient writing systems, these metrics provide objective evidence for directional hypotheses and can highlight potential transcription errors where character patterns deviate unexpectedly.

Future Work

Primary Research Directions

  1. Complex Scripts
    • Vertical writing systems
    • Mixed-direction texts
    • Hybrid writing systems
  2. Advanced Methods
    • Enhanced statistical measures
    • Pattern learning systems
  3. New Applications
    • Ancient script analysis
    • Unknown writing systems

Methodological Extensions

Future work will expand this research in several critical directions. I plan to analyze both smaller and larger text corpora across more languages to validate the patterns I have found, while incorporating additional statistical measures beyond Gini and entropy to capture subtler directional features. I aim to test the method’s robustness through cross-script validation, especially with scripts that mix directions or use unconventional layouts.

Source Code

Implementation available at: github.com/jhnwnstd/writing_direction

References

Ashraf, Md. I., & Sinha, S. (2018). The “handedness” of language: Directional symmetry breaking of sign usage in words. PLoS ONE, 13(1), e0190735. https://doi.org/10.1371/journal.pone.0190735

Citation

BibTeX citation:
@online{winstead2024,
  author = {Winstead, John},
  title = {Writing {Direction} {Detection}},
  date = {2024-11-15},
  url = {https://jhnwnstd.github.io/projects/writing-direction/},
  langid = {en}
}
For attribution, please cite this work as:
Winstead, J. (2024, November 15). Writing Direction Detection. https://jhnwnstd.github.io/projects/writing-direction/

© 2025, John Winstead

Built with Quarto

  • Edit this page
  • Report an issue
Cookie Preferences
  • Contact