Home
  • Research
  • Tools
  • Services
  • Blog
  • CV

On this page

  • Overview
    • Why direction matters
  • Methodology
    • Data
    • Two statistics at two positions
    • Direction inference
  • Results
    • Per-language results
    • Family-level summary
    • Strongest signals
    • Family characteristics
    • Validation
  • Discussion
  • Future work

Writing direction detection

Writing Systems
Information Theory
Computational Linguistics
Inferring writing direction from word-boundary character distributions using Gini and entropy.
Published

November 15, 2024

Overview

Writing systems encode directional information in their sub-word statistics. The distribution of characters at word-initial and word-final positions differs in ways that reflect the reading direction. This project tests whether two summary statistics, the Gini coefficient and Shannon entropy, are sufficient to recover writing direction from a text alone, without prior knowledge of the language or script. The approach builds on Ashraf & Sinha (2018) and extends it computationally to a wider language sample.

Why direction matters

Writing direction varies across civilizations. Most modern Indo-European scripts run left to right. Many historical and contemporary writing systems run right to left, top to bottom, or in mixed orientations. Recovering the correct reading direction from text alone is useful in historical linguistics and in decipherment, where direction is often the first property that must be established before other analysis can begin.

Methodology

The method has three components.

Data

Texts from multiple language families. Sample sizes are matched against the smallest available corpus per language so the entropy and Gini estimates are not driven by sample-size effects.

Two statistics at two positions

For each language, compute the Gini coefficient and Shannon entropy of character frequency separately at word-initial and word-final positions.

The Gini coefficient measures inequality of character usage:

G = \frac{\sum_{i=1}^n\sum_{j=1}^n|x_i - x_j|}{2n\sum_{i=1}^nx_i}

Higher values indicate that some characters dominate the position. Lower values indicate even spread.

Shannon entropy measures randomness:

H = -\sum_{i=1}^n p_i\log_2(p_i)

Higher values indicate more varied character use. Lower values indicate constraint.

Direction inference

In a left-to-right script, word-initial positions admit more variation (higher entropy, lower Gini) than word-final positions, which carry morphological endings that constrain character choice. A right-to-left script inverts the pattern: the constrained position appears at the visible end of the word, which corresponds to its phonological end.

Direction is inferred from the signs of two differences:

  • Entropy difference: H_{\text{initial}} - H_{\text{final}}.
  • Gini difference: G_{\text{initial}} - G_{\text{final}}.

A positive entropy difference paired with a negative Gini difference indicates LTR. The reverse pair indicates RTL.

Results

Per-language results

Language Initial Gini Final Gini Initial Entropy Final Entropy Gini Diff Entropy Diff Direction
Danish 0.4391 0.4843 4.1322 3.3451 -0.0452 0.7871 LTR
Dutch 0.4505 0.5055 3.9709 3.0675 -0.0550 0.9034 LTR
English 0.3716 0.5193 3.9666 3.4086 -0.1477 0.5581 LTR
Finnish 0.3359 0.4985 3.8126 2.3642 -0.1626 1.4484 LTR
French 0.4012 0.6053 4.0665 2.9424 -0.2041 1.1241 LTR
German 0.4223 0.5609 4.0246 2.9083 -0.1385 1.1163 LTR
Greek 0.4911 0.5529 4.0549 3.0230 -0.0618 1.0319 LTR
Italian 0.4742 0.5219 3.7092 2.4568 -0.0477 1.2524 LTR
Portuguese 0.4183 0.5974 3.9604 2.6653 -0.1790 1.2951 LTR
Spanish 0.4682 0.5281 3.7903 2.6029 -0.0599 1.1874 LTR
Swedish 0.4063 0.5230 4.2511 3.3237 -0.1167 0.9275 LTR
Arabic 0.5121 0.4185 3.5498 3.8864 0.0935 -0.3366 RTL
Hebrew 0.5265 0.4938 3.3217 3.5849 0.0327 -0.2633 RTL

Family-level summary

Family Languages Average Entropy Diff Average Gini Diff Accuracy
Germanic Danish, Dutch, English, German, Swedish 0.86 -0.12 100%
Romance French, Italian, Portuguese, Spanish 1.21 -0.13 100%
Semitic Arabic, Hebrew -0.30 0.06 100%
Finnic Finnish 1.45 -0.16 100%
Hellenic Greek 1.03 -0.06 100%

Strongest signals

The strongest LTR signals are Finnish (entropy difference 1.45) and French (Gini difference -0.20). Both languages have rich final-position morphology that constrains the word-final character distribution. Portuguese also shows a strong combined signal (entropy difference 1.30, Gini difference -0.18).

The strongest RTL signals are Arabic (entropy difference -0.34, Gini difference 0.09) and Hebrew (entropy difference -0.26, Gini difference 0.03). Both follow the predicted reverse pattern, with Arabic stronger than Hebrew.

Family characteristics

Germanic (Danish, Dutch, English, German, Swedish). Mean entropy difference 0.86, mean Gini difference -0.12. LTR throughout. English shows the weakest signal in the family. Dutch and German are stronger.

Romance (French, Italian, Portuguese, Spanish). Mean entropy difference 1.21, mean Gini difference -0.13. The strongest family-level signal in the sample.

Semitic (Arabic, Hebrew). Mean entropy difference -0.30, mean Gini difference 0.06. Both languages reverse the pattern of the Indo-European families.

Validation

Reversing each text and re-running the classifier flips the predicted direction in every case. The signal is positional, not an artifact of character frequency or sample size. There are no ambiguous classifications in the sample, no false positives, and no false negatives.

Discussion

For unknown scripts where direction is uncertain, the Gini-and-entropy approach gives an objective signal that does not depend on language identification or character semantics. The same metrics can flag candidate transcription errors when a portion of text deviates from the rest of its family.

The method has known limits. Vertical scripts and mixed-direction texts fall outside a binary LTR/RTL classification. Scripts without clear word boundaries, including continuous-script writing and abjads with limited segmentation, need a separate tokenization step before the metrics apply.

Future work

  • Vertical scripts and mixed-direction texts.
  • Continuous-script writing systems.
  • Smaller and larger corpora to bound the sample-size sensitivity of the entropy and Gini estimates.
  • Additional position-aware statistics beyond Gini and entropy.
Source code

Implementation: github.com/jhnwnstd/writing_direction.

References

Ashraf, Md. I., & Sinha, S. (2018). The “handedness” of language: Directional symmetry breaking of sign usage in words. PLoS ONE, 13(1), e0190735. https://doi.org/10.1371/journal.pone.0190735

Citation

BibTeX citation:
@online{winstead2024,
  author = {Winstead, John},
  title = {Writing Direction Detection},
  date = {2024-11-15},
  url = {https://jhnwnstd.github.io/research/writing-direction/},
  langid = {en}
}
For attribution, please cite this work as:
Winstead, J. (2024, November 15). Writing direction detection. https://jhnwnstd.github.io/research/writing-direction/

© 2026 John Winstead

 
  • Email