Writing direction detection

Writing Systems

Information Theory

Computational Linguistics

Inferring writing direction from word-boundary character distributions using Gini and entropy.

Published

November 15, 2024

Overview

Writing systems encode directional information in their sub-word statistics. The distribution of characters at word-initial and word-final positions differs in ways that reflect the reading direction. This project tests whether two summary statistics, the Gini coefficient and Shannon entropy, are sufficient to recover writing direction from a text alone, without prior knowledge of the language or script. The approach builds on Ashraf & Sinha (2018) and extends it computationally to a wider language sample.

Why direction matters

Writing direction varies across civilizations. Most modern Indo-European scripts run left to right. Many historical and contemporary writing systems run right to left, top to bottom, or in mixed orientations. Recovering the correct reading direction from text alone is useful in historical linguistics and in decipherment, where direction is often the first property that must be established before other analysis can begin.

Methodology

The method has three components.

Data

Texts from multiple language families. Sample sizes are matched against the smallest available corpus per language so the entropy and Gini estimates are not driven by sample-size effects.

Two statistics at two positions

For each language, compute the Gini coefficient and Shannon entropy of character frequency separately at word-initial and word-final positions.

The Gini coefficient measures inequality of character usage:

G = \frac{\sum_{i=1}^n\sum_{j=1}^n|x_i - x_j|}{2n\sum_{i=1}^nx_i}

Higher values indicate that some characters dominate the position. Lower values indicate even spread.

Shannon entropy measures randomness:

H = -\sum_{i=1}^n p_i\log_2(p_i)

Higher values indicate more varied character use. Lower values indicate constraint.

Direction inference

In a left-to-right script, word-initial positions admit more variation (higher entropy, lower Gini) than word-final positions, which carry morphological endings that constrain character choice. A right-to-left script inverts the pattern: the constrained position appears at the visible end of the word, which corresponds to its phonological end.

Direction is inferred from the signs of two differences:

Entropy difference: H_{\text{initial}} - H_{\text{final}}.
Gini difference: G_{\text{initial}} - G_{\text{final}}.

A positive entropy difference paired with a negative Gini difference indicates LTR. The reverse pair indicates RTL.

Results

Per-language results

Language	Initial Gini	Final Gini	Initial Entropy	Final Entropy	Gini Diff	Entropy Diff	Direction
Danish	0.4391	0.4843	4.1322	3.3451	-0.0452	0.7871	LTR
Dutch	0.4505	0.5055	3.9709	3.0675	-0.0550	0.9034	LTR
English	0.3716	0.5193	3.9666	3.4086	-0.1477	0.5581	LTR
Finnish	0.3359	0.4985	3.8126	2.3642	-0.1626	1.4484	LTR
French	0.4012	0.6053	4.0665	2.9424	-0.2041	1.1241	LTR
German	0.4223	0.5609	4.0246	2.9083	-0.1385	1.1163	LTR
Greek	0.4911	0.5529	4.0549	3.0230	-0.0618	1.0319	LTR
Italian	0.4742	0.5219	3.7092	2.4568	-0.0477	1.2524	LTR
Portuguese	0.4183	0.5974	3.9604	2.6653	-0.1790	1.2951	LTR
Spanish	0.4682	0.5281	3.7903	2.6029	-0.0599	1.1874	LTR
Swedish	0.4063	0.5230	4.2511	3.3237	-0.1167	0.9275	LTR
Arabic	0.5121	0.4185	3.5498	3.8864	0.0935	-0.3366	RTL
Hebrew	0.5265	0.4938	3.3217	3.5849	0.0327	-0.2633	RTL

Family-level summary

Family	Languages	Average Entropy Diff	Average Gini Diff	Accuracy
Germanic	Danish, Dutch, English, German, Swedish	0.86	-0.12	100%
Romance	French, Italian, Portuguese, Spanish	1.21	-0.13	100%
Semitic	Arabic, Hebrew	-0.30	0.06	100%
Finnic	Finnish	1.45	-0.16	100%
Hellenic	Greek	1.03	-0.06	100%

Strongest signals

The strongest LTR signals are Finnish (entropy difference 1.45) and French (Gini difference -0.20). Both languages have rich final-position morphology that constrains the word-final character distribution. Portuguese also shows a strong combined signal (entropy difference 1.30, Gini difference -0.18).

The strongest RTL signals are Arabic (entropy difference -0.34, Gini difference 0.09) and Hebrew (entropy difference -0.26, Gini difference 0.03). Both follow the predicted reverse pattern, with Arabic stronger than Hebrew.

Family characteristics

Germanic (Danish, Dutch, English, German, Swedish). Mean entropy difference 0.86, mean Gini difference -0.12. LTR throughout. English shows the weakest signal in the family. Dutch and German are stronger.

Romance (French, Italian, Portuguese, Spanish). Mean entropy difference 1.21, mean Gini difference -0.13. The strongest family-level signal in the sample.

Semitic (Arabic, Hebrew). Mean entropy difference -0.30, mean Gini difference 0.06. Both languages reverse the pattern of the Indo-European families.

Validation

Reversing each text and re-running the classifier flips the predicted direction in every case. The signal is positional, not an artifact of character frequency or sample size. There are no ambiguous classifications in the sample, no false positives, and no false negatives.

Discussion

For unknown scripts where direction is uncertain, the Gini-and-entropy approach gives an objective signal that does not depend on language identification or character semantics. The same metrics can flag candidate transcription errors when a portion of text deviates from the rest of its family.

The method has known limits. Vertical scripts and mixed-direction texts fall outside a binary LTR/RTL classification. Scripts without clear word boundaries, including continuous-script writing and abjads with limited segmentation, need a separate tokenization step before the metrics apply.

Future work

Vertical scripts and mixed-direction texts.
Continuous-script writing systems.
Smaller and larger corpora to bound the sample-size sensitivity of the entropy and Gini estimates.
Additional position-aware statistics beyond Gini and entropy.

Source code

Implementation: github.com/jhnwnstd/writing_direction.

References

Ashraf, Md. I., & Sinha, S. (2018). The “handedness” of language: Directional symmetry breaking of sign usage in words. PLoS ONE, 13(1), e0190735. https://doi.org/10.1371/journal.pone.0190735

Citation

BibTeX citation:

@online{winstead2024,
  author = {Winstead, John},
  title = {Writing Direction Detection},
  date = {2024-11-15},
  url = {https://jhnwnstd.github.io/research/writing-direction/},
  langid = {en}
}

For attribution, please cite this work as:

Winstead, J. (2024, November 15). Writing direction detection. https://jhnwnstd.github.io/research/writing-direction/