Suxotin’s vowel identification algorithm

Writing Systems

Computational Linguistics

Automatic vowel detection in alphabetic writing systems from adjacency statistics alone.

Published

September 1, 2024

Background

Suxotin’s algorithm identifies vowels in alphabetic text from letter-adjacency statistics alone, without prior knowledge of the language or writing system (Apresjan, 1973; Guy, 1991). The premise is that vowels appear next to a more diverse set of characters than consonants do, because consonants tend to take vowel neighbors while vowels take both. The algorithm exploits this asymmetry: it picks the character with the highest adjacency diversity, marks it as a vowel, discounts the neighbors of every identified vowel, and continues until every character is classified.

This project tests the algorithm on seven European languages (German, French, Spanish, Italian, Dutch, Greek, English) and on two English corpora (the Sherlock Holmes texts and the NLTK Gutenberg corpus). The implementation adds a threshold-based reclassification step that improves recall on languages with extensive diacritics.

Methodology

Corpus preparation

Each corpus is preprocessed identically:

Convert text to lowercase.
Strip non-alphabetic characters while keeping diacritics.
Standardize special characters and diacritical marks.
Retain spacing so adjacency is computed within the original word boundaries.

Adjacency matrix

I build a frequency matrix A in which a_{ij} is the count of character i adjacent to character j. The row sum s_i = \sum_j a_{ij} is character i’s total adjacency frequency. Row sums are normalized against overall character frequency so high-frequency characters do not dominate by mere abundance.

Iterative classification

The classification has three phases.

First, identify the character with the highest adjacency sum and mark it as a vowel. Second, for each newly identified vowel, adjust the adjacency sums of the remaining characters:

\text{new\_sum}_i = \text{current\_sum}_i - 2(a_{ij} + a_{ji})

where a_{ij} is the adjacency count between characters i and j. The factor of 2 accounts for both left and right adjacency. Third, apply threshold-based reclassification:

\text{threshold} = \min(\text{vowel\_sums}) \times 2

The factor of 2 was set empirically across the test corpora. The procedure continues until every character is classified and all sums are fully adjusted.

Results

Cross-linguistic performance

The algorithm was evaluated on seven languages with distinct orthographic patterns and character sets:

Language	True Positives (TP)	False Positives (FP)	False Negatives (FN)	Precision	Recall	F1 Score
German	12	1	0	0.923	1.000	0.960
French	17	1	1	0.944	0.944	0.944
Spanish	11	2	0	0.846	1.000	0.917
Italian	12	2	4	0.857	0.750	0.800
Dutch	12	0	1	1.000	0.923	0.960
Greek	20	3	2	0.870	0.909	0.889
English	9	0	0	1.000	1.000	1.000

Precision is at or above 0.85 in every language. Recall is at or above 0.90 for every language except Italian (0.75). F1 ranges from 0.80 (Italian) to 1.00 (English), with German and Dutch at 0.96, French at 0.94, Spanish at 0.92, and Greek at 0.89. The Italian recall gap is concentrated in vowels with diacritics that are treated as distinct characters from their unaccented forms; this is a preprocessing decision, not a property of the algorithm.

English corpora

The algorithm scores perfectly on both English corpora, including correct classification of accented characters that appear in loanwords:

Text Corpus	Classified Vowels	Classified Consonants
Sherlock Holmes Text	[‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘à’, ‘â’, ‘æ’, ‘è’, ‘é’, ‘œ’]	[‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’]
NLTK Gutenberg Corpus	[‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘æ’, ‘è’, ‘é’, ‘î’]	[‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’]

Text Corpus	Precision	Recall	F1 Score
Sherlock Holmes Text	1.0	1.0	1.0
NLTK Gutenberg Corpus	1.0	1.0	1.0

Accented vowels (à, â, æ, è, é, œ, î) are classified correctly. No consonants are misclassified.

Conclusion

The threshold-based variant of Suxotin’s algorithm correctly identifies vowels across all seven languages tested, with F1 between 0.80 and 1.00. Errors are concentrated in languages where diacritic vowels are treated as orthographically distinct from their unaccented variants. This is a property of the preprocessing decision, not of the adjacency-diversity heuristic.

The result is consistent with the algorithm’s original premise. Vowel and consonant adjacency asymmetries are stable enough across alphabetic systems to support automatic vowel detection without language-specific information.

Source code

Implementation: github.com/jhnwnstd/suxotin.

References

Apresjan, J. D. (1973). Decipherment Models. In Principles and Methods in Contemporary Structural Linguistics (pp. 135–166). De Gruyter. https://doi.org/10.1515/9783110877908-010

Guy, J. B. M. (1991). Vowel Identification: an Old (But Good) Algorithm. Cryptologia, 15(3), 258–262. https://doi.org/10.1080/0161-119191865920

Citation

BibTeX citation:

@online{winstead2024,
  author = {Winstead, John},
  title = {Suxotin’s Vowel Identification Algorithm},
  date = {2024-09-01},
  url = {https://jhnwnstd.github.io/research/vowel-identification/},
  langid = {en}
}

For attribution, please cite this work as:

Winstead, J. (2024, September 1). Suxotin’s vowel identification algorithm. https://jhnwnstd.github.io/research/vowel-identification/