Suxotin’s vowel identification algorithm
Background
Suxotin’s algorithm identifies vowels in alphabetic text from letter-adjacency statistics alone, without prior knowledge of the language or writing system (Apresjan, 1973; Guy, 1991). The premise is that vowels appear next to a more diverse set of characters than consonants do, because consonants tend to take vowel neighbors while vowels take both. The algorithm exploits this asymmetry: it picks the character with the highest adjacency diversity, marks it as a vowel, discounts the neighbors of every identified vowel, and continues until every character is classified.
This project tests the algorithm on seven European languages (German, French, Spanish, Italian, Dutch, Greek, English) and on two English corpora (the Sherlock Holmes texts and the NLTK Gutenberg corpus). The implementation adds a threshold-based reclassification step that improves recall on languages with extensive diacritics.
Methodology
Corpus preparation
Each corpus is preprocessed identically:
- Convert text to lowercase.
- Strip non-alphabetic characters while keeping diacritics.
- Standardize special characters and diacritical marks.
- Retain spacing so adjacency is computed within the original word boundaries.
Adjacency matrix
I build a frequency matrix A in which a_{ij} is the count of character i adjacent to character j. The row sum s_i = \sum_j a_{ij} is character i’s total adjacency frequency. Row sums are normalized against overall character frequency so high-frequency characters do not dominate by mere abundance.
Iterative classification
The classification has three phases.
First, identify the character with the highest adjacency sum and mark it as a vowel. Second, for each newly identified vowel, adjust the adjacency sums of the remaining characters:
\text{new\_sum}_i = \text{current\_sum}_i - 2(a_{ij} + a_{ji})
where a_{ij} is the adjacency count between characters i and j. The factor of 2 accounts for both left and right adjacency. Third, apply threshold-based reclassification:
\text{threshold} = \min(\text{vowel\_sums}) \times 2
The factor of 2 was set empirically across the test corpora. The procedure continues until every character is classified and all sums are fully adjusted.
Results
Cross-linguistic performance
The algorithm was evaluated on seven languages with distinct orthographic patterns and character sets:
| Language | True Positives (TP) | False Positives (FP) | False Negatives (FN) | Precision | Recall | F1 Score |
|---|---|---|---|---|---|---|
| German | 12 | 1 | 0 | 0.923 | 1.000 | 0.960 |
| French | 17 | 1 | 1 | 0.944 | 0.944 | 0.944 |
| Spanish | 11 | 2 | 0 | 0.846 | 1.000 | 0.917 |
| Italian | 12 | 2 | 4 | 0.857 | 0.750 | 0.800 |
| Dutch | 12 | 0 | 1 | 1.000 | 0.923 | 0.960 |
| Greek | 20 | 3 | 2 | 0.870 | 0.909 | 0.889 |
| English | 9 | 0 | 0 | 1.000 | 1.000 | 1.000 |
Precision is at or above 0.85 in every language. Recall is at or above 0.90 for every language except Italian (0.75). F1 ranges from 0.80 (Italian) to 1.00 (English), with German and Dutch at 0.96, French at 0.94, Spanish at 0.92, and Greek at 0.89. The Italian recall gap is concentrated in vowels with diacritics that are treated as distinct characters from their unaccented forms; this is a preprocessing decision, not a property of the algorithm.
English corpora
The algorithm scores perfectly on both English corpora, including correct classification of accented characters that appear in loanwords:
| Text Corpus | Classified Vowels | Classified Consonants |
|---|---|---|
| Sherlock Holmes Text | [‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘à’, ‘â’, ‘æ’, ‘è’, ‘é’, ‘œ’] | [‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’] |
| NLTK Gutenberg Corpus | [‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘æ’, ‘è’, ‘é’, ‘î’] | [‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’] |
| Text Corpus | Precision | Recall | F1 Score |
|---|---|---|---|
| Sherlock Holmes Text | 1.0 | 1.0 | 1.0 |
| NLTK Gutenberg Corpus | 1.0 | 1.0 | 1.0 |
Accented vowels (à, â, æ, è, é, œ, î) are classified correctly. No consonants are misclassified.
Conclusion
The threshold-based variant of Suxotin’s algorithm correctly identifies vowels across all seven languages tested, with F1 between 0.80 and 1.00. Errors are concentrated in languages where diacritic vowels are treated as orthographically distinct from their unaccented variants. This is a property of the preprocessing decision, not of the adjacency-diversity heuristic.
The result is consistent with the algorithm’s original premise. Vowel and consonant adjacency asymmetries are stable enough across alphabetic systems to support automatic vowel detection without language-specific information.
References
Citation
@online{winstead2024,
author = {Winstead, John},
title = {Suxotin’s Vowel Identification Algorithm},
date = {2024-09-01},
url = {https://jhnwnstd.github.io/research/vowel-identification/},
langid = {en}
}