Suxotin’s Vowel Identification Algorithm

Writing Systems

Computational Linguistics

An algorithmic approach to automatic vowel detection in alphabetic writing systems

Published

September 1, 2024

Background

B.V. Suxotin’s algorithm is a method for identifying vowels in text based on letter adjacency patterns without requiring no prior knowledge of the language or writing system other than its alphabetic nature (Apresjan, 1973; Guy, 1991). Recent enhancements, particularly in threshold-based reclassification, have significantly improved its accuracy across diverse languages and orthographic conventions, including those with complex vowel accents. By leveraging universal linguistic principles—such as vowels’ tendency to exhibit higher adjacency diversity compared to consonants—the algorithm demonstrates adaptability across varied linguistic contexts. This study validates its effectiveness across multiple languages, including German, French, Spanish, Italian, Dutch, Greek, and English, as well as within specific English corpora like the Sherlock Holmes texts and the NLTK Gutenberg corpus.

Methodology

1. Load and Format Corpus

Each language corpus undergoes systematic preprocessing to ensure consistent analysis: - Converting all text to lowercase for uniformity - Removing non-alphabetic characters while preserving diacritics - Standardizing special characters and diacritical marks - Retaining spacing for adjacency analysis

2. Build Adjacency Matrix

Create a frequency matrix A, where each element a_{ij} represents the count of character i appearing adjacent to character j
Calculate row sums s_i = \sum_j a_{ij} to measure each character’s total adjacency frequency
Normalize adjacency patterns relative to overall character frequency

3. Iterative Classification

Initial Classification: Identify the character with the highest adjacency sum as the first vowel.
Adjustment Phase: For each newly identified vowel: \text{new\_sum}_i = \text{current\_sum}_i - 2(a_{ij} + a_{ji}) where a_{ij} represents adjacency count between characters i and j. The factor of 2 accounts for both left and right adjacencies.
Threshold-Based Reclassification: Apply an optimized threshold: \text{threshold} = \min(\text{vowel\_sums}) \times 2 This factor of 2 empirically captures vowel characteristics across various corpora.
Termination: Continue until all characters are classified and sums are fully adjusted.

Results and Analysis

Cross-Linguistic Performance

The algorithm was evaluated across seven languages, with distinct orthographic patterns and character sets. Precision, recall, and F1 scores demonstrate high accuracy in identifying vowels and consonants:

Language	True Positives (TP)	False Positives (FP)	False Negatives (FN)	Precision	Recall	F1 Score
German	12	1	0	0.923	1.000	0.960
French	17	1	1	0.944	0.944	0.944
Spanish	11	2	0	0.846	1.000	0.917
Italian	12	2	4	0.857	0.750	0.800
Dutch	12	0	1	1.000	0.923	0.960
Greek	20	3	2	0.870	0.909	0.889
English	9	0	0	1.000	1.000	1.000

Performance Summary

Strong Precision

The algorithm achieved good precision across most languages, with English showing perfect precision (1.0). Most other languages maintained precision above 0.84, with French (0.944), German (0.923), and Greek (0.870) showing particularly strong results. The algorithm maintained reliable vowel identification even in complex cases involving multiple alphabets and diacritics.

Varied Recall

The algorithm achieved mixed recall scores, ranging from perfect (1.0 for English, Spanish, and German) to lower values for Italian (0.750). Most languages maintained recall above 0.90, demonstrating good but not perfect detection of vowel patterns, including accented variants.

F1 Scores

The F1 scores show strong overall performance: - Perfect score for English (1.0) - Excellent performance for German and Dutch (0.960) - Strong results for French (0.944) and Spanish (0.917) - Good performance for Greek (0.889) - Solid performance for Italian (0.800)

This suggests the algorithm is effective at recognizing vowel patterns across different writing systems, with particularly strong performance in languages using Latin-based alphabets. The lower scores in some languages can be attributed to challenges with diacritical marks and character variants rather than fundamental limitations of the algorithm.

English Corpora Results

The algorithm also performed excellently on English corpora, such as the Sherlock Holmes texts and the NLTK Gutenberg corpus, as illustrated below:

Text Corpus	Classified Vowels	Classified Consonants
Sherlock Holmes Text	[‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘à’, ‘â’, ‘æ’, ‘è’, ‘é’, ‘œ’]	[‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’]
NLTK Gutenberg Corpus	[‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘æ’, ‘è’, ‘é’, ‘î’]	[‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’]

Text Corpus	Precision	Recall	F1 Score
Sherlock Holmes Text	1.0	1.0	1.0
NLTK Gutenberg Corpus	1.0	1.0	1.0

These results confirm the algorithm’s ability to classify vowels accurately, including accented characters (e.g., à, â, æ, è, é, œ, î) while correctly identifying consonants.

Conclusion

The algorithm demonstrates robust cross-linguistic performance through its threshold-based approach (using a factor of 2), effectively identifying vowel patterns by exploiting universal vowel-consonant adjacency tendencies. This success across diverse writing systems validates its theoretical foundations and establishes it as a valuable tool for linguistic analysis and decipherment tasks.

Source Code

Implementation available at: github.com/jhnwnstd/suxotin

References

Apresjan, J. D. (1973). Decipherment Models. In Principles and Methods in Contemporary Structural Linguistics (pp. 135–166). De Gruyter. https://doi.org/10.1515/9783110877908-010

Guy, J. B. M. (1991). Vowel Identification: an Old (But Good) Algorithm. Cryptologia, 15(3), 258–262. https://doi.org/10.1080/0161-119191865920

Citation

BibTeX citation:

@online{winstead2024,
  author = {Winstead, John},
  title = {Suxotin’s {Vowel} {Identification} {Algorithm}},
  date = {2024-09-01},
  url = {https://jhnwnstd.github.io/projects/vowel-identification/},
  langid = {en}
}

For attribution, please cite this work as:

Winstead, J. (2024, September 1). Suxotin’s Vowel Identification Algorithm. https://jhnwnstd.github.io/projects/vowel-identification/