Suxotin’s Vowel Identification Algorithm
Background
B.V. Suxotin’s algorithm is a method for identifying vowels in text based on letter adjacency patterns without requiring no prior knowledge of the language or writing system other than its alphabetic nature (Apresjan, 1973; Guy, 1991). Recent enhancements, particularly in threshold-based reclassification, have significantly improved its accuracy across diverse languages and orthographic conventions, including those with complex vowel accents. By leveraging universal linguistic principles—such as vowels’ tendency to exhibit higher adjacency diversity compared to consonants—the algorithm demonstrates adaptability across varied linguistic contexts. This study validates its effectiveness across multiple languages, including German, French, Spanish, Italian, Dutch, Greek, and English, as well as within specific English corpora like the Sherlock Holmes texts and the NLTK Gutenberg corpus.
Methodology
1. Load and Format Corpus
Each language corpus undergoes systematic preprocessing to ensure consistent analysis: - Converting all text to lowercase for uniformity - Removing non-alphabetic characters while preserving diacritics - Standardizing special characters and diacritical marks - Retaining spacing for adjacency analysis
2. Build Adjacency Matrix
- Create a frequency matrix A, where each element a_{ij} represents the count of character i appearing adjacent to character j
- Calculate row sums s_i = \sum_j a_{ij} to measure each character’s total adjacency frequency
- Normalize adjacency patterns relative to overall character frequency
3. Iterative Classification
- Initial Classification: Identify the character with the highest adjacency sum as the first vowel.
- Adjustment Phase: For each newly identified vowel: \text{new\_sum}_i = \text{current\_sum}_i - 2(a_{ij} + a_{ji}) where a_{ij} represents adjacency count between characters i and j. The factor of 2 accounts for both left and right adjacencies.
- Threshold-Based Reclassification: Apply an optimized threshold: \text{threshold} = \min(\text{vowel\_sums}) \times 2 This factor of 2 empirically captures vowel characteristics across various corpora.
- Termination: Continue until all characters are classified and sums are fully adjusted.
Results and Analysis
Cross-Linguistic Performance
The algorithm was evaluated across seven languages, with distinct orthographic patterns and character sets. Precision, recall, and F1 scores demonstrate high accuracy in identifying vowels and consonants:
Language | True Positives (TP) | False Positives (FP) | False Negatives (FN) | Precision | Recall | F1 Score |
---|---|---|---|---|---|---|
German | 12 | 1 | 0 | 0.923 | 1.000 | 0.960 |
French | 17 | 1 | 1 | 0.944 | 0.944 | 0.944 |
Spanish | 11 | 2 | 0 | 0.846 | 1.000 | 0.917 |
Italian | 12 | 2 | 4 | 0.857 | 0.750 | 0.800 |
Dutch | 12 | 0 | 1 | 1.000 | 0.923 | 0.960 |
Greek | 20 | 3 | 2 | 0.870 | 0.909 | 0.889 |
English | 9 | 0 | 0 | 1.000 | 1.000 | 1.000 |
Performance Summary
Strong Precision
The algorithm achieved good precision across most languages, with English showing perfect precision (1.0). Most other languages maintained precision above 0.84, with French (0.944), German (0.923), and Greek (0.870) showing particularly strong results. The algorithm maintained reliable vowel identification even in complex cases involving multiple alphabets and diacritics.
Varied Recall
The algorithm achieved mixed recall scores, ranging from perfect (1.0 for English, Spanish, and German) to lower values for Italian (0.750). Most languages maintained recall above 0.90, demonstrating good but not perfect detection of vowel patterns, including accented variants.
F1 Scores
The F1 scores show strong overall performance: - Perfect score for English (1.0) - Excellent performance for German and Dutch (0.960) - Strong results for French (0.944) and Spanish (0.917) - Good performance for Greek (0.889) - Solid performance for Italian (0.800)
This suggests the algorithm is effective at recognizing vowel patterns across different writing systems, with particularly strong performance in languages using Latin-based alphabets. The lower scores in some languages can be attributed to challenges with diacritical marks and character variants rather than fundamental limitations of the algorithm.
English Corpora Results
The algorithm also performed excellently on English corpora, such as the Sherlock Holmes texts and the NLTK Gutenberg corpus, as illustrated below:
Text Corpus | Classified Vowels | Classified Consonants |
---|---|---|
Sherlock Holmes Text | [‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘à’, ‘â’, ‘æ’, ‘è’, ‘é’, ‘œ’] | [‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’] |
NLTK Gutenberg Corpus | [‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘æ’, ‘è’, ‘é’, ‘î’] | [‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’] |
Text Corpus | Precision | Recall | F1 Score |
---|---|---|---|
Sherlock Holmes Text | 1.0 | 1.0 | 1.0 |
NLTK Gutenberg Corpus | 1.0 | 1.0 | 1.0 |
These results confirm the algorithm’s ability to classify vowels accurately, including accented characters (e.g., à, â, æ, è, é, œ, î) while correctly identifying consonants.
Conclusion
The algorithm demonstrates robust cross-linguistic performance through its threshold-based approach (using a factor of 2), effectively identifying vowel patterns by exploiting universal vowel-consonant adjacency tendencies. This success across diverse writing systems validates its theoretical foundations and establishes it as a valuable tool for linguistic analysis and decipherment tasks.
References
Citation
@online{winstead2024,
author = {Winstead, John},
title = {Suxotin’s {Vowel} {Identification} {Algorithm}},
date = {2024-09-01},
url = {https://jhnwnstd.github.io/projects/vowel-identification/},
langid = {en}
}