Minimalistic logo symbolizing computational linguistics John Winstead
  • Home
  • Services
  • Research
  • Blog
  • CV

On this page

  • Background
  • Methodology
    • 1. Load and Format Corpus
    • 2. Build Adjacency Matrix
    • 3. Iterative Classification
  • Results and Analysis
    • Cross-Linguistic Performance
    • Performance Summary
    • English Corpora Results
  • Conclusion
  • Edit this page
  • Report an issue

Suxotin’s Vowel Identification Algorithm

Writing Systems
Computational Linguistics
An algorithmic approach to automatic vowel detection in alphabetic writing systems
Published

September 1, 2024

Background

B.V. Suxotin’s algorithm is a method for identifying vowels in text based on letter adjacency patterns without requiring no prior knowledge of the language or writing system other than its alphabetic nature (Apresjan, 1973; Guy, 1991). Recent enhancements, particularly in threshold-based reclassification, have significantly improved its accuracy across diverse languages and orthographic conventions, including those with complex vowel accents. By leveraging universal linguistic principles—such as vowels’ tendency to exhibit higher adjacency diversity compared to consonants—the algorithm demonstrates adaptability across varied linguistic contexts. This study validates its effectiveness across multiple languages, including German, French, Spanish, Italian, Dutch, Greek, and English, as well as within specific English corpora like the Sherlock Holmes texts and the NLTK Gutenberg corpus.

Methodology

1. Load and Format Corpus

Each language corpus undergoes systematic preprocessing to ensure consistent analysis: - Converting all text to lowercase for uniformity - Removing non-alphabetic characters while preserving diacritics - Standardizing special characters and diacritical marks - Retaining spacing for adjacency analysis

2. Build Adjacency Matrix

  • Create a frequency matrix A, where each element a_{ij} represents the count of character i appearing adjacent to character j
  • Calculate row sums s_i = \sum_j a_{ij} to measure each character’s total adjacency frequency
  • Normalize adjacency patterns relative to overall character frequency

3. Iterative Classification

  • Initial Classification: Identify the character with the highest adjacency sum as the first vowel.
  • Adjustment Phase: For each newly identified vowel: \text{new\_sum}_i = \text{current\_sum}_i - 2(a_{ij} + a_{ji}) where a_{ij} represents adjacency count between characters i and j. The factor of 2 accounts for both left and right adjacencies.
  • Threshold-Based Reclassification: Apply an optimized threshold: \text{threshold} = \min(\text{vowel\_sums}) \times 2 This factor of 2 empirically captures vowel characteristics across various corpora.
  • Termination: Continue until all characters are classified and sums are fully adjusted.

Results and Analysis

Cross-Linguistic Performance

The algorithm was evaluated across seven languages, with distinct orthographic patterns and character sets. Precision, recall, and F1 scores demonstrate high accuracy in identifying vowels and consonants:

Language True Positives (TP) False Positives (FP) False Negatives (FN) Precision Recall F1 Score
German 12 1 0 0.923 1.000 0.960
French 17 1 1 0.944 0.944 0.944
Spanish 11 2 0 0.846 1.000 0.917
Italian 12 2 4 0.857 0.750 0.800
Dutch 12 0 1 1.000 0.923 0.960
Greek 20 3 2 0.870 0.909 0.889
English 9 0 0 1.000 1.000 1.000

F1 Scores

Performance Summary

Strong Precision

The algorithm achieved good precision across most languages, with English showing perfect precision (1.0). Most other languages maintained precision above 0.84, with French (0.944), German (0.923), and Greek (0.870) showing particularly strong results. The algorithm maintained reliable vowel identification even in complex cases involving multiple alphabets and diacritics.

Varied Recall

The algorithm achieved mixed recall scores, ranging from perfect (1.0 for English, Spanish, and German) to lower values for Italian (0.750). Most languages maintained recall above 0.90, demonstrating good but not perfect detection of vowel patterns, including accented variants.

F1 Scores

The F1 scores show strong overall performance: - Perfect score for English (1.0) - Excellent performance for German and Dutch (0.960) - Strong results for French (0.944) and Spanish (0.917) - Good performance for Greek (0.889) - Solid performance for Italian (0.800)

This suggests the algorithm is effective at recognizing vowel patterns across different writing systems, with particularly strong performance in languages using Latin-based alphabets. The lower scores in some languages can be attributed to challenges with diacritical marks and character variants rather than fundamental limitations of the algorithm.

English Corpora Results

The algorithm also performed excellently on English corpora, such as the Sherlock Holmes texts and the NLTK Gutenberg corpus, as illustrated below:

Text Corpus Classified Vowels Classified Consonants
Sherlock Holmes Text [‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘à’, ‘â’, ‘æ’, ‘è’, ‘é’, ‘œ’] [‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’]
NLTK Gutenberg Corpus [‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘æ’, ‘è’, ‘é’, ‘î’] [‘b’, ‘c’, ‘d’, ‘f’, ‘g’, ‘h’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’]
Text Corpus Precision Recall F1 Score
Sherlock Holmes Text 1.0 1.0 1.0
NLTK Gutenberg Corpus 1.0 1.0 1.0

These results confirm the algorithm’s ability to classify vowels accurately, including accented characters (e.g., à, â, æ, è, é, œ, î) while correctly identifying consonants.

Conclusion

The algorithm demonstrates robust cross-linguistic performance through its threshold-based approach (using a factor of 2), effectively identifying vowel patterns by exploiting universal vowel-consonant adjacency tendencies. This success across diverse writing systems validates its theoretical foundations and establishes it as a valuable tool for linguistic analysis and decipherment tasks.

Source Code

Implementation available at: github.com/jhnwnstd/suxotin

References

Apresjan, J. D. (1973). Decipherment Models. In Principles and Methods in Contemporary Structural Linguistics (pp. 135–166). De Gruyter. https://doi.org/10.1515/9783110877908-010
Guy, J. B. M. (1991). Vowel Identification: an Old (But Good) Algorithm. Cryptologia, 15(3), 258–262. https://doi.org/10.1080/0161-119191865920

Citation

BibTeX citation:
@online{winstead2024,
  author = {Winstead, John},
  title = {Suxotin’s {Vowel} {Identification} {Algorithm}},
  date = {2024-09-01},
  url = {https://jhnwnstd.github.io/projects/vowel-identification/},
  langid = {en}
}
For attribution, please cite this work as:
Winstead, J. (2024, September 1). Suxotin’s Vowel Identification Algorithm. https://jhnwnstd.github.io/projects/vowel-identification/

© 2025, John Winstead

Built with Quarto

  • Edit this page
  • Report an issue
Cookie Preferences
  • Contact