For small datasets, a language's morphological profile (how much, what it marks) might interact with nearest neighbor word embedding profiles, especially when corpora are small.
Test this on a corpus of similar size to Voynich for different languages:
- Latin
- Greek
- French
- Italian
- Navajo
- Mandarin
- Arabic
- Turkish
- German
Do the same with de-voweled corpora of the same languages.