Language teaching it occurs mainly in the form of inflectional paradigms.
Morphology consists of inflection and word formation. In this context, this paper describes ongoing work for elaborating a morphological lexicon, based on inflected forms, for the standard Moroccan Amazigh language. This latter poses significant challenge to NLP tasks, especially that Amazigh language belongs to the Afro-Asiatic language (Hamito-Semitic) family, known by its non-concatenative morphology based on root and pattern.įace to the scarcity of Amazigh language resources dealing with morphemes encoding, orthographic changes, and morphotactic variations, the elaboration of a standardized lexical resource will certainly ensure a large exchange and exploitation. Due to historical, geographical and sociolinguistic factors, the Amazigh language is characterized by the proliferation of many intervarieties, which has led to a complex morphology. This language is spoken by many North African communities, including Morocco. Therefore, it is important to adopt it for designing lexical resources, especially for less commonly resourced languages such Amazigh. Standardized resources are key components for the development of applications related to human language technology. All this shows a great real-world application potential of our suggested methods to more data, languages, and error classes. We also show the accuracies to further improve with longer training. Our approach is also able to restore diacritics in words not seen during training with >76% accuracy. The experimental investigation proves that our approach is able to achieve comparable results (>98%) to previously reported despite being trained on fewer data. We also perform diacritics restoration alone on 12 benchmark datasets with the additional one for the Lithuanian language. Our simultaneous diacritics restoration and typos correction approach demonstrates near state-of-the-art performance in 13 languages, reaching >96% of the alpha-word accuracy. In this work, we tackle both problems at once by employing newly-developed ByT5 byte-level transformer models. However, both of these problems are typically addressed separately, i.e., state-of-the-art diacritics restoration methods do not tolerate other typos. Restoring diacritics and correcting spelling is important for proper language use and disambiguation of texts for both humans and downstream algorithms. Taken together, this shows the great real-world application potential of our suggested methods to more data, languages, and error classes.ĭue to the fast pace of life and online communications, the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing. We also demonstrate all the accuracies to further improve with more training. It has no direct competitors and strongly outperforms classical spell-checking or dictionary-based approaches. Our simultaneous diacritics restoration and typos correction approach reaches >94% alpha-word accuracy on the 13 languages. The experimental investigation proves that our approach is able to achieve results (>98%) comparable to the previous state-of-the-art, despite being trained less and on fewer data. For a comparison, we perform diacritics restoration on benchmark datasets of 12 languages, with the addition of Lithuanian. However, both of these problems are typically addressed separately: the state-of-the-art diacritics restoration methods do not tolerate other typos, but classical spellcheckers also cannot deal adequately with all the diacritics missing.In this work, we tackle both problems at once by employing the newly-developed universal ByT5 byte-level seq2seq transformer model that requires no language-specific model structures. Restoring diacritics and correcting spelling is important for proper language use and the disambiguation of texts for both humans and downstream algorithms. Due to the fast pace of life and online communications and the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing in other languages.