   Designing A Tagset For Annotating the Tuvan National Corpus  
نویسنده Bayyr-Ool Aziyana ,Voinov Vitaly
منبع International Journal Of Language Studies - 2012 - دوره : 6 - شماره : 4 - صفحه:1 -24
چکیده    This paper examines various aspects of designing a part-of-speech (pos) tagset for annotating a textual corpus in the tuvan language of siberia (turkic family). the issues raised are relevant by extension to designing tagsets in other languages. preliminary issues discussed are tuvan linguistic structure, the rationale for preferring a pos tagset at initial stages of corpus design, the metalanguage and orthography of the tagset, and the potential usefulness of existing tagsets for designing a new tagset. the paper then presents the specific linguistic attributes that are encoded in the tuvan tagset, using the three-level model of major class > subclass > features. difficulties involved in deciding whether a specific type of word is a major class or a subclass are illustrated with tuvan language data. the actual structure of the individual tags to be used in the tagset is also discussed, examining several existing models that differ in terms of transparency and level of linguistic detail. sample tuvan words that have been tagged using the system laid out in the paper are provided to illustrate how this tagset design facilitates searching for decomposable morphosyntactic elements relevant to the grammatical structure of tuvan (as well as that of other turkic languages.)
کلیدواژه Corpus Annotation ,Tagset Design ,Tuvan ,Turkic
آدرس Institute Of Philology, Russia, University Of Texas At Arlington, Usa
