Resources of a Digital Text Corpus in Analysing Heroic Epics of the Shors
Author: Funk, Dmitry
Affiliation: Institute of Ethnology and Anthropology, Russian Ac. of Sc.

The Shors are one of Siberia’s smaller populations, 12,888 according to the 2010 Russian census. The Shors live mainly in the south of Western Siberia and mainly (about 73% of the entire Shor population) in towns in the southern part of the Kemerovo region. This ethnic group is especially well known thanks to its rich epic tradition, examples of which have been recorded over the last 150 years by W. Radloff, A.V. Adrianov, N.P. Dyrenkova, G.F. Babushkin, A.I. Smerdov, O.I. Blagoveshchenskaya, A.I. Chudoyakov, as well as by many other scholars and enthusiasts. According to the author’s data, there are at least 280 texts from the Shor epics stored in different archives and/or in private collections. From these rich materials there were 26 epic texts published in original (Shor) language since 1861 till 2010, and only 17 of them — at their full lengths. To make the problem worse, the most part of earlier publications by Radloff (1861), Dyrenkova (1940), Babushkin (1940) is hardly accessible, especially for the Shors in their places of residents.

To make a significant part of the unknown materials available for researchers and other people able to read in Siberian native languages, in 2011 I have initiated a project “Corpus of Folklore Texts in the Languages of Indigenous Peoples of Siberia(, which is being developed by the researchers and postgraduate students of the Department of Northern and Siberian Studies, Institute for Ethnology and Anthropology of the Russian Academy of Sciences (RAS), with the support from the RAS Presidium Corpus Linguistics research initiative.

The corpus is capable of storing original (images of handwritten pages, audio- or video-records with or without transcriptions, etc.), as well as orthographically standardised version of any text.

The Shor part of the Corpus comprises now 155,957 orthographic words in 30 epic texts (there are some texts of other genres, as well), and is supplemented by the oral sub-corpus of approximately 28 thousand words, which is currently accessible in audible form only. Included texts represent both Kondoma (kondomskiy) and Mrass (mrasskiy) dialects of the Shor language. It is noteworthy that 26 epic texts are unique to this corpus: they built a part of the author’s personal collection and have not been previously published in any form (apart of six that were published recently, see Funk, 2010, 2012). By the end of the project (2014) we plan to include into the Corpus at least some 40 epic texts more, making them available for the first time.

There is no other freely accessible corpus in Shor of comparable volume. It is important to underline, that our Corpus is not just a store. The web interface (in Russian only at the time of this writing) provides means for:

ñ  looking up specific word-forms and their context;

ñ  finding word-forms adjacent to the given one from left or right, as well as co-occurrences of given word-forms on a certain distance within sentence;

ñ  collecting statistical data on frequency of any word-form, and comparing lists of word-form from any number of texts in the same language;

ñ  comparing sentences from any two texts in order to identify recurring expressions (or loci communes).

All this gives researchers the unique possibility to analyse epic texts in many ways, making this Corpus of especial value for linguists, folklorists, and cultural anthropologists as well. While presenting the main results achieved, I am going to address basic principles and problems of finding/recording, standardising, and publishing epic texts of the Shors and of Sayan-Altai Turkic groups in general.