Russian in East Siberia and Kamchatka

Storage and documentation of the data

Since 1997 over 130 hours of DAT recordings were collected from members of the four target communities. They include narrations of wide stylistic variation (monologues, dialogs, interviews and like). Seeing the diversity and the amount of data which can be elicited from the recordings (not only in the field of linguistics, but also for ethnographical, historical and social studies) the main critical question is the storage and accessibility of the whole content. It is supposed that a digital technology of archiving of acoustic data developed in the Linguistic Laboratory of the Ruhr university will be used. The core of the technology is the processing of DAT recordings that allows storing long recordings as a number of short segments (tracks) on an audio CD. Each track has a duration of about 1–2 minutes and contains relatively self-sufficient context. A CD of this type allows playing back extracts of any length without pauses between tracks as well as quick and easy reference to a certain position in the text and in the corpus. The latter is provided by an detailed cataloguing in which each track has its own index and supplied by pertinent data including date and place of the recording, information about the speaker (e.g. age, education, origin), type of the narration (dialogue, monolog, etc.), topic(s), linguistic remarks, surrounding circumstances and technical terms of recording.. Standard audio format compatible with the vast majority of computers and various CD players make the archive accessible to every interested user. The apparent advantage of this technique is the maximum of exposure of the data since each extract included in a database, used for research or as an illustration will have exact reference to an audio CD number, track number and time position within a track.

The retrieval system: markup

The markup process is aimed 1) to create an efficient mechanism of managing and utilization of the archive that should make its exploitation widely available to the academic and other interested communities, 2) to lay the foundation for the online acoustic database of language contacts and contact-induced changes in the speech of outpost Slavic settlements. Hence the corpora will be supplied with two types of annotation. The first one should provide general guiding lines and will involve extra-linguistic and linguistic marks, such as information about a person (persons) from whom the text was recorded (age, origin, education, family etc.); type of narration (monologue – dialog), type of discourse, pragmatic features, subject (with reference to each track). Some of the features explicated through this markup are also of importance for the linguistic study of contact phenomena. The second type of annotation is problem-oriented and will reflect contact-induced changes in a particular system or subsystem.

The acoustic database

Profile. The acoustic database (AD) is a problem-oriented database with the focus on the linguistic results of long-term interethnic contacts within outpost communities. It is presumed that within the framework of the project the AD will present data from 4 ethnic sub-groups of mixed Slavic-indigenous origin and in future may be extended by data from other isolated communities.

Contents. The goal of the AD is to present most typical linguistic features of a certain contact area (such as an outpost community). Hence the focus of the database will be contact-induced changes noticed in the speech of the given community. To provide systematic approach to a contact situation a common net of classifiers is currently being worked out. This net will cover major linguistic and relevant extra-linguistic categories. The linguistic tagging is being implemented on a number of representative texts (currently texts from Russkoye Ustye are being processed). The ultimate goal of this work is to present the linguistic situation within the communities taken as a whole with the necessary degree of generalization and to provide the comparative analysis of various contact area through the similar set of descriptors.

Acoustic citation. Each record will be associated with audio files that would illustrate the revised feature. The number and size of associated files will vary considering the illustrated feature (for example for prosodic phenomena a broader context is required than for segmental phonetics). Files extracted from annotated corpora will have exact reference to the item of the archive from which it is elicited including CD-ROM number, track number and time position. The record will have also reference to similar features noticed in annotated texts - for users who would like to work both with the database and with the archive.

Accessibility. The database as well as the electronic archive will rely as far as possible on standard software and hence will be widely available to the academic and other interested communities. Standard data formats and generally accepted Web technology should provide access, in principle independently of the platform.

A would-be user will have an opportunity to download audio files presented online for further analysis. Moreover the proposed cataloguing system gives easy access to audio data not presented online – tracks from audio texts may be received as attached files via Internet.

Hereby the database on the one hand will become a hierarchically organized description of restricted contact areas and on the other will provide access to the electronic archive (annotated audio texts) representing speech of the four target communities.

Russian in East Siberia and Kamchatka

Back to LiLab

Web-master Georgy Krasovitsky.
Last updated: 18-04-04.

Russian in East Siberia and Kamchatka

Back to LiLab

Web-master Georgy Krasovitsky. Last updated: 18-04-04.

Web-master Georgy Krasovitsky.
Last updated: 18-04-04.