Storage and documentation of
the data
Since 1997 over 130 hours of
DAT recordings were collected from members of the four target communities. They
include narrations of wide stylistic variation (monologues, dialogs, interviews
and like). Seeing the diversity and the amount of data which can be elicited
from the recordings (not only in the field of linguistics, but also for
ethnographical, historical and social studies) the main critical question is the
storage and accessibility of the whole content. It is supposed that a digital
technology of archiving of acoustic data developed in the Linguistic Laboratory
of the Ruhr university will be used. The core of the technology is the
processing of DAT recordings that allows storing long recordings as a number of
short segments (tracks) on an audio CD. Each track has a duration of about 1–2
minutes and contains relatively self-sufficient context. A CD of this type
allows playing back extracts of any length without pauses between tracks as well
as quick and easy reference to a certain position in the text and in the corpus.
The latter is provided by an detailed cataloguing in which each track has its
own index and supplied by pertinent data including date and place of the
recording, information about the speaker (e.g. age, education, origin), type of
the narration (dialogue, monolog, etc.), topic(s), linguistic remarks,
surrounding circumstances and technical terms of recording.. Standard audio
format compatible with the vast majority of computers and various CD players
make the archive accessible to every interested user. The apparent advantage of
this technique is the maximum of exposure of the data since each extract
included in a database, used for research or as an illustration will have exact
reference to an audio CD number, track number and time position within a track.
The retrieval system: markup
The markup process is aimed 1) to create an
efficient mechanism of managing and utilization of the archive that should make
its exploitation widely available to the academic and other interested
communities, 2) to lay the foundation for the online acoustic database of
language contacts and contact-induced changes in the speech of outpost Slavic
settlements. Hence the corpora will be supplied with two types of annotation.
The first one should provide general guiding lines and will involve
extra-linguistic and linguistic marks, such as information about a person
(persons) from whom the text was recorded (age, origin, education, family etc.);
type of narration (monologue – dialog), type of discourse, pragmatic features,
subject (with reference to each track). Some of the features explicated through
this markup are also of importance for the linguistic study of contact
phenomena. The second type of annotation is problem-oriented and will reflect
contact-induced changes in a particular system or subsystem.
The acoustic database
Profile.
The acoustic database (AD) is a problem-oriented database with the focus on the
linguistic results of long-term interethnic contacts within outpost communities.
It is presumed that within the framework of the project the AD will present data
from 4 ethnic sub-groups of mixed Slavic-indigenous origin and in future may be
extended by data from other isolated communities.
Contents.
The goal of the AD is to present most typical linguistic features of a certain
contact area (such as an outpost community). Hence the focus of the database
will be contact-induced changes noticed in the speech of the given community. To
provide systematic approach to a contact situation a common net of classifiers
is currently being worked out. This net will cover major linguistic and relevant
extra-linguistic categories. The linguistic tagging is being implemented on a
number of representative texts (currently texts from Russkoye Ustye are being
processed). The ultimate goal of this work is to present the linguistic
situation within the communities taken as a whole with the necessary degree of
generalization and to provide the comparative analysis of various contact area
through the similar set of descriptors.
Acoustic citation. Each
record will be associated with audio files that would illustrate the revised
feature. The number and size of associated files will vary considering the
illustrated feature (for example for prosodic phenomena a broader context is
required than for segmental phonetics). Files extracted from annotated corpora
will have exact reference to the item of the archive from which it is elicited
including CD-ROM number, track number and time position. The record will have
also reference to similar features noticed in annotated texts - for users who
would like to work both with the database and with the archive.
Accessibility. The
database as well as the electronic archive will rely as far as possible on
standard software and hence will be widely available to the academic and other
interested communities. Standard data formats and generally accepted Web
technology should provide access, in principle independently of the platform.
A would-be user will have an
opportunity to download audio files presented online for further analysis.
Moreover the proposed cataloguing system gives easy access to audio data not
presented online – tracks from audio texts may be received as attached files
via Internet.
Hereby the database on the one hand will
become a hierarchically organized description of restricted contact areas and on
the other will provide access to the electronic archive (annotated audio texts)
representing speech of the four target communities.
|