Lingvističke aktuelnosti

Upisano u kategoriju: Projekti, Broj 6

 

THE CORPUS OF SERBIAN LANGUAGE

HISTORY

Work on the Corpus of Serbian Language CSL started in 1957 at the Institute for Experimental Phonetics and Speech Pathology in Belgrade. The CSL project was initiated and conducted by Prof. Đorđe Kostić, and was part of a broader project whose initial goal was automatic speech and text recognition and machine translation. This project was also conducted by Prof. Kostić. Work on the CSL lasted till 1962, when it was suspended. About 400 collaborators (80 experts in linguistics and other related fields, together with more than 300 technical staff) participated in the CSL project. Due to the level of technology in the fifties, all work on the CSL was executed manually. The first step was to create a system of annotation which consisted of approximately 2,000 distinct codes to capture all inflected forms within the Serbian language. Once the system was established, each word from the sample was manually tagged for its grammatical status and multiple frequency dictionaries were compiled (more than 27,000 pages).

In 1996, through joint efforts of the Institute for Experimental Phonetics and Speech Pathology and the Laboratory for Experimental Psychology, University of Belgrade, most of the material was converted into an electronic format and partial updating of the system of grammatical tagging was initiated. This phase of the project is still in progress (in 2001).

 

HOW IT WAS MADE

Due to technological limitations in the late fifties, most of the work on the СSL was executed manually. The final goal was to compile a number of frequency dictionaries that would serve as a basis for automatic speech and text recognition and machine translation. Compilation of frequency dictionary consisted of 27 distinct operations. Here we outline the most important ones.

a. Within each book included in a sample, lines were enumerated on each page.

b. A4 sheet of paper was divided into 16 frames and within each frame a word from a book was transcribed. For each given word, its page and line numbers in the original book were specified as well.

c. Once the whole text was transcribed and lines and pages recorded, each word was specified for its grammatical status at the level of inflected morphology.

d. Grammatical tagging was subsequently monitored by a group of linguistic experts that randomly sampled about 10% of each text. In cases where there was more than a 2% error rate, the grammatical tagging was repeated until the required criterion was reached. Sometimes the procedure had to be executed 3 or 4 times to reach the required standards of reliability (i.e. the 2% error rate).

e. Once grammatical tagging was complete, the A4 sheet was cut into 16 frames – one frame for each word. Word frames were then sorted into alphabetical order.
f. Different grammatical forms (i.e. frames) for each word were sorted according to a specified order (e.g. for the word HOUSE, for example, all nominatives singular were put together, then all genitives singular etc.).

g. Reliability for grammatical code sorting was monitored.

h. For each word, the frequencies for each grammatical form were counted, as was the total the number of the occurrences of a word entry (e.g. HOUSE appeared 15 times in the nominative singular, 5 times in the genitive singular etc., the word HOUSE, irrespective of its grammatical form (i.e. lemma) appeared 75 times).

i. Counting of grammatical forms and word entries was also controlled for reliability.

j. The obtained frequency counts were transcribed into a specialized form and then typed on an A4 sheet. In its final form each frequency dictionary had two distinct versions: one with word entries and grammatical forms for each word being sorted by alphabetical order and the other, with word entries sorted by rank frequency.

k. By the time when the project was to be suspended, more than 27,000 pages of various frequency dictionaries had been compiled.

 

SAMPLE

The Corpus of Serbian Language is based on a sample of about 11,000,000 words and is chronologically divided into five samples. The first four samples (about 4,000,000 words) draw from the Serbian language from the 12th to 20th centuries. The fifth sample includes contemporary language (about 7,000,000 words).

a. 12th to 18th century: The sample includes writings of the most prominent authors from that period (e.g. Domentijan, Teodosije, Archbishop Danilo etc.) as well as the old Serbian charters and letters. The text is given in its original form and orthography (i.e. Serbian-Slavonic).

b. 18th and early 19th century: The sample includes writings of Milovan Vidaković, Gerasim Zelić, Joakim Vujić and other authors from that period. The text is given in its original form and orthography.

c. Complete works of Vuk St. Karadžić: The sample is divided into several subsamples (Serbian national poems and stories, translation of the New Testament, the Serbian Dictionary, as well as Karadžić’s correspondence, linguistic, ethnological and historical writings).

d. Second half of the 19th century: The sample includes complete works of Branko Radičević, Marko Miljanov, Petar Petrović-Njegoš, Jovan Jovanović-Zmaj and Đura Jakšić.

e. Contemporary language: The sample is divided into six parts: a. novels and essays (140 books), b. poetry (204 books), c. daily press (Politika), d. scientific literature (135 books), e. political prose and f. texts of Belgrade surrealists.

 

SAMPLING CRITERIA

There were two principal sampling criteria in building up the Corpus of Serbian Language. The first criterion was that corpus should include all relevant periods in the development of the Serbian language and to encompass all relevant genres of Serbian written language. The second criterion is related to the overall size of the corpus and to size of its subsamples. Inspection of the documentation suggests that sampling constituted an important part of the project, which was approached with the utmost care and consideration. The fact that there were several studies on sample size and sample reliability (i.e. corpus size and its reliability) written by the most prominent statisticians of that time (B. Ivanović and B. Bajšanski), indicates that sample segments and their size were not chosen at random. Thus far, these original studies have not been found, although we know their titles. Likewise, inspection of authors and books that constitute the sub samples of Serbian language from 12th to 20th century suggests a clear sampling criteria that will be elaborated in more detail in the forthcoming paragraphs.

A.             Criteria for determining the size of the sample and the sub samples

1. General considerations: What may be the minimal (or optimal) size of a corpus that will assure its reliability is an empirical rather than an intuitive matter. It could be argued that the issue of reliability with respect to corpus size is heavily dependent on the aspects of language that are investigated. However, to our knowledge there are no systematic statistical studies that might suggest an optimal corpus size for a particular aspect of language. As a consequence, there are no clear empirical criteria what may be the required size that will assure corpus reliability. Our intention is to make a systematic statistical investigation of the CSL in the near future and establish quantitative norms for stability of probabilities for different aspects of language as a function of corpus size.

2. Why CSL has 11,000,000 words: At this point we don’t know why the Corpus is of the size it is. What we know is that the size of the corpus and its subsamples was not determined arbitrary and was a matter of serious study for the two most prominent statisticians in Yugoslavia in the mid 1950’s. The size of each subsample for the period up to the 20th century varies between half a million to more than one-and-a-half million items. Thus, for example, each of the subsamples of the old Serbian literature (12th – 17th and 18th century) has approximately half a million words. The size of the subsample of complete works of Vuk St. Karadžić was determined by the amount of published material (about 1,700,000 words), while the subsample that includes the second part of the 19th century contains about 1,300,000 words. Contemporary language contains about 7,000,000 words. It is interesting that the subsamples are approximately of the same size – about 1,400,000 words. As noted, at this point it is not clear which criterion was used to determine the sub sample size, although this may be clarified when the studies concerning the sample size are found or when we do statistical research on corpus reliability.

B.             Criteria for the choice of periods and authors

1. Criteria for diachronic sampling: Given that the corpus is diachronic, two considerations are of relevance: a) which historical periods should be included, and b) which segments (genres) should be considered to be representative of contemporary Serbian language. Scholars dealing with old Serbian literature agree that there are three distinct periods in the development of Serbian written language: a) a period from the 12th century to the end of the 17th century which is characterized by Serbian-Slavonic language. b) A period between 18th century to the first part of the 19th century when the radical reforms were introduced by Vuk St. Karadžić and c) the second half of the 19th century when Karadžić’s reforms prevailed and linguistic standards, both in written and spoken language, became generally accepted. Part of the Corpus that encompasses Serbian language up to the 20th century is divided into four distinct subsamples. The first subsample encompasses the period between the 12th and 18th centuries and includes two distinct types of material: a) the lives of Serbian saints, constituting a distinct genre written according to the specified rules and in this respect may be considered as typical literary texts of that period and b) old Serbian charters and letters that are closer to everyday language. By including these two types of material in the sample, both literary and popular (national – i.e. spoken by ordinary people) language are represented, thus covering all relevant forms of Serbian language between the 12th and 18th centuries.

The second subsample includes language between the end of the 17th century to the reforms introduced by Vuk St. Karadžić. This period is characterized by a dramatic absence of linguistic and orthographic standards and various influences that were not treated systematically. As a consequence, authors from that period used somewhat idiosyncratic orthography, vocabulary and grammar. The included authors represent all forms of this variation in the usage of the Serbian language, making the whole subsample representative for the respective period.

A distinct part of the sample of Serbian language to the 20th century is the complete works of Vuk St. Karadžić. There are several reasons why Karadžić has been included in full. The first and the most important reason is that Karadžić introduced radical reforms both in Serbian orthography and linguistic standards. The work of Karadžić is a turning point in the development of Serbian written and spoken language. However, Karadžić was not only a reformer of Serbian language. He also collected Serbian national poems, proverbs and stories, translated the New Testament into Serbian, made first Serbian language dictionary, wrote the first primer and the first Serbian language grammar, wrote a number of linguistic, ethnological, geographical and historical studies and had extensive correspondence with the most prominent people in Europe of that time. Thus, Karadžić’s complete works encompass various aspects of Serbian language, spanning Serbian national poetry and proverbs to his personal correspondence. This allows for a number of comparisons, on the one hand, including the different historical segments of the Serbian national language and, on the other hand, the language of Vuk St. Karadžić himself. Likewise, this subsample allows for detailed tracing of the changes consequent upon Karadžić’s reforms.

The fourth subsample refers to language from the second part of the 19th century and includes authors that adopted Karadžić’s reforms. This subsample includes complete works of Branko Radičević, Marko Miljanov, Đura Jakšić, Petar Petrović – Njegoš, Jovan Jovanović – Zmaj and one text by Laza Kostić. These six authors are not only among the most prominent figures in Serbian literature, they also cover all genres of 19th century Serbian literature. Thus, for example, Branko Radičević was one of the first poets to adopt Karadžić’s reforms, while the writings of Marko Miljanov resemble spoken language from the end of the 19th century. The complete works of Njegoš represent a specific subsample because, in addition to “Gorski Vijenac” and “Luča Mikrokozma”, two of the most prominent works written in the Serbian language, these works include his personal correspondences. Đura Jakšić made significant contributions within different literary genres, thus allowing for their comparison within a single author. This is to some extent also true for Jovan Jovanović-Zmaj. The writings of Laza Kostić, included in the corpus, are representative of literary criticism of that period.

2. Sampling criteria for the contemporary language: Contemporary Serbian language is represented by five distinct subsamples (prose, daily press, scientific literature, poetry, and political texts), each of them being a distinct genre of written language. It is hard to think of an additional genre of written language that may enhance the representativeness of the sample. All items included into the corpus (with few exceptions) were written between 1945 and 1957. It could be argued that this may challenge the status of the material as not being representative for the contemporary language. This issue will be discussed later. The subsample of prose includes novels, essays, literary criticisms and polemics. Daily press includes Belgrade’s daily newspaper “Politika” which was considered to be a broadsheet of the highest language standards. This subsample was divided into three distinct periods that allow for the statistical investigation of stable and variable aspects of language across a somewhat restricted time span of 12 years. Scientific literature encompassed a number of scientific disciplines, enabling an insight into the idiosyncrasies of language use across various different scientific fields, on the one hand, and the contrast between language used in scientific literature and other domains of written language on the other. The motivation for introducing poetical works resides in the fact that poetic vocabulary is often richer than those encountered in other genres. Finally, political texts were included because it was believed that they have distinct properties that are uncommon within other genres. It should be emphasized that the criterion for including a particular item into the corpus was not its literary or scientific value.

3. Is the CSL outdated? It could be argued that the sample of the contemporary language is not reliable due to the fact that selected items are almost half a century old. Again, this claim requires empirical evaluation. At this point we do not know what changes took place in the course of the last few decades. On the other hand, subsamples of contemporary language differ with respect to potential changes over time. Intuitively, we can say that the language of prose and poetry did not change much. In contrast, the language of the daily press seems to be subject to greater changes, in particular the personal names, places and terms specific for a particular period. Likewise, the vocabulary of scientific literature and political texts also changed over time. However, the claim that the CSL represents an outdated sample of contemporary language relies on pure intuition and it is a matter of statistical evaluation to find out what percentage of vocabulary is stable and how is this stability  dependent on the type of material.

 

STRUCTURE OF THE MATERIAL

The corpus of Serbian Language CSL consists of:
a. Annotated text:  Each word in the Corpus is tagged for its grammatical status, number of phonemes and phonological structure.

b. Frequency dictionaries: For each sample a series of frequency dictionaries have been compiled, or are in the process of compilation at all relevant levels (from the level of a book to the level of a sample (e.g. contemporary language). Frequency dictionaries contain probabilities of word entry, grammatical forms for a given entry, the number of graphemes, number of syllables and phonological structure for each word.

c. Probability matrices: The CSL will contain probability matrices for all grammatical forms of Serbian language, as well as for phonemes and phonemic co-occurrences and syllables and syllabic co-occurrences. Matrices will be given at all levels of potential analyses – from the level of a book to the level of a sample.
The material will be offered in a format that is easy to transfer into any standard statistical package. At present the following is available: grammatically annotated text for most samples, frequency dictionaries of the contemporary Serbian language compiled from daily press and poetry (2,000,000 words, 65,000 lemmata and 240,000 grammatical forms), and more than 200 individual frequency dictionaries of poetical works.

 

SYSTEM OF ANNOTATION

Each word in the CSL is manually tagged for its grammatical status. The system of tagging includes about 2,000 distinct grammatical (inflected) forms. Each word in the CSL has a defined entry, number of graphemes and syllables and phonological structure. The beginning and the end of a sentence and paragraph are also tagged. Prof. Đorđe Kostić and a group of linguists established the system of coding in the mid fifties. In 1999 the system was updated and the changes will be applied in the near future.


Example of the annotated text (the old version)

 

* Petar je otišao u školu. – Peter went to school.
*A - word entry, B - the original text, C – grammatical code, D - numeric grammatical code, E – number of graphemes, F – number of syllables, G – phonological structure.
* i noun, gl verb, pre preposition, nom nominative, a accusative, prez present tense, perf past tense, s as part of, 3l third person, J singular, mu masculine, ž feminine, rp - past participle

 

TECHNICAL DETAILS

The majority of the material (the original annotated text and frequency dictionaries) has been transferred into electronic format (relational database – MS Visual FoxPro). Currently, the material is structured as a table, each word and its codes being in a single row. The whole material is also available in ASCII format and could be transferred into any standard spreadsheet or statistical package. In its final version the CSL will be in the XML format. Transferring into XML is underway and it is expected to be completed by the end of 2001.

In addition to the main corpus, the final version of the CSL will contain a series of probability matrices that will capture all aspects of Serbian language, spanning from the level of phonology and syllabic structure to the level of inflected morphology.

For more information about this project see www.serbian-corpus.edu.yu.

 

5741 komentara »