Concordances and Frequency Lists


Summary

Representativeness; Practical applications in language corpora; Concordances; Frequency lists; Reverse lists; Content analysis.


Homework

HW # 5 (due end of Week 8): Create two corpora from two different newspaper sections (e.g., Sports and World) from a newspaper of your choice. Download at least 500K of text from each section. Create a concordance and frequency list for each corpus. Analyze the data and compose a report stating your findings. Attach the top 50 ranks from the frequency list to your report and e-mail it to me.


Corpora

Corpus = any collection of texts used in research and teaching

Take a look at this course of corpus linguistics. Also, corpus linguistics resources by Michael Barlow may be interesting

When making inferences about frequency count in your corpus, you may be interested in comparing it to the frequency in the general lexicon. If your corpus is Russian, here is a solution for you:Russian frequency dictionary. Institut fuer Deutsche Sprache offers the the list of most frequent 30K German words. Here is Serbian corpus, Here is Croatian. Finally, the corpora from the land of Tuborg and Andersen.

Types of corpora:

Sample vs. population
Random sampling
Structured sampling (sampling frame)

Types of annotations

Uses of corpora:


Corpus analysis

Qualitative vs. quantitative

Concordances

Frequency Lists

Collocations

KWIC (key word in context) most common type of concordance

LHS (left-hand side)|keyword|RHS (right-hand side)|location

Lemmatized vs. non-lemmatized concordances


Concordancers

Numerous concordancers are available in the Internet either as a shareware or fully functional demo versions with a time stamp. Take a look at the following pieces of software:
You can download one or two of these and see how they work.
Take a look at my on-line Concordancer available at the AMU Slavic Department:
Type in a few sentences and generate their concordance and frequency lists.

See this nice collection of corpus linguistic resources and toolsClick here to pick up CLarkSystem, an XML based system for corpus development. You need to install the Java Runtime Environment to use this program.


Example

Last year's project