Representativeness; Practical applications in language corpora; Concordances; Frequency lists; Reverse lists; Content analysis.
HW # 5 (due end of Week 8): Create two corpora from two different newspaper sections (e.g., Sports and World) from a newspaper of your choice. Download at least 500K of text from each section. Create a concordance and frequency list for each corpus. Analyze the data and compose a report stating your findings. Attach the top 50 ranks from the frequency list to your report and e-mail it to me.
Corpus = any collection of texts used in research and teaching
Take a look at this course of corpus linguistics. Also, corpus linguistics resources by Michael Barlow may be interesting
When making inferences about frequency count in your corpus, you may be interested in comparing it to the frequency in the general lexicon. If your corpus is Russian, here is a solution for you:Russian frequency dictionary. Institut fuer Deutsche Sprache offers the the list of most frequent 30K German words. Here is Serbian corpus, Here is Croatian. Finally, the corpora from the land of Tuborg and Andersen.
Types of corpora:
Sample vs. population
Random sampling
Structured sampling (sampling frame)
Types of annotations
Uses of corpora:
Qualitative vs. quantitative
Concordances
Frequency Lists
Collocations
KWIC (key word in context) most common type of concordance
LHS (left-hand side)|keyword|RHS (right-hand side)|location
Lemmatized vs. non-lemmatized concordances
See this nice collection of corpus linguistic resources and toolsClick here to pick up CLarkSystem, an XML based system for corpus development. You need to install the Java Runtime Environment to use this program.