A collection of Chinese corpora and frequency lists frequencies in LCMC.
Note iridium oakley that tokenisation of texts into words follows the rules used in each corpus. Sometimes the results of tokenisation are not compatible, while some "words" oakleysunglasses in the frequency list of the Internet corpus can be parts of "real" Chinese words. Chinese learners frequently ask about the frequency of individual characters (as this helps to order them in a reasonable sequence for learning). Numerous lists of common characters are available in various oakley a frame sunglasses dictionaries (Oxford Dictionary, Wenlin or various online sources). They oakley sunglasses discount sale are often taken as the absolute, while they obviously depend on the corpus (the list in the Oxford Dictionary, for example, is skewed towards newspaper texts). The Chinese Internet corpus is a snapshot of the Chinese Web from 2005. The frequency list of characters coming from it might be more general (though still not ideal). The list of characters is available from here. The first column is the rank, the second one is the frequency, which has been normalised per million characters. The three corpora listed above are: Chinese Internet Corpus, 280 million words (tokens).
This corpus has been compiled by Serge Sharoff from the Internet in February 2005 along with other Internet corpora (for English, German and Russian). The Lancaster Corpus of Mandarin Chinese, created by Richard Xiao and Tony McEnery Chinese Business Corpus, 30 million words (tokens). This corpus has been compiled by Serge Sharoff from the Internet in 2008 along with other business corpora (for English and Russian).
Prev: oakley sunglasses purple
Next: clearance oakley sunglasses