Calculate word frequencies in text corpora
To install WordFrequenciesCounter
, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):
Metacello new
baseline: 'WordFrequenciesCounter';
repository: 'github://olekscode/WordFrequenciesCounter/src';
load.
If you want to add a dependency on WordFrequenciesCounter
to your project, include the following lines into your baseline method:
spec
baseline: 'WordFrequenciesCounter'
with: [ spec repository: 'github://olekscode/WordFrequenciesCounter/src' ].
First, we need to select a text corpus on which we will be calculating the word frequencies. A corpus is just a file with very long text or combination of texts that is used to train language models. For example, the Guttenberg Corpus contains full texts of hundreds of English books, Brown Corpus contains different-purpose texts in English such as Press, Hobbies, Science, Religion, Fiction. Leipzig WortSchatz provides corpora in many different languages, including Wikipedia articles, news, and web corpora.
By analysing a text corpus, we can learn about the language that is used in it.
More specifically, with WordFrequenciesCalculator
, we can calculate the word frequencies in a corpus, which will be representative of the word frequency of its language.
We download a selected corpus into a .txt
file and create a file reference in Pharo:
corpusFile := '/Users/oleks/Documents/Data/brown.txt' asFileReference.
The contents of that file may look like this:
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced no evidence that any irregularities took place. The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, deserves the praise and thanks of the City of Atlanta for the manner in which the election was conducted. The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible irregularities in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. Only a relative handful of such reports was received, the jury said, considering the widespread interest in the election, the number of voters and the size of this city. The jury said it did find that many of Georgia's registration and election laws are outmoded or inadequate and often ambiguous ...
counter := WordFrequenciesCounter withAlphabet: Alphabet english.
frequencies := counter wordFrequenciesInFile: corpusFile.
brownFrequenciesFile := '/Users/oleks/Documents/Data/brown-frequencies.csv' asFileReference.
counter
saveFrequencies: frequencies
toCsv: brownFrequenciesFile.
counter
saveTop: 10000
frequencies: frequencies
toCsv: brownFrequenciesFile.