Coder Social home page Coder Social logo

wordfrequenciescounter's Introduction

WordFrequenciesCounter

Build status Coverage Status License

Calculate word frequencies in text corpora

How to install it

To install WordFrequenciesCounter, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):

Metacello new
  baseline: 'WordFrequenciesCounter';
  repository: 'github://olekscode/WordFrequenciesCounter/src';
  load.

How to depend on it

If you want to add a dependency on WordFrequenciesCounter to your project, include the following lines into your baseline method:

spec
  baseline: 'WordFrequenciesCounter'
  with: [ spec repository: 'github://olekscode/WordFrequenciesCounter/src' ].

How to use it

1. Selecting a text corpus

First, we need to select a text corpus on which we will be calculating the word frequencies. A corpus is just a file with very long text or combination of texts that is used to train language models. For example, the Guttenberg Corpus contains full texts of hundreds of English books, Brown Corpus contains different-purpose texts in English such as Press, Hobbies, Science, Religion, Fiction. Leipzig WortSchatz provides corpora in many different languages, including Wikipedia articles, news, and web corpora.

By analysing a text corpus, we can learn about the language that is used in it. More specifically, with WordFrequenciesCalculator, we can calculate the word frequencies in a corpus, which will be representative of the word frequency of its language.

We download a selected corpus into a .txt file and create a file reference in Pharo:

corpusFile := '/Users/oleks/Documents/Data/brown.txt' asFileReference.

The contents of that file may look like this:

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced no evidence that any irregularities took place. The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, deserves the praise and thanks of the City of Atlanta for the manner in which the election was conducted. The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible irregularities in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. Only a relative handful of such reports was received, the jury said, considering the widespread interest in the election, the number of voters and the size of this city. The jury said it did find that many of Georgia's registration and election laws are outmoded or inadequate and often ambiguous ...

2. Creating an instance on WordFrequenciesCalculator

counter := WordFrequenciesCounter withAlphabet: Alphabet english.

3. Calculating word frequencies

frequencies := counter wordFrequenciesInFile: corpusFile.

4. Saving the frequencies table into a CSV file

brownFrequenciesFile := '/Users/oleks/Documents/Data/brown-frequencies.csv' asFileReference.
counter
    saveFrequencies: frequencies
    toCsv: brownFrequenciesFile.
counter
    saveTop: 10000
    frequencies: frequencies
    toCsv: brownFrequenciesFile.

wordfrequenciescounter's People

Contributors

olekscode avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.