Coder Social home page Coder Social logo

sheng-fu / distributional_learning Goto Github PK

View Code? Open in Web Editor NEW

This project forked from connormayer/distributional_learning

0.0 0.0 0.0 29.52 MB

Code for learning phonological classes from a corpus

License: GNU General Public License v3.0

Python 66.02% R 33.98%

distributional_learning's Introduction

An algorithm for learning phonological classes from distributional similarity

This repository contains the code used in Mayer, C. (accepted) An algorithm for learning phonological classes from distributional information. Phonology. I hope that by making the code publicly available, researchers will be able to both extend the algorithm and apply it to their own data sets. A brief description of the components and their usage is given below. See the paper for more details.


code

This folder contains the code used in the paper.

Minimum Python requirements (earlier versions may work, but have not been tested):

  • Python 3 (3.6.5)
  • numpy package (1.13.3)
  • nltk package (3.2.5)
  • sklearn package (0.19.1)

Python files

Most Python files can be called from the command line. You can add --help to these commands to get a description of the arguments.

  • HMM.py: A group of classes that implement a simple Hidden Markov Model that can be used to generate toy language corpora with specific transition and emission probabilities. Has no command line interface. See generate_parupa_corpora.py for an example of its use.

  • generate_parupa_corpora.py: Generates one or more corpora for the toy language Parupa. This script can be called from the command line with the following arguments.

    • Required positional argument(s): A space-separated list of noise values between 0 and 1. The noise value reflects the percentage of generated tokens that do not follow the phonotactic constraints of Parupa. This option combined with corpora_per_level determines how many corpora will be generated in total.
    • --corpora per level: The number of corpora that will be generated at each level. Optional, default 10.
    • --corpus_size: The number of tokens to generate in each corpus. Optional, default 50,000.
    • --output_dir: The directry to save the corpora in. Optional, default ../corpora/noisy_parupa/

    An example of usage is:

    python3 0 0.25 0.5 0.75 1 --corpora_per_level 5 --corpus_size 10000 --outdir /my/great/dir/

    This command will generate 25 corpora: 5 at noise level 0, 5 at noise level 0.25, etc.

  • VectorModelBuilder.py: Generates a vector embedding of a corpus file. The input file consists of one word per line, with the segments in the word separated by a space. Because segments are space-separated, multi-character representations of a segment can be used. See the files in the corpora directory for formatting examples.

    The class generates three output files:

    • .data file: contains the vector representations of each segment in the input corpus.
    • .sounds file: contains the labels of the sounds in the same order as their vectors in the .data file.
    • .contexts file: contains the labels of the contexts (columns) of the vectors in the .data file.

    This class can be called from the command line or instantiated in a Python script.

    Command line arguments:

    • Required positional argument: The path to the corpus file to vectorize.

    Optional arguments:

    • --count_method: The counting method to use when creating the vectors. The program currently supports only the ngram method. Default: ngram.
    • --n: The value of n to use when the count_method == ngram. Default: 3.
    • --weighting: The weighting method to use on the raw counts when creating the vectors. Options include probability, conditional_probability, pmi, ppmi, and none. Note that if you use unigrams (n == 1), ppmi and pmi will weight all counts to 0 (because there is only a single context with a probability of 1.0), and conditional probability and probability weightings will be equivalent. Default: ppmi.
    • --outfile: The base filename to save the output files as. Optional, if not specified the base filename will be the same as the input corpus file.
    • --outdir: The directory to save the output files in. Optional, default ../vector_data/.

    An example of usage is:

    python3 VectorModelBuilder.py ../corpora/parupa.txt --n 3 --weighting ppmi --outfile my_vectors --outdir ../vector_data/

  • clusterer.py: Takes a vector embedding as input and generates classes of sounds using the combination of PCA and k-means clustering. Will print the discovered classes to the console and save them to a text file.

    Command line arguments:

    • Required positional argument: The stem of the set of input files generated by VectorModelBuilder.py. For example, if your input files are parupa_trigram_ppmi.data, parupa_trigram_ppmi.sounds, and parupa_trigram_ppmi.contexts, this argument should be parupa_trigram_ppmi.

    • Required positional argument: Path to the file where the discovered classes will be saved.

    • --v_scalar: A parameter that controls what proportion of variance a principal component must account for to be used in clustering. The threshold is (this value * the average amount of variance).

    • --no_constrain_initial_partition: A parameter that removes restrictions on how initial partition of the data set: namely, it removes the restriction that any partition of the full set of sounds must be into two classes (e.g., consonants vs. vowels, voiced vs. voiceless, etc.).

    • --no_constrain_initial_pcs: A parameter that removes restrictions on the initial partition of the data set. Namely, it remove the restriction that only the first principal component is considered. Setting this to FALSE will result in the same classes being detected as when it is TRUE, but with additional partitions of the data set potentially discovered as well. Similar results can be gained by increasing the variability scalar, but this will apply to all recursive calls to the clusterer rather than just the top level call.

  • vectorize_dir.py: A convenience script that produces vector representations for all corpora in a directory.

    The command line arguments for this script are essentially identical to those for VectorModelBuilder.py. The only differences are that the --outfile argument has been removed, and the required positional argument specifying the corpus file has been replaced with an optional argument specifying the directory of corpora:

    • --indir: The directory of corpus files that will be vectorized. Default: `../corpora/noisy_parupa'.

R files

R files can be run from an IDE like RStudio. Configurable variables are given in upper case at the tops of the files, and have accompanying comments specifying their use.

  • plot_embeddings.R: Plots and saves 2D PCAs of the full vector embedding, as well as 2D embeddings of the first partition into two by k-means clustering (in general, consonants vs. vowels). This was used to generate many of the figures in the paper.

corpora

This directory contains the corpora used in the paper.


vector_data

This directory contains the vector embeddings of the corpora used in the paper.


found_classes

This directory will hold .txt files containing the classes discovered by clusterer.R.


plot_data

This directory will hold plots of the vector embeddings generated by plot_embedding.R.


distributional_learning's People

Contributors

connormayer avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.