An algorithm for learning phonological classes from distributional similarity

This repository contains the code used in Mayer, C. (accepted) An algorithm for learning phonological classes from distributional information. Phonology. I hope that by making the code publicly available, researchers will be able to both extend the algorithm and apply it to their own data sets. A brief description of the components and their usage is given below. See the paper for more details.

code

This folder contains the code used in the paper.

Minimum Python requirements (earlier versions may work, but have not been tested):

Python 3 (3.6.5)
numpy package (1.13.3)
nltk package (3.2.5)
sklearn package (0.19.1)

Python files

Most Python files can be called from the command line. You can add --help to these commands to get a description of the arguments.

HMM.py: A group of classes that implement a simple Hidden Markov Model that can be used to generate toy language corpora with specific transition and emission probabilities. Has no command line interface. See generate_parupa_corpora.py for an example of its use.
generate_parupa_corpora.py: Generates one or more corpora for the toy language Parupa. This script can be called from the command line with the following arguments.
- Required positional argument(s): A space-separated list of noise values between 0 and 1. The noise value reflects the percentage of generated tokens that do not follow the phonotactic constraints of Parupa. This option combined with corpora_per_level determines how many corpora will be generated in total.
- --corpora per level: The number of corpora that will be generated at each level. Optional, default 10.
- --corpus_size: The number of tokens to generate in each corpus. Optional, default 50,000.
- --output_dir: The directry to save the corpora in. Optional, default ../corpora/noisy_parupa/
An example of usage is:

python3 0 0.25 0.5 0.75 1 --corpora_per_level 5 --corpus_size 10000 --outdir /my/great/dir/

This command will generate 25 corpora: 5 at noise level 0, 5 at noise level 0.25, etc.
VectorModelBuilder.py: Generates a vector embedding of a corpus file. The input file consists of one word per line, with the segments in the word separated by a space. Because segments are space-separated, multi-character representations of a segment can be used. See the files in the corpora directory for formatting examples.

The class generates three output files:
- .data file: contains the vector representations of each segment in the input corpus.
- .sounds file: contains the labels of the sounds in the same order as their vectors in the .data file.
- .contexts file: contains the labels of the contexts (columns) of the vectors in the .data file.
This class can be called from the command line or instantiated in a Python script.

Command line arguments:
- Required positional argument: The path to the corpus file to vectorize.
Optional arguments:
- --count_method: The counting method to use when creating the vectors. The program currently supports only the ngram method. Default: ngram.
- --n: The value of n to use when the count_method == ngram. Default: 3.
- --weighting: The weighting method to use on the raw counts when creating the vectors. Options include probability, conditional_probability, pmi, ppmi, and none. Note that if you use unigrams (n == 1), ppmi and pmi will weight all counts to 0 (because there is only a single context with a probability of 1.0), and conditional probability and probability weightings will be equivalent. Default: ppmi.
- --outfile: The base filename to save the output files as. Optional, if not specified the base filename will be the same as the input corpus file.
- --outdir: The directory to save the output files in. Optional, default ../vector_data/.
An example of usage is:

python3 VectorModelBuilder.py ../corpora/parupa.txt --n 3 --weighting ppmi --outfile my_vectors --outdir ../vector_data/
clusterer.py: Takes a vector embedding as input and generates classes of sounds using the combination of PCA and k-means clustering. Will print the discovered classes to the console and save them to a text file.

Command line arguments:
- Required positional argument: The stem of the set of input files generated by VectorModelBuilder.py. For example, if your input files are parupa_trigram_ppmi.data, parupa_trigram_ppmi.sounds, and parupa_trigram_ppmi.contexts, this argument should be parupa_trigram_ppmi.
- Required positional argument: Path to the file where the discovered classes will be saved.
- --v_scalar: A parameter that controls what proportion of variance a principal component must account for to be used in clustering. The threshold is (this value * the average amount of variance).
- --no_constrain_initial_partition: A parameter that removes restrictions on how initial partition of the data set: namely, it removes the restriction that any partition of the full set of sounds must be into two classes (e.g., consonants vs. vowels, voiced vs. voiceless, etc.).
- --no_constrain_initial_pcs: A parameter that removes restrictions on the initial partition of the data set. Namely, it remove the restriction that only the first principal component is considered. Setting this to FALSE will result in the same classes being detected as when it is TRUE, but with additional partitions of the data set potentially discovered as well. Similar results can be gained by increasing the variability scalar, but this will apply to all recursive calls to the clusterer rather than just the top level call.
vectorize_dir.py: A convenience script that produces vector representations for all corpora in a directory.

The command line arguments for this script are essentially identical to those for VectorModelBuilder.py. The only differences are that the --outfile argument has been removed, and the required positional argument specifying the corpus file has been replaced with an optional argument specifying the directory of corpora:
- --indir: The directory of corpus files that will be vectorized. Default: `../corpora/noisy_parupa'.

R files

R files can be run from an IDE like RStudio. Configurable variables are given in upper case at the tops of the files, and have accompanying comments specifying their use.

plot_embeddings.R: Plots and saves 2D PCAs of the full vector embedding, as well as 2D embeddings of the first partition into two by k-means clustering (in general, consonants vs. vowels). This was used to generate many of the figures in the paper.

corpora

This directory contains the corpora used in the paper.

vector_data

This directory contains the vector embeddings of the corpora used in the paper.

found_classes

This directory will hold .txt files containing the classes discovered by clusterer.R.

plot_data

This directory will hold plots of the vector embeddings generated by plot_embedding.R.

sheng-fu / distributional_learning Goto Github PK

distributional_learning's Introduction

An algorithm for learning phonological classes from distributional similarity

code

Python files

R files

corpora

vector_data

found_classes

plot_data

distributional_learning's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent