Coder Social home page Coder Social logo

xachab / qumin Goto Github PK

View Code? Open in Web Editor NEW
8.0 4.0 1.0 2.38 MB

Quantitative modelling of inflection

Home Page: https://qumin.readthedocs.io/

License: GNU General Public License v3.0

Python 96.73% JavaScript 0.64% CSS 0.23% Makefile 2.40%
inflection linguistics morphology paradigms

qumin's Introduction

tests DocStatus

Qumin (QUantitative Modelling of INflection) is a package for the computational modelling of the inflectional morphology of languages. It was initially developed for Sacha Beniamine's PhD dissertation.

Contributors: Sacha Beniamine, Jules Bouton.

Documentation: https://qumin.readthedocs.io/

Github: https://github.com/XachaB/Qumin

This is version 2, which was significantly updated since the publications cited below. These updates do not affect results, and focused on bugfixes, command line interface, paralex compatibility, workflow improvement and overall tidyness.

For more detail, you can refer to Sacha's dissertation (in French, Beniamine 2018).

Citing

If you use Qumin in your research, please cite Sacha's dissertation (Beniamine 2018), as well as the relevant paper for the specific actions used (see below). To appear in the publications list, send Sacha an email with the reference of your publication at s.<last name>@surrey.ac.uk

Quick Start

Install

Install the Qumin package using pip:

pip install qumin

Data

Qumin works from full paradigm data in phonemic transcription.

The package expects Paralex datasets, containing at least a forms and a sounds table. Note that the sounds files may sometimes require edition, as Qumin imposes more constraints on sound definitions than paralex does.

Scripts

Note

We now rely on hydra to manage CLI interface and configurations. Hydra will create a folder outputs/<yyyy-mm-dd>/<hh-mm-ss>/ containing all results. A subfolder outputs/<yyyy-mm-dd>/<hh-mm-ss>/.hydra/ contains details of the configuration as it was when the script was run. Hydra permits a lot more configuration. For example, any of the following scripts can accept a verbose argument of the form hydra.verbose=True, and the output directory can be customized with hydra.run.dir="./path/to/output/dir".

More details on configuration::

/$ qumin --help

Patterns

Alternation patterns serve as a basis for all the other scripts. An early version of the patterns algorithm is described in Beniamine (2017). An updated description figures in Beniamine, Bonami and Luís (2021).

The default action for Qumin is to compute patterns only, so these two commands are identical:

/$ qumin data=<dataset.package.json>
/$ qumin action=patterns data=<dataset.package.json>

By default, Qumin will ignore defective lexemes and overabundant forms.

For paradigm entropy, it is possible to explicitly keep defective lexemes:

/$ qumin pats.defective=True data=<dataset.package.json>

For inflection class lattices, both can be kept:

/$ qumin pats.defective=True pats.overabundant=True data=<dataset.package.json>

Microclasses

To visualize the microclasses and their similarities, one can compute a microclass heatmap:

/$ qumin action=heatmap data=<dataset.package.json>

This will compute patterns, then the heatmap. To pass pre-computed patterns, pass the file path:

/$ qumin action=heatmap patterns=<path/to/patterns.csv> data=<dataset.package.json>

It is also possible to pass class labels to facilitate comparisons with another classification:

/$ qumin.heatmap label=inflection_class patterns=<path/to/patterns.csv> data=<dataset.package.json>

The label key is the name of the column in the Paralex lexemes table to use as labels.

A few more parameters can be changed:

heatmap:
    cmap: null               # colormap name
    exhaustive_labels: False # by default, seaborn shows only some labels on
                            # the heatmap for readability.
                            # This forces seaborn to print all labels.

Paradigm entropy

An early version of this software was used in Bonami and Beniamine 2016, and a more recent one in Beniamine, Bonami and Luís (2021)

By default, this will start by computing patterns. To work with pre-computed patterns, pass their path with patterns=<path/to/patterns.csv>.

Computing entropies from one cell

/$ qumin action=H data=<dataset.package.json>

Computing entropies for other number of predictors:

/$ qumin action=H  n=2 data=<dataset.package.json>
/$ qumin action=H  n="[2,3]" data=<dataset.package.json>

Warning

With n and N>2 the computation can get quite long on large datasets, and it might be better to run Qumin on a server.

Predicting with known lexeme-wise features (such as gender or inflection class) is also possible. This feature was used in Pellegrini (2023). To use features, pass the name of any column(s) from the lexemes table:

/$ qumin.H  feature=inflection_class patterns=<patterns.csv> data=<dataset.package.json>
/$ qumin.H  feature="[inflection_class,gender]" patterns=<patterns.csv> data=<dataset.package.json>

The config file contains the following keys, which can be set through the command line:

patterns: null        # pre-computed patterns
entropy:
  n:                  # Compute entropy for prediction from with n predictors.
    - 1
  features: null      # Feature column in the Lexeme table.
                      # Features will be considered known in conditional probabilities: P(X~Y|X,f1,f2...)
  importFile: null    # Import entropy file with n-1 predictors (allows for acceleration on nPreds entropy computation).
  merged: False       # Whether identical columns are merged in the input.
  stacked: False      # whether to stack results in long form

For bipartite systems, it is possible to pass two values to both patterns and data, eg:

/$ qumin.H  patterns="[<patterns1.csv>,<patterns2.csv>]" data="[<dataset1.package.json>,<dataset2.package.json>]"

Visualizing results

Since Qumin 2.0, results are shipped as long tables. This allows to store several metrics in the same file, with results for several runs. Results file now look like this:

predictor,predicted,measure,value,n_pairs,n_preds,dataset
<cell1>,<cell2>,cond_entropy,0.39,500,1,<dataset_name>
<cell1>,<cell2>,cond_entropy,0.35,500,1,<dataset_name>
<cell1>,<cell2>,cond_entropy,0.2,500,1,<dataset_name>
<cell1>,<cell2>,cond_entropy,0.43,500,1,<dataset_name>
<cell1>,<cell2>,cond_entropy,0.6,500,1,<dataset_name>
<cell1>,<cell2>,cond_entropy,0.1,500,1,<dataset_name>

All results are in the same file, including different number of predictors (indicated in the n_preds column), and different measures (indicated in the measure column).

To facilitate a quick general glance at the results, we output an entropy heatmap in the wide matrix format. This behaviour can be disabled by passing entropy.heatmap=False. It takes advantage of the Paralex features-values table to sort the cells in a canonical order on the heatmap. The heatmap.order setting is used to specify which feature should have higher priority in the sorting:

/$ qumin action=H data=<dataset.package.json> heatmap.order="[number, case]"

It is also possible to draw an entropy heatmap without running entropy computations:

/$ qumin action=ent_heatmap entropy.importFile=<entropies.csv>

The config file contains the following keys, which can be set through the command line:

heatmap:
  cmap: null               # colormap name
  exhaustive_labels: False # by default, seaborn shows only some labels on
                           # the heatmap for readability.
                           # This forces seaborn to print all labels.
  dense: False             # Use initials instead of full labels (only for entropy heatmap)
  annotate: False          # Display values on the heatmap. (only for entropy heatmap)
  order: False             # Priority list for sorting features (for entropy heatmap)
                           # ex: [number, case]). If no features-values file available,
                           # it should contain an ordered list of the cells to display.
entropy:
  heatmap: True        # Whether to draw a heatmap.

Macroclass inference

Our work on automatical inference of macroclasses was published in Beniamine, Bonami and Sagot (2018)".

By default, this will start by computing patterns. To work with pre-computed patterns, pass their path with patterns=<path/to/patterns.csv>.

Inferring macroclasses

/$ qumin action=macroclasses data=<dataset.package.json>

Lattices

By default, this will start by computing patterns. To work with pre-computed patterns, pass their path with patterns=<path/to/patterns.csv>.

This software was used in Beniamine (2021)".

Inferring a lattice of inflection classes, with (default) html output

/$ qumin action=lattice pats.defective=True pats.overabundant=True data=<dataset.package.json>

Further config options:

lattice:
  shorten: False      # Drop redundant columns altogether.
                      #  Useful for big contexts, but loses information.
                      # The lattice shape and stats will be the same.
                      # Avoid using with --html
  aoc: False          # Only attribute and object concepts
  stat: False         # Output stats about the lattice
  html: False         # Export to html
  ctxt: False         # Export as a context
  pdf: True           # Export as pdf
  png: False          # Export as png

qumin's People

Contributors

jpapir avatar xachab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

jpapir

qumin's Issues

[Lattices] unexpected keyword 'collections'

Got following error on branch RemoveAliases, just by running qumin.lattice on very basic test patterns.
Looks like collections (in kwargs) are passed to table_to_context, but I don't know this error is normal behaviour, or if collections should theoretically be an argument for table_to_context. Note also the very weird package imports at the top of the file qumin/lattice/lattice.py

I am not familiar with this part of the code, so I just put it there :

Traceback (most recent call last):
  File "PATH/Qumin/bin/qumin.lattice", line 33, in <module>
    sys.exit(load_entry_point('qumin', 'console_scripts', 'qumin.lattice')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/Qumin/qumin/make_lattice.py", line 164, in lattice_command
    main(args)
  File "PATH/Qumin/qumin/make_lattice.py", line 70, in main
    lattice = ICLattice(pat_table.loc[list(microclasses), :], microclasses,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/Qumin/qumin/lattice/lattice.py", line 156, in __init__
    self.context = table_to_context(dataframe, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: table_to_context() got an unexpected keyword argument 'collections'

[data reading] Qumin always re-segments wordforms, even in the presence of spaces

The behaviour is that of Qumin V.1: always re-segment. However, it might be better to make it possible to respect the given segmentation. Unfortunately, Qumin has quirks regarding phonology, and the segmentation needed tends to be different from that which I use in later datasets (where I write tiered information, such as length, tones, stress, on the syllable's vowel).

Note that all paralex datasets now MUST have space separation:

"The value of the phon_form MUST be a sequence of space-separated segments, e.g. not "dominoːrum" but "d o m i n oː r u m"."

I think we need the following cases:

  • There are no spaces => Throw an error, not paralex compliant (although we do have the means to still parse things... should we ?)
  • There are spaces, by default, split on spaces
  • There are spaces, but a user-defined config parameter asks to re-split: re-split (current behaviour)

@JPapir : what do you think ? Is that reasonable ?

The line doing the splitting is here:

tokens = Inventory._segmenter.findall(string)

[Patterns] regression in v2.0.0 on pats.defective=False

Investigating why #30 was failing when pats.defective=False, it turns out that v2.0 introduced a regression : whatever the value of pats.defective, we get the same result including defective lexemes. Here is a sample patterns file with defective entries and pats.defective=False. Lexemes p and q should be dropped.

lexeme,"('first', 'second')","('first', 'third')","('second', 'third')"
k,s_ ⇌ _p / p_s_ <2>,s_ ⇌ _k / p_s_ <1>,p ⇌ k / ps_ <1>
p,s_ ⇌ _p / p_s_ <2>,,
q,,,k ⇌ t / ps_ <1>
s,s ⇌  / ssp_ <1>,ss_ ⇌ _s / _p_s <1>,_p ⇌ p_ / _ss_ <1>

I never use pats.defective=False, so this explains why I didn't spot it before.

I suspect these 2 lines used to delete lexemes, whereas they only delete forms now that we use long format tables.

Do you confirm?

Remove the versionning code which tries to get a svn version

Remove svn use in:

def get_repository_version():
"""Return an ID for the current git or svn revision.
If the directory isn't under git or svn, the function returns an empty str.
Returns:
(str): svn/git version or ''.
"""
import subprocess
try:
kwargs = {"universal_newlines": True}
try:
no_svn = subprocess.call(["svn", "info"], **kwargs)
except FileNotFoundError:
no_svn = 1
if no_svn == 0:
version = subprocess.check_output(["svnversion"], **kwargs)
else:
version = subprocess.check_output(["git", "describe"], **kwargs)
return version.strip("\n ")
except:
return ''

[interface] handling of long file names

Hi Sacha ! I thought it could be useful to write here down some practical issues I found, just to remember they have to be fixed. This doesn't mean that I expect you to do it, I can also handle it! It's just for tracking.

When working on NTFS file systems, even on Linux-like OS, file names are limited to 255 chars.
However, Qumin stores files with really large names. In such cases, it could raise an OSError :

OSError: [Errno 36] File name too long: '../Results/JointPred/20240115/vlexique_short-20171031_paradigms.csv_v1.1.0-63-g6e7431d_20240115_12h03__patternsPhonsim.csv_v1.1.0-63-g6e7431d_20240115_12h14_onePredEntropies_beta0.5.csv'

There should be a fallback in case of such error, in order to save the computation results + a WARNING to the user.

[PyPi] missing js / css files for lattice script

(to fix for #22)

When installing with PyPi, the following files are missing from the package:

If no mpld3 is not installed, this is fine. But if mpld3 is available, it raises the following error when qumin.lattice --html :

    JAVASCRIPT = _load_external_text("HighlightSubTrees.js")
  File "/venv/lib/python3.9/site-packages/qumin/lattice/lattice.py", line 34, in _load_external_text
    open(join(dirname(__file__), filename), "r", encoding="utf-8").readlines())
FileNotFoundError: [Errno 2] No such file or directory: '/venv/lib/python3.9/site-packages/qumin/lattice/HighlightSubTrees.js'

Remove unused code

There is still a lot of code which is not used at all, and was written "just in case". It should be removed to facilitate maintenance, since there should be very little new features implemented.

Add a Frequencies class

The Frequencies class (former Weights from branch feature-overabundance), should have methods to:

  • return a frequency for any form_id, lexeme_id or cell_id
  • (a bit tricky) read the frequencies from :
    1. A Paralex forms table (resp lexeme, cell)
    2. A Paralex frequencies table if no form is available
    3. return a default Uniform distribution otherwise
  • Return relative frequencies (based on grouping on a parameter) and absolute frequencies for any of the three tables (form, lexeme, cell)

Any other suggestions for this class can be added here.

Better heuristic for the insertion cost in alignments

I keep changing the insertion cost in the phonologically aware alignment function because it is hard to find a correct value.

The problem

Intuitively, it needs to be lower than half a bad substitution, but higher than half a good substitution. Indeed, if [t] and [d] are considered close, we might prefer a single substitution:

alignment:  xyi
            xyu
operations: SSS

To two insertions:

alignment:  xyi-
            xy-u
operations: SSII

In the case of two segments which are very distinct, for ex [t] and [a], we should prefer:

alignment:  xyt-
            xy-a
operations: SSII

to:

alignment:  xyt
            xya
operations: SSS

A lower cost favors many insertions and results in many concurrent alignments. A high cost forces alignments of dissimilar segments, and may result in not finding the right alignment.

Heuristics

The current heuristic is to multiply the mean cost in the entire segment inventory by a constant proportion, currently set to 1/3, but which has been set to 2 in the past:

Then we do:

alignment.PHON_INS_COST = alignment.get_mean_cost()*INS_PROP

Proposition

While looking at the actual matrices of segment similarity, I noticed that there is often a vaguely bimodal distribution, with many similar pairs, and many dissimilar pairs. As a result, I believe the median is better than the mean.

We need the result to be lower than half the mean: a good default for the INS_PROP might be to go back to 1/3 for the proportion. Since it is a constant, the program should accept it as a keyword argument in case the user wants to provide a custom number.

[Patterns] Merge alternations when the only difference is a blank

Some alternations differ only by the number of blanks, e.g. in Portuguese verbs:

[+son +stress -C]__ ⇌ [+son -C -stress]_ˈɐm_ʃ / X*_C+_u_
[+son +stress -C]_ ⇌ [+son -C -stress]ˈɐm_ʃ / X*[-lat]_u_	

These could be expressed as:

[+son +stress -C]__ ⇌ [+son -C -stress]_ˈɐm_ʃ / X*_C*_u_

  • Detect these cases.
  • Merge them

Replace the alias system with coding on integers

Currently I segment forms, then replace each sequence of characters by 1-char aliases which look vaguely like the original sequence. This is a lot of complicated code for something simple. It would be better (and less confusing) to code strings on integers for internal calculations. All external outputs will be unchanged.

Separator "-" is sometimes missing from alternations in exported patterns

When the alternation has a regular phonological alternation, but the segments are not a natural class, the separator "-" is missing in the exported patterns.

Signalled by Olivier:

"je tombe sur une bizarrerie de la version actuelle du qumin: la convention pour l'usage des crochets n'est pas la même dans la partie contexte et dans la partie changement.
Exemple ci-dessous Sans doute est-ce un oubli? Je te le signale parce que suivant le reste de ton code ça peut créer des problèmes non triviaux..."

Example:
[iuy] ⇌ [jwɥ]ɔ̃ / [E-O-a-b-d-f-g-i-j-k-l-m-n-p-s-t-u-v-w-y-z-Ø-ɑ̃-ɔ̃-ŋ-œ̃-ɥ-ɲ-ʁ-ʃ-ʒ-ɛ̃]*[b-d-f-g-k-l-m-n-p-s-t-v-z-ŋ-ɲ-ʁ-ʃ-ʒ]_ <281>

[Patterns] Pretty printing generalized alternations should co-ordinate the print style

I don't want things like:

[+son +stress -C][a-o-u-ɐ̃-õ-ɐ-ɔ]ʃ ⇌ [+son -C -stress][+J +dipht +son +stress -C] / X*C_

It's either a list of phonemes as in [a-o-u-ɐ̃-õ-ɐ-ɔ] or a list of features as in [+J +dipht +son +stress -C]. This might be a good occasion to rewrite the way I do regular phonological changes in the alternation.

[Patterns] Generalization in the alternation should generalize minimally

Currently, we generalize the entire alternation when this allows to merge more than one pattern. However, there may be several sites where generalization is possible, and only some which actually require it. For example:

Verbo PresIndic:2 PretPerfIndic:1
criar kɾˈiɐʃ kɾiˈɐj
continuar kõtinˈuɐʃ kõtinuˈɐj
apoiar ɐpˈojɐʃ ɐpojˈɐj

Are captured at first by three alternations:

ˈiɐʃ ⇌ iˈɐj
ˈuɐ ⇌ uˈɐj
ˈojɐʃ  ⇌ ojˈɐj 

We find two generalizations possible in the alternations:

  1. /ˈi/ to /i/, /ˈu/ to /u/ and /ˈoj/ to /oj/ can all be expressed by a change from [-stress] to [+stress].
  2. /ɐ/ to /ˈɐj/ is a change from [-stress -diph -J] to [+stress +diph +J].

We notice that generalizing can allow us to merge these three patterns, so we end up with:

[+son +stress -C][a-o-u-ɐ̃-õ-ɐ-ɔ]ʃ ⇌ [+son -C -stress][+J +dipht +son +stress -C] / X*C_

But only the first of the two generalizations were necessary, since all of these words do exactly /ɐ/ to /ˈɐj/. We want to generalize only what is necessary to obtain a single pattern:

[+son +stress -C]ɐʃ ⇌ [+son -C -stress]ˈɐj / X*C_

Update documentation and setup generation

Currently, the doc lives on a server at the LLF, and the last generated version is old. I need to update the docs so they are current, re-generate, and setup somewhere to host the generated files. Maybe readthedocs ?

[PyPi] Deprecation warning

When installing Qumin with pypi, I get the following warning:

DEPRECATION: qumin is being installed using the legacy 'setup.py install' method, 
because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. 
pip 23.1 will enforce this behaviour change. A possible replacement is to enable 
the '--use-pep517' option. 
Discussion can be found at https://github.com/pypa/pip/issues/8559

If I understand it well, we should use a pyproject.toml file instead of setup.py ? Don't know it this is project related or only wrong setups on my side. Do not hesitate to close if irrelevant.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.