kawu / concraft Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 1.0 233 KB

A morphosyntactic disambiguation library based on constrained conditional random fields

License: BSD 2-Clause "Simplified" License

Haskell 100.00%

concraft's People

Stargazers

Watchers

Forkers

tindzk

concraft's Issues

Text -> AttrVal

Replace Text with AttrVal where appropriate.

Problem with installation

During installation of Nerf's dependency (monad-par) cabal is not able to correctly determine version numbers of individual packages. The following solution seems to work:

rm -r -f ~/.cabal
rm -r -f ~/.ghc
cabal update
cabal install parallel
cabal install nerf

(don't know if two first commands are necessary, but they are dangerous for sure.)

Check if this problem occurs with newer cabal/GHC versions.

Consider different feature storage methods

Right now, "phi" function is the bottleneck which slows down computations substantially. The main reason lies in the representation of a (feature -> feature index) map. We use standard "Data.Map" container for that, but it is too slow (profiling reported that program spends ~30% of time in the "phi" function).

Potential solutions:

Use hashmap (or similar) container instead.
- Unfortunately, hashmap containers available on hackage do not provide Binary instances.
Use array (or set of arrays).
- It will be a viable solution only when the numbers of labels are limited within individual tagging layers.
Best of all: handle different model representation back-ends.

Tagging runtime error

Running Concraft on data with no disambiguation annotations results with the following runtime error: concraft: mkProb: no elements with positive probability.

Guessing strategies

Library should provide different guessing strategies, for example:

Guess interpretations of unknown words.
Narrow down interpretations of known word.
Choose tags with probabilities higher then some arbitrary threshold.
Combine different strategies.

Bump dependencies

As with kawu/sgd#1, the dependency binary must be bumped. The same applies to array.

Guessing and disambiguation in one module

Concraft should provide one entry module and a command line tool for joined guessing/disambiguation functionality.

Installation failure

Reported by Adam Radziszewski:

Polecenie cabal install conrcraft-pl dość długo kompilowało różne zależności, aż w końcu wywaliło się na crf-chain2-tiers-0.1.1:

src/Data/CRF/Chain2/Tiers/Dataset/Codec.hs:31:8:
Could not find module Control.Comonad.Trans.Store' It is a member of the hidden packagecomonad-4.0'.
Perhaps you need to add `comonad' to the build-depends in your .cabal file.
Use -v to see a list of the files searched for.
cabal: Error: some packages failed to install:
concraft-0.8.0 depends on crf-chain2-tiers-0.1.1 which failed to install.
concraft-pl-0.3.0 depends on crf-chain2-tiers-0.1.1 which failed to install.
crf-chain2-tiers-0.1.1 failed during the building phase. The exception was:
ExitFailure 1

W załączniku daję na wszelki wypadek to, co leci na stdout przy opcji -v.
Używam 64-bitowego Ubu 12.04. Być może tu są jakieś bardzo stare paczki i stąd się to bierze. Niemniej jednak a nuż będziesz w stanie coś podpowiedzieć, ja nie znam nawet tego żargonu (np. „hidden package”).

Change the `Analyse` type

Currently, the Analyse type allows to add weights to the analysis results, which seems unfounded. The analysis function doesn't have to know anything about weights of the individual tags.

On the other hand, perhaps weights could be used as a priori probability distributions. If you decide to implement this, don't forget to add appropriate info in docs.

Combine guessing results with disambiguation data properly

Results of guessing are not presented correctly in the output file:

.........
Łostowickim space
    None ign
    None adj:sg:loc:m3:pos disamb
    None interj disamb
    None subst:sg:gen:f disamb
    None subst:sg:inst:m1 disamb
    .........
.........

Therefore, information about disambiguation is lost.

Add info that the training output is only "pictorial"

In particular, it is overoptimistic, because correct tags are always added to the intermediary dataset (arisen from guessing tags for OOV words).

Discard the set of unknown tags

The guessing model (see the NLP.Concraft.DAG.Guess module) includes the unkTagSet attribute, which is no longer needed. It is now replaced by the "label complexification" function, which allows for better performance.

Note that once this attribute is removed, a new model will have to be trained (or, alternatively, the old model will have to be adapted).

Replace OOV in documentation with `Word.oov`

It will make it immediately obvious what do we mean by OOV in the documentation.

DAG: discard unlikely tags

The possibility of discarding unlikely tags at various stages of processing should be allowed.

This is related to the strategy of coarse-to-fine pruning. The most unlikely tags could be already discarded at the guessing stage, or at least, at the segmentation stage (guessing could be too early, given the 1-order nature of the underlying model, unless we can better control the confidence of the model and make it only make decisions of which it is 100% sure).

Note that discarded tags should still be visible in the tagging output.

Move model implementation from crf-chain2-generic library to here

Implementation of the pair model used within Concraft is placed in the external, crf-chain2-generic library. This is wrong, since in the generic CRF library no concrete implementation should reside. Only simple and basic ones.

Stack space overflow

Training the new, tiered model (i.e. model with arbitrary number of layers) on big data results in stack space overflow.

Improve README

README should contain some information needed to install, configure and use the Concraft tool.

DAG: show min/max model values when training

This would allow the user to see when things go wrong.

Abstract format

The core library should not be restricted to a particular data format (plain format right now) but rather work with abstract format representation which can be implemented by concrete formats later.

logFloat: argument out of range

Command:

concraft-guess train train.plain -e eval.plain --igntag ign

sometimes ends with the following error message after a few iterations:

concraft-guess: Data.Number.LogFloat.logFloat: argument out of range

Extract core functionality into a separate library

Core functionality should be extracted into a separate core library. Right now abstractions and Polish-specific functions are mixed together in one library.

Motivation: when designing version of concraft adapted to another language it should be possible to rely only on the core library and not on dependencies and functions specific for the Polish language.

Consider extending interpretations of know words

Motivation: we could easily correct some of the false-negative errors made by the morphosyntactic analysis function.

It should be relatively easy to change the behavior of the Concraft.Guess.include function so that not only OOV words are extended with guessed interpretations, but also (in some specific situations, e.g. when a guessed interpretation is much more probable than the given ones?) known words.

More generally, it should be possible to specify how the guessing results are merged with the input data.

kawu / concraft Goto Github PK

concraft's People

Stargazers

Watchers

Forkers

concraft's Issues

Recommend Projects

Recommend Topics

Recommend Org