kawu / concraft Goto Github PK
View Code? Open in Web Editor NEWA morphosyntactic disambiguation library based on constrained conditional random fields
License: BSD 2-Clause "Simplified" License
A morphosyntactic disambiguation library based on constrained conditional random fields
License: BSD 2-Clause "Simplified" License
Replace Text
with AttrVal
where appropriate.
During installation of Nerf's dependency (monad-par) cabal is not able to correctly determine version numbers of individual packages. The following solution seems to work:
rm -r -f ~/.cabal
rm -r -f ~/.ghc
cabal update
cabal install parallel
cabal install nerf
(don't know if two first commands are necessary, but they are dangerous for sure.)
Check if this problem occurs with newer cabal/GHC versions.
Right now, "phi" function is the bottleneck which slows down computations substantially. The main reason lies in the representation of a (feature -> feature index) map. We use standard "Data.Map" container for that, but it is too slow (profiling reported that program spends ~30% of time in the "phi" function).
Potential solutions:
Running Concraft on data with no disambiguation annotations results with the following runtime error: concraft: mkProb: no elements with positive probability
.
Library should provide different guessing strategies, for example:
As with kawu/sgd#1, the dependency binary
must be bumped. The same applies to array
.
Concraft should provide one entry module and a command line tool for joined guessing/disambiguation functionality.
Reported by Adam Radziszewski:
Polecenie cabal install conrcraft-pl dość długo kompilowało różne zależności, aż w końcu wywaliło się na crf-chain2-tiers-0.1.1:
src/Data/CRF/Chain2/Tiers/Dataset/Codec.hs:31:8:
Could not find module Control.Comonad.Trans.Store' It is a member of the hidden package
comonad-4.0'.
Perhaps you need to add `comonad' to the build-depends in your .cabal file.
Use -v to see a list of the files searched for.
cabal: Error: some packages failed to install:
concraft-0.8.0 depends on crf-chain2-tiers-0.1.1 which failed to install.
concraft-pl-0.3.0 depends on crf-chain2-tiers-0.1.1 which failed to install.
crf-chain2-tiers-0.1.1 failed during the building phase. The exception was:
ExitFailure 1
W załączniku daję na wszelki wypadek to, co leci na stdout przy opcji -v.
Używam 64-bitowego Ubu 12.04. Być może tu są jakieś bardzo stare paczki i stąd się to bierze. Niemniej jednak a nuż będziesz w stanie coś podpowiedzieć, ja nie znam nawet tego żargonu (np. „hidden package”).
Currently, the Analyse
type allows to add weights to the analysis results, which seems unfounded. The analysis function doesn't have to know anything about weights of the individual tags.
On the other hand, perhaps weights could be used as a priori probability distributions. If you decide to implement this, don't forget to add appropriate info in docs.
Results of guessing are not presented correctly in the output file:
.........
Łostowickim space
None ign
None adj:sg:loc:m3:pos disamb
None interj disamb
None subst:sg:gen:f disamb
None subst:sg:inst:m1 disamb
.........
.........
Therefore, information about disambiguation is lost.
In particular, it is overoptimistic, because correct tags are always added to the intermediary dataset (arisen from guessing tags for OOV words).
The guessing model (see the NLP.Concraft.DAG.Guess
module) includes the unkTagSet
attribute, which is no longer needed. It is now replaced by the "label complexification" function, which allows for better performance.
Note that once this attribute is removed, a new model will have to be trained (or, alternatively, the old model will have to be adapted).
It will make it immediately obvious what do we mean by OOV in the documentation.
The possibility of discarding unlikely tags at various stages of processing should be allowed.
This is related to the strategy of coarse-to-fine pruning. The most unlikely tags could be already discarded at the guessing stage, or at least, at the segmentation stage (guessing could be too early, given the 1-order nature of the underlying model, unless we can better control the confidence of the model and make it only make decisions of which it is 100% sure).
Note that discarded tags should still be visible in the tagging output.
Implementation of the pair model used within Concraft is placed in the external, crf-chain2-generic library. This is wrong, since in the generic CRF library no concrete implementation should reside. Only simple and basic ones.
Training the new, tiered model (i.e. model with arbitrary number of layers) on big data results in stack space overflow.
README should contain some information needed to install, configure and use the Concraft tool.
This would allow the user to see when things go wrong.
The core library should not be restricted to a particular data format (plain format right now) but rather work with abstract format representation which can be implemented by concrete formats later.
Command:
concraft-guess train train.plain -e eval.plain --igntag ign
sometimes ends with the following error message after a few iterations:
concraft-guess: Data.Number.LogFloat.logFloat: argument out of range
Core functionality should be extracted into a separate core library. Right now abstractions and Polish-specific functions are mixed together in one library.
Motivation: when designing version of concraft adapted to another language it should be possible to rely only on the core library and not on dependencies and functions specific for the Polish language.
Motivation: we could easily correct some of the false-negative errors made by the morphosyntactic analysis function.
It should be relatively easy to change the behavior of the Concraft.Guess.include
function so that not only OOV words are extended with guessed interpretations, but also (in some specific situations, e.g. when a guessed interpretation is much more probable than the given ones?) known words.
More generally, it should be possible to specify how the guessing results are merged with the input data.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.