Coder Social home page Coder Social logo

concraft's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

tindzk

concraft's Issues

Problem with installation

During installation of Nerf's dependency (monad-par) cabal is not able to correctly determine version numbers of individual packages. The following solution seems to work:

rm -r -f ~/.cabal
rm -r -f ~/.ghc
cabal update
cabal install parallel
cabal install nerf

(don't know if two first commands are necessary, but they are dangerous for sure.)

Check if this problem occurs with newer cabal/GHC versions.

Consider different feature storage methods

Right now, "phi" function is the bottleneck which slows down computations substantially. The main reason lies in the representation of a (feature -> feature index) map. We use standard "Data.Map" container for that, but it is too slow (profiling reported that program spends ~30% of time in the "phi" function).

Potential solutions:

  • Use hashmap (or similar) container instead.
    • Unfortunately, hashmap containers available on hackage do not provide Binary instances.
  • Use array (or set of arrays).
    • It will be a viable solution only when the numbers of labels are limited within individual tagging layers.
  • Best of all: handle different model representation back-ends.

Tagging runtime error

Running Concraft on data with no disambiguation annotations results with the following runtime error: concraft: mkProb: no elements with positive probability.

Guessing strategies

Library should provide different guessing strategies, for example:

  • Guess interpretations of unknown words.
  • Narrow down interpretations of known word.
  • Choose tags with probabilities higher then some arbitrary threshold.
  • Combine different strategies.

Installation failure

Reported by Adam Radziszewski:

Polecenie cabal install conrcraft-pl dość długo kompilowało różne zależności, aż w końcu wywaliło się na crf-chain2-tiers-0.1.1:

src/Data/CRF/Chain2/Tiers/Dataset/Codec.hs:31:8:
Could not find module Control.Comonad.Trans.Store' It is a member of the hidden packagecomonad-4.0'.
Perhaps you need to add `comonad' to the build-depends in your .cabal file.
Use -v to see a list of the files searched for.
cabal: Error: some packages failed to install:
concraft-0.8.0 depends on crf-chain2-tiers-0.1.1 which failed to install.
concraft-pl-0.3.0 depends on crf-chain2-tiers-0.1.1 which failed to install.
crf-chain2-tiers-0.1.1 failed during the building phase. The exception was:
ExitFailure 1

W załączniku daję na wszelki wypadek to, co leci na stdout przy opcji -v.
Używam 64-bitowego Ubu 12.04. Być może tu są jakieś bardzo stare paczki i stąd się to bierze. Niemniej jednak a nuż będziesz w stanie coś podpowiedzieć, ja nie znam nawet tego żargonu (np. „hidden package”).

Change the `Analyse` type

Currently, the Analyse type allows to add weights to the analysis results, which seems unfounded. The analysis function doesn't have to know anything about weights of the individual tags.

On the other hand, perhaps weights could be used as a priori probability distributions. If you decide to implement this, don't forget to add appropriate info in docs.

Combine guessing results with disambiguation data properly

Results of guessing are not presented correctly in the output file:

.........
Łostowickim space
    None ign
    None adj:sg:loc:m3:pos disamb
    None interj disamb
    None subst:sg:gen:f disamb
    None subst:sg:inst:m1 disamb
    .........
.........

Therefore, information about disambiguation is lost.

Discard the set of unknown tags

The guessing model (see the NLP.Concraft.DAG.Guess module) includes the unkTagSet attribute, which is no longer needed. It is now replaced by the "label complexification" function, which allows for better performance.

Note that once this attribute is removed, a new model will have to be trained (or, alternatively, the old model will have to be adapted).

DAG: discard unlikely tags

The possibility of discarding unlikely tags at various stages of processing should be allowed.

This is related to the strategy of coarse-to-fine pruning. The most unlikely tags could be already discarded at the guessing stage, or at least, at the segmentation stage (guessing could be too early, given the 1-order nature of the underlying model, unless we can better control the confidence of the model and make it only make decisions of which it is 100% sure).

Note that discarded tags should still be visible in the tagging output.

Stack space overflow

Training the new, tiered model (i.e. model with arbitrary number of layers) on big data results in stack space overflow.

Improve README

README should contain some information needed to install, configure and use the Concraft tool.

Abstract format

The core library should not be restricted to a particular data format (plain format right now) but rather work with abstract format representation which can be implemented by concrete formats later.

logFloat: argument out of range

Command:

concraft-guess train train.plain -e eval.plain --igntag ign

sometimes ends with the following error message after a few iterations:

concraft-guess: Data.Number.LogFloat.logFloat: argument out of range

Extract core functionality into a separate library

Core functionality should be extracted into a separate core library. Right now abstractions and Polish-specific functions are mixed together in one library.

Motivation: when designing version of concraft adapted to another language it should be possible to rely only on the core library and not on dependencies and functions specific for the Polish language.

Consider extending interpretations of know words

Motivation: we could easily correct some of the false-negative errors made by the morphosyntactic analysis function.

It should be relatively easy to change the behavior of the Concraft.Guess.include function so that not only OOV words are extended with guessed interpretations, but also (in some specific situations, e.g. when a guessed interpretation is much more probable than the given ones?) known words.

More generally, it should be possible to specify how the guessing results are merged with the input data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.