Coder Social home page Coder Social logo

condor's People

Watchers

 avatar

condor's Issues

Straight forward way to import parsed examples

Reasoning

Instead of having to explicitly perform parsing when we want to work with examples, it would be nice if that could all happen under the hood automatically and we were able to simply import a corpus the same as we would a module.

There are two primary options. Either we have a specific index.ts in each corpus subdirectory. Or we have a single interface in corpora/index.ts. The latter means less work when we add a corpus.

The latter is probably preferable in many ways, but it would be difficult / impossible to replicate the ability to directly import a corpus by name. If we provide a scaffolding script for adding the index.ts to each corpus then maybe it isn't a problem.

Proposal

The first option looks something like the following:
In each corpus subdirectory we have an index.ts file that does similar to the following:

import { parse, ParsingError } from "../parser.ts";
export default new Promise(function (resolve, reject) {
  try {
    const data = {};
    const stream = parse(__dirname);
    stream.on("data", (example) => data.push(example));
    stream.on("end", () => resolve(data));
    
    window.setTimeout(() => reject(new ParsingError("timeout"), 15000));
  } catch(e: ParsingError) {
    reject(e)
  }
});

Possibly this code should be wrapped up in the parse function itself and we simply provide the __dirname value.

Convert dictionary modules to typed constants, interfaces and enums

Reasoning

Using native typescript features like enums, typed constants and interfaces will reduce errors and provide simple to parse values. In particular Common.js and Capitalization.js would benefit from the typescript treatment.

Proposal

export enum PartsOfSpeech {
  DETERMINER,
  VERB,
  ...
}

Improved training corpus data format

Reasoning

Having looked at several projects in the space, plain Part of Speech tagging typically comes in the form of pairs in a sequence, e.g. [[Hello, "INTERJECTION"], ["world", "NOUN"]]. Adding to this it may be an option to store a version of the word that has undergone either lemmatization or word-stemming.

Lemmatization is to do things properly and look up root words. Word-stemming simply chops off the endings and generally achieves some of the benefits of lemmatization but with lesser accuracy, but can be automated.

Instead of storing this information it would be an option to perform word-stemming (with a memoized function) on the fly as we parse training data. For lemmatization we cannot, since we could have one word with multiple root words depending on context, e.g. refuse -> ["to decline", "waste items"]

Given that the PoS data will be provided manually when entering examples, it would be fair enough to also provide lemmatization this way.

There is however actually a strong argument for using the tsv format. Then we can add new columns as we choose to store more data - easy to edit - and if we write the parsing correctly, adding additional data wouldn't break (or even affect) the existing code, we can provide additional information for some time and wait to use it. The question at this point would be if we store examples in entirely separate tsv files, or if we use e.g. blank rows to separate examples.

An example could also simply be a sentence so then it would be nice if we could have multiple in one file so we can have all related sentences in one place as we add them.

Proposal

A file of tab separated values where sentences (examples) are separated by blank space.

word part of speech lemma
car NN AUTOMOBILE

Tool to automatically generate additional training corpus

Reasoning

In addition to having manually tagged examples of a high quality it would be really useful to the ability to generate a larger corpus. Obviously if using lemmatisation and part of speech tagging tools to make predictions about our examples we are going to have quite a few errors in this corpus, but as a total it is likely still more useful than not. It would also be interesting to compare performance.

It would also be important to have an easy way to feed in the free text to transform. It should be possible to feed it different sources. Perhaps as a command line argument, that we also have the ability to use unix pipes. We could actually even just run it as an ongoing service and have it process incoming data. That way we could have e.g. a scraping service just continuously serving up new text data.

Proposal

  • Web page + service to enter free text in a form, or POST directly. Fires off a Resque job which will execute a terminal command to run nltk suing spawn, exec or system and create corpus files.
  • A Resque (resque-scheduler) job to grab articles from some sources (TBD) every 5 minutes and submit to the parsing service.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.