mkohlmyr / condor Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 212 KB

Third Year Project at the University of Southampton, for the award of BSc by Mikael Köhlmyr.

JavaScript 100.00%

condor's People

Watchers

condor's Issues

Straight forward way to import parsed examples

Reasoning

Instead of having to explicitly perform parsing when we want to work with examples, it would be nice if that could all happen under the hood automatically and we were able to simply import a corpus the same as we would a module.

There are two primary options. Either we have a specific index.ts in each corpus subdirectory. Or we have a single interface in corpora/index.ts. The latter means less work when we add a corpus.

The latter is probably preferable in many ways, but it would be difficult / impossible to replicate the ability to directly import a corpus by name. If we provide a scaffolding script for adding the index.ts to each corpus then maybe it isn't a problem.

Proposal

The first option looks something like the following:
In each corpus subdirectory we have an index.ts file that does similar to the following:

import { parse, ParsingError } from "../parser.ts";
export default new Promise(function (resolve, reject) {
  try {
    const data = {};
    const stream = parse(__dirname);
    stream.on("data", (example) => data.push(example));
    stream.on("end", () => resolve(data));
    
    window.setTimeout(() => reject(new ParsingError("timeout"), 15000));
  } catch(e: ParsingError) {
    reject(e)
  }
});

Possibly this code should be wrapped up in the parse function itself and we simply provide the __dirname value.

Convert dictionary modules to typed constants, interfaces and enums

Reasoning

Using native typescript features like enums, typed constants and interfaces will reduce errors and provide simple to parse values. In particular Common.js and Capitalization.js would benefit from the typescript treatment.

Proposal

export enum PartsOfSpeech {
  DETERMINER,
  VERB,
  ...
}

Improved training corpus data format

Reasoning

Having looked at several projects in the space, plain Part of Speech tagging typically comes in the form of pairs in a sequence, e.g. [[Hello, "INTERJECTION"], ["world", "NOUN"]]. Adding to this it may be an option to store a version of the word that has undergone either lemmatization or word-stemming.

Lemmatization is to do things properly and look up root words. Word-stemming simply chops off the endings and generally achieves some of the benefits of lemmatization but with lesser accuracy, but can be automated.

Instead of storing this information it would be an option to perform word-stemming (with a memoized function) on the fly as we parse training data. For lemmatization we cannot, since we could have one word with multiple root words depending on context, e.g. refuse -> ["to decline", "waste items"]

Given that the PoS data will be provided manually when entering examples, it would be fair enough to also provide lemmatization this way.

There is however actually a strong argument for using the tsv format. Then we can add new columns as we choose to store more data - easy to edit - and if we write the parsing correctly, adding additional data wouldn't break (or even affect) the existing code, we can provide additional information for some time and wait to use it. The question at this point would be if we store examples in entirely separate tsv files, or if we use e.g. blank rows to separate examples.

An example could also simply be a sentence so then it would be nice if we could have multiple in one file so we can have all related sentences in one place as we add them.

Proposal

A file of tab separated values where sentences (examples) are separated by blank space.

word	part of speech	lemma
car	NN	AUTOMOBILE

Tool to automatically generate additional training corpus

Reasoning

In addition to having manually tagged examples of a high quality it would be really useful to the ability to generate a larger corpus. Obviously if using lemmatisation and part of speech tagging tools to make predictions about our examples we are going to have quite a few errors in this corpus, but as a total it is likely still more useful than not. It would also be interesting to compare performance.

It would also be important to have an easy way to feed in the free text to transform. It should be possible to feed it different sources. Perhaps as a command line argument, that we also have the ability to use unix pipes. We could actually even just run it as an ongoing service and have it process incoming data. That way we could have e.g. a scraping service just continuously serving up new text data.

Proposal

Web page + service to enter free text in a form, or POST directly. Fires off a Resque job which will execute a terminal command to run nltk suing spawn, exec or system and create corpus files.
A Resque (resque-scheduler) job to grab articles from some sources (TBD) every 5 minutes and submit to the parsing service.

mkohlmyr / condor Goto Github PK

condor's People

Watchers

condor's Issues

Straight forward way to import parsed examples

Reasoning

Proposal

Convert dictionary modules to typed constants, interfaces and enums

Reasoning

Proposal

Improved training corpus data format

Reasoning

Proposal

Tool to automatically generate additional training corpus

Reasoning

Proposal

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent