mikehopcroft / tokenflow Goto Github PK
View Code? Open in Web Editor NEWData driven tokenizer for entities, attributes, quantifiers, numbers and intents in NLP scenarios.
License: MIT License
Data driven tokenizer for entities, attributes, quantifiers, numbers and intents in NLP scenarios.
License: MIT License
The query "I'd like a horn" doesn't match "very loud train horn".
Use AJV for schema validation. This should replace code in src/utilities/type-checking.ts
.
s.split(/\s+/) instead of s.split(' ')
generateAliases should trim spaces from options
Stopwords should use a YAML file to allow for inline comments.
query = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 8589934592, 11, 2, 13, 14, 15, 16];
prefix = [ 1, 2, 3, 4, 5, 6 ];
expected match = [ 1, 2, 3, 4, 5, 6 ];
observed match = [1, 2, 13, 14, 15, 16];
The problem here is that a sequence of replacements after token 8589934592 involving terms [11, 2, 13, 14,1 15, 16] can be cheaper than deleting [8589934592, 11, 2, 13, 14, 15, 16].
Remove readline-sync dependency. Let's just use the async version for the repl. This will make it easier for webpack to consume this package.
The PatternRecognizer constructor should take an Iterable, not a Map. The Map functionality is only used by the CreateFooRecognizer functions. PatternRecognizer just needs an Iterable of .
Might also want to come up with a less generic name for the Item interface.
Right now, a Recognizer splits a single, UndefinedToken into an array of Tokens. Consider two changes:
pipelineDemo('can I get four four cars') causes NumberRecognizer to throw because it checks to see if the output of wordsToNumbers('four four') is a number. The output is '4 4' which is a string.
Currently contributed words are provided as sets that can be unioned together. It would be nice if they could be provided as predicates that can be chained. A predicate would be better for NumberRecognizer, since its contributed terms is technically the set of all integers.
One issue with this approach is that the matcher is based on term hashes, not term text. The tokenizers would need to utilize the same stemming and hashing if they were to provide a hash-based predicate, instead of a text-based predicate.
Without this fix, NumberRecognizer is forced to return a small, hard-coded list of integers.
Rename badWords
concept. Consider using contributedTerms
terminology.
Also rename Recognizer.terms().
Might consider using a different package or implementing our own.
See commit dff201339d2fe5314483f82eaf3e6083b4c0bcb9 in short-order.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.