Rasa NLU Examples

This repository contains Rasa compatible machine learning components. These components are open sourced in order to encourage experimentation and to quickly offer support to more tools. By hosting these components here they do not need to go through the same vetting process as the components in Rasa and we hope that this makes it easier for people to contribute new ideas.

The components in the repository are not officially supported. There will be units tests as well as documentation but this project should be considered a community project, not something that is part of core Rasa. If there's a component here that turns out to be useful to the larger Rasa community then we might port features from this repository to Rasa.

Contribute

There are many ways you can contribute to this project.

You can suggest new features.
You can help review new features.
You can submit new components.
You can let us know if there are bugs.
You can share the results of an experiment you ran using these tools.
You can let us know if the components in this library help you.

Feel free to start the discussion by opening an issue on this repository. Before submitting code to the repository it would help if you first create an issue so that the maintainers can disucss the changes you would like to contribute. A more in-depth contribution guide can be found here.

Documentation

You can find the documentation for this project here.

Compatibility

This project currently supports components for Rasa 2.0. For older versions, see the list below.

version 0.1.3 is the final release for Rasa 1.10

Features

The following components are implemented;

Tokenizers

Tokenizers can split up the input text into tokens. Depending on the Tokenizer that you pick you can also choose to apply lemmatization. For languages that have rich grammatical features this might help reduce the size of all the possible tokens.

StanzaTokenizer

rasa_nlu_examples.tokenizers.StanzaTokenizer docs

We support a tokenizier based on Stanza. This tokenizer offers part of speech tagging as well as lemmatization for many languages that spaCy currently does not support. These features might help your ML pipelines in those situations.

ThaiTokenizer

rasa_nlu_examples.tokenizers.ThaiTokenizer docs

We support a Thai tokenizier based on PyThaiNLP link.

Dense Featurizers

Dense featurizers attach dense numeric features per token as well as to the entire utterance. These features are picked up by intent classifiers and entity detectors later in the pipeline.

FastTextFeaturizer

rasa_nlu_examples.featurizers.dense.FastTextFeaturizer docs

These are the pretrained embeddings from FastText, see for more info here. These are available in 157 languages, see here.

BytePairFeaturizer

rasa_nlu_examples.featurizers.dense.BytePairFeaturizer docs

These BytePair embeddings are specialized subword embeddings that are built to be lightweight. See this link for more information. These are available in 227 languages and you can specify the subword vocabulary size as well as the dimensionality.

GensimFeaturizer

rasa_nlu_examples.featurizers.dense.GensimFeaturizer docs

A benefit of the gensim library is that it is very easy to train your own word embeddings. It's typically only about 5 lines of code. That means that you could train your own word-embeddings and then easily use them in a Rasa pipeline. This can be useful if you have specific jargon you'd like to capture.

Another benefit of the tool is that it has made it easy for community members to train custom embeddings for many languages. Here's a list of resources;

AraVec has embeddings for Arabic trained on twitter and/or Wikipedia.

Fallback Classifiers

Fallback classifiers are models that can override previous intents. In Rasa NLU there is a NLU Fallback Classifier that can "fallback" whenever the main classifier isn't confident about their prediction. In this repository we also host a few of these models such that you can handle specific instances with a custom model too. These models are meant to be used in combination with a RulePolicy.

FasttextLanguage

rasa_nlu_examples.fallback.FasttextLanguageFallbackClassifier docs

This fallback classifier is based on fasttext. It can detect when a user is speaking in an unintended language such that you can create a rule to respond appropriately.

Usage

You can install the examples from this repo via pip;

pip install git+https://github.com/RasaHQ/rasa-nlu-examples

Once installed you can add tools to your config.yml file, here's an example;

language: en
pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
  lang: en
  vs: 1000
  dim: 25
- name: DIETClassifier
  epochs: 200

An example config for using the Thai tokenizer would look like:

language: th
pipeline:
  - name: rasa_nlu_examples.tokenizers.ThaiTokenizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 200

And you can use this file to run benchmarks. From the root folder of the project typically that means running something like;

rasa test nlu --config basic-bytepair-config.yml \
          --cross-validation --runs 1 --folds 2 \
          --out gridresults/basic-bytepair-config

Open an Issue

If you've spotted a bug then you can submit an issue here. GitHub issues allow us to keep track of a conversation about this repository and it is the preferred communication channel for bugs related to this project.

dashayushman / rasa-nlu-examples Goto Github PK