What is okgraph

okgraph is a python3 library that performs unsupervised natural-language understanding (NLU).

It currently focuses on the following tasks:

set expansion given one or a short set of words, continues this set with a list of other 'same-type' words (co-hyponyms);
relation expansion given one or a short set of word tuples, continues this set with a list of tuples having the same implicit relation of the given tuples;
set labeling given one or a short set of words, returns a list of short strings (labels) describing the given set (its type or hyperonym);
relation labeling given one or a short set of word tuples, returns a list of short strings (labels) describing the implicit relation of the tuples in the given set.

Being unsupervised, it only takes a free (untagged) text corpus as input, in any space-separated language. Scriptio-continua corpora and languages needs third-party tokenization techniques (e.g. micter).

How to use the okgraph library

How to install

Creating the virtual environment

Please ensure a python version between 3.7 and 3.9 is being used. If you're using Windows, ensure the Windows 10 SDK have been installed by the Visual Studio Installer.

After cloning the repository (https://github.com/atzori/okgraph.git && cd okgraph), run the followings commands from the root directory of the downloaded project to install okgraph for development:

> python -m venv venv  # be sure you are referring to an acceptable python version using
> source venv/bin/activate
(venv) > python -m pip install --upgrade pip setuptools devtools
(venv) > pip install -r requirements.txt  # this may take several minutes

If you want to use it as a library inside one of your projects, just install it from your environment with:

(venv) > pip install ../path/to/downloaded/okgraph

Acquiring the test data

To run the tests some text corpora are required. The following script will provide the required text corpora, along with their word-embeddings, corpus indexes and corpus dictionaries:

$ python tests/get_test_corpus_and_resources.py

This procedure may take a while: it will download the wiki-english-20171001 corpus from the Gensim-data and use it to generate three corpora: text7.txt, text8.txt and text9.txt (obtained respectively from the first 10⁷, 10⁸ and 10⁹ bytes of the wiki-english-20171001 corpus). Then, all the related resources (embeddings, index and dictionary) are created. The brandly new corpora and their related resources can be found in tests/data from the project directory.

Generating the docs

This library uses Sphinx to automatically integrate the in-code comments within the library documentation. To obtain the code documentation run the following:

$ python docs/make_docs.py

The README, modules and packages will be automatically parsed to obtain an html documentation that can be found in docs/build/html/index.html from the project directory.

Loading a corpus

The first step is to create an OKgraph instance by loading a text corpus. Any OKgraph instance will also require a word-embedding model, a corpus index and a corpus dictionary, but specifying any value for them is optional.

The word-embedding model can be obtained processing the specified corpus, or can be a pre-existent model. The word-embedding model is available through one of the extension of the okgraph.embeddings.WordEmbeddings abstract class which introduces the word-embeddings and the operations that can be done with them. The okgraph.embeddings.MagnitudeWordEmbeddings class extends the okgraph.embeddings.WordEmbeddings class and provides the currently one and only implementation available for the word-embeddings, using the Magnitude library.

The corpus index and corpus dictionary are strictly related to the corpus itself and are always obtained from its processing.

Specifing a corpus

This example creates an OKgraph instance based on the text8.txt corpus:

from okgraph.core import OKgraph
okg = OKgraph("text8.txt")

The file text8.txt will be set as corpus file and the word-embeddings, corpus index and corpus dictionary will be searched using their default values starting from the same directory of the corpus (text8.magnitude, indexdir/ and dictTotal.npy). If found, they will be used as they are, otherwise they will be automatically generated processing the corpus.

Specifying a corpus and model

This example creates an OKgraph instance based on the text8.txt corpus with a specified word-embedding model:

from okgraph.core import OKgraph
okg = OKgraph("text8.txt", "model_file")

or equivalently:

import okgraph
okg = okgraph.OKgraph(corpus="text8.txt", embeddings="model_file")

The file text8.txt will be set as corpus file and the word-embeddings is searched, starting from the same directory of the corpus, in a file named model_file.magnitude (when the extension is not provided, .magnitude is automatically appended). If found, the word-embeddings will be loaded, otherwise it will be generated processing the corpus and stored as model_file.magnitude.

If instead of a file name, the word-embeddings argument is a URL (starts with 'http://' or 'https://'), a remote version of the file will be used (to be fixed).

from okgraph.core import OKgraph
okg = OKgraph(corpus="text8.txt", embeddings="https://model_file")

The stream argument allows to stream the model, instead of downloading it:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text8.txt", embeddings="https://model_file", stream=True)

Specifying all the resources

This example creates an OKgraph instance based on the text8.txt corpus with a specified value for the word-embeddings, the corpus index and corpus dictionary:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text8.txt", embeddings="model_file",
                      index_dir="corpus_index/", dictionary_file="corpus_dictionary.npy")

The file text8.txt will be set as corpus file and the word-embeddings, corpus index and corpus dictionary will be searched using their specified values starting from the same directory of the corpus. If found, they will be used as they are, otherwise they will be automatically generated processing the corpus and stored with the specified names.

Forcing the resources generation

When an OKgraph instance is created loading a corpus, the word-embeddings, corpus index and corpus dictionary are searched starting from the corpus directory to be loaded as they are, if they exist. To avoid this resources to be loaded as they are and force their re-generation from the corpus processing, the force_init argument can be set to True:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text8.txt", force_init=True)

This code will force the OKgraph constructor to generate again the resources and overwrite them, if they exist.

Preparing a corpus

The classes and methods in okgraph.preprocessing.* are useful to parse and prepare a text corpus before creating the OKgraph instance. For instance, these methods helps, e.g., to convert xml (MediaWiki) and html format into cleaned free text usable by okgraph.

Some methods are also useful to make it lowercase, for stemming, for co-occurrence (n-gram) tokenization, stop-words removal or idioms identification.

Some of these functions are taken from the Gensim preprocessing module and Gensim phrases module

(to be done)

Executing a task

Once an OKgraph object has been instantiated, it can be used to execute the four task using its four methods: set_expansion(), relation_expansion(), set_labeling() and relation_labeling().

Every method takes four arguments:

seed: is the generator used to compute the results. It will be: a list of strings in the set_expansion and set_labeling tasks; a list of string tuples in the relation_expansion and relation_labeling tasks;
k (optional): is an integer specifying the limit to the number of results returned (setting it to -1 will avoid the limit). It is set to 15 by default;
algo (optional): is a string specifying the name of the algorithm chosen as implementation of the task. Every task has its own default algorithm with default arguments, so this argument can be optionally not specified along with the options argument;
options (optional): is a dictionary of the type {'argument': value} containing the values of the arguments requested by the chosen algorithm. Every task has its own default algorithm with default arguments, so this argument can be optionally not specified along with the algo argument;

To correctly execute a task it's important to know which implementations are available for each task, and so which values can be assigned to the algo and options arguments. Every implementation has its own package, inside the respective task package, containing a same-name module that implements the method task(). The task() method is the one being called by the OKgraph instance with the unpacked dictionary of arguments **options. The algo arguments has to be the name of one of the packages and the options arguments has to contain the packed arguments for the respective task() method.

Executing a set expansion algorithm

All the set expansion algorithms can be found in the okgraph.task.set_expansion package.

This is an example of usage with default values(using embeddings):

from okgraph.core import OKgraph
okg = OKgraph(corpus="text9.txt")

okg.set_expansion(["Italy", "France", "Germany"])

> e.g.: ["Spain", "Portugal", "Belgium", ...]

And another example using a specific algorithm (using embeddings):

from okgraph.core import OKgraph
okg = OKgraph(corpus="text9.txt")

okg.set_expansion(
    seed=["Italy", "France", "Germany"],
    k=15,
    algo='centroid_boost',
    options={"embeddings": okg.embeddings,
             "step": 2,
             "fast": False}
)

> e.g.: ["Spain", "Portugal", "Belgium", ...]

If you want to use a pretrained masked model you can use the fill-mask algorithm, for example:

okg.set_expansion(
                seed = ('italy', 'france', 'germany'),
                k = 20,
                algo = "fill_mask",
                options = {}
            )

Executing a relation expansion algorithm

All the relation expansion algorithms can be found in the okgraph.task.relation_expansion package.

This is an example of usage with default values:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text9.txt")

okg.relation_expansion([("Italy", "Rome"), ("Germany", "Berlin")])

> e.g.: [("Spain", "Madrid"),("Belgium", "Brussels"), ...]

And another example using a specific algorithm:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text9.txt")

okg.relation_expansion(
    seed=[("Italy", "Rome"), ("Germany", "Berlin")],
    k=15,
    algo="centroid",
    options={"embeddings": okg.embeddings,
             "set_expansion_algo": "centroid",
             "set_expansion_options": {"embeddings": okg.embeddings},
             "set_expansion_k": 15}
)

> e.g.: [("Spain", "Madrid"),("Belgium", "Brussels"), ...]

Executing a set labeling algorithm

All the set labeling algorithms can be found in the okgraph.task.set_labeling package.

This is an example of usage with default values:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text9.txt")

okg.set_labeling(["Italy", "France", "Germany"])

> e.g.: ["country", "state", "nation", ...]

And another example using a specific algorithm:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text9.txt")

okg.set_labeling(
    seed=["Italy", "France", "Germany"],
    k=15,
    algo='intersection',
    options={"dictionary": okg.dictionary,
             "index": okg.index}
)

> e.g.: ["country", "state", "nation", ...]

Executing a relation labeling algorithm

All the relation labeling algorithms can be found in the okgraph.task.relation_labeling package.

This is an example of usage with default values:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text9.txt")

okg.relation_labeling([("Italy", "Rome"), ("Germany", "Berlin")])

> e.g.: ["capital", "capital_of", "soccer_team", ...]

And another example using a specific algorithm:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text9.txt")

okg.relation_labeling(
    seed=[("Italy", "Rome"), ("Germany", "Berlin")],
    k=15,
    algo='intersection',
    options={"dictionary": okg.dictionary,
             "index": okg.index}
)

> e.g.: ["capital", "capital_of", "soccer_team", ...]

Evaluation

The classes and methods in okgraph.evaluation.* evaluate the performance of algorithms in each task based on several benchmarks.

(to be done)

Embeddings operations

The followings are examples of use of the embeddings in okgraph:

from okgraph.core import OKgraph
okg = OKgraph(corpus="text9.txt")

# This is a WordEmbeddings class, specifically a MagnitudeWordEmbeddings
# instance
emb = okg.embeddings

# Obtain the vector representation of "town"
emb.w2v("town")

# Obtain the 5 closest word to "town"
emb.w2w("town", 5)

# Obtain the 5 words closest to the given vector representation (compatible with
# the embeddings)
v: ndarray
emb.v2w(v, 5)

# Obtain the 5 vector representations of the 5 words closest to the given vector
# representation (compatible with the embeddings)
v: ndarray
emb.v2v(v, 5)

More can be found in the okgraph.embeddings module.

How to contribute

Tools that may be useful (not mandatory):

hatch (commands reference)
git flow (simple guide) and also git-flow-completion

To send a contribution:

git checkout master
git flow init -d (to set the default settings)
git flow feature start my-cool-feature (use an appropriate feature name, for bugs use git flow bugfix start ...)
add, commit and push your work (it will be in branch feature/my-cool-feature)
follow the link suggested after the push to create a new push request to "develop" branch and start a discussion with the maintainer
the maintainer will merge your work into develop (or master in case of new releases)

Implementing a task

To implement a new task with an algorithm named new_implementation, the new_implementation package must be created inside the task package. The new_implementation package will have to contain the new_implementation.py module containing the task() method, where the task() method will effectively contain the algorithm.

For example, to implement a new relation labeling algorithm named my_rel_label_alg, the following path will have to exist: okgraph/task/relation_labeling/my_rel_label_alg/my_rel_label_alg.py.

Look at existing methods for practical examples, e.g.: /tasks/set_expansion/centroid/centroid.py

Documenting your work

All the project has been documented using the in-code Google Style Python Docstrings. The comments are extracted by Sphinx to automatically generate the library documentation. To update the documentation to match your updates, run the following script:

$ python docs/make_docs.py

Testing

To run the tests, from the root directory, run:

python -m unittest discover tests/ -v

If the data required to run the tests has not been acquired yet, the tests/get_test_corpus_and_resources.py script will be executed before testing.

atzori / okgraph Goto Github PK

okgraph's Introduction

What is okgraph

How to use the okgraph library

How to install

Creating the virtual environment

Acquiring the test data

Generating the docs

Loading a corpus

Specifing a corpus

Specifying a corpus and model

Specifying all the resources

Forcing the resources generation

Preparing a corpus

Executing a task

Executing a set expansion algorithm

Executing a relation expansion algorithm

Executing a set labeling algorithm

Executing a relation labeling algorithm

Evaluation

Embeddings operations

How to contribute

Implementing a task

Documenting your work

Testing

Recommend Projects

Recommend Topics

Recommend Org