Coder Social home page Coder Social logo

onto-gen's Introduction

onto-gen

Module for generating ontologies based on text corpus.

I. Preparing corpus. Current pipeline supports only pdf files. You can use different files by converting them into txt and placing in "corpus/txt" directory.

Place pdf files into "corpus/pdf" directory.
Use:
make

When whole process is ready, corpus file  will be placed in "corpus/final/final.txt".

II. Creating topics for Latent Semantic Indexing (LSI).

1. Creating a dictionary
./topics/analyser.py -i INPUT-CORPUS-FILE

This command will create dictionary file in the same directory as INPUT-CORPUS-FILE.
INPUT-CORPUS-FILE is a file with corpus sentences listed one in each line.
For example, corpus created in the previous section.

2. Creating a LSI models:
./topics/models.py -i INPUT_PATH
INPUT-PATH is a path to a dictionary created in the first step.
This command will create LSI models in the same directory as dictionary.

III. Create inverted index

1. Create index
./search/search_manager.py -c PATH-TO-SCHEMA -i PATH-TO-INDEX
PATH-TO-INDEX is a path to directory where index will be created
PATH-TO-SCHEMA is a path to schema file described below:
Example schema format (means: store title, store full text, allow full text search):
title	TEXT	True

2. Read corpus files into index
To read corpus file use:
./search/search_manager.py -i PATH-TO-INDEX -q -af PATH-TO-FILE
PATH-TO-INDEX is a path to directory where index was created in previous step
PATH-TO-FILE is a path to corpus file you want to read into index

III. Creating an ontology WARNING! All calculations for distance matrix are being stored in "./temp/" directory to speed up process in next generation, remember to clear (rm ./temp/*) this directory after changing INPUT-TERMS list.

1. Creating new ontology
./ontology_factory.py -i INDEX-PATH -c CORPUS-PATH -t INPUT-TERMS -o OUTPUT-OWL-FILE
Additionaly you can use -l option to lemmatize input terms.

INDEX-PATH is a path to a directory in which inverted index was created in the previous section.
CORPUS-PATH is a path to topic dictionary created in II section
OUTPUT-OWL-FILE is a path for generated output OWL ontology
INPUT-TERMS is a path to file with terms listed one in each line

2. To extend existing ontology
./ontology_factory.py -i INDEX-PATH -c CORPUS-PATH -t INPUT-TERMS -o OUTPUT-OWL-FILE -g INPUT-OWL
Additionaly you can use -l option to lemmatize input terms.

INDEX-PATH is a path to a directory in which inverted index was created in the previous section.
CORPUS-PATH is a path to topic dictionary created in II section
OUTPUT-OWL-FILE is a path for generated output OWL ontology
INPUT-TERMS is a path to file with terms listed one in each line
INPUT-OWL is path to OWL file you want to extend

onto-gen's People

Contributors

heolin avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.