Coder Social home page Coder Social logo

pytextclassification's Introduction

pyTextClassification

Training and using classifiers for textual documents

General

pyTextClassification is a simple python library that can be used to train and use text classifiers. It can be trained using a corpus of text documents organized in folders, each folder corresponding to a different content class.

Installation and dependencies

  • pip dependencies:
pip install numpy matplotlib scipy sklearn nltk

Train a classifier

In order to train a classifier based on a dataset, the following command must be used:

python textClassification.py trainFromDirs -i <datasetPath> --method <svm or knn or randomforest or gradientboosting or extratrees> --methodname <modelFileName>

<datasetPath> is the path of the training corpus. This path must contain a list of folders, each one corresponding to a different content class. Each folder contains a list of filenames (no extension assumed) which correspond to documents belonging to this class

<modelFileName> is the path where the extracted model is stored

Feature extraction is done using a set of predefined (static) dictionaries, stored in the myDicts/ folder. For each dictionary, a separate feature value is extracted.

Example:

python textClassification.py trainFromDirs -i moviePlotsSmall/ --method svm --methodname svmMoviesPlot7Classes

Apply a classifier

Given a trained model, and an unknown document, the following command syntax is used to classify the document:

python textClassification.py classifyFile -i <pathToUnknownDocument> --methodname <modelFileName>

This repository already contains a trained SVM model (svmMoviesPlot7Classes) that discriminates between 7 classes of movie plots. The files samples/sample_pulpFiction, samples/sample_forestgump and samples/sample_lordoftherings contain three plot examples that can be used as unknown documents for testing.

In order to classify these three files using svmMoviesPlot7Classes, the following command must be executed:

python textClassification.py classifyFile -i samples/sample_pulpFiction --methodname svmMoviesPlot7Classes

python textClassification.py classifyFile -i samples/sample_forestgump --methodname svmMoviesPlot7Classes

python textClassification.py classifyFile -i samples/sample_lordoftherings --methodname svmMoviesPlot7Classes

The above examples return the most dominant content classes along with the respective normalized probabilities (sorted from highest to lowest).

pytextclassification's People

Contributors

tyiannak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.