Coder Social home page Coder Social logo

stuff-classifier's Introduction

stuff-classifier

No longer maintained

This repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.

Description

A library for classifying text into multiple categories.

Currently provided classifiers:

Ran a benchmark of 1345 items that I have previously manually classified with multiple categories. Here's the rate over which the 2 algorithms have correctly detected one of those categories:

  • Bayes: 79.26%
  • Tf-Idf: 81.34%

I prefer the Naive Bayes approach, because while having lower stats on this benchmark, it seems to make better decisions than I did in many cases. For example, an item with title "Paintball Session, 100 Balls and Equipment" was classified as "Activities" by me, but the bayes classifier identified it as "Sports", at which point I had an intellectual orgasm. Also, the Tf-Idf classifier seems to do better on clear-cut cases, but doesn't seem to handle uncertainty so well. Of course, these are just quick tests I made and I have no idea which is really better.

Install

gem install stuff-classifier

Usage

You either instantiate one class or the other. Both have the same signature:

require 'stuff-classifier'

# for the naive bayes implementation
cls = StuffClassifier::Bayes.new("Cats or Dogs")

# for the Tf-Idf based implementation
cls = StuffClassifier::TfIdf.new("Cats or Dogs")

# these classifiers use word stemming by default, but if it has weird
# behavior, then you can disable it on init:
cls = StuffClassifier::TfIdf.new("Cats or Dogs", :stemming => false)

# also by default, the parsing phase filters out stop words, to
# disable or to come up with your own list of stop words, on a
# classifier instance you can do this:
cls.ignore_words = [ 'the', 'my', 'i', 'dont' ]

Training the classifier:

cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")    
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")

And finally, classifying stuff:

cls.classify("This test is about cats.")
#=> :cat
cls.classify("I hate ...")
#=> :cat
cls.classify("The most annoying animal on earth.")
#=> :cat
cls.classify("The preferred company of software developers.")
#=> :cat
cls.classify("My precious, my favorite!")
#=> :cat
cls.classify("Get off my keyboard!")
#=> :cat
cls.classify("Kill that bird!")
#=> :cat

cls.classify("This test is about dogs.")
#=> :dog
cls.classify("Cats or Dogs?") 
#=> :dog
cls.classify("What pet will I love more?")    
#=> :dog
cls.classify("Willy, where the heck are you?")
#=> :dog
cls.classify("I like big buts and I cannot lie.") 
#=> :dog
cls.classify("Why is the front door of our house open?")
#=> :dog
cls.classify("Who is eating my meat?")
#=> :dog

Persistency

The following layers for saving the training data between sessions are implemented:

  • in memory (by default)
  • on disk
  • Redis
  • (coming soon) in a RDBMS

To persist the data in Redis, you can do this:

# defaults to redis running on localhost on default port
store = StuffClassifier::RedisStorage.new(@key)

# pass in connection args
store = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829})

To persist the data on disk, you can do this:

store = StuffClassifier::FileStorage.new(@storage_path)

# global setting
StuffClassifier::Base.storage = store

# or alternative local setting on instantiation, by means of an
# optional param ...
cls = StuffClassifier::Bayes.new("Cats or Dogs", :storage => store)

# after training is done, to persist the data ...
cls.save_state

# or you could just do this:
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
  # when done, save_state is called on END
end

# to start fresh, deleting the saved training data for this classifier
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true)

The name you give your classifier is important, as based on it the data will get loaded and saved. For instance, following 3 classifiers will be stored in different buckets, being independent of each other.

cls1 = StuffClassifier::Bayes.new("Cats or Dogs")
cls2 = StuffClassifier::Bayes.new("True or False")
cls3 = StuffClassifier::Bayes.new("Spam or Ham")	

License

MIT Licensed. See LICENSE.txt for details.

stuff-classifier's People

Contributors

alexandru avatar denniskuczynski avatar mauidude avatar niclin avatar oliviergg avatar railsmechanic avatar tjmullicani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stuff-classifier's Issues

Why I can't call RedisStorage

[1] pry(main)> store = StuffClassifier::RedisStorage.new("key_classifier")
NameError: uninitialized constant StuffClassifier::RedisStorage

unclassify?

Would be nice if you allowed us to unclassify something. I may do a pull request for that but if anyone else wants to tackle it go ahead!

load_state wrong?

There are something wrong with the load_state. I tried with your code and it is as well broken. Classify has wrong results if you load the saved state.

how to test it:

git clone in your repo
build and install the gem
Bayes.open (make sure that the save stat file doesn't exist)
train it with two categories
run classifier.cat_core in a category
Bayes.open same name from above and with an already saved state file
classifier.cat_scores with the same category gives me another value

set threshold

Hi,

I would like to set a threshold, to classify method, to set a default category if all others are matched but with a low probability. I saw that the classify method, it has a default text that I can pass, but it will just be used if no category was found (if the prob > 0.0, that is too relaxed). Any way to default if a threshold is smaller than X ?

my idea: to pass to the Bayes initializer, at opts an option max_prob. For me it would be something like 0.043. I would use this value in the classifiy method. I can do a pull request and do it. I just would like to know if you have another way to implement it today, wihout modifying the code.

Redis Persistance

Whenever I try to initialize the Redis persistance storage through the StuffClassifier::RedisStorage class, I get a uninitialized constant StuffClassifier::RedisStorage error. Is there anything else I have to install other than the stuff-classifier gem?

Storage improvement

  • Factorize common code in inMemoryStorage and FileStorage => A new class Storage
  • Use a JSON to format data to save
  • Others params need to be saved : Language, ignore_word, ...
  • New test need to be done in 005_inMemoryStorage

nil error while classifying

NoMethodError: undefined method map' for nil:NilClass from /Users/william/.rvm/gems/ruby-1.9.3-p547/gems/stuff-classifier-0.5/lib/stuff-classifier/bayes.rb:13:indoc_prob'

Gem doesn't work - Ruby version

I think I introduce some wrong things into my previous commit.
It's about ruby version.
I've done 'rake test' with ruby 1.9 and it was ok. but I release something that doesn't work with ruby 1.8
Now the gemspec need rcov that is only available with ruby 1.8 and not with ruby 1.9
so rake test on the master version doesn't work
I'll fix this soon (and create a development branch) but we have to choose a ruby version. I would recommend ruby 1.9.
And you ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.