Coder Social home page Coder Social logo

semantix's Introduction

Semantix

The Semantix crawler - in progress.

Implements a Naive Bayes classifier using the NLTK library.

Takes in crawled HTML pages of a whole website and classifies the website based on a business type such as restaurant or medical.

Based on the business type, further classify the website's content into relevant data such as hours of operation, location, and menu items for restaurants.

Installation

  1. Install Python 2.7.3.
  2. Clone the project and navigate into it.
  3. Install virtualenv and make sure it is activated. All Python libraries should be installed while virtualenv is activated.
  4. Install Flask.
  5. Install BeautifulSoup.
  6. Install NLTK.
  7. Obtain crawled websites data from someone on the team.

Quick Installation Commands

  1. Install Python 2.7.3.
  2. git clone https://github.com/rhuang/semantix.git
  3. sudo pip install virtualenv
  4. virtualenv venv
  5. . venv/bin/activate
  6. pip install -U Flask beautifulsoup4 pyyaml nltk
  7. ./start or python semantix.py

Windows

  1. To run locally first start the environment by running winStart.bat
  2. Then run python semantix.py
  3. In your browser type 127.0.0.1:5001

Mac

  1. Run ./start.

Notes

We activate a virtual environment to ensure our project runs on the enclosed Python version and is not affected by the other Python versions installed on the machine. Flask is also installed into the virtual environment, and not globally on our machine.

You can also run python app/main.py to check out the main algorithms without starting flask.

OCR Recognition

OCR recognition is done using the Tesseract library.

  1. brew install tesseract

Usage:
tesseract [image_name] [output_file]

semantix's People

Contributors

rhuang avatar r3bhatta avatar dmilisav avatar marcroopchand avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.