Coder Social home page Coder Social logo

python-lsa's Introduction

Latent Semantic Analysis in Python

Build Status

In this project we will perform latent semantic analysis of large document sets.

We first create a document term matrix, and then perform SVD decomposition.

This document term matrix uses tf-idf weighting.

To Run! Set your cwd to scripts/ and run the file located there.

Notes to @rrish:

  • This actually does work for the entire jeopardy dataset, with all 200,000 documents and 100,000 unique words. Warning, if you do run it on that, it needs about 2GB of memory to store everything, so be careful.
  • The global WORKERS variable sets how many worker processes to create. Feel free to play around for performance. (I haven't yet)
  • In terms of timing, as it stands it can analyze all 200,000 documents and create the document-term matrix in about 45-50 seconds on my machine (mileage may vary based on cores/etc.)
  • It is currently using the basic tf-idf weighting. We may wish to adjust this later.

The SVD_using_LSA.m file is a matlab implementation of the latter half of the LSA algorithm once the document-term matrix has been constructed and the SVD has been calculated. It calculated the new word matrix and doc matrix and then takes a query and calculates the cosine distances of the query with each of the documents (columns of the doc matrix, saved into a new array called "docs"). Finally, it ranks the documents according to the relevance to the query word/words.

python-lsa's People

Contributors

rrish avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.