Coder Social home page Coder Social logo

mo-inkhan / tf-idf-vectorizer Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 4 KB

Compute the TF-IDF matrix from a collection of documents to measure the importance of words for text analysis and information retrieval tasks.

License: MIT License

Python 100.00%
information-retrieval machine-learning machine-learning-algorithms nlp tfidf tfidf-matrix tfidf-vectorizer tfidfvectorizer

tf-idf-vectorizer's Introduction

TF-IDF Vectorizer

This Python-based TF-IDF Vectorizer is a simple implementation of the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm. It takes a list of text documents as input and calculates a TF-IDF matrix, which can be used for various Natural Language Processing (NLP) and Machine Learning tasks. Compute the TF-IDF matrix from a collection of documents to measure the importance of words for text analysis and information retrieval tasks.

Features

  • Tokenization of input documents
  • Calculation of Term Frequency (TF) for each term in each document
  • Calculation of Inverse Document Frequency (IDF) for each term in the corpus
  • Calculation of TF-IDF matrix from input documents

Usage

  1. Import the get_tf_idf_with_terms function from the tf_idf_vectorizer.py module:
from tf_idf_vectorizer import get_tf_idf_with_terms
  1. Pass a list of documents (strings) to the get_tf_idf_with_terms function:
documents = [
    "A young wizard discovers his magical heritage and begins his studies at a prestigious school for wizards.",
    "A group of astronauts embark on a dangerous mission to save Earth by entering a wormhole in search of a new habitable planet.",
    "In a post-apocalyptic world, a father and son journey through a desolate landscape while trying to survive and find hope for humanity.",
    "An aspiring musician enters a magical world to find his true passion and learn what it means to live a fulfilled life.",
]
  1. The get_tf_idf_with_terms function returns a tuple containing the unique terms (column keys) and the calculated TF-IDF matrix as a list of lists:
unique_terms, tf_idf_matrix = get_tf_idf_with_terms(documents)

print("Unique terms:", unique_terms)
for i, doc in enumerate(tf_idf_matrix):
    print(f"Document {i+1}: {doc}")

Example

Check example.py for sample use case.

Limitations

This implementation is intended for educational purposes and may not be as efficient or robust as more advanced libraries. It does not handle stopwords, punctuation, or stemming, which may be needed in a more advanced implementation.

Contributing

All contributions are welcome. Please create an issue first for any feature request or bug. Then fork the repository, create a branch and make any changes to fix the bug or add the feature and create a pull request. That's it! Thanks!

License

TF-IDF Vectorizer is released under the MIT License. Check out the full license here.

tf-idf-vectorizer's People

Contributors

mo-inkhan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.