Coder Social home page Coder Social logo

lprtk / pytctk Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 22 KB

Python Text Cleaning ToolKit library (pyTCTK)

License: MIT License

Python 66.80% Jupyter Notebook 33.20%
data-preparation library nlp nlp-library preprocessing python text-cleaning text-mining

pytctk's Introduction

pyTCTK for Python Text Cleaning ToolKit

GitHub issues GitHub forks Github Stars GitHub license Code style: black

Table of contents

Overview

The objective is to give tools to prepare your text data without having to install anything. Some text cleaning libraries can't be used on professional computers because they need to download files from servers or from urls that are blocked by internet proxies. With pyTCTK, you just need Python and access to GitHub to clean your text data. So it's a library that you can use on your professional computer, that's the goal : a library usable everywhere.

Content

For the moment, three class with several functions are available:

  • The TextNet class implements all the general functions to clean up your text (remove punctuation, uppercase, email address, urls, html tags, etc.);

  • The WordNet class implements all the functions to perform more precise cleaning at the word level of your text (remove stopwords or apply lemming or stemming);

  • The Tokenize class implements all two functions to tokenize and detokenize the words in your text.

Requirements

  • Python version 3.9.7
  • Install requirements.txt
$ pip install -r requirements.txt 
  • Librairies used
import numpy as np
import os
import pandas as pd
import re
from urllib import request

File details

  • requirements
  • This folder contains a .txt file with all the packages and versions needed to run the project.
  • pyTCTK
  • This folder contains a .py file with all class, functions and methods.
  • example
  • This folder contains an example notebook to better understand how to use the different class and functions, and their outputs.
  • ressources
  • This folder contains several subfolders in which there are .txt vocabulary files for processing and cleaning the texts.

Here is the project pattern:

- project
    > pyTCTK
        > requirements
            - requirements.txt
        > codefile 
            - pyTCTK.py
        > example 
            - pyTCTK.ipynb
        > ressources 
            >stopwords
                - english.txt
                - french.txt
            >lemme
                - english.txt
                - french.txt
            >stemme
                - english.txt
                - french.txt
            >accents
                - accents.txt

Features

My profil โ€ข My GitHub

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.