Coder Social home page Coder Social logo

jfilter / clean-text Goto Github PK

View Code? Open in Web Editor NEW
924.0 14.0 77.0 161 KB

🧹 Python package for text cleaning

License: Other

Python 100.00%
python natural-language-processing text-cleaning text-normalization text-preprocessing python-package nlp user-generated-content scraping

clean-text's People

Contributors

jfilter avatar sadra-barikbin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clean-text's Issues

add a removal of stop words

Hi,

This is a very convenient tool. It will be great to add an option to remove stop words (e.g., from nltk stop words list).

Thanks.

emojis removed with unidecode

In a cleantext install with unidecode, emojis are removed, while they are preserved without unidecode. As emojis are an important part of informal communication nowadays, I think they should always be preserved (or there should be an additional parameter to specify this). Furthermore, this inconsistency should probably be documented somewhere...

Localhost URLs are not removed

Hi, I really like your package. I just found that localhost urls are not correctly removed:

For example:

url="http://localhost:8080"

The missing top-level domain might be the cause of this.

Import name confusion

It is indicated on documentation that "This package is named clean-text and not cleantext.". there is a confusion because both clean-text and cleantext are imported using the code:

import cleantext
is there a way I can specify which of the two Intent to import??

no_numbers does not remove numbers in the middle of words

When cleannig text with numbers, if the number is connected to a word, it will not be removed.

For clean("A fr1ie45nd 23 is a sec6on7d self", no_numbers=True, replace_with_number="") the expected output is "a friend is a second self" but the output is "a fr1ie45nd is a sec6on7d self"

doesn't work with dataframe (csv file)

Hi,
i have a csv file with multiple columns: post_id, post_text
and im trying to clean the post_text which is a dataframe and i read it from a csv file. The problem that the clean method doesn't take all the text but it tooks some word from every line of the dataframe !
please help.
you can find a txt file (csv are not allowed here) and two screenshots, one is for real data and the other is for the output of clean()

clean
data

posts.txt

clean function error

I just followed the example in the readme and got this issue

>>> clean("he;;p", fix_unicode=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: clean() got an unexpected keyword argument 'fix_unicode'

Add multiprocessing

Given that cleaning text could be sometimes a very time consuming task if the number of data texts are huge, it would be really good if clean-text can provide inbuilt multiprocessing ability.

It could be really simple such that you could providing a flag and then adding an option to input list of text instead of a single text.

What do you think?

TypeError: clean() got an unexpected keyword argument 'no_urls'

processed_cmts = []
for cmt in df_user_max['comment_body']:

processed_text = clean(cmt,
                       no_urls=True, no_emails=True, no_numbers=True, no_digits=True, 
                       no_currency_symbols=True, no_punct=True, 
                       replace_with_url="<URL>", 
                       replace_with_email="<EMAIL>",
                       replace_with_phone_number="<PHONE>",
                       replace_with_number="<NUMBER>",
                       replace_with_digit="0",
                       replace_with_currency_symbol="<CUR>",
                       lang="en")
processed_cmts.append(processed_text)

print(len(processed_cmts))

Error---------->


TypeError Traceback (most recent call last)
in
12 replace_with_digit="0",
13 replace_with_currency_symbol="",
---> 14 lang="en")
15 processed_cmts.append(processed_text)
16 print(len(processed_cmts))

TypeError: clean() got an unexpected keyword argument 'no_urls'

Numbers not fully removed

Hi!

With a string like Add more staff,2) after punctuation is removed, the string turns into Add more staff2 and then the number 2 isn't removed :/

replace punctuation (at least sometimes) with whitespace

At least for some punctuation signs like .!? it might be better if they are replaced by whitespace instead of an empty string, such that if there is no whitespace between two sentences, the two neighboring words aren't concatenated. Maybe there could be an additional parameter to specify with what punctuation signs are replaced?

whitespace between emojis

Thanks for this great library! :)

To facilitate tokenisation it would be great if additional whitespace could be added before and after each emoji.

Improve scikit-learn compatibility

Thank you for building and open-sourcing this!

Unfortunately there are still some compatbility issues in CleanTransformer, especially when using it within Pipeline and FeatureUnion objects.

  1. y argument missing in fit.
  2. Missing partial_fit method.
  3. Missing get_feature_names_out method.

I implemented these in #31. Would be happy to discuss any changes or additional functionality + tests.

Add Scikit-learn compatible api for function clean

Hi there!

I propose to add the Scikit-learn compatible api for the funcion clean. It's as easy as making a class subclassing scikit-learn TransformerMixin and implementing the fit and transform methods.

This api has the advantage that enables the user to integrate the clean function into the preprocessing pipeline and examine effects of various options that clean provides, in the downstream task.

Not up to date with Emoji module

Emoji module's newest module doesn't utilize "UNICODE_EMOJI" anymore and thus importing clean-text will result in an error. Having looked into the emoji repo, it now only has "UNICODE_DATA"

emails not working properly

email_addresses = [
    "[email protected]",
    "mustermann(at)fh-aachen.de",
    "[email protected]",
    "m.mustermann(at)fh-aachen.de",
    "[email protected]",
    "[email protected]",
    "[email protected]",
    "[email protected]"
]

for i, email in enumerate(email_addresses):
    print(f"{i}: {text_cleaner.transform(email)}")
0: <email>
1: mustermann(at)fh-aachen.de
2: <email>
3: m.mustermann(at)fh-aachen.de
4: <email>-aachen.de
5: <email>-aachen.com
6: <email>
7: <email>.com

I expect, that the email-addresses with (at) (instead of @) are not working, but all others should work, some of them actually exist in a similar form.

Issue with 'fix_unicode' (unexpected keyword argument)

I'm encountering the following error any time I try to run clean:

clean() got an unexpected keyword argument 'fix_unicode'

When I try to remove the 'fix_unicode' step and just go on to the next one, the same error occurs each time. I was able to use this successfully as recently as a few months ago, but now I'm no longer able to. The only difference I can think of in usage is that I'm now working on an M2 Mac (which has created issues in other cases, packages, etc., as many have pointed out), but I'm unsure whether this would have an impact here.

Has anyone else encountered this or have a sense of how to address it?

1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.