jfilter / clean-text Goto Github PK

🧹 Python package for text cleaning

License: Other

Python 100.00%

python natural-language-processing text-cleaning text-normalization text-preprocessing python-package nlp user-generated-content scraping

clean-text's People

Contributors

Stargazers

Watchers

Forkers

orenbaldinger raoden1 codeguruji marcucla irdanish11 stjordanis sherlock42 stungkit chetan01101993 wkryst adbmd shaheenkdr damianphung jaingaurav3 abhipsha2412 xclnky dophist databill86 umairqazi wancaiyan parthsaraswat123 bumbutudor balc3r georggr felipe-mesa zkan wanghaoran-ucas jailukanna emirdemirel owlwang george-gca chschaitanya debjyoti003 dilegentmancha muchlisinadi sethips achuthasubhash katecalacat c00renut hercules261188 ammar257ammar ptvirgo afiqmuzaffar johnnyfoulds congvmit jdvala lvzeyu tanwirahmad mohith7548 blayz3r mustafataftaf alleniver razauh calebchiam sekharpink1 shaun95 iamlaom ennamarie19 capuanob jpfa1406 carlolepelaars ilamathi-k chinnuanu ost-metabob-testing venetisgr jayrom23 arpitjain799 peter-sk hope247code danielabalaniuc wltst192 anhlbt faycald huiyingcai habibzadeh scottroot

clean-text's Issues

add a removal of stop words

Hi,

This is a very convenient tool. It will be great to add an option to remove stop words (e.g., from nltk stop words list).

Thanks.

phone numbers with two digit area code not recognized

this: +1 123 1548690 is correctly identified as a phone number, but not this: +49 123 1548690

replace dates with <date>

A pretty common-use case that would be invaluable.

Add option to remove IP address

URLs are not matched

text = "郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈http://t.c"
cleantext.replace_urls(text, "XXX")

output:

郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈哈http://t.c

Expected:

郭麒麟打卡,且听他分享防疫小知识XXX哈哈XXX哈哈哈哈XXX

In a cleantext install with unidecode, emojis are removed, while they are preserved without unidecode. As emojis are an important part of informal communication nowadays, I think they should always be preserved (or there should be an additional parameter to specify this). Furthermore, this inconsistency should probably be documented somewhere...

Localhost URLs are not removed

Hi, I really like your package. I just found that localhost urls are not correctly removed:

For example:

url="http://localhost:8080"

The missing top-level domain might be the cause of this.

Please upgrade the dependencies to support Pandas version 2.x

The clean-text[sklearn] version 0.6.0 restricts the Pandas version to be <2.0.0
Please upgrade the Pandas dependency.

Import name confusion

It is indicated on documentation that "This package is named clean-text and not cleantext.". there is a confusion because both clean-text and cleantext are imported using the code:

import cleantext
is there a way I can specify which of the two Intent to import??

no_numbers does not remove numbers in the middle of words

When cleannig text with numbers, if the number is connected to a word, it will not be removed.

For clean("A fr1ie45nd 23 is a sec6on7d self", no_numbers=True, replace_with_number="") the expected output is "a friend is a second self" but the output is "a fr1ie45nd is a sec6on7d self"

[Feature Request] What about to filter code?

It is recommended to add the feature to filter code and file path.

doesn't work with dataframe (csv file)

Hi,
i have a csv file with multiple columns: post_id, post_text
and im trying to clean the post_text which is a dataframe and i read it from a csv file. The problem that the clean method doesn't take all the text but it tooks some word from every line of the dataframe !
please help.
you can find a txt file (csv are not allowed here) and two screenshots, one is for real data and the other is for the output of clean()

posts.txt

clean function error

I just followed the example in the readme and got this issue

>>> clean("he;;p", fix_unicode=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: clean() got an unexpected keyword argument 'fix_unicode'

add German special treatment

Incorrect removal of accents when preceded by a capital letter

>>> import cleantext
>>> cleantext.clean("všetko")
'vsetko'
>>> cleantext.clean("Všetko") 
'va!etko'

>>> cleantext.__version__
'0.4.0'

[Question] can i use this library async?

Hallo, kann ich asyncio mit dieser Library verwenden?

Add multiprocessing

Given that cleaning text could be sometimes a very time consuming task if the number of data texts are huge, it would be really good if clean-text can provide inbuilt multiprocessing ability.

It could be really simple such that you could providing a flag and then adding an option to input list of text instead of a single text.

What do you think?

TypeError: clean() got an unexpected keyword argument 'no_urls'

processed_cmts = []
for cmt in df_user_max['comment_body']:

processed_text = clean(cmt,
                       no_urls=True, no_emails=True, no_numbers=True, no_digits=True, 
                       no_currency_symbols=True, no_punct=True, 
                       replace_with_url="<URL>", 
                       replace_with_email="<EMAIL>",
                       replace_with_phone_number="<PHONE>",
                       replace_with_number="<NUMBER>",
                       replace_with_digit="0",
                       replace_with_currency_symbol="<CUR>",
                       lang="en")
processed_cmts.append(processed_text)

print(len(processed_cmts))

Error---------->

TypeError Traceback (most recent call last)
in
12 replace_with_digit="0",
13 replace_with_currency_symbol="",
---> 14 lang="en")
15 processed_cmts.append(processed_text)
16 print(len(processed_cmts))

TypeError: clean() got an unexpected keyword argument 'no_urls'

Numbers not fully removed

Hi!

With a string like Add more staff,2) after punctuation is removed, the string turns into Add more staff2 and then the number 2 isn't removed :/

replace punctuation (at least sometimes) with whitespace

At least for some punctuation signs like .!? it might be better if they are replaced by whitespace instead of an empty string, such that if there is no whitespace between two sentences, the two neighboring words aren't concatenated. Maybe there could be an additional parameter to specify with what punctuation signs are replaced?

Functon normalize_whitespace being applied regardless of conditions

Hi! Your project is very useful!

clean-text/cleantext/clean.py

Line 86 in d32aa94

def normalize_whitespace(text, no_line_breaks=False):

I was wondering why this function is applied like this instead of having a condition for it to be applied.

Is fixing whitespaces absolutely required?

Thanks!

whitespace between emojis

Thanks for this great library! :)

To facilitate tokenisation it would be great if additional whitespace could be added before and after each emoji.

Is it possible to remove punctuations but exclude cases like "drive-thru"?

I'd like to remove punctuations from the text but would like to include "-".
For example, "text---cleaning" will become "text cleaning" but "drive-thru" will still be "drive-thru" after the cleaning/

Improve scikit-learn compatibility

Thank you for building and open-sourcing this!

Unfortunately there are still some compatbility issues in CleanTransformer, especially when using it within Pipeline and FeatureUnion objects.

y argument missing in fit.
Missing partial_fit method.
Missing get_feature_names_out method.

I implemented these in #31. Would be happy to discuss any changes or additional functionality + tests.

Add Scikit-learn compatible api for function clean

Hi there!

I propose to add the Scikit-learn compatible api for the funcion clean. It's as easy as making a class subclassing scikit-learn TransformerMixin and implementing the fit and transform methods.

This api has the advantage that enables the user to integrate the clean function into the preprocessing pipeline and examine effects of various options that clean provides, in the downstream task.

Not up to date with Emoji module

Emoji module's newest module doesn't utilize "UNICODE_EMOJI" anymore and thus importing clean-text will result in an error. Having looked into the emoji repo, it now only has "UNICODE_DATA"

emails not working properly

email_addresses = [
    "[email protected]",
    "mustermann(at)fh-aachen.de",
    "[email protected]",
    "m.mustermann(at)fh-aachen.de",
    "[email protected]",
    "[email protected]",
    "[email protected]",
    "[email protected]"
]

for i, email in enumerate(email_addresses):
    print(f"{i}: {text_cleaner.transform(email)}")

0: <email>
1: mustermann(at)fh-aachen.de
2: <email>
3: m.mustermann(at)fh-aachen.de
4: <email>-aachen.de
5: <email>-aachen.com
6: <email>
7: <email>.com

I expect, that the email-addresses with (at) (instead of @) are not working, but all others should work, some of them actually exist in a similar form.

Issue with 'fix_unicode' (unexpected keyword argument)

I'm encountering the following error any time I try to run clean:

clean() got an unexpected keyword argument 'fix_unicode'

When I try to remove the 'fix_unicode' step and just go on to the next one, the same error occurs each time. I was able to use this successfully as recently as a few months ago, but now I'm no longer able to. The only difference I can think of in usage is that I'm now working on an M2 Mac (which has created issues in other cases, packages, etc., as many have pointed out), but I'm unsure whether this would have an impact here.

Has anyone else encountered this or have a sense of how to address it?