jfilter / clean-text Goto Github PK
View Code? Open in Web Editor NEW🧹 Python package for text cleaning
License: Other
🧹 Python package for text cleaning
License: Other
Hi,
This is a very convenient tool. It will be great to add an option to remove stop words (e.g., from nltk stop words list).
Thanks.
this: +1 123 1548690
is correctly identified as a phone number, but not this: +49 123 1548690
A pretty common-use case that would be invaluable.
text = "郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈http://t.c"
cleantext.replace_urls(text, "XXX")
output:
郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈哈http://t.c
Expected:
郭麒麟打卡,且听他分享防疫小知识XXX哈哈XXX哈哈哈哈XXX
In a cleantext install with unidecode, emojis are removed, while they are preserved without unidecode. As emojis are an important part of informal communication nowadays, I think they should always be preserved (or there should be an additional parameter to specify this). Furthermore, this inconsistency should probably be documented somewhere...
Hi, I really like your package. I just found that localhost urls are not correctly removed:
For example:
url="http://localhost:8080"
The missing top-level domain might be the cause of this.
The clean-text[sklearn] version 0.6.0 restricts the Pandas version to be <2.0.0
Please upgrade the Pandas dependency.
It is indicated on documentation that "This package is named clean-text and not cleantext.". there is a confusion because both clean-text
and cleantext
are imported using the code:
import cleantext
is there a way I can specify which of the two Intent to import??
When cleannig text with numbers, if the number is connected to a word, it will not be removed.
For clean("A fr1ie45nd 23 is a sec6on7d self", no_numbers=True, replace_with_number="")
the expected output is "a friend is a second self"
but the output is "a fr1ie45nd is a sec6on7d self"
It is recommended to add the feature to filter code and file path.
Hi,
i have a csv file with multiple columns: post_id, post_text
and im trying to clean the post_text which is a dataframe and i read it from a csv file. The problem that the clean method doesn't take all the text but it tooks some word from every line of the dataframe !
please help.
you can find a txt file (csv are not allowed here) and two screenshots, one is for real data and the other is for the output of clean()
I just followed the example in the readme and got this issue
>>> clean("he;;p", fix_unicode=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: clean() got an unexpected keyword argument 'fix_unicode'
>>> import cleantext
>>> cleantext.clean("všetko")
'vsetko'
>>> cleantext.clean("Všetko")
'va!etko'
>>> cleantext.__version__
'0.4.0'
Hallo, kann ich asyncio mit dieser Library verwenden?
Given that cleaning text could be sometimes a very time consuming task if the number of data texts are huge, it would be really good if clean-text can provide inbuilt multiprocessing ability.
It could be really simple such that you could providing a flag and then adding an option to input list of text instead of a single text.
What do you think?
processed_cmts = []
for cmt in df_user_max['comment_body']:
processed_text = clean(cmt,
no_urls=True, no_emails=True, no_numbers=True, no_digits=True,
no_currency_symbols=True, no_punct=True,
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_number="<PHONE>",
replace_with_number="<NUMBER>",
replace_with_digit="0",
replace_with_currency_symbol="<CUR>",
lang="en")
processed_cmts.append(processed_text)
print(len(processed_cmts))
Error---------->
TypeError Traceback (most recent call last)
in
12 replace_with_digit="0",
13 replace_with_currency_symbol="",
---> 14 lang="en")
15 processed_cmts.append(processed_text)
16 print(len(processed_cmts))
TypeError: clean() got an unexpected keyword argument 'no_urls'
Hi!
With a string like Add more staff,2)
after punctuation is removed, the string turns into Add more staff2
and then the number 2
isn't removed :/
At least for some punctuation signs like .!?
it might be better if they are replaced by whitespace instead of an empty string, such that if there is no whitespace between two sentences, the two neighboring words aren't concatenated. Maybe there could be an additional parameter to specify with what punctuation signs are replaced?
Hi! Your project is very useful!
Line 86 in d32aa94
I was wondering why this function is applied like this instead of having a condition for it to be applied.
Is fixing whitespaces absolutely required?
Thanks!
Thanks for this great library! :)
To facilitate tokenisation it would be great if additional whitespace could be added before and after each emoji.
I'd like to remove punctuations from the text but would like to include "-".
For example, "text---cleaning" will become "text cleaning" but "drive-thru" will still be "drive-thru" after the cleaning/
Thank you for building and open-sourcing this!
Unfortunately there are still some compatbility issues in CleanTransformer
, especially when using it within Pipeline
and FeatureUnion
objects.
y
argument missing in fit
.partial_fit
method.get_feature_names_out
method.I implemented these in #31. Would be happy to discuss any changes or additional functionality + tests.
Hi there!
I propose to add the Scikit-learn compatible api for the funcion clean
. It's as easy as making a class subclassing scikit-learn TransformerMixin
and implementing the fit
and transform
methods.
This api has the advantage that enables the user to integrate the clean
function into the preprocessing pipeline and examine effects of various options that clean
provides, in the downstream task.
Emoji module's newest module doesn't utilize "UNICODE_EMOJI" anymore and thus importing clean-text will result in an error. Having looked into the emoji repo, it now only has "UNICODE_DATA"
email_addresses = [
"[email protected]",
"mustermann(at)fh-aachen.de",
"[email protected]",
"m.mustermann(at)fh-aachen.de",
"[email protected]",
"[email protected]",
"[email protected]",
"[email protected]"
]
for i, email in enumerate(email_addresses):
print(f"{i}: {text_cleaner.transform(email)}")
0: <email>
1: mustermann(at)fh-aachen.de
2: <email>
3: m.mustermann(at)fh-aachen.de
4: <email>-aachen.de
5: <email>-aachen.com
6: <email>
7: <email>.com
I expect, that the email-addresses with (at)
(instead of @
) are not working, but all others should work, some of them actually exist in a similar form.
I'm encountering the following error any time I try to run clean
:
clean() got an unexpected keyword argument 'fix_unicode'
When I try to remove the 'fix_unicode' step and just go on to the next one, the same error occurs each time. I was able to use this successfully as recently as a few months ago, but now I'm no longer able to. The only difference I can think of in usage is that I'm now working on an M2 Mac (which has created issues in other cases, packages, etc., as many have pointed out), but I'm unsure whether this would have an impact here.
Has anyone else encountered this or have a sense of how to address it?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.