snguyenthanh / better_profanity Goto Github PK

View Code? Open in Web Editor NEW

184.0 6.0 63.0 360 KB

Blazingly fast cleaning swear words (and their leetspeak) in strings

License: MIT License

Python 100.00%

profanity python words censor censorship censored-words leetspeak

better_profanity's Introduction

better_profanity

Blazingly fast cleaning swear words (and their leetspeak) in strings

Currently there is a performance issue with the latest version (0.7.0). It is recommended to use the last stable version 0.6.1.

Inspired from package profanity of Ben Friedland, this library is significantly faster than the original one, by using string comparison instead of regex.

It supports modified spellings (such as p0rn, h4NDjob, handj0b and b*tCh).

Requirements

This package works with Python 3.5+ and PyPy3.

Installation

pip3 install better_profanity

Unicode characters

Only Unicode characters from categories Ll, Lu, Mc and Mn are added. More on Unicode categories can be found here.

Not all languages are supported yet, such as Chinese.

Usage

from better_profanity import profanity

if __name__ == "__main__":
    profanity.load_censor_words()

    text = "You p1ec3 of sHit."
    censored_text = profanity.censor(text)
    print(censored_text)
    # You **** of ****.

All modified spellings of words in profanity_wordlist.txt will be generated. For example, the word handjob would be loaded into:

'handjob', 'handj*b', 'handj0b', 'handj@b', 'h@ndjob', 'h@ndj*b', 'h@ndj0b', 'h@ndj@b',
'h*ndjob', 'h*ndj*b', 'h*ndj0b', 'h*ndj@b', 'h4ndjob', 'h4ndj*b', 'h4ndj0b', 'h4ndj@b'

The full mapping of the library can be found in profanity.py.

1. Censor swear words from a text

By default, profanity replaces each swear words with 4 asterisks ****.

from better_profanity import profanity

if __name__ == "__main__":
    text = "You p1ec3 of sHit."

    censored_text = profanity.censor(text)
    print(censored_text)
    # You **** of ****.

2. Censor doesn't care about word dividers

The function .censor() also hide words separated not just by an empty space but also other dividers, such as _, , and .. Except for @, $, *, ", '.

from better_profanity import profanity

if __name__ == "__main__":
    text = "...sh1t...hello_cat_fuck,,,,123"

    censored_text = profanity.censor(text)
    print(censored_text)
    # "...****...hello_cat_****,,,,123"

3. Censor swear words with custom character

4 instances of the character in second parameter in .censor() will be used to replace the swear words.

from better_profanity import profanity

if __name__ == "__main__":
    text = "You p1ec3 of sHit."

    censored_text = profanity.censor(text, '-')
    print(censored_text)
    # You ---- of ----.

4. Check if the string contains any swear words

Function .contains_profanity() return True if any words in the given string has a word existing in the wordlist.

from better_profanity import profanity

if __name__ == "__main__":
    dirty_text = "That l3sbi4n did a very good H4ndjob."

    profanity.contains_profanity(dirty_text)
    # True

5. Censor swear words with a custom wordlist

5.1. Wordlist as a `List`

Function load_censor_words takes a List of strings as censored words. The provided list will replace the default wordlist.

from better_profanity import profanity

if __name__ == "__main__":
    custom_badwords = ['happy', 'jolly', 'merry']
    profanity.load_censor_words(custom_badwords)

    print(profanity.contains_profanity("Have a merry day! :)"))
    # Have a **** day! :)

5.2. Wordlist as a file

Function `load_censor_words_from_file takes a filename, which is a text file and each word is separated by lines.

from better_profanity import profanity

if __name__ == "__main__":
    profanity.load_censor_words_from_file('/path/to/my/project/my_wordlist.txt')

6. Whitelist

Function load_censor_words and load_censor_words_from_file takes a keyword argument whitelist_words to ignore words in a wordlist.

It is best used when there are only a few words that you would like to ignore in the wordlist.

# Use the default wordlist
profanity.load_censor_words(whitelist_words=['happy', 'merry'])

# or with your custom words as a List
custom_badwords = ['happy', 'jolly', 'merry']
profanity.load_censor_words(custom_badwords, whitelist_words=['merry'])

# or with your custom words as a text file
profanity.load_censor_words_from_file('/path/to/my/project/my_wordlist.txt', whitelist_words=['merry'])

7. Add more censor words

from better_profanity import profanity

if __name__ == "__main__":
    custom_badwords = ['happy', 'jolly', 'merry']
    profanity.add_censor_words(custom_badwords)

    print(profanity.contains_profanity("Happy you, fuck!"))
    # **** you, ****!

Limitations

As the library compares each word by characters, the censor could easily be bypassed by adding any character(s) to the word:

profanity.censor('I just have sexx')
# returns 'I just have sexx'

profanity.censor('jerkk off')
# returns 'jerkk off'

Any word in wordlist that have non-space separators cannot be recognised, such as s & m, and therefore, it won't be filtered out. This problem was raised in #5.

Testing

python3 tests.py

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Special thanks to

Andrew Grinevich - Add support for Unicode characters.
Jaclyn Brockschmidt - Optimize string comparison.

Acknowledgments

Ben Friedland - For the inspiring package profanity.

better_profanity's People

Contributors

Stargazers

Watchers

Forkers

derfirm gijones99 adarsa michaelqknguyen davidmcclure pennydreadfulmtg soroushjavdan raymonf alsharou giovanni-alzetta kalilu andriybeats carlinmack shopeonarope ysenarath afghani-iitkgp jcbrockschmidt fortin spatel912020 wywen0825 zao113android globax89 uncharacteristically nicholas-wan freebil emso-c korfor isabuster chrischen08 bloctans jagerman verticalscope ronakcoalesce nullbuddy1243 theycallhermax inceptiontime anitavero thebrownbug andijoniyuz nelsonve aliz-f gizatupu ganiyevuz showierdata9978 joshatticus eschaton22 gredondogc temporal-games cloutiermat brunogomescoelho artmeet varunkadya vomxy codeaye arpitjain799 knlsarda13 mojtabaakbari221b yperevoznikov yanste

better_profanity's Issues

lesbian not in wordlist.txt but is flagged as a bad word?

Is there a more complete list or am I missing something

It would be really nice if you could censor just one char in bad words (f*ck) instead of censoring all world.

Some default words not censored

As of 0.7.0, the words "shi+" and "sh!+" have been added to the default wordlist. But they are not censored. Should we...

Remove them from the word list.
Add "+" to ALLOWED_CHARACTERS (and optionally add "+" to CHARS_MAPPING for "t").

Note that if we go with option 2, profanity separated by "+" (e.g. "fuck+fuck") will no longer be censored.

Quoted profanities aren't censored

Here's an example of what I am seeing.

>>> from better_profanity import profanity
>>> profanity.contains_profanity('I have to go pee')
True
>>> profanity.contains_profanity('I have to go "pee"')
False

I looked around and it doesn't seem like this behavior is intentional.

Is there any way to know for which word the text was censored?

I was making a discord bot so I'm using this for looking bad words in the chat but I don't know for which word the bot marked the text inappropriate.

swear words with space isn't registered

In social media, commenting with spaces is too common, so i believe this bug must be patched with priority.

better_profanity.contains_profanity('shit')
# True

better_profanity.contains_profanity('s h i t')
# False

Get all possible leet words from my txt file and store it as list variable.

Hello 😊,
I want to get all possible leet words from my txt file and store it in list.

suppose I've a.txt file it contains following :

handjob

and I want to print All possible modified spellings of words stored in list

[ 'handjob', 'handj*b', 'handj0b', 'handj@b', 'h@ndjob', 'h@ndj*b', 'h@ndj0b', 'h@ndj@b',
'h*ndjob', 'h*ndj*b', 'h*ndj0b', 'h*ndj@b', 'h4ndjob', 'h4ndj*b', 'h4ndj0b', 'h4ndj@b' ]

can you please share me a code for it it will be great!

Thanks and Have a Nice day! ❤️

How to add a profanity score or prediction probability

Can someone help me to build a logic around profanity scoring for a string

Add supports for other languages

Currently, better_profanity only supports a limited number of languages, from categories Ll, Lu, Mc and Mn (More on Unicode categories can be found here).

Please feel free to create a PR for Unicode categories for your wanted languages.

Does not work for Python 3.5

Gets error when running on Python 3.5 : better-profanity requires Python '>3.6' but the running Python is 3.5.0

"hell" issue

If the word "hell" is sent to profanity.censor(), nothing happens. However if I were to add a space at the end like this: "hell ", it gets censored.
Was this intentional?

Remove some swears

I personally don't consider "gay" and "lesbian" profanity. How can I disable them?

Version 0.7.0 significantly slower than 0.6.1

We've been using this great module in a larger analysis for months: thank you for making it available! Here's the relevant code:

corpus = reviews['review_text']
profanity.add_censor_words([x.lower() for x in other_stop_words])
corpus = corpus.apply(profanity.censor, args=(' ',))

The statement corpus.apply(profanity.censor, args=(' ',)) is taking a couple orders of magnitude longer using version 0.7.0 than 0.6.1. Here are some timings with everything the same other than the better_profanity version. "Time to apply profanity" is for just corpus = corpus.apply(profanity.censor, args=(' ',))

better_profanity=0.7.0

This product is named Oasis High-Waisted Pocket Capri
Begin by noting there are 1669 reviews for this product
Time to apply profanity: 95.60132622718811
Time it takes to run this LDA model: 97.7805495262146
{'size', 'material', 'working', 'comfortable', 'waist', 'length', 'fit', 'fabric', 'soft'}
1 of 122 products' review sets remodeled

This product is named Ryan Built In Bra Tank II
Begin by noting there are 427 reviews for this product
Time to apply profanity: 29.865559816360474
Time it takes to run this LDA model: 30.556731939315796
{'cute', 'size', 'top', 'comfortable', 'fit'}
2 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Pocket 7/8
Begin by noting there are 10934 reviews for this product
Time to apply profanity: 710.3287241458893
Time it takes to run this LDA model: 726.3966491222382
{'pocket', 'comfortable', 'feel', 'waist', 'fit', 'see', 'color', 'soft'}
3 of 122 products' review sets remodeled

This product is named High-Waisted Ultracool Side Stripe Crop
Begin by noting there are 168 reviews for this product
Time to apply profanity: 10.014711618423462
Time it takes to run this LDA model: 10.347350835800171
{'comfortable', 'feel', 'waist', 'size'}
4 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Twist 7/8
Begin by noting there are 1750 reviews for this product
Time to apply profanity: 121.31187510490417
Time it takes to run this LDA model: 123.6646056175232
{'cute', 'style', 'size', 'material', 'comfortable', 'detail', 'bit', 'fit', 'color', 'soft'}
5 of 122 products' review sets remodeled

better_profanity=0.6.1

This product is named Oasis High-Waisted Pocket Capri
Begin by noting there are 1669 reviews for this product
Time to apply profanity: 0.19291996955871582
Time it takes to run this LDA model: 4.058649063110352
{'size', 'material', 'working', 'comfortable', 'waist', 'length', 'fit', 'fabric', 'soft'}
1 of 122 products' review sets remodeled

This product is named Ryan Built In Bra Tank II
Begin by noting there are 427 reviews for this product
Time to apply profanity: 0.05718731880187988
Time it takes to run this LDA model: 0.7385601997375488
{'cute', 'size', 'top', 'comfortable', 'fit'}
2 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Pocket 7/8
Begin by noting there are 10934 reviews for this product
Time to apply profanity: 1.264852523803711
Time it takes to run this LDA model: 17.08655619621277
{'pocket', 'size', 'comfortable', 'waist', 'fit', 'around', 'color', 'amazing'}
3 of 122 products' review sets remodeled

This product is named High-Waisted Ultracool Side Stripe Crop
Begin by noting there are 168 reviews for this product
Time to apply profanity: 0.018624067306518555
Time it takes to run this LDA model: 0.34430885314941406
{'comfortable', 'feel', 'waist', 'size'}
4 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Twist 7/8
Begin by noting there are 1750 reviews for this product
Time to apply profanity: 0.2005002498626709
Time it takes to run this LDA model: 2.5792129039764404
{'cute', 'style', 'size', 'material', 'comfortable', 'detail', 'bit', 'fit', 'color', 'soft'}
5 of 122 products' review sets remodeled

could this be faster with Set instead of List

My colleague was working with this library for some NLP stuff, and he was trying to manipulate the CENSOR_WORDS for reasons not particularly important for this question.

It got me wondering, wouldn't this all go a lot faster if CENSOR_WORDS was a set(). Forgive me if I'm wasting your time, I didn't FULLY trace the code.

It seems to me that a lookup against a very large set of words or phrases would always be faster if you had a Set because it works as a hash table under the python covers.

Want to see the list of censored words for one word i input

The readme is confusing, i am unsure how the file should look like so i run it and it will ask me profane phrase like "handjob" and print the list of censored words like:
handj*b
handj0b
handj@b
h@ndjob
...

to CLI output or to a file. If it can not ask i would make a bash script, that would insert my phrase into a file and the supposedly .py script will read the file and print censored words one at a line.

Incorrect processing result for keywords having symbols

Use word "s&m", "s & m", "2 girls 1 cups" ... to run profanity.censor with the default config got the incorrect result.
for example:

print(profanity.censor("s & m"))
# s & m

why ?

How to replace the swear word with the actual word?

Hi,
In this program, the swear words are replaced with asteriks, for example: F@@k will be shown as ****
But i want change a sentence like "F@@k you" into "Fuck you"

How do i do this?

Apostrophe included in banned separators

It says in the readme that the apostrophe is not included in banned separators but for some reason it is marking the word he'll as profain.

Consider a trie based approach to maybe increase the overall performance of this package

Hello,

In the context of my application, I developed a simple profanity checker. I compared it to yours and mine runs 10,000X faster.

Note: What I do not take into consideration are varying swear words (sex, s3x, etc.). However, building the Trie can incorporate these variations. In my context, I do not do it because I check well written book titles and descriptions.

Note 2: I only use a contains-based approach. However, this approach can also use a censor method.

Here is the implementation:

import re

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False


class ProfanityChecker:
    def __init__(self):
        with open("profanity.txt", "r") as f:
            self.root = self.build_trie(f.read().splitlines())

    # Building the trie: O(m * k),
    # where m is the number of words in the dictionary and k is the average length of words.
    def build_trie(self, dictionary):
        root = TrieNode()
        for word in dictionary:
            node = root
            for char in word:
                if char not in node.children:
                    node.children[char] = TrieNode()
                node = node.children[char]
            node.is_end_of_word = True
        return root

    # Searching each word in the text: O(n * k), where n is the number of words in the text.
    def contains_profanity(self, text):
        def search(remaining_text, node=self.root):
            for i, char in enumerate(remaining_text):
                if char in node.children:
                    node = node.children[char]
                    if node.is_end_of_word and (i == len(remaining_text) - 1 or remaining_text[i + 1] in (' ', '-')):
                        return True
                else:
                    break  # Stop searching if the character is not in the trie
            return False

        # Remove punctuation and convert to lowercase before tokenizing
        text = re.sub(r'[^\w\s]', '', text)
        words = text.split()

        for word in words:
            if search(word.lower()):
                return True
        return False


if __name__ == '__main__':
    pc = ProfanityChecker()
    print(pc.contains_profanity("my assessment"))  # Output: False
    print(pc.contains_profanity("my ass essment"))  # Output: True
    print(pc.contains_profanity("my ass-essment"))  # Output: True

Related to #33. Adding any letter before a word will not censor the word.

We use Better profanity as the main censoring system for the platform Meower.
Here is an example of what i'm talking about:

profanity filtering doesn't work for combined words like "fckme" or "suckmydck"

censorship error

There is a bug in which 455 appears as True.
I don't know if that's a curse word because I don't speak English, but what does it mean if that's a curse word?

Regarding use of regex

Instead of generating distorted profane words from the list of profane words, can't we use regex?

Why "xXx" is profanity?

The result of such constructions is often TRUE:
print(profanity.contains_profanity('xxx_x'))
or print(profanity.contains_profanity('xXx')) and so on.
But I don`t understand why?
Thanks!

Package abandoned?

See all all PRs still open and a lot of issues unaswered.

Excessive memory consumption for large enough words

When a word or phrase is sufficiently long, the method Profanity.load_censor_words_from_file will consume all of a system's memory.

The cause appears to be the use of product in Profanity._generate_patterns_from_word. As the number of substitutable characters in a words grows linearly, the number of possible patterns grows exponentially to their power. In other words, it has a memory footprint of O(n^x) where n is the approximate number of substitute per-character and x is the number of substitutable characters. At some point, the x becomes so large that an entire system's memory will be consumed.

You can use these words in a text file to pinpoint where this threshold is:

eeeeeeeeeeee
eeeeeeeeeeeee
eeeeeeeeeeeeee
eeeeeeeeeeeeeee

The second-to-last line of 15 characters will take relatively long to process. Then the final word of 16 characters will consume all of a system's available memory (up to 16 GB). Here's a table of how much memory python3.6 takes up as a process given the length of our test words,

e's	`len(product(...))`	Memory (Mb)
13	1594323	216
14	4782969	625
15	14348907	1411
16	43046721	...

(len(product(...)) is calculated as 3**x since "e" can be substituted with "e", "*", or "3")

For context, python3.6 takes up only about 60 Mb when the default 320 length wordlist is loaded.

Either the underlying data structure needs to be changed, or words for which the predicted output of product exceed a certain amount need to be ignored.

problem with PyInstaller

I'm packaging a script as an executable via PyInstaller.

better_profanity always looks for a file in this directory: C:\\Users\\<USERNAME>\\AppData\\Local\\Temp\\_MEI199002\\better_profanity\\alphabetic_unicode.json

Is there a way to select the file for that manually as we can for the wordlist?

Traceback:

Traceback (most recent call last):
  File "gui.py", line 15, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 671, in _load_unlocked
  File "PyInstaller\loader\pyimod03_importers.py", line 546, in exec_module
  File "app\app.py", line 3, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 671, in _load_unlocked
  File "PyInstaller\loader\pyimod03_importers.py", line 546, in exec_module
  File "better_profanity\__init__.py", line 3, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 671, in _load_unlocked
  File "PyInstaller\loader\pyimod03_importers.py", line 546, in exec_module
  File "better_profanity\better_profanity.py", line 5, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 671, in _load_unlocked
  File "PyInstaller\loader\pyimod03_importers.py", line 546, in exec_module
  File "better_profanity\constants.py", line 14, in 
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\\\AppData\\Local\\Temp\\_MEI199002\\better_profanity\\alphabetic_unicode.json'

Get list of words that are profane

Is there any way to return a list of words that are profane instead of replacing them with characters?

Add unit tests for large input strings and a large corpus

In response to issue #19, we should add unit tests that run on large strings, as well as on a large corpus of strings. This should help us catch speed inefficiencies down the road.

two swear words without space isn't registered

better_profanity.contains_profanity('shitshitshit')
# False

Incorrect results for short words with numbers.

Hi, thank you for your excellent work.
I noticed issues related to the short words that include numbers.

version 0.7.0
Example
`from better_profanity import profanity

print(profanity.contains_profanity("73rd")) # True
print(profanity.contains_profanity("73rdem")) # False
`

Problems with separators / non-allowed characters

Both issues I've encountered appear to be parser related:

If a swear word is at the very end of a string and it's last character of a swear word has a non-allowed character before it, it will not be censored. So f_u_c_k won't be censored, but f_u_c_k_ and f_u_ck will be.
If too many characters in a swear word have separators between them, the swear word won't be censored. This appears to happen when the number of separations exceeds MAX_NUMBER_COMBINATIONS. So w i l l i es_ will be censored but not w i l l i e s_ (underscore included to avoid issue described in 1).

Is there a way to add additional profane words to the default list?

Currently the custom_words list replaces the default list, which might make sense for some scenarios. But it would be great if there is some other param like blacklist, where the default list is active and in addition to it the newly given words are also filtered.