Coder Social home page Coder Social logo

snguyenthanh / better_profanity Goto Github PK

View Code? Open in Web Editor NEW
184.0 6.0 63.0 360 KB

Blazingly fast cleaning swear words (and their leetspeak) in strings

License: MIT License

Python 100.00%
profanity python words censor censorship censored-words leetspeak

better_profanity's Introduction

better_profanity

Blazingly fast cleaning swear words (and their leetspeak) in strings

release Build Status python license

Currently there is a performance issue with the latest version (0.7.0). It is recommended to use the last stable version 0.6.1.

Inspired from package profanity of Ben Friedland, this library is significantly faster than the original one, by using string comparison instead of regex.

It supports modified spellings (such as p0rn, h4NDjob, handj0b and b*tCh).

Requirements

This package works with Python 3.5+ and PyPy3.

Installation

pip3 install better_profanity

Unicode characters

Only Unicode characters from categories Ll, Lu, Mc and Mn are added. More on Unicode categories can be found here.

Not all languages are supported yet, such as Chinese.

Usage

from better_profanity import profanity

if __name__ == "__main__":
    profanity.load_censor_words()

    text = "You p1ec3 of sHit."
    censored_text = profanity.censor(text)
    print(censored_text)
    # You **** of ****.

All modified spellings of words in profanity_wordlist.txt will be generated. For example, the word handjob would be loaded into:

'handjob', 'handj*b', 'handj0b', 'handj@b', 'h@ndjob', 'h@ndj*b', 'h@ndj0b', 'h@ndj@b',
'h*ndjob', 'h*ndj*b', 'h*ndj0b', 'h*ndj@b', 'h4ndjob', 'h4ndj*b', 'h4ndj0b', 'h4ndj@b'

The full mapping of the library can be found in profanity.py.

1. Censor swear words from a text

By default, profanity replaces each swear words with 4 asterisks ****.

from better_profanity import profanity

if __name__ == "__main__":
    text = "You p1ec3 of sHit."

    censored_text = profanity.censor(text)
    print(censored_text)
    # You **** of ****.

2. Censor doesn't care about word dividers

The function .censor() also hide words separated not just by an empty space but also other dividers, such as _, , and .. Except for @, $, *, ", '.

from better_profanity import profanity

if __name__ == "__main__":
    text = "...sh1t...hello_cat_fuck,,,,123"

    censored_text = profanity.censor(text)
    print(censored_text)
    # "...****...hello_cat_****,,,,123"

3. Censor swear words with custom character

4 instances of the character in second parameter in .censor() will be used to replace the swear words.

from better_profanity import profanity

if __name__ == "__main__":
    text = "You p1ec3 of sHit."

    censored_text = profanity.censor(text, '-')
    print(censored_text)
    # You ---- of ----.

4. Check if the string contains any swear words

Function .contains_profanity() return True if any words in the given string has a word existing in the wordlist.

from better_profanity import profanity

if __name__ == "__main__":
    dirty_text = "That l3sbi4n did a very good H4ndjob."

    profanity.contains_profanity(dirty_text)
    # True

5. Censor swear words with a custom wordlist

5.1. Wordlist as a List

Function load_censor_words takes a List of strings as censored words. The provided list will replace the default wordlist.

from better_profanity import profanity

if __name__ == "__main__":
    custom_badwords = ['happy', 'jolly', 'merry']
    profanity.load_censor_words(custom_badwords)

    print(profanity.contains_profanity("Have a merry day! :)"))
    # Have a **** day! :)

5.2. Wordlist as a file

Function `load_censor_words_from_file takes a filename, which is a text file and each word is separated by lines.

from better_profanity import profanity

if __name__ == "__main__":
    profanity.load_censor_words_from_file('/path/to/my/project/my_wordlist.txt')

6. Whitelist

Function load_censor_words and load_censor_words_from_file takes a keyword argument whitelist_words to ignore words in a wordlist.

It is best used when there are only a few words that you would like to ignore in the wordlist.

# Use the default wordlist
profanity.load_censor_words(whitelist_words=['happy', 'merry'])

# or with your custom words as a List
custom_badwords = ['happy', 'jolly', 'merry']
profanity.load_censor_words(custom_badwords, whitelist_words=['merry'])

# or with your custom words as a text file
profanity.load_censor_words_from_file('/path/to/my/project/my_wordlist.txt', whitelist_words=['merry'])

7. Add more censor words

from better_profanity import profanity

if __name__ == "__main__":
    custom_badwords = ['happy', 'jolly', 'merry']
    profanity.add_censor_words(custom_badwords)

    print(profanity.contains_profanity("Happy you, fuck!"))
    # **** you, ****!

Limitations

  1. As the library compares each word by characters, the censor could easily be bypassed by adding any character(s) to the word:
profanity.censor('I just have sexx')
# returns 'I just have sexx'

profanity.censor('jerkk off')
# returns 'jerkk off'
  1. Any word in wordlist that have non-space separators cannot be recognised, such as s & m, and therefore, it won't be filtered out. This problem was raised in #5.

Testing

python3 tests.py

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Special thanks to

Acknowledgments

better_profanity's People

Contributors

andriybeats avatar bakert avatar jcbrockschmidt avatar korfor avatar snguyenthanh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

better_profanity's Issues

Some default words not censored

As of 0.7.0, the words "shi+" and "sh!+" have been added to the default wordlist. But they are not censored. Should we...

  1. Remove them from the word list.
  2. Add "+" to ALLOWED_CHARACTERS (and optionally add "+" to CHARS_MAPPING for "t").

Note that if we go with option 2, profanity separated by "+" (e.g. "fuck+fuck") will no longer be censored.

Quoted profanities aren't censored

Here's an example of what I am seeing.

>>> from better_profanity import profanity
>>> profanity.contains_profanity('I have to go pee')
True
>>> profanity.contains_profanity('I have to go "pee"')
False

I looked around and it doesn't seem like this behavior is intentional.

swear words with space isn't registered

In social media, commenting with spaces is too common, so i believe this bug must be patched with priority.

better_profanity.contains_profanity('shit')
# True

better_profanity.contains_profanity('s h i t')
# False

Get all possible leet words from my txt file and store it as list variable.

Hello ๐Ÿ˜Š,
I want to get all possible leet words from my txt file and store it in list.

suppose I've a.txt file it contains following :

handjob

and I want to print All possible modified spellings of words stored in list

[ 'handjob', 'handj*b', 'handj0b', 'handj@b', 'h@ndjob', 'h@ndj*b', 'h@ndj0b', 'h@ndj@b',
'h*ndjob', 'h*ndj*b', 'h*ndj0b', 'h*ndj@b', 'h4ndjob', 'h4ndj*b', 'h4ndj0b', 'h4ndj@b' ]

can you please share me a code for it it will be great!

Thanks and Have a Nice day! โค๏ธ

Add supports for other languages

Currently, better_profanity only supports a limited number of languages, from categories Ll, Lu, Mc and Mn (More on Unicode categories can be found here).

Please feel free to create a PR for Unicode categories for your wanted languages.

Does not work for Python 3.5

Gets error when running on Python 3.5 : better-profanity requires Python '>3.6' but the running Python is 3.5.0

"hell" issue

If the word "hell" is sent to profanity.censor(), nothing happens. However if I were to add a space at the end like this: "hell ", it gets censored.
Was this intentional?

Remove some swears

I personally don't consider "gay" and "lesbian" profanity. How can I disable them?

Version 0.7.0 significantly slower than 0.6.1

We've been using this great module in a larger analysis for months: thank you for making it available! Here's the relevant code:

corpus = reviews['review_text']
profanity.add_censor_words([x.lower() for x in other_stop_words])
corpus = corpus.apply(profanity.censor, args=(' ',))

The statement corpus.apply(profanity.censor, args=(' ',)) is taking a couple orders of magnitude longer using version 0.7.0 than 0.6.1. Here are some timings with everything the same other than the better_profanity version. "Time to apply profanity" is for just corpus = corpus.apply(profanity.censor, args=(' ',))

better_profanity=0.7.0

This product is named Oasis High-Waisted Pocket Capri
Begin by noting there are 1669 reviews for this product
Time to apply profanity: 95.60132622718811
Time it takes to run this LDA model: 97.7805495262146
{'size', 'material', 'working', 'comfortable', 'waist', 'length', 'fit', 'fabric', 'soft'}
1 of 122 products' review sets remodeled

This product is named Ryan Built In Bra Tank II
Begin by noting there are 427 reviews for this product
Time to apply profanity: 29.865559816360474
Time it takes to run this LDA model: 30.556731939315796
{'cute', 'size', 'top', 'comfortable', 'fit'}
2 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Pocket 7/8
Begin by noting there are 10934 reviews for this product
Time to apply profanity: 710.3287241458893
Time it takes to run this LDA model: 726.3966491222382
{'pocket', 'comfortable', 'feel', 'waist', 'fit', 'see', 'color', 'soft'}
3 of 122 products' review sets remodeled

This product is named High-Waisted Ultracool Side Stripe Crop
Begin by noting there are 168 reviews for this product
Time to apply profanity: 10.014711618423462
Time it takes to run this LDA model: 10.347350835800171
{'comfortable', 'feel', 'waist', 'size'}
4 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Twist 7/8
Begin by noting there are 1750 reviews for this product
Time to apply profanity: 121.31187510490417
Time it takes to run this LDA model: 123.6646056175232
{'cute', 'style', 'size', 'material', 'comfortable', 'detail', 'bit', 'fit', 'color', 'soft'}
5 of 122 products' review sets remodeled

better_profanity=0.6.1

This product is named Oasis High-Waisted Pocket Capri
Begin by noting there are 1669 reviews for this product
Time to apply profanity: 0.19291996955871582
Time it takes to run this LDA model: 4.058649063110352
{'size', 'material', 'working', 'comfortable', 'waist', 'length', 'fit', 'fabric', 'soft'}
1 of 122 products' review sets remodeled

This product is named Ryan Built In Bra Tank II
Begin by noting there are 427 reviews for this product
Time to apply profanity: 0.05718731880187988
Time it takes to run this LDA model: 0.7385601997375488
{'cute', 'size', 'top', 'comfortable', 'fit'}
2 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Pocket 7/8
Begin by noting there are 10934 reviews for this product
Time to apply profanity: 1.264852523803711
Time it takes to run this LDA model: 17.08655619621277
{'pocket', 'size', 'comfortable', 'waist', 'fit', 'around', 'color', 'amazing'}
3 of 122 products' review sets remodeled

This product is named High-Waisted Ultracool Side Stripe Crop
Begin by noting there are 168 reviews for this product
Time to apply profanity: 0.018624067306518555
Time it takes to run this LDA model: 0.34430885314941406
{'comfortable', 'feel', 'waist', 'size'}
4 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Twist 7/8
Begin by noting there are 1750 reviews for this product
Time to apply profanity: 0.2005002498626709
Time it takes to run this LDA model: 2.5792129039764404
{'cute', 'style', 'size', 'material', 'comfortable', 'detail', 'bit', 'fit', 'color', 'soft'}
5 of 122 products' review sets remodeled

could this be faster with Set instead of List

My colleague was working with this library for some NLP stuff, and he was trying to manipulate the CENSOR_WORDS for reasons not particularly important for this question.

It got me wondering, wouldn't this all go a lot faster if CENSOR_WORDS was a set(). Forgive me if I'm wasting your time, I didn't FULLY trace the code.

It seems to me that a lookup against a very large set of words or phrases would always be faster if you had a Set because it works as a hash table under the python covers.

Want to see the list of censored words for one word i input

The readme is confusing, i am unsure how the file should look like so i run it and it will ask me profane phrase like "handjob" and print the list of censored words like:
handj*b
handj0b
handj@b
h@ndjob
...

to CLI output or to a file. If it can not ask i would make a bash script, that would insert my phrase into a file and the supposedly .py script will read the file and print censored words one at a line.

Consider a trie based approach to maybe increase the overall performance of this package

Hello,

In the context of my application, I developed a simple profanity checker. I compared it to yours and mine runs 10,000X faster.

Note: What I do not take into consideration are varying swear words (sex, s3x, etc.). However, building the Trie can incorporate these variations. In my context, I do not do it because I check well written book titles and descriptions.

Note 2: I only use a contains-based approach. However, this approach can also use a censor method.

Here is the implementation:

import re

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False


class ProfanityChecker:
    def __init__(self):
        with open("profanity.txt", "r") as f:
            self.root = self.build_trie(f.read().splitlines())

    # Building the trie: O(m * k),
    # where m is the number of words in the dictionary and k is the average length of words.
    def build_trie(self, dictionary):
        root = TrieNode()
        for word in dictionary:
            node = root
            for char in word:
                if char not in node.children:
                    node.children[char] = TrieNode()
                node = node.children[char]
            node.is_end_of_word = True
        return root

    # Searching each word in the text: O(n * k), where n is the number of words in the text.
    def contains_profanity(self, text):
        def search(remaining_text, node=self.root):
            for i, char in enumerate(remaining_text):
                if char in node.children:
                    node = node.children[char]
                    if node.is_end_of_word and (i == len(remaining_text) - 1 or remaining_text[i + 1] in (' ', '-')):
                        return True
                else:
                    break  # Stop searching if the character is not in the trie
            return False

        # Remove punctuation and convert to lowercase before tokenizing
        text = re.sub(r'[^\w\s]', '', text)
        words = text.split()

        for word in words:
            if search(word.lower()):
                return True
        return False


if __name__ == '__main__':
    pc = ProfanityChecker()
    print(pc.contains_profanity("my assessment"))  # Output: False
    print(pc.contains_profanity("my ass essment"))  # Output: True
    print(pc.contains_profanity("my ass-essment"))  # Output: True

censorship error

There is a bug in which 455 appears as True.
I don't know if that's a curse word because I don't speak English, but what does it mean if that's a curse word?

Regarding use of regex

Instead of generating distorted profane words from the list of profane words, can't we use regex?

Why "xXx" is profanity?

The result of such constructions is often TRUE:
print(profanity.contains_profanity('xxx_x'))
or print(profanity.contains_profanity('xXx')) and so on.
But I don`t understand why?
Thanks!

Excessive memory consumption for large enough words

When a word or phrase is sufficiently long, the method Profanity.load_censor_words_from_file will consume all of a system's memory.

The cause appears to be the use of product in Profanity._generate_patterns_from_word. As the number of substitutable characters in a words grows linearly, the number of possible patterns grows exponentially to their power. In other words, it has a memory footprint of O(n^x) where n is the approximate number of substitute per-character and x is the number of substitutable characters. At some point, the x becomes so large that an entire system's memory will be consumed.

You can use these words in a text file to pinpoint where this threshold is:

eeeeeeeeeeee
eeeeeeeeeeeee
eeeeeeeeeeeeee
eeeeeeeeeeeeeee

The second-to-last line of 15 characters will take relatively long to process. Then the final word of 16 characters will consume all of a system's available memory (up to 16 GB). Here's a table of how much memory python3.6 takes up as a process given the length of our test words,

e's len(product(...)) Memory (Mb)
13 1594323 216
14 4782969 625
15 14348907 1411
16 43046721 ...

(len(product(...)) is calculated as 3**x since "e" can be substituted with "e", "*", or "3")

For context, python3.6 takes up only about 60 Mb when the default 320 length wordlist is loaded.

Either the underlying data structure needs to be changed, or words for which the predicted output of product exceed a certain amount need to be ignored.

problem with PyInstaller

I'm packaging a script as an executable via PyInstaller.

better_profanity always looks for a file in this directory: C:\\Users\\<USERNAME>\\AppData\\Local\\Temp\\_MEI199002\\better_profanity\\alphabetic_unicode.json

Is there a way to select the file for that manually as we can for the wordlist?

Traceback:
Traceback (most recent call last):
  File "gui.py", line 15, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 671, in _load_unlocked
  File "PyInstaller\loader\pyimod03_importers.py", line 546, in exec_module
  File "app\app.py", line 3, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 671, in _load_unlocked
  File "PyInstaller\loader\pyimod03_importers.py", line 546, in exec_module
  File "better_profanity\__init__.py", line 3, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 671, in _load_unlocked
  File "PyInstaller\loader\pyimod03_importers.py", line 546, in exec_module
  File "better_profanity\better_profanity.py", line 5, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 671, in _load_unlocked
  File "PyInstaller\loader\pyimod03_importers.py", line 546, in exec_module
  File "better_profanity\constants.py", line 14, in 
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\\\AppData\\Local\\Temp\\_MEI199002\\better_profanity\\alphabetic_unicode.json'

Incorrect results for short words with numbers.

Hi, thank you for your excellent work.
I noticed issues related to the short words that include numbers.

version 0.7.0
Example
`from better_profanity import profanity

print(profanity.contains_profanity("73rd")) # True
print(profanity.contains_profanity("73rdem")) # False
`

Problems with separators / non-allowed characters

Both issues I've encountered appear to be parser related:

  1. If a swear word is at the very end of a string and it's last character of a swear word has a non-allowed character before it, it will not be censored. So f_u_c_k won't be censored, but f_u_c_k_ and f_u_ck will be.
  2. If too many characters in a swear word have separators between them, the swear word won't be censored. This appears to happen when the number of separations exceeds MAX_NUMBER_COMBINATIONS. So w i l l i es_ will be censored but not w i l l i e s_ (underscore included to avoid issue described in 1).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.