Coder Social home page Coder Social logo

better_profanity's Introduction

better_profanity

Blazingly fast cleaning swear words (and their leetspeak) in strings

release Build Status python license

Inspired from package profanity of Ben Friedland, this library is significantly faster than the original one, by using string comparison instead of regex.

It supports modified spellings (such as p0rn, h4NDjob, handj0b and b*tCh).

Requirements

This package only works with Python 3.

Installation

$ pip install better_profanity

Unicode characters

Only Unicode characters from categories Ll, Lu, Mc and Mn are added. More on Unicode categories can be found here.

Not all languages are supported yet, such as Chinese.

Wordlist

Most of the words in the default wordlist are referred from Full List of Bad Words and Top Swear Words Banned by Google.

The wordlist contains a total of 106,992 words, including 317 words from the default profanity_wordlist.txt and their variants by modified spellings.

Its total size in memory is 10.49+MB.

Usage

It is highly recommended to call profanity.load_censor_words() at initialization, to reduce the runtime for the first profanity.censor() call.

from better_profanity import profanity

if __name__ == "__main__":
    profanity.load_censor_words()

    text = "You p1ec3 of sHit."
    censored_text = profanity.censor(text)
    print(censored_text)
    # You **** of ****.

All modified spellings of words in profanity_wordlist.txt will be generated. For example, the word handjob would be loaded into:

'handjob', 'handj*b', 'handj0b', 'handj@b', 'h@ndjob', 'h@ndj*b', 'h@ndj0b', 'h@ndj@b',
'h*ndjob', 'h*ndj*b', 'h*ndj0b', 'h*ndj@b', 'h4ndjob', 'h4ndj*b', 'h4ndj0b', 'h4ndj@b'

The full mapping of the library can be found in profanity.py.

1. Censor swear words from a text

By default, profanity replaces each swear words with 4 asterisks ****.

from better_profanity import profanity

if __name__ == "__main__":
    text = "You p1ec3 of sHit."

    censored_text = profanity.censor(text)
    print(censored_text)
    # You **** of ****.

2. Censor doesn't care about word dividers

The function .censor() also hide words separated not just by an empty space but also other dividers, such as _, , and .. Except for @, $, *, ", '.

from better_profanity import profanity

if __name__ == "__main__":
    text = "...sh1t...hello_cat_fuck,,,,123"

    censored_text = profanity.censor(text)
    print(censored_text)
    # "...****...hello_cat_****,,,,123"

3. Censor swear words with custom character

4 instances of the character in second parameter in .censor() will be used to replace the swear words.

from better_profanity import profanity

if __name__ == "__main__":
    text = "You p1ec3 of sHit."

    censored_text = profanity.censor(text, '-')
    print(censored_text)
    # You ---- of ----.

4. Check if the string contains any swear words

Function .contains_profanity() return True if any words in the given string has a word existing in the wordlist.

from better_profanity import profanity

if __name__ == "__main__":
    dirty_text = "That l3sbi4n did a very good H4ndjob."

    profanity.contains_profanity(dirty_text)
    # True

5. Censor swear words with a custom wordlist

Function .load_censor_words() takes a List of strings as censored words. The provided list will replace the default wordlist.

from better_profanity import profanity

if __name__ == "__main__":
    custom_badwords = ['happy', 'jolly', 'merry']
    profanity.load_censor_words(custom_badwords)

    print(profanity.contains_profanity("Fuck you!"))
    # Fuck you

    print(profanity.contains_profanity("Have a merry day! :)"))
    # Have a **** day! :)

6. Censor Unicode characters

No extra steps needed!

from better_profanity import profanity

if __name__ == "__main__":
    bad_text = "Эффекти́вного противоя́дия от я́да фу́гу не существу́ет до сих пор"
    profanity.load_censor_words(["противоя́дия"])

    censored_text = profanity.censor(text)
    print(censored_text)
    # Эффекти́вного **** от я́да фу́гу не существу́ет до сих пор

Limitations

  1. As the library compares each word by characters, the censor could easily be bypassed by adding any character(s) to the word:
profanity.censor('I just have sexx')
# returns 'I just have sexx'

profanity.censor('jerkk off')
# returns 'jerkk off'
  1. Any word in wordlist that have non-space separators cannot be recognised, such as s & m, and therefore, it won't be filtered out. This problem was raised in #5.

Testing

$ python tests.py

Versions

  • v0.4.0 - Add compatibility to all versions of Python 3.
  • v0.3.4 - Add significantly more swear words.
  • v0.3.3 - Fix incompatibility with Python 3.5.
  • v0.3.2 - Fix a typo in documentation.
  • v0.3.1 - Remove unused dependencies.
  • v0.3.0 - Add support for Unicode characters (Categories: Ll, Lu, Mc and Mn) #2.
  • v0.2.0 - Bug fix + faster censoring
  • v0.1.0 - Initial release

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Special thanks to

Acknowledgments

better_profanity's People

Contributors

snguyenthanh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.