Coder Social home page Coder Social logo

Comments (8)

Priler avatar Priler commented on June 3, 2024 1

Yeah, I see how self-censoring is implemented.
Then I should add N variations of the same word in order to include such cases.

As for accents, I wanted to say that there should be some way to extend replacements, for example.
Then I could just add custom table and ö would be replaced to cyrillic о.

And yes, I do understand that the same ö can be replaced to ASCII o.
So I suggest adding some kind of mode option that can be switched to make rustrict work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang argument).

Cuz, AFAIK current implementation of rustrict is very tied to ASCII/English profanity filtering.
It lacks of localization options.

p.s. It's my thoughts and suggestions on how rustrict could be improved.
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).

from rustrict.

finnbear avatar finnbear commented on June 3, 2024 1

I am open to expanding rustrict to additional languages, to the extent that it doesn't add too much complexity or overhead*.

*adding more words/replacements is probably never too much overhead, but adding more filter steps/features might be.

As for accents, I wanted to say that there should be some way to extend replacements

I could add that in a future update, but it likely wouldn't help as much as you think (because of the effort required to make a comprehensive list of replacements).

Then I could just add custom table and ö would be replaced to cyrillic о.

The umlaut would (along with all other accents) be filtered out by Unicode normalization in the very early stages of the filter, leaving only 'o' (which would then be subject to replacement rules).

While it would, in theory, be possible to replace all 'o' lookalikes with all other 'o' lookalikes, it seems more efficient to use ASCII 'o' in in place of Cyrillic 'о' within the profanity list. That's not because the filter couldn't handle match the Cyrillic 'о' but because the filter is already engineered to replace tens or hundreds of 'o' lookalikes with ASCII 'o'

So I suggest adding some kind of mode option that can be switched to make rustrict work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang argument).
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).

It looks like py-censure has built-in wordlists for different languages (English and Russian at the moment). I do hope to add the option, in the future, to easily substitute out the wordlist (or compose multiple wordlists). The main obstacle is finding false-positives (e.g. "assassin" or "push it"), which takes about 2-3 minutes and requires the entire dictionary for the language (too long and too much data to do at runtime).

from rustrict.

mkadirtan avatar mkadirtan commented on June 3, 2024 1

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)

Are you trying to remove languages you don't care about to make the filter more efficient?

My main takeaway was that the bad-words-next package use per language lookalike a.k.a replacements in the word lists. This allows for converting Cyrillic alphabet conversions. Also, you can explicitly censor in a single language with this approach. I didn't think about the efficiency, though.

from rustrict.

finnbear avatar finnbear commented on June 3, 2024

Thanks for the issue!

except self-censoring
assert!("пл*х*есл*во тест".is(Type::INAPPROPRIATE)); // false

The main way self-censoring is currently implemented is by manually adding variations for common/likely shortenings. For example, the wordlist contains fuk which should also cover fu*k. You could, for example, add плхеслво тест to your wordlist. The exception is ASCII vowels (a, e, i, etc.), which are handled automatically e.g. fuck should cover f*ck.

except accents
assert!("плöхöеслöвö тест".is(Type::INAPPROPRIATE)); // false

Using unicode inspector reveals that the о is cyrillic but the ö is latin. Making the filter consider the possibility of cyrillic letters every time it sees latin letters would make it much slower. I recommend trying to use ASCII in your wordlist if you want both self-censoring-rejection and accent-rejection to work better (e.g. use nnoxoecnobo instead of плохоеслово).

from rustrict.

mkadirtan avatar mkadirtan commented on June 3, 2024

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

from rustrict.

finnbear avatar finnbear commented on June 3, 2024

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)

Are you trying to remove languages you don't care about to make the filter more efficient?

from rustrict.

finnbear avatar finnbear commented on June 3, 2024

This allows for converting Cyrillic alphabet conversions.

One of the barriers between rustrict and better Cyrillic support is indeed alphabet conversions. Right now, most rustrict lookalike characters are targeted at ASCII letters. In other words, a Cyrillic А can be interpreted as a Latin A in a Latin profanity but a Latin A won't be interpreted as a Cyrillic А in a Cyrillic profanity.

If every character that looks like A had to reference very other character that looks like A (and same thing for the other 52+ letters), the replacement list would take too much memory. I have a few ideas for fixing this but none of them are particularly appealing.

Also, you can explicitly censor in a single language with this approach

Indeed 👌

from rustrict.

mkadirtan avatar mkadirtan commented on June 3, 2024

the replacement list would take too much memory.

So this is a memory problem, maybe only convert the most common variations of the letter A? Only variations that are possible to write with a keyboard?

from rustrict.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.