When adding a custom non English word, everything works fine except self-censoring an

Seems like <a href="https://github.com/alexzel/bad-words-nex

Thanks for the issue! except self-censoring <code c

the replacement list would take too much memory. <p dir

Self-censoring & accents does not work with custom non English words about rustrict HOT 8 OPEN

finnbear commented on June 3, 2024 1

Self-censoring & accents does not work with custom non English words

from rustrict.

Comments (8)

Priler commented on June 3, 2024 1

Yeah, I see how self-censoring is implemented.
Then I should add N variations of the same word in order to include such cases.

As for accents, I wanted to say that there should be some way to extend replacements, for example.
Then I could just add custom table and ö would be replaced to cyrillic о.

And yes, I do understand that the same ö can be replaced to ASCII o.
So I suggest adding some kind of mode option that can be switched to make rustrict work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang argument).

Cuz, AFAIK current implementation of rustrict is very tied to ASCII/English profanity filtering.
It lacks of localization options.

p.s. It's my thoughts and suggestions on how rustrict could be improved.
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).

from rustrict.

finnbear commented on June 3, 2024 1

I am open to expanding rustrict to additional languages, to the extent that it doesn't add too much complexity or overhead*.

*adding more words/replacements is probably never too much overhead, but adding more filter steps/features might be.

As for accents, I wanted to say that there should be some way to extend replacements

I could add that in a future update, but it likely wouldn't help as much as you think (because of the effort required to make a comprehensive list of replacements).

Then I could just add custom table and ö would be replaced to cyrillic о.

The umlaut would (along with all other accents) be filtered out by Unicode normalization in the very early stages of the filter, leaving only 'o' (which would then be subject to replacement rules).

While it would, in theory, be possible to replace all 'o' lookalikes with all other 'o' lookalikes, it seems more efficient to use ASCII 'o' in in place of Cyrillic 'о' within the profanity list. That's not because the filter couldn't handle match the Cyrillic 'о' but because the filter is already engineered to replace tens or hundreds of 'o' lookalikes with ASCII 'o'

So I suggest adding some kind of mode option that can be switched to make rustrict work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang argument).
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).

It looks like py-censure has built-in wordlists for different languages (English and Russian at the moment). I do hope to add the option, in the future, to easily substitute out the wordlist (or compose multiple wordlists). The main obstacle is finding false-positives (e.g. "assassin" or "push it"), which takes about 2-3 minutes and requires the entire dictionary for the language (too long and too much data to do at runtime).

from rustrict.

mkadirtan commented on June 3, 2024 1

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)

Are you trying to remove languages you don't care about to make the filter more efficient?

My main takeaway was that the bad-words-next package use per language lookalike a.k.a replacements in the word lists. This allows for converting Cyrillic alphabet conversions. Also, you can explicitly censor in a single language with this approach. I didn't think about the efficiency, though.

from rustrict.

finnbear commented on June 3, 2024

Thanks for the issue!

except self-censoring
assert!("пл*х*есл*во тест".is(Type::INAPPROPRIATE)); // false

The main way self-censoring is currently implemented is by manually adding variations for common/likely shortenings. For example, the wordlist contains fuk which should also cover fu*k. You could, for example, add плхеслво тест to your wordlist. The exception is ASCII vowels (a, e, i, etc.), which are handled automatically e.g. fuck should cover f*ck.

except accents
assert!("плöхöеслöвö тест".is(Type::INAPPROPRIATE)); // false

Using unicode inspector reveals that the о is cyrillic but the ö is latin. Making the filter consider the possibility of cyrillic letters every time it sees latin letters would make it much slower. I recommend trying to use ASCII in your wordlist if you want both self-censoring-rejection and accent-rejection to work better (e.g. use nnoxoecnobo instead of плохоеслово).

from rustrict.

mkadirtan commented on June 3, 2024

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

from rustrict.

finnbear commented on June 3, 2024

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)

Are you trying to remove languages you don't care about to make the filter more efficient?

from rustrict.

finnbear commented on June 3, 2024

This allows for converting Cyrillic alphabet conversions.

One of the barriers between rustrict and better Cyrillic support is indeed alphabet conversions. Right now, most rustrict lookalike characters are targeted at ASCII letters. In other words, a Cyrillic А can be interpreted as a Latin A in a Latin profanity but a Latin A won't be interpreted as a Cyrillic А in a Cyrillic profanity.

If every character that looks like A had to reference very other character that looks like A (and same thing for the other 52+ letters), the replacement list would take too much memory. I have a few ideas for fixing this but none of them are particularly appealing.

Also, you can explicitly censor in a single language with this approach

Indeed 👌

from rustrict.

mkadirtan commented on June 3, 2024

the replacement list would take too much memory.

So this is a memory problem, maybe only convert the most common variations of the letter A? Only variations that are possible to write with a keyboard?

from rustrict.

Self-censoring & accents does not work with custom non English words about rustrict HOT 8 OPEN

Comments (8)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent