Comments (8)
Yeah, I see how self-censoring is implemented.
Then I should add N
variations of the same word in order to include such cases.
As for accents, I wanted to say that there should be some way to extend replacements, for example.
Then I could just add custom table and ö
would be replaced to cyrillic о
.
And yes, I do understand that the same ö
can be replaced to ASCII o
.
So I suggest adding some kind of mode
option that can be switched to make rustrict
work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang
argument).
Cuz, AFAIK current implementation of rustrict
is very tied to ASCII/English profanity filtering.
It lacks of localization options.
p.s. It's my thoughts and suggestions on how rustrict
could be improved.
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).
from rustrict.
I am open to expanding rustrict
to additional languages, to the extent that it doesn't add too much complexity or overhead*.
*adding more words/replacements is probably never too much overhead, but adding more filter steps/features might be.
As for accents, I wanted to say that there should be some way to extend replacements
I could add that in a future update, but it likely wouldn't help as much as you think (because of the effort required to make a comprehensive list of replacements).
Then I could just add custom table and ö would be replaced to cyrillic о.
The umlaut would (along with all other accents) be filtered out by Unicode normalization in the very early stages of the filter, leaving only 'o' (which would then be subject to replacement rules).
While it would, in theory, be possible to replace all 'o' lookalikes with all other 'o' lookalikes, it seems more efficient to use ASCII 'o' in in place of Cyrillic 'о' within the profanity list. That's not because the filter couldn't handle match the Cyrillic 'о' but because the filter is already engineered to replace tens or hundreds of 'o' lookalikes with ASCII 'o'
So I suggest adding some kind of mode option that can be switched to make rustrict work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang argument).
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).
It looks like py-censure has built-in wordlists for different languages (English and Russian at the moment). I do hope to add the option, in the future, to easily substitute out the wordlist (or compose multiple wordlists). The main obstacle is finding false-positives (e.g. "assassin" or "push it"), which takes about 2-3 minutes and requires the entire dictionary for the language (too long and too much data to do at runtime).
from rustrict.
Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?
Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)
Are you trying to remove languages you don't care about to make the filter more efficient?
My main takeaway was that the bad-words-next package use per language lookalike a.k.a replacements in the word lists. This allows for converting Cyrillic alphabet conversions. Also, you can explicitly censor in a single language with this approach. I didn't think about the efficiency, though.
from rustrict.
Thanks for the issue!
except self-censoring
assert!("пл*х*есл*во тест".is(Type::INAPPROPRIATE)); // false
The main way self-censoring is currently implemented is by manually adding variations for common/likely shortenings. For example, the wordlist contains fuk
which should also cover fu*k
. You could, for example, add плхеслво тест
to your wordlist. The exception is ASCII vowels (a
, e
, i
, etc.), which are handled automatically e.g. fuck
should cover f*ck
.
except accents
assert!("плöхöеслöвö тест".is(Type::INAPPROPRIATE)); // false
Using unicode inspector reveals that the о
is cyrillic but the ö
is latin. Making the filter consider the possibility of cyrillic letters every time it sees latin letters would make it much slower. I recommend trying to use ASCII in your wordlist if you want both self-censoring-rejection and accent-rejection to work better (e.g. use nnoxoecnobo
instead of плохоеслово
).
from rustrict.
Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?
from rustrict.
Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?
Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)
Are you trying to remove languages you don't care about to make the filter more efficient?
from rustrict.
This allows for converting Cyrillic alphabet conversions.
One of the barriers between rustrict
and better Cyrillic support is indeed alphabet conversions. Right now, most rustrict
lookalike characters are targeted at ASCII letters. In other words, a Cyrillic А can be interpreted as a Latin A in a Latin profanity but a Latin A won't be interpreted as a Cyrillic А in a Cyrillic profanity.
If every character that looks like A had to reference very other character that looks like A (and same thing for the other 52+ letters), the replacement list would take too much memory. I have a few ideas for fixing this but none of them are particularly appealing.
Also, you can explicitly censor in a single language with this approach
Indeed 👌
from rustrict.
the replacement list would take too much memory.
So this is a memory problem, maybe only convert the most common variations of the letter A? Only variations that are possible to write with a keyboard?
from rustrict.
Related Issues (16)
- False positive cases HOT 8
- Documentation: Extending the word list HOT 2
- Filtering error (false positive and/or false negative) HOT 2
- Filtering error (false positive) HOT 8
- Example: How to use trie HOT 3
- Don't Mark Accents as Censored HOT 4
- Add tags to every published version HOT 3
- docs README.md: Turn test page link into badge
- Filtering error (false positive and/or false negative) HOT 1
- Filtering error (false positive and/or false negative) HOT 2
- Filtering error (false positive) HOT 5
- False negatives HOT 5
- Reading multiple files HOT 2
- UTF-8 added words not being detected HOT 3
- Roadmap for Multi-language support HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rustrict.