Coder Social home page Coder Social logo

dystudio / toolgood.words Goto Github PK

View Code? Open in Web Editor NEW

This project forked from toolgood/toolgood.words

0.0 2.0 0.0 19.8 MB

一款高性能非法词(敏感词)检测组件,附带繁体简体互换,支持全角半角互换,获取拼音首字母,获取拼音字母等功能。

License: Apache License 2.0

C# 100.00%

toolgood.words's Introduction

ToolGood.Words

支持繁体简体互换,支持全角半角互换,获取拼音首字母,获取拼音字母,非法词(敏感词)检测(字符串搜索)。

文件夹说明:

ToolGood.PinYin.Build:      生成词的拼音
ToolGood.PinYin.WordsBuild: 生成多音词的拼音
ToolGood.Words.Contrast:    字符串搜索对比
ToolGood.Words.Test:        单元测试
ToolGood.Words:             本项目源代码

繁体简体互换、全角半角互换、数字转成中文大写、拼音操作

    // 转成简体
    WordsHelper.ToSimplifiedChinese("壹佰贰拾叁億肆仟伍佰陆拾柒萬捌仟玖佰零壹元壹角贰分");
    // 转成繁体
    WordsHelper.ToTraditionalChinese("壹佰贰拾叁亿肆仟伍佰陆拾柒万捌仟玖佰零壹元壹角贰分");
    // 转成全角
    WordsHelper.ToSBC("abcABC123");
    // 转成半角
    WordsHelper.ToDBC("abcABC123");
    // 数字转成中文大写
    WordsHelper.ToChineseRMB(12345678901.12);
    // 获取全拼
    WordsHelper.GetPinYin("我爱**");//WoAiZhongGuo
    // 获取首字母
    WordsHelper.GetFirstPinYin("我爱**");//WAZG
    // 获取全部拼音
    WordsHelper.GetAllPinYin('传');//Chuan,Zhuan

非法词(敏感词)检测(字符串搜索)

非法词(敏感词)检测类:StringSearchWordsSearchIllegalWordsSearchIllegalWordsQuickSearch;

  • StringSearch: 搜索FindFirst方法返回结果为string类型。
  • WordsSearch: 搜索FindFirst方法返回结果为WordsSearchResult类型, WordsSearchResult不仅仅有关键字,还有关键字的开始位置、结束位置,关键字序号等。
  • IllegalWordsSearch: 过滤非法词(敏感词)专用类,可设置跳字长度,默认繁体转简体,全角转半角,忽略大小写,转化特殊数字, 搜索FindFirst方法返回为IllegalWordsSearchResult,有关键字,开始位置、结束位置。
  • IllegalWordsQuickSearch: IllegalWordsSearch简化版本。
  • 共同方法有:SetKeywordsContainsAnyFindFirstFindAllReplace
    string s = "**|国人|zg人";
    string test = "我是**人";

    StringSearch iwords = new StringSearch();
    iwords.SetKeywords(s.Split('|'));
    
    var b = iwords.ContainsAny(test);
    Assert.AreEqual(true, b);

    var f = iwords.FindFirst(test);
    Assert.AreEqual("**", f);

    var all = iwords.FindAll(test);
    Assert.AreEqual("**", all[0]);
    Assert.AreEqual("国人", all[1]);
    Assert.AreEqual(2, all.Count);

    var str = iwords.Replace(test, '*');
    Assert.AreEqual("我是***", str);
性能对比

执行10万次性能对比,结果如下:

10W次性能对比

注:C#自带正则很慢,StringSearch.ContainsAnyRegex.IsMatch效率的1.5万倍。

Regex.Matches的运行方式跟IQueryable的类似,只返回MatchCollection,还没有计算。

TrieFilter,FastFilter来源: http://www.cnblogs.com/yeerh/archive/2011/10/20/2219035.html

在 Find All测试中,

FastFilter只能检测出7个: [0]: "主席" [1]: "赵洪祝" [2]: "**" [3]: "铁道部" [4]: "党" [5]: "胡锦涛" [6]: "倒台"

StringSearch检测出14个: [0]: "党" [1]: "党委" [2]: "西藏" [3]: "党" [4]: "党委" [5]: "主席" [6]: "赵洪祝" [7]: "**" [8]: "铁道部" [9]: "党" [10]: "胡锦涛" [11]: "锦涛" [12]: "倒台" [13]: "黑社会"

IllegalWordsSearch检出15个: [0]: {63|党} [1]: {63|党委} [2]: {81|西藏} [3]: {83|党} [4]: {83|党委} [5]: {143|主席} [6]: {185|赵洪祝} [7]: {204|**} [8]: {221|铁道部} [9]: {235|党} [10]: {244|胡锦涛} [11]: {245|锦涛} [12]: {323|倒台} [13]: {324|台} [14]: {364|黑社会}

IllegalWordsSearchStringSearch多一个,原因:关键字繁体转简体

插曲:在细查Regex.Matches神奇3ms,我发现Regex.Matches有一个小问题,

Regex.Matches只能检测出11个: [0]: "党" [1]: "西藏" [2]: "党" [3]: "主席" [4]: "赵洪祝" [5]: "**" [6]: "铁道部" [7]: "党" [8]: "胡锦涛" [9]: "倒台" [10]: "黑社会"

toolgood.words's People

Contributors

toolgood avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.