Coder Social home page Coder Social logo

wenet-e2e / wetextprocessing Goto Github PK

View Code? Open in Web Editor NEW
439.0 439.0 66.0 898 KB

Text Normalization & Inverse Text Normalization

License: Apache License 2.0

Python 80.03% CMake 3.52% C++ 14.41% C 0.63% Java 1.41%
normalization production-ready text-processing

wetextprocessing's People

Contributors

chimo3333 avatar day9011 avatar jackiexiao avatar kfoodie avatar lingji-yidong avatar ma-dan avatar pengzhendong avatar robin1001 avatar weimeng23 avatar xingchensong avatar y00281951 avatar zhuzizyf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wetextprocessing's Issues

another badcase

In [14]: >>> from tn.chinese.normalizer import Normalizer
...: >>> normalizer = Normalizer()
...: >>> normalizer.normalize("12.5平方电线")
Out[14]: '十二月五日平方电线'

使用报错

您好,上周解决后,这周再次出现相同报错,已安装wetextprocessing包:
(wenetITN) liuhangchen@G08:~/WeNetITN$ weitn --text "二点五平方电线"
Traceback (most recent call last):
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/bin/weitn", line 8, in
sys.exit(main())
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/main.py", line 43, in main
normalizer = InverseNormalizer(
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/inverse_normalizer.py", line 42, in init
self.build_fst('zh_itn', cache_dir, overwrite_cache)
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/tn/processor.py", line 74, in build_fst
self.build_tagger()
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/inverse_normalizer.py", line 45, in build_tagger
tagger = (add_weight(Date().tagger, 1.02)
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/rules/date.py", line 25, in init
self.build_tagger()
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/rules/date.py", line 29, in build_tagger
digit = string_file('itn/chinese/data/number/digit.tsv') # 1 ~ 9
File "extensions/_pynini.pyx", line 1033, in _pynini.string_file
File "extensions/_pynini.pyx", line 1109, in _pynini.string_file
_pywrapfst.FstIOError: Read failed

bad case, 幺三二六二号 -> 1326~2号

您好,我想提供一个bad case,麻烦看看为什么会这么匹配:

weitn --text "哎您好这里市中心幺二三四五工号幺三二六二号"
char { value: "哎" } char { value: "您" } char { value: "好" } char { value: "这" } char { value: "里" } char { value: "市" } char { value: "中" } char { value: "心" } cardinal { value: "12345" } char { value: "工" } char { value: "号" } cardinal { value: "132" } measure { value: "62号" }
哎您好这里市中心12345工号1326
2号

这个”13262“里的”“是怎么来的

新需求:文字还原

把出现的半缩写还原成文字:

100的j格对比它的妆效—>100的价格对比它的妆效
皮肤易m感—>皮肤易敏感

badcase

image
被数字符号":"覆盖了规则。

[Install]直接从github code上安装会有问题

当运行python setup.py install时,这两行代码会报错

version = sys.argv[-1].split('=')[1]

sys.argv = sys.argv[0:len(sys.argv) - 1]

error:
Traceback (most recent call last):
File "setup.py", line 24, in
setup(
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/chenshuaiqi03/install/envs/a100_py38/lib/python3.8/site-packages/setuptools/init.py", line 87, in setup
return distutils.core.setup(**attrs)
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/chenshuaiqi03/install/envs/a100_py38/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 140, in setup
attrs['script_name'] = os.path.basename(sys.argv[0])
IndexError: list index out of range
另还有错误为:
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: setup.py --help [cmd1 cmd2 ...]
or: setup.py --help-commands
or: setup.py cmd --help

error: no commands supplied

当把这两行注释掉,把26行的version内容命为1.2就能安装

[ITN] English ITN crashes

Hello!

I'm trying to use c++ implementation for English ITN with tagger and verbalizer from Nemo library.
The problem is that verbalizer crashes because of TokenParser.
Here is an example.

input:
this is the first test

tagger output:
tokens { name: "this" } tokens { name: "is" } tokens { name: "the" } tokens { ordinal { integer: "1" } } tokens { name: "test" }

parser output:
tokens { name: "this" } tokens { name: "is" } tokens { name: "the" } tokens { ordinal: "{ integer: " : "1" } { } tokens { name: "test" }

The parser output seems a bit odd adding some random quotation marks and brackets.
Removing parser before verbalizer makes everything work as expected (credits to @ezerhouni for pointing this out).

Could you help with solving this in a more neat way on a parser side maybe?

Thanks in advance!

BUG: 啊 -> Invalid start state

code

text="啊"
# text="啊一二" # works fine
import os
from itn.chinese.inverse_normalizer import InverseNormalizer
cache_dir = os.path.join(os.getcwd(),'utils/itn')
invnormalizer = InverseNormalizer(
    cache_dir= cache_dir,
    enable_standalone_number=True,
    enable_0_to_9=False)

replacedText = invnormalizer.normalize(text)
print(replacedText)

exception log:

ERROR: StringFstToOutputLabels: Invalid start state
---------------------------------------------------------------------------
FstOpError                                Traceback (most recent call last)
Cell In[5], line 10
      4 cache_dir = os.path.join(os.getcwd(),'utils/itn')
      5 invnormalizer = InverseNormalizer(
      6     cache_dir= cache_dir,
      7     enable_standalone_number=True,
      8     enable_0_to_9=False)
---> 10 replacedText = invnormalizer.normalize(text)
     11 print(replacedText)

File [~/miniconda3/envs/asr/lib/python3.10/site-packages/tn/processor.py:94]~/miniconda3/envs/asr/lib/python3.10/site-packages/tn/processor.py:94), in Processor.normalize(self, input)
     93 def normalize(self, input):
---> 94     return self.verbalize(self.tag(input))

File [~/miniconda3/envs/asr/lib/python3.10/site-packages/tn/processor.py:82](~/miniconda3/envs/asr/lib/python3.10/site-packages/tn/processor.py:82), in Processor.tag(self, input)
     80 input = escape(input)
     81 lattice = input @ self.tagger
---> 82 return shortestpath(lattice, nshortest=1, unique=True).string()

File extensions/_pynini.pyx:462, in _pynini.Fst.string()

File extensions/_pynini.pyx:507, in _pynini.Fst.string()

FstOpError: Operation failed

main执行报错

大佬们好,在不使用pip安装的前提下,执行 python itn/main.py --text "上午8点半",出现下面的错误是什么原因呢?

Traceback (most recent call last):
File "/data/WeTextProcessing-master/itn/main.py", line 59, in
main()
File "/data/WeTextProcessing-master/itn/main.py", line 43, in main
normalizer = InverseNormalizer(
File "/data/WeTextProcessing-master/itn/chinese/inverse_normalizer.py", line 41, in init
self.build_fst('zh_itn', cache_dir, overwrite_cache)
File "/data/WeTextProcessing-master/itn/processor.py", line 74, in build_fst
self.build_tagger()
File "/data/WeTextProcessing-master/itn/chinese/inverse_normalizer.py", line 44, in build_tagger
tagger = (add_weight(Date().tagger, 1.02)
File "/data/WeTextProcessing-master/itn/chinese/rules/date.py", line 25, in init
self.build_tagger()
File "/data/WeTextProcessing-master/itn/chinese/rules/date.py", line 29, in build_tagger
digit = string_file('itn/chinese/data/number/digit.tsv') # 1 ~ 9
File "extensions/_pynini.pyx", line 1033, in _pynini.string_file
File "extensions/_pynini.pyx", line 1109, in _pynini.string_file
_pywrapfst.FstIOError: Read failed

期待大佬们指点一下!
祝好!

TN 转换字符串报错 "145[=xx]"

(base) ➜ WeTextProcessing git:(master) ✗ python normalize.py --text "145[=xx]"

ERROR: StringFstToOutputLabels: Invalid start state
Traceback (most recent call last):
  File "normalize.py", line 43, in <module>
    main()
  File "normalize.py", line 33, in main
    print(normalizer.tag(args.text))
  File "/data/jackgeek/WeTextProcessing/tn/processor.py", line 67, in tag
    return shortestpath(lattice, nshortest=1, unique=True).string()
  File "extensions/_pynini.pyx", line 462, in _pynini.Fst.string
  File "extensions/_pynini.pyx", line 507, in _pynini.Fst.string
_pywrapfst.FstOpError: Operation failed

Some badcase

ITN:
我要一百 -> 我要1百
我要一百两 -> 我要1百2

TN:
我要1.3和2.7这两个数值 ->我要幺点三和二点七这两个数值 (这里一般读yi1)

far files loading error using the runtime

Hi,

trying to use the runtime but got an error with the far file...

typing the cmd in the README.MD

 [root@localhost runtime]# ./build/bin/processor_main --far zh_tn_normalizer.far --text "2.5平方电线"
ERROR: unknown command line flag 'far'

then, changing the flag to tagger...

 [root@localhost runtime]# ./build/bin/processor_main --tagger zh_tn_normalizer.far --text "2.5平方电线"
F1207 19:19:45.189105 59646 processor_main.cc:32] Please provide the tagger and verbalizer fst files.
*** Check failure stack trace: ***
    @           0x48138d  google::LogMessage::Fail()
    @           0x4833c4  google::LogMessage::SendToLog()
    @           0x480e8b  google::LogMessage::Flush()
    @           0x483cc9  google::LogMessageFatal::~LogMessageFatal()
    @           0x40ac84  main
    @     0x7fdfadf0f555  __libc_start_main
    @           0x40d904  (unknown)
    @              (nil)  (unknown)
已放弃

how could I provide both the tagger and verbalizer fst files?

Regards

from tn.chinese.normalizer import Normalizer 失败

windows 环境 python 3.6:
pip install WeTextProcessing 成功
from tn.chinese.normalizer import Normalizer 失败,
提示:from pynini import cdrewrite, cross, difference, escape, shortestpath, union ImportError: cannot import name 'cdrewrite'
请问是因为 pynini 不支持 windows 的原因吗?

TN运行速度很慢

我在本地跑python解析一句话需要200ms左右,用C++ runtime速度也很慢

价格问题

价值704元的专属回购礼 -> 价值七零四元的专属回购礼
应该是
价值七百零四元的专属回购礼

[TN/measure] xx余件

normalizer.normalize("仅2015年,**巡视组就发现反映领导干部问题线索3000余件、四风问题400余件,督促查处450余名中管干部违纪违法问题.")

结果:仅二零一五年,**巡视组就发现反映领导干部问题线索三零零零余件、四风问题四零零余件,督促查处四五零余名中管干部违纪违法问题.

预期:三千余件,四百余件,四百五十余名

IP

十点一七二点三点二:10点172.3.2

add new language

I hope to use NeMo resource to add new languages in this project. Can you give me some guidelines?

Thanks a lot :)

连续进行ITN时,tokens缓存没有清空

不知道是不是我调的有问题,在对partial_result连续做ITN,以及final_result做ITN时,结果会不断累集。
比如:今天天气怎么样->今天今天天气今天天气怎么样
是否需要在reorder之后清空tokens
std::string TokenParser::reorder(const std::string& input)
153行后,tokens.clear()

FST里的id对应关系

正在学习怎么写规则,有两个问题想要请教一下,谢谢

  1. 我把中间结果进行了打印,但不是很清楚input/output label的对应关系是什么,猜测48对应的可能是“零”,请问如何知道字符和id号之间的对应关系呢?

from pynini import string_file
zero = string_file('tn/chinese/data/number/zero.tsv')
print(zero)
0 1 48 0
1 2 0 0
2 3 0 233
3 4 0 155
4 5 0 182
5

  1. 我只想把手机电话号码中的“一”改成“幺”,也就是说如果一个数是11位的话,就把“一”改成“幺”,其他情况不变,请问这个有没有什么好的实现方法呢?我看之前有一版是有“幺”的存在的,但好像是针对所有一和幺的:5462c72

谢谢

windows 平台不支持 release 编译?

运行如下命令:

cmake -DCMAKE_BUILD_TYPE=Release .. -G "Visual Studio 17 2022" -DBUILD_SHARED_LIBS=0 -DCMAKE_CXX_FLAGS="/utf-8"
cmake --build . --config Release

出现错误:

error MSB6006: “CL.exe”已退出,代码为 -1073740791。 [F:\Works\WeTextProcessing\runtime\build\processor\processor.vcxproj]

如果 build 时候使用DEBUG:

cmake --build .

可以编译通过

如何更改时间的位置?

大佬好呀 ,我想咨询一下如何修改时间上的位置?
比如:早上七点二十五叫我
INT之后是:7:25 a.m 叫我
我将data里面的time中文件中的英文都改成中文了。
效果就变成了:7:25早上叫我
然后我想更改一下位置 ,改成:早上7:25叫我

跟着代码去了time.py之后,把
verbalizer = (hour + addcolon + minute + (addcolon + second).ques + noon.ques)
改成了
verbalizer = (noon.ques+hour + addcolon + minute + (addcolon + second).ques )
结果发现会报错 ,是不是因为tagger没有一起修改?
如果想改要改一下这个顺序,该如何做,大佬指点一下

中英混合的tn

from tn.chinese.normalizer import Normalizer
normalizer = Normalizer()
将英文间的空格删除了

测试报错

你好,我想运行WeTextProcessing-master/itn/chinese/test下的normalizer_test.py进行测试,报错如下:
Traceback (most recent call last):
File "normalizer_test.py", line 23, in
class TestNormalizer:
File "normalizer_test.py", line 25, in TestNormalizer
normalizer = InverseNormalizer(
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/inverse_normalizer.py", line 41, in init
self.build_fst('zh_itn', cache_dir, overwrite_cache)
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/tn/processor.py", line 74, in build_fst
self.build_tagger()
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/inverse_normalizer.py", line 44, in build_tagger
tagger = (add_weight(Date().tagger, 1.02)
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/rules/date.py", line 25, in init
self.build_tagger()
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/rules/date.py", line 29, in build_tagger
digit = string_file('itn/chinese/data/number/digit.tsv') # 1 ~ 9
File "extensions/_pynini.pyx", line 1033, in _pynini.string_file
File "extensions/_pynini.pyx", line 1109, in _pynini.string_file
_pywrapfst.FstIOError: Read failed

请问如何解决,谢谢!

中英混合时,badcase

[StringFstToOutputLabels: Invalid start state](string.h:123] StringFstToOutputLabels: Invalid start state)

关于PC/pc的正则化错误?

文本中有英文单词包含pc两个字母可能会触发不正确的正则化:比如pca正则化为P Ca,PCA正则化为P CA,无论pc出现在单词何处(前中后都会被强制转换)。
猜测错误原因:pc正则化为p c的时候直接用的replace吗?需要先判断是否为一个单词(空格或者标点)?

这样不是很合理,大佬们有空修正一下吧。

Ethical issue

For whatever reason, it seems you choose to fork the work from Zhenxiang(undergraduate intern at SpeechColab) rather than improving the code directly in Nemo, this is OK. But erasing SpeechColab from credits section and avoid mentioning the original work in Nemo project is not an ethical act for open-source projects.

[Known Issue] The "delete" function in Verblizer introduces additional null ('\0') characters.

When using the UTF-8 "string2char" function link, these null ('\0') characters are explicitly separated. However, when using standard "std::out", these null characters are automatically filtered out.

If you do want use "string2char", a simple solution is to remove the null characters through post-processing:

// TODO(zhendong.peng): Figured out this!
std::string output = Verbalize(input);
output.erase(std::remove(output.begin(), output.end(), '\0'), output.end());
std::vector<std::string> chars;
string2chars(output, &chars);

带数字词语ITN

像带数字的词语,例如“三心二意”这种,是不可以做ITN的。粗略看了下你们的代码,需要的前置知识有点多啊,一时不知道从哪下手。是否可以实现仅后面紧邻单位的数字才做ITN?

提交一个bug

[root@localhost WeTextProcessing]# python -m itn --text "二十一到二十五摄氏度"
math { value: "21~2" } measure { value: "15°C" }
21~215°C

输出21~215°C,不符合预期。

使用示例时报错

(wenetITN) liuhangchen@G08:~/WeNetITN/WeTextProcessing-master/itn/chinese$ weitn --text "二点五平方电线"
Traceback (most recent call last):
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/bin/weitn", line 8, in
sys.exit(main())
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/main.py", line 43, in main
normalizer = InverseNormalizer(
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/inverse_normalizer.py", line 41, in init
self.build_fst('zh_itn', cache_dir, overwrite_cache)
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/tn/processor.py", line 74, in build_fst
self.build_tagger()
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/inverse_normalizer.py", line 44, in build_tagger
tagger = (add_weight(Date().tagger, 1.02)
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/rules/date.py", line 25, in init
self.build_tagger()
File "/storage1/liuhangchen/anaconda3/envs/wenetITN/lib/python3.8/site-packages/itn/chinese/rules/date.py", line 29, in build_tagger
digit = string_file('itn/chinese/data/number/digit.tsv') # 1 ~ 9
File "extensions/_pynini.pyx", line 1033, in _pynini.string_file
File "extensions/_pynini.pyx", line 1109, in _pynini.string_file
_pywrapfst.FstIOError: Read failed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.