kakaobrain / g2pm Goto Github PK

View Code? Open in Web Editor NEW

330.0 330.0 73.0 12.46 MB

A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset

License: Apache License 2.0

Python 100.00%

g2pm's People

Contributors

Stargazers

Watchers

Forkers

entn-at kyubyong sidney1994 superhg2012 luckylhy hyzhan macroustc x-ccs lzj520 gcxgongcaixia kingstorm maozhiqiang sunilsivadas zwlanpishu fanhuaandluomu yes7rose hiyoung-asr caozhengquan liuzongquan yangchunyong ntzzc templeblock luweishuang whaozl xinkez nick-2008 labmem-zhouyx hudsonhuang del18687058912 gavin90s whitefu liscarqaq hommmm hlp2819 v-yunbin llmhao ishine ysujiang liaoweixia auzxb aixingxy human2b jiahong3837 yshihui fancyerii lucus-lee gitycc zoestra esoff wendonggan jjandnn tianyu0313 zhengzheng-yay qweasdzxcvde saber5433 elijahahianyo yymax-max amorjnyh lizezheng nzpeng domigome road2018 dyustc zhufeiyao gabrielhaohao shendlcode bestdpf muer-ai clumsyroot

g2pm's Issues

It seems that the result of g2pM model is worse than that of pypinyin model?

Hi, guys! I tested some common Chinese Mandarin texts. The g2pM model gets all error results, and pypinyin get all correct results. Here are the examples I tested.

Two suggestions

SOS（BOS） and EOS have no meaning to the effect and can be removed.
There are some label errors in the data（'儿'r5->er5,"樘"cheng3->cheng1,"骑"ji4->qi2）. And after excluding monosyllabic words, the actual number of effective polyphonic sentences is 94857 lines.

There are some polyphone words missed

Hello, I used G2PM to convert some chinese sentences, and I found that there are some polyphone words missed in the cedict. For example, “一” only have "yi1" in the cedict, but actually it can be pronounced as "yi1", "yi2", "yi4". Does it mean that the dict, dataset and model should be more generalized and updated to solve this problem?

what does the special PinYin "xx5" used for

Hi, all,
Thanks for the good job. I found there is a special PinYin "xx5" in class2idx; But there is no corpus labled with this pinyin, Then what does this Pinyin Class used for? Is there anything special?

suggestion to change some Pinyin style

Hi,
Here is a suggestion that some pinyin style in the CPP should be changed. like this:
女: [nu:3] ---- > [nv3]
略: [lu:e4] -----> [lve4]
The latter are common pinyin style used in China now. All pinyin part “u:” can be changed to "v"

can you open train data?thanks

Can you provide the complete code for training?

Hi,
Thanks for the good job.
I used chinese bert to do this work with your dataset, but I could't get good result like yours. So I want to study your code and know how to training model using this datasets. But the codes your offerring now are only to predict. So can you provide the complete code about your model or bert in your papper?
Thank you for your help.

论文示例里的数据输出错误

model('今天来的目的是什么？') model = G2pM() model('今天来的目的是什么？') output：['jin1', 'tian1', 'lai2', 'de5', 'mu4', 'de5', 'shi4', 'shen2', 'me5', '？']
是否是安装问题。
g2pm 版本为0.1.2.4

polyphone's classification

hi, i find for anyone polyphone the netwok's output is a id, and then you can get pinyin from idx2class, i want to know how can you
ensure the output id is one of the classifications? thank you.

Why the count of polys in cedict is larger then that in corpus

Hi,
I found that the count of poly chars in corpus is 623, while count of poly chars in cedict is over 700, what is the reason?
I mean, when we do prediction, the poly in sentense may be not in the set of 623 polys, but in the set of 700+ polys, Then How will the model predict its Pinyin?

For pypinyin v0.36.0, I got 79.2%, 78.7%, 79.1% with tone, and 89.4%, 89.1%, 89.3% without tone.

To be more clear：

The full sentence was fed to each system, to got the pinyin result.
Then extract the predict as re.findall(r'▁ ([a-z0-9:]+) ▁', pinyin)[0].
Finally, the acc was calculated as np.array([i == j for i, j in zip(pred, gt)]).

I'd like to know how do you get the acc value?

Attachment is the prediction for test set.

If any mistake in the computation, please point it out. Thanks,