Coder Social home page Coder Social logo

g2pm's People

Contributors

kyubyong avatar seanie12 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

g2pm's Issues

Two suggestions

  1. SOS(BOS) and EOS have no meaning to the effect and can be removed.
  2. There are some label errors in the data('儿'r5->er5,"樘"cheng3->cheng1,"骑"ji4->qi2). And after excluding monosyllabic words, the actual number of effective polyphonic sentences is 94857 lines.

There are some polyphone words missed

Hello, I used G2PM to convert some chinese sentences, and I found that there are some polyphone words missed in the cedict. For example, “一” only have "yi1" in the cedict, but actually it can be pronounced as "yi1", "yi2", "yi4". Does it mean that the dict, dataset and model should be more generalized and updated to solve this problem?

what does the special PinYin "xx5" used for

Hi, all,
Thanks for the good job. I found there is a special PinYin "xx5" in class2idx; But there is no corpus labled with this pinyin, Then what does this Pinyin Class used for? Is there anything special?

suggestion to change some Pinyin style

Hi,
Here is a suggestion that some pinyin style in the CPP should be changed. like this:
女: [nu:3] ---- > [nv3]
略: [lu:e4] -----> [lve4]
The latter are common pinyin style used in China now. All pinyin part “u:” can be changed to "v"

Can you provide the complete code for training?

Hi,
Thanks for the good job.
I used chinese bert to do this work with your dataset, but I could't get good result like yours. So I want to study your code and know how to training model using this datasets. But the codes your offerring now are only to predict. So can you provide the complete code about your model or bert in your papper?
Thank you for your help.

论文示例里的数据输出错误

model('今天来的目的是什么?') model = G2pM() model('今天来的目的是什么?') output:['jin1', 'tian1', 'lai2', 'de5', 'mu4', 'de5', 'shi4', 'shen2', 'me5', '?']
是否是安装问题。
g2pm 版本为0.1.2.4

polyphone's classification

hi, i find for anyone polyphone the netwok's output is a id, and then you can get pinyin from idx2class, i want to know how can you
ensure the output id is one of the classifications? thank you.

Why the count of polys in cedict is larger then that in corpus

Hi,
I found that the count of poly chars in corpus is 623, while count of poly chars in cedict is over 700, what is the reason?
I mean, when we do prediction, the poly in sentense may be not in the set of 623 polys, but in the set of 700+ polys, Then How will the model predict its Pinyin?

Pronunciation of "A"

wiki와 g2pc에서는 "诶" 를 ēi와 ei2로 표기하는데,
g2pm에서는 pan1이라고 나옵니다. 에러인가요?

image

Training Data Explanation

if you open up the .lb file there is only one pinyin there, while the corresponding line in .sent file has a string of characters..shouldn't the .lb file also have a string of pronunciation?

Can not reproduce the result.

I want to compare the performance of several g2p systems, so I download the CPP dataset, and try to reproduce the result showed in this repo. But I got much worse acc.

For g2pM v0.1.2.5,I got 92.9% for train set, 92.1% for dev set, and 91.6% for test set. Even ignore the tone information, the accs are: 96.6%, 96.1% 96.0% for train, dev and test set.

For pypinyin v0.36.0, I got 79.2%, 78.7%, 79.1% with tone, and 89.4%, 89.1%, 89.3% without tone.

To be more clear:

  1. The full sentence was fed to each system, to got the pinyin result.
  2. Then extract the predict as re.findall(r'▁ ([a-z0-9:]+) ▁', pinyin)[0].
  3. Finally, the acc was calculated as np.array([i == j for i, j in zip(pred, gt)]).

I'd like to know how do you get the acc value?

Attachment is the prediction for test set.

If any mistake in the computation, please point it out. Thanks,

how to use Bert model?

Hello, I have trained the Bert model according to your code. How to use the trained Bert model for pinyin annotation? :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.