Light

destwang / ctc2021 Goto Github PK

View Code? Open in Web Editor NEW

126.0 126.0 15.0 781 KB

License: Apache License 2.0

Python 99.22% Shell 0.78%

ctc2021's People

Contributors

Stargazers

Watchers

Forkers

myboyliu hfxunlp adzhua 90217 curiousal spurscoder wangleiai ganzf886 pang4254 unipus-ai meverystrong tuzeao mqy9787 young1993 xgl0626

ctc2021's Issues

请问训练模型的超参和bash里面提供的一致吗？

我们训练复现的模型的结果和baseline的结果差挺大的，请问超参和bash里面提供的一致吗？

请问比赛最后有举行交流会吗？

请问比赛最后有举行交流会吗？想要学习一下其他思路

baseline模型效果

你好，请问baseline模型是用bert在ctc训练集上训练得到的吗？baseline模型在测试集或验证集效果怎样呢？感谢。

请问这个数据集的详细和baseline性能实验结果有以论文形式放出来吗

哥们哥们，求一个paper地址

能否提供验证集和测试集

最近才看到这个中文的纠错比赛。已经错过了报名参赛时间。官网也打不开。这里想问下能否提供测试集和验证集？

请问是否有公开验证集和测试集的计划

baseline的训练

我按照要求将数据集用提供的segement.py进行分词并处理成了gector文件，然后使用ctc2021_baseline模型进行训练，其他的参数未调整，但是在训练第一轮的时候，显示的Accuracy会从0.4左右开始很快就达到0.94,在验证集上也达到了0.96,但是在训练集上重新进行推理得到的结果很差，我使用其他的数据集，其他的预训练语言模型，都会出现这样的情况，请问是什么原因造成的？

关于两次提供的验证集数据的差异问题

内容

8月6号邮件提供的验证集数据qua_input.txt文件中语句数目为972条，与从codalab竞赛网站中下载的资格赛数据qua_input文件985条数目不符。
另外，经过对比除了缺少13条语句外，存在其他句子改动。
下图中：左侧为8月6号数据，右侧为从竞赛网站下载的数据。

请问1. 资格赛排行榜上的qua_input文件的句子数目是哪个？
2.是否可以提供一下评估F1值得分的代码？我们在编写评估代码时，不确定是数据差异或是代码bug，所以无法定位评估代码问题。

提交结果的索引问题？

例如：
错误句子： Reddit：喧布，
正确句子： Reddit：宣布，
那么索引应该是
7,别字,
还是
2,别字,
？

提交得分为-0.999999

按照要求的格式提交了结果，但是得分为-0.999999,

查看scoring output log，显示日志为：

======= Set 1 (Qua): score(set1_score)=ERROR =======
list index out of range

训练

请问这里的分词具体指分成哪种呢？直接用BertTokenizer 先encode 再 decode，会有 [UNK] 标记

FileNotFoundError: [Errno 2] No such file or directory: 'qualification_input.txt.output'

I would like to ask the author encountered “FileNotFoundError: [Errno 2] No such file or directory: 'qualification_input.txt.output'”? If met, can you tell me how to solve it?

关于资格赛标注答案和评测脚本的小问题

资格赛的qua_labels.txt文件中存在如下：
`pid=00265, 87, 别字, 象, 像, 87, 别字, 象, 还,`
即：同个位置可以有多个纠正结果。
在最新的评测脚本中：read_label_file(pid_to_text, label_file) 函数返回前，做了一次断言校验，如下：
`assert len(error_set) == len(det_set) == len(cor_set)`

在此背景下：由于检错集'det_set'存在重复(pid, loc, wrong)，所以三个集合数目不一致。

请问提供的baseline模型是使用什么预训练模型为基础训练的呀？

请问链接里提供的模型权重是使用什么预训练模型训练的呀？bert-base-chinese吗？

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.