Coder Social home page Coder Social logo

ctc2021's People

Contributors

destwang avatar nalanwaner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ctc2021's Issues

baseline模型效果

你好,请问baseline模型是用bert在ctc训练集上训练得到的吗?baseline模型在测试集或验证集效果怎样呢?感谢。

能否提供验证集和测试集

最近才看到这个中文的纠错比赛。已经错过了报名参赛时间。官网也打不开。这里想问下能否提供测试集和验证集?

baseline的训练

我按照要求将数据集用提供的segement.py进行分词并处理成了gector文件,然后使用ctc2021_baseline模型进行训练,其他的参数未调整,但是在训练第一轮的时候,显示的Accuracy会从0.4左右开始很快就达到0.94,在验证集上也达到了0.96,但是在训练集上重新进行推理得到的结果很差,我使用其他的数据集,其他的预训练语言模型,都会出现这样的情况,请问是什么原因造成的?

关于两次提供的验证集数据的差异问题

内容

8月6号邮件提供的验证集数据qua_input.txt文件中语句数目为972条,与从codalab竞赛网站中下载的资格赛数据qua_input文件985条数目不符。
另外,经过对比除了缺少13条语句外,存在其他句子改动。
下图中:左侧为8月6号数据,右侧为从竞赛网站下载的数据。
image

请问1. 资格赛排行榜上的qua_input文件的句子数目是哪个?
2.是否可以提供一下评估F1值得分的代码?我们在编写评估代码时,不确定是数据差异或是代码bug,所以无法定位评估代码问题。

提交结果的索引问题?

例如:
错误句子: Reddit:喧布,
正确句子: Reddit:宣布,
那么索引应该是
7,别字,
还是
2,别字,

提交得分为-0.999999

按照要求的格式提交了结果,但是得分为-0.999999,

查看scoring output log,显示日志为:

======= Set 1 (Qua): score(set1_score)=ERROR =======
list index out of range

训练

image
请问这里的分词具体指分成哪种呢?直接用BertTokenizer 先encode 再 decode,会有 [UNK] 标记
image

关于资格赛标注答案和评测脚本的小问题

资格赛的qua_labels.txt文件中存在如下:
pid=00265, 87, 别字, 象, 像, 87, 别字, 象, 还,
即:同个位置可以有多个纠正结果。
在最新的评测脚本中:read_label_file(pid_to_text, label_file) 函数返回前,做了一次断言校验,如下:
assert len(error_set) == len(det_set) == len(cor_set)

在此背景下:由于检错集'det_set'存在重复(pid, loc, wrong),所以三个集合数目不一致。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.