Coder Social home page Coder Social logo

alibaba-nlp / multi-cpr Goto Github PK

View Code? Open in Web Editor NEW
150.0 3.0 17.0 239.51 MB

[SIGIR 2022] Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

Python 95.84% Shell 4.16%
dataset passage-retrieval dense-retrieval passage-ranking pytorch question-answering text-ranking

multi-cpr's People

Contributors

ahxgw avatar dingkun-ldk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

multi-cpr's Issues

data/medial/corpus_split_3(4).tsv两个文件打开乱码

似乎是编码问题,split1和split2显示正常,split3和split4显示乱码,前两者是用utf-8,后两者显示ansi(但似乎ansi的编码方式与系统有关?),我尝试将后两者转化为utf-8,但是仍然显示乱码,请问该如何解决?

领域预训练模型

我看论文有对比这块的效果 但是代码里面没看到提供对应的预训练模型

关于数据采集的疑问

您好,请问一下观察数据发现在标签数据对中,一个query下只会挂载一个doc,对于电商数据来说,如果是曝光日志数据采集得到的标签数据的话,有两个疑问还请咨询一下。

  1. 使用一个query下挂载一个相关doc的标注形式的原因是什么,为何不是一个query下挂载多个doc呢?
  2. 这样采集数据的方式是什么,对于query是保存历史一段时间如30天的query,对于其挂载的唯一doc是这些query对应的点击频次最高的doc物料么?

通用领域数据集

有没有可能提供一份论文中提到的通用领域数据集,或者提供相应的转换流程之类

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.