Coder Social home page Coder Social logo

预训练分词是会用到中文vocab后面带“##”的token吗?如果是,整词掩码预训练出来的语言模型,用在下游任务中,下游任务可以不分词吗? about chinese-bert-wwm HOT 4 CLOSED

dr-GitHub-account avatar dr-GitHub-account commented on May 26, 2024
预训练分词是会用到中文vocab后面带“##”的token吗?如果是,整词掩码预训练出来的语言模型,用在下游任务中,下游任务可以不分词吗?

from chinese-bert-wwm.

Comments (4)

ymcui avatar ymcui commented on May 26, 2024

和带不带##没有关系,wordpiece分词器加载vocab.txt之后分出来什么样就是什么样,和原始bert的用法一样,不需要什么额外的操作。

from chinese-bert-wwm.

dr-GitHub-account avatar dr-GitHub-account commented on May 26, 2024

感谢解答!!!

##确实不是问题的关键。

问题1里面我比较好奇的是,分词后“今天“被分为一个词,”天“是非词首的字,在vocab.txt中应该对应后面14979行那一个”##天“(##表示非词首的字),而不是前面1922行那一个”天“。那tokenize后,input id是不是会出现像14978这么大的id,而不是分词前的1921?之前看过一个中文MLM的项目源码,看了一下里面做完分词后的input id,非词首字确实对应大id。

如果问题1的的理解无误,那么就有了问题2,关心的问题是如果预训练阶段的模型学习过输入14978这么大的id,而下游任务不做额外的操作,”今天“还是对应[791, 1921],那是不是预训练好的bert.embed存放的大id(如14978)到向量的映射并没有被用上?如果是这样,那下游任务做一下分词,把bert.embed存放的大id到向量的映射用上的话,会不会能更好地利用这个预训练模型?当然,这只是一个猜想,下游任务分词能不能有提升,还需要实验验证。我会尝试一下。

from chinese-bert-wwm.

stale avatar stale commented on May 26, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from chinese-bert-wwm.

stale avatar stale commented on May 26, 2024

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

from chinese-bert-wwm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.