Coder Social home page Coder Social logo

benchmark-word2vec's Introduction

benchmark-word2vec

词向量相关词搜索性能对比

背景:

实际使用词向量时, 主要使用gensim工具提供的get_vectorsimilar_by_word接口.

当模型的词规模在百万级以上时, 调用一次similar_by_word, 耗时过长. 如6M大小的模型, 查询单个词的近义词, 耗时约0.35s.

实际项目中, 需要降低检索的耗时. 其中一个解决方案, 便是使用faiss来实现similar_by_word.

本项目, 简单对比gensimfaiss在查找近义词方面的性能.

实验与结果:

20K规模的词向量模型

对应项目中benchmark.py, 可通过执行run.sh, 获取实验结果.

数据:

  • faiss[Flat]: load index, 0.82s; search 100 times by word, 1.08s; search 100 times by vec, 1.06s
  • gensim: load index, 5.80s; search 100 times by word, 1.64s; search 100 times by vec, 1.62s

结论: faiss暴力模式下运行, 能输出与gensim一致的结果, 性能略优于gensim.

6M规模的词向量模型

对应项目benchmark_1M.py.

数据:

  • faiss[Flat]: load index, 31.92s; search 100 times by word, 209.59s; search 100 times by vec, 215.94s
  • faiss[IMI2x10,Flat; nprobe=8192]: load index, 53.94s; search 100 times by word, 4.36s; search 100 times by vec, 4.22s; train+store, 197.05s
  • gensim: load index, 208.36s; search 100 times by word, 394.81s; search 100 times by vec, 423.10s

结论:

  • faiss暴力模式下运行, 能输出与gensim一致的结果. 耗时约为gensim的0.5倍
  • faissIMI2x10,Flat模式下, 训练耗时约200s. 通过提高查询时的nprobe, 能提高检索召回率. 当nprobe=256, 检索结果可以接受; 当nprobe=8192, 基本与gensim一致, 但耗时仅为gensim的0.01倍.

感谢:

benchmark-word2vec's People

Contributors

fengrk avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.