Coder Social home page Coder Social logo

thunlp / wantwords Goto Github PK

View Code? Open in Web Editor NEW
7.0K 72.0 616.0 14.88 MB

An open-source online reverse dictionary.

Home Page: https://wantwords.net/

Python 15.01% JavaScript 44.07% HTML 40.54% CSS 0.38%
reverse-dictionary word nlp natural-language-processing

wantwords's Introduction

|En

WantWords Logo

An Open-source Online Reverse Dictionary [link]

News

The WantWords MiniProgram has been launched. Welcome to scan the following QR code to try it!

MiniProgram QR code

What Is a Reverse Dictionary?

Opposite to a regular (forward) dictionary that provides definitions for query words, a reverse dictionary returns words semantically matching the query descriptions.

rd_example

What Can a Reverse Dictionary Do?

  • Solve the tip-of-the-tongue problem, the phenomenon of failing to retrieve a word from memory
  • Help new language learners
  • Help word selection (or word dictionary) anomia patients, people who can recognize and describe an object but fail to name it due to neurological disorder

Our System

Workflow

workflow

Core Model

The core model of WantWords is based on our proposed Multi-channel Reverse Dictionary Model [paper] [code], as illustrate in the following figure.

model

Pre-trained Models and Data

You can download and decompress the pre-trained models and data to BASE_PATH/website_RD/ to reimplement the system.

Key Requirements

  • Django==2.2.5
  • django-cors-headers==3.5.0
  • numpy==1.17.2
  • pytorch-transformers==1.2.0
  • requests==2.22.0
  • scikit-learn==0.22.1
  • scipy==1.4.1
  • thulac==0.2.0
  • torch==1.2.0
  • urllib3==1.25.6
  • uWSGI==2.0.18
  • uwsgitop==0.11

Cite

If the code or data help you, please cite the following two papers.

@inproceedings{qi2020wantwords,
  title={WantWords: An Open-source Online Reverse Dictionary System},
  author={Qi, Fanchao and Zhang, Lei and Yang, Yanhui and Liu, Zhiyuan and Sun, Maosong},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  pages={175--181},
  year={2020}
}

@inproceedings{zhang2020multi,
  title={Multi-channel reverse dictionary model},
  author={Zhang, Lei and Qi, Fanchao and Liu, Zhiyuan and Wang, Yasheng and Liu, Qun and Sun, Maosong},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  pages={312--319},
  year={2020}
}

wantwords's People

Contributors

fanchao-qi avatar whoisleilei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wantwords's Issues

Missing data file from given link in README.md

Hi syheliel,
First of all, your project is amazing, but downloaded data from the given link in README.md is insufficient to run the project (e.g., missing 'wordTrans_Ch_En_Sort.json', 'wordTrans_En_Ch_Sort.json', 'word2synset_synset.txt', etc.).
It would be great if you can provide the link with such missing files.

Thank you so much.

词库太老了

汉-英反查:
搜索“新冠病毒”,结果如下:
1.retrovirus
2.HIV
3.viral
4.cytomegalovirus
5.virus
6.herpes zoster
7.antiviral
8.acyclovir
9.H1N1
10.virology

正确结果应该是:Coronavirus

能识别的字太少了

刚才我验证了0x4e00到0x9fff的每一个汉字能否被识别(能出现相关近义词就是能识别,返回{"error": 1}就不能识别)

验证代码:(可能需要数小时)

import requests as r

result = 0
result_2 = 0
for i in range(0x4e01, 0x9fff + 1):
    t = r.get(f"https://wantwords.thunlp.org/ChineseRD/?description={chr(i)}&mode=CC")
    if i % 256 == 0:
        print(f"There are {result_2} unrecognizable characters in 256 characters({hex(i-256)}~{hex(i)}).")
        result_2 = 0
    if t.text == '{"error": 1}':
        result += 1
        result_2 += 1
print(result)

结果显示:在所有的20992个汉字中,竟然有9033个汉字不能被识别,能识别的仅有11959个!

因此,我觉得软件支持的汉字太少(CJK基本集支持度才57%,扩展区更加不行),很多不算太生僻的字都不能识别。可以考虑扩展词库了(肯定可以,有些不支持的汉字百度都能搜到)。

展示结果建议:对相关性极低的词语单独列一个 Section

以“老婆”为例,在“老公”这个词后面的结果相关性没有那么强。或许可以为非专业用户做一点展示上的优化,按“高相关性”“中相关性”“低相关性”等进行分区或者分三列展示,并说明不同的可信程度。

婆娘
女人
妻子
媳妇儿
太太
娘儿们
妻
妻室
娘子
爱人
婆姨
老伴
老婆子
夫人
老小
内助
老公
小老婆
丈母娘
*
小姨子
儿媳妇
岳母
*货
二奶
一男半女
二婚
小姑子
外遇
*
戴绿帽子
小叔子
公婆
媳妇
大老婆
嫂子
闺女
女朋友
三妻四妾
公爹
婊子
大姨子
绿帽子
老娘
前妻
独守空房
情夫
沾花惹草
舅妈
贤惠
打光棍
丈夫
拈花惹草
富婆
贱货
男方
奶子
贤妻良母
荡妇
娶
嫁人
偷情
女方
守寡
守活寡
红杏出墙
娇妻
女友
鬼混
糟糠之妻
上床
淫妇
偷人
色鬼
知冷知热
男友
后妈
鸡巴
姐夫
妞
前夫
吃醋
孙媳妇
百依百顺
少奶奶
臭钱
嫂嫂
养老送终
女婿
小姑
花心
情妇
复婚
弟媳
爸
老大不小
再婚
家务活
千依百顺
婶子

The output of the pre-trained model is different from the results on the website 模型的输出和网站上的结果不同

I am not sure if I am running the model correctly, but I have noticed a significant difference in the output results of the pre-trained model compared to the results provided on your website. Has the model been updated? If so, is there any way that I can get access to the updated model? Thank you!

我不确定是我运行模型的方式不正确,但是我发现下载的预训练模型的输出结果和您提供的网站上的结果差别很大。请问是模型更新了吗?如果是更新了,请问有有什么途径可以获得新的模型吗?谢谢!

BiLSTM

请问 Model.py 中 BiLSTM 是如何设置的?它似乎需要作为参数 encoder 输入,我尝试了 MultiRD 中的相关代码,但会报错

请考虑增加对专利文献和专业技术词汇的支持

您好,我是一名知识产权从业人员(之前是一名菜鸟工程师),我和同事们都认为本项目能满足知产行业日常工作中的重要需求,因此能否考虑增加对专利文献和专业技术词汇的支持?谢谢

建议尽快开发出桌面版应用软件

目前运行代码才能使用,大大限制了万词王的推广、优化、升级。引用习大大一句话“时代是出卷人,我们是答卷人,人民是阅卷人”

安装问题

可以增加一些安装和使用的教程吗

关于服务器的两个建议

刚才用脚本测试了一下,不断发送get请求到https://wantwords.thunlp.org/EnglishRD/?description=test&mode=EE
持续一千次,两次之间没有停顿(收到上一个请求的回应就立刻发送下一个请求)
本以为几次之后就会因为请求太频繁而被拒绝掉,但是一千次全部结束了也没有。
因此,建议在服务器上增加对请求频率的限制,防止恶意攻击。

第二,在上述过程中我发现虽然显示词语只有100个,但是get请求却返回了377个词语(有的词甚至更多,如“啊”返回了441个结果),而多余的277个并不能显示出来。
建议增加一个翻页的选项(或者像很多网站一样,用倒三角按钮显示 /隐藏内容)。虽然后面的词语已经几乎不相关了,但有几个还是有价值的,有时也能起到增加词汇量的作用。

是不是考虑增加一些词的过滤呢?

非常偶然地测试了“爽”字,内心的答案应该是“秋高气爽”、“舒适”、“心旷神怡”之类的词,可是真实结果却有些让我大跌眼镜,是不是考虑过滤掉那些词呢?

WantWords 采用的开源协议

没有找到仓库内的开源协议文件,想问下采用的是什么开源协议,在商用场景使用仓库代码是否有什么限制

流行语的支持

希望增加对流行语的支持。
比如:”形容一个人很菜“,”你真狗“这类的

求api

求服务器api,可以直接得到结果json,便于二次开发(直接拿源文件改可能有点难度,而且没有开源协议我觉得不踏实)

添加对网络用语的支持

输入一些常用的网络用语,比如‘奥利给’,‘神马’,'木有'等词语,均不能得到符合语境的答案,建议尝试增加一些积极语境的网络用语反向查询

开源版本似乎未提供敏感词处理代码

当查询词内包含敏感词时,线上部署版本会返回{"details":"The input description contains sensitive words","error":8,"message":"content error"},在开源版本中未见此逻辑。

有无可能加入对于错别字等的兼容?

例如“比萨”容易被错写为“披萨”。而搜索“披萨”的结果远差于“比萨”的结果。
是否可能考虑对于一些错别字、同一事物的不同称谓给予跳转或提示?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.