Light

lost-person / information-retrieval Goto Github PK

View Code? Open in Web Editor NEW

5.0 1.0 4.0 61.26 MB

信息检索大作业

Python 16.32% Makefile 3.44% C++ 0.52% C 79.72%

information-retrieval's Introduction

信息检索大作业

大作业要求

实现对病人病历的检索模型（20分）
界面程序（无具体要求，实现基本功能，建议bash纯命令行界面）
实验报告（10分）

大作业内容

病人病历数据库 xml格式与txt格式前者是官方给定标准数据集格式，后者是为方便处理。官方文档是两者都可使用的，但是要以xml为准！
查询见topic.xml和extra_topics2017.pdf 通常做法是将disease字段作为查询，其他字段作为辅助。
提交结果形式：<查询ID> Q0 <> Q0 <文档ID> Q0 <> <文档排序> <文档评分> <系统ID> Q0 <>
评价指标——P@10 计算方法可自己编写，也可以使用trec_eval脚本计算
5折交叉验证——3部分训练，1部分验证，1部分测试
测试结果取平均

任务

建立倒排索引（必做，已从康哲舟出拷贝，但是只是部分倒排索引，张路，从康哲舟处拷贝倒排索引和程序）
BM25模型（戚亚涛，已完成）
界面（张家瑞，已完成）
词干还原（戚亚涛，必做，完成）
寻找医学语料库（张路，已完成）
查询扩展（张路，优化，已完成）
查询扩展进一步优化（张路，获取更大的语料库，已完成）
程序完善（结果文件标准格式，计算准确率等，张路、戚亚涛，完成）
实验报告编写（石瑞聪，卢丽婧）

实验流程

文档模型 ： BM25模型
词向量 ：利用CBOM模型词向量
查询扩展 ：利用预训练好的词向量返回原查询中的查询词的前k个相近词
权重：原查询字段disease、gene、demographic、other字段权重依次降低，扩展词权重0.9
备注：看代码

文件说明

main.py 主程序文件
bm25.py BM25模型
query.py 查询文件
word2vec.py 训练词向量
data_helpers.py 读取数据文件
util.py 工具文件
SPIMI.py 倒排表构建文件
test.py 测试编程想法
clinical_trials.judgments.2017.csv 标准查询结果
w2v.model 词向量（用于查询扩展）
w2v.model.trainables.syn1neg.npy 词向量辅助文件1
w2v.model.wv.vectors.npy 词向量辅助文件2
vocab.pkl 词表文件
trec_eval trec_eval工具（用于计算准确率）运行命令为——./trec_eval/trec_eval ./eval/qrels.txt ./eval/res.txt
qrels.txt 真实相关文档
res.txt 预测相关文档
IR界面界面运行interface.jar 点击start按钮，打开结果文件夹，显示相关文档内容可选字体，背景（虽然很丑）
topic.xml 查询文件

information-retrieval's People

Contributors

Stargazers

Watchers

Forkers

zjr35897 fatefawkes fcbaernyang peggiewu

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.