yongyehuang / zhihu-text-classification Goto Github PK

View Code? Open in Web Editor NEW

405.0 21.0 156.0 21.22 MB

[2017知乎看山杯多标签文本分类] ye组(第六名) 解题方案

Home Page: https://biendata.com/competition/zhihu/

Python 8.48% Shell 0.02% Jupyter Notebook 91.50%

multi-label text-classification tensorflow lstm textcnn han

zhihu-text-classification's Introduction

2017 知乎看山杯多标签文本分类

比赛总结： 2017知乎看山杯总结(多标签文本分类)

1.运行环境

下面是我实验中的一些环境依赖，版本只提供参考。

环境/库	版本
Ubuntu	14.04.5 LTS
python	2.7.12
jupyter notebook	4.2.3
tensorflow-gpu	1.2.1
numpy	1.12.1
pandas	0.19.2
matplotlib	2.0.0
word2vec	0.9.1
tqdm	4.11.2

2.文件结构

3.数据预处理

把比赛提供的所有数据解压到 raw_data/ 目录下。
按照顺序依次执行各个 .py，不带任何参数。
或者在当前目录下输入下面命令运行所有文件：
dos2unix run_all_data_process.sh # 使用cygwin工具dos2unix将script改为unix格式
sh run_all_data_process.sh

3.1 embed2ndarray.py

赛方提供了txt格式的词向量和字向量，这里把embedding矩阵转成 np.ndarray 形式，分别保存为 data/word_embedding.npy 和 data/char_embedding.npy。用 pd.Series 保存词(字)对应 embedding 中的行号(id),存储在 data/sr_word2id.pkl 和 data/sr_char2id.pkl 中。

3.2 question_and_topic_2id.py

把问题和话题转为id形式，保存在 data/sr_question2id.pkl 和 data/sr_id2question.pkl 中。

3.3 char2id.py

利用上面得到的 sr_char2id，把所有问题的字转为对应的id, 存储为
data/ch_train_title.npy
data/ch_train_content.npy
data/ch_eval_title.npy
data/ch_eval_content.npy

3.4 word2id.py

同 char2id.py

3.5 creat_batch_data.py

把所有的数据按照 batch_size(128) 进行打包，固定seed，随机取 10 万样本作为验证集。每个batch存储为一个 npz 文件，包括 X, y 两部分。这里所有的序列都进行了截断，长度不足的用0进行padding到固定长度。
保存位置：
wd_train_path = '../data/wd-data/data_train/'
wd_valid_path = '../data/wd-data/data_valid/'
wd_test_path = '../data/wd-data/data_test/'
ch_train_path = '../data/ch-data/data_train/'
ch_valid_path = '../data/ch-data/data_valid/'
ch_test_path = '../data/ch-data/data_test/'

3.6 creat_batch_seg.py

和 creat_batch_data.py 相同，只是对 content 部分进行句子划分。用于分层模型。划分句子长度：
wd_title_len = 30, wd_sent_len = 30, wd_doc_len = 10.(即content划分为10个句子，每个句子长度为30个词)
ch_title_len = 52, ch_sent_len = 52, ch_doc_len = 10.
不划分句子：
wd_title_len = 30, wd_content_len = 150.
ch_title_len = 52, ch_content_len = 300.

4.模型训练

切换到模型所在位置，然后进行训练和预测。比如：

cd zhihu-text-classification/models/wd-1-1-cnn-concat/
# 训练
python train.py [--max_epoch 1 --max_max_epoch 6 --lr 1e-3 decay_rate 0.65 decay_step 15000 last_f1 0.4]
# 预测
python predict.py

这里只整理了部分模型，所有模型都用的词向量。如果想要使用字向量，只需要把模型中的输入和序列长度修改即可。

5.模型融合

线性加权融合，模拟梯度下降的策略进行权值搜索。见：local_ensemble.ipynb 注意：

此方法可能会对验证集过拟合，所以需要通过测试集进一步判断。在模型个数比较多时使用此方法效果更好。
需要根据各个单模型的性能认为进行初始化。char 和 word 类型不能直接比较，char 的单模型的性能虽然较差，但是对融合提升非常明显。

zhihu-text-classification's People

Stargazers

Watchers

Forkers

yanwang2014 loganzzz x-hacker stevenlol pathriclee melody-xiaomi levstyle ryfan-rs cosecant-csc pengfei2017 rxt2012kc cyue3 0wave allensmile sundllyq jiujiuwo ghiblifield siyantao digapieceofday tfnlp berryhn tomzhang mxpang blueseasky schwimmer zhoujiang2013 eight-corner dst1213 744996162 yileye hualichenxi samelltiger 1715509415 lia-git vpegasus zbxzc35 shandyone topdreamer haocoder lovehoroscoper tuian jefferyship dimkang zhanzecheng qianyiwei rosefun gegetang mohnkhan zhujiahui shelleyhlx maybefeicun aitianyu2006 luffey1990 delaiahz wushuang3625 dapenggg tutty427 brucexia6116 gaoshui87 zncepup haonanli super9919 coeasy fanfanba chenhui-bupt xxbb1234021 erichan2046 artist100 yuyuvenus zouxiaoyuonly luckmoon nlpscott snaildm augusxing ifhubs tiffen nlpjoe pieere yuanjungod mdzz110 waiteryee1 k-fall chihuataneo perryhau pengpage qqcrash gokasiko sevenkili hijuly cc8848 leichangqing mengd2 moujianming chenny0808 liangxx18 endy-see wasim37 bigzhao fendaq awesome-archive

zhihu-text-classification's Issues

char2id时报错，AttributeError: Can't get attribute 'get_id' on <module 'main' >

@yongyehuang 你好!请问运行char2id时是否遇到过下面错误。
Processing eval data.
test question number 217360
There are 0 test questions without title.
100%|██████████| 217360/217360 [00:01<00:00, 151202.44it/s]
There are 55179 test questions without content.
100%|██████████| 55179/55179 [00:00<00:00, 56958.01it/s]
Exception in thread Thread-11:
Traceback (most recent call last):
File "C:\Users\liks\AppData\Local\Continuum\anaconda3\envs\tensorflow\lib\threading.py", line 914, in _bootstrap_inner
self.run()
File "C:\Users\liks\AppData\Local\Continuum\anaconda3\envs\tensorflow\lib\threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\liks\AppData\Local\Continuum\anaconda3\envs\tensorflow\lib\multiprocessing\pool.py", line 463, in _handle_results
task = get()
File "C:\Users\liks\AppData\Local\Continuum\anaconda3\envs\tensorflow\lib\multiprocessing\connection.py", line 251, in recv
return ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'get_id' on <module 'main' from 'C:\Users\liks\AppData\Local\Continuum\anaconda3\envs\tensorflow\lib\site-packages\spyder\utils\ipython\start_kernel.py'>

请问切割成句子时，切割符号是哪里来的呢？

在这里: https://github.com/yongyehuang/zhihu-text-classification/blob/master/data_helpers.py#L236
有个疑问，请问这几个值是如何得来的？
谢谢。

正式训练出现问题

您好，我在跑wd_1_1_cnn_concat模型时，数据预处理和测试都没问题，但是正式运行时，遇到了一个非常大的问题，就是训练一段时间后会自动把整个zhihu-text-classification文件夹删除，产生如下报错：
IOError：[Errno 2]No such file or directory:'../../data/wd-data/data_train/6317.npz'
第一次出现这个错误，我还以为是谁把我的文件夹删了，但是第二次重新训练又出现了同样的情况，整个zhihu-text-classification文件夹自己被删除。
请问您在实验过程中有遇到过这样的问题吗？

char2id.py中na_title_indexs的那些值是怎么来的？

na_title_indexs = [328877, 422123, 633584, 768738, 818616, 876828, 1273673, 1527297, 1636237, 1682969, 2052477, 2628516, 2657464, 2904162, 2993517]

运行create_bathc_seg.py时出现错误

您好！我运行您的代码，再数据预处理的creat_batch_seg.py中出现了：
AttributeError: Can't get attribute 'get_id' on <module 'pydevd' from 'D:\software\PyCharm 2017.2.4\helpers\pydev\pydevd.py'>
我看了一下是train_title = np.load('../data/wd_train_title.npy')这段代码报错，我也不知道为什么，你有时间能否简答下，谢了

【bug修复】模型联合进行 fine-tune 代码更正

20171026更新：更正模型融合代码第2部分

2.载入多个模型

前版本错误说明：之前失误，写这个例子的时候，两个模型放在一块训练，即 model1， model2 的saver都会把另外一个模型的所有参数都进行保存。这样在后面重新导入模型的时候没有问题。但是实际上两个模型肯定是分开训练的，如果还是按照原来的写法，会发现在导入模型的时候，由于每个 ckpt 并没有保存另外一个模型的变量而报错。
现版本更正方法：重新定义两个 saver，在定义的时候分别传入各模型的变量。

task_specific_attention

Hi，你好，
看了你的代码感觉收获很大，非常感谢。我有一个问题不是很明白：在Bi-gru模型中，task_specific_attention 这个一步骤是添加注意力机制吗？我看不太明白，能不能给个论文或者blog链接，谢谢！

多标签分类问题，怎么评估取top前几概率大的当作类别

多标签分类问题，怎么评估取top前几概率大的当作类别？比如你这边取概率最大的前5个。怎么去定义刻画这个5呢，为什么不能取前面三个

UNKNOW word index？

Dear yongye:
Thank you for providing such excellent code.
I encountered a problem when I ran the word2id.py file. The function get_id() use the index 1 for unknown word. However, we use index 0 for unknown word when build the sr_word2id variable . Is there something wrong?

char2id.py 速度太慢了

data_process/char2id.py

速度太慢了，居然要耗时7天
for na_index in na_title_indexs:
df_eval.loc[na_index, 'char_title'] = df_eval.loc[na_index, 'char_content']

把loc 改为at 只用了几秒种就完成了