Coder Social home page Coder Social logo

zhanzecheng / sohu_competition Goto Github PK

View Code? Open in Web Editor NEW
226.0 10.0 73.0 35.57 MB

Sohu's 2018 content recognition competition 1st solution(搜狐内容识别大赛第一名解决方案)

Home Page: https://biendata.com/competition/sohu2018/

Python 19.68% Jupyter Notebook 79.88% C++ 0.01% Shell 0.05% Cuda 0.39%
nlp emsembling stacking competition sohu

sohu_competition's Introduction

简介

第二届搜狐内容识别大赛冠军LuckyRabbit团队的解决方案,关于参赛细节和详解,请参阅说明文档

代码流程

整个代码分为数据预处理、特征提取、单模型、stacking模型融合、trick部分 img

Input

输入数据是HTML格式的新闻和附带的图片

<title>惠尔新品 | 冷色系实木多层地板系列</title> <p>  </p> <br/><p>  <span style="font-size: 16px;">冷色系实木多层系列全新上市</span></p>	P0000001.JPEG;P0000002.JPEG; 

Preprocessing

  • 文本翻译数据增强:把中文翻译成英文、再把英文翻译回中文,代码中没有给出,可自行调API接口
  • 图片数据增强:图片旋转、平移、加噪声、过采样
  • 使用jieba分词为基本分词组件

Feature Extraction

  • gensim训练好的300维词向量 百度云盘
  • TFIDF 特征 + SVD进行降维度
  • 字向量特征
  • 基础特征:是否含有手机号、微信号等
  • OCR文字提取 提取图片上包含的文字来补充文本分类信息 img

单模型

这里拿一个经典的模型来做一个例子, 我们把ocr提取出来的文本和新闻文本分别输入到同一个embedding层,随后再连接做分类 img 各种模型的得分如下

模型或方法 得分F1-measure
catboost 0.611
xgboost 0.621
lightgbm 0.625
dnn 0.621
textCNN 0.617
capsule 0.625
covlstm 0.630
dpcnn 0.626
lstm+gru 0.635
lstm+gru+attention 0.640
(ps 由于大赛评分系统关闭了,个别模型得分可能不太准确)

模型融合

Stacking

关于stacking这里有一篇很好的模型融合方法的介绍 我们比赛中使用的stacking模型结构如下图所示 img

Snapshot Emsemble

在stacking第二层模型中我们还加入了深度融合的方法,论文地址

Pesudo Labeling

我们使用的另外一个trick就是pesudo-labeling 方法,它适用于所有给定测试集的比赛 教程

方法效果如下

模型或方法 得分F1-measure
单一模型 0.642
stacking 0.647
stacking+trick 0.652

代码结构

|- SOHU_competition
|  |- data          
|  |  |-result  模型输出结果
|  |  |- ···          
|  |- ckpt           # 保存模型
|  |- img       # 说明图片
|  |- src          # 模型代码
|  |  |- model     # 模型
|  |  |  |- model_basic    # 定义模型训练方法等  
|  |  |  |- attention_model    # 模型定义  
|  |  |  |- ···   
|  |  |- preprocess 
|  |  |  |- EDA&Extract.ipynb  # 特征处理和提取流程 
|  |  |  |- ···   
|  |  |- ocr 
|  |  |- train&predict.ipynb # 单模型的训练和测试 
|  |  |- stacking.ipynb # 模型融合 

使用方式:

  • git clone https://github.com/zhanzecheng/SOHU_competition.git
  • 下载新闻文本训练文件,放在 ./data/路径下,地址,图片文件过大,请自行去官网下载
  • 下载ocr模型文件,放在 ./src/ocr/ctpn/checkpoints/ 路径下,模型地址
  • pip3 install -r requirement.txt
  • 下载词向量,放到 ./data 目录下
  • 执行 EDA&Extract.ipynb
  • 执行 train&predict.ipynb
  • 执行 stacking.ipynb

感谢

感谢两位帅气的队友HiYellowCyupeihua

这里还有我们的答辩PPT,如果需要的话自行下载

sohu_competition's People

Contributors

zhanzecheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sohu_competition's Issues

文件找不到

您好,非常感谢您分享您的代码。但是,根据您给出的代码,其中几个文件没有找到,
image
打扰您了,感谢您能够给出指导!!!

stacking oof feature是否不全?

北邮的同学你好,我想问下下面这段关于oof feature的代码:

    # load oof train and oof test
    filenames = glob.glob('../data/result/*oof*')
    for filename in filenames:
        oof_filename.append(filename)
        test_filename.append(filename.replace('_oof_', '_oof_'))

    oof_data = []
    test_data = []

    for tra, tes in zip(oof_filename, test_filename):
        with open(tra, 'rb') as f:
            oof_data.extend(pickle.load(f)[:len(train_x)])
        with open(tes, 'rb') as f:
            test_data.extend(pickle.load(f)[:len(test_x)])
            
    train_x = np.concatenate((train_x, train_ocr_x, features, ocr_features, oof_data[:len(train_x)]), axis=-1)
    test_x = np.concatenate((test_x, test_ocr_x, test_features, ocr_test_features, test_data[:len(test_x)]), axis=-1)

其中的

 train_x = np.concatenate((train_x, train_ocr_x, features, ocr_features, oof_data[:len(train_x)]), axis=-1)

oof_data[:len(train_x)]不是相当于把oof_data截断到一个模型的oof数据了么?

help

你好,在你的ppt部分,我看到了你对字向量和词向量通过attention进行了结合,但我在代码部分并没有找到相应的代码,希望得到你的帮助

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.