Coder Social home page Coder Social logo

mathsantana / product-matching Goto Github PK

View Code? Open in Web Editor NEW

This project forked from englishbook/product-matching

0.0 1.0 0.0 25 KB

The code of Team Rhinobird for Mining the Web of HTML-embedded Product Data Task One at ISWC2020

Home Page: https://ir-ischool-uos.github.io/mwpd/index.html

Python 100.00%

product-matching's Introduction

product-matching

The code of Team Rhinobird for Mining the Web of HTML-embedded Product Data Task One at ISWC2020.

Task one: Product Matching

The product matching task aims to identify that if a pair of product deriving from different websites refer to the same product or not.

Datasets

In the SWC2020 challenge product matching task, the dataset of Task one is sampled from the WDC product data corpus. Products in the corpus are described by these properties: id, cluster id, category, title, description, brand, price, and specification table. Our models are mainly trained on two different matching dataset:

  • Computers dataset is provided by the organizers of the challenge which only contains product from Computers & Accessories.

  • All dataset contains products from all the four categories (Computers & Accessories, Camera & Photo, Watches, and Shoes).

Input

Although products are described by many attributes, most of the fields contain NULL values. Considering the filling rate and the input length, we focus on the title and description attributes and ignore the other ones.

Model

We use BERT_base as the main module of our matching model. Focal loss is adopted to alleviate class imbalance problem.

Please download the dataset and BERT weights first.

Just run the train.py to train all the models we used in the challenge:

python train.py

After obtaining the model parameters, run the predict.py to combine the predictions of different model and get the final results:

python predict.py

Post-processing

For test pairs with prediction results of 1 but different categories, we directly correct their results to 0 in the post-processing phase.

Results

Validation

Single model:

Model Input Dataset F1 Post F1
Bert_focal title All 0.9481 0.9496
Bert_focal title+description All 0.9384 0.9411
Bert_focal title+description Computers 0.9700 0.9700

Test

In the final evaluation, we ensemble these three models:

Model Precision Recall F1
Our model 0.8063 0.9200 0.8594

product-matching's People

Contributors

englishbook avatar mathsantana avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.