Coder Social home page Coder Social logo

blog's Introduction

blog

Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces 2019 acl finding

Motivation:

  1. 现在的双语词典抽取工作比较依赖一个对齐的双语词典或者词分布,通常会假设在两个空间是等距的,但是这种假设在某些场景下可能并不合理。 eg:对于词源比较遥远的语言对,它们的空间距离也会更遥远。 因此这篇工作提出对两个嵌入空间之间的距离做定量估计:Bilingual Lexicon Induction with Semi-Supervision (BLISS) Introduction:

  2. 对于BLI,常用且有效地方法是学习在两个词嵌入空间之间的正交映射,此时会假设两个语言的词嵌入空间是正交的。

  3. 已有工作证明了这种强假设可能存在不合理性,因此作者

    • 首先使用Gromov-Hausdroff(GH)距离来检查正交假设成立的程度; particularly for etymologically and typologically distant language pairs.
    • 提出BLISS jointly optimizes for supervised embedding alignment, unsupervised distribution matching, and a weak orthogonality constraint in the form of a back-translation loss. Methods:
  4. Isometry of Embedding Spaces 使用Gromov Hausdorff (GH) distance检测在正交转换下两个语言的词嵌入空间的对齐程度。 *Hausdorff distance:

    image 测量两个空间的相似度程度(https://www.cnblogs.com/yhlx125/p/5478147.html)

    image

    *Gromov Hausdorff (GH) distance是使所有正交变换上的Hausdorff distance最小化,使得能够提供一个确切的对于两个空间正交程度的定估计。

    image

  5. Semi-supervised Framework

    • 有监督方法缺点:仅仅对齐词,而没有利用到词嵌入中包含的丰富信息
    • 无监督方法缺点:利用词分布,只能对齐较粗粒度,细粒度对齐效果不好
    • 半监督方法框架: *Unsupervised Distribution Matching(muse) 从源到目标端学习一个映射矩阵W,来训练鉴别器D:鉴别映射的源端嵌入WX 和 目标端嵌入y。 loss:是的两个词嵌入空间的分布尽可能匹配 image LW|D使模型能够利用两个嵌入空间中可用的分布信息,从而使用所有可用的单语数据 *Aligning Known Word Pairs 对于词嵌入空间S,相似度函数fs,目标是:最小化fs(最大化相匹配词对的相似度) loss:image LW|S允许以小型种子词典的形式提供标签对的正确对齐。 *Weak Orthogonality Constraint 对于嵌入空间X定义了一个一致性loss,它最大化fa:x和WTWx LW|O鼓励基于联合优化的W矩阵的正交性 image *Nearest Neighbor Retrieval CSLS
      *Iterative Procrustes Refinement and Hubness Mitigation 通常的无监督方法是:迭代、扩展词典,不断优化映射矩阵:一般使用Procrustes方法 hubness问题:陷入局部最优 hubness filtering mechanism:过滤出目标域中作为中心的单词,eg.在迭代字典扩展中不考虑目标域中在源域中具有超过阈值邻居数量的单词。

Experimental Setup

Main Results

blog's People

Contributors

qiuyu-ding avatar

Watchers

 avatar

Forkers

zhiqiangcao1218

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.