bangliu / articlepairmatching Goto Github PK
View Code? Open in Web Editor NEWThe code of ACL 2019 paper: Matching Article Pairs with Graphical Decomposition and Convolutions
License: Other
The code of ACL 2019 paper: Matching Article Pairs with Graphical Decomposition and Convolutions
License: Other
您好,想问一下数据集标注的标准是什么,以下的标注是否会影响结果?
论文提到对于event_pair 是描述同一件事,而story_pair是一个一些有关联的事件(比如一些话题)。
标注的same_story_doc_pair数据集中,
一些相关事件并未标注相关,如:
0|9501|14721|“ 出轨 后 ” 的 宋 喆 买 豪宅 母亲 背 名牌 包 马蓉 这边 却 惨不忍睹|
马蓉 出轨 宋 喆 后 一直 没 露面 , 这次 终于 要 露面 了|
0|10706|10751|详解 特朗普 就职 典礼 全程 安排 具有 多 个 看点|
名流 大腕 拒绝 出席 总统 就职 典礼 特朗普 : 我 想 要 人民|
0|14176|14297|" 台风 "" 海马 "" 本周 或 带来 严重 风雨 影响 "|
“ 海马 ” 或 直 扑 闽粤 19日 至 23日 将 带来 严重 的 风雨 影响|
一些不相关事件标注为相关,如:
1|13109|13110|肇庆 这 部分 路段 封闭 施工 , 车主 请 绕行 !|
肇庆 打掉 一特大 贩毒 犯罪 团伙 缴 毒 6000 多 克|
想问下aggregation layer那里的term based similarity以及Bert做encode层部分代码会放出来吗?
另外就是这里没考虑用pytorch的自带dataloader作为数据加载吗,这样整个模型的Batchsize这部分不好调整
运行feature_extractor.py报错
Traceback (most recent call last):
File "feature_extractor.py", line 9, in
from ccig import *
File "/home/ubuntu/Desktop/CIG-GCN/ArticlePairMatching-master/src/models/CCIG/data/ccig.py", line 13, in
IDF = load_IDF("event_story")
File "/home/ubuntu/Desktop/CIG-GCN/ArticlePairMatching-master/src/models/CCIG/data/resource_loader.py", line 18, in load_IDF
"|", "|", keep_header=False)
File "/home/ubuntu/Desktop/CIG-GCN/ArticlePairMatching-master/src/models/CCIG/util/pd_utils.py", line 11, in export_columns
df = pd.read_csv(fin, sep=sep_in)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/pandas/io/parsers.py", line 2010, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: '../../../../data/raw/event-story-cluster/event_story_cluster.txt'
请问这个不存在的文件在哪里呢?是需要生成的吗?
您好,我在git clone时遇到如下问题:
Error downloading object: data/raw/event-story-cluster/same_event_doc_pair.txt (e5c2482): Smudge error: Error downloading data/raw/event-story-cluster/same_event_doc_pair.txt (e5c2482c410f19418256839d7158d18244e9630466f262b064fb5a69e6f7dddc): batch response: Post https://github.com/BangLiu/ArticlePairMatching.git/info/lfs/objects/batch: dial tcp: lookup github.com: no such host
请问如何解决?
您好,请问完成测试集数据大概需要多久时间?
I want to ask about how to assign document pair to label ?
Crowdsourcing or other ways?
您好,我在使用graph_tool中的draw_graph画图的时候,画出来的图像没法显示中文,
我看您的code中用了vertex_font_family="STKaiti",请问是因为我graph_tool版本的问题吗?
$ conda config --add channels conda-forge
$ conda config --add channels ostrokach-forge
$ conda install graph-tool
HI ,my dear friends,The data generation program has been running for more than 36 hours, unfortunately, I still don't get the output file:/same_event_doc_pair.cd.json
I would be very grateful if you guys give some useful experience
运行feature_extractor.py时出现
python: symbol lookup error: /home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/graph_tool/draw/libgraph_tool_draw.so: undefined symbol: _ZN5Cairo7Context16select_font_faceERKSsNS_9FontSlantENS_10FontWeightE
请问这个如何解决?
假如我现在要使用该模型做文章搜索的功能,搜索相似的文章。首先通过一些TextRank、Ner的等模块提取了特征,然后是不是要和已有库中的所有文章都调用一次模型,这样的效率是不是太慢了
你好,我正在使用git-lfs克隆仓库以期获得数据集,但是在克隆时出现一些问题。请问您能不能将数据集以别的方式发布一下呢,谢谢。
作者您好,在feature_extractor.py中有如下语句
if name == "main":
#debug with a few lines
dataset2featurefile(
"../../../../data/raw/event-story-cluster/same_event_doc_pair.txt",
"../../../../data/processed/event-story-cluster/same_event_doc_pair.cd.debug.json",
"label", "category1", "time1", "time2", "content1", "content2",
["keywords1", "ner_keywords1"], ["keywords2", "ner_keywords2"],
col_title1=None, col_title2=None, use_cd=True,
draw_fig=True, parallel=False, extract_range=range(2), print_fig=True)
# process data
dataset2featurefile(
"../../../../data/raw/event-story-cluster/same_event_doc_pair.txt",
"../../../../data/processed/event-story-cluster/same_event_doc_pair.cd.json",
"label", "category1", "time1", "time2", "content1", "content2",
["keywords1", "ner_keywords1"], ["keywords2", "ner_keywords2"],
col_title1="title1", col_title2="title2", use_cd=True,
draw_fig=False, parallel=True, extract_range=None,
betweenness_threshold_coef=1.0, max_c_size=6, min_c_size=2)
dataset2featurefile(
"../../../../data/raw/event-story-cluster/same_story_doc_pair.txt",
"../../../../data/processed/event-story-cluster/same_story_doc_pair.cd.json",
"label", "category1", "time1", "time2", "content1", "content2",
["keywords1", "ner_keywords1"], ["keywords2", "ner_keywords2"],
col_title1="title1", col_title2="title2", use_cd=True,
draw_fig=False, parallel=True, extract_range=None,
betweenness_threshold_coef=1.0, max_c_size=6, min_c_size=2)
dataset2featurefile(
"../../../../data/raw/event-story-cluster/same_event_doc_pair.txt",
"../../../../data/processed/event-story-cluster/same_event_doc_pair.no_cd.json",
"label", "category1", "time1", "time2", "content1", "content2",
["keywords1", "ner_keywords1"], ["keywords2", "ner_keywords2"],
col_title1="title1", col_title2="title2", use_cd=False,
draw_fig=False, parallel=True, extract_range=None,
betweenness_threshold_coef=1.0, max_c_size=6, min_c_size=2)
dataset2featurefile(
"../../../../data/raw/event-story-cluster/same_story_doc_pair.txt",
"../../../../data/processed/event-story-cluster/same_story_doc_pair.no_cd.json",
"label", "category1", "time1", "time2", "content1", "content2",
["keywords1", "ner_keywords1"], ["keywords2", "ner_keywords2"],
col_title1="title1", col_title2="title2", use_cd=False,
draw_fig=False, parallel=True, extract_range=None,
betweenness_threshold_coef=1.0, max_c_size=6, min_c_size=2)`
`
其中多次调用了dataset2featurefile这一方法,除第一个参数extract_range设置为range(2),其余后后面几次都是一样的;请问这样做是否是必要的,实际运行时只保留其中一次调用可以吗,如果是,保留extract_range=range(2)的,还是extract_range=None的呢?谢谢您!
Hi, in resource_loader.py, 'event_story_cluster.txt' is not provided and I don't know how this file generate.
Hi, friends, I meet a problem when I run
"python main.py --data_type "event" --use_vfeatures --use_siamese --use_gfeatures --use_gcn --use_cd"
the error as follow:
Traceback (most recent call last):
File "main.py", line 297, in <module>
train(args, fout)
File "main.py", line 248, in train
step = train_epoch(epoch, step)
File "main.py", line 190, in train_epoch
output = model(w2v_idxs_l, w2v_idxs_r, v_feature, adj, g_feature, g_vertice) # what if batch > 1 ?
File "//users10/yhwu/miniconda/envs/match_article/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/users10/yhwu/Project/match_article/src/models/CCIG/models/se_gcn.py", line 124, in forward
x_siamese = self.gc_w2v[n_l](x_siamese, adj)
File "//users10/yhwu/miniconda/envs/match_article/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/users10/yhwu/Project/match_article/src/models/CCIG/models/layers.py", line 60, in forward
output = SparseMM()(adj, support)
File "//users10/yhwu/miniconda/envs/match_article/lib/python3.8/site-packages/torch/autograd/function.py", line 159, in __call__
raise RuntimeError(
RuntimeError: Legacy autograd function with non-static forward method is deprecated. Please use new-style autograd function with static forward method. (Example: https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)
By searching online, I find that this problem may due to the version of pytorch( when pytorch version >1.3, torch.autograd.Function need to be static instead of non-static). So I try to solve this problem by changing class"SparseMM" from non-static to static, My modified results are as follows:
class SparseMM(torch.autograd.Function):
"""
Sparse x dense matrix multiplication with autograd support.
Implementation by Soumith Chintala:
https://discuss.pytorch.org/t/
does-pytorch-support-autograd-on-sparse-matrix/6156/7
"""
@staticmethod
def forward(ctx, matrix1, matrix2):
ctx.save_for_backward(matrix1, matrix2)
return torch.mm(matrix1, matrix2)
@staticmethod
def backward(ctx, grad_output):
matrix1, matrix2 = ctx.saved_tensors
grad_matrix1 = grad_matrix2 = None
if ctx.needs_input_grad[0]:
grad_matrix1 = torch.mm(grad_output, matrix2.t())
if ctx.needs_input_grad[1]:
grad_matrix2 = torch.mm(matrix1.t(), grad_output)
return grad_matrix1, grad_matrix2
However, the problem has not been solved and the error prompt has not changed.
I don't know what to do next.(Unless I change the pytorch version) I hope to get the help of the author and everyone. Thanks a lot!
def load_IDF(data):
if data == "event_story":
datafile = "../../../../data/raw/event-story-cluster/event_story_cluster.txt"
contentfile = "../../../../data/processed/event-story-cluster/content.txt"
idffile = "../../../../data/processed/event-story-cluster/IDF.txt"
I meet mistakes : these three txt no such file or directory
As mentioned before #2, installing "graph_tool" may be very troublesome.
Here is my solution to install "graph_tool" on my ubuntu 16.04, hoping can be helpful to those still using ubuntu 16.04 as the server.
We could go to the Official Installation Guidance for debian&ubuntu, but only ubuntu 18.04(bionic), ubuntu 18.10(cosmic) and ubuntu 19.04(disco) are listed blow the instructions.
I tried to open the source url https://downloads.skewed.de/apt/ in my browser, and found there was still a folder "xenial" which is for ubuntu 16.04.
So, I just replace the DISTRIBUTION
with xenial
in the following lines, and added them to /etc/apt/sources.list
.
deb http://downloads.skewed.de/apt/DISTRIBUTION DISTRIBUTION universe
deb-src http://downloads.skewed.de/apt/DISTRIBUTION DISTRIBUTION universe
Then follow the official guidance, I finally installed it successfully.
Note: The code may encounter a “Error”when import cairo
, so the graph drawing will not work. But it can work normally if we don't use it for visualization.
begin loading DATA............../../../data/processed/event-story-cluster/same_event_doc_pair.cd.json
Traceback (most recent call last):
File "main.py", line 102, in
labels, idx_train, idx_val, idx_test = load_graph_data(path, word_to_ix, MAX_LEN, args.num_data)
File "/home/wting/Documents/code/ArticlePairMatching-master/src/models/CCIG/loader.py", line 108, in load_graph_data
sent_idx = right_pad_zeros_1d([word_to_ix[w.decode("utf-8")] for w in val], max_len)
File "/home/wting/Documents/code/ArticlePairMatching-master/src/models/CCIG/loader.py", line 108, in
sent_idx = right_pad_zeros_1d([word_to_ix[w.decode("utf-8")] for w in val], max_len)
KeyError: ')'
device is cpu
I meet the keyerror problem, and found the json contains " "v_texts_mat": [[") 之所以 说 圆满 , 是 因为 这 是 各方 都 能 接受 的 方案 笔者 认为 , 这 也 算 另外 一 种 意义 的 混 改 吧", ""], ["对于 王石 为首 的 万科 来说 ,"
I wander why the json failed, thank you for you explainations
I get this error.
WARNING **: 01:45:59.539: Failed to load shared library 'libgdk-3.so.0' referenced by the typelib: libXcursor.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "feature_extractor.py", line 8, in
from graph_tool.all import *
File "/opt/conda/lib/python3.6/site-packages/graph_tool/all.py", line 34, in
from graph_tool.draw import *
File "/opt/conda/lib/python3.6/site-packages/graph_tool/draw/init.py", line 835, in
from . cairo_draw import graph_draw, cairo_draw,
File "/opt/conda/lib/python3.6/site-packages/graph_tool/draw/cairo_draw.py", line 1496, in
from gi.repository import Gtk, Gdk, GdkPixbuf
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 656, in _load_unlocked
File "", line 626, in _load_backward_compatible
File "/opt/conda/lib/python3.6/site-packages/gi/importer.py", line 144, in load_module
importlib.import_module('gi.repository.' + dep.split("-")[0])
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 656, in _load_unlocked
File "", line 626, in _load_backward_compatible
File "/opt/conda/lib/python3.6/site-packages/gi/importer.py", line 145, in load_module
dynamic_module = load_overrides(introspection_module)
File "/opt/conda/lib/python3.6/site-packages/gi/overrides/init.py", line 118, in load_overrides
override_mod = importlib.import_module(override_package_name)
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/opt/conda/lib/python3.6/site-packages/gi/overrides/Gdk.py", line 83, in
Color = override(Color)
File "/opt/conda/lib/python3.6/site-packages/gi/overrides/init.py", line 195, in override
assert g_type != TYPE_NONE
AssertionError
begin loading DATA............../../../data/processed/event-story-cluster/same_event_doc_pair.cd.json
Traceback (most recent call last):
File "main.py", line 102, in
labels, idx_train, idx_val, idx_test = load_graph_data(path, word_to_ix, MAX_LEN, args.num_data)
File "/home/wting/Documents/code/ArticlePairMatching-master/src/models/CCIG/loader.py", line 108, in load_graph_data
sent_idx = right_pad_zeros_1d([word_to_ix[w.decode("utf-8")] for w in val], max_len)
File "/home/wting/Documents/code/ArticlePairMatching-master/src/models/CCIG/loader.py", line 108, in
sent_idx = right_pad_zeros_1d([word_to_ix[w.decode("utf-8")] for w in val], max_len)
KeyError: ')'
device is cpu
I meet the keyerror problem, and found the json contains " "v_texts_mat": [[") 之所以 说 圆满 , 是 因为 这 是 各方 都 能 接受 的 方案 笔者 认为 , 这 也 算 另外 一 种 意义 的 混 改 吧", ""], ["对于 王石 为首 的 万科 来说 ,"
I wander why the json failed, thank you for you explainations
Is there similar work in English?
Is this dataset available for English as well?
Dr. Liu, Thanks for you great work!
When I try to apply it to an application, I found it's quite slow and not suitable for calculating the similarity of massive sentence pairs(it takes around 6 hrs to process 30k sentence pairs with use_cd=False
on i7-8700K CPU).
I'll appreciate it if you can provide a way to accelerate this progress.
Hi ! I love your work ! I want to use it on other dataset, but i don't know how you get the ner words. Would you tell me ? Thanks!
In the papaer, it said "Given a document D, we first extract the named entities and keywords by TextRank".
Did you use TextRank to get keywords, and then match whether the key words in your own word-type dictionary?
e.g. 广东 in your dictionary is "广东-site".So when you get "广东" is a keyword, you search it and find it is a "site"? Or you use other ner tools? I need a good and fast ner method...Can you tell me about that?
Thanks for listen to me.
ub16c9@ub16c9-gpu:/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/ArticlePairMatching/src/models/CCIG$ python3.6 main.py --data_type "event" --use_gfeatures
device is cpu
begin loading W2V............
Company
W2V loaded!
Vocab size: 2, Embedding size: 200
Namespace(adjacent='tfidf', beta1=0.8, beta2=0.999, betweenness_threshold_coef=1.0, combine_type='separate', data_type='event', dropout_siamese=0.1, dropout_vfeat=0.1, ema_decay=0.9999, epochs=10, gcn_type='valina', gfeatures_type='features', hidden_final=16, hidden_siamese=128, hidden_vfeat=16, inputdata='event-story-cluster/same_event_doc_pair.no_cd.json', lr=0.001, lr_warm_up_num=1000, max_c_size=6, max_grad_norm=5.0, min_c_size=2, no_cuda=False, no_grad_clip=False, num_data=1000000000, num_gcn_layers=2, outputresult='event-story-cluster/same_event_doc_pair.no_cd.result.txt', pool_type='mean', seed=42, use_cd=False, use_ema=False, use_gcn=False, use_gfeatures=True, use_siamese=False, use_vfeatures=False, vertice='pagerank')
begin loading DATA............../../../data/processed/event-story-cluster/same_event_doc_pair.no_cd.json
Traceback (most recent call last):
File "main.py", line 102, in
labels, idx_train, idx_val, idx_test = load_graph_data(path, word_to_ix, MAX_LEN, args.num_data)
File "/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/ArticlePairMatching/src/models/CCIG/loader.py", line 81, in load_graph_data
fin = open(path, "r", encoding="utf-8")
FileNotFoundError: [Errno 2] No such file or directory: '../../../data/processed/event-story-cluster/same_event_doc_pair.no_cd.json'
ub16c9@ub16c9-gpu:/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/ArticlePairMatching/src/models/CCIG$
can you give me the version of graph-tool and torch
Hi, BangLiu,
I use Bert to fine-tuning, extract two long document pair first 256 words, feed them bert, output is feed classifcation layer;
My model batch_size is 6, epoch 2; but the result of train is accuracy = 0.5519683, global_step = 517, loss = 0.72958165, precision = 0.49228394, recall = 0.2029262; The differences with papers's result is very big.
So, I want to know how do you do use bert-text-matching?
I have changed this
pos = sfdp_layout(g)
graph_draw(g, pos=pos,
vertex_text=g.vertex_properties["name"],
vertex_fill_color=c,
vertex_font_family="STKaiti",
vertex_font_size=18,
edge_font_family="STKaiti",
edge_font_size=10,
edge_text=g.edge_properties["name"],
output_size=(1000, 1000),
output=fig_name)
to this
pos = gt.sfdp_layout(g)
gt.graph_draw(g, pos=pos,
vertex_text=g.vertex_properties["name"],
vertex_fill_color=c,
vertex_font_family="STKaiti",
vertex_font_size=18,
edge_font_family="STKaiti",
edge_font_size=10,
edge_text=g.edge_properties["name"],
output_size=(1000, 1000),
output=fig_name)
Even though I have everything imported from graph_tool, I face this error. I also imported graph_tool as gt.
Kindly help me with this
Traceback (most recent call last):
File "main.py", line 102, in
labels, idx_train, idx_val, idx_test = load_graph_data(path, word_to_ix, MAX_LEN, args.num_data)
File "/content/ArticlePairMatching/src/models/CCIG/loader.py", line 81, in load_graph_data
fin = open(path, "r", encoding="utf-8")
FileNotFoundError: [Errno 2] No such file or directory: '../../../data/processed/event-story-cluster/same_event_doc_pair.no_cd.json'
regressor.0.weight : torch.Size([16, 0]) 0
regressor.0.bias : torch.Size([16]) 16
regressor.2.weight : torch.Size([1, 16]) 16
File "H:\EXPERIMENTS\ArticlePairMatching-master\src\models\CCIG\models\se_gcn.py", line 151, in forward
out = self.regressor(x)
File "C:\Users\ediso\anaconda3\envs\articlepair\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\ediso\anaconda3\envs\articlepair\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
input = module(input)
File "C:\Users\ediso\anaconda3\envs\articlepair\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\ediso\anaconda3\envs\articlepair\lib\site-packages\torch\nn\modules\linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "C:\Users\ediso\anaconda3\envs\articlepair\lib\site-packages\torch\nn\functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x5 and 0x16)
As proposed in your paper, you generated local matching vectors in each concept, but what if the concept only contains sentences from one document, do you just ignore these concepts?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.