jiachengli1995 / uctopic Goto Github PK

An easy-to-use tool for phrase encoding and topic mining (unsupervised aspect extraction); Code base for ACL 2022 paper, UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining.

License: MIT License

Shell 0.51% Python 99.49%

topic-modeling contrastive-learning phrase-embeddings aspect-extraction

uctopic's People

Contributors

Stargazers

Watchers

Forkers

cdj0311 katsumata420 hertera1

uctopic's Issues

About Entity Clustering Labels

First, thank you for presenting the nice paper.

When reading your paper, I wondered the below question.

When you evaluate in terms of Entity Clustering, how did you construct the label for each entity?

The datasets have the label for each entity itself?
or
Did you use pseudo labels as the labels?
or
The kinds of entities were used for the labels?

Thank you!

Question about model input at training and inference time.

Let's take an example "Allie drove to Boston for a meeting."

When I pretrain UCTopic, the model takes input_ids as [0, 50264, 324, 4024, 7, 2278, 13, 10, 529, 4, 2] (Allie can be unchanged up to unchange probability) and entity_ids as [2] (mask token of entity embedding).

Then, the model computes contrastive losses by using the hidden state from entity_ids token [2] .

However, the entity embedding from LUKE doesn't have all entities.
Also, the entity embedding only has information about entities, but doesn't have general noun phrases.

Therefore, I guess that such hidden state is weak when unseen entity or general noun phrases are inputted. Is it right? (Allie doesn't also appear in entity vocab of LUKE)

But, when I analyze your code, the model always takes entity_ids as [2] at inference phase(clustering or topic mining) as well as training phase.

So, as if the cls token of BERT represents all tokens in sentence, does the token [2] (mask token) represent entity tokens in input_ids?
Also, since the model only uses the mask token in entity vocab, the model can deal with unseen entity or general noun phrases? (we don't need to worry about the first question ?)

Thank you.

device error occurs when using model moved to GPU

I just tried the example code:

from uctopic import UCTopicTokenizer, UCTopic

tokenizer = UCTopicTokenizer.from_pretrained('JiachengLi/uctopic-base')
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')

text = "Beyoncé lives in Los Angeles."
entity_spans = [(17, 28)] # character-based entity span corresponding to "Los Angeles"

inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
outputs, phrase_repr = model(**inputs)

It works well on CPU(which is the default)
when I try to move the model to GPU (I want to use this model as an encoder in a contrastive model, which will be trained on GPU):

model = model.to("cuda:0")

Then the model can't encode the previous example, the bug is as follows:

outputs, phrase_repr = model(**inputs)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Maybe this is caused by the LUKE model:
( parts of this bug error information )

D:\Anaconda\envs\pytorch\lib\site-packages\transformers\models\luke\modeling_luke.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, entity_ids, entity_attention_mask, entity_token_type_ids, entity_position_ids, head_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict)
    915 
    916         # First, compute word embeddings
--> 917         word_embedding_output = self.embeddings(
    918             input_ids=input_ids,
    919             position_ids=position_ids,

D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

D:\Anaconda\envs\pytorch\lib\site-packages\transformers\models\luke\modeling_luke.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
    248 
    249         if inputs_embeds is None:
--> 250             inputs_embeds = self.word_embeddings(input_ids)
    251 
    252         position_embeddings = self.position_embeddings(position_ids)

D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\modules\sparse.py in forward(self, input)
    156 
    157     def forward(self, input: Tensor) -> Tensor:
--> 158         return F.embedding(
    159             input, self.weight, self.padding_idx, self.max_norm,
    160             self.norm_type, self.scale_grad_by_freq, self.sparse)

D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2042         # remove once script supports set_grad_enabled
   2043         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2044     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

About using all data when fine-tuning

When reading your paper, I wondered the below question.

In Finetuning Setup of section 4.2. Entity Clustering, there is a sentence "Because UCTOPIC is an unsupervised method, we use all data to finetune and evaluate.".

Even though UCTOPIC is a unsupervised model, is it inappropriate to use training + test(valid) data for training?
In other or related papers, are they also using such way?

Thank you!

when i run the clustering.py , download the vocab.json , there is error occured.it need authory,i don't known how to handle it .

Topical Phrase Mining Dataset

In section 4.3 Topical Phrase Mining, for dataset construction, spaCy was used.

Could you provide the processed datasets (Gest, KP20k, KPTimes) which have annotated phrases?

Thank you.

Errors while running topic_mining

Hello, thanks for making this repo available. I'm trying the topic_mining example provided in the repo's Overview section. I have not made any code changes at my end and the code runs into errors. Here is a Colab notebook reproducing it. I do understand the variable n_clusters is set to [15,25] in the example which is more than the number of sentences which is 5, however even after taking a lower n_clusters=[2,3] the code throws an error. Request you to take a look, thanks!

Here is the error log without changing anything in the example:

Normalize phrases: 100%|██████████| 5/5 [00:00<00:00, 97.04it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-7-41ec3e394e7f>](https://localhost:8080/#) in <module>
     13 # len(sentences) is equal to len(spans)
     14 output_data, topic_phrase_dict = topic_tool.topic_mining(sentences, spans, \
---> 15                                                    n_clusters=[15, 25])

2 frames
[/usr/local/lib/python3.7/dist-packages/uctopic/kmeans.py](https://localhost:8080/#) in fit_predict(self, X, centroids, verbose)
    151         start_time = time()
    152         if centroids is None:
--> 153             self.centroids = X[np.random.choice(batch_size, size=[self.n_clusters], replace=False)]
    154         else:
    155             self.centroids = centroids

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

ValueError: Cannot take a larger sample than population when 'replace=False'

jiachengli1995 / uctopic Goto Github PK

uctopic's People

Contributors

Stargazers

Watchers

Forkers

uctopic's Issues

About Entity Clustering Labels

Question about model input at training and inference time.

device error occurs when using model moved to GPU

About using all data when fine-tuning

when i run the clustering.py , download the vocab.json , there is error occured.it need authory,i don't known how to handle it .

Topical Phrase Mining Dataset

Errors while running topic_mining

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent