Coder Social home page Coder Social logo

jiachengli1995 / uctopic Goto Github PK

View Code? Open in Web Editor NEW
42.0 2.0 3.0 3.68 MB

An easy-to-use tool for phrase encoding and topic mining (unsupervised aspect extraction); Code base for ACL 2022 paper, UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining.

License: MIT License

Shell 0.51% Python 99.49%
topic-modeling contrastive-learning phrase-embeddings aspect-extraction

uctopic's People

Contributors

jiachengli1995 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

uctopic's Issues

About Entity Clustering Labels

First, thank you for presenting the nice paper.

When reading your paper, I wondered the below question.

When you evaluate in terms of Entity Clustering, how did you construct the label for each entity?

  1. The datasets have the label for each entity itself?
    or
  2. Did you use pseudo labels as the labels?
    or
  3. The kinds of entities were used for the labels?

Thank you!

Question about model input at training and inference time.

Let's take an example "Allie drove to Boston for a meeting."

When I pretrain UCTopic, the model takes input_ids as [0, 50264, 324, 4024, 7, 2278, 13, 10, 529, 4, 2] (Allie can be unchanged up to unchange probability) and entity_ids as [2] (mask token of entity embedding).

Then, the model computes contrastive losses by using the hidden state from entity_ids token [2] .

However, the entity embedding from LUKE doesn't have all entities.
Also, the entity embedding only has information about entities, but doesn't have general noun phrases.

  1. Therefore, I guess that such hidden state is weak when unseen entity or general noun phrases are inputted. Is it right? (Allie doesn't also appear in entity vocab of LUKE)

But, when I analyze your code, the model always takes entity_ids as [2] at inference phase(clustering or topic mining) as well as training phase.

  1. So, as if the cls token of BERT represents all tokens in sentence, does the token [2] (mask token) represent entity tokens in input_ids?
  2. Also, since the model only uses the mask token in entity vocab, the model can deal with unseen entity or general noun phrases? (we don't need to worry about the first question ?)

Thank you.

device error occurs when using model moved to GPU

I just tried the example code:

from uctopic import UCTopicTokenizer, UCTopic

tokenizer = UCTopicTokenizer.from_pretrained('JiachengLi/uctopic-base')
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')

text = "Beyoncé lives in Los Angeles."
entity_spans = [(17, 28)] # character-based entity span corresponding to "Los Angeles"

inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
outputs, phrase_repr = model(**inputs)

It works well on CPU(which is the default)
when I try to move the model to GPU (I want to use this model as an encoder in a contrastive model, which will be trained on GPU):

model = model.to("cuda:0")

Then the model can't encode the previous example, the bug is as follows:

outputs, phrase_repr = model(**inputs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Maybe this is caused by the LUKE model:
( parts of this bug error information )

D:\Anaconda\envs\pytorch\lib\site-packages\transformers\models\luke\modeling_luke.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, entity_ids, entity_attention_mask, entity_token_type_ids, entity_position_ids, head_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict)
    915 
    916         # First, compute word embeddings
--> 917         word_embedding_output = self.embeddings(
    918             input_ids=input_ids,
    919             position_ids=position_ids,

D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

D:\Anaconda\envs\pytorch\lib\site-packages\transformers\models\luke\modeling_luke.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
    248 
    249         if inputs_embeds is None:
--> 250             inputs_embeds = self.word_embeddings(input_ids)
    251 
    252         position_embeddings = self.position_embeddings(position_ids)

D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\modules\sparse.py in forward(self, input)
    156 
    157     def forward(self, input: Tensor) -> Tensor:
--> 158         return F.embedding(
    159             input, self.weight, self.padding_idx, self.max_norm,
    160             self.norm_type, self.scale_grad_by_freq, self.sparse)

D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2042         # remove once script supports set_grad_enabled
   2043         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2044     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
  

About using all data when fine-tuning

When reading your paper, I wondered the below question.

In Finetuning Setup of section 4.2. Entity Clustering, there is a sentence "Because UCTOPIC is an unsupervised method, we use all data to finetune and evaluate.".

Even though UCTOPIC is a unsupervised model, is it inappropriate to use training + test(valid) data for training?
In other or related papers, are they also using such way?

Thank you!

Topical Phrase Mining Dataset

In section 4.3 Topical Phrase Mining, for dataset construction, spaCy was used.

Could you provide the processed datasets (Gest, KP20k, KPTimes) which have annotated phrases?

Thank you.

Errors while running topic_mining

Hello, thanks for making this repo available. I'm trying the topic_mining example provided in the repo's Overview section. I have not made any code changes at my end and the code runs into errors. Here is a Colab notebook reproducing it. I do understand the variable n_clusters is set to [15,25] in the example which is more than the number of sentences which is 5, however even after taking a lower n_clusters=[2,3] the code throws an error. Request you to take a look, thanks!

Here is the error log without changing anything in the example:

Normalize phrases: 100%|██████████| 5/5 [00:00<00:00, 97.04it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-7-41ec3e394e7f>](https://localhost:8080/#) in <module>
     13 # len(sentences) is equal to len(spans)
     14 output_data, topic_phrase_dict = topic_tool.topic_mining(sentences, spans, \
---> 15                                                    n_clusters=[15, 25])

2 frames
[/usr/local/lib/python3.7/dist-packages/uctopic/kmeans.py](https://localhost:8080/#) in fit_predict(self, X, centroids, verbose)
    151         start_time = time()
    152         if centroids is None:
--> 153             self.centroids = X[np.random.choice(batch_size, size=[self.n_clusters], replace=False)]
    154         else:
    155             self.centroids = centroids

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

ValueError: Cannot take a larger sample than population when 'replace=False'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.