Coder Social home page Coder Social logo

autophrasex's Introduction

ZhouYang Luo

Deep Learning and NLP enthusiast.

luozhouyang's github stats

Top Langs

autophrasex's People

Contributors

jianfengzhai avatar luozhouyang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

autophrasex's Issues

IndexError: index 1 is out of bounds for axis 0 with size 1

def _predict_proba(self, phrases):
    features = [self._compose_feature(phrase) for phrase in phrases]
    pos_probs = [prob[1] for prob in self.classifier.predict_proba(features)] (**prob[1] in this line**)
    pairs = [(phrase, prob) for phrase, prob in zip(phrases, pos_probs)]
    return pairs

import出错;另外,能否提供示例代码所用的数据呢?

本文代码环境为 win10 core-i7 python3

该句导入出错:from autophrasex import AutoPhrase, BaiduLacTokenizer, Strategy

能否将LAC模块替换为其他相同功能的包,百度LAC包repos下也有人反馈有导入错误。

...autophrasex_demo.py", line 11, in <module>
    from autophrasex import AutoPhrase, BaiduLacTokenizer, Strategy
  File "D:\ProgramData\Anaconda3\lib\site-packages\autophrasex\__init__.py", line 3, in <module>
    from .autophrase import AutoPhrase
  File "D:\ProgramData\Anaconda3\lib\site-packages\autophrasex\autophrase.py", line 8, in <module>
    from .strategy import AbstractStrategy
  File "D:\ProgramData\Anaconda3\lib\site-packages\autophrasex\strategy.py", line 6, in <module>
    from LAC import LAC
  File "D:\ProgramData\Anaconda3\lib\site-packages\LAC\__init__.py", line 23, in <module>
    from .lac import LAC
  File "D:\ProgramData\Anaconda3\lib\site-packages\LAC\lac.py", line 28, in <module>
    import paddle.fluid as fluid
  File "D:\ProgramData\Anaconda3\lib\site-packages\paddle\__init__.py", line 29, in <module>
    from .fluid import monkey_patch_variable
  File "D:\ProgramData\Anaconda3\lib\site-packages\paddle\fluid\__init__.py", line 35, in <module>
    from . import framework
  File "D:\ProgramData\Anaconda3\lib\site-packages\paddle\fluid\framework.py", line 34, in <module>
    from .proto import framework_pb2
  File "D:\ProgramData\Anaconda3\lib\site-packages\paddle\fluid\proto\framework_pb2.py", line 11, in <module>
    from google.protobuf import descriptor_pb2
  File "D:\ProgramData\Anaconda3\lib\site-packages\google\protobuf\descriptor_pb2.py", line 1840, in <module>
    __module__ = 'google.protobuf.descriptor_pb2'
TypeError: expected bytes, Descriptor found

IndexError: index 1 is out of bounds for axis 0 with size 1

你好,我想问下,输入文件的格式是怎样的?我运行的时候出现以下bug,我猜测应该是输入特征的问题导致本来应该是二维输出最后变成了一维的。我的输入文件就是每行一条文本无空格,比如:

我是一个人。
哈哈哈哈哈。
那里有个苹果。
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-097136691c2f> in <module>
      6         LoggingCallback(),
      7         ConstantThresholdScheduler(),
----> 8         EarlyStopping(patience=2, min_delta=3)
      9     ])

/usr/local/lib/python3.6/dist-packages/autophrasex/autophrase.py in mine(self, corpus_files, quality_phrase_files, N, epochs, callbacks, topk, filter_fn, **kwargs)
    122 
    123             callback.on_epoch_reorganize_phrase_pools_begin(epoch, pos_pool, neg_pool)
--> 124             pos_pool, neg_pool = self._reorganize_phrase_pools(pos_pool, neg_pool, **kwargs)
    125             callback.on_epoch_reorganize_phrase_pools_end(epoch, pos_pool, neg_pool)
    126 

/usr/local/lib/python3.6/dist-packages/autophrasex/autophrase.py in _reorganize_phrase_pools(self, pos_pool, neg_pool, **kwargs)
    157         new_pos_pool.extend(deepcopy(pos_pool))
    158 
--> 159         pairs = self._predict_proba(neg_pool)
    160         pairs = sorted(pairs, key=lambda x: x[1], reverse=True)
    161         # print(pairs[:10])

/usr/local/lib/python3.6/dist-packages/autophrasex/autophrase.py in _predict_proba(self, phrases)
    184     def _predict_proba(self, phrases):
    185         features = [self._compose_feature(phrase) for phrase in phrases]
--> 186         pos_probs = [prob[1] for prob in self.classifier.predict_proba(features)]
    187         pairs = [(phrase, prob) for phrase, prob in zip(phrases, pos_probs)]
    188         return pairs

/usr/local/lib/python3.6/dist-packages/autophrasex/autophrase.py in <listcomp>(.0)
    184     def _predict_proba(self, phrases):
    185         features = [self._compose_feature(phrase) for phrase in phrases]
--> 186         pos_probs = [prob[1] for prob in self.classifier.predict_proba(features)]
    187         pairs = [(phrase, prob) for phrase, prob in zip(phrases, pos_probs)]
    188         return pairs

IndexError: index 1 is out of bounds for axis 0 with size 1

英文上抽取效果与原文有差距

您好,我使用您的代码跑英文数据集,但是结果并不理想,最高得分只有0.3多。我已经更换wiki_quality.txt为英文版本,并且将tokenizer改为spaCy实现,不知道问题出在哪里,请问您是否在英文数据集上测试过,有没有相应的代码可以提供呢,感谢!
image
image
image

ZeroDivisionError: float division by zero

2021-04-16 11:12:39,442    INFO        autophrase.py   33] Load quality phrases finished. There are 10386 quality phrases in total.
2021-04-16 11:12:39,937    INFO        autophrase.py   36] Selected 1000 frequent phrases.
2021-04-16 11:12:39,938    INFO        autophrase.py   39] Size of initial positive pool: 118
2021-04-16 11:12:39,939    INFO        autophrase.py   40] Size of initial negative pool: 782
2021-04-16 11:12:39,940    INFO        autophrase.py   46] Starting to train model at epoch 1 ...
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-28-7f2482f60f66> in <module>
      4     strategy=strategy,
      5     N=4,
----> 6     epochs=10)
      7 
      8 for pred in predictions:

~/.pylib/Lib/site-packages/autophrasex/autophrase.py in mine(self, input_doc_files, quality_phrase_files, strategy, N, **kwargs)
     45         for epoch in range(kwargs.pop('epochs', 5)):
     46             logging.info('Starting to train model at epoch %d ...', epoch + 1)
---> 47             x, y = strategy.compose_training_data(pos_pool, neg_pool, **kwargs)
     48             self.classifier.fit(x, y)
     49             logging.info('Finished to train model at epoch %d', epoch + 1)

~/.pylib/Lib/site-packages/autophrasex/strategy.py in compose_training_data(self, pos_pool, neg_pool, **kwargs)
    155         for p in pos_pool:
    156             p = ' '.join(self.tokenizer.tokenize(p))
--> 157             examples.append((self.build_input_features(p), 1))
    158         for p in neg_pool:
    159             p = ' '.join(self.tokenizer.tokenize(p))

~/.pylib/Lib/site-packages/autophrasex/strategy.py in build_input_features(self, phrase, **kwargs)
    210         doc_freq = self.idf_callback.doc_freq_of(phrase)
    211         idf = self.idf_callback.idf_of(phrase)
--> 212         pmi = self.ngrams_callback.pmi_of(phrase)
    213         left_entropy = self.entropy_callback.left_entropy_of(phrase)
    214         right_entropy = self.entropy_callback.right_entropy_of(phrase)

~/.pylib/Lib/site-packages/autophrasex/callbacks.py in pmi_of(self, ngram)
     79         ngram_total_occur = sum(self.ngrams_freq[n].values())
     80         freq = self.ngrams_freq[n].get(''.join(ngram.split(' ')), 0)
---> 81         return self._pmi_of(ngram, n, freq, unigram_total_occur, ngram_total_occur)
     82 
     83     def pmi(self):

~/.pylib/Lib/site-packages/autophrasex/callbacks.py in _pmi_of(self, ngram, n, freq, unigram_total_occur, ngram_total_occur)
     61         indep_prob = reduce(
     62             mul, [self.ngrams_freq[1][unigram] for unigram in ngram.split(' ')]) / (unigram_total_occur ** n)
---> 63         pmi = math.log((joint_prob + self.epsilon) / (indep_prob + self.epsilon), 2)
     64         return pmi
     65 

ZeroDivisionError: float division by zero

Debug 后发现是 callback 中 epsilon 默认为 0 导致的,建议改为一个极小值,或者在开始的实例中手动传入 epsilon=1e-9 之类的参数,对使用者更友好一些

参数‘corpus_files’ 和 ‘quality_phrase_files'的使用

你好,在实践中对参数‘corpus_files’ 和 ‘quality_phrase_files有些疑问。

  1. 如果想for循环地使用AutoPhraseX(例如语料被分为n份,依次对每份语料进行挖掘),corpus_files该参数只能对文件进行操作吗?我试图将该参数换成数组或者字符串,会报错。在不方便将处理过的语料写入txt文件的情况下(即语料被分为n份,n较大),如果想for循环地使用AutoPhraseX,我该怎么做呢?非常感谢!
  2. 当我使用简单的quality_phrase_files='userDic.txt'(例如userDic.txt中包含“知识图谱”),发现挖掘出来的结果中将不出现“知识图谱”,然后尝试将userDic.txt中的“知识图谱”删掉,挖掘结果中则出现“知识图谱”该词。尝试多种例子,产生了quality_phrase_files是停用词表的错觉,不知道是语料较少的问题或是使用方式不对的问题。

实践代码如下:
from autophrasex import *

构造autophrase

autophrase = AutoPhrase(
reader=DefaultCorpusReader(tokenizer=JiebaTokenizer()),
selector=DefaultPhraseSelector(),
extractors=[
NgramsExtractor(N=4),
IDFExtractor(),
EntropyExtractor()
]
)

开始挖掘

predictions = autophrase.mine(
corpus_files=['answers.txt'],
quality_phrase_files='userDic.txt', #quality_phrase_files??像是停用词
callbacks=[
LoggingCallback(),
ConstantThresholdScheduler(),
EarlyStopping(patience=2, min_delta=3)
# EarlyStopping()
]
)

输出挖掘结果

for pred in predictions:
print(pred)

非常感谢大家的帮助,谢谢!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.