Coder Social home page Coder Social logo

smoothnlp / smoothnlp Goto Github PK

View Code? Open in Web Editor NEW
619.0 21.0 113.0 6.87 MB

专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference

License: GNU General Public License v3.0

Java 51.64% Python 6.01% Dockerfile 0.04% Jupyter Notebook 13.40% HTML 28.90%
nlp tokenizer postagging depedency-parsing python nlp-pipeline

smoothnlp's Introduction

Version Python3 star this repo fork this repo


Author Email
Victor [email protected]
Yinjun [email protected]
海蜇 [email protected]

Install 安装

通过pip安装

pip install smoothnlp>=0.4.0

通过源代码安装最新版本

git clone https://github.com/smoothnlp/SmoothNLP.git
cd SmoothNLP
python setup.py install

知识图谱

仅支持SmoothNLP V0.3.0以后的版本; 以下展示为V0.4版本后样例:

调用示例&可视化

from smoothnlp.algorithm import kg
from kgexplore import visual
ngrams = kg.extract_ngram(["SmoothNLP在V0.3版本中正式推出知识抽取功能",
                            "SmoothNLP专注于可解释的NLP技术",
                            "SmoothNLP支持Python与Java",
                            "SmoothNLP将帮助工业界与学术界更加高效的构建知识图谱",
                            "SmoothNLP是上海文磨网络科技公司的开源项目",
                            "SmoothNLP在V0.4版本中推出对图谱节点的分类功能",
                            "KGExplore是SmoothNLP的一个子项目"])
visual.visualize(ngrams,width=12,height=10)

SmoothNLP_KG_Demo

功能说明

  • V0.4版本中支持的边关系(edge-type), 包括: 事件触发, 状态描述, 属性描述, 数值描述.
  • V0.4版本中支持的节点种类(node-type), 包括: 产品地区公司与品牌货品机构人物修饰短语其他.

NLP基础Pipelines

1.Tokenize分词

>> import smoothnlp 
>> smoothnlp.segment('欢迎在Python中使用SmoothNLP')
['欢迎', '在', 'Python', '中', '使用', 'SmoothNLP']

2.Postag词性标注

词性标注标签解释wiki

>> smoothnlp.postag('欢迎使用smoothnlp的Python接口')
[{'token': '欢迎', 'postag': 'VV'},
 {'token': '在', 'postag': 'P'},
 {'token': 'Python', 'postag': 'NN'},
 {'token': '中', 'postag': 'LC'},
 {'token': '使用', 'postag': 'VV'},
 {'token': 'SmoothNLP', 'postag': 'NN'}]

3.NER 实体识别

>> smoothnlp.ner("**平安2019年度长期服务计划于2019年5月7日至5月14日通过二级市场完成购股" )
[{'charStart': 0, 'charEnd': 4, 'text': '**平安', 'nerTag': 'COMPANY_NAME', 'sTokenList': {'1': {'token': '**平安', 'postag': None}}, 'normalizedEntityValue': '**平安'},
{'charStart': 4, 'charEnd': 9, 'text': '2019年', 'nerTag': 'NUMBER', 'sTokenList': {'2': {'token': '2019年', 'postag': 'CD'}}, 'normalizedEntityValue': '2019年'},
{'charStart': 17, 'charEnd': 26, 'text': '2019年5月7日', 'nerTag': 'DATETIME', 'sTokenList': {'8': {'token': '2019年5月', 'postag': None}, '9': {'token': '7日', 'postag': None}}, 'normalizedEntityValue': '2019年5月7日'},
{'charStart': 27, 'charEnd': 32, 'text': '5月14日', 'nerTag': 'DATETIME', 'sTokenList': {'11': {'token': '5月', 'postag': None}, '12': {'token': '14日', 'postag': None}}, 'normalizedEntityValue': '5月14日'}]

4. 金融实体识别

>> smoothnlp.company_recognize("旷视科技预计将在今年9月在港IPO")
[{'charStart': 0,
  'charEnd': 4,
  'text': '旷视科技',
  'nerTag': 'COMPANY_NAME',
  'sTokenList': {'1': {'token': '旷视科技', 'postag': None}},
  'normalizedEntityValue': '旷视科技'}]

5. 依存句法分析

注意, smoothnlp.dep_parsing返回的Index=0 为 dummy的roottoken.

依存句法分析标签解释wiki

smoothnlp.dep_parsing("特斯拉是全球最大的电动汽车制造商。")
> [{'relationship': 'top', 'dependentIndex': 2, 'targetIndex': 1},
  {'relationship': 'root', 'dependentIndex': 0, 'targetIndex': 2},
  {'relationship': 'dep', 'dependentIndex': 5, 'targetIndex': 3},
  {'relationship': 'advmod', 'dependentIndex': 5, 'targetIndex': 4},
  {'relationship': 'ccomp', 'dependentIndex': 2, 'targetIndex': 5},
  {'relationship': 'cpm', 'dependentIndex': 5, 'targetIndex': 6},
  {'relationship': 'amod', 'dependentIndex': 8, 'targetIndex': 7},
  {'relationship': 'attr', 'dependentIndex': 2, 'targetIndex': 8},
  {'relationship': 'attr', 'dependentIndex': 2, 'targetIndex': 9},
  {'relationship': 'punct', 'dependentIndex': 2, 'targetIndex': 10}]

6. 切句

smoothnlp.split2sentences("句子1!句子2!")
> ['句子1!', '句子2!']

7. 多线程支持

SmoothNLP 默认使用2个Thread进行服务调用;

from smoothnlp import config
config.setNumThreads(2)

8. 日志

from smoothnlp import config
config.setLogLevel("DEBUG")  ## 设定日志级别

无监督学习

新词挖掘

算法介绍 | 使用说明

事件聚类

该功能我们目前仅支持商业化的解决方案支持, 与线上服务. 详情可联系 [email protected]

效果演示

[
  {
    "url": "https://36kr.com/p/5167309",
    "title": "Facebook第三次数据泄露,可能导致680万用户私人照片泄露",
    "pub_ts": 1544832000
  },
  {
    "url": "https://www.pencilnews.cn/p/24038.html",
    "title": "热点 | Facebook将因为泄露700万用户个人照片 面临16亿美元罚款",
    "pub_ts": 1544832000
  },
  {
    "url": "https://finance.sina.com.cn/stock/usstock/c/2018-12-15/doc-ihmutuec9334184.shtml",
    "title": "Facebook再曝新数据泄露 6800万用户或受影响",
    "pub_ts": 1544844120
  }
]

吐槽: 新浪小编数据错误... 夸大事实, 真实情况Facebook并没有泄露6800万张照片

有监督学习

(资讯)事件分类

该功能我们目前仅支持商业化的解决方案支持, 与线上服务. 详情可联系 [email protected]; 线上服务支持API输出

效果

事件名称 AUC Precision
投资并购 0.996 0.982
企业合作 0.977 0.885
董监高管 0.982 0.940
营收报导 0.994 0.960
企业签约 0.993 0.904
商业拓展 0.968 0.869
产品报道 0.977 0.911
产业政策 0.990 0.879
经营不善 0.981 0.765
违规约谈 0.951 0.890

参考文献


Tutorial

服务说明

声明

  1. SmoothNLP通过云端微服务提供完整的REST文本解析及相关服务应用. 对于开源爱好者等一般用户, 目前我们提供qps<=5的服务支持; 对于商业用户, 我们提供部不受限制的云端账号或本地部署方案.
  2. 包括:切词,词性标注,依存句法分析等基础NLP任务由java代码实现, 在文件夹smoothnlp_maven下. 可通过 maven编译打包
  3. 如果您寻求商业化的NLP或知识图谱解决方案, 欢迎邮件至 [email protected]

Pro 专业版本

SmoothNLP Pro 支持稳定可靠的企业级用户, 使用文档; 如需试用或购买, 请联系 [email protected]

常见问题

  1. 注意, 在0.2.20版本调整后, 以下基础Pipeline功能仅对字符串长度做出了限制(不超过200). 如对较长corpus进行处理, 请先试用smoothnlp.split2sentences 进行切句预处理
  2. 知识图谱可视化部分(V0.4版本以前)默认支持字体SimHei,大多数环境下的matplotlib不支持中文字体, 我们提供字体包的下载链接; 您可以通过运行以下代码, 将Simhei字体加载入matplotlib字体库
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
## 设置字体
font_dirs = ['simhei/']
font_files = font_manager.findSystemFonts(fontpaths=font_dirs)
font_list = font_manager.createFontList(font_files)
font_manager.fontManager.ttflist.extend(font_list)
plt.rcParams['font.family'] = "SimHei"

彩蛋

  1. 如果你对本项目, 有任何建议或者想成为联合开发者, 欢迎提交issue或pull request; 作为回赠, 我们会提供数据分享或 kgexplore 的免费数据体验
  2. 如果你对NLP相关算法或引用场景感兴趣, 但是却缺少实现数据, 我们提供免费的数据支持, 下载.
  3. 如果你是高校学生, 寻求NLP知识图谱相关的研究素材, 甚至是实习机会. 欢迎邮件到 [email protected]

smoothnlp's People

Contributors

chauncy-guo avatar jamesbear avatar siam1991 avatar victorzhrn avatar yjun1989 avatar yvette-wang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smoothnlp's Issues

Questions on prediction by different models

I've trained the RNTN model on a new corp with 6 sentiment categories . And I used the new and its inner model to predict on the test dataset. Joyfully, I got two different prediction result. But when I dig dipper on the prediction, I found that the sentiment category predicted by the model are all 8. However, the training args on the new corp is specified with 6, and its inner model is default smaller than 5. So, I am confused by its predicted sentiment category.
new model : java -mx8g -cp corenlp-chinese-smoothnlp-0.1-with-dependencies.jar edu.stanford.nlp.sentiment.SentimentTraining -numHid 20 -trainPath train_ready.txt -numClasses 6 -classNames "Very negative, Negative, Neutral, Positive, Very positive, Unknown
result:
Screenshot_20190312_111054

the 8 sentiment categories
Screenshot_20190312_111239

新词发现问题

这个包在分词时候是否采取了满足条件后,扩展边界的方式找到符合条件的词汇呢?比如“马克思派的帝国主义”这是要找的词,设定min_n=4,max_n=10,那么"马克思派"先找出来?然后再扩展右边界?找到“马克思派的”?不太清楚这个包的机制。希望能指点一下。

Meta Extraction Failure

"C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\python.exe" "C:\Program Files\JetBrains\PyCharm Community Edition 2020.1.1\plugins\python-ce\helpers\pydev\pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 54150 --file C:/Users/16413/Documents/GitHub/LostXmas/seq2seq/data/augmentation/rationalities.py
pydev debugger: process 86948 is connecting
Connected to pydev debugger (build 201.7846.77)
WARNING:SmoothNLP:HTTPConnectionPool(host='api.smoothnlp.com', port=80): Max retries exceeded with url: /nlp/query?text=%E6%AC%A2%E8%BF%8E%E4%BD%BF%E7%94%A8smoothnlp%E7%9A%84Python%E6%8E%A5%E5%8F%A3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001B6BC5E1948>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2020.1.1\plugins\python-ce\helpers\pydev\pydevd.py", line 1438, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2020.1.1\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/16413/Documents/GitHub/LostXmas/seq2seq/data/augmentation/rationalities.py", line 3, in <module>
    result = smoothnlp.postag('欢迎使用smoothnlp的Python接口')
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\site-packages\smoothnlp\__init__.py", line 25, in postag
    return nlp.postag(text)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\site-packages\smoothnlp\server\__init__.py", line 183, in postag
    tokens = extract_meta(result, "tokens")
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\site-packages\smoothnlp\server\__init__.py", line 91, in extract_meta
    raise ValueError("Meta Extraction Failure")
ValueError: Meta Extraction Failure

code:

import smoothnlp

result = smoothnlp.postag('欢迎使用smoothnlp的Python接口')

print(result)

新词发现数据量的选择

感谢建议!无监督的新词发现方法,在小数据量上也很难有好的效果。我们选择这个方法也是希望能够根据不同的 文本及文本量 决定需要过滤的词,减少新词发现结果的标注成本。我们会在代码中添加相关注释~

Originally posted by @Yvette-Wang in #32 (comment)

您好,小数据量很难有好的效果。请问新词发现功能推荐在多大的数据量会比较合适(或者至少多大的语料),也好针对性的权衡时间和效果。

phrase_extraction 的文本处理(新词发现部分)

ngram_utils.pyphrase_extraction.py,注意到一开始的文本处理是

  1. re.split(r'[;;.。,,!\n!??]',corpus),先按此列表中的标点符号进行切分
  2. re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", corpus),只保留汉字、数字、英文大小写(去掉了其它符号和无意义字符,但没有进行切分)

然后想问两个小问题:

A. 为什么 [;;.。,,!\n!??] 符号部分要单独处理,就是说,为什么不选择在所有标点符号的部分都切开 ?

B. 在第 2 步中,去掉无意义字符后,其前后位置部分会自然拼接,

比如 "动物防疫法(修订版)_全文" 会变成 "动物防疫法修订版全文"。

  • 无意义字符前后位置对应搭配增多,会导致信息熵偏大;

  • 会产生一些不合理的 n-gram 候选对("法修"、"订版全"、"版全"...)。

这种问题应该怎么处理?

新词发现

SmoothNLP-master/smoothnlp/algorithm/phrase/ngram_utils.py", line 75, in _process_corpus_chunk
ngram_keys[ni] = (ngram_keys[ni] | nigram_freq.keys())
MemoryError
问题是ngram_keys过大?有什么优化方法吗

对句子进行分词时句子长度的限制

您好,非常感谢您的工作
目前想对句子进行中英文混合分词,但句子太长会报错,如果固定长度分割后再分词,可能会破坏句子的连续性,这个问题有什么解决方法吗

分词或者NER经常出现的错误:TypeError: string indices must be integers

text = '香港(简称港,雅称香江;英语:Hong Kong,缩写作HK、HKSAR)是中华人民共和国两个特别行政区之一,位于南海北岸、珠江口东侧,北接广东省深圳市,西面与邻近的澳门特别行政区相距63公里,其余两面与南海邻接。全境由香港岛、九龙和新界组成,其中香港岛北部最为发达;'
result = smoothnlp.ner(text)

出现一下错误:
Traceback (most recent call last):
File "F:/Research/Github/deep-learning-and-nlp/pkuseg/smoothnlp_try.py", line 71, in
result = smoothnlp.ner(text)
File "D:\Programs\Continuum\anaconda3\lib\site-packages\smoothnlp\utils_init_.py", line 26, in trycatch
return func(text)
File "D:\Programs\Continuum\anaconda3\lib\site-packages\smoothnlp\utils_init_.py", line 35, in toJson
res = func(text)
File "D:\Programs\Continuum\anaconda3\lib\site-packages\smoothnlp_init_.py", line 64, in ner
return nlp.ner(text)
File "D:\Programs\Continuum\anaconda3\lib\site-packages\smoothnlp\server_init_.py", line 22, in ner
return self.result['entities']
TypeError: string indices must be integers

新词发现

计算pmi的时候,P('电影院')/(P('电')*P('影')*P('院')),为什么不是 P('电影院')/max(P('电影')*P('院'),P('电')*P('影院'))。后者的话可以不用最后处理首字和尾字的高频字符了

新词挖掘的左右邻字丰富程度和内部凝聚程度参数阈值可以自定义吗

我看使用方法里没有设置这两个阈值的参数
这两个参数是不可变的吗

corpus: 必需,file open()、database connection或list
example:corpus = open(file_name, 'r', encoding='utf-8')
corpus = conn.execute(query)
corpus = list(***)
top_k: float or int,表示短语抽取的比例或个数
chunk_size: int,用chunksize分块大小来读取文件
min_n: int,抽取ngram及以上
max_n: int,抽取ngram及以下
min_freq: int,抽取目标的最低词频

top_k是按照词频排序,还是按照左右邻字丰富程度或者内部凝聚程度排序

知识图谱(N元组)抽取的输入格式使用string报错

from smoothnlp import kg
kg.extract("SmoothNLP在V0.3版本中正式推出知识抽取功能")

报出一堆错误

Expecting value: line 1 column 1 (char 0)
An invalid response was received from the upstream server

Expecting value: line 1 column 1 (char 0)
An invalid response was received from the upstream server

Expecting value: line 1 column 1 (char 0)
An invalid response was received from the upstream server

Expecting value: line 1 column 1 (char 0)
An invalid response was received from the upstream server
JSONDecodeError                           Traceback (most recent call last)
~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     44     try:
---> 45         result = r.json()
     46     except (json.decoder.JSONDecodeError,Exception) as e:

~\Documents\Anaconda3\lib\site-packages\requests\models.py in json(self, **kwargs)
    891                     pass
--> 892         return complexjson.loads(self.text, **kwargs)
    893 

~\Documents\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    353             parse_constant is None and object_pairs_hook is None and not kw):
--> 354         return _default_decoder.decode(s)
    355     if cls is None:

~\Documents\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
    338         """
--> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    340         end = _w(s, end).end()

~\Documents\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)
    356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
    358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-15-d45ce311a5bb> in <module>()
----> 1 kg.extract("SmoothNLP在V0.3版本中正式推出知识抽取功能")

~\Documents\Anaconda3\lib\site-packages\smoothnlp\algorithm\kg\__init__.py in extract(text, pretty)
     58         raise TypeError(" Unsupported type for text parameter: {}".format(text))
     59     all_kgs = []
---> 60     sentkgs = extract_all_kg(text = sents, pretty = pretty)
     61     for sentkg in sentkgs:
     62         all_kgs+=sentkg

~\Documents\Anaconda3\lib\site-packages\smoothnlp\algorithm\kg\__init__.py in extract_all_kg(text, pretty)
     23     :return:
     24     """
---> 25     kg_result = _request(text,path="/kg/query",other_params={'pretty':pretty})
     26     return kg_result
     27 

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request(text, path, max_size_limit, other_params)
     75         config.logger.info(
     76             "request parameter: NUM_THREAD = {}, POOL_TYPE = {}".format(config.NUM_THREADS, config.POOL_TYPE))
---> 77         return _request_concurent(text,path,max_size_limit,other_params)
     78     elif isinstance(text,str):
     79         return _request_single(text,path = path,counter=0,max_size_limit=max_size_limit,other_params=other_params)

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_concurent(texts, path, max_size_limit, other_params)
     67         pool = ThreadPool(config.NUM_THREADS)
     68     params = [(text,path,0,max_size_limit,other_params) for text in texts]
---> 69     result = pool.starmap(_request_single,params)
     70     pool.close()
     71     return result

~\Documents\Anaconda3\lib\multiprocessing\pool.py in starmap(self, func, iterable, chunksize)
    272         `func` and (a, b) becomes func(a, b).
    273         '''
--> 274         return self._map_async(func, iterable, starmapstar, chunksize).get()
    275 
    276     def starmap_async(self, func, iterable, chunksize=None, callback=None,

~\Documents\Anaconda3\lib\multiprocessing\pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

~\Documents\Anaconda3\lib\multiprocessing\pool.py in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
    117         job, i, func, args, kwds = task
    118         try:
--> 119             result = (True, func(*args, **kwds))
    120         except Exception as e:
    121             if wrap_exception and func is not _helper_reraises_exception:

~\Documents\Anaconda3\lib\multiprocessing\pool.py in starmapstar(args)
     45 
     46 def starmapstar(args):
---> 47     return list(itertools.starmap(args[0], args[1]))
     48 
     49 #

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     48         print(r.text)
     49         counter +=3
---> 50         return _request_single(text, path=path, counter=counter, max_size_limit=max_size_limit)
     51     if r.status_code==429:  ## qps超限制
     52         counter += 1

~\Documents\Anaconda3\lib\site-packages\smoothnlp\server\__init__.py in _request_single(text, path, counter, max_size_limit, other_params)
     27     if counter > 99:
     28         raise Exception(
---> 29             " exceed maximal attemps for parsing. ")
     30     if config.apikey is not None and isinstance(config.apikey,str):   ## pro 版本支持 apikey 调用
     31         other_params['apikey'] = config.apikey

Exception:  exceed maximal attemps for parsing. 

词频统计的问题

您好,我对 ngram_utils 的 get_ngram_freq_info 有些疑惑,请教一下:
为什么对于词频是否大于min_freq 的操作要在 _process_corpus_chunk 中进行?
假如每个 chunk 中各有一个 X,共10个 chunk ,那么即便 min_freq 设的是2 也不会统计到这个 X.
min_freq 是只对当前 chunk 的词频结果判断嘛?不应该是整个corpus?

阅读源码后的两个疑问

疑问1:

DONE 对在candidate ngram中, 首字或者尾字出现次数特别多的进行筛选, 如"XX的,美丽的,漂亮的"剔出字典

target_ngrams = word_info_scores.keys()
start_chars = Counter([n[0] for n in target_ngrams])
end_chars = Counter([n[-1] for n in target_ngrams])
threshold = min(2000, int(len(target_ngrams) * 0.001))
threshold = max(50, threshold)
logger.info("~~~ Threshold used for removing start end char: {} ~~~~".format(threshold))
invalid_start_chars = set([char for char, count in start_chars.items() if count > threshold])
invalid_end_chars = set([char for char, count in end_chars.items() if count > threshold])

invalid_target_ngrams = set([n for n in target_ngrams if (n[0] in invalid_start_chars or n[-1] in invalid_end_chars)])

for n in invalid_target_ngrams:  ## 按照不合适的字头字尾信息删除一些
    word_info_scores.pop(n)

请问增加该方法是出于怎样的考量?简单做了实验发现过滤停用词也可以达到类似效果,但是没有做太多实验。

分词问题。

您好,我有一些语句想分词。但是出现的结果和我想的完全不一样。句子有1万多条,我摘录几条如下:
【之嚆矢。故其民族帝国主义
说。其於孕育民族帝国主义
第一要著。此近世帝国主义
义之公德。此近世帝国主义
变为民族主义。由民族主义而变为民族帝国主义
族主义而变为民族帝国主义
诸国中择其有代表帝国主义
国之籍。故英人之帝国主义
。其最能发挥现世帝国主义
德国若也。德人行帝国主义
起今皇维廉第二之帝国主义
同心戮力。以实行帝国主义
要而论之。德人之帝国主义
斯 俄罗斯之帝国主义
昌耳。然则俄国之帝国主义
前驱。然则谓俄人帝国主义
由此观之。俄人之帝国主义
  麦坚尼之帝国主义
自由无碍以实行帝国主义】
其中【民族帝国主义】【俄国之帝国主义】这类词在语句中出现频次都不下几百之多。但是分词没有将它们选出来。不知道为何。选出的词是这样的一些。
【军阀
资本
而为
英美
阶级
革命
所谓
政府
实行
打倒
反对
民军
是英
反抗
麦端尼
布尔塞维
马克思派
段祺瑞
蒋介石
马克思
马克思派的】
是否进行了过滤?希望帮助解答一些。谢谢

pip install SmoothNLP failed

(pyEnv37) C:\Users*>pip install smoothNlp
Looking in indexes: https://mirrors.aliyun.com/pypi/simple
Collecting smoothNlp
Downloading https://mirrors.aliyun.com/pypi/packages/5e/64/e9d7e18e51a5ae3f7c3a6791d862cbbf65a8cb3ecaaee281b0e96eccca2b/SmoothNLP-0.3.1.tar.gz (16 kB)
ERROR: Command errored out with exit status 1:
command: 'C:\Users*
\wwwFlask\Scripts\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\\AppData\Local\Temp\pip-install-5exu66t9\smoothNlp\setup.py'"'"'; file='"'"'C:\Users\\AppData\Local\Temp\pip-install-5exu66t9\smoothNlp\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users*\AppData\Local\Temp\pip-install-5exu66t9\smoothNlp\pip-egg-info'
cwd: C:\Users*
\AppData\Local\Temp\pip-install-5exu66t9\smoothNlp
Complete output (5 lines):
Traceback (most recent call last):
File "", line 1, in
File "C:\Users********\AppData\Local\Temp\pip-install-5exu66t9\smoothNlp\setup.py", line 5, in
long_description = open(os.path.join(rootdir, 'README.md')).read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa3 in position 690: illegal multibyte sequence
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

左右熵的问题

这个包处理的左右熵不是在实际语境中候选词的左右熵,是用ngram组合后的词的左右熵。例如选择一个ngram的范围为2-4.如果一个句子是abcd,那么组合出来的词汇是ab,bc,cd,abc,bcd,abcd。假设要考虑bc是不是一个独立词汇,它的左熵变成了ab,右熵变成了cd。我觉得这是不合理的。因为这不是一个具体的文字的语境。所以导致算出来的结果不正确。因为到了abcd这个词的时候没有了右熵,而且abcd到底是不是一个正确的词也不清楚。因此希望技术人员能够从算法原理上给予一点解释。

Questions on training and using corenlp-chinese-smoothnlp-0.1-with-dependencies.jar

I've trained the corenlp-chinese-smoothnlp-0.1-with-dependencies.jar on a new corps ./sentiment_output_train_combined.txt, which sentiment category is 6.
(terminal): java -mx8g -cp corenlp-chinese-smoothnlp-0.1-with-dependencies.jar edu.stanford.nlp.sentiment.SentimentTraining -numHid 20 -trainPath sentimen_output_train_combined.txt -train -model model.ser.gz -numClasses 6 -classNames "Very negative,Negative,Neutral,Positive,Very positive,Unknown"
And I got the training model ./model.ser.gz
But when I used it, it seemed that the result came from the original inner model but not the updated model, with means that the updated model didn't work.
(terminal): java -jar corenlp-chinese-smoothnlp-0.1-with-dependencies.jar sentiment.model model.ser.gz
When I use it in python notebook, the result seemed to be unchanged
screenshot_20190304_135505
The sentimentDistribution of the test text on the new model is totally the same with "NLP_Utils/demo.ipynb, and its sentiment category is 5, but not 6.

docker

Error: Unable to access jarfile smoothnlp-0.2-exec.jar

不知道怎么解决

处理语料时会遗漏掉每句话的最后一个n元词组

处理语料,遍历句子,进行字符和单词抽取的时候,总是遗漏掉每句话的最后一个词
应该是ngram_utils.py中的下面代码导致了问题
```
def generate_ngram(corpus,n:int=2):
def generate_ngram_str(text:str,n):
for i in range(0,len(text)-n):
yield text[i:i+n]

range(min,max) 在取值时,不会取到max的值
sample: 无人货架启动和运营成本貌似最低
结果
ngram=1 ['无', '人', '货', '架', '启', '动', '和', '运', '营', '成', '本', '貌', '似', '最']
ngram=2 ['无人', '人货', '货架', '架启', '启动', '动和', '和运', '运营', '营成', '成本', '本貌', '貌似', '似最']
ngram=3 ['无人货', '人货架', '货架启', '架启动', '启动和', '动和运', '和运营', '运营成', '营成本', '成本貌', '本貌似', '貌似最']
应当改为for i in range(0, len(text)-n+1)

新词发现

新词发现测试中的top100 top500是按什么排名的,同一个数据集为什么和我使用出来的结果不一样,以及和什么对比得出的精准度

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.