hidadeng / cntext Goto Github PK

View Code? Open in Web Editor NEW

223.0 223.0 28.0 66.55 MB

文本分析包，支持字数统计、可读性、文档相似度、情感分析在内的多种文本分析方法。chinese text sentiment analysis

License: MIT License

Python 65.27% HTML 32.43% TeX 2.29%

cntext's Introduction

Hi there,

I'm DaDeng, PhD candidate from Harbin Institute of Technology, China.

I major in management science.
I'm famililar with Python programming
I'm interesting in text analysis(unstructual data) now.

Personal blog

Blog posts

cntext's People

Contributors

Stargazers

Watchers

cntext's Issues

pandas低版本无法加载情感词典

python 3.7
pandas 1.1.5

import cntext as ct

concreteness_df = ct.load_pkl_dict('concreteness.pkl')

按readme运行后
Can't get attribute 'new_block' on

升级pandas至最新版本1.3.5后解决，是不是安装包应该约束一下pandas版本？

统计时能否考虑到语意反转现象？比如，若肯定词前紧跟着否定词，则统计为否定词，反之亦然

DUTIR字典”怒“对应的列表为空

print(ct.__version__)
diction = ct.load_pkl_dict("DUTIR.pkl")
for key in diction['DUTIR'].keys():
    print(key, len(diction['DUTIR'][key]))
print(diction['DUTIR']['怒'])

1.8.4
乐 1967
好 11107
怒 0
哀 2314
惧 1179
恶 10282
惊 228
[]

依赖库不兼容

ValueError Traceback (most recent call last)
Cell In[5], line 1
----> 1 import cntext as ct
2 help(ct)

File ~/anaconda3/envs/dadeng/lib/python3.8/site-packages/cntext/init.py:3
1 version = "1.8.4"
----> 3 from cntext.dictionary import SoPmi, W2VModels, co_occurrence_matrix, Glove
4 from cntext.similarity import jaccard_sim, minedit_sim, simple_sim, cosine_sim
5 from cntext.stats import load_pkl_dict, dict_pkl_list, term_freq, readability, sentiment, sentiment_by_valence, sentiment_by_weight

File ~/anaconda3/envs/dadeng/lib/python3.8/site-packages/cntext/dictionary.py:3
1 import jieba.posseg as pseg
2 import math,time
----> 3 from gensim.models import word2vec
4 from pathlib import Path
5 from nltk.tokenize import word_tokenize

File ~/anaconda3/envs/dadeng/lib/python3.8/site-packages/gensim/init.py:11
7 version = '4.3.1'
9 import logging
---> 11 from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils # noqa:F401
14 logger = logging.getLogger('gensim')
15 if not logger.handlers: # To ensure reload() doesn't add another one

File ~/anaconda3/envs/dadeng/lib/python3.8/site-packages/gensim/corpora/init.py:6
1 """
2 This package contains implementations of various streaming corpus I/O format.
3 """
5 # bring corpus classes directly into package namespace, to save some typing
----> 6 from .indexedcorpus import IndexedCorpus # noqa:F401 must appear before the other classes
8 from .mmcorpus import MmCorpus # noqa:F401
9 from .bleicorpus import BleiCorpus # noqa:F401

File ~/anaconda3/envs/dadeng/lib/python3.8/site-packages/gensim/corpora/indexedcorpus.py:14
10 import logging
12 import numpy
---> 14 from gensim import interfaces, utils
16 logger = logging.getLogger(name)
19 class IndexedCorpus(interfaces.CorpusABC):

File ~/anaconda3/envs/dadeng/lib/python3.8/site-packages/gensim/interfaces.py:19
7 """Basic interfaces used across the whole Gensim package.
8
9 These interfaces are used for building corpora, model transformation and similarity queries.
(...)
14
15 """
17 import logging
---> 19 from gensim import utils, matutils
22 logger = logging.getLogger(name)
25 class CorpusABC(utils.SaveLoad):

File ~/anaconda3/envs/dadeng/lib/python3.8/site-packages/gensim/matutils.py:1030
1025 return 1. - float(len(set1 & set2)) / float(union_cardinality)
1028 try:
1029 # try to load fast, cythonized code if possible
-> 1030 from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
1032 except ImportError:
1033 def logsumexp(x):

File ~/anaconda3/envs/dadeng/lib/python3.8/site-packages/gensim/_matutils.pyx:1, in init gensim._matutils()

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

A bug in readability function

There is a bug in readability function for Chinese. The variable in the 120th line of stats.py should be "adv_conj_num_per_sent". Not "adv_conj_mum_per_sent".

安装失败

一些结果输出：
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for numpy
Failed to build numpy
ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects

使用py 3.10 ， 3.9， 3.8 都试过了，安装不成功，这是什么原因呢

USE ERROR

ModuleNotFoundError: No module named 'pdfdocx'

ValueError: Only callable can be used as callback

在进行情感分析时，运行readme.md中的示例代码：
import cntext as ct

text = '我今天得奖了，很高兴，我要将快乐分享大家。'

ct.sentiment(text=text,
diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'],
lang='chinese')
出现如标题所示报错，请问是什么原因？谢谢

Slow sentiment analysis

Hello, I've tried to run sentiment analysis with DUTIR on a dataset of 80000 and it took >40 minutes to execute which is very slow compared to other packages. Is it possible if you can show me how to optimize it? Any document that helps?

Thanks. Regards

中文可读性评价指标参数名称是否有问题？

ct.readability()函数中第二个参数是zh_adjconj=None

这里表示自定义副词和连词词典，参数介绍中也是Chinese conjunctions and adverbs, receive list data type. By default, the built-in dictionary of cntext is used。

为何在参数的命名中却是形容词adj的缩写？

'DUTIR' 情绪词典脏了

`
import cntext as ct
res = ct.load_pkl_dict('DUTIR.pkl')

for k,v in res['DUTIR'].items():
print(k,len(v))
print("开心" in v)
`

会发现 ’开心‘同时在’乐‘和’惧‘两个类别中出现

使用DUTIR词典报错

运行代码

import cntext as ct

text = '我今天得奖了，很高兴，我要将快乐分享大家。'

ct.sentiment(text=text,
             diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'],
             lang='chinese')

报错

Traceback (most recent call last):
  File "d:\PythonProject\test\test_cntext.py", line 5, in <module>
    ct.sentiment(text=text,
  File "D:\Miniconda3\envs\py38\lib\site-packages\cntext\stats.py", line 159, in sentiment
    jieba.add_word(w)
  File "D:\Miniconda3\envs\py38\lib\site-packages\jieba\__init__.py", line 426, in add_word
    word = strdecode(word)
  File "D:\Miniconda3\envs\py38\lib\site-packages\jieba\_compat.py", line 79, in strdecode
    sentence = sentence.decode('utf-8')
AttributeError: 'int' object has no attribute 'decode'

如果不使用DUTIR词典，使用其他词典，则可以正常运行，如：

import cntext as ct

text = '我今天得奖了，很高兴，我要将快乐分享大家。'

ct.sentiment(text=text,
             diction=ct.load_pkl_dict('HOWNET.pkl')['HOWNET'],
             lang='chinese')

运行结果

{'deny_num': 0,
 'ish_num': 0,
 'more_num': 0,
 'neg_num': 0,
 'pos_num': 3,
 'very_num': 1,
 'stopword_num': 8,
 'word_num': 14,
 'sentence_num': 1}

效价函数测试中concreteness.pkl文件不存在，核实下载1.7.6版本对应的文件夹无此文件

如题

用word2vec构建词典

大佬你好！请问这个每个情感类的种子词的第一行是每个类的名称还是说就是那个类的种子词？

情感计算问题

大佬您好！小弟又来求教了，情感分析得到的只是情感类别词的数量，怎么得到情感类别判断还有情感值呢？情感词典带效价的似乎得到的也不是情感值

cntext archlinux下无法安装

你好，archlinux 下面无法进行安装，提示gensim构建失败
ERROR: Failed building wheel for gensim
所有软件版本均为最新（arch linux 默认的状态下的软件更新机制）
python:3.10
pip:pip 22.1.2 from /home/XXX/.local/lib/python3.10/site-packages/pip (python 3.10)
所有软件均安装在用户目录下(~/.local/lib)中
如果这种安装方式不对，希望提供一个比较合适的系统版本（环境），比如ubuntu debian等

调用ChineseEmoBank.pkl失败

其他词典（如HOWNET，正常）

print(ct.__version__)
# 导入pkl词典文件,
print(ct.load_pkl_dict('ChineseEmoBank.pkl'))

报错：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[51], line 3
      1 print(ct.__version__)
      2 # 导入pkl词典文件,
----> 3 print(ct.load_pkl_dict('ChineseEmoBank.pkl'))

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\cntext\stats.py:32, in load_pkl_dict(file, is_builtin)
     30 else:
     31     dict_f = open(file, 'rb')
---> 32 dict_obj = pickle.load(dict_f)
     33 return dict_obj

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\blocks.py:2400, in new_block(values, placement, ndim, refs)
   2388 def new_block(
   2389     values,
   2390     placement: BlockPlacement,
   (...)
   2397     # - check_ndim/ensure_block_shape already checked
   2398     # - maybe_coerce_values already called/unnecessary
   2399     klass = get_block_type(values.dtype)
-> 2400     return klass(values, ndim=ndim, placement=placement, refs=refs)

TypeError: Argument 'placement' has incorrect type (expected pandas._libs.internals.BlockPlacement, got slice)

环境：

Windows 11 x64
python 3.11.4
pandas 2.1.3
cntext 1.8.8

大连理工词典里的怒为空的问题能不能解决下呢

想引用你的论文，看不到你论文原文

作者你好，你提供的引文格式，即便基于doi也查不到原文，方便的话劳烦提供一下具体的论文网址：
另sentiment_by_valence()有具体解释底层算法的文章吗？方便的话也劳烦提供一下
谢谢！
祝，
好！

@misc{YourReferenceHere,
author = {Deng, Xudong and Nan, Peng},
doi = {10.5281/zenodo.7063523},
month = {9},
title = {cntext: a Python tool for text mining},
url = {https://github.com/hiDaDeng/cntext},
year = {2022}
}