Coder Social home page Coder Social logo

rkcosmos / deepcut Goto Github PK

View Code? Open in Web Editor NEW
411.0 31.0 97.0 11.54 MB

A Thai word tokenization library using Deep Neural Network

License: MIT License

Jupyter Notebook 16.09% Python 83.36% Dockerfile 0.55%
deep-neural-networks thai segmentation python keras-tensorflow keras deep-learning tensorflow

deepcut's Introduction

Deepcut

License DOI

A Thai word tokenization library using Deep Neural Network.

model_structure

What's new

  • v0.7.0 Migrate from keras to TensorFlow 2.0
  • v0.6.0 Allow excluding stop words and custom dictionary, updated weight with semi-supervised learning
  • v0.5.2 Better pretrained weight matrix
  • v0.5.1 Faster tokenization by code refactorization
  • examples folder provide starter script for Thai text classification problem
  • DeepcutJS, you can try tokenizing Thai text on web browser here

Performance

The Convolutional Neural network is trained from 90 % of NECTEC's BEST corpus (consists of 4 sections, article, news, novel and encyclopedia) and test on the rest 10 %. It is a binary classification model trying to predict whether a character is the beginning of word or not. The results calculated from only 'true' class are as follow

Precision Recall F1
97.8% 98.5% 98.1%

Installation

Install using pip for stable release (tensorflow version2.0),

pip install deepcut

For latest development release (recommended),

pip install git+git://github.com/rkcosmos/deepcut.git

If you want to use tensorflow version 1.x and standalone keras, you will need

pip install deepcut==0.6.1

Docker

First, install and run docker on your machine. Then, you can build and run deepcut as follows

docker build -t deepcut:dev . # build docker image
docker run --rm -it deepcut:dev # run docker, -it flag makes it interactive, --rm for clean up the container and remove file system

This will open a shell for us to play with deepcut.

Usage

import deepcut
deepcut.tokenize('ตัดคำได้ดีมาก')

Output will be in list format

['ตัดคำ','ได้','ดี','มาก']

Bag-of-word transformation

We implemented a tokenizer which works similar to CountVectorizer from scikit-learn . Here is an example usage:

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
                             max_df=1.0, min_df=0.0)
X = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน']) # 3 x 6 CSR sparse matrix
print(tokenizer.vocabulary_) # {'บิน': 0, 'ได้': 1, 'ฉัน': 2, 'อยาก': 3, 'ข้าว': 4, 'กิน': 5}, column index of sparse matrix

X_test = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน']) # use built tokenizer vobalurary to transform new text
print(X_test.shape) # 2 x 6 CSR sparse matrix

tokenizer.save_model('tokenizer.pickle') # save the tokenizer to use later

You can load the saved tokenizer to use later

tokenizer = deepcut.load_model('tokenizer.pickle')
X_sample = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน'])
print(X_sample.shape) # getting the same 2 x 6 CSR sparse matrix as X_test

Custom Dictionary

User can add custom dictionary by adding path to .txt file with one word per line like the following.

ขี้เกียจ
โรงเรียน
ดีมาก

The file can be placed as an custom_dict argument in tokenize function e.g.

deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict='/path/to/custom_dict.txt')
deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict=['ดีมาก']) # alternatively, you can provide a list of custom dictionary

Notes

Some texts might not be segmented as we would expected (e.g.'โรงเรียน' -> ['โรง', 'เรียน']), this is because of

  • BEST corpus (training data) tokenizes word this way (They use 'Compound words' as a criteria for segmentation)
  • They are unseen/new words -> Ideally, this would be cured by having better corpus but it's not very practical so I am thinking of doing semi-supervised learning to incorporate new examples.

Any suggestion and comment are welcome, please post it in issue section.

Contributors

Citations

If you use deepcut in your project or publication, please cite the library as follows

Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn,
Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad.
(2019, September 23). DeepCut: A Thai word tokenization library using Deep Neural Network. Zenodo. http://doi.org/10.5281/zenodo.3457707

or BibTeX entry:

@misc{Kittinaradorn2019,
    author       = {Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn, Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad},
    title        = {{DeepCut: A Thai word tokenization library using Deep Neural Network}},
    month        = Sep,
    year         = 2019,
    doi          = {10.5281/zenodo.3457707},
    version      = {1.0},
    publisher    = {Zenodo},
    url          = {http://doi.org/10.5281/zenodo.3457707}
}

Partner Organizations

  • True Corporation

We are open for contribution and collaboration.

deepcut's People

Contributors

bact avatar bluenex avatar chanwit avatar kittinan avatar p16i avatar phatsriwichai avatar rkcosmos avatar tienlemanh avatar titipata avatar wannaphong avatar wisticejent avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepcut's Issues

some question about deepcut

First, thanks for your contribution about Thai NLP. The performance of deepcut is so incredible. It makes me curious what is the trainning data about deepcut.

Pruning model

Hi,

I've made some analysis, investigating whether some layers in DeepCut are redundant. There are some big layers that we can remove while DeepCut's tokenization performance drops slightly.

Having this result, you might consider pruning or doing quantisation to the model. As a result, DeepCut will become smaller and (hopefully) faster.

Please find more information in this notebook.

multi-thread

deepcut รันแบบ multi thread ยังไงครับ

Can't use tokenizer.fit_tranform

i copy example and paste it into VSCode but Error

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    X = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน']) # 3 x 4 CSR sparse matrix
  File "/home/winnie/anaconda3/envs/tensorflow/lib/python3.5/site-packages/deepcut/deepcut.py", line 278, in fit_tranform
    X = self.transform(raw_documents, new_document=True)
  File "/home/winnie/anaconda3/envs/tensorflow/lib/python3.5/site-packages/deepcut/deepcut.py", line 267, in transform
    self.max_features)
  File "/home/winnie/anaconda3/envs/tensorflow/lib/python3.5/site-packages/deepcut/deepcut.py", line 214, in _limit_features
    if not kept_indices:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

screenshot from 2018-09-10 00-08-38
screenshot from 2018-09-10 00-10-59

Adding document-term matrix to the library

@rkcosmos, Can I add document-term matrix calculation to deepcut (similar to CountVectorizer in scikit-learn but using deepcut to tokenize Thai words for a given documents)? I think this would be helpful for people who wants to use document-term matrix for topic modeling like Latent Dirichlet Allocation or Latent Semantic Analysis.

I proposed something as follows:

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1), max_df=0.8, min_df=0.1)
X = tokenizer.fit_transform(documents) # document-term matrix of Thai text

This is still not as perfect as scikit-learn count vectorizer but good enough for usage :)

Custom words dictionary

Thank you for this good work. I have two questions about using this tool. First let me briefly explain my use case:

I am translating Buddhist texts from Thai to English for the Mahachulalangkornraachawitayaalay (MCU). The source material is images, so I must first do OCR (with tesseract) and then edit to markdown format. After that I can translate to English using Google Translate. During OCR some characters and annotations are missed or misinterpreted. I hope that deepcut can allow me to correct those words that are misrepresented by OCR. For example, the correct word is 'ประจําบท' but OCR misses the sara am and returns 'ประจาบท'.

  1. Can deepcut help in this case?
  2. If there are new or unseen words in the text, how can I add these words to deepcut for identification in the future?

AttributeError: 'list' object has no attribute 'ravel'

จะให้ตัดคำภาษาไทย 100,000 บรรทัด ครับแล้วขึ้นแบบนี้
Using TensorFlow backend.
Traceback (most recent call last):
File "deepv1.py", line 64, in
x.append(deep(line))
File "deepv1.py", line 24, in deep
z =deepcut.tokenize(list)
File "/home/essz/anaconda3/lib/python3.6/site-packages/deepcut/deepcut.py", line 45, in tokenize
y_predict = (y_predict.ravel() > 0.5).astype(int)
AttributeError: 'list' object has no attribute 'ravel'

After loaded a saved model, DeepcutTokenizer not found function "todense()"

environments

  • deepcut==0.7.0
  • python3.7
  • collab notebook

saving part

tokenizer = DeepcutTokenizer(ngram_range=(1,3), max_df=1.0, min_df=0.0)
X = tokenizer.fit_tranform(tokenized_df) 

X_test = tokenizer.transform(test_df[:10]) # use built tokenizer vobalurary to transform new text
print(X_test.shape)

tokenizer.save_model(dirpath+'tokenizer.pickle') # save the tokenizer to use later

load part

X = deepcut.load_model(dirpath+'tokenizer.pickle')
# plot stuff 
fig, ax = plt.subplots(figsize=(12, 12))
ax.imshow(X.todense(), interpolation='nearest')
plt.tight_layout()

error found

AttributeError                            Traceback (most recent call last)
<ipython-input-14-0de8817a548e> in <module>()
      1 fig, ax = plt.subplots(figsize=(12, 12))
----> 2 ax.imshow(X.todense(), interpolation='nearest')
      3 plt.tight_layout()

AttributeError: 'DeepcutTokenizer' object has no attribute 'todense'

Using Keras fit_generator to train the model

It might be useful to use generator, see example here to train the model.

We should also add callback_list while training e.g. ModelCheckpoint, EarlyStopping, ReduceLROnPlateau to fit_generator function in order to save the best model (divide training set into training and development set).

I can work on it and send the PR over the weekend :)

Semi-supervised training

Curious as to what's your plan to further enhance this by doing semi-supervised training. Anything specific you have in mind already? Any help that might be needed?

About multiprocessing

I tried using multiprocessing with asynchronous and it got infinity loop or something. Program is not processed. Here is sample

from multiprocessing import Pool
import deepcut as dc

def f(text):
    return dc.tokenize(text)

if __name__ == '__main__':
    texts = ["ง่ายๆแค่นี้เอง", "ลองใหม่สิ", "เอางี้นะ", "เอาอีกรอบหนึ่ง"]
    pool = Pool(processes=4)

    multiple_results = [pool.apply_async(f, (i,)) for i in texts]
    for func in multiple_results:
         print(func.get())

I opened the Activity Monitor on macOS and it's not used the processor.
It happened after the updated version that making your program faster. I think.

AttributeError: partially initialized module 'deepcut' has no attribute 'tokenize'

Hi,

First of all, Thanks for a really useful library.

I'm trying to use Deepcut library and I'm facing this error:

AttributeError: partially initialized module 'deepcut' has no attribute 'tokenize' (most likely due to a circular import)

I've installed deepcut with pip, like pip install deepcut, and then written a very simple python code

import deepcut
print(deepcut.tokenize('ตัดคำได้ดีมาก'))

And I got that error message.

My environment:
Python: 3.10.12, 3.7.17 (tried both)
pip: 23.2.1
OS: MacOS
Chip: M2

Getting the indices of the words

I want to use the library and get indices of the extracted word instead of the extracted word string.

For example:
a list that tells me where is the start and end of each word in the input string - [(0,2), (2,5), (6,8)]

Is there a way to do that?

Questions about access to BEST corpus for class project on deepcut

Hi! I am a master's student in Computational Social Science at UChicago. I am taking a computational linguistic class here with Allyson Ettinger, and I have to conduct a final project. I am very interested in Thai text tokenization, and I am planning to use deepcut as my starting point for my final project. I am at a very amateur level in the field of computational linguistics, so I am very likely not going to make any significant contribution to your already-well-built deepcut. However, I still want to have hands-on experience with neural networks on the topic that I personally care about (Thai tokenization). Therefore, is it possible for me to use deepcut skeleton and explore some possible tweaks with it, as well as to get access to the BEST corpus you use so that I can train and evaluate my version in comparison with your original deepcut on BEST corpus? Of course, I am going to cite deepcut in my project properly!

ขอบคุณค่ะ
มิ้น

deepcut with user dictionary support ?

Dear K.Rakpong krub,

May I have a suggestion about using deepcut with user dictionary support ?

According to your sample "โรงเรียน", there are many words which shouldn't be cut such as "ขี้เกียจ"

d1

And I have some specific word which I don't want deepcut to cut it such as "หูอื้อ" in middle of sentence (if there is only "หูอื้อ", it works fine !)

My workaround for this issue is to enclose every word such as "ขี้เกียจ" / "หูอื้อ" with "#" and replace them in the last process.

d2

I don't know is this the best way to achieve this, but it would be great if we can input dictionary for the corpus which shouldn't be cut because of I think that there will be many users which will have their own specific corpus which don't want to be cut in their own project, it would be nice to have this feature.

Best Regards,

Ping

Performance of DeepcutTokenizer

Why Performance of DeepcutTokenizer is slow more than Countvectorizer.
In case of Countvertorizer fit_transform 1000 sentense about 2-3 sec
but In deeptokenizer fit_tranform using same data set about 5 min with same env.

DeepCut Tokenize Error

I have a problem concerned with using deepcut.tokenize. It is fine if i use outside of my project but i need to use deepcut.tokenize before starting my application. if not i got that error, "ValueError: Tensor Tensor("dense_14/Sigmoid:0", shape=(?, 1), dtype=float32) is not an element of this graph". I m using deepCut: version 0.6.0.0, tensorflow: version 1.10.0.

cut words one by one

hi @rkcosmos

when I ran the example, I got the a different result.

>>> import deepcut
Using TensorFlow backend.
2017-10-30 18:32:12.384678: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.384717: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.384726: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.384732: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.384739: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.679174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6325
pciBusID 0000:02:00.0
Total memory: 10.91GiB
Free memory: 10.50GiB
2017-10-30 18:32:12.910366: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x5583d11dd3f0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-30 18:32:12.911434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6325
pciBusID 0000:03:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-30 18:32:12.912357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1
2017-10-30 18:32:12.912376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y Y
2017-10-30 18:32:12.912415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1:   Y Y
2017-10-30 18:32:12.912429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0)
2017-10-30 18:32:12.912439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0)
>>> print(" ".join(deepcut.tokenize(u'ตัดคำได้ดีมาก')))
    
>>> print(deepcut.tokenize('ตัดคำได้ดีมาก'))
['\xe0', '\xb8', '\x95', '\xe0', '\xb8', '\xb1', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\x84', '\xe0', '\xb8', '\xb3', '\xe0', '\xb9', '\x84', '\xe0', '\xb8', '\x94', '\xe0', '\xb9', '\x89', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\xb5', '\xe0', '\xb8', '\xa1', '\xe0', '\xb8', '\xb2', '\xe0', '\xb8', '\x81']

I install deepcut by pip. Am I missing some configurations?

Zero padding problem: possible (0, 0) padding

In model.py, there is a case that padding can be (0, 0) which gives a problem when we want to convert to use in Tensorflow JS:

out = ZeroPadding1D(padding=(0, window-1))(out)

We probably have to modify the following:

if window != 1:
    out = ZeroPadding1D(padding=(0, window-1))(out)

eval data

Can you provide some Thai segment data for training model or evaluating model?

Problems with numbers

Numbers with comma are tokenized separately by comma, for example, 2,000 is tokenized as '2', ',' and '000'

Transform method in DeepcutTokenizer

CountVectorier has the transform method but DeepcutTokenizer doesn't have this method.
In the future deepcut will implement this methods or not? Basically, I have to change CountVectorier to DeepTokenizer.
because it have to transform test data before predict.
// I am not good with English grammar. if you not understand i can explain in Thai lang thx
42943831_332569077309557_2723233342697766912_n

Using batch generator instead of train all training set

@kittinan, here is workaround script to train using fit_generator instead of fit. You can replace lines 184-193 in train.py to the following:

import itertools
import numpy as np

def generator(l1, l2, l3, batch_size=128):
    gen1 = iter(itertools.cycle(l1))
    gen2 = iter(itertools.cycle(l2))
    gen3 = iter(itertools.cycle(l3))
    while 1:
        yield [np.vstack([next(gen1) for _ in range(batch_size)]), np.vstack([next(gen2) for _ in range(batch_size)])],  np.vstack([next(gen3) for _ in range(batch_size)])

batch_size = 128
gen_batch_train = generator(x_train_char, x_train_type, y_train batch_size=batch_size)
gen_batch_val = generator(x_val_char, x_val_type, y_val, batch_size=batch_size)
model.fit_generator(gen_batch_train, steps_per_epoch=len(x_train_char) // batch_size, 
                    epochs=10, verbose=verbose,
                    validation_data=gen_batch_val,
                    validation_steps=len(x_val_char) // batch_size,
                    callbacks=callbacks_list)

New installation of latest release fails (via pip)

sudo python3.6 -m pip install deepcut
WARNING: Running pip install with root privileges is generally not a good idea. Try __main__.py install --user instead.
Collecting deepcut
Using cached https://files.pythonhosted.org/packages/ef/f3/ecda1d7dc51da0689b2df3d002541d0d04ac4db02c5d148eca48c8e3d219/deepcut-0.7.0.0-py3-none-any.whl
Requirement already satisfied: h5py in /usr/local/lib64/python3.6/site-packages (from deepcut)
Collecting tensorflow>=2.0.0 (from deepcut)
Could not find a version that satisfies the requirement tensorflow>=2.0.0 (from deepcut) (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0rc0, 1.1.0rc1, 1.1.0rc2, 1.1.0, 1.2.0rc0, 1.2.0rc1, 1.2.0rc2, 1.2.0, 1.2.1, 1.3.0rc0, 1.3.0rc1, 1.3.0rc2, 1.3.0, 1.4.0rc0, 1.4.0rc1, 1.4.0, 1.4.1, 1.5.0rc0, 1.5.0rc1, 1.5.0, 1.5.1, 1.6.0rc0, 1.6.0rc1, 1.6.0, 1.7.0rc0, 1.7.0rc1, 1.7.0, 1.7.1, 1.8.0rc0, 1.8.0rc1, 1.8.0, 1.9.0rc0, 1.9.0rc1, 1.9.0rc2, 1.9.0, 1.10.0rc0, 1.10.0rc1, 1.10.0, 1.10.1, 1.11.0rc0, 1.11.0rc1, 1.11.0rc2, 1.11.0, 1.12.0rc0, 1.12.0rc1, 1.12.0rc2, 1.12.0, 1.12.2, 1.12.3, 1.13.0rc0, 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 2.0.0a0, 2.0.0b0, 2.0.0b1)
No matching distribution found for tensorflow>=2.0.0 (from deepcut)

Remove duplicated files

There are 3 large folders/files in the repository as follows:

  • package
  • weight
  • deepcut/weight

Is there a way that we could reduce these files to only one place somehow?

Visualisation of DeepCut's architecture

Hi,

I'm currently trying to understand how DeepCut actually works in terms of computation. So, I've made this visualisation see how DeepCut's computational graph is constructed.

deepcut-architecture

Could you please confirm me whether the visualisation is correct?

How to using custom_dict ?

i'm so sorry, i'm not good in english. i have question about custom_dict.
this example python test deepcut.
screen shot 2561-09-03 at 10 55 53

i not understand result in test_deepcut_custom_dict_B.
how is the process to logically make the deepcut by custom_dict ?

thank you.

Problem with 'save_model' attribute

Firstly, I appreciate your work that is really useful. It is not difficult to set up. However, I found an issue while running at JupyterNotebook. You can see below the typed codes then following by error messages.

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
max_df = 1.0, min_df=0.0)
X = tokenizer.fit_tranform(['ฉันบินได้','ฉันกินข้าว','ฉันอยากบิน'])
print(tokenizer.vocabulary_)

X_test = tokenizer.transform(['ฉันกิน','ฉันไม่อยากบิน'])
print(X_test.shape)

tokenizer.save_model('tokenizer.pickle')


{'บิน': 0, 'ข้าว': 1, 'ได้': 2, 'อยาก': 3, 'ฉัน': 4, 'กิน': 5}
(2, 6)


AttributeError Traceback (most recent call last)
in
10 print(X_test.shape)
11
---> 12 tokenizer.save_model('tokenizer.pickle')

AttributeError: 'DeepcutTokenizer' object has no attribute 'save_model'


I tried several times it would be a typo. But it does not. I am a newbie in ML/DL. Please guide me to fix this problem. Thank you.

Regards,

Teddy

Document is outdated

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
                             max_df=1.0, min_df=0.0)
deepcut.tokenize('ตัดคำได้ดีมาก')
----------------------------------------------------------------------
ValueError                           Traceback (most recent call last)
<ipython-input-5-9855ac2d86cd> in <module>
----> 1 X = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน'])

~/.pyenv/versions/3.6.6/envs/surasak/lib/python3.6/site-packages/deepcut/deepcut.py in fit_tranform(self, raw_documents)
    276         sparse CSR format (see scipy)
    277         """
--> 278         X = self.transform(raw_documents, new_document=True)
    279         return X
    280 

~/.pyenv/versions/3.6.6/envs/surasak/lib/python3.6/site-packages/deepcut/deepcut.py in transform(self, raw_documents, new_document)
    265                                                 max_doc_count,
    266                                                 min_doc_count,
--> 267                                                 self.max_features)
    268         self.vocabulary_ = vocabulary
    269 

~/.pyenv/versions/3.6.6/envs/surasak/lib/python3.6/site-packages/deepcut/deepcut.py in _limit_features(self, X, vocabulary, high, low, limit)
    212                 removed_terms.add(term)
    213         kept_indices = np.where(mask)[0]
--> 214         if not kept_indices:
    215             raise ValueError("After pruning, no terms remain. Try a lower"
    216                              " min_df or a higher max_df.")

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

deepcut==0.6.1.0
Python3.6.6

ValueError: Tensor is not an element of this graph.

I have used deepcut.tokenize as an analyzer in a CountVectorizer and it raises an error

File "/model/CountVectorizer.py", line 14, in cutWord
    return [word for word in deepcut.tokenize(original_text) if word not in stop_list]
  File "/usr/local/lib/python3.6/dist-packages/deepcut/deepcut.py", line 60, in tokenize
    y_predict = model.predict([x_char, x_type])
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1832, in predict
    self._make_predict_function()
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1029, in _make_predict_function
    **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2502, in function
    return Function(inputs, outputs, updates=updates, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2445, in __init__
    with tf.control_dependencies(self.outputs):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 4863, in control_dependencies
    return get_default_graph().control_dependencies(control_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 4481, in control_dependencies
    c = self.as_graph_element(c)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3478, in as_graph_element
    return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3557, in _as_graph_element_locked
    raise ValueError("Tensor %s is not an element of this graph." % obj)
ValueError: Tensor Tensor("dense_14/Sigmoid:0", shape=(?, 1), dtype=float32) is not an element of this graph.

So I don't know could this is the problem with deepcut using keras or not.??
reference issue -> keras-team/keras#2397

Instruction to install from repository for the latest version

Maybe you can allow users to install directly from repository also. Something like following:

Install using pip for stable release

pip install deepcut

Latest release

pip install git+git://github.com/rkcosmos/deepcut.git

We can change to the stable one later. It's now in active development so I think it makes sense to have this instruction.

custom_dict parameter didn't work on custom dictionary?

I have created custom dictionary text-file named "test.txt" and placed test.txt's directory as a parameter of custom_dict. The result didn't work whereas the function returned a result same as custom_dict default value instead. How can I fix it?
It should work as ['วิชา', 'การเขียนโปรแกรม', 'มี', 'ใคร', 'สอน', 'บ้าง']

image

image

image

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.