rkcosmos / deepcut Goto Github PK

View Code? Open in Web Editor NEW

411.0 31.0 97.0 11.54 MB

A Thai word tokenization library using Deep Neural Network

License: MIT License

Jupyter Notebook 16.09% Python 83.36% Dockerfile 0.55%

deep-neural-networks thai segmentation python keras-tensorflow keras deep-learning tensorflow

deepcut's Introduction

Deepcut

A Thai word tokenization library using Deep Neural Network.

What's new

v0.7.0 Migrate from keras to TensorFlow 2.0
v0.6.0 Allow excluding stop words and custom dictionary, updated weight with semi-supervised learning
v0.5.2 Better pretrained weight matrix
v0.5.1 Faster tokenization by code refactorization
examples folder provide starter script for Thai text classification problem
DeepcutJS, you can try tokenizing Thai text on web browser here

Performance

The Convolutional Neural network is trained from 90 % of NECTEC's BEST corpus (consists of 4 sections, article, news, novel and encyclopedia) and test on the rest 10 %. It is a binary classification model trying to predict whether a character is the beginning of word or not. The results calculated from only 'true' class are as follow

Precision	Recall	F1
97.8%	98.5%	98.1%

Installation

Install using pip for stable release (tensorflow version2.0),

pip install deepcut

For latest development release (recommended),

pip install git+git://github.com/rkcosmos/deepcut.git

If you want to use tensorflow version 1.x and standalone keras, you will need

pip install deepcut==0.6.1

Docker

First, install and run docker on your machine. Then, you can build and run deepcut as follows

docker build -t deepcut:dev . # build docker image
docker run --rm -it deepcut:dev # run docker, -it flag makes it interactive, --rm for clean up the container and remove file system

This will open a shell for us to play with deepcut.

Usage

import deepcut
deepcut.tokenize('ตัดคำได้ดีมาก')

Output will be in list format

['ตัดคำ','ได้','ดี','มาก']

Bag-of-word transformation

We implemented a tokenizer which works similar to CountVectorizer from scikit-learn . Here is an example usage:

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
                             max_df=1.0, min_df=0.0)
X = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน']) # 3 x 6 CSR sparse matrix
print(tokenizer.vocabulary_) # {'บิน': 0, 'ได้': 1, 'ฉัน': 2, 'อยาก': 3, 'ข้าว': 4, 'กิน': 5}, column index of sparse matrix

X_test = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน']) # use built tokenizer vobalurary to transform new text
print(X_test.shape) # 2 x 6 CSR sparse matrix

tokenizer.save_model('tokenizer.pickle') # save the tokenizer to use later

You can load the saved tokenizer to use later

tokenizer = deepcut.load_model('tokenizer.pickle')
X_sample = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน'])
print(X_sample.shape) # getting the same 2 x 6 CSR sparse matrix as X_test

Custom Dictionary

User can add custom dictionary by adding path to .txt file with one word per line like the following.

ขี้เกียจ
โรงเรียน
ดีมาก

The file can be placed as an custom_dict argument in tokenize function e.g.

deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict='/path/to/custom_dict.txt')
deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict=['ดีมาก']) # alternatively, you can provide a list of custom dictionary

Notes

Some texts might not be segmented as we would expected (e.g.'โรงเรียน' -> ['โรง', 'เรียน']), this is because of

BEST corpus (training data) tokenizes word this way (They use 'Compound words' as a criteria for segmentation)
They are unseen/new words -> Ideally, this would be cured by having better corpus but it's not very practical so I am thinking of doing semi-supervised learning to incorporate new examples.

Any suggestion and comment are welcome, please post it in issue section.

Contributors

Citations

If you use deepcut in your project or publication, please cite the library as follows

Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn,
Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad.
(2019, September 23). DeepCut: A Thai word tokenization library using Deep Neural Network. Zenodo. http://doi.org/10.5281/zenodo.3457707

or BibTeX entry:

@misc{Kittinaradorn2019,
    author       = {Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn, Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad},
    title        = {{DeepCut: A Thai word tokenization library using Deep Neural Network}},
    month        = Sep,
    year         = 2019,
    doi          = {10.5281/zenodo.3457707},
    version      = {1.0},
    publisher    = {Zenodo},
    url          = {http://doi.org/10.5281/zenodo.3457707}
}

Partner Organizations

True Corporation

We are open for contribution and collaboration.

deepcut's People

Contributors

Stargazers

Watchers

Forkers

vichuda kobkrit johnnyduo teerapat-ch macfrellow fonylew kantapithm foofybuster lawdalana ima8 jackz08 nithiroj olesang pariyat kitisak rerngrit powxoper zeromtmu chunrapeepat beebrain kkkkong swat-wanna chanwit pongsakornsemsuwan athiwatp igetter7 spiderokingo allfake prissadang limweb phatsriwichai thanzero izayafirst mockingintellect petetanru xbklairith tienlemanh jiratpasuksmit irshadbhat smileychubby megenius bluenex greenrenge akenarong999 chonla oroszgy iambenn athana pratzn garfieldnate kritcs18-zz devrabbiz odevtheair zylinks thammarongg dwisan robingong lezaimemecha ehsainit p16i wurentidai pramoth codeindy iamalbert siftxxx joytianya gamoman wisticejent wanitchayap thanajade wannaphong pitsanu123 amnuaym thirasan pavitsu loperntu mohkaow wangwang2160 cstorm125 pchalee titipakorn edwinstudy mrpeerat horizen23 huak95 chaweewatp wendonggan zhuweijin layel2 saranpan i-rinn naturewoker tajsinghyatra

deepcut's Issues

some question about deepcut

First, thanks for your contribution about Thai NLP. The performance of deepcut is so incredible. It makes me curious what is the trainning data about deepcut.

Pruning model

Hi,

I've made some analysis, investigating whether some layers in DeepCut are redundant. There are some big layers that we can remove while DeepCut's tokenization performance drops slightly.

Having this result, you might consider pruning or doing quantisation to the model. As a result, DeepCut will become smaller and (hopefully) faster.

Please find more information in this notebook.

multi-thread

deepcut รันแบบ multi thread ยังไงครับ

Create web application for deepcut using KerasJS/TensorflowJS

@kittinan, let's create docs folder to for hosting Tensorflow JS model and tiny application for testing the library. Can you also add the script to convert the model to Tensorflow JS so I can try it (thanks in advance krub!)?

Can't use tokenizer.fit_tranform

i copy example and paste it into VSCode but Error

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    X = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน']) # 3 x 4 CSR sparse matrix
  File "/home/winnie/anaconda3/envs/tensorflow/lib/python3.5/site-packages/deepcut/deepcut.py", line 278, in fit_tranform
    X = self.transform(raw_documents, new_document=True)
  File "/home/winnie/anaconda3/envs/tensorflow/lib/python3.5/site-packages/deepcut/deepcut.py", line 267, in transform
    self.max_features)
  File "/home/winnie/anaconda3/envs/tensorflow/lib/python3.5/site-packages/deepcut/deepcut.py", line 214, in _limit_features
    if not kept_indices:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Adding document-term matrix to the library

@rkcosmos, Can I add document-term matrix calculation to deepcut (similar to CountVectorizer in scikit-learn but using deepcut to tokenize Thai words for a given documents)? I think this would be helpful for people who wants to use document-term matrix for topic modeling like Latent Dirichlet Allocation or Latent Semantic Analysis.

I proposed something as follows:

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1), max_df=0.8, min_df=0.1)
X = tokenizer.fit_transform(documents) # document-term matrix of Thai text

This is still not as perfect as scikit-learn count vectorizer but good enough for usage :)

ํYour program is too slow.

You have to create init method to load model instead load every time in tokenize
https://github.com/rkcosmos/deepcut/blob/master/deepcut/deepcut.py
def tokenize(text):

Pickle can't save ``DeepcutTokenizer``

I try to save DeepcutTokenizer with pickle but got the following error

Custom words dictionary

Thank you for this good work. I have two questions about using this tool. First let me briefly explain my use case:

I am translating Buddhist texts from Thai to English for the Mahachulalangkornraachawitayaalay (MCU). The source material is images, so I must first do OCR (with tesseract) and then edit to markdown format. After that I can translate to English using Google Translate. During OCR some characters and annotations are missed or misinterpreted. I hope that deepcut can allow me to correct those words that are misrepresented by OCR. For example, the correct word is 'ประจําบท' but OCR misses the sara am and returns 'ประจาบท'.

Can deepcut help in this case?
If there are new or unseen words in the text, how can I add these words to deepcut for identification in the future?

Could you explain how to use stopword?

Hello Deepcut team,
I would like to manually custom the stop word. Please let me know how.

Thank you :)

Upgrade Dockerfile to Tensorflow 2.0

Need help to upgrade Dockerfile to Tensorflow 2.0

AttributeError: 'list' object has no attribute 'ravel'

จะให้ตัดคำภาษาไทย 100,000 บรรทัด ครับแล้วขึ้นแบบนี้
Using TensorFlow backend.
Traceback (most recent call last):
File "deepv1.py", line 64, in
x.append(deep(line))
File "deepv1.py", line 24, in deep
z =deepcut.tokenize(list)
File "/home/essz/anaconda3/lib/python3.6/site-packages/deepcut/deepcut.py", line 45, in tokenize
y_predict = (y_predict.ravel() > 0.5).astype(int)
AttributeError: 'list' object has no attribute 'ravel'

vocabulary set to empty during tranform method in DeepcutTokenizer

Somehow after transforming the document using

tokenizer.transform(raw_documents, new_document=True)

The tokenizer.vocabulary_ is set to empty. It might be an error in the transform method where we accidentally set the vocabulary to empty.

DeepCut on GPU is slower than on CPU.

Hi,

While working on PyThaiNLP/docker-thai-tokenizers#1 (comment), I've found that Running DeepCut on GPU is slightly slower than running it on CPU. This is a bit counterintuitive and I'm trying to find the reason for that.

Do you have an idea why this is the case?

Link to the benchmark code: https://colab.research.google.com/drive/1SuNYLiwEkOln_WGP_BekumGP4CxRm9rp

After loaded a saved model, DeepcutTokenizer not found function "todense()"

environments

deepcut==0.7.0
python3.7
collab notebook

saving part

tokenizer = DeepcutTokenizer(ngram_range=(1,3), max_df=1.0, min_df=0.0)
X = tokenizer.fit_tranform(tokenized_df) 

X_test = tokenizer.transform(test_df[:10]) # use built tokenizer vobalurary to transform new text
print(X_test.shape)

tokenizer.save_model(dirpath+'tokenizer.pickle') # save the tokenizer to use later

load part

X = deepcut.load_model(dirpath+'tokenizer.pickle')
# plot stuff 
fig, ax = plt.subplots(figsize=(12, 12))
ax.imshow(X.todense(), interpolation='nearest')
plt.tight_layout()

error found

AttributeError                            Traceback (most recent call last)
<ipython-input-14-0de8817a548e> in <module>()
      1 fig, ax = plt.subplots(figsize=(12, 12))
----> 2 ax.imshow(X.todense(), interpolation='nearest')
      3 plt.tight_layout()

AttributeError: 'DeepcutTokenizer' object has no attribute 'todense'

Using Keras fit_generator to train the model

It might be useful to use generator, see example here to train the model.

We should also add callback_list while training e.g. ModelCheckpoint, EarlyStopping, ReduceLROnPlateau to fit_generator function in order to save the best model (divide training set into training and development set).

I can work on it and send the PR over the weekend :)

Semi-supervised training

Curious as to what's your plan to further enhance this by doing semi-supervised training. Anything specific you have in mind already? Any help that might be needed?

Can I apply this software to Lao Language

I am a Lao student studying in China, so I ahve an idea to use it for Lao Language, and will it work ? and wht is the first step for me?

The URL of the public signing key has changed

In the Dockerfile we are still using old public signing key of alpine. please change from

'https://raw.githubusercontent.com/sgerrand/alpine-pkg-glibc/master/sgerrand.rsa.pub'
to
'https://alpine-pkgs.sgerrand.com/sgerrand.rsa.pub'

DeepCut with TF2?

Can the DeepCut now work with TensorFlow 2?

About multiprocessing

I tried using multiprocessing with asynchronous and it got infinity loop or something. Program is not processed. Here is sample

from multiprocessing import Pool
import deepcut as dc

def f(text):
    return dc.tokenize(text)

if __name__ == '__main__':
    texts = ["ง่ายๆแค่นี้เอง", "ลองใหม่สิ", "เอางี้นะ", "เอาอีกรอบหนึ่ง"]
    pool = Pool(processes=4)

    multiple_results = [pool.apply_async(f, (i,)) for i in texts]
    for func in multiple_results:
         print(func.get())

I opened the Activity Monitor on macOS and it's not used the processor.
It happened after the updated version that making your program faster. I think.

AttributeError: partially initialized module 'deepcut' has no attribute 'tokenize'

Hi,

First of all, Thanks for a really useful library.

I'm trying to use Deepcut library and I'm facing this error:

AttributeError: partially initialized module 'deepcut' has no attribute 'tokenize' (most likely due to a circular import)

I've installed deepcut with pip, like pip install deepcut, and then written a very simple python code

import deepcut
print(deepcut.tokenize('ตัดคำได้ดีมาก'))

And I got that error message.

My environment:
Python: 3.10.12, 3.7.17 (tried both)
pip: 23.2.1
OS: MacOS
Chip: M2

Getting the indices of the words

I want to use the library and get indices of the extracted word instead of the extracted word string.

For example:
a list that tells me where is the start and end of each word in the input string - [(0,2), (2,5), (6,8)]

Is there a way to do that?

Questions about access to BEST corpus for class project on deepcut

Hi! I am a master's student in Computational Social Science at UChicago. I am taking a computational linguistic class here with Allyson Ettinger, and I have to conduct a final project. I am very interested in Thai text tokenization, and I am planning to use deepcut as my starting point for my final project. I am at a very amateur level in the field of computational linguistics, so I am very likely not going to make any significant contribution to your already-well-built deepcut. However, I still want to have hands-on experience with neural networks on the topic that I personally care about (Thai tokenization). Therefore, is it possible for me to use deepcut skeleton and explore some possible tweaks with it, as well as to get access to the BEST corpus you use so that I can train and evaluate my version in comparison with your original deepcut on BEST corpus? Of course, I am going to cite deepcut in my project properly!

ขอบคุณค่ะ
มิ้น

deepcut with user dictionary support ?

Dear K.Rakpong krub,

May I have a suggestion about using deepcut with user dictionary support ?

According to your sample "โรงเรียน", there are many words which shouldn't be cut such as "ขี้เกียจ"

And I have some specific word which I don't want deepcut to cut it such as "หูอื้อ" in middle of sentence (if there is only "หูอื้อ", it works fine !)

My workaround for this issue is to enclose every word such as "ขี้เกียจ" / "หูอื้อ" with "#" and replace them in the last process.

I don't know is this the best way to achieve this, but it would be great if we can input dictionary for the corpus which shouldn't be cut because of I think that there will be many users which will have their own specific corpus which don't want to be cut in their own project, it would be nice to have this feature.

Best Regards,

Ping

Performance of DeepcutTokenizer

Why Performance of DeepcutTokenizer is slow more than Countvectorizer.
In case of Countvertorizer fit_transform 1000 sentense about 2-3 sec
but In deeptokenizer fit_tranform using same data set about 5 min with same env.

DeepCut Tokenize Error

I have a problem concerned with using deepcut.tokenize. It is fine if i use outside of my project but i need to use deepcut.tokenize before starting my application. if not i got that error, "ValueError: Tensor Tensor("dense_14/Sigmoid:0", shape=(?, 1), dtype=float32) is not an element of this graph". I m using deepCut: version 0.6.0.0, tensorflow: version 1.10.0.

cut words one by one

hi @rkcosmos

when I ran the example, I got the a different result.

>>> import deepcut
Using TensorFlow backend.
2017-10-30 18:32:12.384678: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.384717: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.384726: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.384732: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.384739: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-30 18:32:12.679174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6325
pciBusID 0000:02:00.0
Total memory: 10.91GiB
Free memory: 10.50GiB
2017-10-30 18:32:12.910366: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x5583d11dd3f0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-30 18:32:12.911434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6325
pciBusID 0000:03:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-30 18:32:12.912357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1
2017-10-30 18:32:12.912376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y Y
2017-10-30 18:32:12.912415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1:   Y Y
2017-10-30 18:32:12.912429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0)
2017-10-30 18:32:12.912439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0)
>>> print(" ".join(deepcut.tokenize(u'ตัดคำได้ดีมาก')))
ต ั ด ค ำ ไ ด ้ ด ี ม า ก
>>> print(deepcut.tokenize('ตัดคำได้ดีมาก'))
['\xe0', '\xb8', '\x95', '\xe0', '\xb8', '\xb1', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\x84', '\xe0', '\xb8', '\xb3', '\xe0', '\xb9', '\x84', '\xe0', '\xb8', '\x94', '\xe0', '\xb9', '\x89', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\xb5', '\xe0', '\xb8', '\xa1', '\xe0', '\xb8', '\xb2', '\xe0', '\xb8', '\x81']

I install deepcut by pip. Am I missing some configurations?

Zero padding problem: possible (0, 0) padding

In model.py, there is a case that padding can be (0, 0) which gives a problem when we want to convert to use in Tensorflow JS:

out = ZeroPadding1D(padding=(0, window-1))(out)

We probably have to modify the following:

if window != 1:
    out = ZeroPadding1D(padding=(0, window-1))(out)

eval data

Can you provide some Thai segment data for training model or evaluating model?

Problems with numbers

Numbers with comma are tokenized separately by comma, for example, 2,000 is tokenized as '2', ',' and '000'

Transform method in DeepcutTokenizer

CountVectorier has the transform method but DeepcutTokenizer doesn't have this method.
In the future deepcut will implement this methods or not? Basically, I have to change CountVectorier to DeepTokenizer.
because it have to transform test data before predict.
// I am not good with English grammar. if you not understand i can explain in Thai lang thx

Using batch generator instead of train all training set

@kittinan, here is workaround script to train using fit_generator instead of fit. You can replace lines 184-193 in train.py to the following:

import itertools
import numpy as np

def generator(l1, l2, l3, batch_size=128):
    gen1 = iter(itertools.cycle(l1))
    gen2 = iter(itertools.cycle(l2))
    gen3 = iter(itertools.cycle(l3))
    while 1:
        yield [np.vstack([next(gen1) for _ in range(batch_size)]), np.vstack([next(gen2) for _ in range(batch_size)])],  np.vstack([next(gen3) for _ in range(batch_size)])

batch_size = 128
gen_batch_train = generator(x_train_char, x_train_type, y_train batch_size=batch_size)
gen_batch_val = generator(x_val_char, x_val_type, y_val, batch_size=batch_size)
model.fit_generator(gen_batch_train, steps_per_epoch=len(x_train_char) // batch_size, 
                    epochs=10, verbose=verbose,
                    validation_data=gen_batch_val,
                    validation_steps=len(x_val_char) // batch_size,
                    callbacks=callbacks_list)

Could you please give me some data or just a few lines of data format?

Make custom dict as an argument instead of looking particular directory

Probably we can change this to format like

deepcut.tokenize('อยากกิน', custom_dict='/path/to/dict.txt')

where default, we set it to None. It's going to be easier to add manual custom dictionary.

New installation of latest release fails (via pip)

sudo python3.6 -m pip install deepcut
WARNING: Running pip install with root privileges is generally not a good idea. Try __main__.py install --user instead.
Collecting deepcut
Using cached https://files.pythonhosted.org/packages/ef/f3/ecda1d7dc51da0689b2df3d002541d0d04ac4db02c5d148eca48c8e3d219/deepcut-0.7.0.0-py3-none-any.whl
Requirement already satisfied: h5py in /usr/local/lib64/python3.6/site-packages (from deepcut)
Collecting tensorflow>=2.0.0 (from deepcut)
Could not find a version that satisfies the requirement tensorflow>=2.0.0 (from deepcut) (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0rc0, 1.1.0rc1, 1.1.0rc2, 1.1.0, 1.2.0rc0, 1.2.0rc1, 1.2.0rc2, 1.2.0, 1.2.1, 1.3.0rc0, 1.3.0rc1, 1.3.0rc2, 1.3.0, 1.4.0rc0, 1.4.0rc1, 1.4.0, 1.4.1, 1.5.0rc0, 1.5.0rc1, 1.5.0, 1.5.1, 1.6.0rc0, 1.6.0rc1, 1.6.0, 1.7.0rc0, 1.7.0rc1, 1.7.0, 1.7.1, 1.8.0rc0, 1.8.0rc1, 1.8.0, 1.9.0rc0, 1.9.0rc1, 1.9.0rc2, 1.9.0, 1.10.0rc0, 1.10.0rc1, 1.10.0, 1.10.1, 1.11.0rc0, 1.11.0rc1, 1.11.0rc2, 1.11.0, 1.12.0rc0, 1.12.0rc1, 1.12.0rc2, 1.12.0, 1.12.2, 1.12.3, 1.13.0rc0, 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 2.0.0a0, 2.0.0b0, 2.0.0b1)
No matching distribution found for tensorflow>=2.0.0 (from deepcut)

Remove duplicated files

There are 3 large folders/files in the repository as follows:

package
weight
deepcut/weight

Is there a way that we could reduce these files to only one place somehow?

Visualisation of DeepCut's architecture

Hi,

I'm currently trying to understand how DeepCut actually works in terms of computation. So, I've made this visualisation see how DeepCut's computational graph is constructed.

Could you please confirm me whether the visualisation is correct?

How to using custom_dict ?

i'm so sorry, i'm not good in english. i have question about custom_dict.
this example python test deepcut.

i not understand result in test_deepcut_custom_dict_B.
how is the process to logically make the deepcut by custom_dict ?

thank you.

Problem with 'save_model' attribute

Firstly, I appreciate your work that is really useful. It is not difficult to set up. However, I found an issue while running at JupyterNotebook. You can see below the typed codes then following by error messages.

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
max_df = 1.0, min_df=0.0)
X = tokenizer.fit_tranform(['ฉันบินได้','ฉันกินข้าว','ฉันอยากบิน'])
print(tokenizer.vocabulary_)

X_test = tokenizer.transform(['ฉันกิน','ฉันไม่อยากบิน'])
print(X_test.shape)

tokenizer.save_model('tokenizer.pickle')

{'บิน': 0, 'ข้าว': 1, 'ได้': 2, 'อยาก': 3, 'ฉัน': 4, 'กิน': 5}
(2, 6)

AttributeError Traceback (most recent call last)
in
10 print(X_test.shape)
11
---> 12 tokenizer.save_model('tokenizer.pickle')

AttributeError: 'DeepcutTokenizer' object has no attribute 'save_model'

I tried several times it would be a typo. But it does not. I am a newbie in ML/DL. Please guide me to fix this problem. Thank you.

Regards,

Teddy

Document is outdated

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
                             max_df=1.0, min_df=0.0)
deepcut.tokenize('ตัดคำได้ดีมาก')

----------------------------------------------------------------------
ValueError                           Traceback (most recent call last)
<ipython-input-5-9855ac2d86cd> in <module>
----> 1 X = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน'])

~/.pyenv/versions/3.6.6/envs/surasak/lib/python3.6/site-packages/deepcut/deepcut.py in fit_tranform(self, raw_documents)
    276         sparse CSR format (see scipy)
    277         """
--> 278         X = self.transform(raw_documents, new_document=True)
    279         return X
    280 

~/.pyenv/versions/3.6.6/envs/surasak/lib/python3.6/site-packages/deepcut/deepcut.py in transform(self, raw_documents, new_document)
    265                                                 max_doc_count,
    266                                                 min_doc_count,
--> 267                                                 self.max_features)
    268         self.vocabulary_ = vocabulary
    269 

~/.pyenv/versions/3.6.6/envs/surasak/lib/python3.6/site-packages/deepcut/deepcut.py in _limit_features(self, X, vocabulary, high, low, limit)
    212                 removed_terms.add(term)
    213         kept_indices = np.where(mask)[0]
--> 214         if not kept_indices:
    215             raise ValueError("After pruning, no terms remain. Try a lower"
    216                              " min_df or a higher max_df.")

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

deepcut==0.6.1.0
Python3.6.6

Digital Object Identifier for DeepCut

Hi,

It would be great if DeepCut has its own DOI for academic references. You can achieve this persistent reference by using Zenodo, recommended by GitHub.

มันไม่ตัดครับ

ValueError: Tensor is not an element of this graph.

I have used deepcut.tokenize as an analyzer in a CountVectorizer and it raises an error

File "/model/CountVectorizer.py", line 14, in cutWord
    return [word for word in deepcut.tokenize(original_text) if word not in stop_list]
  File "/usr/local/lib/python3.6/dist-packages/deepcut/deepcut.py", line 60, in tokenize
    y_predict = model.predict([x_char, x_type])
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1832, in predict
    self._make_predict_function()
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1029, in _make_predict_function
    **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2502, in function
    return Function(inputs, outputs, updates=updates, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2445, in __init__
    with tf.control_dependencies(self.outputs):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 4863, in control_dependencies
    return get_default_graph().control_dependencies(control_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 4481, in control_dependencies
    c = self.as_graph_element(c)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3478, in as_graph_element
    return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3557, in _as_graph_element_locked
    raise ValueError("Tensor %s is not an element of this graph." % obj)
ValueError: Tensor Tensor("dense_14/Sigmoid:0", shape=(?, 1), dtype=float32) is not an element of this graph.

So I don't know could this is the problem with deepcut using keras or not.??
reference issue -> keras-team/keras#2397

install using Dockerfile has an error

Step 9/11 : COPY keras.json /root/.keras/
COPY failed: stat /var/lib/docker/tmp/docker-builder400846702/keras.json: no such file or directory

Instruction to install from repository for the latest version

Maybe you can allow users to install directly from repository also. Something like following:

Install using pip for stable release

pip install deepcut

Latest release

pip install git+git://github.com/rkcosmos/deepcut.git

We can change to the stable one later. It's now in active development so I think it makes sense to have this instruction.

custom_dict parameter didn't work on custom dictionary?

I have created custom dictionary text-file named "test.txt" and placed test.txt's directory as a parameter of custom_dict. The result didn't work whereas the function returned a result same as custom_dict default value instead. How can I fix it?
It should work as ['วิชา', 'การเขียนโปรแกรม', 'มี', 'ใคร', 'สอน', 'บ้าง']

rkcosmos / deepcut Goto Github PK

deepcut's Introduction

Deepcut

What's new

Performance

Installation

Docker

Usage

Bag-of-word transformation

Custom Dictionary

Notes

Contributors

Citations

Partner Organizations

deepcut's People

Contributors

Stargazers

Watchers

Forkers

deepcut's Issues

Recommend Projects

Recommend Topics

Recommend Org