hyunwoongko / kochat Goto Github PK

View Code? Open in Web Editor NEW

448.0 19.0 181.0 316.95 MB

Opensource Korean chatbot framework

License: Apache License 2.0

Python 100.00%

chatbot deeplearning deep-learning korean korean-chatbot sentence-classification sequance-tagging web-crawler

kochat's Introduction

Kochat

챗봇 빌더는 성에 안차고, 자신만의 딥러닝 챗봇 애플리케이션을 만드시고 싶으신가요?
Kochat을 이용하면 손쉽게 자신만의 딥러닝 챗봇 애플리케이션을 빌드할 수 있습니다.

# 1. 데이터셋 객체 생성
dataset = Dataset(ood=True)

# 2. 임베딩 프로세서 생성
emb = GensimEmbedder(model=embed.FastText())

# 3. 의도(Intent) 분류기 생성
clf = DistanceClassifier(
    model=intent.CNN(dataset.intent_dict),                  
    loss=CenterLoss(dataset.intent_dict)                    
)

# 4. 개체명(Named Entity) 인식기 생성                                                     
rcn = EntityRecognizer(
    model=entity.LSTM(dataset.entity_dict),
    loss=CRFLoss(dataset.entity_dict)
)

# 5. 딥러닝 챗봇 RESTful API 학습 & 빌드
kochat = KochatApi(
    dataset=dataset, 
    embed_processor=(emb, True), 
    intent_classifier=(clf, True),
    entity_recognizer=(rcn, True), 
    scenarios=[
        weather, dust, travel, restaurant
    ]
)

# 6. View 소스파일과 연결                                                                                                        
@kochat.app.route('/')
def index():
    return render_template("index.html")

# 7. 챗봇 애플리케이션 서버 가동                                                          
if __name__ == '__main__':
    kochat.app.template_folder = kochat.root_dir + 'templates'
    kochat.app.static_folder = kochat.root_dir + 'static'
    kochat.app.run(port=8080, host='0.0.0.0')

Why Kochat?

한국어를 지원하는 최초의 오픈소스 딥러닝 챗봇 프레임워크입니다. (빌더와는 다릅니다.)
다양한 Pre built-in 모델과 Loss함수를 지원합니다. NLP를 잘 몰라도 챗봇을 만들 수 있습니다.
자신만의 커스텀 모델, Loss함수를 적용할 수 있습니다. NLP 전문가에겐 더욱 유용합니다.
챗봇에 필요한 데이터 전처리, 모델, 학습 파이프라인, RESTful API까지 모든 부분을 제공합니다.
가격 등을 신경쓸 필요 없으며, 앞으로도 쭉 오픈소스 프로젝트로 제공할 예정입니다.
아래와 같은 다양한 성능 평가 메트릭과 강력한 시각화 기능을 제공합니다.

Documentation

Reference

License

Copyright 2020 Hyunwoong Ko.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

kochat's People

Contributors

Stargazers

Watchers

Forkers

ilyeong-ai changwng minhyukyang syjeon1107 bytearchive leesehoon mkhoin cypark728 rl0444 mrbananahuman keep-steady kgeneral kyuseokyi chochobo hash2430 kimjiseong1994 haven-jeon gamja99 ainlp seunghyunmoon2 jeongyoonlee2015 james98k taehoonkoo fiesta0211 dhlee347 sonnysorry jianlog dnjzmaos147 raenoh choiinyeol jongsix cm-lee-1960 kobarley beenypapa-kt paulsunnypark poveteen jepetolee daejeen jack0616 antares88 study-for-mini huni1153 aelleek dja314159 juhyunson jaewonlee0217 inkyscope soyoungcho kshan0515 gywlsdms123 lukliz duke24k jaeyun95 hoo0681 woobin3069 winterconnect hssu0 zinzinbin driver88 namdda superyoplait dodiskkkdk blala2927 parkdongmyoung 2seunghyuck tovenbae jpro6679 sswwd95 keiraydev odus05 normalstory sungminyun underflow101 idgmatrix jaekookang mbc-internet eunjebit actionarchitect koys007 laplacekorea jejejekim cjh7396 rp-pcu gimmespoon lonycell pwsunf helenhyeon gon1942 momozzing dadajon luckfellow gabriellaeun srain0626 junnjjj nhlee whtngus scott6878 joe-yoon-jang hojunkimdev hwangjaehwan

kochat's Issues

matplotlib 관련 에러

Exception ignored in: <function Image.__del__ at 0x000001FA80FD7DC8>
Traceback (most recent call last):
  File "C:\Users\USER\anaconda3\envs\kochat-test\lib\tkinter\__init__.py", line 3507, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (os_windows_x86.cpp:144), pid=12176, tid=0x00000000000009a0
#  guarantee(result == EXCEPTION_CONTINUE_EXECUTION) failed: Unexpected result from topLevelExceptionFilter
#
# JRE version: Java(TM) SE Runtime Environment (8.0_191-b12) (build 1.8.0_191-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.191-b12 mixed mode windows-amd64 compressed oops)
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# D:\pyproject\kochat-test\hs_err_pid12176.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

안녕하세요 로컬에서 demo 돌리는중에 해당에러를 발견하였습니다.
구글링 하다보니 이게 matplotlib 관련에러인것을 찾았는데요

GensimEmbedder 학습 끝나고 나서 발생합니다.
해결법으로는 matplotlib 쓰는 곳에

import matplotlib
matplotilb.use('Agg')

추가해주면 됩니다만..

파이참에서 실행하면 잘되던데 terminal로 실행하니 에러가 발생하네요..
matplotlib backend 문제? 라고 하는거 같은데 해당 코드를 추가해주시면
많은 사람들이 서버 운영하는 입장에서 수고를 덜수있지 않을까 생각해봅니다~

Kochat 데이터셋 관련해서 질문이 있습니다.

제가 kochat으로 민원 서비스 챗봇 모델을 만들어보려고 하는데 혹시 데이터셋은 무조건 csv파일로만 들어가나요?

그리고 혹시 사전학습 모델(ex, BERT)을 입혀서 학습할 수 있을까요??

kochat/utils/metrics.py에서 호출한 classification_report() 중 `ValueError: Number of classes, 1, does not match size of target_names, 2. Try specifying the labels parameter`

Intro

안녕하세요 @hyunwoongko 님! 한국어 챗봇 프레임워크를 필요로 했는데, 너무 잘 만드신 것 같습니다!
코드와 자세한 docs를 읽어보며 감탄했습니다. 덕분에 원하는 기능의 챗봇을 만들 수 있을 것 같습니다.

문제 상황

[DistanceClassifier] 학습을 완료한 후, 이런 에러가 발생합니다. (아마 OOD를 이용해 classification metrics report 파일을 만드는 과정인 것 같습니다.)

...
[DistanceClassifier] Epoch : 10, ETA : 4.3569 sec 
Traceback (most recent call last):
  File "application.py", line 26, in <module>
    kochat = KochatApi(
  File "/workspace/.pyenv_mirror/user/3.8.19/lib/python3.8/site-packages/kochat/app/kochat_api.py", line 56, in __init__
    self.__fit_intent()
  File "/workspace/.pyenv_mirror/user/3.8.19/lib/python3.8/site-packages/kochat/app/kochat_api.py", line 153, in __fit_intent
    self.intent_classifier.fit(self.dataset.load_intent(self.embed_processor))
  File "/workspace/.pyenv_mirror/user/3.8.19/lib/python3.8/site-packages/kochat/proc/intent_classifier.py", line 44, in fit
    report, _ = self.metrics.report(['in_dist', 'out_dist'], mode='ood')
  File "/workspace/.pyenv_mirror/user/3.8.19/lib/python3.8/site-packages/sklearn/utils/_testing.py", line 317, in wrapper
    return fn(*args, **kwargs)
  File "/workspace/.pyenv_mirror/user/3.8.19/lib/python3.8/site-packages/kochat/utils/metrics.py", line 86, in report
    classification_report(
  File "/workspace/.pyenv_mirror/user/3.8.19/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/workspace/.pyenv_mirror/user/3.8.19/lib/python3.8/site-packages/sklearn/metrics/_classification.py", line 1950, in classification_report
    raise ValueError(
ValueError: Number of classes, 1, does not match size of target_names, 2. Try specifying the labels parameter

저의 생각

kochat/utils/metrics.py의 Metrics.report() 함수를 보면 classification_report() 함수를 호출하고 있습니다.

class Metrics:

    ...

    def report(self, label_dict: dict, mode: str) -> tuple:
        """
        분류 보고서와 confusion matrix를 출력합니다.
        여기에는 Precision, Recall, F1 Score, Accuracy 등이 포함됩니다.

        :return: 다양한 메트릭으로 측정한 모델 성능
        """

        ...

        report = DataFrame(
            classification_report(
                y_true=label,
                y_pred=predict,
                target_names=list(label_dict),
                output_dict=True
            )
        )

        ...

classification_report() 함수 정의는 다음과 같습니다. 에러는 해당 코드의 맨 마지막 줄에서 발생합니다.

def classification_report(y_true, y_pred, *, labels=None, target_names=None,
                          sample_weight=None, digits=2, output_dict=False,
                          zero_division="warn"):
    """Build a text report showing the main classification metrics.

    Read more in the :ref:`User Guide <classification_report>`.

    Parameters
    ----------
    y_true : 1d array-like, or label indicator array / sparse matrix
        Ground truth (correct) target values.

    y_pred : 1d array-like, or label indicator array / sparse matrix
        Estimated targets as returned by a classifier.

    labels : array, shape = [n_labels]
        Optional list of label indices to include in the report.

    target_names : list of strings
        Optional display names matching the labels (same order).

    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights.

    digits : int
        Number of digits for formatting output floating point values.
        When ``output_dict`` is ``True``, this will be ignored and the
        returned values will not be rounded.

    output_dict : bool (default = False)
        If True, return output as dict

        .. versionadded:: 0.20

    zero_division : "warn", 0 or 1, default="warn"
        Sets the value to return when there is a zero division. If set to
        "warn", this acts as 0, but warnings are also raised.

    Returns
    -------
    report : string / dict
        Text summary of the precision, recall, F1 score for each class.
        Dictionary returned if output_dict is True. Dictionary has the
        following structure::

            {'label 1': {'precision':0.5,
                         'recall':1.0,
                         'f1-score':0.67,
                         'support':1},
             'label 2': { ... },
              ...
            }

        The reported averages include macro average (averaging the unweighted
        mean per label), weighted average (averaging the support-weighted mean
        per label), and sample average (only for multilabel classification).
        Micro average (averaging the total true positives, false negatives and
        false positives) is only shown for multi-label or multi-class
        with a subset of classes, because it corresponds to accuracy otherwise.
        See also :func:`precision_recall_fscore_support` for more details
        on averages.

        Note that in binary classification, recall of the positive class
        is also known as "sensitivity"; recall of the negative class is
        "specificity".

    See also
    --------
    precision_recall_fscore_support, confusion_matrix,
    multilabel_confusion_matrix

    Examples
    --------
    >>> from sklearn.metrics import classification_report
    >>> y_true = [0, 1, 2, 2, 2]
    >>> y_pred = [0, 0, 2, 2, 1]
    >>> target_names = ['class 0', 'class 1', 'class 2']
    >>> print(classification_report(y_true, y_pred, target_names=target_names))
                  precision    recall  f1-score   support
    <BLANKLINE>
         class 0       0.50      1.00      0.67         1
         class 1       0.00      0.00      0.00         1
         class 2       1.00      0.67      0.80         3
    <BLANKLINE>
        accuracy                           0.60         5
       macro avg       0.50      0.56      0.49         5
    weighted avg       0.70      0.60      0.61         5
    <BLANKLINE>
    >>> y_pred = [1, 1, 0]
    >>> y_true = [1, 1, 1]
    >>> print(classification_report(y_true, y_pred, labels=[1, 2, 3]))
                  precision    recall  f1-score   support
    <BLANKLINE>
               1       1.00      0.67      0.80         3
               2       0.00      0.00      0.00         0
               3       0.00      0.00      0.00         0
    <BLANKLINE>
       micro avg       1.00      0.67      0.80         3
       macro avg       0.33      0.22      0.27         3
    weighted avg       1.00      0.67      0.80         3
    <BLANKLINE>
    """

    y_type, y_true, y_pred = _check_targets(y_true, y_pred)

    labels_given = True
    if labels is None:
        labels = unique_labels(y_true, y_pred) # labels의 정의되는 지점
        labels_given = False
    else:
        labels = np.asarray(labels)

    # labelled micro average
    micro_is_accuracy = ((y_type == 'multiclass' or y_type == 'binary') and
                         (not labels_given or
                          (set(labels) == set(unique_labels(y_true, y_pred)))))

    if target_names is not None and len(labels) != len(target_names):
        if labels_given:
            warnings.warn(
                "labels size, {0}, does not match size of target_names, {1}"
                .format(len(labels), len(target_names))
            )
        else:
            raise ValueError(
                "Number of classes, {0}, does not match size of "
                "target_names, {1}. Try specifying the labels "
                "parameter".format(len(labels), len(target_names))
            ) # 여기에서 에러가 발생합니다!
    ...

즉, labels와 target_names의 길이가 달라서 에러가 발생하는 것으로 보입니다. labels는 classification_report() 함수에서 일부러 None 값이 들어가도록 따로 값을 적어 호출하지 않으신 것 같아서 labels는 unique_labels(y_true, y_pred)로 정의됩니다.

unique_labels() 함수의 설명 속 예시는 다음과 같습니다.

    Examples
    --------
    >>> from sklearn.utils.multiclass import unique_labels
    >>> unique_labels([3, 5, 5, 5, 7, 7])
    array([3, 5, 7])
    >>> unique_labels([1, 2, 3, 4], [2, 2, 3, 4])
    array([1, 2, 3, 4])
    >>> unique_labels([1, 2, 10], [5, 11])
    array([ 1,  2,  5, 10, 11])

즉, unique_labels(y_true, y_pred)는 y_true와 y_pred를 합집합 하는 연산이라 보입니다.

문제는 이때 y_true와 y_pred가 모두 동일한 label인 1, 즉 out_dist을 가지고 있을 때 발생합니다. (학습을 충분히 시키지 않은 문제도 있지만, 모두 OOD로 분류되더라도 학습은 진행되어야 하는 것 아닌가요?)

y_true와 y_pred를 출력해보면 각각 [1 1 1 ... 1 1 1]과 [1 1 1 ... 1 1 1]로, 길이는 동일합니다.

해당 에러는 어떻게 해결할 수 있을까요? 열심히 제 나름대로 저의 시행착오를 정리했는데 두서가 없는 점 죄송합니다 ㅠㅠ 멋진 프레임워크를 공유해주셔서 다시 한 번 감사합니다.

안녕하세요. 챗봇 오픈소스를 이용해서 개발 공부를 하고 있는데요!
약 5천개의 문장까지는 인텐트를 학습시키는 것이 가능했는데,
(1. 제가 선택한 문장의 길이는 한 문장당 20바이트 내의 비교적 짧은 문장들이며,
2. kochat_config에서 설정한 단어배치사이즈=128, (미니)배치사이즈=128 입니다.
3. 또한 GPU가 없어서 CPU로 학습합니다.)

그 이상 (7천개..) 를 시도하자
process finished with exit code 0xC000005 가 뜨며
에러로그 없이 프로그램이 계속 종료되었습니다.

이게 저는 문장수를 늘리자 발생했기 때문에 메모리 문제로 추측중인데요..
메모리 사용량을 줄이기 위해
단어벡터사이즈, 미니배치사이즈를 줄여도
문장 수가 많으면 여전히 에러가 나는 상황입니다.

(5천개 문장 학습시
작업관리자에서 확인한 메모리 사용량이 최대 80-90퍼까지올라갔고 약 210GB 였습니다.)

제가 kochat오픈소스를 그대로 사용하지는 않았고,
csv에서 데이터를 불러오는 부분을
db에서 불러오는 것으로 교체했는데요(dataset 데이터 형식은 그대로 유지..)

이게 보통의 챗봇들은 몇문장정도를 학습시키는 건지,
약 몇 문장 학습을 시킬때 어느 정도 메모리가 보통 소요되는지,
이게 제가 소스를 건드려서 (메모리 관리를 못해서) 문제가 발생하는 것인지
아니면 보통 이렇게 많이 소요되는건지 전혀 가늠이 가질 않아서 문의드립니다ㅠㅠ

아시는 부분에 대해 답변을 주시면 정말 감사합니다!

처음에 빌드할때

깃으로 다운로드 받고 빌드하니까
아무것도 실행이 안됩니다...ㅠㅠ

어떻게 해야 데모를 실행할 수 있나요?

혹시 이전 버전을 공유해주실 수 있으신가요?

안녕하세요, 자연어 처리를 공부해보려고 합니다.
의도 파악을 위한 튜토리얼을 찾다가 좋은 자료가 있다고 해서 찾아왔는데,
현재 업데이트 중이셔서 코드가 돌아가지 않는다고 보았습니다.

혹시 가능하시다면 이전 버전에 대한 코드를 공유해줄 수 있으신가요?

[BUG] FileNotFoundError

안녕하세요, 자연어 처리를 공부하는 학생인데 좋은 코드를 공유해주셔서 감사합니다.

노트북과 Google Colaboratory 두 군데에서 실행을 해 보았는데요 모두 공통적인 에러가 생겼습니다.

IntentClassifier

intent_classifier=(clf, True)

/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:253: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
test
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-24-9c1b6a9984f6> in <module>()
     18     entity_recognizer=(rcn, False),
     19     scenarios=[
---> 20         weather
     21     ]
     22     # scenarios=[

5 frames
/content/drive/My Drive/kochat/kochat/utils/visualizer.py in __load_txt(self, mode)
    279         """
    280 
--> 281         f = open(self.model_dir + 'temp{_}{mode}.txt'.format(_=self.delimeter, mode=mode), 'r')
    282         file = f.read()
    283         file = re.sub('\\[', '', file)

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/kochat/saved/IntentClassifier/temp/train_accuracy.txt'

EntityRecognizer

entity_recognizer=(rcn, True)

/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:253: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
test
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-23-f8dc28e96af9> in <module>()
     18     entity_recognizer=(rcn, True),
     19     scenarios=[
---> 20         weather
     21     ]
     22     # scenarios=[

5 frames
/content/drive/My Drive/kochat/kochat/utils/visualizer.py in __load_txt(self, mode)
    279         """
    280 
--> 281         f = open(self.model_dir + 'temp{_}{mode}.txt'.format(_=self.delimeter, mode=mode), 'r')
    282         file = f.read()
    283         file = re.sub('\\[', '', file)

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/kochat/saved/EntityRecognizer/temp/train_accuracy.txt'

해당 코드는 Google Colaboratory 코드입니다.

from flask import render_template

from kochat.app import KochatApi
from kochat.data import Dataset
from kochat.loss import CRFLoss, CosFace, CenterLoss, COCOLoss, CrossEntropyLoss
from kochat.model import intent, embed, entity
from kochat.proc import IntentClassifier, GensimEmbedder, EntityRecognizer, SoftmaxClassifier
from flask import Flask
# from demo.scenrios import restaurant, travel, dust, weather
from demo.scenrios import weather

dataset = Dataset(ood=True)
emb = GensimEmbedder(model=embed.FastText())

clf = IntentClassifier(
    model=intent.CNN(dataset.intent_dict),
    loss=CenterLoss(dataset.intent_dict),
)

rcn = EntityRecognizer(
    model=entity.LSTM(dataset.entity_dict),
    loss=CRFLoss(dataset.entity_dict)
)

kochat = KochatApi(
    dataset=dataset,
    embed_processor=(emb, True),
    intent_classifier=(clf, True),
    entity_recognizer=(rcn, True),
    scenarios=[
        weather
    ]
    # scenarios=[
    #     weather, dust, travel, restaurant
    # ]
)

# added
kochat.app = Flask(__name__)
run_with_ngrok(kochat.app)   #starts ngrok when the app is run

@kochat.app.route('/')
def index():
    return render_template("index.html")


if __name__ == '__main__':
    kochat.app.template_folder = '/content/drive/My Drive/kochat/' + 'templates'
    kochat.app.static_folder = '/content/drive/My Drive/kochat/' + 'static'
    # kochat.app.template_folder = kochat.root_dir + 'templates'
    # kochat.app.static_folder = kochat.root_dir + 'static'
    kochat.app.run()

다음과 같이 코드를 돌렸을 때, saved의 IntentClassifier과 EntityRecognizer의 temp 폴더에 train과 관련된 save 데이터가 저장되지 않는 것을 확인하였습니다. (Test_*.txt파일의 경우엔 모두 저장된 것 같습니다.)

IntentClassifier

EntityRecognizer

이 부분을 무시하고

kochat = KochatApi(
    dataset=dataset,
    embed_processor=(emb, False),
    intent_classifier=(clf, False),
    entity_recognizer=(rcn, False),
    scenarios=[
        weather
    ]
    # scenarios=[
    #     weather, dust, travel, restaurant
    # ]
)

이와 같이 True에서 False로 하여 데모를 돌렸을 때에는

다음과 같은 서버 연결 실패 문제가 발생합니다.

로그는 다음과 같습니다.

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
 * Running on http://1b6ad93801ad.ngrok.io
 * Traffic stats available on http://127.0.0.1:4040
127.0.0.1 - - [07/Jul/2020 06:13:45] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [07/Jul/2020 06:13:46] "GET /static/css/bootstrap.css HTTP/1.1" 200 -
127.0.0.1 - - [07/Jul/2020 06:13:46] "GET /static/css/main.css HTTP/1.1" 200 -
127.0.0.1 - - [07/Jul/2020 06:13:46] "GET /static/js/jquery.js HTTP/1.1" 200 -
127.0.0.1 - - [07/Jul/2020 06:13:46] "GET /static/js/bootstrap.js HTTP/1.1" 200 -
127.0.0.1 - - [07/Jul/2020 06:13:46] "GET /static/js/main.js HTTP/1.1" 200 -
127.0.0.1 - - [07/Jul/2020 06:13:48] "GET /favicon.ico HTTP/1.1" 404 -

Running on http://1b6ad93801ad.ngrok.io -> 여기서 데모를 돌려보았습니다.

train_*.txt파일이 생성되지 않아 생기는 문제에 대한 해결방안을 알려주신다면 정말 감사하겠습니다.

KoBART 경량화?

Distillation을 해볼 수도 있긴한데, 너무 리소스가 많이 드니까
Quanitzation이나 Pruning 같은 걸 해볼 수도 있을 것 같아요.
자유롭게 생각 써주세용~

Configuration 파일

kochat_config.zip

I got FileNotFoundError ...

Hi,

I got error message during run main.py as follows,

(py6) dweom@soynet:~/prog/Chatbot$ python main.py
...
AI is awakening now...
Provided Feature : 날씨, 뉴스, 달력, 맛집, 미세먼지, 명언, 번역, 시간, 위키, 음악, 이슈, 인물
...
Traceback (most recent call last):
File "main.py", line 27, in
main()
File "main.py", line 8, in main
import application as app
File "/nipa/home/dweom/prog/Chatbot/application.py", line 10, in
from src.intent.classifier import get_intent
File "/nipa/home/dweom/prog/Chatbot/src/intent/classifier.py", line 13, in
configs = IntentConfigs()
File "/nipa/home/dweom/prog/Chatbot/src/intent/configs.py", line 20, in init
self.data = pd.read_csv(self.root_path + 'train_intent.csv')
File "/home/dweom/.conda/envs/py6/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/dweom/.conda/envs/py6/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/dweom/.conda/envs/py6/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in init
self._make_engine(self.engine)
File "/home/dweom/.conda/envs/py6/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/dweom/.conda/envs/py6/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'intent/train_intent.csv' does not exist: b'intent/train_intent.csv'

My test environment

OS : Ubuntu 18.04 LTS
Python : 3.6.7

To run, I did as follows,

~/prog$ git clone https://github.com/gusdnd852/Chatbot
~/prog$ cd Chatbot
~/prog/Chatbot$ cp src/main.py ./
~/prog/Chatbot$ cp src/application.py ./
~/prog/Chatbot$ vi main.py

def main():
#import src.application as app
import application as app

~/prog/Chatbot$ python main.py

TypeError: init(_ got an unexpected keyword argument 'errors'

안녕하세요.
만드신 kochat 라이브러리를 다운받아 써보고 싶은데,
Dataset 객체를 생성하는데서 에러가 납니다.
Colab에서 작업하였구요.
pip install 로 하면서 튜토리얼 대로 하다가, 같은 에러가 나서 clone해서 demo 에서 돌리는데도 같은 에러가 나서.. 확인 부탁드리겠습니다.

데모 예제를 돌리는 데 상당한 시간이 걸립니다.

colab 에서 데모 예제를 돌리고 있는데 데모 치고 상당한 시간이 걸리는데

혹시 시간이 적게 걸리려면 파라미터 어떤 부분을 수정하는게 좋을 까요?

특히 DistanceClassifier() 에서 상당한 시간이 걸리네요.

numpy version 관련 오류

안녕하세요. 먼저 좋은 프레임워크 개발과 공유에 감사드립니다.
다름이 아니라, kochat을 사용하기 위해 필요 라이브러리를 설치하던 중 라이브러리 버전에 문제가 있는 것 같아 issue 남깁니다.
제가 사용하고 있는 PC 및 버전은 아래와 같습니다.

OS: macOS Monterey 12.6.1
Python: 3.8.16
torch: 1.13.0

이외에는 kochat의 Requirements와 동일한 라이브러리 버전을 사용하고 있습니다.
오류가 발생한 부분은 kochat을 import하는 부분입니다.
ImportError: numpy.core.multiarray failed to import
numpy 버전이 맞지 않아 발생하는 오류라고 판단하여 최신 버전의 numpy로 업그레이드하면 kochat과의 의존성 오류가 발생하는 아이러니한 상황입니다.

혹시 해당 오류 해결 방법을 공유해주시면 감사드리겠습니다!

src/entity/restaurant/train_entity.csv 파일이 깨진 듯 합니다.

안녕하세요. 공유해주신 챗봇 소스를 공부하고 있습니다.
트레이닝 중 오류가 있는듯 해서 이슈 전달 드립니다.

src/entity/restaurant/train_entity.csv 파일이 깨진듯 합니다.
브라우저에서도 보이지 않고, 다운받아도 보이지 않네요.
파일 공유 부탁 드립니다.
감사합니다.

안드로이드 앱 연동 가능한가요?

날씨 앱 내부에 채팅봇 적용하려고 합니다.

모델 만들어서 안드로이드 앱 내부에 세션별 서비스가 가능한지 문의 드려요.

서버 연결에 실패했습니다.

먼저 올라온 이슈를 살펴보면서 config에서 모델 줄이고 학습 완료됬는데...
실행하니깐 서버 연결에 실패했다고 뜨네요

왜이럴까요..?

demo/data 사용

안녕하세요, 한국어 대화 모델 연구 및 강의를 위해
여러 레퍼런스를 찾던중 hyunwoongko님의 kochat을 발견했습니다.

hyunwoongko님의 demo/data에 있는 데이터를 사용하고 싶습니다.
출처와 깃의 주소를 남기고 사용을 해도 될까요?

감사합니다.

README images

아파치 연동시

아파치와 Flask를 연동하여 사용할려고 합니다.
아파치와 Flask 연동 테스트까지 마쳤는데 kochat 모듈 사용 하는 부분에서 아파치 에러가 발생합니다.
FileNotFoundError: [Errno 2] No such file or directory: '/data/raw/' 에러라고 발생합니다.
data 폴더가 있는 부분은 다른곳에 있는데 꼭 /data/이 폴더에서 찾네요...

How can I use this repository??

Program says it's not yet ready to answer..

NLG팀 NLU팀 논의

NLG팀인데 NLU도 같이 하고 싶다거나,
NLU팀인데 NLG도 같이 하시고 싶다고 하면 말씀해주세요!
(매우 환영임 ㅎ)

GPU 사용

torch.cuda.is_available()이 True여서 gpu 사용이 가능한 환경입니다
그런데 왜이렇게 demo 학습 속도가 느릴까요? ㅠㅠ
보통 어느정도 걸리나요?

그리고 saved된 weight로 모델 사용하고 싶을 땐 어떻게해요?

kochat 2.0 개발

프로젝트 개요

프로젝트 목표 : 내가 NLP 알못일 때 만든, 엉성한 Kochat을 개선해보자.
프로젝트 기간 : 딱히 정한 것은 없음. 2021년 내내 개발하지 않을까 싶음.. (퇴근하고 짬짬이)
프로젝트 멤버 : 일단 내가 주축. 하고싶은 사람 있으면 찾아볼 예정. (8명 이내가 적당할 듯?)
라이브러리 : 아마 pytorch-lightning + transformers를 기반으로 할 듯 (나한테 제일 편함ㅎ)
개발 자원 : 일단 나는 로봇 사업할 때 받아놓은 2070가지고도 충분 할듯 (파인튜닝만 할거니까)
- 만약 같이 작업할 팀원이 생긴다면 자원은 팀원이 알아서...

프로젝트 목록

1. UX/UI project based on Web

일반인도 사용 가능한 웹 기반의 UX/UI를 제공
대시보드 형태로 제작될 것이며 RESTful API를 지원함.
사용자 컴퓨터가 서버가 되는 형태이기 때문에 따로 내가 뭐 운영을 하거나 하는 방식은 아님.
데이터를 손쉽게 작성할 수 있어야 하고, 학습 시작 버튼을 누르면 학습이 진행됨.
react.js 등의 SPA 프레임워크와 Spring 프레임워크를 이용하여 구현 예정.
내 방에 거주 중인 웹개발자랑 같이 만들어보면 좋을듯 ㅋㅋㅋ (리액트 공부용...)
사용자가 코딩을 못해도 챗봇을 만들 수 있게 하는 것이 목표!!

2. NLU Dataset project

외국의 유명 Intent Detection 데이터셋들을 한국어로 번역하고 Human correction을 거쳐서 오번역 내용을 교정. (번역기는 구글 번역기나 파파고 등을 이용해볼 수 있음)
각 문장에 NER 태그를 태깅하는 방식을 사용하여 Joint Learning 가능한 형태로 만들 것임.
- 일단은 가장 유명한 데이터셋을 빠르게 하나만 한국어로 만들어놓고 이를 기반으로 KoChat 2.0을 만들 예정
- 다른 데이터셋들은 추후에 계속 천천히 작업할 예정.
이를 통해 대규모의 한국어 Intent + Entity Joint Detection 데이터셋을 구축하여 따로 공개할 예정.

3. NLU Modeling project

기본적으로 NER과 Intent Detection을 동시에 수행하는 Joint Detection 형태로 개발
- 하나의 BERT backbone을 share하고 Intent Classifier와 Entity Tagger가 연결되어 있으며 Loss를 더해서 Minimize 하는 형태
- BIOES scheme을 사용하는데, 만약 Entity가 없는 문장은 전부 O로 처리하여 Intent만 Classification 되게끔 만들 수 있음.
Kochat 1.0의 철학을 따라 Fallback checking에 신경을 많이 쓸 예정.
- 각 Intent를 분류 할 때, Center Loss나 Cosface 같은 Metric Learning을 수행.
- Kochat 1.0 처럼 모든 example에 검색하는 것이 아닌 각 클래스의 중심점을 저장하여 가장 가까운 클래스로 분류하는 방식
- Threshold 설정을 자동화 해야함. 이전에는 OOD 데이터를 입력하는 방식을 이용했는데, 이는 매우 불편한 방식임. (말이 안됨)
- 사용자 입력 데이터와 각 클래스 중심점 간의 거리의 분산값들을 사용하여 Thershold 설정을 자동화 할 수 있음. (이미 실험해봤으며 꽤 성공적이였음..)
DistilKoBERT를 기반으로 구현

4. QA project

사용자 니즈에 따라 QA 모듈이 매우 요긴하게 사용 될 수 있음.
가령 사용자가 특정 Knowledge를 제공받고 싶은 경우가 있을 수 있는데, 이러한 작업은 단순히 NLU module만 사용하여 구현하기 힘듦.
가령 "배달 시간은 몇시까지야?"라는 입력이 들어올 수 있는데 이러한 모든 사항을 Intent-Entity 방식으로 구현하는 것은 합리적이지 못함.
따라서 여기에서는 Opendomain QA 모델을 넣을 예정. 수행방식은 다음과 같음.
- 가장 먼저 챗봇의 Owner에게 사전에 문서들을 입력해놓을 수 있게 함. (e.g. "배달시간은 10시까지 이며 ~~ ")
- Document를 저장할 때 문장 단위로 쪼개서 저장해놓음. 이는 후술할 Search 기법을 적용해보기 위해서임.
- 챗봇의 User가 Query를 입력하면 해당 Query와 문장 단위로 쪼개진 Document 사이의 BLEU Score를 계산해서 가장 높은 BLEU score를 가지는 문장을 찾아냄. (물론 검색을 SBERT 기반의 Similarity Search로 수행할 수도 있지만 모델링의 복잡성을 줄이기 위해 일단은 N-gram 기반의 나이브한 접근법을 채용하고 추후에 Future work로 작업해볼 수 있음)
- 가장 높은 BLEU score를 가진 문장 top-3 ~ top-5개를 추출하여 이를 포함하고 있는 Document와 Query를 이용해 QA를 수행.
데이터는 KorQuAD 2.0으로 학습할 것이며, 모델은 DistilKoBERT를 이용할 예정.

5. Paraphrase Generation project

데이터를 페러프레이즈 할 수 있게 도와주는 모듈.
챗봇 빌더의 가장 큰 문제는 사용자가 데이터를 입력하기 어렵다는 것에 있음.
만약 한 인텐트를 입력하면, 해당 인텐트에 해당하는 문장을 다수개 페러프레이즈 함. (빔서치와 샘플링 전략을 활용해서 문장 1개만 입력해도 100개 이상의 페러프레이즈 문장을 생성할 수 있음)
페러프레이즈 문장을 Intent Detection 모듈에 인퍼런스 시켜서 통과한 경우만 데이터에 포함하는 필터링 메커니즘 적용
SKT의 KoBART를 기반으로 구현할 것이며 체크포인트를 공개할 것.
다 만들고나서 내가 카브에서 같이 하고 있는 모 프로젝트에 모델을 올리는 것도 좋아보인다.

6. Dialogue Management (DST & DP) project

Kochat 1.0에서는 DST나 DP를 Rule-based로 수행했는데, DST나 DP 같은 부분들도 모델 베이스로 해볼 수 있다고 생각함.
아직 구체적인 계획은 서지 않았음. 문헌조사를 좀 해봐야 할 듯 함.

7. NLG project

만들어진 Dialogue state + DP를 입력받아서 이에 적절한 대답을 생성하는 작업
데이터셋이 없기 때문에 데이터셋부터 직접 만들어야 해서 가장 어려운 작업이 될 것으로 예상함.
- NLU dataset 처럼 대규모로 할 필요는 없고 가장 먼저 만들어진 NLU 데이터를 기반으로 하나 정도만 만들 것.
사용자 데이터마다 DP나 Dialogue state가 다를텐데 이게 생성모델로 처리가 가능한지가 관건.
구현은 SKT의 KoBART를 이용하면 될 듯.
퀄리티를 보고 만약 퀄리티가 괜찮으면 NLG를 기본으로 하고, 퀄리티가 별로면 템플릿 베이스를 사용.

산출물

논문 : 딱히 어떤 학회를 목표로 하진 않지만 1차적으로 arXiv에 여러편의 논문(아마 3~4편 정도)을 아카이빙 하는 것이 목표. 기회를 봐서 EMNLP 2021, ACL 2021 등의 해외 탑 컨퍼런스에 투고할 것. (붙든 말든 내보자 ㅎ)
- (1) 번역된 NLU 데이터셋 (KorNLI, KorSTS와 비슷한 느낌이라고 생각)
- (2) Kochat 라이브러리 자체를 한 편의 논문으로 만들 수 있음.
- (3) NLG 쪽에서 State + DP가지고 Next Utterance Generation 하는 것은 본적이 없음.
- (4) NLU 모듈에서 Fallback Checking 하는 방식이 어느정도의 novelity를 가진다고 생각.
오픈소스 소프트웨어 : 많은 사람들이 무료로 영구적인 챗봇 빌더를 사용할 수 있게 됨.
데이터셋 : 최소 2개의 데이터셋을 공개 할 수 있음 (KoChat NLU dataset, KoChat NLG dataset)

챗봇 답변 변경 방법 (API)

안녕하세요, 초보 코더입니다.
개발하신 코챗 프레임워크를 이용하여 챗봇을 만들어보던 도중 질문이 있어 납깁니다.

intent와 entity와 데이터 셋을 원하는 데이터로 채워보고 있는데, 시나리오의 엔티티를 수정하니 API의 파라미터와 맞지 않는다며 오류가 뜹니다.

그 오류를 수정하고 원하는 답변들을 추가하기 위해서는 API를 변경해야 하는 것 같은데 API는 어디서 어떻게 변경하나요...? 데모 폴더 안에 들어있는 것이 아닌가요? 처음부터 제작해야 하는 것이라면 틀은 어떻게 될까요?

초보적인 질문이라 죄송합니다. 여유가 있으시다면 답변해주시면 감사하겠습니다.

새로운 데이터 학습

안녕하세요. 먼저 감사 인사를 드립니다.
챗봇 구현을 하려고 여러 책을 보다가 한국어 챗봇을 어떻게 시작해야할지 막막했는데, 공유해주신 코드 잘 보았습니다.
다름이 아니라 올려주신 코드를 통해 데모는 실행 됨을 확인하였습니다.
그 다음으로 여행이 아닌 다른 부분(예, 감정) 에 대해서 학습 시켜서 대화를 해보려고 진행중입니다.
데이터를 새로 만드는 절차에 대한 부분은 설명에 없는 것 같아서,
데이터 폴더에 있는 intent_data, entity_data, raw 데이터를 확인해보고 수정해서 진행해보았었는데
unicodedecodeerror: 'utf-8' codec can't decode byte 0xc1 in position 16: invalid start byte
위와 같은 오류가 발견되었습니다.
혹시 데이터 생성시 정해진 룰이 있을까요?
다시 한번 감사드립니다.