안녕하세요. 일단 이런 멋진 오픈소스 공유해주셔서 너무 감사합니다. repository 보면서 train쪽 관심생겨 따라해

정확한 답변은 아니겠지만, 저도 비슷한 상황을 봉착했어서 내용 공유 드립니다. <code class="notranslate

답변 감사드립니다. 답변 참고해서 진행 하니까 학습이 잘 되었는데, 두가지를 변경 했습니다. <ol dir="aut

train 관련해서 문의 드립니다. about koalpaca HOT 3 CLOSED

beomi commented on July 16, 2024

train 관련해서 문의 드립니다.

from koalpaca.

Comments (3)

Beomi commented on July 16, 2024 6

타래에 언급해주신 것 처럼,

Instruction + Answer를 text라는 하나의 컬럼으로 변경해주셔야 합니다!

이 부분은 다양한 Prompt를 추가할 수 있기 때문에 일부러 (각자 처리할 수 있도록) 비워둔 부분이었구요.

app.py 참고해서 json 형식을 {"text":"### 질문: 질문내용 ### 답변: 답변내용"} 형식으로 학습 진행 <== 이 방향이 제가 실제 학습을 진행한 방향입니다 :)

from koalpaca.

Choiuijin1125 commented on July 16, 2024 3

정확한 답변은 아니겠지만, 저도 비슷한 상황을 봉착했어서 내용 공유 드립니다.

run_clm.py 파일에서 tokenize_fuction을 보면 tokenization 과정에서 column_names[0] 값만을 사용하고 있습니다.

KoAlpaca_v1.1.jsonl 파일 같은 경우 column_names[0] 이 instruction 이므로 tokenzation 과정에서 instruction 값 만을 사용하게 됩니다.

저같은 경우 #31 이슈 참고해서
https://github.com/Beomi/easy-lm-trainer/blob/main/data/text_ko_alpaca_data.jsonl 파일 형식으로 전처리 후 학습했습니다.

tokenize_fuction을 수정하는 것도 하나의 방법 일 것 같습니다 ~

  if training_args.do_train:
      column_names = list(raw_datasets["train"].features)
  else:
      column_names = list(raw_datasets["validation"].features)
  text_column_name = "text" if "text" in column_names else column_names[0]

  # since this will be pickled to avoid _LazyModule error in Hasher force logger loading before tokenize_function
  tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base")

  def tokenize_function(examples):
      with CaptureLogger(tok_logger) as cl:
          output = tokenizer(examples[text_column_name])
      # clm input could be much much longer than block_size
      if "Token indices sequence length is longer than the" in cl.out:
          tok_logger.warning(
              "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits"
              " before being passed to the model."
          )
      return output

from koalpaca.

ccsweets commented on July 16, 2024

답변 감사드립니다.

답변 참고해서 진행 하니까 학습이 잘 되었는데, 두가지를 변경 했습니다.

app.py 참고해서 json 형식을 {"text":"### 질문: 질문내용 ### 답변: 답변내용"} 형식으로 학습 진행
runpod.io에서 --fp16 파라미터로 진행할때 deepspeed에서 overflow! 메세지 나오면서 학습이 중단되는 증상이 생겨 --bf16 으로 변경하고 학습 진행.

2번은 아직 정확히 문제점이 뭐였는지 이해 하는건 아니나, 예상한대로 학습이 되네요 감사합니다.

from koalpaca.

Recommend Projects

train 관련해서 문의 드립니다. about koalpaca HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent