Coder Social home page Coder Social logo

dwzhu-pku / pose Goto Github PK

View Code? Open in Web Editor NEW
189.0 5.0 18.0 1.86 MB

Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)

Home Page: https://arxiv.org/abs/2309.10400

License: MIT License

Shell 2.93% Python 97.07%

pose's Issues

LM Evaluate

Thank you so much for your work and your open source code.
But some problems occurred when running lm_eval. I guess it is a version problem.

  1. The current lm-evaluation-harness version does not have a main function, so I modified it to run in lm-eval mode;
  2. The task truthfulqa_mc no longer exists, replaced by truthfulqa_mc1 and truthfulqa_mc2, I guess it is truthfulqa_mc1;
  3. The hf-causal-experimental model should be customized by the author, because the original lm-evaluation-harness does not have this type.
    Please correct me if I have any problem.

Example Training data

Hi, this seems like an amazing breakthrough to me, and I'm not sure how it isn't getting more attention.

Anyway, I was looking at your docs and particular your scripts. Do you all have links to the training/testing jsonl files you used? I wanted to look at it to get an idea of what to pass in.

训练时loss=0,Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.是什么原因

bash script/run_train_baichuan.sh 64 yarn

factor=$1
rope_type=$2

debug_mode="-m debugpy --listen 127.0.0.1:6679 --wait-for-client"

python -m torch.distributed.run --nproc_per_node=1 ${debug_mode} src/train_baichuan.py \

deepspeed src/train_baichuan.py
--model_name_or_path ./baichuan2-7b-base
--train_data_path ./data/pile/train_00_long_10w.jsonl
--valid_data_path ./data/pile/val_long.jsonl
--test_data_path ./data/pile/test_pg19.jsonl
--output_dir ./skipos/results/baichuan2/4k-$((factor*4))k-${rope_type}
--max_steps 1000
--model_max_position_embeddings 4096
--rope_scaling_type ${rope_type}
--rope_scaling_factor $factor
--inference_length 16384
--per_device_train_batch_size 4
--per_device_eval_batch_size 1
--gradient_accumulation_steps 2
--do_train True
--do_eval True
--do_predict True
--evaluation_strategy "steps"
--eval_steps 50
--save_strategy "steps"
--save_steps 500
--warmup_steps 0
--learning_rate 2e-5
--logging_steps 10
--report_to "tensorboard"
--gradient_checkpointing True
--fp16 True
--deepspeed src/configs/deepspeed_config.json \

Comparative experiments on PI directly on 2k length

hi, @dwzhu-pku , have you ever tried performing PI directly on a length of 2k, and then compared using PoSE to perform PI on a length of 2k? Also fine-tune 1000 steps with the same parameters.

My results show that there is not much difference between the two, but I did not find any relevant comparative experiments in the paper.

Please correct me if I'm wrong.

How long does it take for training?

Hi, thanks for the nice work! I see you mentioned you use 8x V100 for training. I wonder how long it takes for training for e.g., Llama2? And is there any modification I need to make if I want to use flash attention during training with A100 GPUs? Many thanks!

训练过程

您好,我理解的是如果想拓展到4096的长度,那就把4096的example分成两个2048长度的example,那样一个example就变成两个了(当然会加一些term去链接两个example)。但是训练的时候,单个样本还是2048的长度呀,他并没有真正的见到4096。那怎么保证可以拓展到无限长度呢?

A question about data preproecess

chunked_ids = ids[lt1:rt1] + ids[lt2:rt2]

Hi,in line 172 of the file, it appears that two non-adjacent fragments have been cut from the original training data, and each of these fragments has been assigned a continuous positional code in subsequent processing. Why is this done?
In a real SFT scenario, the raw data input is often continuous fragments. What is the problem with using continuous fragments and applying pose?

To create a model for **textual similarity tasks** involving **JSON-structured data

My intention is to use the PoSE to pre/fine-train the LLM with diverse structured data like JSON and XML documents.
Unlike natural language text, structured data doesn't meaningful units sentences. Instead it consists of key-value pairs, nested objects, arrays, and etc. Besides that I am dealing with a very large documents like XML/JSONs with 100k+ tokens.
Do you think I can use PoSE for that?

Regarding the script code: I am struggling to match the code with the paper.
Figure 6: Python Code used for calculating coverage probability of each relative position in Figure 5.
PoSE

could you brief explain what these methods is doing?
train_pose.py

  1. smart_tokenizer_and_embedding_resize
  2. DataCollatorForSupervisedDataset
  3. train_preprocess_function_randomized
  4. train_preprocess_function_pose
  5. Specific for train_preprocess_function_pose, is it correct the variable is assigned and never changed: lt1 = 0?

Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.