philschmid / llm-sagemaker-sample Goto Github PK

License: Apache License 2.0

Jupyter Notebook 92.35% Python 7.65%

llm-sagemaker-sample's Introduction

End-to-End LLMOps for open LLMs on Amazon SageMaker

This repository provides an end-to-end example of using LLMOps practices on Amazon SageMaker for large language models (LLMs). The repository demonstrates a sample LLMOps pipeline for training, optimizing, deploying, monitoring, and managing LLMs on SageMaker using infrastructure as code principles.

Currently implemented:

Training and deploying LLMs on SageMaker
Optimizing LLMs with Quantization (coming soon)
LLMOps pipeline for training, optimizing, and deploying LLMs on SageMaker (coming soon)
Monitoring and managing LLMs with CloudWatch (coming soon)

The repository currently contains:

scripts/: Scripts for training and deploying LLMs on SageMaker
notebooks/: Examples and tutorials for using the pipeline

Pre-requisites

Before we can start make sure you have met the following requirements

AWS Account with quota
AWS CLI installed
AWS IAM user configured in CLI with permission to create and manage ec2 instances

Contributions

Contributions are welcome! Please open issues and pull requests.

License

This repository is licensed under the MIT License.

llm-sagemaker-sample's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes gandalf012 hunterlum ducktapedevops xinyuanhu aniszakari hz-nm dsghostpos3idon matthiashuschle phoenixml newincastle skaiphd 0ni0nrings arjuntheprogrammer

llm-sagemaker-sample's Issues

VRAM Requirements

Hi, thanks for publishing this example.

With Mixtral + TGI, is it actually required to fit the full model in VRAM? Or, is it possible to opt for 100GB+ of system memory with lower GPU capacity?

ml.g5.48xlarge instances are quite expensive, so I’m looking for options to reduce deployment costs.

Datasets version 2.13.0 leads to conflicts

In the train-deploy-llm.ipynb notebook, when running dataset = load_dataset("databricks/databricks-dolly-15k", split="train"),
I came across the following error:

Error:
ValueError: Invalid pattern: '**' can only be an entire path component

Solution:
I was able to resolve this issue by updating dataset version:
!pip install --upgrade datasets

Complete Error Log:

ValueError                                Traceback (most recent call last)
Cell In[4], line 5
      2 from random import randrange
      4 # Load dataset from the hub
----> 5 dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
      7 print(f"dataset size: {len(dataset)}")
      8 print(dataset[randrange(len(dataset))])

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/datasets/load.py:1773, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   1768 verification_mode = VerificationMode(
   1769     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   1770 )
   1772 # Create a dataset builder
-> 1773 builder_instance = load_dataset_builder(
   1774     path=path,
   1775     name=name,
   1776     data_dir=data_dir,
   1777     data_files=data_files,
   1778     cache_dir=cache_dir,
   1779     features=features,
   1780     download_config=download_config,
   1781     download_mode=download_mode,
   1782     revision=revision,
   1783     use_auth_token=use_auth_token,
   1784     storage_options=storage_options,
   1785     **config_kwargs,
   1786 )
   1788 # Return iterable dataset in case of streaming
   1789 if streaming:

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/datasets/load.py:1502, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, storage_options, **config_kwargs)
   1500     download_config = download_config.copy() if download_config else DownloadConfig()
   1501     download_config.use_auth_token = use_auth_token
-> 1502 dataset_module = dataset_module_factory(
   1503     path,
   1504     revision=revision,
   1505     download_config=download_config,
   1506     download_mode=download_mode,
   1507     data_dir=data_dir,
   1508     data_files=data_files,
   1509 )
   1511 # Get dataset builder class from the processing script
   1512 builder_cls = import_main_class(dataset_module.module_path)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/datasets/load.py:1219, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1214             if isinstance(e1, FileNotFoundError):
   1215                 raise FileNotFoundError(
   1216                     f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
   1217                     f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   1218                 ) from None
-> 1219             raise e1 from None
   1220 else:
   1221     raise FileNotFoundError(
   1222         f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory."
   1223     )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/datasets/load.py:1203, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1188         return HubDatasetModuleFactoryWithScript(
   1189             path,
   1190             revision=revision,
   (...)
   1193             dynamic_modules_path=dynamic_modules_path,
   1194         ).get_module()
   1195     else:
   1196         return HubDatasetModuleFactoryWithoutScript(
   1197             path,
   1198             revision=revision,
   1199             data_dir=data_dir,
   1200             data_files=data_files,
   1201             download_config=download_config,
   1202             download_mode=download_mode,
-> 1203         ).get_module()
   1204 except (
   1205     Exception
   1206 ) as e1:  # noqa: all the attempts failed, before raising the error we should check if the module is already cached.
   1207     try:

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/datasets/load.py:769, in HubDatasetModuleFactoryWithoutScript.get_module(self)
    759 def get_module(self) -> DatasetModule:
    760     hfh_dataset_info = HfApi(config.HF_ENDPOINT).dataset_info(
    761         self.name,
    762         revision=self.revision,
    763         token=self.download_config.use_auth_token,
    764         timeout=100.0,
    765     )
    766     patterns = (
    767         sanitize_patterns(self.data_files)
    768         if self.data_files is not None
--> 769         else get_data_patterns_in_dataset_repository(hfh_dataset_info, self.data_dir)
    770     )
    771     data_files = DataFilesDict.from_hf_repo(
    772         patterns,
    773         dataset_info=hfh_dataset_info,
    774         base_path=self.data_dir,
    775         allowed_extensions=ALL_ALLOWED_EXTENSIONS,
    776     )
    777     split_modules = {
    778         split: infer_module_for_data_files(data_files_list, use_auth_token=self.download_config.use_auth_token)
    779         for split, data_files_list in data_files.items()
    780     }

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/datasets/data_files.py:658, in get_data_patterns_in_dataset_repository(dataset_info, base_path)
    656 resolver = partial(_resolve_single_pattern_in_dataset_repository, dataset_info, base_path=base_path)
    657 try:
--> 658     return _get_data_files_patterns(resolver)
    659 except FileNotFoundError:
    660     raise EmptyDatasetError(
    661         f"The dataset repository at '{dataset_info.id}' doesn't contain any data files"
    662     ) from None

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/datasets/data_files.py:223, in _get_data_files_patterns(pattern_resolver)
    221 try:
    222     for pattern in patterns:
--> 223         data_files = pattern_resolver(pattern)
    224         if len(data_files) > 0:
    225             non_empty_splits.append(split)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/datasets/data_files.py:471, in _resolve_single_pattern_in_dataset_repository(dataset_info, pattern, base_path, allowed_extensions)
    469 else:
    470     base_path = "/"
--> 471 glob_iter = [PurePath(filepath) for filepath in fs.glob(PurePath(pattern).as_posix()) if fs.isfile(filepath)]
    472 matched_paths = [
    473     filepath
    474     for filepath in glob_iter
   (...)
    481     )
    482 ]  # ignore .ipynb and __pycache__, but keep /../
    483 if allowed_extensions is not None:

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/fsspec/spec.py:606, in AbstractFileSystem.glob(self, path, maxdepth, **kwargs)
    602         depth = None
    604 allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
--> 606 pattern = glob_translate(path + ("/" if ends_with_sep else ""))
    607 pattern = re.compile(pattern)
    609 out = {
    610     p: info
    611     for p, info in sorted(allpaths.items())
   (...)
    618     )
    619 }

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/fsspec/utils.py:734, in glob_translate(pat)
    732     continue
    733 elif "**" in part:
--> 734     raise ValueError(
    735         "Invalid pattern: '**' can only be an entire path component"
    736     )
    737 if part:
    738     results.extend(_translate(part, f"{not_sep}*", not_sep))

ValueError: Invalid pattern: '**' can only be an entire path component

Error when deploying mixtral

I get a very non-specific error when deploying mixtral to sagemaker:

Traceback (most recent call last): File "XXX", line 47, in <module> huggingface_model.deploy( File "XXX", line 315, in deploy return super(HuggingFaceModel, self).deploy( File "/XXX", line 1654, in deploy self.sagemaker_session.endpoint_from_production_variants( File "/XXX", line 5380, in endpoint_from_production_variants return self.create_endpoint( File "XXX", line 4291, in create_endpoint self.wait_for_endpoint(endpoint_name, live_logging=live_logging) File "XXX", line 5023, in wait_for_endpoint raise exceptions.UnexpectedStatusException( sagemaker.exceptions.UnexpectedStatusException: Error hosting endpoint XXX: Failed. Reason: Request to service failed. If failure persists after retry, contact customer support.. Try changing the instance type or reference the troubleshooting page https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-troubleshooting.html

AWS has not created a log group in cloudwatch at this time.

Is anyone else experiencing the same problem?

Can not redeploy the model

Hi @philschmid,

Thanks for making this repo, it was a huge help!
I successfully trained and deployed the model to a sagemaker endpoint. However, when I deleted the endpoint when I was done with it and wanted to recreate it again, I could not do so.

For context, I manually retrieved the s3 url of my model and put it in the model s3 path.

I am unable to figure out why I am not able to deploy the model even though the s3 path is pointing to the correct location and my role has all the required permissions.

I get the following error:

ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not access model data at /huggingface-qlora-mistralai-Mistral-7B--2023-10-06-11-27-09-016/output/model/. Please ensure that the role "" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:GetObject" permissions and that the object is located in eu-west-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permission.

Truly would appreciate your help!

[Question] SFT task

Hi, probably a dumb question, but in your Mistral fine-tuning notebook example, is next token prediction objective being applied to the entire instruction+context+answer prompt rather than only being applied to the portion that corresponds to the answer?

It seems like the former because the whole prompt is created at once and I don't see any information being given to the model on where the question and context portions are, in order to mask those out of the loss calculation? Wanting to make sure I understand the training objective here. Thanks!

training zephyr-7b

hey, @philschmid i have trained zephyr-7b on qlora but when i am trying to deploy the llm then i am getting a problem :-

so, i couldn't create the endpoint.
can anyone explain why it is happening.
how can i solve this issue?

notebooks/deploy-mixtral.ipynb issue

This is not really an issue, but I couldn't find any other way to contact you. I was trying to follow your instructions on https://www.philschmid.de/sagemaker-deploy-mixtral and ended up in this repository.

I tried to follow the deployment instructions, but the deployment was not successful. I got the following error logs on the inference endpoint:

2023-12-15T20:06:10.216+01:00	> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 161, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 310, in get_model return FlashMixtral( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mixtral.py", line 21, in __init__ super(FlashMixtral, self).__init__( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 318, in __init__ SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
2023-12-15T20:06:10.216+01:00	TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

The HF image that I ended up using was 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.3.1-gpu-py310-cu121-ubuntu20.04

looking into TGI issues, and found this thread. It seems to be fixed by a commit mentioned in the thread.
But I don't how can I get the latest DLC image of 1.3.3 for a sagemaker deployment, because when I specify the version in image_uris.retrieve or in get_huggingface_llm_image_uri, it complains:

ValueError: Unsupported huggingface-llm version: 1.3.3. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface-llm versions. Supported huggingface-llm version(s): 0.6.0, 0.8.2, 0.9.3, 1.0.3, 1.1.0, 1.2.0, 1.3.1, 0.6, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3.

I don't know the procedure for having the latest version ending up in aws-dkr or how we can use a custom-built DLC image when deploying to Sagemaker.
Can you help in any way, or can you explain how your deployment works?

Thanks in advance

"trust_remote_code" script parameter is not handled in scripts/run_qlora.py training_function

Models that require trust_remote_code=true are not supported by the script. Eg. Getting the following error when trying yo train Phi1.5

Error:

ErrorMessage "EOFError
EOF when reading a line

During handling of the above exception, another exception occurred
Traceback (most recent call last)
File "/opt/ml/code/run_qlora.py", line 194, in
main()
File "/opt/ml/code/run_qlora.py", line 190, in main
training_function(script_args, training_args)
File "/opt/ml/code/run_qlora.py", line 97, in training_function
model = AutoModelForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 525, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1037, in from_pretrained
trust_remote_code = resolve_trust_remote_code(
File "/opt/conda/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 608, in resolve_trust_remote_code
raise ValueError(
ValueError: The repository for microsoft/phi-1_5 contains custom code which must be executed to correctlyload the model. You can inspect the repository content at https://hf.co/microsoft/phi-1_5.
Please pass the argument trust_remote_code=True to allow custom code to be run."

======

The following is the estimator used:

create the Estimator

huggingface_estimator = HuggingFace(
entry_point = 'run_qlora.py', # train script
source_dir = '../scripts', # directory which includes all the files needed for training
instance_type = 'ml.g5.4xlarge', # instances type used for the training job
instance_count = 1, # the number of instances used for training
max_run = 22460*60, # maximum runtime in seconds (days * hours * minutes * seconds)
base_job_name = job_name, # the name of the training job
role = role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
transformers_version = '4.28', # the transformers version used in the training job
pytorch_version = '2.0', # the pytorch_version version used in the training job
py_version = 'py310', # the python version used in the training job
hyperparameters = hyperparameters, # the hyperparameters passed to the training job
environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache1" }, # set env variable to cache models in /tmp
disable_output_compression = True, # not compress output to save training time and cost
trust_remote_code = True
)

Fine Tuning Mixtral 8x7b

Hi there,

Thanks for the scripts and posts! I am interested in fine-tuning Mixtral 8x7b on sagemaker. The task I have requires around 8k token length.

I have tried running training following this tutorial: https://solano-todeschini.medium.com/fine-tune-mixtral-8x7b-on-aws-sagemaker-and-deploy-to-runpod-6bbb79981d7b#31b4, but using this updated script instead https://www.philschmid.de/sagemaker-train-evalaute-llms-2024.

The first post uses a ml.g5.24xlarge instance, which, funnily enough, has no sharding or fsdp parameter set up. When I try running the same setup with increased context length, I get an OOM. I went up to a ml.g5.48xlarge instance with 192 GB VRAM, but nothing changed.

I also looked into this: https://www.philschmid.de/sagemaker-fsdp-gpt and tried setting fsdp up by adding 'fsdp': '"full_shard auto_wrap"'

The estimated setup cost, according to this chart https://github.com/hiyouga/LLaMA-Factory#hardware-requirement, should be around 30-60 GB for the model? How much does the context length affect?

I also saw that here https://github.com/philschmid/sagemaker-huggingface-llama-2-samples/blob/master/training/sagemaker-notebook.ipynb you are using a much larger instance, but not sure if that's because that was old or not.

PS: I am using pretty much the same parameters you do in the post except the max_seq_len, with the addition of fsdp

Any insight would be greatly appreciated.

not enforcing datasets version leads to conflicts

Hey Philipp, thank you for your amazing work !

In the train-deploy-llm.ipynb notebook, when running huggingface_estimator.fit(data, wait=True)
I just noticed that if we don't enforce datasets version it leads to conflicts during the installation of the requirements.txt:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tokenizers 0.14.1 requires huggingface_hub<0.18,>=0.16.4, but you have huggingface-hub 0.20.3 which is incompatible.

what I suggest is enforcing datasets version in the requirements.txt:

# Requirements.txt
transformers==4.34.0
datasets==2.14.0
peft==0.4.0
accelerate==0.23.0
bitsandbytes==0.41.1
safetensors>=0.3.3
packaging
ninja

deployment on t4 instance

hey, @philschmid how can we deploy quantized model on ml.g4dn.2xlarge?

can we solve this flash attention error?

Error with disable_output_compression = True

@philschmid thanks for sharing your sagemaker guides on fine-tuning.

In the code for the Mistral 7b FT with FA2 there is a parameter at the end on the HF Estimator:
https://www.philschmid.de/sagemaker-mistral

disable_output_compression = True

This generates an error on Sagemaker:
ParamValidationError: Parameter validation failed: Unknown parameter in OutputDataConfig: "CompressionType", must be one of: KmsKeyId, S3OutputPath

I tried adding you on LD to message you but you didn't reply so i thought sending you this here might help.

[Question] DPO on SageMaker?

Hi, curious if you plan to write an example of DPO on SageMaker? 🙃

CodeLLama support in SageMaker HF integration

Hey Phil, thanks for putting together these tutorials.

I am trying to fine-tune CodeLlama using HF Sagemaker, but I am facing errors with the Tokenizer, I think that the provided transformers images have version 4.28 that may not support Codellama. I find this weird as Mistral is a more recent model and you made it work no problem.

Do you have any tips on how to solve this?

I managed to train the model, but had to skip the tokenizer.save_pretrained(...) part, and also now I can't deploy as I need to load that thing back...

Having a greater chunk length than 2048 in packing leads to OOM error

Hi @philschmid,

When I try to increase the chunk length to be greater than 2048, the training fails and runs into an OOM error on g5.4xlarge.
Totally makes sense why it's happening, my question is how would you recommend using the g5.12xlarge instance which has 4x the gpus, and consequently 4x the vram to train the model.

I found this resource on HF: https://huggingface.co/docs/sagemaker/train#distributed-training for doing model parallelism, however when I tried using it with the following config,

mpi_options = {
    "enabled" : True,
    "processes_per_host" : 4
}

smp_options = {
    "enabled": True,
    "parameters": {
        "ddp": True,
        "ddp_dist_backend": "auto", #OR "nccl" to disable SMDDP Collectives
        "partitions": 2,
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

I ran into the following error

UnexpectedStatusException: Error for Training job huggingface-qlora-HuggingFaceH4-zephyr--2023-11-03-16-29-02-663: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 134 ErrorMessage "ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/flash_attn-0.2.8.dist-info/'

Is there any way to solve this? And is the model parallelism the method you would recommend to use the g5.12xlarge instance?