infatoshi / fcc-intro-to-llms Goto Github PK

Jupyter Notebook 80.80% Python 19.20%

fcc-intro-to-llms's Introduction

FreeCodeCamp - Building LLMs from Scratch

Google Colab for those who don't have a GPU: https://colab.research.google.com/drive/1_7TNpEEl8xjHlr9JzKbK5AuDKXwAkHqj?usp=sharing

Dependencies (assuming windows): pip install pylzma numpy ipykernel jupyter torch --index-url https://download.pytorch.org/whl/cu118

If you don't have an NVIDIA GPU, then the device parameter will default to 'cpu' since device = 'cuda' if torch.cuda.is_available() else 'cpu'. If device is defaulting to 'cpu' that is fine, you will just experience slower runtimes.

All the links you should need are in this repo. I will add detailed explanations as questions and issues are posted.

Visual Studio 2022 (for lzma compression algo) - https://visualstudio.microsoft.com/downloads/

OpenWebText Download

https://skylion007.github.io/OpenWebTextCorpus/
if this doesn't work, default to the wizard of oz mini dataset for training / validation

Socials

Twitter / X - https://twitter.com/elliotarledge

My YouTube Channel - https://www.youtube.com/channel/UCjlt_l6MIdxi4KoxuMjhYxg

How to SSH from Mac to Windows - https://www.youtube.com/watch?v=7hBeAb6WyIg&t=

How to Setup Jupyter Notebooks in 5 minutes or less - https://www.youtube.com/watch?v=eLmweqU5VBA&t=

Linkedin - https://www.linkedin.com/in/elliot-arledge-a392b7243/

Join My Discord Server - https://discord.gg/pV7ByF9VNm

Schedule a 1-on-1: https://calendly.com/elliot-ayxc/60min

Research Papers:

Attention is All You Need - https://arxiv.org/pdf/1706.03762.pdf

A Survey of LLMs - https://arxiv.org/pdf/2303.18223.pdf

QLoRA: Efficient Finetuning of Quantized LLMs - https://arxiv.org/pdf/2305.14314.pdf

fcc-intro-to-llms's People

Contributors

Stargazers

Watchers

Forkers

iankfc maxcodextc serviteur fredyfx gaybro8777 muktoakash ryanlisse joskid pterameta kuntal-c antor44 lalrempiang mydevclouds folkrepo anupgoenka ritik-890 apollohuang1 darkpoolsweb techthiyanes ixnisarg itsmegagan yigoya vorlenko natedhaliwal anower120 mkmksto madhavmadupu kshitij0807 fristy89 tibebesolo kumar045 samuelavike1 pop2pop3 mahmudtopu3 pyvovarovi-ipy alessandroborges israaar massoudkargar emehprincewill ramikan-br mohsen-bagheri roodk santoshbammidi07 ahmedouu vik1001 razvan-gosocialdev tuanthescientist nine-sarayut susheel557 ljd0 dharm-harley alexema22 faisalmehmood2007 ayushkshah kennyng-19 thejustinjames amruthwipfli saomi711 amarbabuk ankitshah009 ybkangster masoudforks rkp64 appsz28 fakhripraya blcuyzc elmino-7094 dhilip2002 damodaran013 mainakmaitra joyouscob bassboulder steelheart mikouse elianomarques matthewk84 nsxsoft amansinghlegitt3110 omagebright r-chvrd lokeshs-git caleb789 dialloibrahima221 fundatialeaders emmanuel-adom jasonmarkwomack samuelail kampountolas 441041 haowang-bioinfo beepcar mhe014 anusornc aniskoubaa smhanes15 kingsraghav suhasbhairav hitenvats16 najisrawi eobi

fcc-intro-to-llms's Issues

file not found

train_split.txt file is not available so i can't run my code properly

RuntimeError: The size of tensor a (64) must match the size of tensor b (65) at non-singleton dimension 2

When I try to run the chatbot.py, it spits out this error when trying to generate the response:

Traceback (most recent call last):
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 199, in
generated_chars = decode(m.generate(context.unsqueeze(0), max_new_tokens=150)[0].tolist())
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 176, in generate
logits, loss = self.forward(index_cond)
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 156, in forward
x = self.blocks(x) # (B,T,C)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\container.py", line 217, in forward
input = module(input)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 121, in forward
y = self.sa(x)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 88, in forward
out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, F) -> (B, T, [h1, h1, h1, h1, h2, h2, h2, h2, h3, h3, h3, h3])
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 88, in
out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, F) -> (B, T, [h1, h1, h1, h1, h2, h2, h2, h2, h3, h3, h3, h3])
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 67, in forward
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
RuntimeError: The size of tensor a (64) must match the size of tensor b (65) at non-singleton dimension 2

Any idea on how to fix this plz?
Gpu: 4070 12gb
torch: 2.3.0+cu121

Title

GPU and pylzma issues

I have a question and then another question.
First off I have the AMD Radeon RX580 2048sp as my graphics card - nothing Cuda. Can I still build this? And if so, how?

That leads to my next question. Is the install of pylzma dependent on Cuda which is why it won't work? I tried without being in (Cuda) (Base) D: \Python Testing\ pip install pylzma (and pip3 also), and it wouldn't install then either.

Is there a different option? Can I bypass this install? The error I get is io.h not found. I installed the VS Code build library and selected 10 SDK Please share how to fix this error in baby steps for noobs like me. Nothing works so far and there's nothing on Stack Overflow either.

Cuda on Mac M1

When I attempted to run the torch.cuda.is_available() function on my Mac, I got the response 'cpu'. after briefly searching on google, I have been made aware that coda is not supported on Mac. Is there some work around for this?

Dependency issues installing libs - what version of python was used for this

Have a machine with the latest version of Python 3.12 and hit some dependency issues. What python version was used for this lab?

Issues with extracting the data from openwebtext

I'm using the videos following code:

import os
import lzma
from tqdm import tqdm

def xf_files_in_dr(directory):
  files = []

  for filename in os.listdir(directory):
    if filename.endswith(".xz") and os.path.isfile(os.path.join(directory, filename)):
      files.append(filename)      
  return files

folder_path="X:/projects/python/llm-python/openwebtext/subsets/urlsf_subset02/openwebtext"
output_file_train = "train_split.txt"
output_file_val = "val_split.txt"

vocab_file = "vocab.txt"
split_files = int(input("how many files would you like to split this into? "))

files = xf_files_in_dr(folder_path)
total_files = len(files)

split_index = int(total_files *0.9)
files_train = files[:split_index]
files_val = files[split_files:]

max_count = total_files // split_files if split_files != 0 else total_files

vocab = set()

with open(output_file_train, "w", encoding="utf-8") as outfile:
  for filename in tqdm(files_train, total=len(files_train)):
    file_path = os.path.join(folder_path, filename)
    with lzma.open(file_path, "rt", encoding="utf-8") as infile:
      text = infile.read()
      outfile.write(text)
      characters = set(text)
      vocab.update(characters)


with open(output_file_val, "w", encoding="utf-8") as outfile:
  for filename in tqdm(files_val, total=len(files_val)):
    file_path = os.path.join(folder_path, filename)
    with lzma.open(file_path, "rt", encoding="utf-8") as infile:
      text = infile.read()
      outfile.write(text)
      characters = set(text)
      vocab.update(characters)

with open(vocab_file,"w", encoding="utf-8") as vfile:
  for char in vocab:
    vfile.write(char + '\n')

The folder path is a little different form the video because now when you download the OpenWebText, it does so in smaller the subset folders instead of a huge folder, but no issues here so far.

The problems from when you actually create the vocab.txt file and i sure as well as the train_split.txt and val_split.txt; All I get is Chinese characters and I am not sure why. The code is simple and obvious enough that is not the issue, and the .XZ files are in English text. I'm copying my vocab.txt for reference:
vocab.txt

I try using the new data_extraction.py files but that caused more issues that I am not wanting to troubleshoot at the moment.

Any suggestion as to what's may be going on?

request

hi.
tank your Effort
how to set this code for Persian language .
tanks a lot