Coder Social home page Coder Social logo

fcc-intro-to-llms's Introduction

FreeCodeCamp - Building LLMs from Scratch

Dependencies (assuming windows): pip install pylzma numpy ipykernel jupyter torch --index-url https://download.pytorch.org/whl/cu118

If you don't have an NVIDIA GPU, then the device parameter will default to 'cpu' since device = 'cuda' if torch.cuda.is_available() else 'cpu'. If device is defaulting to 'cpu' that is fine, you will just experience slower runtimes.

All the links you should need are in this repo. I will add detailed explanations as questions and issues are posted.

Visual Studio 2022 (for lzma compression algo) - https://visualstudio.microsoft.com/downloads/

OpenWebText Download

Socials

Twitter / X - https://twitter.com/elliotarledge

My YouTube Channel - https://www.youtube.com/channel/UCjlt_l6MIdxi4KoxuMjhYxg

How to SSH from Mac to Windows - https://www.youtube.com/watch?v=7hBeAb6WyIg&t=

How to Setup Jupyter Notebooks in 5 minutes or less - https://www.youtube.com/watch?v=eLmweqU5VBA&t=

Linkedin - https://www.linkedin.com/in/elliot-arledge-a392b7243/

Join My Discord Server - https://discord.gg/pV7ByF9VNm

Schedule a 1-on-1: https://calendly.com/elliot-ayxc/60min

Research Papers:

Attention is All You Need - https://arxiv.org/pdf/1706.03762.pdf

A Survey of LLMs - https://arxiv.org/pdf/2303.18223.pdf

QLoRA: Efficient Finetuning of Quantized LLMs - https://arxiv.org/pdf/2305.14314.pdf

fcc-intro-to-llms's People

Contributors

infatoshi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fcc-intro-to-llms's Issues

file not found

train_split.txt file is not available so i can't run my code properly

RuntimeError: The size of tensor a (64) must match the size of tensor b (65) at non-singleton dimension 2

When I try to run the chatbot.py, it spits out this error when trying to generate the response:

Traceback (most recent call last):
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 199, in
generated_chars = decode(m.generate(context.unsqueeze(0), max_new_tokens=150)[0].tolist())
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 176, in generate
logits, loss = self.forward(index_cond)
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 156, in forward
x = self.blocks(x) # (B,T,C)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\container.py", line 217, in forward
input = module(input)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 121, in forward
y = self.sa(x)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 88, in forward
out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, F) -> (B, T, [h1, h1, h1, h1, h2, h2, h2, h2, h3, h3, h3, h3])
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 88, in
out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, F) -> (B, T, [h1, h1, h1, h1, h2, h2, h2, h2, h3, h3, h3, h3])
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "L:\Projects\Python\cuda\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "L:\Projects\Python\GitHub-Repos\fcc-intro-to-llms\chatbot.py", line 67, in forward
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
RuntimeError: The size of tensor a (64) must match the size of tensor b (65) at non-singleton dimension 2

Any idea on how to fix this plz?
Gpu: 4070 12gb
torch: 2.3.0+cu121

GPU and pylzma issues

I have a question and then another question.
First off I have the AMD Radeon RX580 2048sp as my graphics card - nothing Cuda. Can I still build this? And if so, how?

That leads to my next question. Is the install of pylzma dependent on Cuda which is why it won't work? I tried without being in (Cuda) (Base) D: \Python Testing\ pip install pylzma (and pip3 also), and it wouldn't install then either.

Is there a different option? Can I bypass this install? The error I get is io.h not found. I installed the VS Code build library and selected 10 SDK Please share how to fix this error in baby steps for noobs like me. Nothing works so far and there's nothing on Stack Overflow either.

Cuda on Mac M1

When I attempted to run the torch.cuda.is_available() function on my Mac, I got the response 'cpu'. after briefly searching on google, I have been made aware that coda is not supported on Mac. Is there some work around for this?

Issues with extracting the data from openwebtext

I'm using the videos following code:

import os
import lzma
from tqdm import tqdm

def xf_files_in_dr(directory):
  files = []

  for filename in os.listdir(directory):
    if filename.endswith(".xz") and os.path.isfile(os.path.join(directory, filename)):
      files.append(filename)      
  return files

folder_path="X:/projects/python/llm-python/openwebtext/subsets/urlsf_subset02/openwebtext"
output_file_train = "train_split.txt"
output_file_val = "val_split.txt"

vocab_file = "vocab.txt"
split_files = int(input("how many files would you like to split this into? "))

files = xf_files_in_dr(folder_path)
total_files = len(files)

split_index = int(total_files *0.9)
files_train = files[:split_index]
files_val = files[split_files:]

max_count = total_files // split_files if split_files != 0 else total_files

vocab = set()

with open(output_file_train, "w", encoding="utf-8") as outfile:
  for filename in tqdm(files_train, total=len(files_train)):
    file_path = os.path.join(folder_path, filename)
    with lzma.open(file_path, "rt", encoding="utf-8") as infile:
      text = infile.read()
      outfile.write(text)
      characters = set(text)
      vocab.update(characters)


with open(output_file_val, "w", encoding="utf-8") as outfile:
  for filename in tqdm(files_val, total=len(files_val)):
    file_path = os.path.join(folder_path, filename)
    with lzma.open(file_path, "rt", encoding="utf-8") as infile:
      text = infile.read()
      outfile.write(text)
      characters = set(text)
      vocab.update(characters)

with open(vocab_file,"w", encoding="utf-8") as vfile:
  for char in vocab:
    vfile.write(char + '\n')

The folder path is a little different form the video because now when you download the OpenWebText, it does so in smaller the subset folders instead of a huge folder, but no issues here so far.

The problems from when you actually create the vocab.txt file and i sure as well as the train_split.txt and val_split.txt; All I get is Chinese characters and I am not sure why. The code is simple and obvious enough that is not the issue, and the .XZ files are in English text. I'm copying my vocab.txt for reference:
vocab.txt

I try using the new data_extraction.py files but that caused more issues that I am not wanting to troubleshoot at the moment.

Any suggestion as to what's may be going on?

request

hi.
tank your Effort
how to set this code for Persian language .
tanks a lot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.