rasbt / llms-from-scratch Goto Github PK

View Code? Open in Web Editor NEW

16.4K 202.0 1.5K 8.9 MB

Implementing a ChatGPT-like LLM from scratch, step by step

Home Page: https://www.manning.com/books/build-a-large-language-model-from-scratch

License: Other

Jupyter Notebook 80.48% Python 19.49% Dockerfile 0.02%

chatgpt gpt large-language-models llm python pytorch

llms-from-scratch's Issues

Inconsistencies in unsqueeze operation description in the book and in notebook and its necessity (3.6.2 Implementing multi-head attention with weight splits)

Hi @rasbt,

I found that implementation of the MultiHeadAttention class has the following line:

mask_unsqueezed = mask_bool.unsqueeze(0).unsqueeze(0)

But there is only one unsqueeze operation in the notebook:

mask_unsqueezed = mask_bool.unsqueeze(0)

But as I understand we can skip unsqueeze operation at all because masked_fill_() method supports broadcasting

Thank you.

Probably a typo in multi-head attention description (3.6.1 Stacking multiple single-head attention layers)

Hi @rasbt,

I found the following statement in the mentioned section:

Figure 3.24 illustrates the structure of a multi-head attention module, which consists of
multiple single-head attention modules, as previously depicted in Figure 3.24, stacked on
top of each other.

Did you mean Figure 3.18 in the second case?

Thank you.

Wrong number of token ids specified in the notebook (2.7 Creating token embeddings)

Hi @rasbt,

There is the following description in this section:

Previously, we have seen how to convert a single token ID into a three-dimensional
embedding vector. Let's now apply that to all four input IDs we defined earlier (torch.tensor([5, 1, 3, 2])):

But probably there is a typo in the notebook and you specified only 3 tokens for the same code (after cell [47]):

To embed all three input_ids values above, we do

Thank you.

class MHAPyTorchScaledDotProduct

Thanks for the great work. I have several questions about class MHAPyTorchScaledDotProduct in mha-implementations.ipynb:

class MHAPyTorchScaledDotProduct(nn.Module):
    def __init__(self, d_in, d_out, num_heads, context_length, dropout=0.0, qkv_bias=False):
        super().__init__()

        assert d_out % num_heads == 0, "embed_dim is indivisible by num_heads"

        self.num_heads = num_heads
        self.context_length = context_length
        self.head_dim = d_out // num_heads
        self.d_out = d_out

        self.qkv = nn.Linear(d_in, 3 * d_out, bias=qkv_bias)
        self.proj = nn.Linear(d_in, d_out)
        self.dropout = dropout

        self.register_buffer(
            "mask", torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, x):
        batch_size, num_tokens, embed_dim = x.shape

        # (b, num_tokens, embed_dim) --> (b, num_tokens, 3 * embed_dim)
        qkv = self.qkv(x)

        # (b, num_tokens, 3 * embed_dim) --> (b, num_tokens, 3, num_heads, head_dim)
        qkv = qkv.reshape(batch_size, num_tokens, 3, self.num_heads, self.head_dim)

        # (b, num_tokens, 3, num_heads, head_dim) --> (3, b, num_heads, num_tokens, head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)

        # (3, b, num_heads, num_tokens, head_dim) -> 3 times (b, num_heads, num_tokens, head_dim)
        queries, keys, values = qkv.unbind(0)

        use_dropout = 0. if not self.training else self.dropout
        context_vec = nn.functional.scaled_dot_product_attention(
            queries, keys, values, attn_mask=None, dropout_p=use_dropout, is_causal=True)

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, num_tokens, self.d_out)

        return context_vec

I am not sure which one is better: .reshape() or .view()?

        # (b, num_tokens, 3 * embed_dim) --> (b, num_tokens, 3, num_heads, head_dim)
        qkv = qkv.reshape(batch_size, num_tokens, 3, self.num_heads, self.head_dim)

        # (b, num_tokens, 3 * embed_dim) --> (b, num_tokens, 3, num_heads, head_dim)
        qkv = qkv.view(batch_size, num_tokens, 3, self.num_heads, self.head_dim)

.unbind(0) is not necessary (the shape of queries, keys, values does not change without it), is it a speed concern?

        # (3, b, num_heads, num_tokens, head_dim) -> 3 times (b, num_heads, num_tokens, head_dim)
        queries, keys, values = qkv.unbind(0)

        # (3, b, num_heads, num_tokens, head_dim) -> 3 times (b, num_heads, num_tokens, head_dim)
        queries, keys, values = qkv

According to the equivalent implementation in F.scaled_dot_product_attention(), it seems like self.proj() is missing at the end:

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, num_tokens, self.d_out)

        return context_vec

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, num_tokens, self.d_out)

        context_vec = self.proj(context_vec)

        return context_vec

Again, I am not sure which one is better: .reshape() or .view()?

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, num_tokens, self.d_out)

        return context_vec

Question about number of tokens in ChatGPT (2.5 Byte pair encoding)

Hi @rasbt,

Could you please clarify this sentence:

In fact, the BPE tokenizer that was used to train models such as GPT-2, GPT-3,
and ChatGPT has a total vocabulary size of 50,257, with <|endoftext|> being assigned
the largest token ID.

Which model do you mean by 'ChatGPT'?
I saw different definitions of this term and based on this definitions there are different vocabulary sizes:

text-davinci-003 (50k tokens)
gpt-3.5-turbo or/and gpt-4 (100k tokens)

Thank you.

Inconsistencies in MHA Wrapper Implementation Between Chapter 3 Main Content and Bonus Material

In the notebook ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb, the parameter d_out is not divided by num_heads. As a result, the shape differs from other implementations: [8, 1024, 9216] versus [8, 1024, 768]. Additionally, the implementation lacks the final projection.

It is correctly implemented in ch03\01_main-chapter-code\multihead-attention.ipynb cell 6 and 7.

This inconsistency leads to a significant performance gap in the subsequent cells.

Output of the cell without variable specified (Embedding Layers and Linear Layers)

Hi @rasbt,

There is a cell [28] in this notebook where there is an output but no variable to output is specified (probably it was linear.weight which was deleted after cell execution):

torch.manual_seed(123)
linear = torch.nn.Linear(num_idx, out_dim, bias=False)
---
Parameter containing:
tensor([[-0.2039,  0.0166, -0.2483,  0.1886],
        [-0.4260,  0.3665, -0.3634, -0.3975],
        [-0.3159,  0.2264, -0.1847,  0.1871],
        [-0.4244, -0.3034, -0.1836, -0.0983],
        [-0.3814,  0.3274, -0.1179,  0.1605]], requires_grad=True)

Thank you.

Missing encoder.json and vocab.bpe for running bpe_openai_gpt2 (02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb)

FileNotFoundError occured when trying to instantiate the bpe_openai_gpt2 as following

--------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[20], line 1
----> 1 orig_tokenizer = get_encoder(model_name="gpt2", models_dir=".")

File ~/localdev/python/LLMs-from-scratch/ch02/02_bonus_bytepair-encoder/bpe_openai_gpt2.py:140, in get_encoder(model_name, models_dir)
    139 def get_encoder(model_name, models_dir):
--> 140     with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f:
    141         encoder = json.load(f)
    142     with open(os.path.join(models_dir, model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f:

FileNotFoundError: [Errno 2] No such file or directory: './gpt2/encoder.json'

Offering Chinese Translation for 'Build a Large Language Model From Scratch

Dear Dr. Sebastian Raschka,

Greetings! I am a researcher passionate about machine learning and artificial intelligence. As a native Chinese speaker, I would like to extend my deepest respect and gratitude for the open-source repository of "Build a Large Language Model From Scratch" that you have made available on GitHub. This book is not only comprehensive and beautifully illustrated but also organized in such a manner that beginners like myself find it both intuitive and easy to understand. Your work showcases profound expertise while being incredibly accessible to newcomers, from which I have greatly benefited.

Above all, I am inspired by your passion for AI and open-source software. Motivated by this passion, I have embarked on a project to translate your book and its associated code into Chinese. This effort aims to assist Chinese-speaking learners, like me, in better understanding the process of building large language models. To date, I have completed the translation of the first four chapters. During this process, I have made a concerted effort to clarify any contextual differences and added some foundational knowledge to help beginners grasp the material more effectively.

I am eager to contribute my translated version to the project and wonder if it would be possible to do so by including a link to my forked version in the official GitHub repository's readme or through another method you deem appropriate. My forked version is located at Intelligence-Manifesto/LLMs-from-scratch, which contains the translation work completed so far.

With this letter, I wish to express not only my admiration and thanks for this invaluable book but also seek your guidance and assistance on how I might integrate my work into this admirable open-source project in a suitable manner. How might I contribute my translation so that more Chinese readers can benefit?

Thank you again for your outstanding work and contributions to the open-source community. I look forward to your response.

Sincerely,
Intelligence-Manifesto

Make it clear in REAME.md what this repository is for

When reading the README.md for this repository, it's not immediately clear what this repository contains or what it is for. I think this should be clarified.

Chapter 5 - Context Size and the DataLoaders

First off, great book!

Second, I noticed a small issue in Section 5.1.1 that stumped me for a bit.

"ctx_len": 256, # Shortened context length (orig: 1024)

If this is set to 1024, the val_loader will fail to load with the train_ratio of 0.90. Adjusting to 0.80 will load the data but the shape is mismatched.

Restoring the ctx_len to 256 fixes the issue.

I'm curious as to why this is occurring?

Encoding/decoding transformation of the text (2.3 Converting tokens into token IDs)

Hi @rasbt,

I noticed that when we decode the following encoded sentence:

"It's the last he painted, you know," Mrs. Gisburn said with
pardonable pride.

We will have additional leading spaces at the start of the sentence and after apostrophe in the word It' s:

 "It' s the last he painted, you know," Mrs. Gisburn said with
pardonable pride.

Formally, this does not matter for our case, because we do not take into account spaces, but in general, here we do not precisely restore the original text, right?

Could you please tell if you are interested in such insignificant feedback like this or it is not worth the notes or new issues?

Thank you.

tiktoken is not running in jupyter notebook

Hello Razbt,
Nice to meet you! I've been enjoying your book so far (LLMs from scratch), but I find the examples hard to follow as some of the tools used do not mention which versions you used. I tried to follow along but packages like tiktoken and pytorch refuse to work, or even get installed. I tried using conda to install environments with both Python 3.9 and 3.10. and both successfully install tiktoken, but fail to import it in the jupyter notebook. The command I ran to attempt installation was pip install tiktoken.

Can you let me know which version of Python / tiktoken / pytorch you were using? Is there any intermediate step I missed?

I am running Windows 11 and an (non-Nividia) GPU.

Contributions for Chinese simplified version

hi, @rasbt~
This project is awesome and the tutorial structure is rather clear, I was able to get up and running quickly and I'm learning a lot from it. Really appreciate your work! Would you be interested in having a Chinese version of your project? So that LLM learners from China can refer to your work more efficiently. Maybe I can begin with README-zh.md?

Question about implementation of CausalAttention class (3.5.3 Implementing a compact causal self-attention class)

Hi @rasbt,

This notebook contains the following implementaion of CausalAttention:

class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, block_size, dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New
        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1)) # New

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights) # New

        context_vec = attn_weights @ values
        return context_vec

I have a question - why do we need the following 2 lines in the forward() method implementation:

def forward(self, x):It
        b, num_tokens, d_in = x.shape # New batch dimension b
        ...
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        ...

Can we remove the first line and just replace the second line to the following code:

attn_scores.masked_fill_(self.mask.bool(), -torch.inf)

As I understand num_tokens = batch_size and we provide batch_size value as the argument, so neither calculating x.shape nor indexing [:num_tokens, :num_tokens] is required.
Is it correct?

Thank you.

Difference btwn book and repo

Hi @rasbt - very much enjoying your book! Just a heads up about a difference between the book and repo I found. Results in the same value and code in the repo is what I expected. Screenshot attached. I think d_k = keys.shape[1].

About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py

In https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py#L95
file, the '<|endoftext|>' symbol always appear at val_data_set, and train_data_set always not contains it.

RuntimeError: size mismatch - ch05/03_bonus_pretraining_on_gutenberg

I have an issue running pretraining_simple.py. I have downloaded ca. 50% of the files from Project Gutenberg via the gutenberg repo and then ran your scripts:

The text data preparation works fine so far:

prepare_dataset.py

root@9db1a84319a3:/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg# python prepare_dataset.py
--data_dir gutenberg/data
--max_size_mb 500
--output_dir gutenberg_preprocessed
16697 file(s) to process.

But when trying to train the model, it comes to a shape mismatch. It seems like the data will not be trained batch-wise:

pretraining_simple.py

root@9db1a84319a3:/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg# python pretraining_simple.py --
data_dir "gutenberg_preprocessed" --n_epochs 1 --batch_size 4 --output_dir model_checkpoints
Total files: 16
Tokenizing file 1 of 16: gutenberg_preprocessed/combined_1.txt
Training ...
Traceback (most recent call last):
File "/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py", line 200, in
train_losses, val_losses, tokens_seen = train_model_simple(
File "/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py", line 110, in train_model_simple
loss = calc_loss_batch(input_batch, target_batch, model, device)
File "/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/previous_chapters.py", line 247, in calc_loss_batch
loss = torch.nn.functional.cross_entropy(logits.flatten(0, -1), target_batch.flatten())
File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: size mismatch (got input: [205852672], target: [4096])

I believe the issue comes from the flatten func. In calc_loss_batch() in previous_chapters.py, what do you think about exchanging flatten() with using view()?

loss = torch.nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), target_batch.view(-1))

Please double-check if this idea and output is correct.

I have run the updated script locally on my RTX 3080 Ti, the output is:

root@9db1a84319a3:/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg# python pretraining_simple.py --data_dir "gutenberg_preprocessed" --n_epochs 1 --batch_size 4 --output_dir model_checkpoints
Total files: 16
Tokenizing file 1 of 16: gutenberg_preprocessed/combined_1.txt
Training ...
Ep 1 (Step 0): Train loss 9.952, Val loss 9.663
Every effort moves you
Ep 1 (Step 100): Train loss 6.567, Val loss 6.906
Ep 1 (Step 200): Train loss 6.468, Val loss 6.637
Ep 1 (Step 300): Train loss 6.170, Val loss 6.578
Ep 1 (Step 400): Train loss 5.560, Val loss 6.485
Ep 1 (Step 500): Train loss 5.874, Val loss 6.381
Ep 1 (Step 600): Train loss 5.481, Val loss 6.449
Ep 1 (Step 700): Train loss 5.620, Val loss 6.314
...

Solution for Exercise 3.2 is included in the notebook with main code (3.6.1 Stacking multiple single-head attention layers)

Hi @rasbt,

It seems that cell [36] in the notebook with main code contains solution to Exercise 3.2.

Thank you.

stride value caused skipping one word

"dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=5, shuffle=False)\n",
This code does skip one word, which is different to the text in the book saying we do not skip a word and do not overlap. stride=4 make it consistent with the book.

do have a doc for hardware specs

Incorrect code output in the book (2.2 Tokenizing text)

Hi @rasbt,

I found that in the latest book version (v5) there is an incorrect code output in the section "2.2 Tokenizing text":

result = re.split(r'([,.]|\s)', text)
print(result)
We can see that the words and punctuation characters are now separate list entries just
as we wanted:
['Hello', ',', '', ' ', 'world.', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test.']

and

The resulting whitespace-free output looks like as follows:
['Hello', ',', 'world.', 'This', ',', 'is', 'a', 'test.']

But if we execute provided notebook, the output is correct.

P.S. It is a great pleasure to explore your next new book, especially about LLMs, thank you! :)

Thank you.

requirements.txt

Hi,

Can you please add a requirements.txt to the repo as well (to set the environment for book in one go, without needing to install every package manually)?

Error in the code in Listing A.13 (DDP-script.py)

Hi @rasbt,

I tried to run your DDP script and found that there is an error while executing this script "as-is":

PyTorch version: 2.2.1+cu121
CUDA available: True
Number of GPUs available: 2
Traceback (most recent call last):
  File "/home/user/app/DDP-script.py", line 178, in <module>
    mp.spawn(main, args=(world_size, num_epochs), nprocs=world_size)
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/user/app/DDP-script.py", line 128, in main
    features, labels = features.to(rank), labels.to(rank) # New: use rank
AttributeError: 'int' object has no attribute 'to'

The reason is the following incorrect line:

for features, labels in enumerate(train_loader):

which should be like that:

for idx, (features, labels) in enumerate(train_loader):

or like that (because idx was not used):

for features, labels in train_loader:

Thank you.

Solution of Excercise 2.1 is included in both main code and solution notebooks (2.5 Byte pair encoding)

Hi @rasbt,

I found that solution to the Excercise 2.1 already exists also in the notebook with the main code (section "Experiments with unknown words")

Thank you.

Inconsistencies in output for dropout section (3.5.2 Masking additional attention weights with dropout)

Hi @rasbt,

I am trying to explore and reproduce Chapter 3 and found that I can't reproduce results that you specified in the notebook and the book, even if I download notebook and run without any changes.
The difference appears only starting with the following 2 cells (I haven't checked the next cells yet):

Cell [31]

torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) # dropout rate of 50%
example = torch.ones(6, 6) # create a matrix of ones

print(dropout(example))

Your output

tensor([[2., 2., 0., 2., 2., 0.],
        [0., 0., 0., 2., 0., 2.],
        [2., 2., 2., 2., 0., 2.],
        [0., 2., 2., 0., 0., 2.],
        [0., 2., 0., 2., 0., 2.],
        [0., 2., 2., 2., 2., 0.]])

My output

tensor([[2., 2., 2., 2., 2., 2.],
        [0., 2., 0., 0., 0., 0.],
        [0., 0., 2., 0., 2., 0.],
        [2., 2., 0., 0., 0., 2.],
        [2., 0., 0., 0., 0., 2.],
        [0., 2., 0., 0., 0., 0.]])

Cell [32]

torch.manual_seed(123)
print(dropout(attn_weights))

Your output

tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
        [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
       grad_fn=<MulBackward0>)

My output

tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.8966, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.6206, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4921, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4350, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.3327, 0.0000, 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)

Thank you.

{Q} : Replacing the LlamaDecoderLayer Class hugging Face With New LongNet

Feedback: Stripe output from notebook

This book is a wonderful read, just wanted to submit one small comment on the notebooks which could just be personal learning style. It's nice to have to run the actual notebook to get the output so block-by-block it's easier to focus on that without being distracted with the output already rendered. So maybe there could 2 notebooks per chapter, a clean one and a completed one? In the meantime I'm just using nbstripeout locally but wanted to pass along the feedback.

The definition of stride is confusing in 2.6

hi @rasbt，what a amazing job. But the definition of stride confuses me as follow:

We use a sliding window approach where we slide the window one word at a time (this is also known as stride=1):

An example using stride equal to the context length (here: 4) as shown below:

I think stride is the separation distance between two inputs. In fig 1, two inputs The distance between the two inputs is actually four words. But now，stride marks the distance between input and target.

book feedback

hi @rasbt : fantastic work - and code which is clean and readable.

One small feedback / issue, I noticed with the "early access book" is that in chapter 3 , the manual seed of 789 is missing - which is what brought my here :)

In 3.3.1, there seems to be a missing image between "The attention weights and context vector calculation are summarized in the figure below:" and "The code below walks through the figure above step by step."

By convention, the unnormalized attention weights are referred to as "attention scores" whereas the normalized attention scores, which sum to 1, are referred to as "attention weights"
The attention weights and context vector calculation are summarized in the figure below:

In 3.3.1, there seems to be a missing image between "The attention weights and context vector calculation are summarized in the figure below:" and "The code below walks through the figure above step by step."

Perhaps the sentence needs to be modified

Solution for Exercise 3.3 is included in the notebook with main code (3.6.2 Implementing multi-head attention with weight splits)

Hi @rasbt,

It seems that cell [40] in the notebook with main code contains solution to Exercise 3.3.

Thank you.

Incorrect description of function torch.arange() (2.8 Encoding word positions)

Hi @rasbt,

There is a probably typo in the description of torch.arange() function here:

As shown in the preceding code example, the input to the pos_embeddings is usually a
placeholder vector torch.arange(block_size), which contains a sequence of numbers
1, 2, ..., up to the maximum input length.

I think you mean the range 0, 1, ..., up to the maximum input length - 1?

Thank you.

suggestion of adding torch.profile

i just check out the code of appendix-A/01_main-chapter-code /DDP-script.py,how about adding

from torch.profiler import profile
with profile() as prof:
    #the main function training code
if rank == 0:
    print("exporting trace")
    prof.export_chrome_trace("trace_ddp_simple.json")

than we can see the tracing profiling json file in google Chrome

Inconsistencies between the code in the book and the notebooks (2.6 Data sampling with a sliding window)

Hi @rasbt,

I noticed that in the book you provide the following code with function name create_dataloader and the argument stride = max_length + 1 to avoid overlap in data even for targets:

dataloader = create_dataloader(raw_text, batch_size=8, max_length=4,
stride=5)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

But in the cell of the jupyter notebook with main code (cell [43]) and jupyter notebook with only dataloader (cell [2]) you use function with name create_dataloader_v1 and argument stride = max_length.

Could you please tell do I understand correctly that we need to use stride = max_length + 1 to avoid overfitting? Does the overlap in target (when stride = max_length) seriously increase the risk of overfitting?

Thank you.

Several package requirements from bonus material are not specified in requirements.txt (Tokenizers comparison)

Hi @rasbt,

I don't know if packages from the notebooks with bonus materials like this notebook with tokenizers comparison are intended to be included in requirements.txt, but there are 2 missing libraries:

tqdm (which is required by import from bpe_openai_gpt2 import get_encoder, download_vocab)
transformers

To simplify the work with the control of the libraries used for this project I use poetry which is great to track all explicit and implicit dependencies, so if you want I can send you my configuration for it.

Thank you.

rasbt / llms-from-scratch Goto Github PK

llms-from-scratch's Issues

Recommend Projects

Recommend Topics

Recommend Org