Skip to document

GPT Model Behind the Scene Exploring it from scratch with Pytorch by Chee Kean Artificial Intelligence in Plain English

Hello
Course

Giáo dục Quốc phòng và An ninh (D02031)

204 Documents
Students shared 204 documents in this course
Academic year: 2021/2022
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
Đại học Tôn Đức Thắng

Comments

Please sign in or register to post comments.

Preview text

GPT Model Behind the Scene: Exploring it from

scratch with Pytorch

CheeKean · Follow

Published in Artificial Intelligence in Plain English

18 min read · Apr 22

Listen Share More

Discover the intricacies of building and exploring a GPT model from scratch with an

in-depth explanation that covers technical details and code implementation.

Photo by D koi on Unsplash

Introduction

ChatGPT is taking the world by storm, thanks to its ability to generate human-like

responses that are so accurate. But what makes ChatGPT so remarkable? The answer

Get unlimited access to the best of Medium for less than $ 1 /week. Become a member

lies in its powerful architecture, which is backed by none other than the world-
renowned Generative pre-trained transformers (GPT) model. In this blog, we’re
going to dive deep into the technical architecture and implementation of GPT.

Getting Started

NanoGPT
While GPT may seem like a complex model at first glance, at its core it’s simply a
transformer that takes in a sequence of indices and outputs a probability
distribution over the next index in the sequence. In order to make our
implementation of GPT more accessible and educational, we’ve decided to use
NanoGPT, a lightweight PyTorch re-implementation of the model. Unlike some of
the more complex GPT implementations out there, NanoGPT is clean, interpretable,
and easy to understand.
Environment and Libraries
Now that we’ve covered the basics of NanoGPT and its capabilities, let’s roll up our
sleeves and get ready to dive into the exciting world of GPT!

####### pip install transformers tiktoken

Before we delve into the code, it is crucial to ensure that we have all the required
libraries installed and imported in our Integrated Development Environment (IDE).

####### import os

####### import json

####### import regex as re

####### import requests

####### import numpy as np

####### import torch

####### import torch as nn

####### from torch import functional as F

####### from torch.utils import Dataset

####### from torch.utils.data import DataLoader

####### import pickle

####### import math

####### import time

Following on that, the tokenization process is done using the tiktoken tokenization
library and the GPT2 BPE algorithm. In the context of the GPT model, tokenization
is necessary to convert text into a sequence of integers that can be used as input to
the transformer. BPE tokenizer is a data compression technique that represents
frequently occurring sequences of characters in a text as a single symbol or token.
For instance, consider the sentence

“the cat sat on the mat.”

The BPE tokenizer would first split this sentence into individual characters, as
follows:

“t h e c a t s a t o n t h e m a t .”

Next, it would find the most frequent pair of characters and replace them with a
new symbol or token. Let’s say “th” is the most frequent pair in this sentence, so
it would be replaced with a new token “@@”. The sentence would now look like

the ca@@ sat on the ma@@.

This process is repeated until a desired vocabulary size is reached, or until all
character pairs have been replaced with tokens. This way, the tokenizer can
handle rare words and misspelled words, by breaking them down into smaller
units.
Finally, the integer tokens are converted into binary files and saved in the train
and val files respectively.
Model Configuration
The configuration of the model involves setting hyperparameters such as the
number of layers, number of heads, hidden size, and sequence length. The specific
values chosen for these hyperparameters depend on the size of the machine being
used and the specific task being performed. For example, a larger machine with
more memory and computational power might be able to handle a larger number of
layers or heads, which would allow for more complex modeling of the input
sequence.

####### class GPTConfig:

####### def init(self, vocab_size, **kwargs):

####### self_size = vocab_size

####### for key, value in kwargs():

####### setattr(self, key, value)

####### class CustomConfig(GPTConfig):

####### n_layer = 8

####### n_head = 8

####### n_embd = 256

####### embd_pdrop = 0.

####### resid_pdrop = 0.

####### attn_pdrop = 0.

####### dropout = 0.

####### compile = True

####### device = 'cuda'

####### num_workers = 0

####### max_iters = 2e

####### batch_size = 4

####### block_size = 64

####### learning_rate = 6e-

####### betas = (0, 0)

####### weight_decay = 1e-

####### grad_norm_clip = 1.

####### vocab_size = len(train_ids)

####### config = CustomConfig(vocab_size=vocab_size)

vocab_size : the size of the vocabulary
n_layer : the number of layers in the modeln_layer: the number of layers in the
model
n_head : the number of attention heads in each layer
n_embd : the size of the embedding layer
embd_pdrop : dropout probability for the embedding layer
resid_pdrop : dropout probability for the residual connections
attn_pdrop : dropout probability for the attention weights
dropout : global dropout probability

####### self = train_data if split == 'train' else val_data

####### def len(self):

####### return len(self) - self_size

####### def getitem(self, idx):

####### x = torch_numpy(self[idx : idx + self_size].astype(np

####### y = torch_numpy(self[idx + 1 : idx + 1 + self_size].ast

####### if self_type == 'cuda':

####### # pin arrays x,y, which allows us to move them to GPU asynchronousl

####### x, y = x_memory().to('cuda', non_blocking=True), y_memory()

####### else:

####### x, y = x('cpu'), y('cpu')

####### return x, y

####### # create dataset and dataloader

####### train_dataset = ShakespeareDataset('train', config_size, config)

####### train_loader = DataLoader(train_dataset, batch_size=config_size, num_work

####### test_dataset = ShakespeareDataset('test', config_size, config)

####### test_loader = DataLoader(test_dataset, batch_size=config_size, num_worker

We create train_dataset and test_dataset instances using the ShakespeareDataset
class and then create dataloaders using the DataLoader class provided by PyTorch,
with the train_loader and test_loader instances being used to load the training and
validation data, respectively.

Model Implementation

In this step, we will implement the core features of the GPT model, including the
self-attention mechanism, GELU activation function and GPT blocks.
GELU Activation Function

####### Activation function Comparison (from OpenGenus)

The GELU (Gaussian Error Linear Units) activation function is a non-linear
activation function that was introduced in 2016 by Hendrycks and Gimpel. It is a
smooth approximation of the ReLU activation function and has been shown to
perform better than the ReLU function in some deep learning models.

####### class NewGELU(nn):

####### """

####### Implementation of the GELU activation function currently in Google BERT rep

####### Reference: Gaussian Error Linear Units (GELU) paper: arxiv/abs/

####### """

####### def forward(self, x):

####### return 0 * x * (1 + torch(math(2 / math) * (x + 0.

The GELU function has several desirable properties, such as being differentiable
and having a range from -1 to infinity. It has been shown to improve the training
speed and accuracy of deep learning models, particularly in natural language
processing tasks.
Causal Self Attention

####### k = k(B, T, self_head, C // self_head).transpose( 1 , 2 ) # (b, h

####### q = q(B, T, self_head, C // self_head).transpose( 1 , 2 ) # (b, h

####### v = v(B, T, self_head, C // self_head).transpose( 1 , 2 ) # (b, h

####### # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -

####### if self:

####### # efficient attention using Flash Attention CUDA kernels

####### y = torch.nn.functional_dot_product_attention(

####### q, k, v, attn_mask=None, dropout_p=self if self

####### )

####### else:

####### # (b, h, seq_len, d_k) matmul (b, h, d_k, seq_len) --> (b, h, seq_l

####### att = (q @ k(- 2 , - 1 )) * (1 / math(k(- 1 )))

####### # diagonal mask

####### # fill 0 mask with super small number so it wont affect the softmax

####### # (batch_size, h, seq_len, seq_len)

####### att = att_fill(self[:,:,:T,:T] == 0 , float('-inf'))

####### att = F(att, dim=- 1 )

####### att = self_dropout(att)

####### # (b, h, seq_len, seq_len) matmul (b, h, seq_len, d_k) --> (b, h, s

####### y = att @ v

####### # (b, h, seq_len, d_k) --> (b, seq_len, h, d_k) --> (b, seq_len, d_mode

####### y = y( 1 , 2 ).contiguous().view(B, T, C)

####### # output projection

####### y = self_dropout(self_proj(y))

####### return y

The code above defines a class CausalSelfAttention that implements the causal self-
attention mechanism in the GPT model.
The forward method is where the actual computation takes place. The method
receives a tensor x of shape (batch_size, seq_len, emb_dim) as input.
It splits the input x into query, key, and value tensors for all heads and reshapes
them accordingly. It then computes the attention score matrix using either the
fast flash attention (torch ≥ 2)or the slower dot product method,
depending on the pytorch version.
In the case of dot product attention, the attention score matrix is computed
using matrix multiplication between the query and key tensors, followed by
scaling by the square root of the key tensor’s dimension.

####### Scaled Dot Product Achitecture (Introduced by Vaswani et al. in Attention Is All You Need)

A mask is then applied to ensure that the attention is only applied to the left in
the input sequence. In GPT, the masking is done using a triangular mask that
blocks the model from attending to any word that comes after the current word
in the sequence. To achieve this, we use torch(torch(n, n)) to create a
lower-triangular matrix of ones. The tril function zeros out all elements above
the diagonal of the matrix.
tokens. The masked self-attention ensures that the model cannot look ahead in the
sequence and only uses the previous tokens for prediction. This also means that the
model does not need to learn the representation of the input sequence, making the
encoder unnecessary.

####### class Block(nn):

####### """ GPT decoder block"""

####### def init(self, config):

####### super().init()

####### self_1 = nn(config_embd)

####### self = CausalSelfAttention(config)

####### self_2 = nn(config_embd)

####### self = nn(dict(

####### c_fc = nn(config_embd, 4 * config_embd),

####### act = NewGELU(),

####### c_proj = nn( 4 * config_embd, config_embd),

####### dropout = nn(config_pdrop),

####### ))

####### m = self

####### self = lambda x: m(m_proj(m(m_fc(x))))

####### def forward(self, x):

####### # (batch_size, seq_len, emb_dim)

####### x = x + self(self_1(x))

####### x = x + self(self_2(x))

####### return x

In the implementation of a single decoder block, it takes in an input tensor x of
shape (batch_size, seq_len, emb_dim).
The block first applies layer normalization ( ln_1 ) to the input tensor. Then it
applies a causal self-attention mechanism ( attn ) to the normalized input, which
allows the model to only attend to the previous tokens and prevents information
leakage from future tokens.
The resulting tensor is added to the original input tensor (i. residual
connection) to obtain the first intermediate tensor.

####### Decoder Block (from Research)

Next, the intermediate tensor is passed through a multi-layer perceptron ( mlp ).
The MLP is composed of four layers: a linear layer ( c_fc ) that expands the input
dimension by a factor of 4, a non-linear activation function ( act ), a second
linear layer ( c_proj ) that compresses the dimension back to emb_dim , and a
dropout layer ( dropout ) to regularize the model.
The output of the MLP is added to the first intermediate tensor (i., another
residual connection) and returned as the final output of the decoder block.
Overall, this decoder block enables the GPT model to generate new sequences
autoregressively by predicting the probability distribution of the next token given
the previous tokens.
Don’t be intimidated by the lengthy code above. We will break down each function
block in detail, and you’ll see that it’s not as complex as you may initially think.
The constructor ( init ) initializes the GPT model with the given
configuration. The GPT model combined several components which are the
embedding layer for word tokens wte , embedding layer for positional encoding
wpe , decoder blocks Block and finally a layer normalization layer applied to the
output of the transformer ln_f.
Meanwhile, the constructor initializes the weights of the GPT model using a
special scaled initialization technique, as described in the GPT-2 paper. It also
sets up an optimizer for training the model, with separate weight decay settings
for different parts of the model.

####### class GPT(nn):

####### """ GPT Language Model """

####### def init(self, config):

####### super().init()

####### self_size = config_size

####### self = nn(dict(

####### wte = nn(config_size, config_embd),

####### wpe = nn(config_size, config_embd),

####### drop = nn(config_pdrop),

####### h = nn([Block(config) for _ in range(config_layer)]),

####### ln_f = nn(config_embd),

####### ))

####### self_head = nn(config_embd, config_size, bias=False)

####### # init all weights, and apply a special scaled init to the residual pro

####### self(self._init_weights)

####### for pn, p in self_parameters():

####### if pn('c_proj'):

####### torch.nn.init_(p, mean=0, std=0.02/math( 2 * confi

####### # report number of parameters (note we don't count the decoder paramete

####### n_params = sum(p() for p in self.transformer())

####### print("number of parameters: %" % (n_params/1e6,))

####### def _init_weights(self, module):

####### if isinstance(module, nn):

####### torch.nn.init_(module, mean=0, std=0)

####### if module is not None:

####### torch.nn.init_(module)

####### elif isinstance(module, nn):

####### torch.nn.init_(module, mean=0, std=0)

####### elif isinstance(module, nn):

####### torch.nn.init_(module)

####### torch.nn.init_(module)

####### def configure_optimizers(self, train_config):

####### # separate out all parameters to those that will and won't experience r

####### decay = set()

####### no_decay = set()

####### whitelist_weight_modules = (torch.nn, )

####### blacklist_weight_modules = (torch.nn, torch.nn)

####### for mn, m in self_modules():

####### for pn, p in m_parameters():

####### fpn = '%s.%s' % (mn, pn) if mn else pn # full param name

####### # random note: because named_modules and named_parameters are r

####### # we will see the same tensors p many many times. but doing it

####### # allows us to know which parent module any tensor p belongs to

####### if pn('bias'):

####### # all biases will not be decayed

####### no_decay(fpn)

####### elif pn('weight') and isinstance(m, whitelist_weight_m

####### # weights of whitelist modules will be weight decayed

####### decay(fpn)

####### elif pn('weight') and isinstance(m, blacklist_weight_m

####### # weights of blacklist modules will NOT be weight decayed

####### no_decay(fpn)

####### # validate that we considered every parameter

####### param_dict = {pn: p for pn, p in self_parameters()}

####### inter_params = decay & no_decay

####### union_params = decay | no_decay

####### # create the pytorch optimizer object

####### optim_groups = [

####### {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_

####### {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weig

####### ]

####### optimizer = torch.optim(optim_groups, lr=train_config_ra

####### return optimizer

####### def forward(self, idx, targets=None):

####### device = idx

####### b, t = idx()

####### assert t <= self_size, f"Cannot forward sequence of length {t}, b

####### # positional token, shape (1, t)

####### pos = torch( 0 , t, dtype=torch, device=device).unsqueeze( 0 )

####### # forward the GPT model itself

####### tok_emb = self.transformer(idx) # token embeddings of shape (b, t,

####### pos_emb = self.transformer(pos) # position embeddings of shape (1,

####### x = self.transformer(tok_emb + pos_emb)

encoding layer to the position indices. It then applies the transformer layers to
the resulting tensor.
Next, it applies the language model head to the output of the transformer to
obtain a probability distribution over the vocabulary.
Lastly, it computes the cross-entropy loss between the predicted distribution
and the target distribution.
Word Generation
GPT is an auto-regressive language model that takes in a conditioning sequence of
indices and then generates new text one token at a time. The model generates each
token based on the preceding tokens in the sequence.
The generate function is a method in the GPT class that generates new text
based on a given input sequence. It takes in a conditioning sequence of indices
idx of shape (batch size, sequence length). The function then completes the
sequence max_new_tokens times, feeding the predictions back into the model
each time.
It forward passes the model to get the logits for the index in the sequence. The
logits represent the unnormalized probability distribution over the vocabulary
of possible tokens.
Next, the function plucks the logits at the final step and scales them by a desired
temperature. The temperature is used to control the randomness of the
generated output. Higher temperatures lead to more diverse and random
outputs, while lower temperatures lead to more conservative and predictable
outputs.
Then, it applies softmax to convert the logits to normalized probabilities. The
probabilities represent the likelihood of each token in the vocabulary to be the
next token in the generated sequence.
Finally, the function either samples from the probability distribution using
torch(). It then appends the sampled index to the running
sequence and continues the loop until max_new_tokens is reached.

Trainer

After thoroughly exploring the GPT model and analyzing the provided source code,
we are now equipped with the knowledge and understanding necessary to train the
model using Shakespearean data. We can now confidently hit the “begin” button to
initiate the training process and watch as the model learns to generate
Shakespearean text.

####### class Trainer:

####### def init(self, config, model, train_dataset):

####### self = config

####### self = model

####### self = None

####### self_dataset = train_dataset

####### self = defaultdict(list)

####### self = config

####### self = self.model(self)

####### # variables that will be assigned to trainer class later for logging an

####### self_num = 0

####### self_time = 0.

####### self_dt = 0.

####### def add_callback(self, onevent: str, callback):

####### self[onevent].append(callback)

####### def set_callback(self, onevent: str, callback):

####### self[onevent] = [callback]

####### def trigger_callbacks(self, onevent: str):

####### for callback in self.callbacks(onevent, []):

####### callback(self)

####### def run(self):

####### model, config = self, self

####### # setup the optimizer

####### self = model_optimizers(config)

####### # setup the dataloader

####### train_loader = DataLoader(

####### self_dataset,

####### sampler=torch.utils.data(self_dataset, replacem

Was this document helpful?

GPT Model Behind the Scene Exploring it from scratch with Pytorch by Chee Kean Artificial Intelligence in Plain English

Course: Giáo dục Quốc phòng và An ninh (D02031)

204 Documents
Students shared 204 documents in this course
Was this document helpful?
GPT Model Behind the Scene: Exploring it from
scratch with Pytorch
CheeKean ·Follow
Published in
Artificial Intelligence in Plain English
18 min read · Apr 22
Listen Share More
Discover the intricacies of building and exploring a GPT model from scratch with an
in-depth explanation that covers technical details and code implementation.
Photo by D koi on Unsplash
Introduction
ChatGPT is taking the world by storm, thanks to its ability to generate human-like
responses that are so accurate. But what makes ChatGPT so remarkable? The answer
Get unlimited access to the best of Medium for less than $1/week. Become a member