Information
AI Chat

GPT Model Behind the Scene Exploring it from scratch with Pytorch by Chee Kean Artificial Intelligence in Plain English

Hello

Course

Giáo dục Quốc phòng và An ninh (D02031)

204 Documents

Students shared 204 documents in this course

University

Đại học Tôn Đức Thắng

Academic year: 2021/2022

Uploaded by:

Anonymous Student

This document has been uploaded by a student, just like you, who decided to remain anonymous.

Đại học Tôn Đức Thắng

Comments

Please sign in or register to post comments.

Preview text

GPT Model Behind the Scene: Exploring it from

scratch with Pytorch

CheeKean · Follow

Published in Artificial Intelligence in Plain English

18 min read · Apr 22

Listen Share More

Discover the intricacies of building and exploring a GPT model from scratch with an

in-depth explanation that covers technical details and code implementation.

Photo by D koi on Unsplash

Introduction

ChatGPT is taking the world by storm, thanks to its ability to generate human-like

responses that are so accurate. But what makes ChatGPT so remarkable? The answer

Get unlimited access to the best of Medium for less than $ 1 /week. Become a member

lies in its powerful architecture, which is backed by none other than the world-

renowned Generative pre-trained transformers (GPT) model. In this blog, we’re

going to dive deep into the technical architecture and implementation of GPT.

Getting Started

NanoGPT

While GPT may seem like a complex model at first glance, at its core it’s simply a

transformer that takes in a sequence of indices and outputs a probability

distribution over the next index in the sequence. In order to make our

implementation of GPT more accessible and educational, we’ve decided to use

NanoGPT, a lightweight PyTorch re-implementation of the model. Unlike some of

the more complex GPT implementations out there, NanoGPT is clean, interpretable,

and easy to understand.

Environment and Libraries

Now that we’ve covered the basics of NanoGPT and its capabilities, let’s roll up our

sleeves and get ready to dive into the exciting world of GPT!

####### pip install transformers tiktoken

Before we delve into the code, it is crucial to ensure that we have all the required

libraries installed and imported in our Integrated Development Environment (IDE).

####### import os

####### import json

####### import regex as re

####### import requests

####### import numpy as np

####### import torch

####### import torch as nn

####### from torch import functional as F

####### from torch.utils import Dataset

####### from torch.utils.data import DataLoader

####### import pickle

####### import math

####### import time

Following on that, the tokenization process is done using the tiktoken tokenization

library and the GPT2 BPE algorithm. In the context of the GPT model, tokenization

is necessary to convert text into a sequence of integers that can be used as input to

the transformer. BPE tokenizer is a data compression technique that represents

frequently occurring sequences of characters in a text as a single symbol or token.

For instance, consider the sentence

“the cat sat on the mat.”

The BPE tokenizer would first split this sentence into individual characters, as

follows:

“t h e c a t s a t o n t h e m a t .”

Next, it would find the most frequent pair of characters and replace them with a

new symbol or token. Let’s say “th” is the most frequent pair in this sentence, so

it would be replaced with a new token “@@”. The sentence would now look like

the ca@@ sat on the ma@@.

This process is repeated until a desired vocabulary size is reached, or until all

character pairs have been replaced with tokens. This way, the tokenizer can

handle rare words and misspelled words, by breaking them down into smaller

units.

Finally, the integer tokens are converted into binary files and saved in the train

and val files respectively.

Model Configuration

The configuration of the model involves setting hyperparameters such as the

number of layers, number of heads, hidden size, and sequence length. The specific

values chosen for these hyperparameters depend on the size of the machine being

used and the specific task being performed. For example, a larger machine with

more memory and computational power might be able to handle a larger number of

layers or heads, which would allow for more complex modeling of the input

sequence.

####### class GPTConfig:

####### def init(self, vocab_size, **kwargs):

####### self_size = vocab_size

####### for key, value in kwargs():

####### setattr(self, key, value)

####### class CustomConfig(GPTConfig):

####### n_layer = 8

####### n_head = 8

####### n_embd = 256

####### embd_pdrop = 0.

####### resid_pdrop = 0.

####### attn_pdrop = 0.

####### dropout = 0.

####### compile = True

####### device = 'cuda'

####### num_workers = 0

####### max_iters = 2e

####### batch_size = 4

####### block_size = 64

####### learning_rate = 6e-

####### betas = (0, 0)

####### weight_decay = 1e-

####### grad_norm_clip = 1.

####### vocab_size = len(train_ids)

####### config = CustomConfig(vocab_size=vocab_size)

vocab_size : the size of the vocabulary

n_layer : the number of layers in the modeln_layer: the number of layers in the

model

n_head : the number of attention heads in each layer

n_embd : the size of the embedding layer

embd_pdrop : dropout probability for the embedding layer

resid_pdrop : dropout probability for the residual connections

attn_pdrop : dropout probability for the attention weights

dropout : global dropout probability

####### self = train_data if split == 'train' else val_data

####### def len(self):

####### return len(self) - self_size

####### def getitem(self, idx):

####### x = torch_numpy(self[idx : idx + self_size].astype(np

####### y = torch_numpy(self[idx + 1 : idx + 1 + self_size].ast

####### if self_type == 'cuda':

####### # pin arrays x,y, which allows us to move them to GPU asynchronousl

####### x, y = x_memory().to('cuda', non_blocking=True), y_memory()

####### else:

####### x, y = x('cpu'), y('cpu')

####### return x, y

####### # create dataset and dataloader

####### train_dataset = ShakespeareDataset('train', config_size, config)

####### train_loader = DataLoader(train_dataset, batch_size=config_size, num_work

####### test_dataset = ShakespeareDataset('test', config_size, config)

####### test_loader = DataLoader(test_dataset, batch_size=config_size, num_worker

We create train_dataset and test_dataset instances using the ShakespeareDataset

class and then create dataloaders using the DataLoader class provided by PyTorch,

with the train_loader and test_loader instances being used to load the training and

validation data, respectively.

Model Implementation

In this step, we will implement the core features of the GPT model, including the

self-attention mechanism, GELU activation function and GPT blocks.

GELU Activation Function

####### Activation function Comparison (from OpenGenus)

The GELU (Gaussian Error Linear Units) activation function is a non-linear

activation function that was introduced in 2016 by Hendrycks and Gimpel. It is a

smooth approximation of the ReLU activation function and has been shown to

perform better than the ReLU function in some deep learning models.

####### class NewGELU(nn):

####### """

####### Implementation of the GELU activation function currently in Google BERT rep

####### Reference: Gaussian Error Linear Units (GELU) paper: arxiv/abs/

####### """

####### def forward(self, x):

####### return 0 * x * (1 + torch(math(2 / math) * (x + 0.

The GELU function has several desirable properties, such as being differentiable

and having a range from -1 to infinity. It has been shown to improve the training

speed and accuracy of deep learning models, particularly in natural language

processing tasks.

Causal Self Attention

####### k = k(B, T, self_head, C // self_head).transpose( 1 , 2 ) # (b, h

####### q = q(B, T, self_head, C // self_head).transpose( 1 , 2 ) # (b, h

####### v = v(B, T, self_head, C // self_head).transpose( 1 , 2 ) # (b, h

####### # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -

####### if self:

####### # efficient attention using Flash Attention CUDA kernels

####### y = torch.nn.functional_dot_product_attention(

####### q, k, v, attn_mask=None, dropout_p=self if self

####### )

####### else:

####### # (b, h, seq_len, d_k) matmul (b, h, d_k, seq_len) --> (b, h, seq_l

####### att = (q @ k(- 2 , - 1 )) * (1 / math(k(- 1 )))

####### # diagonal mask

####### # fill 0 mask with super small number so it wont affect the softmax

####### # (batch_size, h, seq_len, seq_len)

####### att = att_fill(self[:,:,:T,:T] == 0 , float('-inf'))

####### att = F(att, dim=- 1 )

####### att = self_dropout(att)

####### # (b, h, seq_len, seq_len) matmul (b, h, seq_len, d_k) --> (b, h, s

####### y = att @ v

####### # (b, h, seq_len, d_k) --> (b, seq_len, h, d_k) --> (b, seq_len, d_mode

####### y = y( 1 , 2 ).contiguous().view(B, T, C)

####### # output projection

####### y = self_dropout(self_proj(y))

####### return y

The code above defines a class CausalSelfAttention that implements the causal self-

attention mechanism in the GPT model.

The forward method is where the actual computation takes place. The method

receives a tensor x of shape (batch_size, seq_len, emb_dim) as input.

It splits the input x into query, key, and value tensors for all heads and reshapes

them accordingly. It then computes the attention score matrix using either the

fast flash attention (torch ≥ 2)or the slower dot product method,

depending on the pytorch version.

In the case of dot product attention, the attention score matrix is computed

using matrix multiplication between the query and key tensors, followed by

scaling by the square root of the key tensor’s dimension.

####### Scaled Dot Product Achitecture (Introduced by Vaswani et al. in Attention Is All You Need)

A mask is then applied to ensure that the attention is only applied to the left in

the input sequence. In GPT, the masking is done using a triangular mask that

blocks the model from attending to any word that comes after the current word

in the sequence. To achieve this, we use torch(torch(n, n)) to create a

lower-triangular matrix of ones. The tril function zeros out all elements above

the diagonal of the matrix.

tokens. The masked self-attention ensures that the model cannot look ahead in the

sequence and only uses the previous tokens for prediction. This also means that the

model does not need to learn the representation of the input sequence, making the

encoder unnecessary.

####### class Block(nn):

####### """ GPT decoder block"""

####### def init(self, config):

####### super().init()

####### self_1 = nn(config_embd)

####### self = CausalSelfAttention(config)

####### self_2 = nn(config_embd)

####### self = nn(dict(

####### c_fc = nn(config_embd, 4 * config_embd),

####### act = NewGELU(),

####### c_proj = nn( 4 * config_embd, config_embd),

####### dropout = nn(config_pdrop),

####### ))

####### m = self

####### self = lambda x: m(m_proj(m(m_fc(x))))

####### def forward(self, x):

####### # (batch_size, seq_len, emb_dim)

####### x = x + self(self_1(x))

####### x = x + self(self_2(x))

####### return x

In the implementation of a single decoder block, it takes in an input tensor x of

shape (batch_size, seq_len, emb_dim).

The block first applies layer normalization ( ln_1 ) to the input tensor. Then it

applies a causal self-attention mechanism ( attn ) to the normalized input, which

allows the model to only attend to the previous tokens and prevents information

leakage from future tokens.

The resulting tensor is added to the original input tensor (i. residual

connection) to obtain the first intermediate tensor.

####### Decoder Block (from Research)

Next, the intermediate tensor is passed through a multi-layer perceptron ( mlp ).

The MLP is composed of four layers: a linear layer ( c_fc ) that expands the input

dimension by a factor of 4, a non-linear activation function ( act ), a second

linear layer ( c_proj ) that compresses the dimension back to emb_dim , and a

dropout layer ( dropout ) to regularize the model.

The output of the MLP is added to the first intermediate tensor (i., another

residual connection) and returned as the final output of the decoder block.

Overall, this decoder block enables the GPT model to generate new sequences

autoregressively by predicting the probability distribution of the next token given

the previous tokens.

Don’t be intimidated by the lengthy code above. We will break down each function

block in detail, and you’ll see that it’s not as complex as you may initially think.

The constructor ( init ) initializes the GPT model with the given

configuration. The GPT model combined several components which are the

embedding layer for word tokens wte , embedding layer for positional encoding

wpe , decoder blocks Block and finally a layer normalization layer applied to the

output of the transformer ln_f.

Meanwhile, the constructor initializes the weights of the GPT model using a

special scaled initialization technique, as described in the GPT-2 paper. It also

sets up an optimizer for training the model, with separate weight decay settings

for different parts of the model.

####### class GPT(nn):

####### """ GPT Language Model """

####### def init(self, config):

####### super().init()

####### self_size = config_size

####### self = nn(dict(

####### wte = nn(config_size, config_embd),

####### wpe = nn(config_size, config_embd),

####### drop = nn(config_pdrop),

####### h = nn([Block(config) for _ in range(config_layer)]),

####### ln_f = nn(config_embd),

####### ))

####### self_head = nn(config_embd, config_size, bias=False)

####### # init all weights, and apply a special scaled init to the residual pro

####### self(self._init_weights)

####### for pn, p in self_parameters():

####### if pn('c_proj'):

####### torch.nn.init_(p, mean=0, std=0.02/math( 2 * confi

####### # report number of parameters (note we don't count the decoder paramete

####### n_params = sum(p() for p in self.transformer())

####### print("number of parameters: %" % (n_params/1e6,))

####### def _init_weights(self, module):

####### if isinstance(module, nn):

####### torch.nn.init_(module, mean=0, std=0)

####### if module is not None:

####### torch.nn.init_(module)

####### elif isinstance(module, nn):

####### torch.nn.init_(module, mean=0, std=0)

####### elif isinstance(module, nn):

####### torch.nn.init_(module)

####### def configure_optimizers(self, train_config):

####### # separate out all parameters to those that will and won't experience r

####### decay = set()

####### no_decay = set()

####### whitelist_weight_modules = (torch.nn, )

####### blacklist_weight_modules = (torch.nn, torch.nn)

####### for mn, m in self_modules():

####### for pn, p in m_parameters():

####### fpn = '%s.%s' % (mn, pn) if mn else pn # full param name

####### # random note: because named_modules and named_parameters are r

####### # we will see the same tensors p many many times. but doing it

####### # allows us to know which parent module any tensor p belongs to

####### if pn('bias'):

####### # all biases will not be decayed

####### no_decay(fpn)

####### elif pn('weight') and isinstance(m, whitelist_weight_m

####### # weights of whitelist modules will be weight decayed

####### decay(fpn)

####### elif pn('weight') and isinstance(m, blacklist_weight_m

####### # weights of blacklist modules will NOT be weight decayed

####### no_decay(fpn)

####### # validate that we considered every parameter

####### param_dict = {pn: p for pn, p in self_parameters()}

####### inter_params = decay & no_decay

####### union_params = decay | no_decay

####### # create the pytorch optimizer object

####### optim_groups = [

####### {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_

####### {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weig

####### ]

####### optimizer = torch.optim(optim_groups, lr=train_config_ra

####### return optimizer

####### def forward(self, idx, targets=None):

####### device = idx

####### b, t = idx()

####### assert t <= self_size, f"Cannot forward sequence of length {t}, b

####### # positional token, shape (1, t)

####### pos = torch( 0 , t, dtype=torch, device=device).unsqueeze( 0 )

####### # forward the GPT model itself

####### tok_emb = self.transformer(idx) # token embeddings of shape (b, t,

####### pos_emb = self.transformer(pos) # position embeddings of shape (1,

####### x = self.transformer(tok_emb + pos_emb)

encoding layer to the position indices. It then applies the transformer layers to

the resulting tensor.

Next, it applies the language model head to the output of the transformer to

obtain a probability distribution over the vocabulary.

Lastly, it computes the cross-entropy loss between the predicted distribution

and the target distribution.

Word Generation

GPT is an auto-regressive language model that takes in a conditioning sequence of

indices and then generates new text one token at a time. The model generates each

token based on the preceding tokens in the sequence.

The generate function is a method in the GPT class that generates new text

based on a given input sequence. It takes in a conditioning sequence of indices

idx of shape (batch size, sequence length). The function then completes the

sequence max_new_tokens times, feeding the predictions back into the model

each time.

It forward passes the model to get the logits for the index in the sequence. The

logits represent the unnormalized probability distribution over the vocabulary

of possible tokens.

Next, the function plucks the logits at the final step and scales them by a desired

temperature. The temperature is used to control the randomness of the

generated output. Higher temperatures lead to more diverse and random

outputs, while lower temperatures lead to more conservative and predictable

outputs.

Then, it applies softmax to convert the logits to normalized probabilities. The

probabilities represent the likelihood of each token in the vocabulary to be the

next token in the generated sequence.

Finally, the function either samples from the probability distribution using

torch(). It then appends the sampled index to the running

sequence and continues the loop until max_new_tokens is reached.

Trainer

After thoroughly exploring the GPT model and analyzing the provided source code,

we are now equipped with the knowledge and understanding necessary to train the

model using Shakespearean data. We can now confidently hit the “begin” button to

initiate the training process and watch as the model learns to generate

Shakespearean text.

####### class Trainer:

####### def init(self, config, model, train_dataset):

####### self = config

####### self = model

####### self = None

####### self_dataset = train_dataset

####### self = defaultdict(list)

####### self = config

####### self = self.model(self)

####### # variables that will be assigned to trainer class later for logging an

####### self_num = 0

####### self_time = 0.

####### self_dt = 0.

####### def add_callback(self, onevent: str, callback):

####### self[onevent].append(callback)

####### def set_callback(self, onevent: str, callback):

####### self[onevent] = [callback]

####### def trigger_callbacks(self, onevent: str):

####### for callback in self.callbacks(onevent, []):

####### callback(self)

####### def run(self):

####### model, config = self, self

####### # setup the optimizer

####### self = model_optimizers(config)

####### # setup the dataloader

####### train_loader = DataLoader(

####### self_dataset,

####### sampler=torch.utils.data(self_dataset, replacem

Was this document helpful?