A full tutorial on turning GPT-2 into a Conversational AI, with NanoGPT and NanoChatGPT

28 min readAug 31, 2023

Representing The LLM moat with Stable diffusion

This tutorial can easily be used by anyone with python experience and a basic knowledge of machine learning terms. This tutorial uses Pytorch.

A couple days ago, I built NanoChatGPT, a model fine tuned on
GPT-2-medium model. When most people see GPT-2 they think of an autocomplete, maybe a couple sentences, but with fine tuning and good data, GPT-2 can write dialogue, make stories, and chat, all in 354 million parameters.

The number 354 million is really important. The ChatGPT we all love has 175 billion parameters. We are attempting to Chat with 0.002% of the power in GPT-3. That's why this is geared towards short responses, and It isn’t as good at giving contextual responses or staying on topic. But it is great at speaking in english and making natural sentences.

What does it sound like?

Here's a random Chat excerpt I had with the Model:

...           
User: Talking about pythons, whats your favorite animal?
Bot: Not knowing, but it is my favorite animal.   I had been thinking about it for a long time, and I guess I liked the name.  
User: Which animal though, whats your favorite?
Bot: I like it when they play but I love birds too.  
User: Birds huh? My favorite is the Peacock
Bot: No other bird is my favorite though.                
User: What about the flamingo or the parrot
Bot: Mm, they are both, but the parrot is only in my group.               
User: Do you like macaws or someother parrots
Bot: Mm, I like them both.                   <endOf Text>  "The past is dead: it will never be
User: Really? What about Mammals like tigers
Bot: Mammals like tigers. </p>   "I will have this dog today. It's actually quite good.   "                   
User: What do you like, Lions or tigers?
Bot: I like when I hear a song, I like that it's interesting.  Thanks for the reply.   > I have a taste for Black Forest Wildcat  
User: Wildcats? What about PUMA
Bot: So again, what is your opinion?       
User: Jaguar, the king of the amazon
Bot: Mm, it is a fantastic looking animal, it's not as big as a jaguar, it's not as hard to spot as a jaguar, it's a feline like animal.        
User: What animal are you talking about?
Bot: Mm. My friend in the car, who I wasn't seeing, just said he was a black bear, but he's not a bear.        is pretty hard to spot (A bear was mentioned but I couldn
User: He saw a black bear, I saw a Grizzly
Bot: I haven't seen him yet, I'll call him tomorrow if I do.  
User: I grizzly bear????
Bot: I wouldn't say that, I just thought that was interesting and was curious about it.    
User: Have you ever been to yellowstone national park??
...

As you can see, it's definitely nothing like a regular chatbot :) This is a small excerpt of a conversation with bot, where we smoothly transition from python the programming language to python the animal, and other animals. It gives competent responses and stays on topic for around 10 uer-bot turns, transitioning the topic every 10 turns or so can lead to a better chat experience.

Something funny that used happen with the AI, is that it decides that “It’s not worth it anymore” and just sends blank gaps for messages. I have no idea where this came from, but when I decided to send it motivational messages for fun, It started talking again. When I added more data to the dataset, It Stopped the blank spacing, but it still sends messages like it, and may occasionally have too many spaces. but now, let's get into the code

Setup

This is all based off the google colab, but with slight changes to make sure we remain commercial friendly. Make sure you have a GPU.

To begin working with NanoChatGPT, clone the Github repo.

!git clone https://github.com/VatsaDev/nanoChatGPT.git
%cd /content/nanoChatGPT

The next step is optional, but if you use google colab as a cloud ai environment, then mounting to google drive can be a great idea for a place to store your ckpt.pt file

from google.colab import drive
drive.mount('/content/drive')

Install all dependencies

!pip install torch numpy transformers datasets tiktoken wandb tqdm # needed to run the model

Make slight edits to prepare.py. We are only using one input file from the dataset NanoChatGPT was trained on, making it commercially friendly and fast to finetune. This input file was from the ubuntu dialogue corpus, a good quality multi turn conversational dataset.

How does this code work? well, looking inside prepare.py

import os
import requests
import tiktoken
import numpy as np

train_ids=[]
val_ids=[]
enc = tiktoken.get_encoding("gpt2")

def download_file(url):
  response = requests.get(url)
  if response.status_code == 200:
    with open('dataset.txt', 'wb') as f:
      f.write(response.content)
      print("downloaded dataset, tokenizing")
  else:
    print('Error downloading file:', response.status_code)

download_file('https://raw.githubusercontent.com/VatsaDev/nanoChatGPT/main/data/Chat/input15.txt')

def split_file(filename, output_dir, chunk_size):
  if not os.path.exists(output_dir):
    os.mkdir(output_dir)

  with open(filename, 'r') as f:
    lines = f.readlines()

  n_chunks = len(lines) // chunk_size
  for i in range(n_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(lines))

    chunk_lines = lines[start:end]

    output_filename = os.path.join(output_dir, f'{i}-dataset.txt')
    with open(output_filename, 'w') as f:
      f.writelines(chunk_lines)

split_file('dataset.txt', 'output', 10000)

def is_numbers(string):
  two_chars = string[:1]

  try:
    int(two_chars)
    return True
  except ValueError:
    return False

for filename in os.listdir('output'):
  if filename.endswith('.txt'):
    if is_numbers(filename) == True:
      if int(filename[:1]) <= 7:
        with open(f'output/{filename}', 'r') as f:
          data = f.read()
        train_ids = train_ids+enc.encode_ordinary(data)
      if int(filename[:1]) > 7:
        with open(f'output/{filename}', 'r') as f:
          data = f.read()
        val_ids = val_ids+enc.encode_ordinary(data)

print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))

import os
import requests
import tiktoken
import numpy as np

Breaking this code down part by part, The imports are OS, request, tiktoken, and numpy. OS and request are used to manipulate files, tiktoken is the tokenizer, a way to turn the text we give GPT-2 into numbers it can work with.

train_ids=[]
val_ids=[]
enc = tiktoken.get_encoding("gpt2")

Then we define the train and val arrays, which are used to store tokenized content, along with the enc variable, which is just a way to store the tiktoken encoding value.

def download_file(url):
  response = requests.get(url)
  if response.status_code == 200:
    with open('dataset.txt', 'wb') as f:
      f.write(response.content)
      print("downloaded dataset, tokenizing")
  else:
    print('Error downloading file:', response.status_code)

download_file('https://raw.githubusercontent.com/VatsaDev/nanoChatGPT/main/data/Chat/input15.txt')

After that, we have the download_file function which gets the dataset from our desired location. The function itself is rather simple, we make a file called dataset.txt, request an online file, and write that to our local file.

def split_file(filename, output_dir, chunk_size):
  if not os.path.exists(output_dir):
    os.mkdir(output_dir)

  with open(filename, 'r') as f:
    lines = f.readlines()

  n_chunks = len(lines) // chunk_size
  for i in range(n_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(lines))

    chunk_lines = lines[start:end]

    output_filename = os.path.join(output_dir, f'{i}-dataset.txt')
    with open(output_filename, 'w') as f:
      f.writelines(chunk_lines)

split_file('dataset.txt', 'output', 10000)

The next function, split_file, is rather useful, and literally necessary at large dataset sizes. It takes the file we downloaded, and splits them into a bunch of smaller files in an output directory. It splits them by the number of lines in the text file. For example, here, the dataset is 100,000 lines of text, and chunk_size is 10,000, so the total number output files is just 100,000/10,000 = 10 chunk files.

def is_numbers(string):
  two_chars = string[:1]

  try:
    int(two_chars)
    return True
  except ValueError:
    return False

is_numbers is a simple utility function, which just checks to see if the first char of a filename is a number.

for filename in os.listdir('output'):
  if filename.endswith('.txt'):
    if is_numbers(filename) == True:
      if int(filename[:1]) <= 7:
        with open(f'output/{filename}', 'r') as f:
          data = f.read()
        train_ids = train_ids+enc.encode_ordinary(data)
      if int(filename[:1]) > 7:
        with open(f'output/{filename}', 'r') as f:
          data = f.read()
        val_ids = val_ids+enc.encode_ordinary(data)

This is the real tokenization loop. In this implementation, we loop through all the txt files in the output directory, then check if they are chunks with the is_numbers function, then send them to train if they begin with 7 or below, and val if they begin with a number higher than 7. In our case, with 10 chunk files, 0–7 go to train, and 8 and 9 go to val. An 80/20 split between train and val, on parts of the same dataset, great!

Another thing I would like to mention is that the way this was implemented is super important for you, if you scale dataset size. In the original Nano GPT repo, The fine tuning was based of downloading a dataset file off github, splitting the dataset into 90% train and 10% val strings. loading those into memory and tokenizing them. As dataset size grows, loading a lot of data like that onto system ram instantly crashes the PC, let alone tokenizing the data. Chunking the data allows for much smoother processing experience, being able to handle much more data, and tokenizing in chunks is faster, as I discovered, as when I switched to this process, my processing time went from around ~3min 30sec to ~2min.

print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))

The last piece of code, simply printing out your train tokens and val tokens, using np.array to turn all our 32 bit integers into 16 bit integers, making the code useable by ML algorithms that can’t handle negative numbers and making the array take up less memory.

Then, running the first piece of real code, to prepare the train and val bin files,

!python data/Chat/prepare.py # makes train and val

you should see,

train has 1,853,650 tokens
val has 443,904 tokens

From here, all you would have to do to get the model running is to use chat.py with the Huggingface option, but since we're not here to do that, lets look into the the next block of code.

Now we get into the real work, finetuning the model. The code block we use,

!python train.py config/finetune-gpt2.py

the config, finetune-gpt2.py contains all the hyperparameters for the model. There are many hyperparameters, but the most important ones are, eval_interval, max_iters, init_from, dataset, batch_size, and learning_rate.

eval_interval determines every how many iterations we check the validation loss, then save the checkpoint to the ckpt.pt file which represents our model. we also have a hyperparameter known as always_save_checkpoint. Which is kept false, and since it’s false, the model only saves a checkpoint if the validation loss improves, which means our model gets better overtime.

In our case the eval_interval is set to 5. setting it lower will mean more evaluations, and possibly a lower val loss, while training will be slower. If we set eval_interval higher, the model might train faster, but the val loss might be higher than what it could have been. Changing hyperparameters is a tradeoff, but see what works best for you.

max_iters is the amount of iterations a model trains. A longer number of iterations is better for the model, as it has a longer time to train and learn, but it also results in a longer training time. for our model, we have 50 iters, and it takes ~30 mins to train. too much training time might also not be useful, due to the fact that over too many iters, training stabilizes out, and the val loss might change by inconsequential amounts. I have yet to try to train model for more than 100 iters, but the results in val loss and model outputs were quite similar to 50 iters, so I left it ait 50.

init_from is the location for the model weights. Here it’s GPT-2-medium, but you could use other options like the other models of the gpt-2 series, or a model you trained yourself already on this project, you could even finetune a model you’ve already finetuned this way.

dataset is the location of your train.bin and val.bin files, which here is chat, because the train.bin and val.bin files were saved in the chat directory.

batch_size is very important, as it's essential to both you’re pc not crashing, and your models training time. 1 batch is how many tokens your model can take in at a time. For GPT 2, this is 1024 tokens, or ~4096 chars. A higher batch size can make your models training time a lot faster, as the model can see multiple batches and move through the dataset faster, but also make computational usage jump, because your loading and processing more in memory. I chose a batch size of 4 because it was my GPUs limit.

learning_rate is the rate at which a model learns information. the larger the dataset, the lower the learning rate, to prevent the model from overfitting, or learning the wrong patterns in the training data. for an extremely large llm like chatGPT, the learning rate is 1e-4 . But in the whole NanoChatGPT model, the learning rate is 2e-5, because we have a much smaller dataset. for the purposes of this tutorial, with a dataset size of a couple mb, you could set the learning rate really high, like 3e-4, and still have no overfit issues.

The training loop

Coming from the original NanoGPT model, You don’t need to know this part to finetune or run the model, skip over this section if thats not what you’re interested in.

The actual training script, train.py, looks like this:

import os
import time
import math
import pickle
from contextlib import nullcontext

import numpy as np
import torch
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group

from model import GPTConfig, GPT

# -----------------------------------------------------------------------------
# default config values designed to train a gpt2 (124M) on OpenWebText
# I/O
out_dir = 'out'
eval_interval = 2000
log_interval = 1
eval_iters = 200
eval_only = False # if True, script exits right after the first eval
always_save_checkpoint = True # if True, always save a checkpoint after each eval
init_from = 'scratch' # 'scratch' or 'resume' or 'gpt2*'
# wandb logging
wandb_log = False # disabled by default
wandb_project = 'owt'
wandb_run_name = 'gpt2' # 'run' + str(time.time())
# data
dataset = 'openwebtext'
gradient_accumulation_steps = 5 * 8 # used to simulate larger batch sizes
batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size
block_size = 1024
# model
n_layer = 12
n_head = 12
n_embd = 768
dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
bias = False # do we use bias inside LayerNorm and Linear layers?
# adamw optimizer
learning_rate = 6e-4 # max learning rate
max_iters = 600000 # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 2000 # how many steps to warm up for
lr_decay_iters = 600000 # should be ~= max_iters per Chinchilla
min_lr = 6e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla
# DDP settings
backend = 'nccl' # 'nccl', 'gloo', etc.
# system
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
compile = True # use PyTorch 2.0 to compile the model to be faster
# -----------------------------------------------------------------------------
config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
exec(open('configurator.py').read()) # overrides from command line or config file
config = {k: globals()[k] for k in config_keys} # will be useful for logging
# -----------------------------------------------------------------------------

# various inits, derived attributes, I/O setup
ddp = int(os.environ.get('RANK', -1)) != -1 # is this a ddp run?
if ddp:
    init_process_group(backend=backend)
    ddp_rank = int(os.environ['RANK'])
    ddp_local_rank = int(os.environ['LOCAL_RANK'])
    ddp_world_size = int(os.environ['WORLD_SIZE'])
    device = f'cuda:{ddp_local_rank}'
    torch.cuda.set_device(device)
    master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
    seed_offset = ddp_rank # each process gets a different seed
    # world_size number of processes will be training simultaneously, so we can scale
    # down the desired gradient accumulation iterations per process proportionally
    assert gradient_accumulation_steps % ddp_world_size == 0
    gradient_accumulation_steps //= ddp_world_size
else:
    # if not ddp, we are running on a single gpu, and one process
    master_process = True
    seed_offset = 0
    ddp_world_size = 1
tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

if master_process:
    os.makedirs(out_dir, exist_ok=True)
torch.manual_seed(1337 + seed_offset)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
# note: float16 data type will automatically use a GradScaler
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

# poor man's data loader
data_dir = os.path.join('data', dataset)
train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

# init these up here, can override if init_from='resume' (i.e. from a checkpoint)
iter_num = 0
best_val_loss = 1e9

# attempt to derive vocab_size from the dataset
meta_path = os.path.join(data_dir, 'meta.pkl')
meta_vocab_size = None
if os.path.exists(meta_path):
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    meta_vocab_size = meta['vocab_size']
    print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")

# model init
model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=None, dropout=dropout) # start with model_args from command line
if init_from == 'scratch':
    # init a new model from scratch
    print("Initializing a new model from scratch")
    # determine the vocab size we'll use for from-scratch training
    if meta_vocab_size is None:
        print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
elif init_from == 'resume':
    print(f"Resuming training from {out_dir}")
    # resume training from a checkpoint.
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    checkpoint_model_args = checkpoint['model_args']
    # force these config attributes to be equal otherwise we can't even resume training
    # the rest of the attributes (e.g. dropout) can stay as desired from command line
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = checkpoint_model_args[k]
    # create the model
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    # fix the keys of the state dictionary :(
    # honestly no idea how checkpoints sometimes get this prefix, have to debug more
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)
    iter_num = checkpoint['iter_num']
    best_val_loss = checkpoint['best_val_loss']
elif init_from.startswith('gpt2'):
    print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
    # initialize from OpenAI GPT-2 weights
    override_args = dict(dropout=dropout)
    model = GPT.from_pretrained(init_from, override_args)
    # read off the created config params, so we can store them into checkpoint correctly
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = getattr(model.config, k)
# crop down the model block size if desired, using model surgery
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size # so that the checkpoint will have the right value
model.to(device)

# initialize a GradScaler. If enabled=False scaler is a no-op
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# optimizer
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
if init_from == 'resume':
    optimizer.load_state_dict(checkpoint['optimizer'])
checkpoint = None # free up memory

# compile the model
if compile:
    print("compiling the model... (takes a ~minute)")
    unoptimized_model = model
    model = torch.compile(model) # requires PyTorch 2.0

# wrap model into DDP container
if ddp:
    model = DDP(model, device_ids=[ddp_local_rank])

# helps estimate an arbitrarily accurate loss over either split using many batches
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# learning rate decay scheduler (cosine with warmup)
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

# logging
if wandb_log and master_process:
    import wandb
    wandb.init(project=wandb_project, name=wandb_run_name, config=config)

# training loop
X, Y = get_batch('train') # fetch the very first batch
t0 = time.time()
local_iter_num = 0 # number of iterations in the lifetime of this process
raw_model = model.module if ddp else model # unwrap DDP container if needed
running_mfu = -1.0
while True:

    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and master_process:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if wandb_log:
            wandb.log({
                "iter": iter_num,
                "train/loss": losses['train'],
                "val/loss": losses['val'],
                "lr": lr,
                "mfu": running_mfu*100, # convert to percentage
            })
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
    if iter_num == 0 and eval_only:
        break

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        if ddp:
            # in DDP training we only need to sync gradients at the last micro step.
            # the official way to do this is with model.no_sync() context manager, but
            # I really dislike that this bloats the code and forces us to repeat code
            # looking at the source of that context manager, it just toggles this variable
            model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train')
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()
    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0 and master_process:
        # get loss as float. note: this is a CPU-GPU sync point
        # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
        lossf = loss.item() * gradient_accumulation_steps
        if local_iter_num >= 5: # let the training loop settle a bit
            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    iter_num += 1
    local_iter_num += 1

    # termination conditions
    if iter_num > max_iters:
        break

if ddp:
    destroy_process_group()

Breaking this code down, we get

import os
import time
import math
import pickle
from contextlib import nullcontext

import numpy as np
import torch
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group

from model import GPTConfig, GPT

these imports give are used to get the GPT model configuration, the ability to run torch training processes on 1 or more GPUs, the ability to work with pickle files, numpy, etc.

After that comes all the possible hyperparameters and the ability to modify them with the command line configurator.

ddp = int(os.environ.get('RANK', -1)) != -1 # is this a ddp run?
if ddp:
    init_process_group(backend=backend)
    ddp_rank = int(os.environ['RANK'])
    ddp_local_rank = int(os.environ['LOCAL_RANK'])
    ddp_world_size = int(os.environ['WORLD_SIZE'])
    device = f'cuda:{ddp_local_rank}'
    torch.cuda.set_device(device)
    master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
    seed_offset = ddp_rank # each process gets a different seed
    # world_size number of processes will be training simultaneously, so we can scale
    # down the desired gradient accumulation iterations per process proportionally
    assert gradient_accumulation_steps % ddp_world_size == 0
    gradient_accumulation_steps //= ddp_world_size
else:
    # if not ddp, we are running on a single gpu, and one process
    master_process = True
    seed_offset = 0
    ddp_world_size = 1
tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

This code checks whether or not we are running in DDP (split over multiple GPUs) or on a single GPU. After that, it prints the amount of tokens it will go through in one iter. For an LLM with massive amounts of data, you probably won’t go through all the data, so don’t worry if the number of iters with tokens per iter doesn’t match with total tokens.

if master_process:
    os.makedirs(out_dir, exist_ok=True)
torch.manual_seed(1337 + seed_offset)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
# note: float16 data type will automatically use a GradScaler
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

this code checks to see if its the main process, then sets up a directory where our model’s ckpt.pt goes. Then we have a pytorch seed to make our responses reproducible, but you could remove that for variation in answers when working with the model multiple times. Then we allow for TF32 for matrix multiplication and cuDNN. This helps improve performance. Then we check for cuda and GPU, then we setup a dictionary to match data types with pytorchs datatypes. Then we ctx, which is nothing if the device is a cpu, but is an AMP context otherwise, another performance improver.

data_dir = os.path.join('data', dataset)
train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

This code gets the train.bin and val.bin files, and makes batches that are batch_size large, and converts the numpy arrays to pytorch tensors. It then moves them to GPU if possible, and finally returns a batch of data as a tuple of (x, y), where x is the input data and y is the output data.

meta_path = os.path.join(data_dir, 'meta.pkl')
meta_vocab_size = None
if os.path.exists(meta_path):
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    meta_vocab_size = meta['vocab_size']
    print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")

This code attempts to find the vocab size of a custom meta.pkl if you provided one, which we don’t, as we use the gpt 2 encodings.

model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=None, dropout=dropout) # start with model_args from command line
if init_from == 'scratch':
    # init a new model from scratch
    print("Initializing a new model from scratch")
    # determine the vocab size we'll use for from-scratch training
    if meta_vocab_size is None:
        print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
elif init_from == 'resume':
    print(f"Resuming training from {out_dir}")
    # resume training from a checkpoint.
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    checkpoint_model_args = checkpoint['model_args']
    # force these config attributes to be equal otherwise we can't even resume training
    # the rest of the attributes (e.g. dropout) can stay as desired from command line
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = checkpoint_model_args[k]
    # create the model
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    # fix the keys of the state dictionary :(
    # honestly no idea how checkpoints sometimes get this prefix, have to debug more
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)
    iter_num = checkpoint['iter_num']
    best_val_loss = checkpoint['best_val_loss']
elif init_from.startswith('gpt2'):
    print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
    # initialize from OpenAI GPT-2 weights
    override_args = dict(dropout=dropout)
    model = GPT.from_pretrained(init_from, override_args)
    # read off the created config params, so we can store them into checkpoint correctly
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = getattr(model.config, k)

# crop down the model block size if desired, using model surgery
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size # so that the checkpoint will have the right value
model.to(device)

The code begins with the model arguments, which are used to initiate the model, the arguments include the number of layers, the number of heads, the number of embedding dimensions, the block size, the bias setting, and the vocabulary size.

The code then checks the value of the init_from variable. If init_from is set to scratch, then the code creates a new GPT-2 model from scratch. If init_from is set to resume, then the code resumes training from a checkpoint. If init_from is set to a string that starts with gpt2, then the code initializes the model from the pre-trained GPT-2 weights.

then we chop blocks to block_size if they happen to be to large, and then moves model to gpu if available.

# initialize a GradScaler. If enabled=False scaler is a no-op
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# optimizer
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
if init_from == 'resume':
    optimizer.load_state_dict(checkpoint['optimizer'])
checkpoint = None # free up memory

# compile the model
if compile:
    print("compiling the model... (takes a ~minute)")
    unoptimized_model = model
    model = torch.compile(model) # requires PyTorch 2.0

# wrap model into DDP container
if ddp:
    model = DDP(model, device_ids=[ddp_local_rank])

Then Initiate a GradScaler, a tool in AMP training which speeds up training, then makes an optimizer or checks for one. If compile is set to true, we use torch.compile, which can make inferencing the model faster, and If ddp is being used, the model is wrapped in a ddp container.

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

This function gives us the loss estimates we see during training, The function is decorated with the torch.no_grad() decorator, which tells PyTorch not to compute gradients for the function. This is important because the function is only estimating the loss, and there is no need to compute gradients for this purpose.

the model enters eval mode, checks the losses over the splits, gives the mean loss, and sets the model back to training.

def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

The code above deals with the learning rate, when the model is in its warm up iterations, the learning rate gets a linear warmup. This means that the learning rate starts at 0 and increases linearly to the learning rate over the warmup iterations.

If the iteration number is greater than the warmup iterations, then the function checks if it is greater than the learning rate decay iterations. If it is, then the function returns the minimum learning rate.

If the iteration number is between the warmup iterations and the learning rate decay iterations, then the function uses a cosine decay schedule. This means that the learning rate starts at the specified learning rate and decays to the minimum learning rate using a cosine function.

# training loop
X, Y = get_batch('train')
t0 = time.time()
local_iter_num = 0 
raw_model = model.module if ddp else model 
running_mfu = -1.0
while True:

    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    if iter_num % eval_interval == 0 and master_process:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if wandb_log:
            wandb.log({
                "iter": iter_num,
                "train/loss": losses['train'],
                "val/loss": losses['val'],
                "lr": lr,
                "mfu": running_mfu*100,
            })
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
    if iter_num == 0 and eval_only:
        break

    for micro_step in range(gradient_accumulation_steps):
        if ddp:
           model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps 
        X, Y = get_batch('train')
        scaler.scale(loss).backward()
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad(set_to_none=True)

    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0 and master_process:
        lossf = loss.item() * gradient_accumulation_steps
        if local_iter_num >= 5:
            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    iter_num += 1
    local_iter_num += 1

    if iter_num > max_iters:
        break

if ddp:
    destroy_process_group()

The code above is the real training loop for the model. It begins by fetching the very first batch, then setting up the iteration counter, then it determines and sets the learning rate for the iteration. Then the code write down the train and val loss, and saves the checkpoint if the val loss is the best so far. The code then performs a forward and backward pass of the model, which involves sending input to get an output in the forward pass, and the process of computing the gradients of the loss function is the backward pass.

The code then clips the gradients, The code then updates the GradScaler, which is used to scale the gradients. Finally, the code zeroes the gradients so that they are not accumulated for the next iteration.

Then the code prints the loss, time, and memory utilization (mfu). The code then increments the iteration counter and repeats the loop. The code finally destroys the process group if the model is using distributed training.

Chatting with the Bot

Now that we have a dataset, and a model, it's time to chat, or inference the model.

For NanoChatGPT, we do this with

!python chat.py --out_dir=/content/drive/MyDrive/Model --init_from=resume --context="<human>Hello how are you?<endOfText><bot>Hello, I'm fine how about you?<endOfText>"

This gets our model, starts it off with a context, a great way to keep the bot more focused on the topic at hand. You can now chat with the model. Now, looking at the Chat.py file,

"""
Chat with a trained model
"""
import os
import pickle
from contextlib import nullcontext
import torch
import tiktoken
from model import GPTConfig, GPT
import requests

# -----------------------------------------------------------------------------
init_from = 'resume'
out_dir = 'out-chat' # where finetuned model lives
num_samples = 1 # no samples. 1 for 1 chat at a time
max_new_tokens = 100
temperature = 0.8 
top_k = 5 # retain only the top_k most likely tokens, clamp others to have 0 probability
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32' or 'bfloat16' or 'float16'
compile = True # use PyTorch 2.0 to compile the model to be faster
context="<human>Hello, how are you?<endOfText><bot>Thanks, Im good, what about you?<endOfText><human>Im great thanks, My names James, and I'm from the UK, wbu?<endOfText><bot>Hi James, I'm Conner, and im from america. <endOftext>" # a little context for better chat responses
exec(open('configurator.py').read()) # overrides from command line, only for out_dir location, if you store the ckpt.pt elsewhere, like gdrive, to escape finetuning everytime you run the colab
# -----------------------------------------------------------------------------

torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

def download_ckpt(url):
  response = requests.get(url)
  if response.status_code == 200:
    with open('ckpt.pt', 'wb') as f:
      f.write(response.content)
  else:
    print('Error downloading file:', response.status_code)

# gets model
# init from a model saved in a specific directory
if init_from == 'huggingface':
  if os.path.isfile('ckpt.pt'):
    # init from huggingface model
    ckpt_path = 'ckpt.pt'
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict) 
  else:
    # init from huggingface model
    download_ckpt('https://huggingface.co/VatsaDev/ChatGpt-nano/resolve/main/ckpt.pt')
    ckpt_path = 'ckpt.pt'
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict) 
elif init_from == 'resume':
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)

model.eval()
model.to(device)
if compile:
    model = torch.compile(model) # requires PyTorch 2.0 (optional)

# gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<|endoftext|>"})
decode = lambda l: enc.decode(l)

def respond(input, samples): # generation function
    x = (torch.tensor(encode(input), dtype=torch.long, device=device)[None, ...]) #get the last 4000 chars of the input, make sure gpt2 give no errors, and that its the most recent context
    with torch.no_grad():
        with ctx:
            for k in range(samples):
                generated = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
                output = decode(generated[0].tolist())

                # replace context
                output = output.replace(input,'')
                # remove any human response
                output =  output.partition('<human>')
                # if the bot has anything left afterwards, the endOfText token is put to use
                output_text =  output[0].rpartition('<endOftext>')
                output_text = output[0] + output[1]
                # label removing
                output_text = output_text.replace('<human>',' ')
                output_text = output_text.replace('<bot>',' ')
                output_text = output_text.replace('<endOfText>',' ')
                return output_text

# chat loop
while True:
    # get input from user
    start_input = input('User: ')
    start = '<human>'+start_input+'<endOfText><bot>'

    # context
    context=context+start
    
    out = respond(context, num_samples)
    context=context+out+'<endOfText>'
    print('Bot: '+ out)

The code begins with stuff we’ve seen in the other files,

import os
import pickle
from contextlib import nullcontext
import torch
import tiktoken
from model import GPTConfig, GPT
import requests

# -----------------------------------------------------------------------------
init_from = 'resume'
out_dir = 'out-chat' # where finetuned model lives
num_samples = 1 # no samples. 1 for 1 chat at a time
max_new_tokens = 100
temperature = 0.8 
top_k = 5 # retain only the top_k most likely tokens, clamp others to have 0 probability
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32' or 'bfloat16' or 'float16'
compile = True # use PyTorch 2.0 to compile the model to be faster
context="<human>Hello, how are you?<endOfText><bot>Thanks, Im good, what about you?<endOfText><human>Im great thanks, My names James, and I'm from the UK, wbu?<endOfText><bot>Hi James, I'm Conner, and im from america. <endOftext>" # a little context for better chat responses
exec(open('configurator.py').read()) # overrides from command line, only for out_dir location, if you store the ckpt.pt elsewhere, like gdrive, to escape finetuning everytime you run the colab
# -----------------------------------------------------------------------------

torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

Imports for everything, the models hyperparameters, then the device, matrix multiplication, cuDNN, ctx, etc.

def download_ckpt(url):
  response = requests.get(url)
  if response.status_code == 200:
    with open('ckpt.pt', 'wb') as f:
      f.write(response.content)
  else:
    print('Error downloading file:', response.status_code)

The download ckpt function is used to download the ckpt.pt file off huggingface when needed

if init_from == 'huggingface':
  if os.path.isfile('ckpt.pt'):
    # init from huggingface model
    ckpt_path = 'ckpt.pt'
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict) 
  else:
    # init from huggingface model
    download_ckpt('https://huggingface.co/VatsaDev/ChatGpt-nano/resolve/main/ckpt.pt')
    ckpt_path = 'ckpt.pt'
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict) 
elif init_from == 'resume':
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)

This code checks whether you want to use your own model or get the NanoChatGPT model on huggingface If you already have the file, it will just run inference on it, or download it if you don’t. If you pick resume, it will get the model from where you would have saved it.

# gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<|endoftext|>"})
decode = lambda l: enc.decode(l)

Then we initiate the gpt 2 encoding.

def respond(input, samples): # generation function
    x = (torch.tensor(encode(input), dtype=torch.long, device=device)[None, ...]) 
    with torch.no_grad():
        with ctx:
            for k in range(samples):
                generated = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
                output = decode(generated[0].tolist())

                # replace context
                output = output.replace(input,'')
                # remove any human response
                output =  output.partition('<human>')
                # if the bot has anything left afterwards, the endOfText token is put to use
                output_text =  output[0].rpartition('<endOftext>')
                output_text = output[0] + output[1]
                # label removing
                output_text = output_text.replace('<human>',' ')
                output_text = output_text.replace('<bot>',' ')
                output_text = output_text.replace('<endOfText>',' ')
                return output_text

The respond function is there to separate content from generation. The respond function takes input, encodes it and makes a torch.tensor. Then we actually generate the content, replace the context we gave it in the input, remove any response generated with the human token in it, as we only want the bot response. we also check to see if there's anything after the <endOfText> token, just in case it decided to generate something, even though it shouldn’t. and finally, we remove any syntax tags before the output is generated.

# chat loop
while True:
    # get input from user
    start_input = input('User: ')
    start = '<human>'+start_input+'<endOfText><bot>'

    # context
    context=context+start
    
    out = respond(context, num_samples)
    context=context+out+'<endOfText>'
    print('Bot: '+ out)

Finally the chat loop, which runs till the user quits it, and a context containing the whole chat with syntax tags.

Thats how you turn GPT 2 into a chatbot. Keep reading for limitations and improvements.

Features, and Future improvements

While this isn’t the full NanoChatGPT, the full version has may things this version doesn’t, including,

Medium Dataset(~700mb), full of a variety of conversations, and a little arithmetic
Model and datasets available on Huggingface
(at best), it can talk to you on a variety of topics and smoothly switch between topics, feeling like a texting human.

In terms of future improvements, these are things that could be done for this model, but I will probably not spend time trying to make this for this model. If interested, make a PR.

Math and Logical Reasoning → While there are datasets for this, it's a lot of stuff to add and it's pretty different from the rest of the dataset, so I might add them, but it would be worse off than model with more parameters learning that.

Short term memory → I haven’t found a great dataset yet but one could make a short term memory in this format,

<human> Dogecoin is cool <endOfText>
<bot> indeed, very shibe <endOfText>
<human> what were we just talking about? <endOfText>
<bot> {summary} <endOfText>

That's it for this tutorial, I hope you found this interesting, and that you build on NanoChatGPT for your own cool stuff.

Github → https://github.com/VatsaDev/nanoChatGPT