Fine-Tune the GPT-2 Model On the works of Shakespeare

6 min readMar 20, 2023

GPT-2

GPT-2 (Generative Pre-trained Transformer 2) is a large-scale language model developed by OpenAI.

GPT-2 is designed to generate human-like text by predicting the next word in a sequence of words, based on the words that came before it. It uses a type of deep learning called transformer architecture, which allows it to understand and model the complex relationships between words and their context.

Fine-tuning

Fine-tuning a GPT-2 model refers to the process of taking a pre-trained GPT-2 model and training it further on a specific task or dataset to improve its performance on that particular task.

Let’s say you work for a customer support company that receives a large number of customer inquiries via email. Your job is to develop an automated system that can understand and respond to these inquiries, in order to reduce the workload of your customer support team.

One way to approach this problem is to use a GPT-2 model that has been fine-tuned on a dataset of customer inquiries and corresponding responses. The idea is to train the model to generate appropriate responses based on the content of the incoming emails.

Preprocessing Data

Creating a Tokenizer: A tokenizer is a tool that splits text into individual tokens (words, punctuation marks, etc.) that can be used as input to a machine-learning model. You can use various tokenizers such as WordPiece tokenizer, Byte Pair Encoding (BPE) tokenizer, or SentencePiece tokenizer.

#pip install transformers
from transformers import AutoTokenizer
import tensorflow as tf

# create the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# tokenize the sentence
text = "All the world's a stage, and all the men and women merely players."
tokens = tokenizer.tokenize(text)
tokens

['all', 'the', 'world', "'", 's', 'a', 'stage', ',', 'and', 'all', 'the', 'men', 'and', 'women', 'merely', 'players', '.']

2. Encoding the Text: Encoding is the process of converting the text into a numerical form that can be fed into a machine learning model. In the case of NLP, this typically involves converting each token into its corresponding integer index.

# encode the tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[2035, 1996, 2088, 1005, 1055, 1037, 2754, 1010, 1998, 2035, 1996, 2273, 1998, 2308, 6414, 2867, 1012]

3. Batching the Data: Batching is the process of grouping the encoded text into batches of a fixed size. This is done to speed up the training process and make the best use of available computational resources. Here’s how to batch the encoded data:

# batch the encoded text
batch_size = 1  # set the batch size to 1 to create a single example
dataset = tf.data.Dataset.from_tensor_slices((input_ids, attention_mask, token_type_ids))
dataset = dataset.batch(batch_size)

Hugging Face Transformers

Let’s say you work for a social media monitoring company that helps businesses track mentions of their brand on social media platforms. Your job is to develop a machine-learning model that can classify social media posts as either positive or negative, based on the sentiment expressed in the post.

One way to approach this problem is to use the Transformers library from Hugging Face, which provides a wide range of pre-trained models for natural language processing tasks, including sentiment analysis.

Writing a Python application using FastAPI

FINE-TUNE GPT-2

Here are the general steps you can follow to fine-tune a GPT-2 model using the Shakespeare Dataset:

Load the Shakespeare Dataset using the datasets library.
Preprocess the data by creating a tokenizer, encoding the text, and batching the data.
Load a pre-trained GPT-2 model using the transformers library.
Set up the training loop, including defining the optimizer and loss function.
Train the model on the Shakespeare Dataset.
Generate some text using the trained model.

import os
import random
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup

# set random seed for reproducibility
random.seed(42)
torch.manual_seed(42)

# load text file as dataset
with open('/content/shakespear.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# initialize GPT2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# tokenize text and convert to torch tensors
#input_ids = tokenizer.encode(text, return_tensors='pt').to(device)
input_ids = tokenizer.encode(text, return_tensors='pt', max_length=512, truncation=True).to(device)

# set training parameters
train_batch_size = 4
num_train_epochs = 3
learning_rate = 5e-5

# initialize optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(input_ids) * num_train_epochs // train_batch_size
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# train the model
model.train()
for epoch in range(num_train_epochs):
    epoch_loss = 0.0
    for i in range(0, len(input_ids)-1, train_batch_size):
        # slice the input ids tensor to get the current batch
        batch_input_ids = input_ids[i:i+train_batch_size]
        # create shifted labels for each input in the batch
        batch_labels = batch_input_ids.clone()
        batch_labels[:, :-1] = batch_labels[:, 1:]
        # set label ids to -100 for padded tokens
        batch_labels[batch_labels == tokenizer.pad_token_id] = -100
        # clear gradients
        optimizer.zero_grad()
        # forward pass
        outputs = model(input_ids=batch_input_ids, labels=batch_labels)
        loss = outputs[0]
        # backward pass
        loss.backward()
        epoch_loss += loss.item()
        # clip gradients to prevent exploding gradients problem
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # update parameters
        optimizer.step()
        scheduler.step()
    print('Epoch: {}, Loss: {:.4f}'.format(epoch+1, epoch_loss/len(input_ids)))

# save the trained model
output_dir = './results/'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

A detailed explanation of the code:

Here’s a detailed explanation of the code:

The first line of the code imports the os module, which provides a way of using operating system-dependent functionality like reading or writing to the file system.
The second line imports the random module which provides functions for generating random numbers.
The third line imports the torch module which provides tensor computations with strong GPU acceleration.
The fourth line imports several functions from the transformers module, which is a library for state-of-the-art natural language processing (NLP) models.
The next two lines set the random seed for reproducibility.
The open() the function is used to open a file named shakespear.txt which contains text data. The r parameter tells Python to open the file in read mode and encoding='utf-8' specifies the character encoding to be used. The text data is then read and stored in a variable called text.
The GPT2Tokenizer and GPT2LMHeadModel classes are initialized from the transformers module. These classes are used to tokenize the text data and train a language model, respectively.
The device the variable is set to use the GPU if available, otherwise the CPU.
The input_ids the variable is set to tokenize the text data using the tokenizer.encode() method. The return_tensors parameter is set to 'pt' to convert the output to PyTorch tensors, and max_length=512 and truncation=True are set to limit the length of the tokenized sequence to 512 tokens.
The training parameters are set with a batch size of 4, 3 epochs, and a learning rate of 5e-5.
The optimizer and scheduler are initialized with the AdamW optimizer and a linear learning rate schedule with a warm-up.
The model.train() the method is called to set the model to training mode.
The code then loops through each epoch and each batch of training data. In each epoch, the loss for each batch is calculated and the gradients are computed and used to update the model parameters using the optimizer.step() method. The learning rate is adjusted using the scheduler.step() method. The average loss for the epoch is printed.
The trained model and tokenizer are saved in the ./results/ directory, which is created if it does not exist, using the model.save_pretrained() and tokenizer.save_pretrained() methods.

Overall, this code trains a GPT-2 language model on a text dataset and saves the trained model and tokenizer for later use.

Fine-Tune the GPT-2 Model On the works of Shakespeare

GPT-2

Fine-tuning

Preprocessing Data

Hugging Face Transformers

Writing a Python application using FastAPI

FINE-TUNE GPT-2

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by SALOME SONYA LOMSADZE

Responses (1)

More from SALOME SONYA LOMSADZE

Building a Machine Learning Model for Product Recommendations Using Customer Purchase Data

To develop a recommendation system based on a user’s purchase history, you can use collaborative filtering, a technique commonly used for…

Boston Şehri Suç Analitiği (Python & Tableau)

Boston Şehri 2015–2018 Suç Verisi

KAGGLE NFL COMPETITON: PREDICT PLAYER CONTACT WITH LGBM

The goal of the Competition

Times and Dates in Python -II

Using Time Zone In Practice

Recommended from Medium

Fine-Tuning Google’s Gemma-3-12B for Reasoning: How GRPO Turned a Good Model into a Brilliant…

Artificial intelligence can speak fluently, create art, and even pass exams — but logical reasoning remains its ultimate frontier. How do…

You’re Doing RAG Wrong: How to Fix Retrieval-Augmented Generation for Local LLMs

How To Set Up RAG Locally, Avoid Common Issues, and Improve RAG Retrieval Accuracy.

Mastering BERT: A Comprehensive Guide from Beginner to Advanced in Natural Language Processing…

Introduction: A Guide to Unlocking BERT: From Beginner to Expert

Testing 18 RAG Techniques to Find the Best

crag, HyDE, fusion and more!

Fine Tune Large Language Model (LLM) on a Custom Dataset with QLoRA

The field of natural language processing has been revolutionized by large language models (LLMs), which showcase advanced capabilities and…

Encoder-Decoder Transformer Models: BART and T5

If you’re not a Medium subscriber, click here to read the full article.