Skip to content

Notes from Andrej Karpathy's NanoGPT codealong

  • ChatGPT
  • generates text - left to right
  • is a probabilistic system
  • stands for Chat "Generatively Pre-trained transformer"
  • is a language model - it models the sequence of characters or words or token. It predicts how charactes/words/tokens follow each other in a language
  • given a question/prompt, ChatGPT is completing the sequence.
  • is based on the transformer architecture (see the 2017 landmark paper, attention is all you need)

  • NanoGPT

  • trained on [OpenWebtext]
  • reproduces GPT2 124 Million parameter model

Codealong: ~NanoGPT * is a character level language model * trained on Tiny Shakespeare * generates infinite Shakespeare

Tokenization

  • character level
  • used in the codealong
  • word level
  • sub-word level
  • Google Sentence piece
  • OpenAI tiktoken (used in GPT)

tradeoff codebook size and sequence lengths: with word level or sub word level tokenization, there are a larger number of tokens, but this results in much more compact encoding.

Training

  • Training happens on:
  • on chunks of data of given blocksize (or context length)
    • In each block, there are 'blocksize' number of individual "contexts"
    • the blocksize gives us the 'time' component (?)
  • on chucks of given batchsize.
    • Chunks are trained independently

Bigram Language Model

  • See makemore series

Jupyter Notebooks

To Do

  • walkthrough the makemore playlist
  • revist this lecture from timestamp 25