LLM Primer

This section provides a high-level overview of key concepts related to Large Language Models (LLMs). If you are familiar with LLMs, you may want to proceed to LLM Ops.

Large Language Models (LLMs) use Transformer technology and serve as a foundational component in enabling computers to understand and generate human language. Drawing from large datasets, they learn linguistic structures, semantics, and context through a process called pre-training.

These models, in their most sophisticated form (eg. GPT-4), produce text that is remarkably human-like, able to answer questions, translate languages, and even generate creative content like document analysis or SQL queries.


Model Foundations

Large language models are built by training massive datasets of tokenized text to learn patterns and relationships between words. Through an intensive compute process, models ingest sequences of tokens to predict next words in context. The foundational dataset is what makes each model unique and plays the largest role in model performance.

As models train, their parameters are tuned to generate human-like responses. The size of a model, which is measured in parameters, typically determines its power and performance – however, a growing trend in highly-capable smaller models began in 2023 H2 due to innovations in training and inference. While state-of-the-art models have hundreds of billions of parameters, open-source models trend from 7 to 65 billion parameters.

  • Name
    Token
    Type
    Numerical representation of language
    Description

    Tokens are a unique numerical representation of a word or partial word. Tokenization allows LLMs to handle and process text data.

    Most LLMs have a 1.3:1 ratio of tokens to English words.

  • Name
    Parameters
    Type
    Internalized knowledge
    Description

    Parameters represent the learned patterns and relationships between tokens within the training data. ML engineers convert massive datasets into tokenized training data for training.

    Commonly used datasets include The Pile, CommonCrawl, OpenAssistant Conversations, or websites (Reddit, StackOverflow, Github).

  • Name
    Training
    Type
    Model computation
    Description

    Training is the process of converting tokenized content into model parameters. The result is a reusable model for inference. The model is fed sequences of training tokens and it learns to predict the next token in the sequence. The goal is to fine-tune the model's parameters for accurate and contextually appropriate responses.

  • Name
    Model Size
    Type
    Number of parameters in training
    Description

    Number of parameters is the typical measurement for model size. State-of-the-art models (GPT-4, PaLM 2) trend in hundreds to thousands of billions of parameters, while emerging open source models trend between 7-65B parameters (MPT, Falcon).

Estimate tokens →


Model Inference

Inference is the process of using a trained LLM model to make predictions on new content to generate. A model is loaded into memory, new data is presented in the form of a prompt, and the model generates a completion. The size of the context window will have a significant impact on the depth of the LLM predictions.

  • Name
    Context Window
    Type
    Total tokens at inference
    Description

    The context window is the total number of tokens used during inference. This includes both the input prompt and generated output.

    Early versions of GPT-3 and most open source models have a context window of 2048-4096 tokens. GPT-3.5 Turbo has up to 16k, GPT-4 Turbo has 128k, and Claude 2.1 has 200k.

  • Name
    Prompt
    Type
    Initial model input
    Description

    A prompt provides initial input to steer the model's response in a particular direction. Like setting the stage, prompts focus the model on a specific topic, style or genre. This narrows the internal search space of the model.

  • Name
    Completion
    Type
    Model-generated response
    Description

    The completion refers to the text generated by the model in response to the prompt. The content variability and length of the completion depends on the prompt and model config parameters like temperature and max tokens.


Hardware Requirements

Large language models require substantial computational resources for both training and inference. A cloud GPU server will cost between $100 and $1000 per day depending on the number of GPU and memory.

Training LLMs typically necessitates the use of multiple high-end GPUs, such as NVIDIA's A100 or H100 GPUs, which are renowned for their superior memory capacity (80 GB per card) and robust processing capabilities. Meta's LLaMA model (65B parameters), was trained on 2048 A100 GPUs with 80 GB over 21 days. Renting 2048 A100 GPUs from AWS (256 P4DE Instances) for 21 days would cost approximately $3.8M.

Inference breakthroughs enable running LLMs locally on Apple M1/2 hardware.

  • Name
    Training
    Type
    Millions of dollars
    Description

    Training hardware requirements are a function of parameters, batch size, and training time. Computational power doubles every 3.4 months resulting in significant breakthroughs each year. GPT3 cost roughly $5M in 2020, but would cost $500k in 2023 – primarily due to software improvements.

  • Name
    Finetuning
    Type
    Thousands of dollars
    Description

    Finetuning hardware requirements are a function of model size, batch size, and finetuning time.

    Finetuning takes a pre-trained model and trains it on a new dataset. Finetuning is significantly faster and cheaper than training from scratch. The LoRA (Low-Rank Adaptation) method enables finetuning a 65B parameter model on a new instruction-following dataset in hours/days. Finetuning requires roughly 12x the GPU memory of the model size.

  • Name
    Inference
    Type
    Hundreds of dollars
    Description

    Inference hardware requirements are a function of model size, context window, and concurrent inference requests.

    Inference typically requires 2.1x the GPU memory of the model size. This is due to the need to store the model parameters and intermediate activations during the forward pass.

    Quantization breakthroughs enable running LLMs on significantly smaller hardware requiring only 70% of the GPU memory of the model size.