Understanding Foundation Model - Part 5 (Sampling Strategies)

Jul 31, 2025

In the previous part of Understanding Foundation Model, we have covered the following:

Attention Mechanism: How models understand sentence structure
Model Size: What is the impact of model size and how it impact the output
Post Training: How post training is used to tweak the model for human preference through supervised fine-tuning and reinforcement learning.

This part explores why an AI model output is probabilistic and how does different sampling strategy yield different results.

When an input is given to a language model, it calculates the probability of the all the possible (next) output tokens. For example, given an input: Which is the most difficult subject for a grade 8 student? The model calculates the probability distribution over all tokens in its vocabulary to generate the next token.

The most difficult subject for Grade 8 student is ______

Math: 50%
the: 5%
Science: 20%
Art: 10%
none: 1%

The most intuitive answer would be to pick the token with highest probability which is math. But this makes the output boring and predictable.

The most difficult subject for Grade 8 student is Math

Let us see what other output can model generate if it choose say “the“ or “none“

The most difficult subject for Grade 8 student is the one the student spend the least amount of time.
The most difficult subject for Grade 8 student is none other than the most dreaded subject Math.

This is how LLM become probabilistic. Instead of always selecting the token with highest probability, the model chooses Math 50% of the time, Art 10% of the time, none 1% of the time and so on. Choosing the output with the highest probability is good for the classification tasks.

Sampling Strategies

In the previous section, we saw how the model calculates probabilities over the next token and how always picking the highest probability token (greedy decoding) can make the output dull and repetitive. This section explores different sampling strategies that allow the model to be more creative, diverse, or focused — depending on the goal.

Let us take a fitness-related example.

Input: What is a good 10-minute morning workout?

The model might assign the following probabilities to the next token:

Jumping jacks - 40%
Push-ups - 25%
Plank - 15%
Burpees - 10%
Meditation - 5%
Deadlift - 1%

If we apply greedy decoding, the model will always pick Jumping jacks - the token with the highest probability. Useful for consistency, but it will keep giving the same response.

To get more variety and make the output feel more human, we can use different sampling strategies:

1. Temperature: Controlling Creativity

Temperature is a parameter that controls how sharp or "soft" the probability distribution is.

Low temperature (e.g., 0.2): Makes the model more confident. High probability tokens have even higher chance of being picked. Output becomes predictable and focused.
High temperature (e.g., 1.0): Softens the distribution. Lower probability tokens have a higher chance of being picked. Output becomes more diverse and creative.

In our example, with high temperature, the model might sometimes suggest burpees or even meditation, adding freshness to the answer.

2. Top-k Sampling: Limiting to k Choices

Top-k sampling only considers the top k tokens with the highest probabilities.

If k=3, the model will randomly pick among:
Jumping jacks (40%), Push-ups (25%), and Plank (15%).

This keeps the responses within a safe, high-quality range and avoids unusual or irrelevant outputs like deadlift in a 10-minute beginner workout.

However the model doesn’t adjust based on context.

3. Top-p (Nucleus) Sampling: Adaptive and Context-Aware

Instead of choosing a fixed number of top tokens, top-p sampling selects the smallest set of tokens whose cumulative probability exceeds a certain threshold p (e.g., 0.9).

From our example, if:

Jumping jacks: 40%
Push-ups: 25%
Plank: 15%
Burpees: 10%
Meditation: 5%

Then with p=0.9, the model will consider Jumping jacks + Push-ups + Plank + Burpees (40+25+15+10 = 90%) and ignore the rest.

This leads to more natural, in-flow, human-like outputs.

Each strategy has its role. Greedy decoding is good for deterministic tasks like classification. But for natural, conversational, or creative outputs — top-k, top-p, and temperature allow the model to feel more human, varied, and engaging.

Knowing when to use which sampling strategy is a critical part of designing AI products.

Product Decisions

Discussion about this post