Understanding Foundation Models - Part 1

Why your Product needs to understand Foundation Models before it uses them

Jul 05, 2025

Let’s say you're building a product to simplify visa applications. You want to create an AI assistant that guides users in Bengali, predicts missing documents, and answers tough immigration questions.
You reach for GPT.

It gives vague answers. It doesn’t understand regional nuances. And worse - it starts hallucinating policies from other countries. The model does not work the way you want it to work because the foundation of these models were not built on the context you are using it for. And unless you understand what they were trained on, how they work, and what they weren’t built for, they’ll fail your product in subtle but costly ways.

What are Foundation Models?

Foundation models are massive, pre-trained AI systems - like GPT-4, Gemini, or Claude that learn from web-scale data and are capable of performing hundreds of tasks without needing custom training for each one. They’re called foundation because they’re supposed to be the base layer for everything else.

The strength and weakness of these models are that they only as strong as the data they were trained on. If the data is skewed, shallow, or missing key domains or languages, the model becomes like a smart student who aced English and flunked Geography because nobody taught her that part.

Common crawl and the english-centric web

Most of these models are trained on a dataset called Common Crawl - a giant scrape of the internet. In 2023, over 45% of that dataset was in English. The next language? Russian, at around 6%.
Here’s where it gets problematic:

Punjabi speakers make up over 1% of the world’s population.
But Punjabi contributes just 0.0061% of Common Crawl.
That’s a 230x under-representation.

GPT-4 is brilliant in English. Mediocre in Hindi. And nearly blind in Marathi or Telugu.

Your AI product might promise inclusivity or automation—but under the hood, it performs well only if you speak English.

So when the model hasn’t been trained on the context or language of your product it perform abruptly. Say you’re building:

A travel app that generates custom itineraries
A transport tool to track real-time freight routes in India
A healthcare assistant that explains symptoms in Kannada

Even if you prompt GPT perfectly, it won’t work reliably if the language isn’t well-represented in training, the domain (e.g., Indian railways, local climate, dengue symptoms) wasn’t in the training set or the format (travel itineraries, prescriptions, bus routes) is unfamiliar to the model. All because the model was never taught your problem.
Foundation models don’t understand, they recognize patterns. If the patterns didn’t exist in the data, the model is guessing. And that guesswork is where bad UX, hallucinations, and trust issues begin.

Using domain specific models for your product

If your product lives in a niche domain or serves a local audience, you probably don’t need a general-purpose model. You need a domain-specific model. This could mean:

Fine-tuning GPT on your data (quotes, chats, itineraries, etc.)
Training a small model from scratch on a focused dataset
Using tools like LoRA, RAG, or adapters to specialize responses

Some examples already exist:

Med-PaLM (Google) for medical queries
AlphaFold (DeepMind) for protein folding
BioNeMo (NVIDIA) for drug discovery
CroissantLLM for French speakers
PhoGPT for Vietnamese
And dozens of LLMs trained specifically on Arabic, Chinese, and even Hindi

Companies are building these models because it’s the only way to get performance that aligns with user expectations in high-risk, localized, or domain-specific scenarios. This also offers a great opportunity to create models for hyper-niche market that makes business sense. Another opportunity is creating datasets for specific domain and language which is very high in demand due to their application in model training.

The issue with translation

All the above discussion naturally tempts one to think translation is the way to go about the language problem - translate user inputs to English, run GPT, then translate back. That’s like taking a call in regional language repeating it in English to an intern, and relaying their answer back in the same regional language.
It works but many languages like French, Spanish operate differently. You lose nuance, especially relationship markers, gendered terms, and cultural context
You increase latency and cost, since token usage balloons in translation. You can’t control the quality, especially for safety-critical applications
Instead of bandaging onto GPT, consider owning the model’s foundation for your product’s core domain.

Data quality over data quantity

More data does not necessarily mean that the model will perform well. A clean, domain-specific data beats large, noisy data every time.
If you’re collecting data from your product, chats, tickets, user flows, forms, etc you might already have what you need to start training or fine-tuning a better model for your use case.

What should you do as a Product Manager

Ask yourself these 4 questions:

What language do my users speak?
Is that language underrepresented in Common Crawl? If yes, general-purpose models may underperform.
What domain does my product live in?
Is it something the average web user talks about (like movies or news)? Or is it niche (like local buses, Ayurveda, or regional taxes)?
Can I collect good-quality data from my product today?
Even 5,000 good examples of itineraries, prescriptions, or complaints can be valuable.
Is translation good enough? Or do I need native understanding?
If mistakes could cost money, trust, or lives, translation isn’t enough.

If you answer these honestly, you’ll start seeing where foundational models help and where you need to build your own.

Summary

Foundation models are powerful but trained mostly on English web data.
Most Indian languages and niche domains are underrepresented.
Translation is not a true fix. It’s a stopgap with risks.
Domain-specific models are increasingly essential for real-world products.
Clean, focused data beats more data.
If your product needs accuracy, trust, and cultural nuance - you may need to specialise your model.

Product Decisions

Discussion about this post