Claude 3.5 is highly skilled at writing flawless Spanish, detecting intentions, and generating coherent responses across 47 languages. However, ask it what "dejar en visto" means in Mexico versus Argentina, or why "guay" differs between Madrid and Barcelona, and you'll notice a common stumble among LLMs: they mistake linguistic proficiency for cultural understanding. Interestingly, the issue isn't their language performance but that these models operate on statistical patterns, not lived experiences.
Photo: Igor Omilaev on Unsplash
This is crucial because, by 2026, many startups integrate LLMs without realizing they're creating culturally blind products. A chatbot might speak neutral Spanish but fail to recognize that a Colombian client might be offended if addressed as "vos" instead of "usted." It's not a translation bug but an architectural issue: transformers learn token correlations, not contextual meaning rooted in history, geography, or collective identity.
Transformers Process Patterns, Not Experiences
The transformer architecture, the foundation of Claude 3.5, GPT-4, and other models, predicts the most probable next token based on an input sequence. This is effective for syntax, grammar, narrative coherence, and some logical reasoning. That said, cultural nuance isn't based on probabilities but on accumulated context that only makes sense by understanding why a human group uses certain expressions at specific moments.
When training a model with billions of tokens from the internet, it's striking how often it's fed predominantly American English texts, followed by European Spanish, French from France, and Mandarin Chinese. Latin American Spanish appears fragmented into variants: Mexican, Argentine, Chilean, Colombian. These are treated as statistical noise within a language when they are actually cultural systems with their own codes.
Claude can reproduce phrases like "qué chévere" or "qué padre," but without knowing when to use them correctly. In February 2026, a Colombian fintech rolled back its Claude 3.5 deployment after three days: users claimed the bot "sounded like a gringo trying to be Colombian." The model learned vocabulary, not situated use.
Geographic Bias in Training Isn't Accidental
Photo: Jo Lin on Unsplash
The datasets feeding these models aren't neutral samples of human language. They reflect biases of economic power, digital infrastructure, and content production. The United States, for example, generates more indexable text per capita than any Latin American country. Spain produces more digital content in Spanish than Mexico, despite having a fifth of its population.
This means that when Claude 3.5 processes queries in Spanish, it operates with a representation where Madrid Spanish weighs more than Buenos Aires Spanish. Not because it's better, but because there are more data points, and the models don't question their sources: they optimize to minimize loss in the training set.
A concrete case: in March 2026, a Peruvian educational platform using Claude to generate pedagogical content found that 40% of the cultural analogies referred to European or American experiences. "The model suggested comparing fractions to pizza slices when in rural Cusco schools, kids learn with quinoa and potatoes," the CTO explained in a LinkedIn post. It's not malice; the model learns from what it sees, and it disproportionately sees a single type of experience.
The Illusion of "Neutral" Spanish
Many startups opt for neutral Spanish, an aseptic language that avoids localisms in an attempt to be universally comprehensible. The problem is that neutral Spanish is also a cultural construct that has historically favored peninsular forms. Saying "ordenador" instead of "computadora" might sound neutral in Spain but foreign in Mexico or Chile. Frankly, it underestimates the richness of our variants.
Claude 3.5 leans towards neutral Spanish because its human evaluators likely operated under the logic of avoiding conflict by choosing the most "standard." But in contexts like customer service, education, or mental health, this neutrality is perceived as emotional distance. Does an Argentine user want to be spoken to like an airport announcer or like a fellow Argentine?
Models Don't Grasp Situated Irony or Contextual Humor
Humor is a treacherous ground for LLMs, as it depends on timing, shared social context, and a calculated breach of expectations. Claude can recognize obvious sarcasm ("Oh, great, just what I needed: another email"), but it fails with subtle irony or humor requiring knowledge of local references.
A whole dilemma arises when asking Claude to explain the Mexican joke: "How do you say 'pésame' in Chinese? Chin-gao." It can recognize the play on words but doesn't understand the cultural weight of "chingao" or why this type of humor works in certain Mexican contexts and is incomprehensible in Spain. The joke isn't in the phonetics; it's in the transgressive use of colloquial language in a formal context.
In January 2026, a Brazilian marketing agency used Claude to generate localized ad copy. The model produced technically correct texts in Brazilian Portuguese but without the "malice," the cheeky charm characterizing much local advertising. It was as if a robot had studied Portuguese in Coimbra and then tried to sell beer in Rio. The issue: the model was trained with too much European Portuguese and technical documentation in Brazilian Portuguese but little with the informal, playful, and suggestive register that works in local advertising.
Idioms as Orphan Tokens
Idioms are problematic; their meaning isn't compositional. "Estar en las nubes" doesn't just derive from "estar," "en," "las," and "nubes." They are cultural meaning units the model must have seen enough to learn their situated use.
The frequency of idioms in training datasets is irregular. "Estar en las nubes" appears a lot; it's pan-Hispanic and features in literature, subtitles, articles. But "estar con el JesĂșs en la boca" (Mexican, being terrified) or "estar hasta la pija" (Argentine vulgar for being fed up) are less frequent or filtered as inappropriate.
The result: Claude uses common idioms but not regional or vulgar ones. In many cultures, the vulgar isn't peripheral; it's the register where authentic emotional connection occurs. Isn't it fascinating how language reflects identity?
The Problem of Hierarchies and Formal Treatment
Latin America has subtle codes of formality that vary by country, region, social class, and generational context. In Colombia, "usted" can be respect or cold distance depending on the tone. In Argentina, "vos" is universal, and using "tĂș" sounds pretentious. In Mexico, "tĂș" is standard but can sound invasive with strangers in formal contexts.
Claude 3.5 doesn't navigate this; its training didn't include metadata about social relationships. The model sees texts, not situated interactions. It can detect that "usted" is used in some contexts and "tĂș" in others but can't infer when to switch registers without clear information about the speaker's relationship.
A Chilean HR startup used Claude to automate preliminary interviews in 2026. Technically it worked well, but candidates felt uncomfortable: the bot oscillated between "usted" and "tĂș" inconsistently. It wasn't a programming error; the model lacked a mental model of the user relationship.
The Architecture Lacks Cultural Memory
Transformers have context windows (in Claude 3.5, up to 200K tokens), but this memory is ephemeral: it only lasts during the conversation. There's no persistence of cultural learning between sessions. Each new conversation starts from scratch in terms of cultural adaptation.
Even if you teach Claude that your Mexican audience prefers a certain tone, that calibration doesn't carry over to the next session. There's no cumulative "cultural memory." Each inference is statistically independent, informed only by the base training and immediate context.
Some companies use RAG (Retrieval-Augmented Generation), injecting documents with cultural style guides before each query. But this is a patch: forcing the model to "remember" information that should be integrated into its weights, not retrieved ad-hoc.
Fine-Tuning Doesn't Fix the Root Problem
Fine-tuning with specific cultural datasets seems like the obvious solution. It works to a certain extent: it improves vocabulary, common phrases, stylistic preferences. But it doesn't teach social context comprehension if the base model lacks architecture to represent it.
Transformers encode information as vector embeddings in high-dimensional spaces. They can learn that "chévere" is close to "cool." But they don't encode that "chévere" is a marker of Colombian identity, that in Spain it might sound condescending, or that in Argentina it's unserious.
In April 2026, Mercado Libre reported on their experiments with fine-tuning Claude for regionalized customer service. They trained variants for Mexico, Argentina, Brazil, and Chile with millions of real transcripts. The results improved marginally, but they concluded that "the model remained an enhanced translator, not a native speaker." Is that enough for your needs?
The Regional Corpus Dilemma
Even if you wanted to do serious fine-tuning with specific cultural data, there's a logistical problem: insufficient quality digital text in many non-dominant cultures. Regional dialects, minority languages, and informal registers are underrepresented online because the communities using them don't produce indexable content at the same rate.
This creates a vicious circle: models don't understand non-dominant cultures because there are no data â companies don't invest in collecting those data because current models don't justify them â non-dominant cultures remain excluded from AI advances.
What's Next: Culturally Situated Models or Context Adapters?
Some research points to alternative architectures. Anthropic is experimenting with "culture embeddings": vector representations of cultural context that can be injected during inference to modulate model behavior. The idea is that instead of one model per culture, there's a base model capable of "activating" cultural awareness based on input metadata.
OpenAI is exploring constitutional AI with culturally specific values: instead of aligning all models to the same universal principles (often Western in practice), allowing different deployments to have different ethical and cultural constitutions. It's intriguing, though it raises complex questions about who decides a culture's values, especially in plural societies.
Another direction is hybrid RAG: models that not only retrieve documents but cultural context in real-time. If a Mexican user interacts with your system, the model automatically queries a knowledge base of Mexican cultural usages before responding. It's more computationally expensive but more flexible than fine-tuning.
The reality is we probably need all three: better data, more sophisticated architectures, and hybrid systems combining general models with specific cultural knowledge. But meanwhile, if you're building a product with Claude 3.5 or any LLM, assume the model doesn't understand your culture. It can mimic it if you provide enough explicit context, but it doesn't live it.
In Conclusion: Syntax Isn't Semantics, and Semantics Isn't Culture
The true limit of models like Claude 3.5 isn't technical in the narrow sense. The transformer architecture captures linguistic patterns with astonishing power. The limit is epistemological: these models learn from text, and text is always a partial representation of culture. Culture lives in gestures, voice tones, meaningful silences, shared historical context, and embodied experiences that never reach the internet.
You can make Claude sound more natural with careful prompts, contextual RAG, and specific fine-tuning. But you can't make it understand why a Chilean might be offended by "huevĂłn" in the wrong tone, or why a Mexican might affectionately call you "pendejo," while from a stranger it's an insult. That requires having lived those dynamics, not just read about them.
So the question isn't when LLMs will master cultural nuances, but whether the current architecture can even do it, or if we need something fundamentally different. Is your startup building for culturally diverse users? How are you validating that the model truly understands them, or is it just mimicking them?