Claude 3.5 Sonnet has impressive capabilities: it can code, analyze complex architectures, and hold lengthy conversations. However, when asked to recommend products in a mid-sized e-commerce store, it struggles because the model wasn't designed to tackle the challenges of genuine personalization. Despite not being a weak model, Claude was nurtured among giants like Amazon and Alibaba, which doesn't reflect the realities of your store.
Photo: Mark König on Unsplash
Curiously, the problem isn't technical in the usual sense. It's deeper, almost structural. LLMs are trained on databases where large e-commerce platforms dominate. This limits them to patterns not found in smaller catalogs. Applying this knowledge to a store with 5,000 products and tight margins almost inevitably leads to failure. Why aren't we talking more about this?
The Invisible Bias of Planetary-Scale Training
Claude 3.5, like GPT-4 and Gemini, faces a similar dilemma. Their databases are heavily biased towards purchasing behaviors on massive platforms. These platforms have extensive catalogs, advanced search engines, and massive machine learning budgets.
When you expect Claude to personalize the experience in your store, what does it do? It uses patterns learned from different situations. While a quick purchase might be common on Amazon, buying a $1,200 table from your store may take weeks. In my experience, these models simply aren't prepared for such long decision cycles with smaller catalogs.
In October 2025, Anthropic reported that Claude 3.5 reached 92% accuracy in information retrieval. However, these figures are based on public benchmarks, not the ambiguous realities of e-commerce where interactions require more complex inferences.
Real Example: A European sustainable fashion startup used Claude for its chatbot. The problem arose when the model recommended out-of-stock products, prioritizing semantic matches over checking availability. Is it a model error? No, it's the logical result of its training on generic datasets.
The Hell of Scattered Context and Selective Memory
Photo: sarah b on Unsplash
Claude 3.5 can handle context windows of 200K tokens. In theory, this is enough to include a complete user history. However, there are practical problems.
First, context is expensive. At $3 per million tokens, maintaining continuous context for each user becomes unsustainable. Imagine the cost of 15,000 daily calls with large windows; it quickly becomes prohibitive even before generating responses.
Second, the model doesn't always prioritize properly in large windows. Anthropic documented a curious phenomenon in March 2026: when critical information is buried in the middle of a long prompt, there's a 40% higher chance it gets ignored. How do we handle this in e-commerce, where relevant context is often buried?
What surprises me most is that agents need explicit memory architectures. While RAG (Retrieval-Augmented Generation) helps, it also introduces latency and complications. Implementing it means keeping vector stores synchronized and embedding pipelines updated.
The Fallacy of "Automatic Personalization"
It's said that agents automatically learn user preferences. In e-commerce, however, the signals are honestly chaotic:
- A user viewing 50 products might be researching for someone else.
- 34% of searches are exploratory, not intended to buy.
- Abandoned carts can have many reasons.
Without the variables that truly matter, Claude can't understand the user's true intent. Amazon achieves this with years of data, but your store doesn't have that luxury. Do you really think you can compete with just 18 months of data?
Why Fine-Tuning Isn't the Magic Solution
The apparent solution seems to be: "fine-tune Claude to your data." In theory, this sounds good. However, a DeepMind paper in February 2026 showed that for significant improvement, you need at least 100,000 examples of successful interactions. How many mid-sized stores have this?
Moreover, fine-tuning freezes the model at the training time. In a world where your catalog constantly changes, retraining is a logistical and financial nightmare.
The Hidden Cost of Fine-Tuning
Fine-tuning isn't cheap. Between processing costs and GPU time, a modest store might pay up to $1,620 annually just to keep the model updated. Is this investment worth it for a startup with limited revenue?
The Architecture You Really Need (and No One Is Building)
If LLMs fail, what's the alternative? It's not about discarding them but using them wisely.
Hybrid Architecture: combine traditional recommendations with LLMs for contextualization.
- Classic recommendation engine for the base. This understands specific and local patterns.
- LLM as a reasoning layer for complex queries or edge cases. Ever needed to translate an ambiguous intent into something concrete?
- Vectorstore with product embeddings, balancing structured and semantic attributes.
Viable Stack Example:
# Simplified pseudocode
from anthropic import Anthropic
from pinecone import Pinecone
import pandas as pd
client = Anthropic(api_key="your-key")
pc = Pinecone(api_key="your-pinecone-key")
index = pc.Index("product-embeddings")
def hybrid_recommend(user_query, user_history, business_rules):
# Step 1: Claude interprets intent
prompt = f"""Analyze this query: '{user_query}'
Relevant history: {user_history}
Extract:
- Relevant categories
- Implicit price range
- Usage occasion
- Purchase urgency"""
interpretation = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
# Step 2: Vectorstore searches for candidates
candidates = index.query(
vector=embed(user_query),
top_k=50,
filter=apply_business_rules(business_rules)
)
# Step 3: Classic recommendation system ranks
ranked = collaborative_system.predict(
user=user_history,
items=candidates,
business_weights={'margin': 0.3, 'stock': 0.4, 'popularity': 0.3}
)
return ranked[:10]
This architecture offers:
- Control over essential business rules
- Controlled latency (Claude only processes interpretation)
- Manageable costs by limiting calls to specific cases
The Real Problem is Selling Simple Solutions to Complex Problems
In 2026, the narrative is that "AI agents personalize everything." But this only holds true in environments with millions of interactions and stable patterns. What about mid-sized e-commerce? Nothing is that simple. Automatic personalization is just an illusion.
Agents fail because we don't use them properly. They're not the complete solution, but pieces in a larger puzzle.
The sustainable fashion startup I mentioned found its solution. They didn't replace Claude but redesigned their system. Now, Claude interprets ambiguous queries and generates personalized descriptions, and the true recommendations come from a classic system. Conversion improved by 23% in three months.
Is your e-commerce making the most of AI agents? Or are you just patching an architecture problem?