Training an 8B Memory Routing Model on Tinker

I have spent two years building marketing agent systems, context orchestration pipelines, and AI personas. I have written a lot of system prompts, built evaluation frameworks, and shipped production agents. I had not trained a model.

This project closes that gap. Using Claude Opus 4.5 in Cursor and Tinker (by Thinking Machines) as the training platform, I fine-tuned an 8B Llama on a narrow classification task: memory routing for marketing conversations. On a 50-scenario benchmark the fine-tuned 8B model scores higher than the 104B teacher used to generate its training data. Below is the full pipeline, what broke, and what I would do differently.

The motivating problem: when an agent embodies an expert marketing strategist, one of the core decisions is memory routing — what from a conversation deserves long-term storage, and at what scope.

User: “Our brand voice is professional but approachable. Think Harvard Business Review meets Slack.”

This is company.brand_core - store it forever. It defines how every piece of content should sound.

User: “Can you check if the Q3 deck is ready?”

This is none - transactional, no long-term value.

User: “I prefer bullet points over paragraphs. Get to the point quickly.”

This is user.communication_style - personal preference that should persist across sessions.

But what about:

User: “Our philosophy is ‘measure twice, cut once’. We’d rather delay a launch than ship something half-baked.”

Is this company.strategic_signatures (a decision framework) or company.brand_core (a value statement)? Both are valid. The model needs to understand the distinction.

I had 13 categories across company and user scopes, plus none for irrelevant content. Multi-label is allowed. Persistence horizons range from 2 weeks to 1+ years. This is a challenging classification problem.

I could call an LLM for every routing decision. But:

Latency: Memory routing happens on every conversation turn. 200ms+ API calls add up.
Cost: At scale, per-token pricing for classification is wasteful.
Control: I wanted deterministic behavior. General-purpose models are too variable.
Privacy: Some client data can’t leave the infrastructure.

The solution: train a small, specialized model that runs fast and cheap while matching or exceeding large model quality.

The Approach: Prompt Distillation with Tinker

Prompt distillation uses a large model to generate training data, then trains a smaller model to replicate (and ideally exceed) that behavior.

Data Generation: Cohere Command-R-Plus (104B) generates 2,001 labelled marketing conversations (nothing particular with this model, I just had credits with Cohere and needed a model without rate limit issues.)
SFT (Supervised Fine-Tuning): Train Llama-3.1-8B with LoRA to match the teacher’s outputs
RL (Reinforcement Learning): Optimize for exact category matching with a custom reward function
Evaluation: Benchmark against the teacher on held-out scenarios

I got access to Tinker and used it as the training platform because:

Async API design: Overlap forward/backward passes with optimizer steps
Built-in loss functions: cross_entropy for SFT, importance_sampling for RL
LoRA support: Rank 32 adapters, no full fine-tuning required
Checkpoint management: Save and resume training seamlessly
Excellent documentation: Every API explained with examples and edge cases

The documentation was genuinely very easy to understand.

Phase 1: Synthetic Data Generation

The First Attempt (And Why It Failed)

My initial prompt to Cohere was too generic:

Generate a marketing conversation that demonstrates [category].
Include realistic details.

The result: 1,000 conversations about sustainable fashion brands and eco-friendly packaging. The model latched onto “marketing” and generated variations of the same scenario.

The Fix: Structured Diversity

I built a generation pipeline with explicit diversity controls:

INDUSTRIES = ["fintech", "healthcare", "SaaS", "e-commerce", "agency", 
              "CPG", "media", "education", "legal", "manufacturing"]
USER_ROLES = ["CMO", "VP Growth", "Brand Manager", "Performance Marketer",
              "Content Lead", "Demand Gen", "Product Marketing"]
TURN_COUNTS = [3, 4, 5, 6, 7, 8]  # Conversation length variation

Each generation randomly samples from these pools. The prompt explicitly instructs:

Create a realistic marketing conversation between a {role} at a {industry} 
company and their AI assistant. The conversation should be {turns} turns long.

CRITICAL: This is a MID-CONVERSATION excerpt. No greetings, no "Hi, how can 
I help you today?" Start in the middle of a substantive discussion.

The “mid-conversation” constraint was crucial. Early data was polluted with greeting patterns that added no signal.

Balancing Category Distribution

My first dataset was severely imbalanced:

Category	Percentage
user.strategic_approach	34.5%
company.brand_core	18.2%
company.tools_config	0.5%
none	3.1%

The model learned to predict user.strategic_approach for everything.

I regenerated with explicit category targets:

CATEGORY_TARGETS = {
    "company.brand_core": 77,
    "company.strategic_signatures": 77,
    "company.tools_config": 77,  # Previously under-represented
    "none": 154,  # 2x weight for negative examples
    # ... etc
}

The none category needed extra weight. Models hate saying “nothing here” - they want to find patterns. Robust negative training requires more examples.

Temperature and Diversity

Even with structured prompts, outputs were too similar. The fix:

temperature=0.95  # High creativity

Plus removing prescriptive examples from prompts. Instead of:

Example: "Our brand is built on transparency and trust..."

I used:

The conversation MUST strongly exemplify {category}.
Definition: {category_definition}
Invent SPECIFIC and UNIQUE details relevant to a {industry} company.

Let the model be creative within constraints.

Phase 2: Supervised Fine-Tuning with Tinker

The Training Loop

Tinker’s async API lets you overlap computation:

for step in range(num_steps):
    # Submit forward-backward pass
    fwd_bwd_future = await training_client.forward_backward_async(
        batch_data,
        loss_fn="cross_entropy",
    )
    
    # Submit optimizer step (can overlap)
    optim_future = await training_client.optim_step_async(adam_params)

    # Wait for both
    fwd_bwd_result = await fwd_bwd_future.result_async()
    optim_result = await optim_future.result_async()

This matters for throughput. You’re not waiting for one operation to complete before starting the next.

Hyperparameters

Tinker provides get_lr() which returns the LoRA-optimized learning rate for your model:

from tinker_cookbook.hyperparam_utils import get_lr
learning_rate = get_lr("meta-llama/Llama-3.1-8B")  # Returns ~2.86e-4

No manual tuning. The docs explain the scaling: LoRA adapters need higher LRs than full fine-tuning because you’re updating fewer parameters.

Key settings:

LoRA rank: 32 (Tinker default for classification)
Batch size: 128
Steps: 100 (with early stopping)
Optimizer: Adam (β1=0.9, β2=0.95, ε=1e-8)

The save_state vs save_weights_for_sampler Bug

This cost me hours. Tinker has two checkpoint methods:

save_weights_for_sampler(): Saves weights for inference only
save_state(): Saves full training state (weights + optimizer + step count)

I used save_weights_for_sampler() for my SFT checkpoint. When I started RL, the training client couldn’t load optimizer state. RL started from scratch, ignoring all SFT progress.

The fix:

# For SFT final checkpoint that RL will continue from:
sft_checkpoint = training_client.save_state(name="sft_final")

# For intermediate checkpoints you'll only use for inference:
sampling_checkpoint = training_client.save_weights_for_sampler(name="sft_step_50")

The docs explain this, but I skimmed too fast. Claude caught it when debugging why RL showed 0% accuracy for 5 iterations.

SFT Results

Loss dropped from 5.47 to 0.26 (95% reduction) over 100 steps:

SFT Loss Curve

The model learned the task quickly. By step 20, test loss was already at 0.35. The remaining steps refined edge cases.

Phase 3: Reinforcement Learning

SFT gets you 80% of the way. The model outputs plausible categories. But it doesn’t optimize for exact matching - it optimizes for token-level cross-entropy.

RL fixes this by directly rewarding correct classification.

The Reward Function

R_total = 0.6 * R_F1 + 0.2 * R_temp + 0.1 * R_parity + 0.1 * R_eff

Component	Weight	Description
R_F1	60%	F1 score vs gold labels
R_temp	20%	Persistence horizon alignment (long/medium/short)
R_parity	10%	Company/user scope correctness
R_eff	10%	Storage efficiency (≤3 categories preferred)

The weights encode our priorities: get the categories right (F1), but also respect temporal semantics and avoid over-storing.

Importance Sampling

Tinker uses importance sampling for policy gradient:

fwd_bwd_future = await training_client.forward_backward_async(
    train_data,
    loss_fn="importance_sampling"
)

This requires careful construction of the training data. Each datum needs:

input_tokens: The prompt
target_tokens: The model’s generated response
logprobs: Log probabilities of each target token
advantages: Reward signal (centered within groups)

The lengths must match exactly. I hit this error repeatedly:

tinker.BadRequestError: input sequence, target_tokens, logprobs, and 
advantages at index 0 must have the same length

The fix was ensuring tokenization consistency between sampling and training.

KL Divergence Monitoring

Tinker’s RL hyperparameters docs emphasize KL monitoring for training stability:

# Two KL estimators per Tinker RL docs
kl_v1 = (old_logprobs - new_logprobs).mean()  # Can be negative
kl_v2 = (torch.exp(old_logprobs - new_logprobs) - 1 - (old_logprobs - new_logprobs)).mean()  # Always non-negative

Target: KL < 0.005
Warning: KL 0.005-0.01
Critical: KL > 0.01

My first RL runs had KL around -0.01 (negative). This is mathematically impossible for true KL divergence. The bug: I was computing new - old instead of old - new.

Advantage Computation

Rewards must be centered within groups for stable gradients:

# Wrong: use raw rewards
advantages = rewards

# Right: center within each group
group_mean = rewards.mean()
advantages = rewards - group_mean

Without centering, the model receives inconsistent gradient signals. Tinker’s docs explain this in the RL hyperparameters section.

RL Results

30 iterations improved mean reward from 0.73 to 0.93:

RL Reward Progression

The reward components tell the story:

Component	Start	End
R_F1	0.64	0.90
R_temp	0.81	0.95
R_parity	0.86	1.00
R_eff	1.00	1.00

F1 improved most (+40%). The model learned to get exact category matches, not just plausible outputs.

RL Reward Components

Evaluation: Student vs Teacher

I created a 50-scenario benchmark with challenging marketing cases across 7 domains:

Domain	Scenarios
Brand & Positioning	8
Strategic Decisions	8
Performance & Metrics	8
Tools & Integrations	6
User Preferences	10
Business Priorities	6
Knowledge Artifacts	4

Each scenario has difficulty ratings (easy/medium/hard) and expected categories.

Results

Model	Size	Avg F1	Exact Match
Llama-8B + LoRA (this work)	8B	0.68	60%
Cohere Command-R-Plus	104B	0.61	26%

The 8B model scores 11% higher on F1 and 2.3x on exact match. Important caveat: see the difficulty breakdown below — most of the lift comes from easy and medium cases, and the 104B teacher still wins on hard multi-label scenarios.

Model Comparison

Performance by Difficulty

Difficulty	Our Model	Cohere	Delta
Easy	0.86	0.48	+79%
Medium	0.65	0.64	+2%
Hard	0.50	0.72	-31%

The 8B model wins on easy cases and matches on medium. The 104B teacher still wins on hard multi-label scenarios where three or more categories apply. The aggregate lift is real, but the comparison favors the student because most benchmark cases are not hard.

Difficulty Comparison

Why does the student outperform on aggregate?

Three reasons:

Specialization. The 8B model does one thing. The teacher is a general-purpose 104B model optimized for many tasks.
RL objective. Cross-entropy (SFT) rewards plausible outputs. The reward function used here explicitly rewards exact category matching, which the teacher was never optimized for.
Clean data. Synthetic labels have no annotator disagreement, so the model learns one consistent decision boundary instead of the average of conflicting human labels.

This is the well-known specialization vs. generalization trade-off; the result is not unique to this task, and it does not transfer to harder distributions.

Working with Claude

I had not used PyTorch for training, written a reward function, or debugged gradient issues before this. What I had was a year of building AI systems: enough to decompose problems, iterate on failures, and read documentation carefully. Claude filled the implementation gap.

Three things made this productive:

1. Framing prompts as architectural questions, not codegen requests

Instead of “write me an RL training loop,” prompts like:

“What would you check first when RL shows 0% accuracy for 5 iterations? Challenge my assumptions about checkpoint state.”

surfaced the save_state vs save_weights_for_sampler bug by reasoning about what state RL needs to resume training.

2. Reading docs in the loop

Tinker’s documentation is dense but good. Paste a section, describe the bug, and Claude could connect them:

3. Iteration count

Roughly 50 iterations end to end:

Synthetic data too homogeneous → fix prompt diversity.
Category distribution imbalanced → fix generation targets.
KL divergence negative → fix computation order.
Advantages not centered → fix group normalization.
RL started from scratch → fix checkpoint method.

Each cycle: run, observe, diagnose, fix.

Lessons Learned

On Prompt Distillation

A specialized small model can outperform a generalist teacher on a narrow distribution. This is well-known and tells you nothing about general capability.
Data quality matters more than quantity. The first 1,000 examples were too similar. The second 1,000, with higher temperature and diverse prompts, helped more than doubling the dataset would have.
Negative examples are important. Models bias toward finding patterns. The none category needed roughly 2x the examples to learn robust rejection.
RL is powerful but finicky. SFT gets you most of the way. RL handles the last bit but adds more failure modes: reward hacking, KL explosion, gradient instability.

On Tinker

The documentation is dense but accurate, with examples and edge cases for most APIs.
The async API improves throughput by letting forward/backward overlap with optimizer steps.
Checkpoint semantics matter: know the difference between save_state() and save_weights_for_sampler().
Monitor KL divergence. If it exceeds 0.01, something is wrong upstream.

On AI-Assisted Development

Treat the assistant as a reviewer first, code generator second. Ask it to challenge assumptions and identify failure modes.
Iterate fast. Most experiments fail; learning velocity matters more than first-attempt success.
Give it the full stack trace, not just the final line.
Log metrics every step. If something goes wrong, you need to trace exactly where.

What’s Next

The model is open-source:

HuggingFace: MuratcanKoylan/Marketing-Memory-Routing-8B
GitHub: muratcankoylan/memory-routing-agent

Roadmap:

Train on larger dataset (10k+ examples)
Test different base models (Qwen, Mistral)
Add per-category evaluation metrics
Deploy to production

Final Thoughts

The gap between using AI and training AI is narrowing. With Tinker, an assistant like Claude, and the willingness to iterate through failures, training a specialized model is now accessible to engineers who are not ML researchers.

This model is narrow. It still loses to the teacher on hard multi-label cases. It is 13x smaller and good enough for the routing task it was built for. If you are thinking about training your first model: pick a narrow problem, generate some data, train something, see what breaks, and iterate.

Thanks to Thinking Machines for Tinker. Thanks to Cohere for the teacher model, Meta for Llama, and Anthropic for Claude.

Links:

Model: huggingface.co/MuratcanKoylan/Marketing-Memory-Routing-8B
Code: github.com/muratcankoylan/memory-routing-agent
Tinker: thinkingmachines.ai