Build Self-Improving AI with Karpathy's Auto-Research

Q: How often should the auto-research loop run?

It depends on how fast your feedback loop is. For cold email, every 4 hours works well. For ML training like Karpathy's setup, every 5 minutes is feasible. The tighter the loop, the faster you converge on the optimal solution.

That's not science fiction anymore. Andrej Karpathy — one of the most respected names in AI research — just open-sourced a project called autoresearch that does exactly that. And when you pair it with Claude Code, you can apply the same self-improving loop to almost any business metric you care about.

No, this isn't hype. This is a real GitHub repo with real practical applications. Let's break down what it actually does, how the pattern works, and how to set it up for your business.

What Is Karpathy's Auto-Research Pattern?

Karpathy built autoresearch while training his own language model. His idea: instead of manually tweaking hyperparameters and waiting hours for results, why not let the AI do the experimentation itself?

Here's the core concept, straight from the repo:

The Idea

"Give an AI agent a small but real LLM training setup and just let it experiment autonomously overnight. It'll modify the code, train for five minutes, check if the results improved, keep or discard, and repeat. You wake up in the morning to a log of experiments and hopefully a better model."

The agent runs a tight loop: form a hypothesis → run the experiment → measure a metric → keep what works → iterate. In Karpathy's case, the metric was validation loss (how well the model predicts text). But the pattern is universal.

📝 Hypothesis

→

🧪 Experiment

→

📊 Measure

→

✅ Keep / Discard

→

🔁 Repeat

The auto-research loop — runs continuously with zero human involvement

Why This Changes Everything for Business

Here's the thing — you don't need to be training language models to use this pattern. The loop applies to anything with three ingredients:

1. Objective Metric

A number you can track automatically via API. Reply rate, conversion rate, CTR — something measurable without subjectivity.

2. An Input to Change

Email copy, landing page headlines, ad creatives, pricing text — something the AI can modify between experiments.

3. API Access

A way for the agent to deploy its changes AND retrieve the results automatically. No manual copy-paste required.

Once you have those three things, you can build an optimization loop that runs 24 hours a day — without you being in the loop at all.

The Real Advantage

A human optimizer might run 2–3 experiments per day. An AI agent running on a 4-hour loop runs 6 experiments per day. On a 1-hour loop? 24 experiments. Tighten the feedback loop, multiply the learning rate.

Real-World Use Cases Beyond Machine Learning

Use Case	Metric to Optimize	Input to Change	API Source
Cold Email	Reply rate	Email copy / subject line	Instantly API
Landing Pages	Conversion rate	Headlines, CTAs, layout	Webflow / CMS API
Ad Creatives	CTR / CVR	Copy, headlines, CTA	Meta / Google Ads API
Chatbot Scripts	Customer satisfaction score	Response templates	CRM / Support API
Product Listings	Sales / Click-through	Descriptions, titles	Chrome DevTools MCP
YouTube Titles	Click-through rate	Title variations	YouTube Data API v3
Email Newsletters	Open rate / CTR	Subject lines	Mailchimp / Beehiiv API
Pricing Pages	Plan upgrade rate	Copy, feature order	Webflow / Next.js API

What You'll Need

Claude Code — Any IDE that runs Claude Code (VS Code, Cursor, or OpenClaw on a VPS)
GitHub account — For hosting the repo and running GitHub Actions
An API for your platform — Instantly, Webflow, Meta Ads, etc.
Anthropic API key — The orchestrator uses Claude to generate new variants
A clear metric you can query — If your platform doesn't have an API, Chrome DevTools MCP can help
~30 minutes — Most of the setup is handled by Claude Code

Step 1: Clone the Auto-Research Repo

Open Claude Code in a new project folder and give it this command:

Clone the autoresearch repo from https://github.com/karpathy/autoresearch
into the current working directory.

Claude Code will clone the repo, read the documentation, and load all the context it needs before asking what you want to build. The repo includes:

program.md — The instructions the agent follows to run experiments
train.py — The mutable file the agent is allowed to modify (you'll replace this with your own equivalent)
Research logs — Where the agent records what it learned from each experiment

Step 2: Define Your Goal, Metric & Test Method

This is the most important step. Once Claude Code has the repo context, describe exactly what you want to optimize. Here's the prompt structure that works best:

Use the context in the auto-research folder to help me build a similar
pipeline, except instead of testing for validation loss and iterating on
a machine learning model, I want to do this for [YOUR USE CASE].

The metric I want to optimize is: [YOUR METRIC]
The platform I'm using is: [YOUR PLATFORM]
The thing that changes between experiments: [YOUR VARIABLE]
API credentials will be provided in a moment.

Put this on the cloud using GitHub Actions, running every [X] hours.

Cold Email Example

Use the auto-research folder to help me build a similar pipeline, except instead of testing for validation loss, I want to optimize cold email reply rate. The platform is Instantly (I'll give you API credentials). The variable is the email copy. Run this on GitHub Actions every 4 hours.

Step 3: What Claude Code Builds for You

Claude Code will scaffold the entire system. Here's what it typically generates:

orchestrator.py

The top-level agent that coordinates everything. It reads the research log, generates a new hypothesis, deploys the experiment, and harvests results. This is the "brain" of the loop.

platform_client.py (e.g. instantly_client.py)

API integration layer. All the API calls your orchestrator needs — querying metrics, deploying variants, creating campaigns, purging old data.

baseline.md + resource.md

Baseline is your first experiment. Resource.md is a living document where the agent records what it's learned — "shorter subject lines work better", "leading with risk reversal improves reply rate", etc. This compounds over time.

GitHub Actions workflow (.github/workflows/optimize.yml)

A cron job that triggers the orchestrator on your chosen schedule. Claude Code will configure the workflow and tell you exactly which secrets to add in your GitHub repo settings.

Step 4: Deploy to GitHub Actions (Run 24/7)

Once Claude Code has generated all the files, push to GitHub and configure your secrets:

git init
git add .
git commit -m "feat: auto-research optimizer"
git remote add origin https://github.com/YOUR_USERNAME/YOUR_REPO.git
git push -u origin main

Then in your GitHub repo, go to Settings → Secrets and variables → Actions and add:

ANTHROPIC_API_KEY — Your Claude API key
PLATFORM_API_KEY — Your platform API key (e.g. Instantly)
Any other credentials Claude Code identified

The GitHub Actions workflow will look like this:

name: Auto-Research Optimizer

on:
  schedule:
    - cron: '0 */4 * * *'  # Run every 4 hours
  workflow_dispatch:        # Allow manual runs

jobs:
  optimize:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - run: python orchestrator.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          PLATFORM_API_KEY: ${{ secrets.PLATFORM_API_KEY }}

Step 5: Monitor with Slack Notifications

Since the loop runs completely autonomously, you'll want visibility into what's happening. Ask Claude Code to add a Slack webhook integration:

Add a Slack webhook notification to the orchestrator. 
Send a message every time:
- A new challenger variant is created (include the copy)
- A harvest completes (include which variant won and the metric scores)

Slack webhook URL: [YOUR_SLACK_WEBHOOK_URL]

You'll get Slack pings like:

Example Slack Notification

🧪 New challenger created
Hypothesis: Baseline is too long, burying the offer. Testing a sub-75-word version with risk reversal upfront and a concrete time ask.

📊 Harvest complete
Challenger wins: 3.1% reply rate vs baseline 2.4% → New baseline updated.

What Good Metrics Look Like

Not all metrics are created equal. The auto-research pattern works best when your metric is:

Objective — A number, not an opinion. "Reply rate" is objective. "Feels more engaging" is not.
Fast to measure — The tighter the feedback loop, the faster you improve. Cold email takes 4–24 hours. ML training takes 5 minutes. Choose accordingly.
API-accessible — The agent needs to query the result programmatically. If there's no API, Chrome DevTools MCP can bridge the gap for browser-based platforms.

✅ Good Metrics

Reply rate · Conversion rate · Click-through rate · Customer satisfaction score · Sales volume · Open rate · Validation loss

❌ Bad Metrics (Too Fuzzy)

Brand warmth · Content quality · Visual appeal · "How good it feels" — anything subjective that can't be measured automatically

When Auto-Research Doesn't Work

Be honest about these limitations before you start:

No API access — If you can't query the metric or deploy changes via code, the loop breaks. Chrome DevTools MCP is a workaround but adds complexity.
Very slow feedback loops — If results take weeks (e.g. SEO), the iteration cycle is too slow to be useful for automated optimization.
Compliance-sensitive platforms — Some ad platforms have strict rules about automated account modifications. Check their ToS before running automated experiments.
Insufficient volume — You need enough data per experiment to get statistically meaningful results. If you're getting 10 email opens per day, your metrics are too noisy.

The Bigger Picture

This is what every major AI lab in the world is already doing — running thousands of experiments overnight to make their models better. Karpathy just open-sourced the pattern and made it accessible to everyone.

The compounding effect is the point. In early experiments, most challengers lose to the baseline. But over hundreds of runs, the agent builds a resource.md of accumulated knowledge — "shorter subject lines with personal openers outperform generic intros by 40%" — and starts generating genuinely better variants.

Run this for a year, and you end up with an optimization system that's orders of magnitude better than anything you could build manually.

Key Insight

Every challenger becomes the new baseline when it wins. Every failure is recorded in resource.md. The agent gets smarter with every loop — not because the model is fine-tuned, but because it has better context about what works.

Frequently Asked Questions

What is Karpathy's auto-research pattern?

Auto-research is an open-source framework by Andrej Karpathy that lets an AI agent autonomously run experiments in a tight loop. The agent modifies code or content, measures a metric, keeps what works, discards what doesn't, and repeats — all without human intervention.

Can I use auto-research for business purposes (not just ML)?

Yes. The pattern applies to anything with an objective metric and API access. Common business use cases include cold email reply rates, landing page conversion rates, ad creative CTR, chatbot satisfaction scores, and product description click-through rates.

How do I use auto-research with Claude Code?

Clone the autoresearch repo from GitHub, open it in your Claude Code environment, and describe your goal, metric, and the API you'll use to measure results. Claude Code will scaffold the orchestrator, API integrations, and GitHub Actions workflow to run the loop automatically.

What metrics work best with auto-research?

The best metrics are objective, measurable via API, and return results quickly. Examples: email reply rate (via Instantly API), landing page conversion rate, ad CTR. Avoid fuzzy metrics like "warmth" or "brand feel" — they can't be tracked automatically.

How often should the auto-research loop run?

It depends on your feedback loop. For cold email, every 4 hours works well. For ML training like Karpathy's setup, every 5 minutes is feasible. The tighter your loop, the faster you converge on the optimal solution.

⚠️ Important Disclaimer

This article contains experimental code and autonomous AI patterns. Before running anything:

Do NOT use in production. AI-generated and AI-modified code can contain errors, security vulnerabilities, or unexpected behavior.
Test in sandbox environments only. Always use isolated test environments before considering any deployment.
AI makes mistakes. Large language models hallucinate, misunderstand context, and produce incorrect code. Review everything manually.
Autonomous agents are high-risk. Giving an AI agent write access to systems — especially production systems — can result in data loss, security breaches, or unintended changes.
You are responsible. By using any code or techniques from this article, you accept full responsibility for the outcomes. We provide no warranties.

Proceed with caution. When in doubt, don't give agents access to anything you can't afford to lose.

Want to Build Stuff Like This?

Join Vibe Coding Academy — where we turn builders like you into AI-powered developers who ship real products. Step-by-step, from Claude Code basics to autonomous agent pipelines.

Join vibecodingacademy.club →