|

Karpathy's Auto-Research + Claude Code

Andrej Karpathy just open-sourced a framework that turns an AI into a self-improving research machine. Here's how to apply it to your business — and have it running 24/7 on autopilot.

Published March 13, 2026 · 10 min read

#AutoResearch #ClaudeCode #AIAutomation

That's not science fiction anymore. Andrej Karpathy — one of the most respected names in AI research — just open-sourced a project called autoresearch that does exactly that. And when you pair it with Claude Code, you can apply the same self-improving loop to almost any business metric you care about.

No, this isn't hype. This is a real GitHub repo with real practical applications. Let's break down what it actually does, how the pattern works, and how to set it up for your business.

What Is Karpathy's Auto-Research Pattern?

Karpathy built autoresearch while training his own language model. His idea: instead of manually tweaking hyperparameters and waiting hours for results, why not let the AI do the experimentation itself?

Here's the core concept, straight from the repo:

The Idea

"Give an AI agent a small but real LLM training setup and just let it experiment autonomously overnight. It'll modify the code, train for five minutes, check if the results improved, keep or discard, and repeat. You wake up in the morning to a log of experiments and hopefully a better model."

The agent runs a tight loop: form a hypothesis → run the experiment → measure a metric → keep what works → iterate. In Karpathy's case, the metric was validation loss (how well the model predicts text). But the pattern is universal.

📝 Hypothesis
🧪 Experiment
📊 Measure
✅ Keep / Discard
🔁 Repeat

The auto-research loop — runs continuously with zero human involvement

Why This Changes Everything for Business

Here's the thing — you don't need to be training language models to use this pattern. The loop applies to anything with three ingredients:

1. Objective Metric

A number you can track automatically via API. Reply rate, conversion rate, CTR — something measurable without subjectivity.

2. An Input to Change

Email copy, landing page headlines, ad creatives, pricing text — something the AI can modify between experiments.

3. API Access

A way for the agent to deploy its changes AND retrieve the results automatically. No manual copy-paste required.

Once you have those three things, you can build an optimization loop that runs 24 hours a day — without you being in the loop at all.

The Real Advantage

A human optimizer might run 2–3 experiments per day. An AI agent running on a 4-hour loop runs 6 experiments per day. On a 1-hour loop? 24 experiments. Tighten the feedback loop, multiply the learning rate.

Real-World Use Cases Beyond Machine Learning

Use Case Metric to Optimize Input to Change API Source
Cold Email Reply rate Email copy / subject line Instantly API
Landing Pages Conversion rate Headlines, CTAs, layout Webflow / CMS API
Ad Creatives CTR / CVR Copy, headlines, CTA Meta / Google Ads API
Chatbot Scripts Customer satisfaction score Response templates CRM / Support API
Product Listings Sales / Click-through Descriptions, titles Chrome DevTools MCP
YouTube Titles Click-through rate Title variations YouTube Data API v3
Email Newsletters Open rate / CTR Subject lines Mailchimp / Beehiiv API
Pricing Pages Plan upgrade rate Copy, feature order Webflow / Next.js API

What You'll Need

Step 1: Clone the Auto-Research Repo

Open Claude Code in a new project folder and give it this command:

Clone the autoresearch repo from https://github.com/karpathy/autoresearch
into the current working directory.

Claude Code will clone the repo, read the documentation, and load all the context it needs before asking what you want to build. The repo includes:

Step 2: Define Your Goal, Metric & Test Method

This is the most important step. Once Claude Code has the repo context, describe exactly what you want to optimize. Here's the prompt structure that works best:

Use the context in the auto-research folder to help me build a similar
pipeline, except instead of testing for validation loss and iterating on
a machine learning model, I want to do this for [YOUR USE CASE].

The metric I want to optimize is: [YOUR METRIC]
The platform I'm using is: [YOUR PLATFORM]
The thing that changes between experiments: [YOUR VARIABLE]
API credentials will be provided in a moment.

Put this on the cloud using GitHub Actions, running every [X] hours.
Cold Email Example

Use the auto-research folder to help me build a similar pipeline, except instead of testing for validation loss, I want to optimize cold email reply rate. The platform is Instantly (I'll give you API credentials). The variable is the email copy. Run this on GitHub Actions every 4 hours.

Step 3: What Claude Code Builds for You

Claude Code will scaffold the entire system. Here's what it typically generates:

1

orchestrator.py

The top-level agent that coordinates everything. It reads the research log, generates a new hypothesis, deploys the experiment, and harvests results. This is the "brain" of the loop.

2

platform_client.py (e.g. instantly_client.py)

API integration layer. All the API calls your orchestrator needs — querying metrics, deploying variants, creating campaigns, purging old data.

3

baseline.md + resource.md

Baseline is your first experiment. Resource.md is a living document where the agent records what it's learned — "shorter subject lines work better", "leading with risk reversal improves reply rate", etc. This compounds over time.

4

GitHub Actions workflow (.github/workflows/optimize.yml)

A cron job that triggers the orchestrator on your chosen schedule. Claude Code will configure the workflow and tell you exactly which secrets to add in your GitHub repo settings.

Step 4: Deploy to GitHub Actions (Run 24/7)

Once Claude Code has generated all the files, push to GitHub and configure your secrets:

git init
git add .
git commit -m "feat: auto-research optimizer"
git remote add origin https://github.com/YOUR_USERNAME/YOUR_REPO.git
git push -u origin main

Then in your GitHub repo, go to Settings → Secrets and variables → Actions and add:

The GitHub Actions workflow will look like this:

name: Auto-Research Optimizer

on:
  schedule:
    - cron: '0 */4 * * *'  # Run every 4 hours
  workflow_dispatch:        # Allow manual runs

jobs:
  optimize:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - run: python orchestrator.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          PLATFORM_API_KEY: ${{ secrets.PLATFORM_API_KEY }}

Step 5: Monitor with Slack Notifications

Since the loop runs completely autonomously, you'll want visibility into what's happening. Ask Claude Code to add a Slack webhook integration:

Add a Slack webhook notification to the orchestrator. 
Send a message every time:
- A new challenger variant is created (include the copy)
- A harvest completes (include which variant won and the metric scores)

Slack webhook URL: [YOUR_SLACK_WEBHOOK_URL]

You'll get Slack pings like:

Example Slack Notification

🧪 New challenger created
Hypothesis: Baseline is too long, burying the offer. Testing a sub-75-word version with risk reversal upfront and a concrete time ask.

📊 Harvest complete
Challenger wins: 3.1% reply rate vs baseline 2.4% → New baseline updated.

What Good Metrics Look Like

Not all metrics are created equal. The auto-research pattern works best when your metric is:

✅ Good Metrics

Reply rate · Conversion rate · Click-through rate · Customer satisfaction score · Sales volume · Open rate · Validation loss

❌ Bad Metrics (Too Fuzzy)

Brand warmth · Content quality · Visual appeal · "How good it feels" — anything subjective that can't be measured automatically

When Auto-Research Doesn't Work

Be honest about these limitations before you start:

The Bigger Picture

This is what every major AI lab in the world is already doing — running thousands of experiments overnight to make their models better. Karpathy just open-sourced the pattern and made it accessible to everyone.

The compounding effect is the point. In early experiments, most challengers lose to the baseline. But over hundreds of runs, the agent builds a resource.md of accumulated knowledge — "shorter subject lines with personal openers outperform generic intros by 40%" — and starts generating genuinely better variants.

Run this for a year, and you end up with an optimization system that's orders of magnitude better than anything you could build manually.

Key Insight

Every challenger becomes the new baseline when it wins. Every failure is recorded in resource.md. The agent gets smarter with every loop — not because the model is fine-tuned, but because it has better context about what works.

Frequently Asked Questions

What is Karpathy's auto-research pattern?

Auto-research is an open-source framework by Andrej Karpathy that lets an AI agent autonomously run experiments in a tight loop. The agent modifies code or content, measures a metric, keeps what works, discards what doesn't, and repeats — all without human intervention.

Can I use auto-research for business purposes (not just ML)?

Yes. The pattern applies to anything with an objective metric and API access. Common business use cases include cold email reply rates, landing page conversion rates, ad creative CTR, chatbot satisfaction scores, and product description click-through rates.

How do I use auto-research with Claude Code?

Clone the autoresearch repo from GitHub, open it in your Claude Code environment, and describe your goal, metric, and the API you'll use to measure results. Claude Code will scaffold the orchestrator, API integrations, and GitHub Actions workflow to run the loop automatically.

What metrics work best with auto-research?

The best metrics are objective, measurable via API, and return results quickly. Examples: email reply rate (via Instantly API), landing page conversion rate, ad CTR. Avoid fuzzy metrics like "warmth" or "brand feel" — they can't be tracked automatically.

How often should the auto-research loop run?

It depends on your feedback loop. For cold email, every 4 hours works well. For ML training like Karpathy's setup, every 5 minutes is feasible. The tighter your loop, the faster you converge on the optimal solution.

⚠️ Important Disclaimer

This article contains experimental code and autonomous AI patterns. Before running anything:

Proceed with caution. When in doubt, don't give agents access to anything you can't afford to lose.

Want to Build Stuff Like This?

Join Vibe Coding Academy — where we turn builders like you into AI-powered developers who ship real products. Step-by-step, from Claude Code basics to autonomous agent pipelines.

Join vibecodingacademy.club →