I Built the “Self-Healing AI” Paper in Python- Here’s What Actually Happened

Please share to show your support

Self-Healing AI: A hands-on implementation of autonomous prompt optimization, LLM self-improvement without fine-tuning, and automated harness engineering using the Anthropic Claude API


Visit author profile: https://linkedin.com/in/animesh-kumar-sinha-56792119


What if an AI could read its own failures, rewrite its own instructions, and get better — without anyone touching its weights?

That’s the question behind a recent paper that stopped me mid-scroll: Self-Harness (https://arxiv.org/abs/2606.09498), a framework for autonomous LLM improvement that requires zero retraining, zero human feedback, and zero fine-tuning. I spent a weekend building it from scratch as a proof of concept in Python. This is what I learned.

The Idea That Hooked Me

Most people think of improving an AI model in one of two ways: fine-tuning (expensive, slow, needs data) or better prompting (manual, one-off, doesn’t compound).

The Self-Harness paper proposes a third way: let the model automatically improve its own prompts in a loop, with validation by tests.

No human in the loop. No retraining. Just a fixed model that gets progressively better instructions — instructions it wrote for itself. This is what the AI community is starting to call self-referential prompt engineering or automated prompt optimization — and it’s one of the cleaner ideas I’ve seen in the recent wave of agentic AI research.

What “Harness” Actually Means

Before diving in, let me clear up the terminology because it confused me, too.

The harness is not the model. It’s everything around the model — the system prompt, output format rules, a checklist, and a strategy block. In LLM agent frameworks, this is sometimes referred to as the scaffold, wrapper, or meta-prompt. In this project, it’s represented as four named text blocks:

harness = {
    "role":          "You are a careful Python programmer.",
    "strategy":      "",        # empty at start — room to grow
    "output_format": "Return ONLY a fenced ```python code block.",
    "checklist":     "",        # empty at start
}
These four blocks get concatenated into the system prompt on every model call:
def _harness_system(harness: dict) -> str:
    order = ["role", "strategy", "output_format", "checklist"]
    parts = [harness[k] for k in order if harness.get(k)]
    return "\n\n".join(parts)

The Model Wrapper

Before the loop can run, we need a reliable way to call the API. The entire Claude API integration lives in one function:

def call_model(
    system: str,
    user: str,
    model: str = "claude-haiku-4-5-20251001",
    max_tokens: int = 1024,
    temperature: float = 0.2,
    max_retries: int = 3,
) -> str:
    client = _get_client()
    delay = 2.0
    for attempt in range(max_retries):
        try:
            msg = client.messages.create(
                model=model,
                max_tokens=max_tokens,
                temperature=temperature,
                system=system,
                messages=[{"role": "user", "content": user}],
            )
            return msg.content[0].text
        except anthropic.RateLimitError:
            if attempt == max_retries - 1:
                raise
            time.sleep(delay)
            delay *= 2          # exponential backoff: 2s, 4s, 8s
        except anthropic.APIStatusError as e:
            if e.status_code >= 500 and attempt < max_retries - 1:
                time.sleep(delay)
                delay *= 2
            else:
                raise

Two things worth noting here. First, temperature=0.2 is the default- low, so the solver role is near-deterministic and delta measurements reflect harness changes rather than sampling noise. Second, the backoff only retries on rate limits and 5xx server errors, not on 4xx client errors- those fail fast.

The same function is called everywhere. The only thing that changes is what text you pass as system and user, which is the whole point.


The Three-Stage Loop

The system runs for T rounds. Each round has three stages. (Flow diagram below)

The Three-Stage Loop

Stage 1 — Weakness Mining: How an AI Diagnoses Its Own Failures

The model solves a set of coding tasks. The failures get collected and fed back to the same model with a different hat on.

First, failure evidence is assembled from the task records:

evidence = "\n\n".join(
    f"[{r.task_id}]\n"
    f"error: {r.error or 'none'}\n"
    f"code snippet:\n{(r.extracted_code or r.model_output)[:300]}"
    for r in failed_records

Then sent as a self-reflective prompt — one of the core techniques in modern LLM introspection research:

prompt = f"""You are reviewing failures from a coding agent. Below are {len(failed_records)} failed tasks.
{evidence}
Cluster these failures into at most {max_patterns} distinct, named failure patterns.
Focus on ROOT CAUSE patterns (e.g., "missing edge case for empty input", "wrong return type").
Respond with ONLY a JSON array. Each element must have:
  "name": short label (≤8 words)
  "task_ids": list of task_ids that fit this pattern
  "description": one sentence describing the root cause
  "example_error": the most illustrative error string from the tasks (or null)
Return only the JSON array, no prose."""
The model responds with structured failure analysis:
[
  {
    "name": "forgets empty input",
    "task_ids": ["task_04", "task_07"],
    "description": "Solution does not handle the case where the input list is empty",
    "example_error": "IndexError: list index out of range"
  },
  {
    "name": "off-by-one on ranges",
    "task_ids": ["task_11"],
    "description": "Loop boundary is exclusive where inclusive is required",
    "example_error": "AssertionError: expected 10, got 9"
  }

Notice temperature=0.3 — slightly warmer than the solver to allow grouping flexibility, but still controlled. The result gets parsed with a regex before json.loads as a safety net against any prose the model sneaks in before the array.

If the model returns malformed JSON, the code falls back to a deterministic clustering by error string — a reliable degradation path every production LLM pipeline should have.

Stage 2 — Harness Proposal: The Model Writes Instructions for Itself

The failure patterns plus the current harness get sent to the model again, this time playing the role of a prompt engineer:

prompt = f"""You are reviewing failures from a coding agent. Below are {len(failed_records)} failed tasks.
{evidence}
Cluster these failures into at most {max_patterns} distinct, named failure patterns.
Focus on ROOT CAUSE patterns (e.g., "missing edge case for empty input", "wrong return type").
Respond with ONLY a JSON array. Each element must have:
  "name": short label (≤8 words)
  "task_ids": list of task_ids that fit this pattern
  "description": one sentence describing the root cause
  "example_error": the most illustrative error string from the tasks (or null)
Return only the JSON array, no prose."""
The model responds with structured failure analysis:
[
  {
    "name": "forgets empty input",
    "task_ids": ["task_04", "task_07"],
    "description": "Solution does not handle the case where the input list is empty",
    "example_error": "IndexError: list index out of range"
  },
  {
    "name": "off-by-one on ranges",
    "task_ids": ["task_11"],
    "description": "Loop boundary is exclusive where inclusive is required",
    "example_error": "AssertionError: expected 10, got 9"
  }
]

temperature=0.7 here — warmer, to generate K diverse proposals rather than K minor variations of the same idea. Once parsed, each edit is validated structurally before even reaching the acceptance gate:

prompt = f"""You are reviewing failures from a coding agent. Below are {len(failed_records)} failed tasks.
{evidence}
Cluster these failures into at most {max_patterns} distinct, named failure patterns.
Focus on ROOT CAUSE patterns (e.g., "missing edge case for empty input", "wrong return type").
Respond with ONLY a JSON array. Each element must have:
  "name": short label (≤8 words)
  "task_ids": list of task_ids that fit this pattern
  "description": one sentence describing the root cause
  "example_error": the most illustrative error string from the tasks (or null)
Return only the JSON array, no prose."""
The model responds with structured failure analysis:
[
  {
    "name": "forgets empty input",
    "task_ids": ["task_04", "task_07"],
    "description": "Solution does not handle the case where the input list is empty",
    "example_error": "IndexError: list index out of range"
  },
  {
    "name": "off-by-one on ranges",
    "task_ids": ["task_11"],
    "description": "Loop boundary is exclusive where inclusive is required",
    "example_error": "AssertionError: expected 10, got 9"
  }
]

Applying an edit is a pure function — it never mutates the original harness:

def apply_edit(harness: dict, edit: dict) -> dict:

    h = copy.deepcopy(harness)

    block = edit[“block”]

    text = edit[“text”].strip()

    if edit[“op”] == “replace”:

        h[block] = text

    else:

        existing = h.get(block, “”).strip()

        h[block] = f”{existing}\n{text}”.strip() if existing else text

    return h

Stage 3 — The Strict Validation Gate: Preventing Overfitting in Autonomous AI

This is the elegant part. Before accepting any edit, the system applies it, re-evaluates on both splits, and runs this acceptance rule:

def _run_in_sandbox(code, tests_path, timeout):
    with tempfile.TemporaryDirectory() as tmpdir:
        (Path(tmpdir) / "solution.py").write_text(code)
        (Path(tmpdir) / "tests.py").write_text(tests_path.read_text())
        result = subprocess.run(
            [sys.executable, "tests.py"],
            capture_output=True,
            text=True,
            timeout=timeout,      # hard kill at 10s — handles infinite loops
            cwd=tmpdir,           # isolated from the host filesystem
        )
        if result.returncode == 0:
            return True, None
        stderr = (result.stderr or result.stdout or "").strip()
        return False, stderr[:500]
Each task's tests.py:
from solution import solution
assert solution([]) == []
assert solution([1, 2, 3]) == [3, 2, 1]
assert solution("hello") == "olleh"

The three conditions decoded:

  • d_in >= 0 — no regression on tasks used for mining
  • d_out >= 0 — no regression on unseen tasks (the generalization check)
  • max(d_in, d_out) > 0 — at least one split strictly improved

The held-out split is the key anti-gaming mechanism. The model can’t just memorize the held-in tasks because the held-out split is never exposed during mining or proposal — it only appears at the validation gate. This is the same train/test split logic used in classical machine learning model evaluation, applied here to prompt engineering.


How Pass/Fail Actually Works: Automated LLM Evaluation Without a Judge

How Pass/Fail Actually Works: Automated LLM Evaluation Without a Judge

This part surprised me in its simplicity. Each task is a Python function problem:

Write a Python function called `solution(s)` that reverses

the string `s` and returns it.

Example: solution(“hello”) → “olleh”

The model returns a code block. The system extracts it with a regex cascade — a robust pattern common in LLM output parsing:

def _extract_code(response: str) -> str | None:
    # 1. Fenced python block (ideal)
    m = re.search(r"```python\s*(.*?)\s*```", response, re.DOTALL)
    if m:
        return m.group(1).strip()
    # 2. Any fenced block (fallback)
    m = re.search(r"```\s*(.*?)\s*```", response, re.DOTALL)
    if m:
        return m.group(1).strip()
    # 3. Bare function definition (last resort)
    m = re.search(r"(def solution\b.*?)(?=\ndef |\Z)", response, re.DOTALL)
    if m:
        return m.group(1).strip()
    return None   # model returned prose → failure, error="no_code_block"
def _extract_code(response: str) -> str | None:
    # 1. Fenced python block (ideal)
    m = re.search(r"```python\s*(.*?)\s*```", response, re.DOTALL)
    if m:
        return m.group(1).strip()
    # 2. Any fenced block (fallback)
    m = re.search(r"```\s*(.*?)\s*```", response, re.DOTALL)
    if m:
        return m.group(1).strip()
    # 3. Bare function definition (last resort)
    m = re.search(r"(def solution\b.*?)(?=\ndef |\Z)", response, re.DOTALL)
    if m:
        return m.group(1).strip()
    return None   # model returned prose → failure, error="no_code_block"

If _extract_code returns None — the model returned prose without a code block — that error string “no_code_block” becomes evidence for the mining stage, which might then propose strengthening the output_format block. The failure feeds the improvement. That’s the loop working as intended.

The extracted code runs in a sandboxed subprocess — a critical safety measure when executing model-generated code:

def _run_in_sandbox(code, tests_path, timeout):
    with tempfile.TemporaryDirectory() as tmpdir:
        (Path(tmpdir) / "solution.py").write_text(code)
        (Path(tmpdir) / "tests.py").write_text(tests_path.read_text())
        result = subprocess.run(
            [sys.executable, "tests.py"],
            capture_output=True,
            text=True,
            timeout=timeout,      # hard kill at 10s — handles infinite loops
            cwd=tmpdir,           # isolated from the host filesystem
        )
        if result.returncode == 0:
            return True, None
        stderr = (result.stderr or result.stdout or "").strip()
        return False, stderr[:500]
Each task's tests.py:
from solution import solution
assert solution([]) == []
assert solution([1, 2, 3]) == [3, 2, 1]
assert solution("hello") == "olleh"

Exit code 0 = PASS. Non-zero = FAIL + stderr becomes example_error in the mining stage. No LLM-as-judge, no embeddings, no human annotation. Pure deterministic unit testing — the same approach used in competitive programming judges and automated code evaluation benchmarks like HumanEval and MBPP.


The Merge Step: Deterministic Conflict Resolution

When multiple edits are accepted in the same round, they need to be applied without conflict. The merge is deterministic — sorted by combined delta, applied in sequence:

def merge(harness, accepted_results):

    if not accepted_results:

        return harness

    sorted_results = sorted(

        accepted_results,

        key=lambda r: r[“d_in”] + r[“d_out”],

        reverse=True,       # best combined improvement goes first

    )

    h = harness

    for r in sorted_results:

        h = apply_edit(h, r[“edit”])

    return h

If two edits both append to the checklist, both get appended in order of their score. No edit is silently dropped. The resulting harness is snapshotted to disk after every round — giving you a full audit trail of how the prompt evolved.


What Happened When I Ran It

First run: credit balance too low. The API key I generated didn’t have credits behind it — a reminder that the Anthropic API is billed separately from a Claude chat subscription.

Second run (after topping up): 100% pass rate on the very first baseline evaluation.

[ Baseline ]

  held-in: 100%  held-out: 100%

[ Round 1 ]

  All held-in tasks passed — stopping early.

The loop terminated immediately. There was nothing to improve because the model — Claude Haiku — solved all 20 tasks with the minimal starting harness on its first attempt.

This is the honest result. It’s not a failure of the implementation — the code runs correctly end-to-end. It’s a calibration problem: the tasks I used are too easy for a 2025-era model. The self-harness loop only has work to do when the model is failing at a meaningful rate (~50–70% baseline pass rate is the sweet spot, according to the paper).


What I’d Do Next

The interesting experiment is not “does the loop run” — it’s “what instructions does the model write for itself when it’s genuinely struggling.”

To get there, you need tasks hard enough to produce failures:

  • Algorithmic problems with tricky edge cases
  • Functions with strict type contracts
  • Problems requiring multi-step reasoning

The loop then becomes a lens into the model’s blind spots. The failure patterns it names, the checklist items it writes for itself — that’s the real output worth reading. It’s essentially automated red-teaming of your own prompt, driven by the model that failed.


The Bigger Picture: Why This Architecture Matters

Claude-haiku

What I find compelling about this paper isn’t the performance numbers. It’s the architecture.

You have one model playing three roles, each with a different temperature:

  • Solver — attempts the task, temperature=0.2 for consistency
  • Analyst — diagnoses why it failed, temperature=0.3 for flexible grouping
  • Engineer — writes better instructions to fix the failure, temperature=0.7 for diverse proposals

The only glue is text and a validation gate that runs real unit tests. No special training, no RAG pipeline, no vector database, no external memory — just careful prompt engineering, a held-out evaluation split, and a subprocess sandbox.

That’s a clean idea. And clean ideas are worth building, even when the first run terminates in 30 seconds because your model is too good at your tasks. Also read https://journals-times.com/2026/06/07/orchestrating-intelligence-a-multi-agent-research-system-with-coordinator-pattern-mcp-integration-and-quality-gated-pipelines/


The full implementation is on GitHub. Built with Python, the Anthropic Claude API, and about a weekend of curiosity.

  1. https://github.com/AnimeshKumar-Sinha/Projects/tree/main/Animesh-llm-proj

Please share to show your support

Leave a Reply

Up ↑

Translate »

Discover more from E-JOURNAL TIMES MAGAZINE

Subscribe now to keep reading and get access to the full archive.

Continue reading