April 8, 20266 min read

AutoResearch Through the Greyforge Systems Lens

A public technical note on why Greyforge does not plan to adopt karpathy/autoresearch, and why compact research loops are not the same thing as strong research architecture.

Public Technical Note Audited Repo OpenForge

Featured Chronicle Image

Greyforge Position

AutoResearch is a clean benchmark loop. It is not, by itself, serious research architecture.

Greyforge is already working at the harder layer: supervision, control, bounded execution, durable artifacts, and governed research systems.

Public Executive Verdict

Greyforge does not plan to integrate karpathy/autoresearch and does not currently plan to borrow patterns from it. The reason is not that the repo fails at its own goal. The reason is that the goal is too narrow for the class of system Greyforge is already building.

A compact self-editing loop can be elegant and still fail to move a broader architecture forward. That is the situation here.

What the Repo Actually Solves

The core loop is clear and easy to explain: the human edits program.md, the agent edits train.py, the run gets a fixed budget, and the metric decides whether a change is kept or discarded. That is a strong demo. It is also a narrow one.

Greyforge is interested in the architecture above that loop: supervision, routing, bounded execution, review surfaces, durable research artifacts, memory discipline, and promotion control. Those are the parts that determine whether a research system remains useful once it stops being a demo.

The Greyforge Adoption Bar

Greyforge uses a simple filter for outside patterns. A candidate should solve a live bottleneck, generalize beyond one narrow setup, and add a capability Greyforge cannot already assemble from existing primitives. AutoResearch does not currently clear those bars.

It solves a smaller problem than Greyforge currently cares about.

It is too tightly shaped around one narrow workflow to generalize cleanly.

Its instruction and control model is simpler than Greyforge's existing authority stack.

It does not address the broader governance layer Greyforge treats as central.

What Greyforge Already Does Better

Greyforge is already operating at the broader systems layer. That includes research supervision, provider and capability routing, bounded execution, governed control, durable artifacts, and memory surfaces designed to preserve signal instead of accumulating noise.

This is the real comparison. A self-editing loop is not the frontier if the surrounding system remains thin. Greyforge is already putting more weight on the harder question: can the full research environment stay legible, governable, and cumulative over time?

What Might Still Merit PR Attention

The review surface could be stronger once many runs accumulate.
Platform and capability fit could be surfaced more clearly.
Benchmark accounting should be treated as sacred infrastructure.
Packaging and environment friction still look higher than they should for a public reference project.

Public Source and Related Greyforge Work

The public markdown version for this note is published directly so the reasoning can be inspected without private system detail. Greyforge public work that touches adjacent problems includes capability scanning, memory filtering, and durable SQLite snapshotting.

Public Technical Note

Direct markdown version of the public audit note.

devcap

Capability scanning for development environments and tool surfaces.

memory-quality-gate

Deterministic filtering for memory candidates before long-term storage.

Navigate Greyforge

This note is part of a broader Greyforge public surface covering architecture, chronicles, and open releases.

About Greyforge

Company doctrine, product surface, and systems posture.

Chronicles

Technical dispatches, public audits, and engineering history.

OpenForge

Public tools that solve adjacent operational problems.

The Forge Becomes a Factory

The broader systems layer this audit points toward.