AutoResearch Through the Greyforge Systems Lens
A scrubbed public audit note on why Greyforge does not plan to adopt karpathy/autoresearch, and why compact research loops are not the same thing as strong research architecture.

AutoResearch is a clean benchmark loop. It is not, by itself, serious research architecture.
Greyforge is already working at the harder layer: supervision, control, bounded execution, durable artifacts, and operator-governed research systems.
Public Executive Verdict
Greyforge does not plan to integrate karpathy/autoresearch and does not currently plan to borrow patterns from it. The reason is not that the repo fails at its own goal. The reason is that the goal is too narrow for the class of system Greyforge is already building.
A compact self-editing loop can be elegant and still fail to move a broader architecture forward. That is the situation here.
What the Repo Actually Solves
The core loop is clear and easy to explain: the human edits program.md, the agent edits train.py, the run gets a fixed budget, and the metric decides whether a change is kept or discarded. That is a strong demo. It is also a narrow one.
Greyforge is interested in the architecture above that loop: supervision, routing, bounded execution, review surfaces, durable research artifacts, memory discipline, and promotion control. Those are the parts that determine whether a research system remains useful once it stops being a demo.
The Greyforge Adoption Bar
Greyforge uses a simple filter for outside patterns. A candidate should solve a live bottleneck, generalize beyond one narrow setup, and add a capability Greyforge cannot already assemble from existing primitives. AutoResearch does not currently clear those bars.
What Greyforge Already Does Better
Greyforge is already operating at the broader systems layer. That includes research supervision, provider and capability routing, bounded execution, operator-governed control, durable artifacts, and memory surfaces designed to preserve signal instead of accumulating noise.
This is the real comparison. A self-editing loop is not the frontier if the surrounding system remains thin. Greyforge is already putting more weight on the harder question: can the full research environment stay legible, governable, and cumulative over time?
What Might Still Merit PR Attention
- The review surface could be stronger once many runs accumulate.
- Platform and capability fit could be surfaced more clearly.
- Benchmark accounting should be treated as sacred infrastructure.
- Packaging and environment friction still look higher than they should for a public reference project.
Public Source and Related Greyforge Work
The scrubbed markdown source for this note is published directly so the reasoning can be inspected without private system detail. Greyforge public work that touches adjacent problems includes capability scanning, memory filtering, and durable SQLite snapshotting.
Navigate Greyforge
This note is part of a broader Greyforge public surface covering architecture, chronicles, and open releases.