ARC PRIZE

Slug: arc-prize

13279 characters 1520 words
Subtask 1 — System Role & Run Flow Objective Define the pipeline goal and top-level control flow. Inputs ARC JSON challenges; RunConfig; env vars (keys, MAX_CONCURRENCY, flags). Procedure / Prompt Spec or Algorithm Use src/run.py::run() → run_from_json() to load challenges, choose preset, orchestrate solve_challenges() which schedules solve_challenge(); within it call get_answer_grids() to generate/score instructions and finalize guesses. Outputs / Artifacts Attempts JSON in attempts/arc-prize-20XX/...; per-task temp_solutions/*.json. Dependencies / Anchors README.md (How the system works, Running a solve); src/run.py::{run, run_from_json, solve_challenges, solve_challenge, get_answer_grids, return_answer}. Acceptance Criteria Running python src/run.py produces an attempts file and (if truth present) printed accuracy; no unhandled exceptions. Failure Modes & Mitigations Missing env keys → set per README; rate limits → lower concurrency flags. ::: Subtask 2 — Data Models & Grid Representation Objective Normalize grid IO for prompts and scoring. Inputs Challenge, Example, Input models; integer grids; COLOR_MAP. Procedure / Prompt Spec or Algorithm Render grids via Challenge.grid_to_str; optionally embed base64 PNG via Challenge.grid_to_base64/viz.base64_from_grid. Outputs / Artifacts Text matrices; optional images. Dependencies / Anchors src/models.py::{Challenge, Example, Input, grid_to_str, grid_to_base64, COLOR_MAP}; src/viz.py::{base64_from_grid}. Acceptance Criteria Stringified grids are rectangular, integer-only; base64 generation does not raise. Failure Modes & Mitigations Image encoding errors → log and fall back to text-only (see main.contents_from_grid try/except). ::: Subtask 3 — Prompt Packing Objective Build consistent message content for LLMs. Inputs Training examples, optional attempts, test inputs; toggles: include_base64, use_diffs. Procedure / Prompt Spec or Algorithm Use contents_from_grid, contents_from_example, contents_from_challenge to produce a list of typed parts with labeled sections and (optionally) diffs. Outputs / Artifacts Structured message content lists. Dependencies / Anchors src/main.py::{contents_from_grid, contents_from_example, contents_from_challenge}; diff text via src/run.py::generate_grid_diff. Acceptance Criteria Message list order matches training→tests; when diffs enabled, mismatches include ASCII diff. Failure Modes & Mitigations Context bloat → disable images/diffs; reduce step sizes. ::: Subtask 4 — Instruction Synthesis Objective Derive a general rule as stepwise instructions. Inputs INTUITIVE_PROMPT; packed examples/tests. Procedure / Prompt Spec or Algorithm Call get_next_structure(InstructionsResponse, model=step.instruction_model, messages=…) from get_instructions_from_challenge. Outputs / Artifacts InstructionsResponse.instructions. Dependencies / Anchors src/main.py::{INTUITIVE_PROMPT, InstructionsResponse}; src/run.py::{get_instructions_from_challenge}. Acceptance Criteria Non-empty string; avoids example-specific indices per prompt guidance. Failure Modes & Mitigations Empty/invalid → retry via step sampling; move to revision/pooling. ::: Subtask 5 — Executor (Follow Instructions) Objective Apply instructions to a test grid to produce an output grid. Inputs Instructions; training examples (as reference); one test input; flags is_perfect, include_base64, use_diffs. Procedure / Prompt Spec or Algorithm Use AGENT_FOLLOW_INSTRUCTIONS_PROMPT and optional PERFECT_PROMPT; call output_grid_from_instructions which invokes get_next_structure(GridResponse, …). Outputs / Artifacts GridResponse.grid (2D ints). Dependencies / Anchors src/main.py::{AGENT_FOLLOW_INSTRUCTIONS_PROMPT, PERFECT_PROMPT, GridResponse, output_grid_from_instructions}. Acceptance Criteria Grid shape matches target’s shape during scoring; integers only. Failure Modes & Mitigations Free text instead of grid → schema enforcement via get_next_structure. ::: Subtask 6 — Structured Output Enforcement Objective Ensure parseable, schema-validated LLM outputs. Inputs Pydantic schemas; provider selection. Procedure / Prompt Spec or Algorithm Route via get_next_structure which dispatches to _get_next_structure_* per model (OpenAI/Anthropic/Gemini/DeepSeek/xAI/OpenRouter), using JSON/object or tool use as supported. Outputs / Artifacts Parsed Pydantic instances. Dependencies / Anchors src/llms/structured.py::{get_next_structure, _get_next_structure_openai, _get_next_structure_anthropic, _get_next_structure_gemini, _get_next_structure_deepseek, _get_next_structure_xai, _get_next_structure_openrouter}; src/llms/models.py::Model. Acceptance Criteria Successful parse or controlled failure with retries; no downstream string parsing needed. Failure Modes & Mitigations Model lacks json_schema → use json_object path; retry with backoff. ::: Subtask 7 — Scoring (Leave-One-Out) Objective Quantify instruction generalization. Inputs Candidate instructions; full training set; follow_model. Procedure / Prompt Spec or Algorithm For each training example i, hold out i, execute on its input, compare to its output using get_grid_similarity (exact cell-wise match proportion); average over examples. Outputs / Artifacts InstructionsScore with example_scores and aggregate score. Dependencies / Anchors src/run.py::{get_example_score, get_grid_similarity, score_instructions_on_challenge}. Acceptance Criteria Score in [0,1]; 1.0 iff all cells match on all held-out examples. Failure Modes & Mitigations Dimensional mismatch → similarity 0; optional viz when VIZ=1. ::: Subtask 8 — Revision (Self-Repair) Objective Improve weak instructions using feedback from mismatches. Inputs Prior InstructionsScore; diffs/attempts; StepRevision config. Procedure / Prompt Spec or Algorithm Prompt with REVISION_PROMPT via InstructionsScore.get_revised_instructions; rescore revised outputs. Outputs / Artifacts New InstructionsScore items. Dependencies / Anchors src/run.py::{REVISION_PROMPT, InstructionsScore.get_revised_instructions}; src/configs/models.py::StepRevision. Acceptance Criteria Best revised score ≥ previous best; log improvement metrics. Failure Modes & Mitigations No gains → move to pooling; adjust sampling counts. ::: Subtask 9 — Pooling (Synthesis Across Attempts) Objective Fuse strengths from several near-miss instruction sets. Inputs Top InstructionsScore samples; StepRevisionPool config. Procedure / Prompt Spec or Algorithm Use SYNTHESIS_PROMPT via get_pooling_instruction_from_scores; rescore; merge with candidates. Outputs / Artifacts Pooled instruction texts and scores. Dependencies / Anchors src/run.py::{SYNTHESIS_PROMPT, get_pooling_instruction_from_scores}; src/configs/models.py::StepRevisionPool. Acceptance Criteria At least one pooled instruction’s score ≥ prior top. Failure Modes & Mitigations Convergence to weak consensus → increase diversity (times), keep multiple candidates. ::: Subtask 10 — Step Orchestration & Sampling Objective Execute configured Step/Revision/Pool sequence. Inputs RunConfig.steps with per-step models, counts, timeouts, flags. Procedure / Prompt Spec or Algorithm In get_answer_grids, generate candidates per step, rescore, sort desc, short-list for subsequent steps; exit early on perfect score. Outputs / Artifacts Sorted candidate list. Dependencies / Anchors src/run.py::{get_instruction_scores, get_score_from_instructions, get_answer_grids}; src/configs/models.py::{RunConfig, Step, StepRevision, StepRevisionPool}. Acceptance Criteria Deterministic step flow with early exit on score==1. Failure Modes & Mitigations Empty candidate set → log error and halt (exception path). ::: Subtask 11 — Finalization & Diversity Objective Produce up to two distinct final guesses per test grid. Inputs Top candidates; final_follow_model; final_follow_times. Procedure / Prompt Spec or Algorithm Use get_diverse_attempts to generate multiple outputs; prefer perfect-score instructions; otherwise split attempts between top two candidates; ensure diversity when possible. Outputs / Artifacts Two Guess objects; attempts JSON updated. Dependencies / Anchors src/run.py::{get_diverse_attempts, return_answer, Guess}. Acceptance Criteria Each test input yields ≤2 outputs; if all generated outputs are identical, duplicates are allowed per code. Failure Modes & Mitigations Lack of diversity → explicitly bias attempts across top-2 instruction sources (as implemented). ::: Subtask 12 — Concurrency Control Objective Prevent provider overload and monitor saturation. Inputs Env MAX_CONCURRENCY; config max_concurrent_tasks. Procedure / Prompt Spec or Algorithm Use MonitoredSemaphore in both run loop and get_next_structure (API semaphore); log active/available permits. Outputs / Artifacts Saturation logs, orderly task scheduling. Dependencies / Anchors src/async_utils/semaphore_monitor.py::MonitoredSemaphore; usage in src/run.py::solve_challenges and src/llms/structured.py::API_SEMAPHORE. Acceptance Criteria No storm of RESOURCE_EXHAUSTED; visible saturation percentage logs. Failure Modes & Mitigations Starvation/deadlocks → minimal critical sections; staggered starts in solve_challenges. ::: Subtask 13 — Provider Adapters & Retries Objective Uniform structured calls across providers with robust retries. Inputs Model enum; messages; schemas. Procedure / Prompt Spec or Algorithm Dispatch in get_next_structure; per-provider adapters enforce structured output; retry_with_backoff wraps transient failures (xAI/OpenRouter paths) with jitter and caps. Outputs / Artifacts Parsed outputs; usage logs. Dependencies / Anchors src/llms/structured.py::{retry_with_backoff, get_next_structure, _get_next_structure_*}; src/llms/models.py::Model. Acceptance Criteria Recover from transient failures; bounded retries; logged attempts. Failure Modes & Mitigations Non-retryable errors → immediate fail with error logs. ::: Subtask 14 — Logging & Trace Context Objective Record spans and mirror to local file while scrubbing sensitive data. Inputs LOGFIRE_API_KEY, LOCAL_LOGS_ONLY, LOG_LEVEL. Procedure / Prompt Spec or Algorithm Patch logfire methods to inject run_id/task_id; write rotating file logs/arc.log; wrap spans with start/end/error messages. Outputs / Artifacts Local log file; optional remote telemetry. Dependencies / Anchors src/logging_config.py::{generate_run_id, set_task_id, logfire patches}; src/log.py::log facade. Acceptance Criteria Every major operation creates span logs; no secret leakage (scrubbing callback). Failure Modes & Mitigations No token/network → LOCAL_LOGS_ONLY=1 path. ::: Subtask 15 — Persistence (Files & Optional DB) Objective Store attempts and optionally DB rows for analysis. Inputs Guesses and instruction scores; NEON_DSN (optional). Procedure / Prompt Spec or Algorithm Write attempts / temp files; if DB configured, InstructionsScore.save_to_db and Guess.save_to_db insert rows (JSONB fields) with metadata. Outputs / Artifacts JSON files; DB rows when enabled. Dependencies / Anchors src/run.py::{solve_challenge, Guess.save_to_db, InstructionsScore.save_to_db}. Acceptance Criteria Files are valid JSON; DB insert does not raise. Failure Modes & Mitigations DB failure → continue with file artifacts only. ::: Subtask 16 — Visualization & Base64 Objective Aid debugging and multimodal prompting. Inputs VIZ, LOG_GRIDS env toggles; color map. Procedure / Prompt Spec or Algorithm When enabled, viz_many shows mismatches; base64_from_grid generates PNGs for embedding. Outputs / Artifacts On-screen figures; base64 strings. Dependencies / Anchors src/viz.py::{viz_many, base64_from_grid}; hooks in src/run.py and src/main.py. Acceptance Criteria No GUI errors when disabled; images created when requested. Failure Modes & Mitigations Headless errors → keep defaults off. ::: Subtask 17 — Environment & Runbook Objective Prepare environment and execute. Inputs .env with provider keys; MAX_CONCURRENCY; Python 3.12 target in tooling. Procedure / Prompt Spec or Algorithm Load env early (src/__init__.py); install deps per pyproject.toml; run python src/run.py; adjust run() to select preset (grok_config_prod, gpt_config_prod, oss_config, mini_config). Outputs / Artifacts Successful run start; log output including run id. Dependencies / Anchors README.md (Requirements, Environment configuration, Running a solve); src/__init__.py; src/configs/*. Acceptance Criteria Smoke test runs with defaults; no KeyError for MAX_CONCURRENCY. Failure Modes & Mitigations Missing var → set in .env per README. ::: Subtask 18 — Evaluation Objective Compute accuracy when ground truth is provided. Inputs Attempts JSON; solutions JSON. Procedure / Prompt Spec or Algorithm Use evaluate_solutions to compare attempts against truth; count either attempt as correct; report aggregate accuracy. Outputs / Artifacts Printed accuracy; logs. Dependencies / Anchors src/run.py::{evaluate_solutions}. Acceptance Criteria Accuracy matches competition scoring semantics (either attempt counts). Failure Modes & Mitigations Schema mismatch → validate inputs before evaluation.
URL: https://ib.bsb.br/arc-prize