ARC PRIZE - infoBAG

Subtask 1 — System Role & Run Flow
Objective  Define the pipeline goal and top-level control flow.
Inputs  ARC JSON challenges; RunConfig; env vars (keys, MAX_CONCURRENCY, flags).
Procedure / Prompt Spec or Algorithm  Use src/run.py::run() → run_from_json() to load challenges, choose preset, orchestrate solve_challenges() which schedules solve_challenge(); within it call get_answer_grids() to generate/score instructions and finalize guesses.
Outputs / Artifacts  Attempts JSON in attempts/arc-prize-20XX/...; per-task temp_solutions/*.json.
Dependencies / Anchors  README.md (How the system works, Running a solve); src/run.py::{run, run_from_json, solve_challenges, solve_challenge, get_answer_grids, return_answer}.
Acceptance Criteria  Running python src/run.py produces an attempts file and (if truth present) printed accuracy; no unhandled exceptions.
Failure Modes & Mitigations  Missing env keys → set per README; rate limits → lower concurrency flags.
:::

Subtask 2 — Data Models & Grid Representation
Objective  Normalize grid IO for prompts and scoring.
Inputs  Challenge, Example, Input models; integer grids; COLOR_MAP.
Procedure / Prompt Spec or Algorithm  Render grids via Challenge.grid_to_str; optionally embed base64 PNG via Challenge.grid_to_base64/viz.base64_from_grid.
Outputs / Artifacts  Text matrices; optional images.
Dependencies / Anchors  src/models.py::{Challenge, Example, Input, grid_to_str, grid_to_base64, COLOR_MAP}; src/viz.py::{base64_from_grid}.
Acceptance Criteria  Stringified grids are rectangular, integer-only; base64 generation does not raise.
Failure Modes & Mitigations  Image encoding errors → log and fall back to text-only (see main.contents_from_grid try/except).
:::

Subtask 3 — Prompt Packing
Objective  Build consistent message content for LLMs.
Inputs  Training examples, optional attempts, test inputs; toggles: include_base64, use_diffs.
Procedure / Prompt Spec or Algorithm  Use contents_from_grid, contents_from_example, contents_from_challenge to produce a list of typed parts with labeled sections and (optionally) diffs.
Outputs / Artifacts  Structured message content lists.
Dependencies / Anchors  src/main.py::{contents_from_grid, contents_from_example, contents_from_challenge}; diff text via src/run.py::generate_grid_diff.
Acceptance Criteria  Message list order matches training→tests; when diffs enabled, mismatches include ASCII diff.
Failure Modes & Mitigations  Context bloat → disable images/diffs; reduce step sizes.
:::

Subtask 4 — Instruction Synthesis
Objective  Derive a general rule as stepwise instructions.
Inputs  INTUITIVE_PROMPT; packed examples/tests.
Procedure / Prompt Spec or Algorithm  Call get_next_structure(InstructionsResponse, model=step.instruction_model, messages=…) from get_instructions_from_challenge.
Outputs / Artifacts  InstructionsResponse.instructions.
Dependencies / Anchors  src/main.py::{INTUITIVE_PROMPT, InstructionsResponse}; src/run.py::{get_instructions_from_challenge}.
Acceptance Criteria  Non-empty string; avoids example-specific indices per prompt guidance.
Failure Modes & Mitigations  Empty/invalid → retry via step sampling; move to revision/pooling.
:::

Subtask 5 — Executor (Follow Instructions)
Objective  Apply instructions to a test grid to produce an output grid.
Inputs  Instructions; training examples (as reference); one test input; flags is_perfect, include_base64, use_diffs.
Procedure / Prompt Spec or Algorithm  Use AGENT_FOLLOW_INSTRUCTIONS_PROMPT and optional PERFECT_PROMPT; call output_grid_from_instructions which invokes get_next_structure(GridResponse, …).
Outputs / Artifacts  GridResponse.grid (2D ints).
Dependencies / Anchors  src/main.py::{AGENT_FOLLOW_INSTRUCTIONS_PROMPT, PERFECT_PROMPT, GridResponse, output_grid_from_instructions}.
Acceptance Criteria  Grid shape matches target’s shape during scoring; integers only.
Failure Modes & Mitigations  Free text instead of grid → schema enforcement via get_next_structure.
:::

Subtask 6 — Structured Output Enforcement
Objective  Ensure parseable, schema-validated LLM outputs.
Inputs  Pydantic schemas; provider selection.
Procedure / Prompt Spec or Algorithm  Route via get_next_structure which dispatches to _get_next_structure_* per model (OpenAI/Anthropic/Gemini/DeepSeek/xAI/OpenRouter), using JSON/object or tool use as supported.
Outputs / Artifacts  Parsed Pydantic instances.
Dependencies / Anchors  src/llms/structured.py::{get_next_structure, _get_next_structure_openai, _get_next_structure_anthropic, _get_next_structure_gemini, _get_next_structure_deepseek, _get_next_structure_xai, _get_next_structure_openrouter}; src/llms/models.py::Model.
Acceptance Criteria  Successful parse or controlled failure with retries; no downstream string parsing needed.
Failure Modes & Mitigations  Model lacks json_schema → use json_object path; retry with backoff.
:::

Subtask 7 — Scoring (Leave-One-Out)
Objective  Quantify instruction generalization.
Inputs  Candidate instructions; full training set; follow_model.
Procedure / Prompt Spec or Algorithm  For each training example i, hold out i, execute on its input, compare to its output using get_grid_similarity (exact cell-wise match proportion); average over examples.
Outputs / Artifacts  InstructionsScore with example_scores and aggregate score.
Dependencies / Anchors  src/run.py::{get_example_score, get_grid_similarity, score_instructions_on_challenge}.
Acceptance Criteria  Score in [0,1]; 1.0 iff all cells match on all held-out examples.
Failure Modes & Mitigations  Dimensional mismatch → similarity 0; optional viz when VIZ=1.
:::

Subtask 8 — Revision (Self-Repair)
Objective  Improve weak instructions using feedback from mismatches.
Inputs  Prior InstructionsScore; diffs/attempts; StepRevision config.
Procedure / Prompt Spec or Algorithm  Prompt with REVISION_PROMPT via InstructionsScore.get_revised_instructions; rescore revised outputs.
Outputs / Artifacts  New InstructionsScore items.
Dependencies / Anchors  src/run.py::{REVISION_PROMPT, InstructionsScore.get_revised_instructions}; src/configs/models.py::StepRevision.
Acceptance Criteria  Best revised score ≥ previous best; log improvement metrics.
Failure Modes & Mitigations  No gains → move to pooling; adjust sampling counts.
:::

Subtask 9 — Pooling (Synthesis Across Attempts)
Objective  Fuse strengths from several near-miss instruction sets.
Inputs  Top InstructionsScore samples; StepRevisionPool config.
Procedure / Prompt Spec or Algorithm  Use SYNTHESIS_PROMPT via get_pooling_instruction_from_scores; rescore; merge with candidates.
Outputs / Artifacts  Pooled instruction texts and scores.
Dependencies / Anchors  src/run.py::{SYNTHESIS_PROMPT, get_pooling_instruction_from_scores}; src/configs/models.py::StepRevisionPool.
Acceptance Criteria  At least one pooled instruction’s score ≥ prior top.
Failure Modes & Mitigations  Convergence to weak consensus → increase diversity (times), keep multiple candidates.
:::

Subtask 10 — Step Orchestration & Sampling
Objective  Execute configured Step/Revision/Pool sequence.
Inputs  RunConfig.steps with per-step models, counts, timeouts, flags.
Procedure / Prompt Spec or Algorithm  In get_answer_grids, generate candidates per step, rescore, sort desc, short-list for subsequent steps; exit early on perfect score.
Outputs / Artifacts  Sorted candidate list.
Dependencies / Anchors  src/run.py::{get_instruction_scores, get_score_from_instructions, get_answer_grids}; src/configs/models.py::{RunConfig, Step, StepRevision, StepRevisionPool}.
Acceptance Criteria  Deterministic step flow with early exit on score==1.
Failure Modes & Mitigations  Empty candidate set → log error and halt (exception path).
:::

Subtask 11 — Finalization & Diversity
Objective  Produce up to two distinct final guesses per test grid.
Inputs  Top candidates; final_follow_model; final_follow_times.
Procedure / Prompt Spec or Algorithm  Use get_diverse_attempts to generate multiple outputs; prefer perfect-score instructions; otherwise split attempts between top two candidates; ensure diversity when possible.
Outputs / Artifacts  Two Guess objects; attempts JSON updated.
Dependencies / Anchors  src/run.py::{get_diverse_attempts, return_answer, Guess}.
Acceptance Criteria  Each test input yields ≤2 outputs; if all generated outputs are identical, duplicates are allowed per code.
Failure Modes & Mitigations  Lack of diversity → explicitly bias attempts across top-2 instruction sources (as implemented).
:::

Subtask 12 — Concurrency Control
Objective  Prevent provider overload and monitor saturation.
Inputs  Env MAX_CONCURRENCY; config max_concurrent_tasks.
Procedure / Prompt Spec or Algorithm  Use MonitoredSemaphore in both run loop and get_next_structure (API semaphore); log active/available permits.
Outputs / Artifacts  Saturation logs, orderly task scheduling.
Dependencies / Anchors  src/async_utils/semaphore_monitor.py::MonitoredSemaphore; usage in src/run.py::solve_challenges and src/llms/structured.py::API_SEMAPHORE.
Acceptance Criteria  No storm of RESOURCE_EXHAUSTED; visible saturation percentage logs.
Failure Modes & Mitigations  Starvation/deadlocks → minimal critical sections; staggered starts in solve_challenges.
:::

Subtask 13 — Provider Adapters & Retries
Objective  Uniform structured calls across providers with robust retries.
Inputs  Model enum; messages; schemas.
Procedure / Prompt Spec or Algorithm  Dispatch in get_next_structure; per-provider adapters enforce structured output; retry_with_backoff wraps transient failures (xAI/OpenRouter paths) with jitter and caps.
Outputs / Artifacts  Parsed outputs; usage logs.
Dependencies / Anchors  src/llms/structured.py::{retry_with_backoff, get_next_structure, _get_next_structure_*}; src/llms/models.py::Model.
Acceptance Criteria  Recover from transient failures; bounded retries; logged attempts.
Failure Modes & Mitigations  Non-retryable errors → immediate fail with error logs.
:::

Subtask 14 — Logging & Trace Context
Objective  Record spans and mirror to local file while scrubbing sensitive data.
Inputs  LOGFIRE_API_KEY, LOCAL_LOGS_ONLY, LOG_LEVEL.
Procedure / Prompt Spec or Algorithm  Patch logfire methods to inject run_id/task_id; write rotating file logs/arc.log; wrap spans with start/end/error messages.
Outputs / Artifacts  Local log file; optional remote telemetry.
Dependencies / Anchors  src/logging_config.py::{generate_run_id, set_task_id, logfire patches}; src/log.py::log facade.
Acceptance Criteria  Every major operation creates span logs; no secret leakage (scrubbing callback).
Failure Modes & Mitigations  No token/network → LOCAL_LOGS_ONLY=1 path.
:::

Subtask 15 — Persistence (Files & Optional DB)
Objective  Store attempts and optionally DB rows for analysis.
Inputs  Guesses and instruction scores; NEON_DSN (optional).
Procedure / Prompt Spec or Algorithm  Write attempts / temp files; if DB configured, InstructionsScore.save_to_db and Guess.save_to_db insert rows (JSONB fields) with metadata.
Outputs / Artifacts  JSON files; DB rows when enabled.
Dependencies / Anchors  src/run.py::{solve_challenge, Guess.save_to_db, InstructionsScore.save_to_db}.
Acceptance Criteria  Files are valid JSON; DB insert does not raise.
Failure Modes & Mitigations  DB failure → continue with file artifacts only.
:::

Subtask 16 — Visualization & Base64
Objective  Aid debugging and multimodal prompting.
Inputs  VIZ, LOG_GRIDS env toggles; color map.
Procedure / Prompt Spec or Algorithm  When enabled, viz_many shows mismatches; base64_from_grid generates PNGs for embedding.
Outputs / Artifacts  On-screen figures; base64 strings.
Dependencies / Anchors  src/viz.py::{viz_many, base64_from_grid}; hooks in src/run.py and src/main.py.
Acceptance Criteria  No GUI errors when disabled; images created when requested.
Failure Modes & Mitigations  Headless errors → keep defaults off.
:::

Subtask 17 — Environment & Runbook
Objective  Prepare environment and execute.
Inputs  .env with provider keys; MAX_CONCURRENCY; Python 3.12 target in tooling.
Procedure / Prompt Spec or Algorithm  Load env early (src/__init__.py); install deps per pyproject.toml; run python src/run.py; adjust run() to select preset (grok_config_prod, gpt_config_prod, oss_config, mini_config).
Outputs / Artifacts  Successful run start; log output including run id.
Dependencies / Anchors  README.md (Requirements, Environment configuration, Running a solve); src/__init__.py; src/configs/*.
Acceptance Criteria  Smoke test runs with defaults; no KeyError for MAX_CONCURRENCY.
Failure Modes & Mitigations  Missing var → set in .env per README.
:::

Subtask 18 — Evaluation
Objective  Compute accuracy when ground truth is provided.
Inputs  Attempts JSON; solutions JSON.
Procedure / Prompt Spec or Algorithm  Use evaluate_solutions to compare attempts against truth; count either attempt as correct; report aggregate accuracy.
Outputs / Artifacts  Printed accuracy; logs.
Dependencies / Anchors  src/run.py::{evaluate_solutions}.
Acceptance Criteria  Accuracy matches competition scoring semantics (either attempt counts).
Failure Modes & Mitigations  Schema mismatch → validate inputs before evaluation.
URL: https://ib.bsb.br/arc-prize