C3 - Contextual Counterfactual Credit Assignment

Abstract

Why C3 exists

Terminal-only feedback diffuses credit across multi-agent LLM trajectories. C3 freezes transcript-derived context and estimates local causal credit with fixed-context replay plus a leave-one-out baseline, outperforming MAPPO and MAGRPO under matched budgets.

Problem

Terminal-only supervision diffuses reward across the whole trajectory, obscuring which message actually mattered.

Key idea

Turn collaboration into targeted causal interventions over frozen transcript contexts instead of learning a critic for the whole episode.

Outcome

Better terminal performance and cleaner internal optimization signatures under matched evaluator budgets.

Method

Replay the right decision, not the whole episode

Figure 1 from the paper. C3 records replay states from collaborative trajectories, samples context-matched alternatives, and estimates per-decision marginal value with fixed-continuation replay plus a leave-one-out baseline.

01

Freeze the transcript-derived context

The algorithm logs replay states at visited decision occurrences, preserving the exact observable context of a Reasoner or Actor turn.

02

Evaluate context-matched alternatives

Instead of rolling out fresh full episodes, C3 restarts from the frozen replay state and compares alternatives under the same continuation distribution.

03

Compute low-variance local credit

A leave-one-out baseline removes context-level difficulty, producing unbiased marginal advantages that can feed directly into policy-gradient updates.

Protocol

Reasoner -> Actor, terminal evaluation only

In the paper's primary setup, two LLM agents collaborate asynchronously. The Reasoner proposes a concise plan, the Actor produces the final answer or code, and only the final Actor output is scored by the external evaluator. That asymmetry is exactly why decision-level credit matters.

Results

Performance gains with matched budgets

Qwen3 math 82.80%

MATH500 greedy accuracy, compared with 74.52 for MAGRPO and 69.28 for MAPPO.

Qwen3 math 91.44%

MATH500 pass@10 under the same evaluation-resource accounting.

Qwen2.5 math 87.01%

GSM8K greedy accuracy with Qwen2.5-3B-Instruct.

Qwen2.5 code 7.98

MBPP+ pass metric with Qwen2.5-Coder-3B-Instruct.

Learning dynamics comparison across methods — Figure 2. C3 reaches a higher return plateau earlier and shows tighter uncertainty bands across seeds.

Training efficiency Pareto frontier — Figure 3. C3 lies on a favorable training-token Pareto frontier, reaching strong returns with 418M training tokens.

Benchmarks

Five mathematical and coding evaluation suites

The paper evaluates C3 on MATH500, CMATH, GSM8K, MBPP-test, and MBPP+ with matched budgets and five random seeds. Mathematical tasks use Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507; coding tasks use Qwen2.5-Coder-3B-Instruct.

MATH500 CMATH GSM8K MBPP-test MBPP+ 5 seeds Matched evaluator budgets Reasoner / Actor protocol

Mechanistic Validation

Evidence beyond benchmark accuracy

Credit fidelity 0.270

Spearman correlation to the target within-context advantage in the shared replay-bucket diagnostics.

Within-context variance 0.00513

The LOO baseline reduces variance, stabilizing gradient estimates across tasks.

Inter-agent influence 0.187

Higher conditional mutual information suggests stronger downstream responsiveness to upstream reasoning interventions.

Ablations support the causal story

Removing fixed-context replay lowers terminal accuracy from 90.7% to 86.5%, and removing the LOO baseline reduces the influence metric from 0.187 to 0.149. Both components matter: context matching sharpens the comparison, while LOO removes context-level shifts that would otherwise blur collaboration credit.

Reproduce

Built to connect the paper to runnable code

Quickstart

python -m pip install -r requirements.txt --no-build-isolation
bash scripts/data/prepare_all.sh --out_dir data
bash scripts/reproduce/smoke.sh --task math --limit 1 --print_example 0

Core paths

c3/credit/c3/ for credit assignment logic
openrlhf/trainer/ppo_utils/experience_maker.py for paper-facing integration
configs/main_results_registry.yaml for experiment references
scripts/reproduce/ for smoke tests, training, and analysis

Release hygiene

bash scripts/audit/release_gate.sh
bash scripts/reproduce/preflight_repro.sh --task math
Release checklist
Paper-to-code audit

Citation

Use C3 in your work

If C3 or this repository helps your research, please cite the official arXiv paper below.

Download BibTeX Open arXiv

@misc{chen2026contextualcounterfactualcreditassignment,
  title={Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration},
  author={Yanjun Chen and Yirong Sun and Hanlin Wang and Xinming Zhang and Xiaoyu Shen and Wenjie Li and Wei Zhang},
  year={2026},
  eprint={2603.06859},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2603.06859}
}