MATH500 greedy accuracy on Qwen3-4B-Instruct-2507.
Abstract
Why C3 exists
Terminal-only feedback diffuses credit across multi-agent LLM trajectories. C3 freezes transcript-derived context and estimates local causal credit with fixed-context replay plus a leave-one-out baseline, outperforming MAPPO and MAGRPO under matched budgets.
Problem
Terminal-only supervision diffuses reward across the whole trajectory, obscuring which message actually mattered.
Key idea
Turn collaboration into targeted causal interventions over frozen transcript contexts instead of learning a critic for the whole episode.
Outcome
Better terminal performance and cleaner internal optimization signatures under matched evaluator budgets.
Method
Replay the right decision, not the whole episode
Figure 1 from the paper. C3 records replay states from collaborative trajectories, samples context-matched alternatives, and estimates per-decision marginal value with fixed-continuation replay plus a leave-one-out baseline.
Freeze the transcript-derived context
The algorithm logs replay states at visited decision occurrences, preserving the exact observable context of a Reasoner or Actor turn.
Evaluate context-matched alternatives
Instead of rolling out fresh full episodes, C3 restarts from the frozen replay state and compares alternatives under the same continuation distribution.
Compute low-variance local credit
A leave-one-out baseline removes context-level difficulty, producing unbiased marginal advantages that can feed directly into policy-gradient updates.
Protocol
Reasoner -> Actor, terminal evaluation only
In the paper's primary setup, two LLM agents collaborate asynchronously. The Reasoner proposes a concise plan, the Actor produces the final answer or code, and only the final Actor output is scored by the external evaluator. That asymmetry is exactly why decision-level credit matters.
Results
Performance gains with matched budgets
MATH500 greedy accuracy, compared with 74.52 for MAGRPO and 69.28 for MAPPO.
MATH500 pass@10 under the same evaluation-resource accounting.
GSM8K greedy accuracy with Qwen2.5-3B-Instruct.
MBPP+ pass metric with Qwen2.5-Coder-3B-Instruct.
Benchmarks
Five mathematical and coding evaluation suites
The paper evaluates C3 on MATH500, CMATH, GSM8K, MBPP-test, and MBPP+ with matched budgets and five random seeds. Mathematical tasks use Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507; coding tasks use Qwen2.5-Coder-3B-Instruct.
Mechanistic Validation
Evidence beyond benchmark accuracy
Spearman correlation to the target within-context advantage in the shared replay-bucket diagnostics.
The LOO baseline reduces variance, stabilizing gradient estimates across tasks.
Higher conditional mutual information suggests stronger downstream responsiveness to upstream reasoning interventions.
Ablations support the causal story
Removing fixed-context replay lowers terminal accuracy from 90.7% to 86.5%, and removing the LOO baseline reduces the influence metric from 0.187 to 0.149. Both components matter: context matching sharpens the comparison, while LOO removes context-level shifts that would otherwise blur collaboration credit.
Reproduce
Built to connect the paper to runnable code
Quickstart
python -m pip install -r requirements.txt --no-build-isolation
bash scripts/data/prepare_all.sh --out_dir data
bash scripts/reproduce/smoke.sh --task math --limit 1 --print_example 0
Core paths
c3/credit/c3/for credit assignment logicopenrlhf/trainer/ppo_utils/experience_maker.pyfor paper-facing integrationconfigs/main_results_registry.yamlfor experiment referencesscripts/reproduce/for smoke tests, training, and analysis
Release hygiene
bash scripts/audit/release_gate.shbash scripts/reproduce/preflight_repro.sh --task math- Release checklist
- Paper-to-code audit
Citation
Use C3 in your work
If C3 or this repository helps your research, please cite the official arXiv paper below.
@misc{chen2026contextualcounterfactualcreditassignment,
title={Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration},
author={Yanjun Chen and Yirong Sun and Hanlin Wang and Xinming Zhang and Xiaoyu Shen and Wenjie Li and Wei Zhang},
year={2026},
eprint={2603.06859},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.06859}
}