Project Page

Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration

Yanjun Chen, Yirong Sun, Hanlin Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang

Eastern Institute of Technology, Ningbo and The Hong Kong Polytechnic University

C3 turns terminal-only feedback in multi-agent LLM collaboration into decision-level credit through fixed-context replay and a leave-one-out baseline.

Status Now on arXiv as 2603.06859

Read the official abstract on arXiv or download the paper here.

Setting 2-agent Reasoner -> Actor collaboration

Only the final actor output is scored, which makes upstream credit assignment the central learning bottleneck.

C3 project logo
Main math result 82.80%

MATH500 greedy accuracy on Qwen3-4B-Instruct-2507.

Pass@10 91.44%

MATH500 pass@10 under matched evaluation resources.

Token efficiency 418M

Training-token Pareto point on the Qwen3 math suite.

Mechanistic signal 0.270

Highest reported credit-fidelity correlation in the paper diagnostics.

Abstract

Why C3 exists

Terminal-only feedback diffuses credit across multi-agent LLM trajectories. C3 freezes transcript-derived context and estimates local causal credit with fixed-context replay plus a leave-one-out baseline, outperforming MAPPO and MAGRPO under matched budgets.

Problem

Terminal-only supervision diffuses reward across the whole trajectory, obscuring which message actually mattered.

Key idea

Turn collaboration into targeted causal interventions over frozen transcript contexts instead of learning a critic for the whole episode.

Outcome

Better terminal performance and cleaner internal optimization signatures under matched evaluator budgets.

Method

Replay the right decision, not the whole episode

Overview of the C3 mechanism

Figure 1 from the paper. C3 records replay states from collaborative trajectories, samples context-matched alternatives, and estimates per-decision marginal value with fixed-continuation replay plus a leave-one-out baseline.

01

Freeze the transcript-derived context

The algorithm logs replay states at visited decision occurrences, preserving the exact observable context of a Reasoner or Actor turn.

02

Evaluate context-matched alternatives

Instead of rolling out fresh full episodes, C3 restarts from the frozen replay state and compares alternatives under the same continuation distribution.

03

Compute low-variance local credit

A leave-one-out baseline removes context-level difficulty, producing unbiased marginal advantages that can feed directly into policy-gradient updates.

Protocol

Reasoner -> Actor, terminal evaluation only

In the paper's primary setup, two LLM agents collaborate asynchronously. The Reasoner proposes a concise plan, the Actor produces the final answer or code, and only the final Actor output is scored by the external evaluator. That asymmetry is exactly why decision-level credit matters.

Results

Performance gains with matched budgets

Qwen3 math 82.80%

MATH500 greedy accuracy, compared with 74.52 for MAGRPO and 69.28 for MAPPO.

Qwen3 math 91.44%

MATH500 pass@10 under the same evaluation-resource accounting.

Qwen2.5 math 87.01%

GSM8K greedy accuracy with Qwen2.5-3B-Instruct.

Qwen2.5 code 7.98

MBPP+ pass metric with Qwen2.5-Coder-3B-Instruct.

Learning dynamics comparison across methods
Figure 2. C3 reaches a higher return plateau earlier and shows tighter uncertainty bands across seeds.
Training efficiency Pareto frontier
Figure 3. C3 lies on a favorable training-token Pareto frontier, reaching strong returns with 418M training tokens.

Benchmarks

Five mathematical and coding evaluation suites

The paper evaluates C3 on MATH500, CMATH, GSM8K, MBPP-test, and MBPP+ with matched budgets and five random seeds. Mathematical tasks use Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507; coding tasks use Qwen2.5-Coder-3B-Instruct.

MATH500 CMATH GSM8K MBPP-test MBPP+ 5 seeds Matched evaluator budgets Reasoner / Actor protocol

Mechanistic Validation

Evidence beyond benchmark accuracy

Credit fidelity 0.270

Spearman correlation to the target within-context advantage in the shared replay-bucket diagnostics.

Within-context variance 0.00513

The LOO baseline reduces variance, stabilizing gradient estimates across tasks.

Inter-agent influence 0.187

Higher conditional mutual information suggests stronger downstream responsiveness to upstream reasoning interventions.

Ablations support the causal story

Removing fixed-context replay lowers terminal accuracy from 90.7% to 86.5%, and removing the LOO baseline reduces the influence metric from 0.187 to 0.149. Both components matter: context matching sharpens the comparison, while LOO removes context-level shifts that would otherwise blur collaboration credit.

Reproduce

Built to connect the paper to runnable code

Quickstart

python -m pip install -r requirements.txt --no-build-isolation
bash scripts/data/prepare_all.sh --out_dir data
bash scripts/reproduce/smoke.sh --task math --limit 1 --print_example 0

Core paths

  • c3/credit/c3/ for credit assignment logic
  • openrlhf/trainer/ppo_utils/experience_maker.py for paper-facing integration
  • configs/main_results_registry.yaml for experiment references
  • scripts/reproduce/ for smoke tests, training, and analysis

Citation

Use C3 in your work

If C3 or this repository helps your research, please cite the official arXiv paper below.

@misc{chen2026contextualcounterfactualcreditassignment,
  title={Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration},
  author={Yanjun Chen and Yirong Sun and Hanlin Wang and Xinming Zhang and Xiaoyu Shen and Wenjie Li and Wei Zhang},
  year={2026},
  eprint={2603.06859},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2603.06859}
}