Overview
Large Language Model (LLM) multi-agent systems are gaining traction for tackling complex tasks through collaborative workflows. Yet, even with multiple agents working in parallel, failures are common—and pinpointing the exact agent and moment of failure is notoriously difficult. Manually trawling through thousands of interaction logs is like finding a needle in a haystack, slowing down debugging and optimization.

To solve this, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University, introduced the problem of automated failure attribution. They built the first dedicated benchmark dataset, Who&When, and developed several attribution methods. This work was accepted as a Spotlight presentation at ICML 2025. The code and dataset are fully open-source.
This guide walks you through the core concepts, prerequisites, and practical steps to implement automated failure attribution using the Who&When dataset and proposed techniques. By the end, you'll understand how to programmatically determine which agent caused a failure and when it happened, dramatically reducing manual debugging effort.
Prerequisites
- Python 3.8+ installed on your system
- Basic knowledge of LLMs and multi-agent systems (e.g., how agents communicate via structured prompts)
- Familiarity with Hugging Face datasets and common ML libraries (torch, transformers)
- Access to a GPU (recommended) for running baseline attribution methods efficiently
- Git to clone the repository
Step-by-Step Instructions
Step 1: Understand the Who&When Dataset
The Who&When dataset (Hugging Face) contains multi-agent interaction logs where each log includes a sequence of agent messages, a ground-truth label of the failing agent (the who) and the temporal step (the when) of the failure. The tasks range from reasoning to code generation, with failures caused by single agent errors, miscommunication, or information cascade breakdowns.
Step 2: Set Up the Environment
- Clone the official repository:
git clone https://github.com/mingyin1/Agents_Failure_Attribution.git cd Agents_Failure_Attribution - Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt - Download the dataset (if not automatically loaded):
python download_dataset.py
Step 3: Load and Explore the Data
Use the Hugging Face datasets library to load the dataset:
from datasets import load_dataset
dataset = load_dataset("Kevin355/Who_and_When")
print(dataset)
# View a sample interaction
sample = dataset['train'][0]
print(sample['messages']) # List of agent utterances
print(sample['failure_agent'])
print(sample['failure_step'])
Each sample has three key fields:
- messages: list of dicts with
agent_id,content,timestamp(or sequential order) - failure_agent: integer index of the agent that caused the failure
- failure_step: integer step (message index) where the failure occurred
Step 4: Implement a Baseline Attribution Method
The paper proposes several methods. We'll implement the simplest: Trace-to-Failure. It works by tracking which agents contributed to the final erroneous output. You'll need to parse the conversation to find the last agent message that directly influenced the failure.
def trace_to_failure(messages, final_error):
# Heuristic: find the last agent that produced a message containing the error
for msg in reversed(messages):
if final_error in msg['content']:
return msg['agent_id'], messages.index(msg)
return None, None
More sophisticated methods (e.g., counterfactual reasoning, causal graph) are in the repository under methods/.
Step 5: Evaluate Against Ground Truth
Run the baseline on a subset and compute accuracy:
correct_who = 0
correct_when = 0
total = 0
for sample in dataset['test']:
pred_agent, pred_step = trace_to_failure(sample['messages'], sample['final_error'])
if pred_agent == sample['failure_agent']:
correct_who += 1
if pred_step == sample['failure_step']:
correct_when += 1
total += 1
print(f"Who Accuracy: {correct_who/total:.2%}")
print(f"When Accuracy: {correct_when/total:.2%}")
Step 6: Visualize Results
Create a confusion matrix for agent attribution and a histogram of step errors. The code includes plotting utilities:
from utils.visualization import plot_confusion_matrix
plot_confusion_matrix(predictions, ground_truth, labels=agent_names)
Common Mistakes
- Ignoring information chains: A failure may propagate from an earlier step; only blaming the last agent leads to false attribution.
- Assuming single cause: Some failures stem from interaction between multiple agents. The dataset currently labels only one agent per sample, but real systems may have combined faults.
- Not normalizing timestamps: Ensure message order is consistent. Some logs have non-sequential timestamps; always sort by timestamp before analysis.
- Overfitting to simple heuristics: The baseline methods may perform well only on simple cases. For robust attribution, use the proposed methods (e.g.,
causal_graph_attribution) from the repository.
Summary
Automated failure attribution is a crucial step toward reliable LLM multi-agent systems. By using the Who&When dataset and the open-source tools, you can now systematically identify which agent caused a failure and at what point in the interaction, replacing manual log archaeology with a reproducible, data-driven approach. Start experimenting with the provided code and adapt the attribution methods to your own multi-agent architectures.