COS-PLAY: Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Game Play

1University of Maryland, College Park, 2University of Southern California, 3Good Start Labs, 4Independent Researcher, 5Mohamed bin Zayed University of Artificial Intelligence
COS-PLAY teaser figure

Overview of COS-PLAY. The decision agent (orange) retrieves skills, updates intentions, and selects actions. After each episode, the skill bank agent (red) segments trajectories, learns contracts, and curates the skill bank (purple) via refinement, merging, splitting, or retirement.

Abstract

Long-horizon interactive environments are a testbed for evaluating agents' skill usage abilities. These environments demand multi-step reasoning, the chaining of multiple skills over many timesteps, and robust decision-making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments.

Large Language Models (LLMs) offer a promising alternative as game-playing agents, but they often struggle with consistent long-horizon decision-making because they lack a mechanism to discover, retain, and reuse structured skills across episodes.

We present COS-PLAY, a co-evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent-managed skill pipeline discovers reusable skills from the agent's unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill-bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COS-PLAY with an 8B base model achieves over 25.1% average reward improvement against four frontier LLM baselines on single-player game benchmarks while remaining competitive on multi-player social reasoning games.

Method Overview

We propose a co-evolving multi-agent framework for long-horizon video-game decision-making via unsupervised trajectory decomposition and skill-bank refinement. The framework has two components:

(a) Decision Agent AD: an LLM-based agent that interacts with the game through primitive actions and skill retrieval. At each step, it summarizes the current state, retrieves relevant skill candidates from the skill bank, updates its intention, selects or switches skills when needed, and executes an action.

(b) Skill Bank Agent AS: an LLM-based pipeline that converts unlabeled trajectories into reusable protocol-based skills and learns compact effect contracts for them. It updates the skill bank by proposing new skill candidates, refining low-quality skills, and revising skill protocols over time.

Together, the decision agent generates trajectories, and the skill bank agent transforms them into structured skills that support future decisions through skill retrieval and selection. Both agents are updated via GRPO using separate LoRA adapters in a unified co-evolution loop.

Framework Components

 Decision Agent

Maintains an active skill and current intention. Retrieves skills from the bank, updates a short natural-language intention tag, and executes primitive actions conditioned on the observation, intention, and active skill plan. Trained with two LoRA adapters: skill retrieval and action taking.

 Skill Bank Agent

Converts unlabeled trajectories into reusable skills via a four-stage pipeline: boundary proposal, infer segmentation, contract learning, and skill bank maintenance. Trained with three LoRA adapters: segmentation, contract, and curator.

 Co-Evolution

Better skills improve decision making, and better rollouts improve skill learning. Both agents are jointly updated via GRPO in a closed loop: the decision agent uses the current skill bank, while the skill bank agent converts newly collected rollouts into better reusable skills.

Skill Bank Pipeline

The skill bank agent analyzes trajectories through a four-stage pipeline to extract reusable skills and maintain the skill bank over time:

 Boundary Proposal

A heuristic candidate-generation step that identifies plausible skill-transition points using predicate flips, intention-tag changes, reward spikes, and transitions between primitive-action execution and skill-selection steps.

 Infer Segmentation

Selects the boundary subset that best explains the trajectory as a sequence of skill segments. Each segment's observed effects are scored against the bank's skill contracts and re-ranked by the skill bank agent.

 Contract Learning

Aggregates added and deleted predicates across segments to learn effect contracts—compact specifications of reliable state changes. Only verified contracts with sufficiently high pass rates are written back into the bank.

 Skill Bank Maintenance

Five operations keep the bank compact and evolving: refine updates contracts, materialize promotes new skills, merge removes duplicates, split marks broad skills for re-segmentation, and retire removes unused skills.

Key Highlights

25.1%

Average Improvement

6

Game Environments

8B

Base Model (Qwen3-8B)

5

LoRA Adapters

Evaluated across six game environments: 2048, Tetris, Candy Crush, Super Mario Bros (single-player), Avalon, and Diplomacy (multi-player). Built on Qwen3-8B, COS-PLAY achieves over 25.1% average reward improvement against four frontier LLM baselines (GPT-5.4, Gemini-3.1-Pro, Claude-4.6-Sonnet, GPT-OSS-120B) on single-player game benchmarks while remaining competitive on multi-player social reasoning games.

Experimental Results

Performance Across Game Categories

We report reward for single-player games, overall win rate for Avalon, and overall mean supply centers for Diplomacy. All results with 95% confidence intervals, based on 16 evaluation rollouts for single-player games and 10 rollouts per player for multi-player games.

Model Single-Player Multi-Player
2048
Reward ↑
Tetris
Reward ↑
Candy Crush
Reward ↑
Super Mario
Reward ↑
Avg.
Reward ↑
Avalon
Win Rate ↑
Diplomacy
Mean SC ↑
GPT-5.4 1126.6 ± 150.2 458.2 ± 203.5 532.6 ± 24.8 752.0 ± 35.7 717.4 65.0 ± 14.2 4.70 ± 0.35
Gemini-3.1-Pro 813.3 ± 143.6 372.7 ± 157.7 334.3 ± 59.4 436.8 ± 86.1 489.3 42.0 ± 13.2 2.72 ± 0.26
Claude-4.6-Sonnet 945.0 ± 134.5 444.2 ± 182.6 328.6 ± 23.8 399.5 ± 53.4 529.3 40.0 ± 13.1 3.16 ± 0.19
GPT-OSS-120B 1029.5 ± 122.0 358.1 ± 139.7 334.4 ± 40.5 968.5 ± 175.0 672.6 40.0 ± 13.1 2.46 ± 0.25
Qwen3-8B 131.0 ± 102.6 32.0 ± 8.5 519.9 ± 37.8 835.5 ± 161.6 379.6 30.0 ± 9.9 2.64 ± 0.18
 SFT w/o Skill 516.7 ± 172.3 28.2 ± 9.7 356.1 ± 30.1 736.8 ± 130.4 409.5 28.7 ± 9.7 2.75 ± 0.13
 SFT + 1st Skill 385.5 ± 239.7 35.8 ± 7.0 569.6 ± 29.5 871.9 ± 126.2 465.7 21.2 ± 8.9 2.89 ± 0.20
 SFT + Final Skill 64.8 ± 46.0 24.4 ± 7.0 554.4 ± 24.3 794.4 ± 112.9 359.5 25.0 ± 9.3 2.65 ± 0.25
 GRPO w/o Skill 510.0 ± 249.5 96.7 ± 30.3 163.3 ± 71.4 669.4 ± 130.1 359.9 36.2 ± 10.3 2.76 ± 0.19
 GRPO + 1st Skill 152.0 ± 107.9 93.7 ± 37.7 353.8 ± 20.2 621.2 ± 130.2 305.2 31.2 ± 10.0 2.56 ± 0.22
COS-PLAY (Qwen3-8B) 1589.0 ± 192.4 510.9 ± 199.5 648.8 ± 38.8 948.9 ± 153.2 924.4 39.0 ± 9.4 2.96 ± 0.20

Per-Role Win Rates for Avalon

Win rates with 95% confidence for Good-side roles (Merlin, Servant) and Evil-side roles (Assassin, Minion), evaluated against GPT-5.4 opponents.

Model Good Evil Avg.
Win Rate ↑
Merlin ↑ Servant ↑ Good Avg. ↑ Assassin ↑ Minion ↑ Evil Avg. ↑
GPT-5.4 62.5 ± 27.9 47.4 ± 20.5 51.9 ± 17.7 100.0 ± 19.5 85.7 ± 24.4 92.3 ± 15.9 65.0 ± 14.2
Gemini-3.1-Pro 10.0 ± 19.3 36.4 ± 18.7 28.1 ± 14.9 66.7 ± 30.2 66.7 ± 23.6 66.7 ± 20.0 42.0 ± 13.2
Claude-4.6-Sonnet 30.0 ± 24.8 40.9 ± 19.0 37.5 ± 15.9 33.3 ± 30.2 50.0 ± 24.6 44.4 ± 20.9 40.0 ± 13.1
GPT-OSS-120B 30.0 ± 24.8 31.8 ± 18.2 31.2 ± 15.3 50.0 ± 31.2 58.3 ± 24.4 55.6 ± 20.9 40.0 ± 13.1
Qwen3-8B 18.8 ± 18.2 18.4 ± 12.1 18.5 ± 10.2 66.7 ± 23.6 42.9 ± 23.0 53.8 ± 17.9 30.0 ± 9.9
COS-PLAY (Qwen3-8B) 33.3 ± 17.7 23.3 ± 12.2 26.9 ± 10.4 64.3 ± 22.5 63.2 ± 20.0 63.6 ± 15.6 39.0 ± 9.4

Per-Power Mean Supply Centers for Diplomacy

Mean supply centers per power with 95% confidence intervals. GPT-5.4 is evaluated in self-play; all other models are evaluated against GPT-5.4.

Model Diplomacy Mean
SC ↑
Austria ↑ England ↑ France ↑ Germany ↑ Italy ↑ Russia ↑ Turkey ↑
GPT-5.4 4.38 ± 1.34 4.12 ± 0.82 4.50 ± 1.09 5.12 ± 0.95 4.50 ± 0.77 5.12 ± 1.52 5.12 ± 0.95 4.70 ± 0.35
Gemini-3.1-Pro 2.50 ± 0.89 2.38 ± 0.63 3.12 ± 0.70 2.88 ± 0.82 3.14 ± 0.35 3.14 ± 0.35 1.86 ± 1.24 2.72 ± 0.26
Claude-4.6-Sonnet 3.20 ± 0.56 2.90 ± 0.41 3.50 ± 0.70 3.78 ± 0.64 3.20 ± 0.30 2.80 ± 0.30 2.80 ± 0.66 3.16 ± 0.19
GPT-OSS-120B 1.75 ± 1.07 2.88 ± 0.29 2.88 ± 0.29 2.62 ± 0.44 3.00 ± 0.00 2.88 ± 0.29 1.25 ± 0.97 2.46 ± 0.25
Qwen3-8B 1.62 ± 0.26 2.75 ± 0.25 2.75 ± 0.16 2.88 ± 0.12 3.00 ± 0.00 2.88 ± 0.23 2.62 ± 0.18 2.64 ± 0.18
COS-PLAY (Qwen3-8B) 2.22 ± 0.55 2.70 ± 0.42 3.10 ± 0.46 3.12 ± 0.58 3.20 ± 0.39 3.30 ± 0.30 3.00 ± 0.72 2.96 ± 0.20

General Reasoning (Catastrophic Forgetting Check)

COS-PLAY preserves general reasoning capabilities with minimal degradation.

Model MMLU-Pro Acc. ↑ Math-500 EM ↑
Qwen3-8B61.99%46.40%
COS-PLAY61.15%44.60%

Skill Bank Evolution

We visualize how the skill bank evolves during Diplomacy training: (a) strategic function categories grow richer over training, (b) intention composition diversifies, and (c) the active bank stays compact at 55–70 skills while 121 are discovered and 53 pruned via merge/split/retirement.

Skill bank evolution over Diplomacy training

Training Reward Curves

Training reward curves across all six games show steady improvement in single-player games and equilibrium convergence in multi-player self-play.

Training reward curves for all games

Contributions

  1. Co-evolution framework: We propose a co-evolution framework for long-horizon gameplay that closes the loop between gameplay and skill learning: an LLM-based decision agent interacts with the environment to collect trajectories, and a skill bank agent converts these rollouts into reusable skills that improve future decision making.
  2. Agentic skill-bank pipeline: We propose an agentic skill-bank pipeline for construction and maintenance that transforms unlabeled trajectories into reusable skills via boundary proposal, segmentation, contract learning, and bank curation—enabling skill-guided long-horizon decision making across diverse game environments.
  3. Comprehensive evaluation: We provide a comprehensive evaluation across six game environments requiring multi-hop skill usage, covering both task performance and skill reusability. Built on an 8B base model, COS-PLAY achieves over 25.1% average gain against four frontier LLM baselines on single-player games while remaining competitive with SOTA LLMs on multi-player social reasoning tasks.

BibTeX

@inproceedings{wu2026cosplay,
  title={Co-Evolving {LLM} Decision and Skill Bank Agents for Long-Horizon Game Play},
  author={Wu, Xiyang and Li, Zongxia and Shi, Guangyao and Duffy, Alexander and Marques, Tyler and Olson, Matthew Lyle and Zhou, Tianyi and Manocha, Dinesh},
  booktitle={Conference on Language Modeling (COLM)},
  year={2026}
}