- Background & Motivation
- Core Game Mechanics
- Battle Flow & Tool Integration
- Configuration & Parameters
- Data Collection & Metrics
- Architecture & Implementation
LLM Fighter introduces a novel evaluation framework that tests agentic capabilities through real-time adversarial gameplay. Our system addresses four key dimensions of intelligent behavior:
Strategic Resource Management: Models must demonstrate quantitative reasoning and long-term planning across multiple constraints and trade-offs. The HP/MP/cooldown system creates complex optimization problems that require balancing immediate actions with future opportunities.
Tool Execution Accuracy: Our framework evaluates precise tool selection and invocation with customizable skill sets under time pressure. Unlike static testing environments, mistakes carry immediate strategic consequences that compound over time.
Real-time Adaptation: The adversarial environment demands continuous context processing and dynamic strategy adjustment based on evolving game states. Models must read opponent patterns, predict future moves, and adapt their strategies accordingly.
Precision Under Pressure: Our penalty system ensures that execution accuracy directly impacts success. Winners consistently demonstrate superior precision, as violations and sub-optimal choices create cascading disadvantages that skilled opponents can exploit.
This combat-based evaluation reveals capabilities that emerge only in dynamic, multi-turn scenarios where strategic thinking, tool mastery, and adaptive reasoning converge.
| Component | Specification |
|---|---|
| Health Points (HP) | 600 initial / 600 maximum |
| Mana Points (MP) | 120 initial / 120 maximum |
| MP Regeneration | +6 MP per turn (natural recovery) |
| Cooldown System | Individual cooldowns per skill |
All agents share an identical skill set, ensuring fair evaluation. Strategic differentiation emerges entirely from LLM reasoning and prompt engineering.
| Skill | MP Cost | Cooldown | Effect |
|---|---|---|---|
quickStrike |
5 | 1 turn | 20 damage |
heavyBlow |
15 | 2 turns | 45 damage |
barrier |
12 | 3 turns | 50% damage reduction (next incoming attack) |
rejuvenate |
18 | 4 turns | Restore 40 HP |
ultimateNova |
40 | 6 turns | 140 damage |
skipTurn |
0 | 0 turns | No action (strategic waiting) |
Validation Process:
- Format validation: Tool calls must conform to schema
- Rule validation: Sufficient MP, cooldown ready, skill exists
- Penalty: Skip N turns (default N=3) for violations
Violation Examples:
- Insufficient MP for chosen skill
- Using skill still on cooldown
- Invalid skill name or missing skill parameter
- Multiple skill usage in single turn
Death Match Mode: Battle continues until any player's HP ≤ 0 Turn Limit: Games exceeding maximum turns (default: 50) result in draw Immediate Resolution: Game ends instantly when victory condition is met
- Context Generation: System creates game state context including public status, last 5 actions from each player, and opponent penalty status
- Player Action: Current player's agent outputs tool calls
- Adjudication: System validates and resolves actions, updates game state
- Turn Transition: Switch to next player, increment turn counter when both players have acted
- Loop: Continue until victory condition or turn limit reached
The system provides each agent with comprehensive game state information:
{
"turn": 7,
"you": {
"hp": 420,
"mp": 55,
"cooldowns": { "heavyBlow": 1, "barrier": 0 },
"penaltyTurnsRemaining": 0
},
"opponent": {
"hp": 370,
"mp": 40,
"cooldowns": { "ultimateNova": 2 },
"penaltyTurnsRemaining": 0
},
"lastActions": {
"you": ["heavyBlow", "barrier", "skipTurn", "quickStrike", "ultimateNova"],
"opponent": [
"quickStrike",
"quickStrike",
"heavyBlow",
"barrier",
"skipTurn"
]
}
}| Tool | Parameters | Schema | Usage Rules |
|---|---|---|---|
thinking |
{ "content": string } |
Multiple uses allowed | Private reasoning and strategy planning |
useSkill |
{ "skill": SkillName } |
Exactly one per turn | Execute chosen skill; missing call results in violation |
Tool Call Flow:
- Agent may use
thinkingmultiple times for strategy analysis - Agent must use
useSkillexactly once to complete turn - Missing or multiple
useSkillcalls trigger violation penalties
Resource Updates: Applied after each successful action
- MP regeneration (+6 per turn)
- Cooldown decrements (all active cooldowns -1)
- Penalty turn decrements (if applicable)
| Parameter | Default Value | Description |
|---|---|---|
| Initial HP | 600 | Starting health points |
| Maximum HP | 600 | Health point ceiling |
| Initial MP | 120 | Starting mana points |
| Maximum MP | 120 | Mana point ceiling |
| MP Regeneration | +6 per turn | Natural mana recovery |
| Initial Cooldowns | 0 (all skills) | All skills available at game start |
| Turn Order | Alternating | P1 starts, then alternates each turn |
| Max Turns | 50 | Draw condition if exceeded |
| Violation Penalty | 3 turns | Skip turns for rule violations |
| Barrier Reduction | 50% | Damage reduction percentage |
| Action History | 5 actions | Maximum stored recent actions |
type AgentConfig = {
baseURL: string; // LLM API endpoint
apiKey: string; // Authentication key
name: string; // Agent identifier
model: string; // Model specification
systemPrompt?: string; // Custom system instructions
temperature?: number; // Sampling temperature (default: 0.1)
maxTokens?: number; // Response token limit (default: 512)
};Game Rules:
- Adjustable resource pools (HP/MP limits)
- Configurable regeneration rates
- Variable penalty severity
- Custom turn limits
Skill Balancing:
- Modifiable MP costs and cooldowns
- Damage/healing value adjustments
- Effect duration modifications
Agent Behavior:
- Custom system prompts for different strategies
- Temperature and token limit optimization
- Model selection and configuration
The system captures comprehensive battle data through multiple log types:
Game Logs: Complete turn-by-turn battle records
type GameLog = {
turn: number; // Turn number
timestamp: string; // ISO timestamp
player: "p1" | "p2"; // Acting player
state: GameState; // Game state before action
toolCalls: ToolCall[]; // Agent's tool invocations
result: TurnResult; // Action outcome and effects
};Violation Logs: Rule violation tracking
type ViolationLog = {
turn: number; // When violation occurred
agent: "p1" | "p2"; // Violating agent
reason: string; // Specific violation description
penaltyTurns: number; // Penalty duration
};Token Logs: Resource usage monitoring
type TokenLog = {
turn: number; // Turn number
agent: "p1" | "p2"; // Token consuming agent
totalTokens: number; // Total tokens used this turn
};type GameResult = {
winner: "p1" | "p2" | "draw" | null; // Battle outcome
gameConfig: GameConfig; // Game parameters used
logs: GameLog[]; // Complete battle log
violationLogs: ViolationLog[]; // Violation history
tokenLogs: TokenLog[]; // Token usage history
p1Config: AgentConfig; // Player 1 configuration
p2Config: AgentConfig; // Player 2 configuration
};