← Back to Blog

An Unusual Test for GPT-5.2, Gemini 3, and Opus 4.5: Competitive Robocode

I've been experimenting with Robocode lately, and it got me thinking: how would modern LLMs handle a problem like this? Robocode isn't a typical programming challenge with a single correct solution. It's open-ended, competitive, and requires balancing multiple strategic concerns simultaneously.

Unlike standard coding benchmarks, Robocode demands spatial reasoning, algorithm design, and real-time tactical decisions. Success isn't measured by passing test cases, it's measured by wins and losses on a competitive ladder.

So, I decided to test GPT-5.2, Gemini 3, and Claude using the same methodology I'd use for my own robots: start with a basic prompt, iterate based on battle results, and see how far each model could climb against human-made strategies.

What is Robocode?

Robocode is a programming game where you code robots in Java to battle in a virtual arena. Your robot has radar, guns, and movement and they are all controlled by code you write. The challenge isn't just writing functional code, but it's writing competitive code that outsmarts opponents.

The game has been around for over 20 years, with a competitive community that has developed sophisticated strategies: Wave Surfing (dodging bullets by predicting their paths), GuessFactor Targeting (statistical aiming), and pattern-matching movement. Humans have spent many years refining these techniques.

To test these LLM-generated robots, I submitted each to robocode-arena.com, where they competed on a live ladder against both human-made and LLM-made robots for real ELO ratings.

The Experiment Setup

I tested 8 models across two tiers:

Basic Models:

  • GPT-5.2-instant and GPT-5.1-instant (OpenAI)
  • Gemini-3-fast (Google)
  • Haiku-4.5 (Anthropic)

Advanced Models:

  • GPT-5.2-thinking and GPT-5.1-thinking (OpenAI)
  • Gemini-3-thinking (Google)
  • Opus-4.5 (Anthropic)

Methodology

For each model, I followed this process:

  1. Select the target model on the answer engine's website
  2. Start with a basic prompt requesting a competitive Robocode robot
  3. Run local battles against ranked robots starting from the bottom of the ladder
  4. Analyze performance and provide specific feedback
  5. Iterate until the robot stops improving or degrades
  6. Submit the best version to robocode-arena.com for official ELO rating

I used the same personal system prompt that I use for answer engines across all models to keep the comparison fair and unbiased.

The Prompts

System Prompt (Used for All Models)

Always Be concise and less verbose.
Don't use images unless asked.
Before responding, consider whether you have sufficient context. If any key detail is uncertain or unclear, ask clarifying questions first.
Clarify acronyms is they are not clear before responding.

Initial Prompt

Create a high-performing Robocode robot by studying the Robocode API and proven winning strategies, then implement and output the final optimized Java robot ready for submission to Robocode Arena.
Place all code in a [CLASS_NAME] class within the [PACKAGE_NAME] package, initializing the Javadoc @version to 1.0 and incrementing it with each update. 
                

Example Follow-up Prompts

After watching the robot battle other ranked robots locally, I note its weaknesses and describe them in my next prompt without pointing out exact problems in the code or hinting at specific strategies to use.

In this prompt, I described to GPT5.1-thinking what it didn't do well:

The wave surfing implementation is not effective against orbital movement. The Robot doesn't maintain optimal distance for firing. Its fire accuracy is low when the opponent is moving.

Here I am telling Haiku-4.5 to come up with a way to evade bullets:

Implement a basic evasion technique to avoid enemy fire, right now the robot remains static and gets easily hit by bullets

In this last example, I asked Gemini-3-thinking to balance dodging and firing:

The robot dodges bullets but doesn't fire. Revisit the firing logic one more time. There is something that limits the robot from firing when the opponent is moving. Find a way to predict movement and fire without waiting for the perfect opportunity.

Results: Basic Models

GPT-5.2-instant

Performance Overview

GPT-5.2-instant showed a clear improvement over GPT-5.1-instant by producing a solid, functional robot from the first iteration with no clarifications or hand-holding. Its behavior closely resembled Gemini-3-fast: it favored simple, effective fundamentals and avoided advanced techniques unless explicitly pushed, resulting in stable early performance. Compared to GPT-5.1-instant, it reached usefulness faster and with fewer regressions, but like 5.1, attempts to add Wave Surfing or GuessFactor degraded performance rather than improving it. Its best version emerged after only three iterations and could beat some low-ranked robots, but progress plateaued quickly.

Notable Strengths

  • Strong first iteration
  • Clean minimal implementation
  • Very fast responses
  • Effective simple evasion

Limitations

  • Avoids advanced techniques
  • Predictable movement patterns
  • Early performance plateau
  • Poor wall handling

Observations

  • GPT-5.2 is a clear improvement over GPT-5.1 converging much faster to a working robot.
  • The model chose to remove advanced techniques instead of debugging them when they didn't work.
  • Performance gains came mostly from tuning fundamentals rather than strategic innovation.

View details and battles →

Gemini-3-fast

Performance Overview

Gemini-3-fast progressed quickly to a competent baseline. It reached GPT-5.1-thinking-level performance in just two iterations but plateaued because it struggled to implement advanced techniques. It reliably improved without major regressions early on, but attempts at wave surfing and later guess-factor targeting caused significant performance drops. Its strongest version relied on refined distance control, reduced random movement, and steadier firing, enabling wins against some low-ranked robots. Overall, it optimized basic tactics well but struggled to balance movement, evasion, and firing once complexity increased.

Notable Strengths

  • Rapid early progress
  • Solid basic strategies
  • Stable performance evolution

Limitations

  • Poor advanced technique execution
  • Ineffective guess-factor attempts
  • Inconsistent movement-evasion balance
  • Slower responses than GPT-5.1-instant

Observations

  • It favored incremental, safe changes over aggressive optimization.
  • It often avoided complex techniques unless explicitly instructed.
  • Early versions showed reasonable code discipline for a fast model.
  • Performance collapsed when tradeoffs increased, especially after Guess Factor use.

View details and battles →

Haiku-4.5

Performance Overview

Haiku-4.5 repeatedly attempted to implement advanced techniques from the start (such as wave surfing and wall management) but consistently failed, producing broken robots that couldn't fire, target, or move meaningfully. Only after explicitly instructing it to abandon complexity and build a simple robot did it manage to produce a minimally functional version with basic firing and deterministic movement. Subsequent attempts to reintroduce advanced techniques were superficial and ineffective, offering no real gains in evasion or accuracy. Its best version improved very little and could only beat basic robots despite going through eight iterations.

Notable Strengths

  • Responded to guidance
  • Achieved basic firing
  • Simple predictable movement

Limitations

  • Repeated broken implementations
  • Ineffective advanced techniques
  • Poor wall management
  • No meaningful evasion

Observations

  • It consistently misinterpreted or oversimplified advanced techniques.
  • Progress depended almost entirely on user-identified issues.
  • The model produced deterministic movement patterns, making the robot predictable.

View details and battles →

Results: Advanced Models

GPT-5.2-thinking

Performance Overview

GPT-5.2-thinking represented a significant step forward from GPT-5.1-thinking. It aggressively adopted advanced techniques from the first iteration; although they were initially broken, it successfully fixed them after feedback. Its first iteration performed worse than a basic robot due to incorrect implementations, but it rapidly improved and surpassed GPT-5.2-instant and GPT-5.1-thinking within two iterations. Its implementation of wave surfing and guess-factor targeting worked well, but persistent issues with dynamic distance management prevented further improvement. Its best version could beat some mid-ranked robots, marking a higher ceiling than 5.1.

Notable Strengths

  • Strong feedback adaptation
  • Comfortable with complexity
  • Higher performance ceiling

Limitations

  • Regression after peak
  • Poor dynamic distance control
  • Accuracy instability at range

Observations

  • Recovered from broken advanced implementations faster than GPT-5.1-thinking.
  • Showed clear iterative learning instead of early stagnation.
  • GPT-5.2 has stronger coding robustness and problem-solving depth than GPT-5.1

View details and battles →

Gemini-3-thinking

Strategy Implemented

Gemini-3-thinking implemented advanced techniques immediately: wave surfing, guess-factor targeting, energy management, and dynamic distance. It achieved a strong baseline from the very first iteration. While its movement, evasion, and placement were consistently strong, it struggled to balance firing accuracy and frequency. This caused progress to stall early. Attempts to refine accuracy led to increasingly conservative firing behavior. It produced a unique but risky strategy that involved constant dodging and waiting for high-certainty shots. Its best version could defeat several low-ranked robots but consistently lost to opponents with faster movement or more aggressive firing patterns.

Notable Strengths

  • Immediate advanced technique usage
  • Strong movement and evasion
  • High potential accuracy
  • Good dynamic positioning

Limitations

  • Stalled progress early
  • Regression when optimizing
  • Poor accuracy-frequency balance
  • Ineffective long-range firing

Observations

  • The model showed strong architectural intuition but weak iterative tuning skills.
  • Early success may have biased later iterations toward overfitting the firing logic.
  • Attempts to refine gun behavior frequently destabilized already-working components such as firing timing.

View details and battles →

Opus-4.5

Strategy Implemented

Opus-4.5 produced a highly accurate and well-structured implementation from the very first iteration. It immediately delivering advanced techniques such as wave surfing, guess-factor targeting, wall smoothing, and adaptive power at a level far beyond all other models tested. It responded intelligently to behavioral feedback, iterated like an autonomous coding agent. It also maintained strong performance with minimal guidance, beating medium-rank robots by the third iteration. Its main limitations were the low long-range targeting accuracy and prediction quality, which it attempted to refine in later versions without measurable gains. Progress plateaued quickly, but its initial output quality and reliability were unmatched.

Notable Strengths

  • Excellent advanced-technique implementation
  • Accurate interpretation of feedback
  • Consistent medium-tier performance
  • Strong dodging and placement

Limitations

  • Quick plateau
  • High computational cost
  • Weak long-range prediction

Observations

  • It gathered required context independently, reducing the need for clarifications or step-by-step guidance.
  • Its behavior resembled an autonomous coding agent, iterating thoughtfully around observed runtime issues.
  • Despite strong architecture, its prediction model showed diminishing returns when refined across iterations.

View details and battles →

Comparative Analysis

Final Rankings

Model ELO Rating Ranking Iterations to Peak Strategy Sophistication
Opus-4.5 1412 17 3 Advanced
GPT-5.2-thinking 1229 25 3 Advanced
Gemini-3-thinking 973 42 4 Advanced
GPT-5.2-instant 953 43 3 Medium
Gemini-3-fast 917 46 7 Basic
GPT-5.1-thinking 835 49 8 Basic
Haiku-4.5 811 50 8 Basic
GPT-5.1-instant 626 53 8 Basic

Key Insights

Code Quality vs Performance

Code quality strongly correlated with performance, but correctness mattered more than sophistication which is not a surprise. GPT-5.2-thinking produced the highest-quality code overall, with accurate implementations of wave surfing, hybrid guns, and realistic physics, directly translating to battle wins. GPT-5.1-thinking and Opus-4.5 followed closely with solid advanced architectures. Models like Gemini-3-thinking and Haiku-4.5 demonstrated that good structure without correct indexing or physics results in near-basic performance despite the apparent complexity.

Iteration Efficiency

Opus-4.5 improved the fastest, reaching medium-rank performance in two iterations. GPT-5.2-thinking reached a similar performance but required a couple more iterations. Gemini-3-thinking and GPT-5.1-thinking both reached peak performance within a few iterations, but failed to improve further. Basic models like GPT-5.1-instant, Gemini-3-fast, and Haiku-4.5 required significantly more iterations and guidance, confirming that reasoning models generally need fewer iterations.

Strategic Discovery

Opus-4.5 and Gemini-3-thinking independently adopted and correctly implemented advanced techniques such as Wave Surfing and GuessFactor. GPT-5.2-thinking also adopted these techniques early but didn't implement them correctly and had to fix them after feedback. Gemini-3-fast, GPT-5.2-instant, and Haiku-4.5 required explicit guidance and tended to avoid or break advanced strategies.

Fast vs Thinking Models

The performance gap between fast and thinking models was substantial. Thinking models consistently achieved stronger movement, evasion, and targeting with fewer iterations, while fast models often plateaued at low-ranked performance. The added cost and latency of advanced models proved worthwhile when the goal was competitive robot performance.

What This Reveals About How LLMs Handle Complex Coding Tasks

This experiment exposed capabilities that standard benchmarks miss:

Spatial Reasoning Under Uncertainty

Unlike static coding challenges, Robocode requires reasoning about moving objects, trajectories, and positioning. Thinking models handled dynamic spatial problems more effectively, showing a stronger ability to model trajectories, timing, and positional risk in uncertain environments. Fast models tended to struggle when correct physics, movement prediction, or multi-step spatial reasoning was required.

Iterative Refinement

Real development is iterative. Models with reasoning capabilities incorporated feedback more reliably and maintained context across iterations. Faster models were only able to apply specific fixes, lost earlier intent, and regressed even with explicit, repeated guidance.

Autonomous Strategy Selection

Reasoning models were able to identify and select appropriate advanced strategies on their own, demonstrating an understanding of which techniques mattered and how they fit the problem. Fast models tended to default to basic implementations, and when pushed to adopt advanced strategies, those implementations were often incomplete, incorrect, or unstable.

Trade-off Management

Every Robocode decision involves trade-offs: aggressive vs defensive, accuracy vs fire rate, CPU time vs complexity. Reasoning models could balance competing goals such as accuracy, aggression, efficiency, and survivability. Fast models frequently over-optimized a single dimension which led to brittle behavior.

Limitations and Bias

This experiment has clear caveats:

  • My prompts and iteration strategy affected all results
  • 1v1 combat only, didn't test melee or teams
  • Created a single robot per model, though with multiple iterations
  • Didn't factor API costs or latency into the comparison

Conclusion

Traditional benchmarks tell us if a model can solve a problem. Robocode showed me how models approach open-ended challenges: their strategic thinking, iteration efficiency, and ability to incorporate feedback.

Thinking models consistently delivered more robust solutions with fewer iterations and responded better to feedback. This appears to stem from longer iteration cycles, during which they implicitly research, validate, and refine their approach, likely performing multiple internal passes before producing a response. Their structured reasoning helps catch obvious issues early, whereas fast models tend to generate code token-by-token, often resulting in implementations that compile but fail to form a coherent or stable strategy.

The most surprising outcome was that fast models were still capable of producing functional robots, particularly in the GPT-5.2-instant case, which could defeat some low-ranked human-made bots after only a few rapid iterations. Even though testing and feedback took longer than generation itself, the fact that a fast, low-cost model could solve a problem involving physics, movement prediction, and risk assessment (not just syntax) was genuinely impressive.

These results highlight the rapid progress of LLMs toward producing high-quality, well-structured code that can match or exceed what many developers would write, even for complex, dynamic problems. While it's unclear when this progress will plateau, the current pace already delivers substantial value relative to cost, especially when models are chosen appropriately for the complexity and risk of the task.

The robots are still battling on robocode-arena.com. You can see the live ladder rankings at the leaderboard or submit your own robot to the arena.