An Unusual Test for GPT-5.2, Gemini 3, and Opus 4.5: Competitive Robocode
I've been experimenting with Robocode lately, and it got me thinking: how would modern LLMs handle a problem like this? Robocode isn't a typical programming challenge with a single correct solution. It's open-ended, competitive, and requires balancing multiple strategic concerns simultaneously.
Unlike standard coding benchmarks, Robocode demands spatial reasoning, algorithm design, and real-time tactical decisions. Success isn't measured by passing test cases, it's measured by wins and losses on a competitive ladder.
So, I decided to test GPT-5.2, Gemini 3, and Claude using the same methodology I'd use for my own robots: start with a basic prompt, iterate based on battle results, and see how far each model could climb against human-made strategies.
What is Robocode?
Robocode is a programming game where you code robots in Java to battle in a virtual arena. Your robot has radar, guns, and movement and they are all controlled by code you write. The challenge isn't just writing functional code, but it's writing competitive code that outsmarts opponents.
The game has been around for over 20 years, with a competitive community that has developed sophisticated strategies: Wave Surfing (dodging bullets by predicting their paths), GuessFactor Targeting (statistical aiming), and pattern-matching movement. Humans have spent many years refining these techniques.
To test these LLM-generated robots, I submitted each to robocode-arena.com, where they competed on a live ladder against both human-made and LLM-made robots for real ELO ratings.
The Experiment Setup
I tested 8 models across two tiers:
Basic Models:
- GPT-5.2-instant and GPT-5.1-instant (OpenAI)
- Gemini-3-fast (Google)
- Haiku-4.5 (Anthropic)
Advanced Models:
- GPT-5.2-thinking and GPT-5.1-thinking (OpenAI)
- Gemini-3-thinking (Google)
- Opus-4.5 (Anthropic)
Methodology
For each model, I followed this process:
- Select the target model on the answer engine's website
- Start with a basic prompt requesting a competitive Robocode robot
- Run local battles against ranked robots starting from the bottom of the ladder
- Analyze performance and provide specific feedback
- Iterate until the robot stops improving or degrades
- Submit the best version to robocode-arena.com for official ELO rating
I used the same personal system prompt that I use for answer engines across all models to keep the comparison fair and unbiased.
The Prompts
System Prompt (Used for All Models)
Always Be concise and less verbose.
Don't use images unless asked.
Before responding, consider whether you have sufficient context. If any key detail is uncertain or unclear, ask clarifying questions first.
Clarify acronyms is they are not clear before responding.
Initial Prompt
Create a high-performing Robocode robot by studying the Robocode API and proven winning strategies, then implement and output the final optimized Java robot ready for submission to Robocode Arena.
Place all code in a [CLASS_NAME] class within the [PACKAGE_NAME] package, initializing the Javadoc @version to 1.0 and incrementing it with each update.
Example Follow-up Prompts
After watching the robot battle other ranked robots locally, I note its weaknesses and describe them in my next prompt without pointing out exact problems in the code or hinting at specific strategies to use.
In this prompt, I described to GPT5.1-thinking what it didn't do well:
The wave surfing implementation is not effective against orbital movement. The Robot doesn't maintain optimal distance for firing. Its fire accuracy is low when the opponent is moving.
Here I am telling Haiku-4.5 to come up with a way to evade bullets:
Implement a basic evasion technique to avoid enemy fire, right now the robot remains static and gets easily hit by bullets
In this last example, I asked Gemini-3-thinking to balance dodging and firing:
The robot dodges bullets but doesn't fire. Revisit the firing logic one more time. There is something that limits the robot from firing when the opponent is moving. Find a way to predict movement and fire without waiting for the perfect opportunity.
Results: Basic Models
GPT-5.2-instant
Performance Overview
GPT-5.2-instant showed a clear improvement over GPT-5.1-instant by producing a solid, functional robot from the first iteration with no clarifications or hand-holding. Its behavior closely resembled Gemini-3-fast: it favored simple, effective fundamentals and avoided advanced techniques unless explicitly pushed, resulting in stable early performance. Compared to GPT-5.1-instant, it reached usefulness faster and with fewer regressions, but like 5.1, attempts to add Wave Surfing or GuessFactor degraded performance rather than improving it. Its best version emerged after only three iterations and could beat some low-ranked robots, but progress plateaued quickly.
Notable Strengths
- Strong first iteration
- Clean minimal implementation
- Very fast responses
- Effective simple evasion
Limitations
- Avoids advanced techniques
- Predictable movement patterns
- Early performance plateau
- Poor wall handling
Observations
- GPT-5.2 is a clear improvement over GPT-5.1 converging much faster to a working robot.
- The model chose to remove advanced techniques instead of debugging them when they didn't work.
- Performance gains came mostly from tuning fundamentals rather than strategic innovation.
Gemini-3-fast
Performance Overview
Gemini-3-fast progressed quickly to a competent baseline. It reached GPT-5.1-thinking-level performance in just two iterations but plateaued because it struggled to implement advanced techniques. It reliably improved without major regressions early on, but attempts at wave surfing and later guess-factor targeting caused significant performance drops. Its strongest version relied on refined distance control, reduced random movement, and steadier firing, enabling wins against some low-ranked robots. Overall, it optimized basic tactics well but struggled to balance movement, evasion, and firing once complexity increased.
Notable Strengths
- Rapid early progress
- Solid basic strategies
- Stable performance evolution
Limitations
- Poor advanced technique execution
- Ineffective guess-factor attempts
- Inconsistent movement-evasion balance
- Slower responses than GPT-5.1-instant
Observations
- It favored incremental, safe changes over aggressive optimization.
- It often avoided complex techniques unless explicitly instructed.
- Early versions showed reasonable code discipline for a fast model.
- Performance collapsed when tradeoffs increased, especially after Guess Factor use.
Haiku-4.5
Performance Overview
Haiku-4.5 repeatedly attempted to implement advanced techniques from the start (such as wave surfing and wall management) but consistently failed, producing broken robots that couldn't fire, target, or move meaningfully. Only after explicitly instructing it to abandon complexity and build a simple robot did it manage to produce a minimally functional version with basic firing and deterministic movement. Subsequent attempts to reintroduce advanced techniques were superficial and ineffective, offering no real gains in evasion or accuracy. Its best version improved very little and could only beat basic robots despite going through eight iterations.
Notable Strengths
- Responded to guidance
- Achieved basic firing
- Simple predictable movement
Limitations
- Repeated broken implementations
- Ineffective advanced techniques
- Poor wall management
- No meaningful evasion
Observations
- It consistently misinterpreted or oversimplified advanced techniques.
- Progress depended almost entirely on user-identified issues.
- The model produced deterministic movement patterns, making the robot predictable.
Results: Advanced Models
GPT-5.2-thinking
Performance Overview
GPT-5.2-thinking represented a significant step forward from GPT-5.1-thinking. It aggressively adopted advanced techniques from the first iteration; although they were initially broken, it successfully fixed them after feedback. Its first iteration performed worse than a basic robot due to incorrect implementations, but it rapidly improved and surpassed GPT-5.2-instant and GPT-5.1-thinking within two iterations. Its implementation of wave surfing and guess-factor targeting worked well, but persistent issues with dynamic distance management prevented further improvement. Its best version could beat some mid-ranked robots, marking a higher ceiling than 5.1.
Notable Strengths
- Strong feedback adaptation
- Comfortable with complexity
- Higher performance ceiling
Limitations
- Regression after peak
- Poor dynamic distance control
- Accuracy instability at range
Observations
- Recovered from broken advanced implementations faster than GPT-5.1-thinking.
- Showed clear iterative learning instead of early stagnation.
- GPT-5.2 has stronger coding robustness and problem-solving depth than GPT-5.1
Gemini-3-thinking
Strategy Implemented
Gemini-3-thinking implemented advanced techniques immediately: wave surfing, guess-factor targeting, energy management, and dynamic distance. It achieved a strong baseline from the very first iteration. While its movement, evasion, and placement were consistently strong, it struggled to balance firing accuracy and frequency. This caused progress to stall early. Attempts to refine accuracy led to increasingly conservative firing behavior. It produced a unique but risky strategy that involved constant dodging and waiting for high-certainty shots. Its best version could defeat several low-ranked robots but consistently lost to opponents with faster movement or more aggressive firing patterns.
Notable Strengths
- Immediate advanced technique usage
- Strong movement and evasion
- High potential accuracy
- Good dynamic positioning
Limitations
- Stalled progress early
- Regression when optimizing
- Poor accuracy-frequency balance
- Ineffective long-range firing
Observations
- The model showed strong architectural intuition but weak iterative tuning skills.
- Early success may have biased later iterations toward overfitting the firing logic.
- Attempts to refine gun behavior frequently destabilized already-working components such as firing timing.
Opus-4.5
Strategy Implemented
Opus-4.5 produced a highly accurate and well-structured implementation from the very first iteration. It immediately delivering advanced techniques such as wave surfing, guess-factor targeting, wall smoothing, and adaptive power at a level far beyond all other models tested. It responded intelligently to behavioral feedback, iterated like an autonomous coding agent. It also maintained strong performance with minimal guidance, beating medium-rank robots by the third iteration. Its main limitations were the low long-range targeting accuracy and prediction quality, which it attempted to refine in later versions without measurable gains. Progress plateaued quickly, but its initial output quality and reliability were unmatched.
Notable Strengths
- Excellent advanced-technique implementation
- Accurate interpretation of feedback
- Consistent medium-tier performance
- Strong dodging and placement
Limitations
- Quick plateau
- High computational cost
- Weak long-range prediction
Observations
- It gathered required context independently, reducing the need for clarifications or step-by-step guidance.
- Its behavior resembled an autonomous coding agent, iterating thoughtfully around observed runtime issues.
- Despite strong architecture, its prediction model showed diminishing returns when refined across iterations.
Comparative Analysis
Final Rankings
| Model | ELO Rating | Ranking | Iterations to Peak | Strategy Sophistication |
|---|---|---|---|---|
| Opus-4.5 | 1412 | 17 | 3 | Advanced |
| GPT-5.2-thinking | 1229 | 25 | 3 | Advanced |
| Gemini-3-thinking | 973 | 42 | 4 | Advanced |
| GPT-5.2-instant | 953 | 43 | 3 | Medium |
| Gemini-3-fast | 917 | 46 | 7 | Basic |
| GPT-5.1-thinking | 835 | 49 | 8 | Basic |
| Haiku-4.5 | 811 | 50 | 8 | Basic |
| GPT-5.1-instant | 626 | 53 | 8 | Basic |
Key Insights
Code Quality vs Performance
Code quality strongly correlated with performance, but correctness mattered more than sophistication which is not a surprise. GPT-5.2-thinking produced the highest-quality code overall, with accurate implementations of wave surfing, hybrid guns, and realistic physics, directly translating to battle wins. GPT-5.1-thinking and Opus-4.5 followed closely with solid advanced architectures. Models like Gemini-3-thinking and Haiku-4.5 demonstrated that good structure without correct indexing or physics results in near-basic performance despite the apparent complexity.
Iteration Efficiency
Opus-4.5 improved the fastest, reaching medium-rank performance in two iterations. GPT-5.2-thinking reached a similar performance but required a couple more iterations. Gemini-3-thinking and GPT-5.1-thinking both reached peak performance within a few iterations, but failed to improve further. Basic models like GPT-5.1-instant, Gemini-3-fast, and Haiku-4.5 required significantly more iterations and guidance, confirming that reasoning models generally need fewer iterations.
Strategic Discovery
Opus-4.5 and Gemini-3-thinking independently adopted and correctly implemented advanced techniques such as Wave Surfing and GuessFactor. GPT-5.2-thinking also adopted these techniques early but didn't implement them correctly and had to fix them after feedback. Gemini-3-fast, GPT-5.2-instant, and Haiku-4.5 required explicit guidance and tended to avoid or break advanced strategies.
Fast vs Thinking Models
The performance gap between fast and thinking models was substantial. Thinking models consistently achieved stronger movement, evasion, and targeting with fewer iterations, while fast models often plateaued at low-ranked performance. The added cost and latency of advanced models proved worthwhile when the goal was competitive robot performance.
What This Reveals About How LLMs Handle Complex Coding Tasks
This experiment exposed capabilities that standard benchmarks miss:
Spatial Reasoning Under Uncertainty
Unlike static coding challenges, Robocode requires reasoning about moving objects, trajectories, and positioning. Thinking models handled dynamic spatial problems more effectively, showing a stronger ability to model trajectories, timing, and positional risk in uncertain environments. Fast models tended to struggle when correct physics, movement prediction, or multi-step spatial reasoning was required.
Iterative Refinement
Real development is iterative. Models with reasoning capabilities incorporated feedback more reliably and maintained context across iterations. Faster models were only able to apply specific fixes, lost earlier intent, and regressed even with explicit, repeated guidance.
Autonomous Strategy Selection
Reasoning models were able to identify and select appropriate advanced strategies on their own, demonstrating an understanding of which techniques mattered and how they fit the problem. Fast models tended to default to basic implementations, and when pushed to adopt advanced strategies, those implementations were often incomplete, incorrect, or unstable.
Trade-off Management
Every Robocode decision involves trade-offs: aggressive vs defensive, accuracy vs fire rate, CPU time vs complexity. Reasoning models could balance competing goals such as accuracy, aggression, efficiency, and survivability. Fast models frequently over-optimized a single dimension which led to brittle behavior.
Limitations and Bias
This experiment has clear caveats:
- My prompts and iteration strategy affected all results
- 1v1 combat only, didn't test melee or teams
- Created a single robot per model, though with multiple iterations
- Didn't factor API costs or latency into the comparison
Conclusion
Traditional benchmarks tell us if a model can solve a problem. Robocode showed me how models approach open-ended challenges: their strategic thinking, iteration efficiency, and ability to incorporate feedback.
Thinking models consistently delivered more robust solutions with fewer iterations and responded better to feedback. This appears to stem from longer iteration cycles, during which they implicitly research, validate, and refine their approach, likely performing multiple internal passes before producing a response. Their structured reasoning helps catch obvious issues early, whereas fast models tend to generate code token-by-token, often resulting in implementations that compile but fail to form a coherent or stable strategy.
The most surprising outcome was that fast models were still capable of producing functional robots, particularly in the GPT-5.2-instant case, which could defeat some low-ranked human-made bots after only a few rapid iterations. Even though testing and feedback took longer than generation itself, the fact that a fast, low-cost model could solve a problem involving physics, movement prediction, and risk assessment (not just syntax) was genuinely impressive.
These results highlight the rapid progress of LLMs toward producing high-quality, well-structured code that can match or exceed what many developers would write, even for complex, dynamic problems. While it's unclear when this progress will plateau, the current pace already delivers substantial value relative to cost, especially when models are chosen appropriately for the complexity and risk of the task.
The robots are still battling on robocode-arena.com. You can see the live ladder rankings at the leaderboard or submit your own robot to the arena.