Beware the IDEs of March: Cursor vs Windsurf vs Claude Code

Rob Balian
CTO @ Reprompt
Embracing the IDEs of March: Cursor vs Windsurf vs Claude Code
Our eng team used Cursor, Windsurf, and Claude Code on real world code. Here's what we learned.
Assessment notes: I'm primarily talking about the agent mode of each tool. Autocomplete on all of them is more or less the same.
Cursor: Recent Regression
Cursor 0.46 came out with Claude 3.7 Sonnet. It introduced major issues that made agent mode unusable. I burned 5 hours debugging Cursor's terrible code before exploring other options. What happened:
They reduced the amount of code they put into the context window (presumably to save tokens). Instead they rely on their agent to look at code 50 lines at a time. As you can imagine, this doesn't work that well.
The community is super annoyed, and we're finding that it's easy to swap between IDEs
They've probably released a fix, but I haven't seen it because I churned
Reading the tea leaves: Cursor engineers saw that Windsurf had a better agent mode that used fewer tokens, tried to replicate it, failed, and ironically pushed more people to other IDEs.
Windsurf: Quite good
Sadly, it took all of 2 minutes to move from Cursor to Windsurf. They're both based on VS Code and Windsurf auto-imported all my Cursor settings (which were of course imported from VS Code proper)
Why we liked Windsurf:
Agent mode works and seems to pull in context more intelligently than 50 lines of code at a time. It seems to work like an actual developer: Find a function that you don't know what it does, grep the codebase for it, understand the context, continue.
Feels faster and more responsive than Cursor (Cursor's chat window has a barely perceptible delay on each keystroke and you only realize it when you try a faster version)
Claude Code: Least control, Highest Cost, Max scalability
Claude Code isn't an IDE it's a terminal app. So you have to have an IDE open if you want to make edits. Still it performed quite well:
Uses more tokens, costs more because you pay by the token, also tends to take longer between human inputs

Biggest difference: Because it's a terminal app, you can run multiple agents simultaneously!
Examples of how we’ve used multiple Claude Code instances so far:
One agent does client, one does server, and one does CSS
One agent per bug (which I’ve filed using MCP while coding or talking to customers)
One agent writes the tests, one writes the code and then runs those tests until they all pass
One agent reviews the code that another wrote, while a 3rd starts a new task.
We haven't tried an over the top agent to orchestrate multiple Claude Code agents yet.
Key Finding: AI still sucks at debugging and refactoring

Debugging with AI is significantly more difficult than writing new code. And refactoring isn't as great as we expected.
We're experimenting with repomix to quickly pack up code into an LLM-readable format and send to Claude/O1 for recommendations. But the dream is telling Claude: "Please refactor this code to be simpler"
NB: All of these agents tend to add tons of additional complexity the first time around. About 40% of the time, I have to re-prompt with "There has to be a simpler way" or "Please retry with MAX simplicity" and then they do a better job.
MCPs
MCPs are a fancy word for integrations with LLMs. They're overhyped right now but still pretty cool. Examples I found useful:
Github: "File an issue for this", "review PR #7128", "Write a PR for issue #7129"
Postgres: "How many locations did org ID abd-123 upload in the past week. Split by country"