Hey HN, Anders and Tom here. We had a post about our AI test automation framework 2 months ago that got a decent amount of traction (https://news.ycombinator.com/item?id=43796003).
We got some great feedback from the community, with the most positive response being about our vision-first approach used in our browser agent. However, many wanted to use the underlying agent outside the testing domain. So today, we're releasing our fully featured AI browser automation framework.
You can use it to automate tasks on the web, integrate between apps without APIs, extract data, test your web apps, or as a building block for your own browser agents.
Traditionally, browser automation could only be done via the DOM, even though that’s not how humans use browsers. Most browser agents are still stuck in this paradigm. With a vision-first approach, we avoid relying on flaky DOM navigation and perform better on complex interactions found in a broad variety of sites, for example:
- Drag and drop interactions
- Data visualizations, charts, and tables
- Legacy apps with nested iframes
- Canvas and webGL-heavy sites (like design tools or photo editing)
- Remote desktops streamed into the browser
To interact accurately with the browser, we use visually grounded models to execute precise actions based on pixel coordinates. The model used by Magnitude must be smart enough to plan out actions but also able to execute them. Not many models are both smart *and* visually grounded. We highly recommend Claude Sonnet 4 for the best performance, but if you prefer open source, we also support Qwen-2.5-VL 72B.
Most browser agents never make it to production. This is because of (1) the flaky DOM navigation mentioned above, but (2) the lack of control most browser agents offer. The dominant paradigm is you give the agent a high-level task + tools and hope for the best. This quickly falls apart for production automations that need to be reliable and specific. With Magnitude, you have fine-grained control over the agent with our `act()` and `extract()` syntax, and can mix it with your own code as needed. You also have full control of the prompts at both the action and agent level.
```ts
// Magnitude can handle high-level tasks
await agent.act('Create an issue', {
// Optionally pass data that the agent will use where appropriate
data: {
title: 'Use Magnitude',
description: 'Run "npx create-magnitude-app" and follow the instructions',
},
});// It can also handle low-level actions
await agent.act('Drag "Use Magnitude" to the top of the in progress column');
// Intelligently extract data based on the DOM content matching a provided zod schema
const tasks = await agent.extract(
'List in progress issues',
z.array(z.object({
title: z.string(),
description: z.string(),
// Agent can extract existing data or new insights
difficulty: z.number().describe('Rate the difficulty between 1-5')
})),
);```
We have a setup script that makes it trivial to get started with an example, just run "npx create-magnitude-app". We’d love to hear what you think!
Glad you were able to get it set up quickly!
We currently are optimizing for reliability and quality, which is why we suggest Claude - but it can get expensive in some cases. Using Qwen 2.5-VL-72B will be significantly cheaper, though may not be always reliable.
Most of our usage right now is for running test cases, and people seem to often prefer qwen for that use case - since typically test cases are clearer how to execute.
Something that is top of mind for is is figuring out a good way to "cache" workflows that get taken. This way you can repeat automations either with no LLM or with a smaller/cheap LLM. This will would enable deterministic, repeatable flows, that are also very affordable and fast. So even if each step on the first run is only 95% reliable - if it gets through it, it could repeat it with 100% reliability.
I am desperately waiting for someone to write exactly this! Use the LLM to write the repeatable, robust script. If the script fails, THEN fall back to an LLM to recover and fix the script.
Yes I wish we could combine browser use, stagehand, director.ai, playwright. Even better where I can record my session with mouse movements, clicks, dom inspect, screen sharing and my voice talk and explain what I want to do. Then llm generating scraper for different task and recovering if some scraping task got broken at some point.
https://github.com/browser-use/workflow-use
Yeah, I think its a little tricky to do this well + automatically but is essentially our goal - not necessarily literally writing a script but storing the actions taken by the LLM and being able to repeat them, and adapt only when needed