Oct 23, 2024 5 min read

Improving Claude Computer Use

A distributed, resilient implementation of Anthropic's tools.

Anthropic's recent release of Computer Use for claude-3-5-sonnet-20241022 is super exciting—we're finally getting good abstractions for "bring your own compute" with SOTA LLMs. Hopefully this pushes OpenAI to do something similar, in parallel to them building more "behind the scenes compute" features. At Aneta, we're working on automating the old tools scientists use every day, and allowing Claude to go click around a screen is very tempting.

I wanted to dive in and play right away, but had a few things stopping me. So, I built a library (surprise surprise) that re-implements all the required tools without requiring everything be done in one container.

Demo to Production

The quickstart for Computer Use has a handy-dandy Dockerfile and hosted image (ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest) that anyone can pull down and get started with. This is neat for quick demos, but had three major problems for us: spinning up random containers doesn't mesh well with the rest of our infra, keeping sandboxes running persistently is expensive, and there's no state management built in. Luckily, we have a good chunk of our infra on Modal, which provides pretty good primitives for most this stuff.

Porting the tool code to Modal could have been as simple as spinning up a function using the above hosted image as the source, but we'd still have all the same problems. Instead, I wanted everything to feel Modal-native. There are three major tools to port: Bash, Edit, and Computer. There's also a bunch of logic needed to maintain sandboxes and map them to requests.

To handle state management, I build a coordination server that attempts to stay alive for the life of an interaction. It stores a bunch of the state needed to continue using the same tools in the same sandbox request after request. However, being a distributed system, there's always the chance it goes down. So I had to be careful to make any in-memory state easily restorable. This has the nice bonus that the coordination server, and even the sandboxes, can scale to zero when not being used, and are easily brought back when needed.

Tool Implementations

Bash is, as it sounds like, a tool that lets the model run commands in an open bash session, with options to start and restart sessions. In the Anthropic implementation, a reference to an asyncio.subprocess.Process instance is kept when the session is first started, and used for bidirectional communication with the shell. In an attempt to get something a bit more robust, my implementation simply requires three things to identify an existing session: the request_id, the ID of the modal.Sandbox, and the PID of the bash process running in the sandbox. When the model first makes a request with the Bash tool, a new session is created, and these three identifiers are returned. After that, any server managing that sandbox for the LLM can submit the identifiers along with the command to run; the sandbox is looked up, the process is located, and the stdin, stdout, and stderr streams are obtained. Even if the coordination server goes down, the session stays alive as long as the sandbox does!

Edit is a tool for making filesystem changes. It mostly wraps a modal.NetworkFileSystemthat is created when the sandbox is. The model is instructed to only use the tool to read and write files within a given directory, where the NFS is mounted. Changes are synced back to the filesystem as they're made, so the same file states are visible to the LLM (via the sandbox) and the coordination server. The only durable state needed is the edit history for the "undo" command, which we simply store in a modal.Dict, keyed by the request_id.

Computer is the big boy; it allows the LLM to move, click, and drag the mouse around as desired. It mostly works on a coordinate system, which is super neat—whatever black magic Anthropic is doing to map the pixels received by the model to consistent coordinates works really well. This one was the trickiest to implement and debug, but didn't actually entail much complexity, since it is totally stateless.

Perf!

After finishing Computer, I realized my implementation was much slower than the example. This makes sense, instead of everything being run locally, inside one container, we're making network calls & serializing/deserializing at each tool call. I wasn't satisfied with this, so I set out to improve the speed. I noticed a few things, some unique to Modal and some about the original code, that could be improved.

The first thing the model often does is try to open Firefox. The first launch of the browser was sloooow in the sandbox. I'm guessing because there is some profile initialization & update logic. A simple timeout 30 firefox -headless during the image build step fixed this and greatly sped up first-request latency.
Firefox was super slow rendering pages, sometimes even freezing up for particularly heavy ones. Adding a cheap GPU (Nvidia T4) to the sandbox environment fixed that nicely.
Screenshots were being resized (with convert) inside the sandbox, which was slowing things down a lot: the sandbox isn't a particularly beefy machine, and it's running on multiple layers of virtualization. By moving this logic to the coordination server, the sandbox can send the screenshots over as-is, allowing the server to post-process, and freeing it up for the next command.
The model would often fail several times to use the Edit tool when it was replacing chunks including newlines or lots of spaces (try print a cow in bash, then write it to a file, and add a fun joke to the top of the file using the edit tool in the original demo repo). By adding a very conservative fuzzy-matching scheme, the model can be off by a Levenshtein distance of up to 3 and still have it's request succeed. Yes, this is a tradeoff that opens up some edge cases, but from what I saw it was clearly worth it.
The model spends a lot of time apt-get installing things. Installing apt-fast and instructing the model to always use it was a nice win.

Wrapping it Up

With the Modal-centric implementation and the few performance tweaks, we're now playing with Computer Use in our production product. I think this approach is pretty solid for getting a more resilient implementation up and running, so I wrapped it up into a library here. If you use Modal, having Computer Use tools for your LLM is as simple as:

$ pip install computer-use-modal
$ modal deploy computer_use_modal

followed by:

from modal import Cls

server = Cls.lookup("anthropic-computer-use-modal", "ComputerUseServer")
response = server.messages_create.remote.aio(
    request_id=uuid4().hex,
    user_messages=[{"role": "user", "content": "What is the weather in San Francisco?"}],
)
print(response)

{
    "role": "assistant",
    "content": [
        BetaTextBlock(
            text="According to the National Weather Service, the current weather in San Francisco is:\n\nTemperature: 65°F (18°C)\nHumidity: 53%\nDewpoint: 48°F (9°C)\nLast update: October 23, 2:43 PM PDT\n\nThe website shows the forecast details as well. Would you like me to provide the extended forecast for the coming days?",
            type="text",
        )
    ]
}

Check out the repo and let me know what you think!