AI has moved at an incredible pace in the last few years. Scaling up Transformers has led to remarkable capabilities in language (e.g., GPT-3, PaLM, Chinchilla), code (e.g., Codex, AlphaCode), and image generation (e.g., DALL-E, Imagen).
At Adept, we are building the next frontier of models that can take actions in the digital world—that’s why we’re excited to introduce our first large model, Action Transformer (ACT-1).
Why are we so excited about this?
First, we believe the clearest framing of general intelligence is a system that can do anything a human can do in front of a computer. A foundation model for actions, trained to use every software tool, API, and webapp that exists, is a practical path to this ambitious goal, and ACT-1 is our first step in this direction.
Second, the next era of computing will be defined by natural language interfaces that allow us to tell our computers what we want directly, rather than doing it by hand. We hope these snippets of ACT-1 will give you a window into the next frontier of computing as we see it!
Sign up here to join the waitlist for the upcoming alpha release of our first product built around ACT-1.
Capability preview
ACT-1 is a large-scale Transformer trained to use digital tools — among other things, we recently taught it how to use a web browser. Right now, it’s hooked up to a Chrome extension which allows ACT-1 to observe what’s happening in the browser and take certain actions, like clicking, typing, and scrolling, etc. The observation is a custom “rendering” of the browser viewport that’s meant to generalize across websites, and the action space is the UI elements available on the page.
There’s a lot of room to make it faster, both on the modeling side and on the software side – so we expect future systems will have latency that’s largely imperceptible to humans. These videos have been sped up to make them easier for you to view. An upcoming technical post will go into much more detail on all of these topics.
Here are some cool things ACT-1 can do!
ACT-1 can take a high-level user request and execute it. The user simply types a command into the text box and ACT-1 does the rest. In this example, this requires repeatedly taking actions and observations over a long time horizon to fulfill a single goal.
https://player.vimeo.com/video/749413832?h=15f094bbb9&title=0&byline=0&portrait=0
This can be especially powerful for manual tasks and complex tools — in this example, what might ordinarily take 10+ clicks in Salesforce can be now done with just a sentence.
https://player.vimeo.com/video/749413804?h=15f094bbb9&title=0&byline=0&portrait=0
Working in-depth in tools like spreadsheets, ACT-1 demonstrates real-world knowledge, infers what we mean from context, and can help us do things we may not even know how to do.
https://player.vimeo.com/video/749413815?h=15f094bbb9&title=0&byline=0&portrait=0
The model can also complete tasks that require composing multiple tools together; most things we do on a computer span multiple programs. In the future, we expect ACT-1 to be even more helpful by asking for clarifications about what we want.
https://player.vimeo.com/video/749413825?h=15f094bbb9&title=0&byline=0&portrait=0
The internet contains a lot of knowledge about the world! When the model doesn’t know something, it knows how to just look up the information online (seen here in voice mode).
https://player.vimeo.com/video/749413798?h=15f094bbb9&title=0&byline=0&portrait=0
ACT-1 doesn’t know how to do everything, but it’s highly coachable. With 1 piece of human feedback, it can correct mistakes, becoming more useful with each interaction.
https://player.vimeo.com/video/749597375?h=15f094bbb9&title=0&byline=0&portrait=0
Looking ahead
Natural language interfaces, powered by action transformers like ACT-1, will dramatically expand what people can do in front of a computer/phone/internet-connected device. A few years from now, we believe:
- Most interaction with computers will be done using natural language, not GUIs. We’ll tell our computer what to do, and it’ll do it. Today’s user interfaces will soon seem as archaic as landline phones do to smartphone users.
- Beginners will become power users, no training required. Anyone who can articulate their ideas in language can implement them, regardless of expertise. Software will become even more powerful as advanced features become accessible to everyone and no longer constrained by the length of a drop-down menu.
- Documentation, manuals, and FAQs will be for models, not for people. No longer will we need to learn the quirky language of every individual software tool in order to be effective at a task. We will never search through forums for “how to do X in Salesforce or Unity or Figma” — the model will do that work, allowing us to focus on the higher-order task at hand.
- Breakthroughs across all fields will be accelerated with AI as our teammate. Action transformers will work with us to bring about advances in drug design, engineering, and more. Collaborating with these models will make us more efficient, energized, and creative.
While we’re excited that these systems can transform what people can do on a computer, we clearly see that they have the potential to cause harm if misused or misaligned with user preferences. Our goal is to build a company with large-scale human feedback at the center — models will be evaluated on how well they satisfy user preferences, and we will iteratively evaluate how well this is working as our product becomes more sophisticated and load-bearing. To combat misuse, we plan to use a combination of machine learning techniques and careful, staged deployment.
What we’ve shown above is only scratching the surface — we’re making great progress towards Adept being able to do arbitrary things on a computer. We have ambitious goals in both the short and long term, and we’re hiring visionary and talented people across roles to make it happen — you can apply here.