ACT-1: Transformer for Actions

AI has moved at an incredible pace in the last few years. Scaling up Transformers has led to remarkable capabilities in language (e.g., GPT-3, PaLM, Chinchilla), code (e.g., Codex, AlphaCode), and image generation (e.g., DALL-E, Imagen).

At Adept, we are building the next frontier of models that can take actions in the digital world—that’s why we’re excited to introduce our first large model, Action Transformer (ACT-1).

Why are we so excited about this?

First, we believe the clearest framing of general intelligence is a system that can do anything a human can do in front of a computer. A foundation model for actions, trained to use every software tool, API, and webapp that exists, is a practical path to this ambitious goal, and ACT-1 is our first step in this direction.

Second, the next era of computing will be defined by natural language interfaces that allow us to tell our computers what we want directly, rather than doing it by hand. We hope these snippets of ACT-1 will give you a window into the next frontier of computing as we see it!

Capability preview

ACT-1 is a large-scale Transformer trained to use digital tools — among other things, we recently taught it how to use a web browser. Right now, it’s hooked up to a Chrome extension which allows ACT-1 to observe what’s happening in the browser and take certain actions, like clicking, typing, and scrolling, etc. The observation is a custom “rendering” of the browser viewport that’s meant to generalize across websites, and the action space is the UI elements available on the page.

There’s a lot of room to make it faster, both on the modeling side and on the software side – so we expect future systems will have latency that’s largely imperceptible to humans. These videos have been sped up to make them easier for you to view. An upcoming technical post will go into much more detail on all of these topics.

Here are some cool things ACT-1 can do!

ACT-1 can take a high-level user request and execute it. The user simply types a command into the text box and ACT-1 does the rest. In this example, this requires repeatedly taking actions and observations over a long time horizon to fulfill a single goal.

https://player.vimeo.com/video/749413832?h=15f094bbb9&title=0&byline=0&portrait=0

This can be especially powerful for manual tasks and complex tools — in this example, what might ordinarily take 10+ clicks in Salesforce can be now done with just a sentence.

https://player.vimeo.com/video/749413804?h=15f094bbb9&title=0&byline=0&portrait=0

Working in-depth in tools like spreadsheets, ACT-1 demonstrates real-world knowledge, infers what we mean from context, and can help us do things we may not even know how to do.

https://player.vimeo.com/video/749413815?h=15f094bbb9&title=0&byline=0&portrait=0

The model can also complete tasks that require composing multiple tools together; most things we do on a computer span multiple programs. In the future, we expect ACT-1 to be even more helpful by asking for clarifications about what we want.

https://player.vimeo.com/video/749413825?h=15f094bbb9&title=0&byline=0&portrait=0

The internet contains a lot of knowledge about the world! When the model doesn’t know something, it knows how to just look up the information online (seen here in voice mode).

https://player.vimeo.com/video/749413798?h=15f094bbb9&title=0&byline=0&portrait=0

ACT-1 doesn’t know how to do everything, but it’s highly coachable. With 1 piece of human feedback, it can correct mistakes, becoming more useful with each interaction.

https://player.vimeo.com/video/749597375?h=15f094bbb9&title=0&byline=0&portrait=0

Looking ahead

Natural language interfaces, powered by action transformers like ACT-1, will dramatically expand what people can do in front of a computer/phone/internet-connected device. A few years from now, we believe:

Most interaction with computers will be done using natural language, not GUIs. We’ll tell our computer what to do, and it’ll do it. Today’s user interfaces will soon seem as archaic as landline phones do to smartphone users.
Beginners will become power users, no training required. Anyone who can articulate their ideas in language can implement them, regardless of expertise. Software will become even more powerful as advanced features become accessible to everyone and no longer constrained by the length of a drop-down menu.
Documentation, manuals, and FAQs will be for models, not for people. No longer will we need to learn the quirky language of every individual software tool in order to be effective at a task. We will never search through forums for “how to do X in Salesforce or Unity or Figma” — the model will do that work, allowing us to focus on the higher-order task at hand.
Breakthroughs across all fields will be accelerated with AI as our teammate. Action transformers will work with us to bring about advances in drug design, engineering, and more. Collaborating with these models will make us more efficient, energized, and creative.

While we’re excited that these systems can transform what people can do on a computer, we clearly see that they have the potential to cause harm if misused or misaligned with user preferences. Our goal is to build a company with large-scale human feedback at the center — models will be evaluated on how well they satisfy user preferences, and we will iteratively evaluate how well this is working as our product becomes more sophisticated and load-bearing. To combat misuse, we plan to use a combination of machine learning techniques and careful, staged deployment.

What we’ve shown above is only scratching the surface — we’re making great progress towards Adept being able to do arbitrary things on a computer. We have ambitious goals in both the short and long term, and we’re hiring visionary and talented people across roles to make it happen — you can apply here.

ufabet มีเกมให้เลือกเล่นมากมาย: เกมเดิมพันหลากหลาย ครบทุกค่ายดัง

tornado crypto mixer Discover the power of privacy with TornadoCash! Learn how this decentralized mixer ensures your transactions remain confidential.

ดูบอลสด Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

ดูบอลสด Pretty! This has been a really wonderful post. Many thanks for providing these details.

ดูบอลสด Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

Obrazy Sztuka Nowoczesna Thank you for this wonderful contribution to the topic. Your ability to explain complex ideas simply is admirable.

ufabet Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

ufabet You’re so awesome! I don’t believe I have read a single thing like that before. So great to find someone with some original thoughts on this topic. Really.. thank you for starting this up. This website is something that is needed on the internet, someone with a little originality!

ufabet Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

ACT-1: Transformer for Actions

Why are we so excited about this?

Capability preview

Looking ahead

4000亿国产算力航母：芯片巨头合并超算巨头

开源全能图像模型媲美GPT-4o！解决扩散模型误差累计问题

突破多模态奖励瓶颈！中科院清华快手联合提出R1-Reward，用强化学习赋予模型长期推理能力

英伟达50系甜品卡发售日期定了！国内定价2499元

豆包可以跟你打视频了，陪我看《甄嬛传》还挺懂！难倒一众AI的“看时钟”也没难倒它

大模型竞技场再被锤！Llama4私下测试27个版本，只取最佳成绩

微软CEO和奥特曼失了和，OpenAI被“断粮”

Claude网页版接入MCP！10款应用一键调用，开发者30分钟可创建新集成

用多模态LLM超越YOLOv3！强化学习突破多模态感知极限｜开源

OpenAI最新技术报告：GPT-4o变谄媚的原因万万没想到