full research article
Large Action Model (LAM) is a new foundation model that understands human intentions on computers with neuro-symbolic techniques.
"Large Action Model," or LAM, models human intentions expressed through actions on computers and, by extension, in the physical world. Our key observation is that the inherent structures of human-computer interactions differ from natural language or vision. The applications are expressed in a form that is more structured than a rasterized image and more verbose and noisy than a sentence or a paragraph. The characteristics we desire from a LAM are also different from a foundation model that understands language or vision alone: while we may want an intelligent chatbot to be creative, LAM-learned actions on applications should be highly regular, minimalistic (per Occam's razor), stable, and explainable.
These fresh perspectives allow us to develop unique formulations and models that are surprisingly effective on the benchmarks we care about. We designed the stack from the ground up, from the data collection platform to a new network architecture that utilizes both transformer-style attention and graph-based message passing, combined with symbolic algorithms.
LAM's modeling approach is rooted in imitation, or learning by demonstration: it observes a human using the interface and aims to reliably replicate the process, even if the interface is presented differently or slightly changed. Instead of having a black-box model uncontrollably outputting actions and adapting to the application during inference, LAM's "recipe" is more observable. This means that once the demonstration is provided, the synthesized routine runs directly on the target application without the need for a busy loop of "observation" or "thoughts," and any technically trained human should be able to inspect the "recipe" and reason about its inner workings. Both symbolic and neural components contribute to this process: neural networks are used to understand language, vision, and perform zero-shot reasoning; symbolic algorithms are employed to extract salient substructures and propose action sequences on formalized representation of target applications. As LAM accumulates knowledge from demonstrations over time, it gains a deep understanding of every aspect of an interface exposed by an application and creates a "conceptual blueprint" of the underlying service provided by the application. LAM can be seen as a bridge, connecting users to these services through the application’s interface.
We believe that in the long run, LAM exhibits its own version of "scaling laws," where the actions it learns can generalize to applications of all kinds, even generative ones. As we invest in more computational power, LAM could become increasingly helpful in solving complex problems spanning multiple apps that require professional skills to operate.
By utilizing neuro-symbolic techniques in the loop, LAM sits on the very frontier of inter-disciplinary scientific research in language modeling (LM), programming languages (PL), and formal methods (FM). Traditionally, the PL/FM community has focused on symbolic techniques — solver technologies that rely on logical principles of induction, deduction, and heuristic search. While these symbolic techniques can be highly explainable and come with strong guarantees, they suffer from a scalability limit. By contrast, recent innovations in the LM community are grounded in machine learning and neural techniques: while highly scalable, they suffer from a lack of explainability and come with no guarantees of the output produced. Inspired by the success of machine learning and neural techniques, the PL/FM community has recently made waves of progress on neuro-symbolic methods: by putting together neural techniques (such as LLM) and symbolic ones, one ends up combining the best parts of both worlds, making the task of creating scalable and explainable learning agents a feasible one. Yet to date, no one has put cutting-edge neuro-symbolic techniques into production — LAM seeks to pioneer this direction.