Goodbye Chatbots, Hello Multimodality and Autonomous Agents: What OpenAI and Google’s Latest Releases Mean for Business

By Leon Gauhman, chief strategy officer of digital product consultancy Elsewhen

If Open AI’s ChatGPT spun the business world on its head, the releases keep rolling. Open AI has just launched GPT-4o, which combines faster GPT4 intelligence with a “refreshed” interface that “reasons” across images, video, voice, code, and text. A day later, Google unveiled its Gemini-powered AI model Project Astra, a multimodal AI agent that can also answer questions in real-time via audio, video, and text.

“We are looking at the future of interaction of ourselves and machines,” declared OpenAI chief technology officer Mira Murati.  “GPT-4o is really shifting that paradigm into the future of collaboration, where this interaction becomes much more natural and far easier.”

Where OpenAI and Google lead, rival AI platforms are bound to follow. So, how will this next generation of AI tech influence a new chapter in business growth and performance?

Here are five transformational trends businesses can expect to see:

Goodbye chatbot, hello 3D work assistant: ChatGPT’s original breakthrough was applying a consumer-friendly interface to an LLM. In these latest releases, voice is a key interface, with GPT-4o and Project Astra introducing real-time, human-style conversational speech. However, the real game changer is expanding the interface to include vision. Both GPT-4o and Project Astra demos showed their AI models watching someone’s screen, including live video, and responding in real-time. This is a significant breakthrough, as these AIs can now see the human user, including their facial expressions and surroundings, and interpret what they see as input. As a result, they can develop much more meaningful interactions with human users than chatbots that are limited to text or voice inputs.

A key challenge for businesses is building on these new voice and vision capabilities to develop more imaginative, interpersonal and intuitive employee tools and interfaces that are easy and enjoyable to use and improve productivity.

Multimodality for better context: Project Astra and GPT-4o show what happens when you integrate multimodality—different specialized models that work together—into an AI. Their AIs can now understand visual and verbal contexts and combine them to create a new, more sophisticated sense of the world around them.

This level of multimodality represents a step change in productivity, allowing companies to combine a much wider body of information and context for their AI to reason with. Ultimately, multimodality represents a move towards capturing and querying an organization’s entire corpus of knowledge and experience.

Autonomous agents for task flow: GPT-4o and Project Astra’s models give an idea of what an independent AI agent would look like. In the demos, both AIs create a sense of continuity of tasks, but they lack real agency. However, their enhanced AI capabilities and ability to “reason” independently are a step closer to the autonomous agents era. Autonomous agents will have greater agency around allocated tasks: they will be able to execute multiple steps and act on results without requiring human supervision. Baby AGI is an early example of an autonomous agent developed by the AI community.

Built initially to automate VC tasks, autonomous agents will impact research and workflow management. The tech can learn, understand, and carry out tasks independently, adding and reprioritising as it goes.

Context windows for reliability: The bigger an AI model’s context window, the more information it can handle, which explains why it represents one of AI’s biggest battlegrounds. Google’s latest AI model, Gemini 1.5 Pro, now boasts a context window of two million tokens (up from one million), allowing it to process more text, audio, video and code.

The business value implications of increased context windows are huge. If you give an AI model the complete repository of context about your company (e.g., competitor data, and market trends) with every prompt, it will be able to think in a way that is very closely aligned with your company.

The ultimate brainstorming copilot: When asked what he most likes about GPT4, Open AI’s Sam Altman said he values its brainstorming input, saying there is “a glimmer of something amazing in there.” During the GPT-4o launch, OpenAI talked up human-to-machine collaboration. Yet Google arguably did a better job showcasing AI’s potential as a collaboration tool. The demo of its upgraded Gemini Pro-powered NotebookLM tool showed the user – a parent – asking the AI to redesign physics study material in a way that simultaneously catered to his teenager’s preference for audio learning and his love of basketball. In response, the AI transformed the child’s uploaded study material into a highly personalized podcast-like discussion about the physics of basketball. If deliverable, this level of collaboration, where an AI model works closely with people and customizes its output to meet their highly individualized requirements, has transformative implications for how we learn, teach and work. Businesses must consider how best to train and deploy AI copilots to maximize employees’ newly liberated potential and creativity.

Final thought

Along with new developments from other foundational models such as Claude3 and  Llama 3, the recent “updates” from OpenAi and Google show that AI transformation is accelerating. The key for businesses is to experiment with small steps and adopt a “thought partnership” approach to AI, focussing on what AI and humans can do together rather than apart.