Google’s Gemini for mobile will have better multi-modal AI than this year’s AI gadgets

Google teases advancements in Gemini’s multi-modal AI capability.

Kerry Wan/ZDNET

At its much-anticipated annual I/O event, Google announced some exciting functionality to its Gemini AI model, particularly its multi-modal capabilities in a pre-recorded video demo. Dubbed “Project Astra,” the video showed off how Google has been hard at work developing the capabilities for Gemini to respond to visual and audio context in real time, what it’s calling Gemini Live. 

Although it sounds a lot like the Instagram or TikTok feature, “Live” for Gemini refers to the ability for you to “show” Gemini your view via your camera, and have a two-way conversation with the AI in real time. Sort of like FaceTiming with a friend who knows everything about everything.  

Also: Everything announced at Google I/O 2024: Gemini, Search, Project Astra, and more

This year has seen this kind of AI technology appear in a host of other devices like the Rabbit R1 and the Humane AI pin, two non-smartphone devices that came out this spring to a flurry of hopeful curiosity, but ultimately didn’t move the needle much on the supremacy of the smartphone. 

Now that these devices have had their moment in the sun, Google’s Gemini AI has taken the stage with its snappy, conversational multi-modal AI and brought the focus squarely back to the smartphone. 

Google teased this functionality the day before I/O in a tweet that showed off Gemini correctly identifying the stage at I/O, then giving additional context to the event and asking follow-up questions of the user. 

In the demo video at I/O, the user turns on their smartphone’s camera and pans around the room, asking Gemini to identify its surroundings and provide context on what it sees. What was most impressive was not just the responses it gave, but how quickly they were generated, resulting in that natural, conversational user interaction Google has been trying to convey.   

READ MORE  Democrats Release ‘Most Comprehensive Plan Ever’ to Address Plastics Problem

Also: 3 new Gemini Advanced features unveiled at Google I/O 2024

The goals behind Google’s so-called Project Astra are centered around bringing this cutting edge AI technology down to the scale of the smartphone, and part of why Google says it created Gemini with multi-modal capabilities from the beginning. But getting the AI to respond and ask follow-up questions in real time has apparently been the biggest challenge. 

During its R1 launch demo in April, Rabbit showed off similar multimodal AI technology that many lauded as an exciting feature. Google’s teaser video proves the company has been hard at work in developing similar functionality for Gemini that, from the looks of it, might even be better.

Also: What is Gemini Live? A first look at Google’s new real-time voice AI bot

Google isn’t alone with its breakthroughs with multi-modal AI. Just a day before, OpenAI showed off its own updates during its OpenAI Spring Update livestream, including GPT-4o, its newest AI model that now powers ChatGPT to “see, hear, and speak.” During the demo, presenters showed the AI a host of different objects and scenarios via their smartphone’s camera, including a math problem written by hand, and the presenter’s facial expressions, with the AI correctly identifying these things through a similar conversational back-and-forth with its users.

Also: Google’s new ‘Ask Photos’ AI solves a problem I have every day

When Google updates Gemini on mobile later this year with this feature, the company’s technology could jump to the front of the pack in the AI assistant race, particularly with Gemini’s exceedingly natural-sounding cadence and follow-up questions. Although the exact breadth of capabilities are yet to be fully seen, this development positions Gemini as perhaps the most well-integrated multi-modal AI assistant. 

READ MORE  The future of computing and entertainment

Folks that attended Google’s I/O event in person had a chance to demo Gemini’s multi-modal AI for mobile in a controlled “sandbox” environment at the event, but we can expect more hands-on experience later this year.

Leave a Comment