Commentary

What Google, Microsoft AI Can See

The top two search engines have begun to roll out AI multimodal models with vision capabilities — with user permission, of course. Google and Microsoft have been demonstrating the features during the past year. Now they have begun to show us what the technology can do and how it will change search.

There is no word yet on whether advertisers will get an AI vision-type of feature from Google and Microsoft that enables them to test ads across search.

Wouldn't it be a good thing if Google Project Astra and Microsoft Copilot Vision had the ability to test ads in real time and then make responsive changes depending on the reaction of the consumer?

Here's what these AI vision-based agents can do now.

Project Astra is Google’s vision for AI agents. The company first demonstrated the technology in early 2024, but recently began to roll out the AI feature for smartphones running the Android operating system (OS). It integrates Project Astra capabilities with Gemini.

advertisement

advertisement

In real time, the feature allows Gemini to see and ask questions about what is on the user's screen.

Astra allows Gemini to integrate the phone’s front-facing camera, for example, to request that Gemini assist with something such as helping someone choose a paint color.

In December 2024, Google explained that the prototype, Project Astra, at the time was powered by Gemini 2.0 and used an Android app or prototype glasses to record the world as a person is seeing it.

Astra summarized in a video what it saw, answering questions with content that pulled from Google services such as Search, Maps, Lens and Gemini.

Microsoft Copilot Vision, which can see what people do on the web, was announced in October and released the preview in December 2024. I first received a notice earlier this month. It can see, explain, and give context to websites. 

This past week, Microsoft released Copilot Vision for Android devices for Pro subscribers in United States.

Copilot Vision allows users to use their Android phone’s camera to get help from Copilot by sharing what they see and talking about it. 

Access to Copilot Vision in Edge has expanded to all users for free in the U.S. Copilot can browse alongside the user and instantly scan, analyze, and offer insights based on what it sees on their webpage. 

The tool can understand the full context of what someone does online.

When someone chooses to enable Copilot Vision, it sees the page they are looking at, reads along, and can talk through a problem the user may face. Vision makes browsing an interactive experience.

Copilot can now answer questions with responses with images or videos where appropriate. These enhancements are currently available on desktop and will roll out to mobile in the coming weeks.  

In February, Microsoft introduced Magma model, which integrates visual perception with language comprehension to help AI-powered assistants or robots understand surroundings they have not been trained on.

These models can suggest appropriate actions for new tasks — such as making use of a tool or navigating a website and clicking a button to execute a command. Microsoft said this is a significant step toward AI agents that can serve as versatile, general-purpose assistants.

Shelly Palmer, CEO of The Palmer Group, a consulting practice, has provided details on how to think about and then write or code an AI agent. He outlines product requirements, from defining the problem to focusing on the outcome. Keep it simple and pay attention to security requirements, he explains.

"The gap between 'I want an agent' and 'I have an agent' is bridged by a clear Product Requirements Document (PRD)," he wrote, showing how to create one.

 

Next story loading loading..