Siri Reference Resolution
Siri Reference Resolution
Upgrading Siri for conversational AI
Upgrading Siri for conversational AI

COMPANY: APPLE

COMPANY: APPLE

COMPANY: APPLE

CATEGORY: CONVERSATIONAL AI

CATEGORY: CONVERSATIONAL AI

CATEGORY: CONVERSATIONAL AI

Siri
Siri
Siri

Conversational awareness is one of the next big leaps for AI. It allows users to ask follow-up questions related to the context at hand. While this may seem like a basic skill in human conversation, it is an incredibly complex task for an AI. We reference content in a multitude of ways - through countless forms found in language, what we're looking at on display, and even what we're doing on our device at a given time. Harry Simmonds, the founder of Studio Elros, previously led the team responsible for shipping Siri's ability to understand complex verbal and on-screen references.

What was done

  • Referencing on-screen entities through voice (iOS 15)

  • Referring to people, places, and things in conversation 

01.

Understanding core language used in verbal references

Often, references in language can be vague, quickly pronounced in speech, and applicable to an ambiguous set of things in context, making it extremely challenging for AI without the same contextual awareness. We've all heard "What was that?!" and generally know what someone is talking about because of a clear signal in the real world, but unraveling this context is a phenomenal challenge. The team built models to help understand these references, considering the use of pronouns like "it, they, she, that" and the user's previous utterances. For example, "How old is Barack Obama?" might be followed by "How tall is he?" Voice assistants can also struggle with quickly pronounced references since most people don't emphasise them when they say things like "When was she born?"

02.

Associating references to entities

To understand what the user wants, it's important to consider the surrounding context, which can include on-screen content. For example, if a user is looking at a phone number on a website and says "Call this number," how can we extract that information? This is a relatively straightforward example, but what if we want to be able to refer to almost any entity available?


Creating the largest possible set of contextualised entities is a significant challenge, and the team worked on an architecture that could accomplish this.

03.

Expanding references to various modes of interaction

Human-to-human conversation involves a wealth of non-verbal cues such as head and body gestures, eye movements, and facial expressions. In an ideal scenario, an AI would be able to fully comprehend and interpret these cues. However, even when talking to friends about their frustrations with voice assistants like Google Assistant and Siri, some of the errors may seem silly to us. As engineers, we need to consider all the possible things that Siri could have misinterpreted when hearing a simple command like "Play it."


The challenge lies in developing technology that can capture and process all of this contextual information, giving AI access to learn from different cues. Harry's team was part of building the intelligence behind this future, with the goal of creating an even more nuanced way for AI to understand conversational references.

Media Credit: Apple Inc.

Let’s talk

Subscribe to our upcoming newsletter for occasional updates! We only send emails when we have something to say.

© 2023 Studio Elros

·

Let’s talk

Subscribe to our upcoming newsletter for occasional updates! We only send emails when we have something to say.

© 2023 Studio Elros

·

Let’s talk

Subscribe to our upcoming newsletter for occasional updates! We only send emails when we have something to say.

© 2023 Studio Elros

·