Exam AI-103 Topic 1 Question 48 Discussion

Actual exam question for Microsoft's AI-103 exam
Question #: 48
Topic #: 1

You are creating an agent workflow in a Microsoft Foundry project to support natural voice interactions.
The agent must receive continuous audio input, convert the input into text for reasoning, and then return spoken responses to a user. The workflow must meet the following requirements:
. Support turn-taking dynamics, where the agent begins to generate the speech output before the user finishes speaking.
. Operate with low latency to maintain a conversational experience.
You need to enable both speech to text and text to speech in a real-time agent interaction.
What should you do?

A. Use an embeddings model to encode the audio, and then decode the audio into text and speech. B. Use batch transcription to convert the audio input and return text responses from the agent. C. Use speech translation to convert the audio into another language and return the translated text. D. Use real-time speech to text for incoming audio and text to speech for agent responses.

Suggested Answer: D Vote an answer

The correct answer is D. Use real-time speech to text for incoming audio and text to speech for agent responses . The workflow requires continuous audio input, low-latency transcription for reasoning, and spoken output back to the user. Azure Speech in Foundry Tools real-time speech to text is designed for immediate transcription from streaming audio, which satisfies the incoming-audio side of the interaction. Text to speech provides the outbound spoken response path after the agent generates its answer.
This pattern aligns with Microsoft's real-time voice-agent architecture. The Voice Live API overview explains that low-latency speech-to-speech systems integrate speech recognition, generative reasoning, and text-to-speech functionality to create natural voice experiences. It also identifies contact centers as a key scenario and highlights low perceived latency for end users. Embeddings do not decode audio into conversational speech. Batch transcription introduces file-oriented delay and is not suitable for turn-taking.
Speech translation is only appropriate when translating between languages and does not provide the required reasoning-plus-spoken-response loop. Reference topics: Azure Speech in Foundry Tools, real-time speech to text, text to speech, voice agents, low-latency interaction, and conversational turn-taking.

by Tiffany at Jun 29, 2026, 10:38 AM

Limited Time Offer

15%

Off

Get Premium AI-103 Questions as Interactive Self Test Engine or PDF

Comments

0 Happy Clients

0 Shares

0 Demo Downloads

10 Years in Business