Exam AI-103 Topic 1 Question 48 Discussion
Actual exam question for Microsoft's AI-103 exam
Question #: 48
Topic #: 1
Question #: 48
Topic #: 1
You are creating an agent workflow in a Microsoft Foundry project to support natural voice interactions.
The agent must receive continuous audio input, convert the input into text for reasoning, and then return spoken responses to a user. The workflow must meet the following requirements:
. Support turn-taking dynamics, where the agent begins to generate the speech output before the user finishes speaking.
. Operate with low latency to maintain a conversational experience.
You need to enable both speech to text and text to speech in a real-time agent interaction.
What should you do?
The agent must receive continuous audio input, convert the input into text for reasoning, and then return spoken responses to a user. The workflow must meet the following requirements:
. Support turn-taking dynamics, where the agent begins to generate the speech output before the user finishes speaking.
. Operate with low latency to maintain a conversational experience.
You need to enable both speech to text and text to speech in a real-time agent interaction.
What should you do?
Suggested Answer: D Vote an answer
The correct answer is D. Use real-time speech to text for incoming audio and text to speech for agent responses . The workflow requires continuous audio input, low-latency transcription for reasoning, and spoken output back to the user. Azure Speech in Foundry Tools real-time speech to text is designed for immediate transcription from streaming audio, which satisfies the incoming-audio side of the interaction. Text to speech provides the outbound spoken response path after the agent generates its answer.
This pattern aligns with Microsoft's real-time voice-agent architecture. The Voice Live API overview explains that low-latency speech-to-speech systems integrate speech recognition, generative reasoning, and text-to-speech functionality to create natural voice experiences. It also identifies contact centers as a key scenario and highlights low perceived latency for end users. Embeddings do not decode audio into conversational speech. Batch transcription introduces file-oriented delay and is not suitable for turn-taking.
Speech translation is only appropriate when translating between languages and does not provide the required reasoning-plus-spoken-response loop. Reference topics: Azure Speech in Foundry Tools, real-time speech to text, text to speech, voice agents, low-latency interaction, and conversational turn-taking.
This pattern aligns with Microsoft's real-time voice-agent architecture. The Voice Live API overview explains that low-latency speech-to-speech systems integrate speech recognition, generative reasoning, and text-to-speech functionality to create natural voice experiences. It also identifies contact centers as a key scenario and highlights low perceived latency for end users. Embeddings do not decode audio into conversational speech. Batch transcription introduces file-oriented delay and is not suitable for turn-taking.
Speech translation is only appropriate when translating between languages and does not provide the required reasoning-plus-spoken-response loop. Reference topics: Azure Speech in Foundry Tools, real-time speech to text, text to speech, voice agents, low-latency interaction, and conversational turn-taking.
by Tiffany at Jun 29, 2026, 10:38 AM
0
0
0
10
Comments
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
Report Comment
Commenting
You can sign-up / login (it's free).