Exam NCA-GENM Topic 1 Question 375 Discussion

Actual exam question for NVIDIA's NCA-GENM exam
Question #: 375
Topic #: 1

You're training a multimodal model for image and text retrieval. Given an image, the model should retrieve the most relevant text description from a database, and vice-vers a. You're using a dual-encoder architecture, where one encoder processes images and the other processes text, projecting them into a shared embedding space. What is the most effective way to train the model to ensure that semantically similar images and texts have close embeddings, while dissimilar ones have distant embeddings?

A. Train the encoders independently using separate supervised tasks for image and text classification. B. Use a contrastive loss function that minimizes the distance between embeddings of matching image-text pairs and maximizes the distance between embeddings of non-matching pairs. Example: Triplet Loss, InfoNCE. C. Use a reconstruction loss that forces the model to reconstruct the input image from its text embedding and vice-versa. D. Apply adversarial training to make the embeddings indistinguishable between the two modalities. E. Use a simple L1 loss between the image and text embeddings-

Suggested Answer: B Vote an answer

Contrastive loss functions are specifically designed for learning embeddings where similarity is defined by distance. They directly encourage similar items to be close and dissimilar items to be far apart. Independent training doesn't enforce the multimodal relationship. Reconstruction loss focuses on regenerating the input, not similarity. Adversarial training aims for indistinguishability, not meaningful embeddings. L1 Loss is a basic distance metric but less effective than contrastive losses for learning semantic similarity

by Eileen at Oct 13, 2025, 10:43 PM

Limited Time Offer

15%

Off

Get Premium NCA-GENM Questions as Interactive Self Test Engine or PDF

Comments

0 Happy Clients

0 Shares

0 Demo Downloads

10 Years in Business