Exam NCA-GENL Topic 1 Question 21 Discussion
Actual exam question for NVIDIA's NCA-GENL exam
Question #: 21
Topic #: 1
Question #: 21
Topic #: 1
What is a Tokenizer in Large Language Models (LLM)?
Suggested Answer: C Vote an answer
A tokenizer in the context of large language models (LLMs) is a tool that splits text into smaller units called tokens (e.g., words, subwords, or characters) for processing by the model. NVIDIA's NeMo documentation on NLP preprocessing explains that tokenization is a critical step in preparing text data, with algorithms like WordPiece, Byte-Pair Encoding (BPE), or SentencePiece breaking text into manageable units to handle vocabulary constraints and out-of-vocabulary words. For example, the sentence "I love AI" might be tokenized into ["I", "love", "AI"] or subword units like ["I", "lov", "##e", "AI"]. Option A is incorrect, as removing stop words is a separate preprocessing step. Option B is wrong, as tokenization is not a predictive algorithm. Option D is misleading, as converting text to numerical representations is the role of embeddings, not tokenization.
References:
NVIDIA NeMo Documentation: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp
/intro.html
References:
NVIDIA NeMo Documentation: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp
/intro.html
by Cheryl at May 21, 2025, 12:49 PM
0
0
0
10
Comments
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
Report Comment
Commenting
You can sign-up / login (it's free).