Exam Databricks-Machine-Learning-Associate Topic 2 Question 1 Discussion

Actual exam question for Databricks's Databricks-Machine-Learning-Associate exam
Question #: 1
Topic #: 2

A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column's median value.
They have developed the following code block to accomplish this task:

The code block is not accomplishing the task.
Which reasons describes why the code block is not accomplishing the imputation task?

A. It does not impute both the training and test data sets. B. The inputCols and outputCols need to be exactly the same. C. The fit method needs to be called instead of transform. D. It does not fit the imputer on the data to create an ImputerModel.

Suggested Answer: D Vote an answer

In the provided code block, the Imputer object is created but not fitted on the data to generate an ImputerModel. The transform method is being called directly on the Imputer object, which does not yet contain the fitted median values needed for imputation. The correct approach is to fit the imputer on the dataset first.
Corrected code:
imputer = Imputer( strategy="median", inputCols=input_columns, outputCols=output_columns ) imputer_model = imputer.fit(features_df) # Fit the imputer to the data imputed_features_df = imputer_model.transform(features_df) # Transform the data using the fitted imputer Reference:
PySpark ML Documentation

by Aurora at Apr 24, 2025, 12:12 AM

Limited Time Offer

15%

Off

Get Premium Databricks-Machine-Learning-Associate Questions as Interactive Self Test Engine or PDF

Comments

0 Happy Clients

0 Shares

0 Demo Downloads

10 Years in Business