Databricks Certified Machine Learning Professional - Databricks-Machine-Learning-Professional FREE EXAM DUMPS QUESTIONS & ANSWERS]

Question 1

A Data Scientist is tasked with developing models to forecast product demand. The company offers 5000 different product types, and the Data Scientist must generate weekly forecasts for each type. They have access to two years of historical purchase data and are given ample project budget.
For their next project, they want to build 5000 separate Random Forest models, one for each product type. They aim to train all the models as quickly as possible with minimal setup.
Which approach meets these requirements?

A. Use the DeepSpeed library to distribute the data by product across different nodes to enable the parallel training of multiple models. B. Leverage the pandas function API (Grouped map) to group the data by product type and apply a custom model training function to each group. C. Use the RandomForest method from MLlib. This will leverage Spark's parallel processing capability to train 5000 different models. D. Create a Databricks Workflow with 5000 tasks. Each task is configured to accept a product ID as a parameter which will then train a model based on the specified product ID.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 2

A Data Scientist is building a machine learning pipeline to classify raw text using a Logistic Regression model in Spark using Spark MLlib's Pipelines. This pipeline has three stages: the Tokenizer (to split the raw text in tokens), a HashingTF (to transform tokens into hashes) and the Logistic Regression itself (to perform the classification of texts). The Spark DataFrame with the training data is called trainingDF and the one with the test data is called testDF.
In order to do this, they use the following incomplete piece of code:

Which option correctly states:
(i) The complete command to run model training;
(ii) The complete command to execute the prediction on test data;
(iii) The object type of the model object returned by the model
training command.

A.

B.

C.

D.

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 3

A machine learning engineer is manually refreshing a model in an existing machine learning pipeline. The pipeline uses the MLflow Model Registry model "project". The machine learning engineer would like to add a new version of the model to "project". Which MLflow operation can the machine learning engineer use to accomplish this task?

A. mlflow.add_model_version B. The machine learning engineer needs to create an entirely new MLflow Model Registry model C. MlflowClient.update_registered_model D. mlflow.register_model E. MlflowClient.get_model_version

Discussion 0

Correct Answer: C Vote an answer

Question 4

A machine learning engineer would like to compute predictions on inference data as it becomes available through the pipeline in microbatches. The predictions should be stored in a table for query later. Which deployment strategy can the engineer use?

A. Edge/on-device B. Streaming C. Real-time D. Batch

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 5

A data scientist is building a model to predict which communication channel (Phone, SMS, Email, or Post) is most likely to be effective for a given customer. Which model type is suited to this task?

A. Linear Regression B. ARIMA C. Softmax Classifier D. Logistic Regression

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 6

A Machine Learning Engineer has deployed a fraud detection model in Databricks Model Serving to detect fraudulent transactions. The engineer wants to compare the model's predictions with the actual fraud classifications from the Fraud Ops team to monitor model performance. The Fraud Ops team uses a unique transaction_id to investigate fraudulent activity and persist their findings to a fraud_findings table. The engineer enabled inference tables on the endpoint, but they are not sure how to map the models' predictions to the Fraud Ops team's classifications. How can the engineer uniquely join the models' prediction to the fraud_findings table with the fewest code changes?

A. Store databricks_request_id returned from each model serving request and persist it to the fraud_findings table. Join the inference table with the fraud_findings table using databricks_request_id as the join key. B. Join the inference table with the fraud_findings table using timestamp_ms as the join key. C. Modify the model to include an additional input: transaction_id. Log, register and deploy the new model. In the model serving request body, add transaction_id as an additional input feature. Join the inference table with the fraud_findings table using transaction_id as the join key. D. Populate the client_request_id field with the transaction_id in the model serving request body.
Join the inference table with the fraud_findings table using client_request_id (which contains the transaction_id) as the join key.

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 7

A Machine Learning Engineer is training a large-scale gradient boosting model using SparkML on a cluster of machines. The training job fails due to memory overflow on a single executor node after processing several iterations. The cluster resources are limited to executor nodes with 16 CPU cores and 64 GB RAM each. The engineer wants to continue training the model without changing hyperparameters or reducing the dataset size. They know Spark's architecture well and want to take advantage of its benefits. Which approach will allow the Machine Learning Engineer to solve this issue?

A. Increase the number of executor nodes and implement model parallelism by splitting the gradient boosting model across executors so each node trains a part of the model. B. Increase the number of executor nodes and implement data parallelism by partitioning the dataset across executors so each node trains the model on a subset of data. C. Increase the number of executor nodes and replicate the entire model on each node, then implement data parallelism to train on different mini-batches simultaneously. D. Increase the number of executor nodes and replicate the entire model on each node, then apply model parallelism to train different parts of the model in parallel.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 8

Which MLflow command logs a trained model?

A. mlflow.run() B. mlflow.start_run() C. mlflow.register_model() D. mlflow.log_model()

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 9

A data scientist wants to remove the star_rating column from the Delta table at the location path.
To do this, they need to load in data and drop the star_rating column. Which of the following code blocks accomplishes this task?

A. spark.sql("SELECT * EXCEPT star_rating FROM path") B. Delta tables cannot be modified C. spark.read.format("delta").table(path).drop("star_rating") D. spark.read.table(path).drop("star_rating") E. spark.read.format("delta").load(path).drop("star_rating")

Discussion 0

Correct Answer: D Vote an answer

Question 10

A Machine Learning Engineer wants to monitor the quality and stability of their machine learning model's predictions over time. They have a Delta table, retail_inference_log, which records each model prediction along with input features, a timestamp, and (when available) the true label. They need to detect data drift and monitor model performance trends using Databricks Lakehouse Monitoring, ensuring that alerts are triggered if the distribution of predictions or input features changes significantly. Which approach will set up monitoring for this use case?

A. Create a monitor with the Inference profile on the retail_inference_log table, and specify a recent batch of production data as the baseline table for drift detection. Use this recent production data to compare against new data for drift and performance monitoring. B. Create a monitor with the Snapshot profile on the retail_inference_log table, so that metrics are calculated over the entire table each time the monitor runs and therefore is able to compare new values with previous ones to compute data drift. C. Create a monitor with the Time Series profile on the retail_inference_log table, specifying the timestamp column and including model input, prediction columns and the true label column. This will track drift in features and predictions over time, and model performance could also be tracked using a custom metric. D. Create a monitor with the Inference profile on the retail_inference_log table, specifying the timestamp column and the columns for model inputs, predictions, and labels. Configure the monitor to compute drift and performance metrics over time windows.

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 11

A Data Scientist needs to analyze drift detection results from Databricks Lakehouse Monitoring.
The system has generated both profile metrics and drift metrics tables. The scientist needs to identify baseline drift in numerical features by comparing current data against a baseline from 6 months ago. Which combination of table columns and values indicates baseline drift in a numerical feature?

A. log_type = "BASELINE" in profile metrics table with population_stability_index > 0.2 in drift metrics table. B. drift_type = "BASELINE", ks_test.p_value < 0.05, and wasserstein_distance > 0.1. C. window_cmp pointing to baseline time window and tv_distance > 0.5 with any drift_type value. D. drift_type = "BASELINE", chi_squared_test.p_value < 0.05, and js_distance > 0.2.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 12

Why is Delta Lake time travel useful in ML pipelines?

A. Reproducible training datasets B. Faster model inference C. Smaller datasets D. Model tuning

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 13

A machine learning engineer is in the process of implementing a concept drift monitoring solution.
They are planning to use the following steps:
1. Deploy a model to production and compute predicted values
2. Obtain the observed (actual) label values
3. _____
4. Run a statistical test to determine if there are changes over time
Which of the following should be completed as Step #3?

A. Obtain the observed values (actual) feature values B. None of these should be completed as Step #3 C. Measure the latency of the prediction time D. Compute the evaluation metric using the observed and predicted values E. Retrain the model

Discussion 0

Correct Answer: B Vote an answer

Question 14

A data scientist would like to switch from manually using MLflow logging to MLflow Autologging for all machine learning libraries used in a notebook.
They begin by adding mlflow.autolog()to the top of the below code block:

The data scientist is now trying to determine which line of code will kick off the MLflow Autologging process.
Which line of code within the above code block will start the MLflow Autologging process?

A. rf_model = rf.fit(X_train, y_train) B. mlflow.sklearn.log_model(rf_model, "rf_model") C. mlflow.log_metric("training_mse", mse) D. with mlflow.start_run(run_name="ML Project") as run:

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).

Question 15

A machine learning engineer has deployed a model recommender using MLflow Model Serving.
They now want to query the version of that model that is in the Production stage of the MLflow Model Registry. Which of the following model URls can be used to query the described model version?

A. https:///model/recommender/stage-production/invocations B. https:///model/recommender/Production/invocations C. https:///model-serving/recommender/Production/invocations D. The version number of the model version in Production is necessary to complete this task.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for FreeCram members. You can sign-up / login (it's free).