Databricks Databricks Machine Learning Associate Übungsprüfungen
Zuletzt aktualisiert am 26.04.2025- Prüfungscode: Databricks Machine Learning Associate
- Prüfungsname: Databricks Certified Machine Learning Associate Exam
- Zertifizierungsanbieter: Databricks
- Zuletzt aktualisiert am: 26.04.2025
A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed the train_model function, and they want to apply it to each group of DataFrame df.
They have written the following incomplete code block:
Which of the following pieces of code can be used to fill in the above blank to complete the task?
- A . applyInPandas
- B . mapInPandas
- C . predict
- D . train_model
- E . groupedApplyIn
Which of the following hyperparameter optimization methods automatically makes informed selections of hyperparameter values based on previous trials for each iterative model evaluation?
- A . Random Search
- B . Halving Random Search
- C . Tree of Parzen Estimators
- D . Grid Search
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.
Which of the following code blocks will accomplish this task?
- A . spark_df[spark_df["price"] > 0]
- B . spark_df.filter(col("price") > 0)
- C . SELECT * FROM spark_df WHERE price > 0
- D . spark_df.loc[spark_df["price"] > 0,:]
- E . spark_df.loc[:,spark_df["price"] > 0]
A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.
Which of the following possible explanations for this difference is invalid?
- A . The second model is much more accurate than the first model
- B . The data scientist failed to exponentiate the predictions in the second model prior to computing the RMSE
- C . The data scientist failed to take the log of the predictions in the first model prior to computing the RMSE
- D . The first model is much more accurate than the second model
- E . The RMSE is an invalid evaluation metric for regression problems
A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.
In which situation will the machine learning engineer be correct?
- A . When the new solution requires if-else logic determining which model to use to compute each prediction
- B . When the new solution’s models have an average latency that is larger than the size of the original model
- C . When the new solution requires the use of fewer feature variables than the original model
- D . When the new solution requires that each model computes a prediction for every record
- E . When the new solution’s models have an average size that is larger than the size of the original model
A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).
Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?
A)
B)
C)
D)
- A . Option A
- B . Option B
- C . Option C
- D . Option D
A data scientist is developing a single-node machine learning model. They have a large number of model configurations to test as a part of their experiment. As a result, the model tuning process takes too long to complete.
Which of the following approaches can be used to speed up the model tuning process?
- A . Implement MLflow Experiment Tracking
- B . Scale up with Spark ML
- C . Enable autoscaling clusters
- D . Parallelize with Hyperopt
A data scientist is using Spark SQL to import their data into a machine learning pipeline. Once the data is imported, the data scientist performs machine learning tasks using Spark ML.
Which of the following compute tools is best suited for this use case?
- A . Single Node cluster
- B . Standard cluster
- C . SQL Warehouse
- D . None of these compute tools support this task
Which approach can preserve information about the original state of data when imputing missing values?
- A . Using mean instead of median
- B . Removing variables with missing data
- C . Creating a binary indicator for imputed values
- D . A, B, and C
A data scientist is using Spark ML to engineer features for an exploratory machine learning project.
They decide they want to standardize their features using the following code block:
Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.
Which of the following changes can the data scientist make to address the concern?
- A . Utilize the MinMaxScaler object to standardize the training data according to global minimum and maximum values
- B . Utilize the MinMaxScaler object to standardize the test data according to global minimum and maximum values
- C . Utilize a cross-validation process rather than a train-test split process to remove the need for standardizing data
- D . Utilize the Pipeline API to standardize the training data according to the test data’s summary statistics
- E . Utilize the Pipeline API to standardize the test data according to the training data’s summary statistics