Data science Screen Test

By mfh.officials@gmail.com / January 7, 2025

Data science

Data science Screening test

1 / 20

Which of the following is the most appropriate way to deal with missing values in a dataset?

Replace missing values with zeros

Drop all rows with missing values

Replace missing values with the mean or median of the column

Use a model-based imputation method

2 / 20

Which of the following is TRUE about ensemble methods?

They combine multiple weak models to create a stronger model

They cannot be used with decision trees

They are less prone to overfitting than single models

They are always more computationally expensive than single models

3 / 20

What is the "bias-variance tradeoff"?

Increasing model complexity reduces bias and increases variance

Increasing model complexity does not affect bias or variance

Increasing model complexity reduces variance and increases bias

Increasing model complexity reduces both bias and variance

4 / 20

What is the purpose of the Adam optimizer in neural networks?

To adjust the learning rate during training

To reduce the loss function

To calculate the gradient of the loss function

To perform backpropagation

5 / 20

Which of the following techniques is used for dimensionality reduction?

Naive Bayes

Decision Trees

Principal Component Analysis (PCA)

K-means clustering

6 / 20

The "curse of dimensionality" refers to:

The difficulty of visualizing data in high-dimensional spaces

The tendency of high-dimensional data to become sparse

The increasing complexity of models with more features

The difficulty in finding a suitable machine learning model

7 / 20

Which of the following is a hyperparameter for the k-means clustering algorithm?

Learning rate

Activation function

Regularization strength

Number of clusters (k)

8 / 20

In random forests, what is the primary advantage over a single decision tree?

It uses more training data

It reduces variance and improves accuracy by averaging predictions

It always uses shallow trees

It is more interpretable

9 / 20

Which of the following algorithms is most appropriate for predicting a continuous outcome variable?

Decision tree classification

K-means clustering

K-nearest neighbors (classification)

Linear regression

10 / 20

In the context of model evaluation, what does the "ROC curve" stand for?

Root Output Curve

Recurrent Operations Curve

Receiver Operating Characteristic Curve

Residual Output Curve

11 / 20

Which of the following is a disadvantage of the k-nearest neighbors (KNN) algorithm?

Answer: B) It requires a large amount of training data

It requires a large amount of training data

It assumes linearity in the data

It cannot handle multi-class classification

It is difficult to interpret

12 / 20

Which of the following is an example of unsupervised learning?

Logistic Regressi

Linear Regression

Principal Component Analysis (PCA)

Decision Trees

13 / 20

Cross-validation is primarily used to:

Tune hyperparameters

Split data into training and testing sets

Reduce overfitting and assess model performance

Visualize data

14 / 20

What is the purpose of regularization in machine learning?

To speed up training

To reduce overfitting by penalizing large coefficients

To improve model interpretability

To increase model complexity

15 / 20

Which evaluation metric is most appropriate for imbalanced classification problems?

Accuracy

Mean Squared Error

R-squared

Precision and Recall

16 / 20

Which of the following metrics is used to evaluate clustering models?

Precision

F1-score

Silhouette score

ROC-AUC

17 / 20

Which of the following libraries is primarily used for deep learning?

pandas

Matplotlib

scikit-learn

TensorFlow

18 / 20

Which of the following is a key assumption of the linear regression model?

Multicollinearity

Independence of dependent variables

Non-linearity between independent and dependent variables

Homoscedasticity

19 / 20

2. In a decision tree, the split criterion is typically based on:

Sum of squared errors

Variance

Root mean squared error

Information gain or Gini impurity

20 / 20

In a time series forecasting problem, which of the following is most commonly used to check for stationarity?

ACF/PACF plots

Augmented Dickey-Fuller (ADF) test

Durbin-Watson test

Shapiro-Wilk test

Your score is