TOP Machine Learning Interview Questions and Answers

Q1 - What are some of the most popular algorithms for decision trees?

Below is a list of the most popular algorithms used for creating decision trees

CART (Classification and Regression Trees)
ID3 (Iterative Dichotomiser)
C4.5 (Successor of ID3)

Q2 - What are the benefits of using decision trees?

The main advantage of using decision trees is that it is very simple to understand and explain. It can also be visualized. A little amount of data pre-processing is required yet it can handle both numerical and categorical data.

Q3 - Explain the basic structure of a decision tree.

A decision tree is a flowchart-like structure consisting of multiple components. It has parts named internal nodes, branches, leaf nodes and paths. Each has a unique attribute. For predicting a new sample, the branch path needs to be followed based on sample attributes and prediction is made accordingly.

Q4 - What is a Bag of Words?

Bag of Words is a commonly used model that depends on word frequencies or occurrences to train a model. This model creates an occurrence matrix for documents or sentences irrespective of their grammatical structure or its word order.

Q5 - What’s the difference between BatchNorm and LayerNorm?

BatchNorm computes the mean and variance at each layer for every minibatch whereas LayerNorm computes the mean and variance for every sample for each layer independently. Batch normalization allows you to set higher learning rates, increasing speed of training as it reduces the unstability of initial starting weights

Q6 - What’s the difference between Type I and Type II errors?

Type I error is a FP - false positive, while Type II error is a FN - false negative. Both are not desired, however, a Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is.

Q7 - When would you use gradient descent over stochastic gradient descent?

Gradient descent theoretically minimizes the error function better than Stochastic Gradient descent. However, SGD converges much faster once the dataset becomes large enough. Gradient descent is preferable for small datasets while SGD is a good option for large datasets

Q8 - Explain the Bias-Variance Tradeoff in simple terms.

Predictive models usually have a trade-off between bias (how well the model fits the data) and variance (how much the model changes based on changes in the inputs). Simpler models like linear regression are stable (low variance) but high bias. More complex models are more prone to overfitting (high variance) but they are usually low bias.

Q9 - What is text Summarization?

Text summarization is basically the process of shortening a long piece of text with its meaning and effect intact. Text summarization intends to create a summary of any given piece of text and outlines the main points of the given document.

Q10 - What is Pragmatic Ambiguity in NLP?

Pragmatic ambiguity in NLP refers to those words which have more than one meaning and their use in any sentence can depend entirely on the given context. Pragmatic ambiguity can result in multiple interpretations of the same sentence.

Example -

I am going to the bank
I was walking on the bank of a river

Q11 - What is Latent Semantic Indexing?

Latent semantic indexing is a mathematical technique to extract information from unstructured data. It is based on the principle that words used in the same context carry the same meaning. It is computation-heavy when compared to other models.

Q12 - What is NLU?

NLU stands for Natural Language Understanding. It is a subbranch of NLP that concerns making a machine learn the skills of reading comprehension. A few applications of NLU include Machine translation (MT), Newsgathering, and Text categorization.

Q13 - What is the difference between Lemmatisation and Stemming?

Stemming just removes the last few characters of a word, often leading to incorrect meanings and spellings.

example :

eating -> eat

Caring -> Car

Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

examples:

eating -> eat

Caring -> Care

Q14 - What’s the difference between a generative and discriminative model?

By example in a classification scenario, A generative model learns categories of data while a discriminative model learns the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks.

Q15 - What should we do in low bias and high variance?

In such a situation, a few techniques such as regularization can be used or the model can be simplified by reducing the number of features in the dataset.

Q16 - Describe the core idea of K-means.

K-means algorithm clusters/groups similar data points using the centroid of each cluster. The decision of centroid is an iterative process based on the distribution of data points. Initially, the centroid is picked up randomly. New data points are classified using these centroids, whichever cluster it fits in, belongs to that class.

Q17 - What is K in the K-means algorithm and what is its significance?

K refers to the number of clusters in which all the data points will be grouped. K is a hyperparameter and is specific to every data set. For the optimal value of k, the algorithm performs best, and for the remaining values, the performance fluctuates between the best and the worst depending on the properties of the data.

Q18 - What is the Dunn Index?

The Dunn Index is a clustering outcome measure. Higher the better. It is calculated as the smallest intercluster distance (that is, the shortest distance between any two cluster centroids) divided by the greatest intracluster distance (i.e. the greatest possible distance between any two locations in any cluster)

Q19 - What is your understanding of a data pipeline?

Data pipelines are the bread and butter of machine learning engineers, who take data science models and find ways to automate and scale them. There are tools and platforms to build data pipelines. An example can be using Airflow in the AWS environment to build a data pipeline for all steps in a data science project.

Q20 - What are your thoughts on ChatGPT?

The interviewer asks these questions to understand how updated you are with happenings in the world of data science. Your answer should not be like a normal person's answer - Like "It's a threat to the job"/"It's not good for the economy" blah blah Your answer should have data scientist flavor, for example - you can talk about positives and negatives of language model and how future looks like in this space

Q21 - What is QML?

Quantum machine learning (QML) is a new discipline that combines quantum computing and machine learning to solve complex problems faster and better than classical computers. As we produce and process more and more data, this is definitely one of the most interesting areas going forward.

Q22 - Is logistic regression a generative or a descriptive model?

Logistic Regression is a descriptive model. Logistic regression learns to classify by knowing what features differentiate two or more classes of objects. For example, to classify between an apple and a banana, it will learn that the orange is orange in color and the banana is yellow.

Q23 - What are the advantages and disadvantages of logistic regression?

The primary advantages of logistic regression are it is easy to understand and does not require extensive training to implement and interpret it. It performs well with linearly separable datasets.

The primary disadvantage of logistic regression is logistic regression model's linear decision surface prevents it from resolving non-linear problems.

Q24 - What is word embedding?

Word embeddings is a technique used in natural language processing (NLP) to represent words and phrases in a numerical format that machine learning models can easily process. The concept of word embeddings is based on the idea that words with similar meanings will have similar representations in a high-dimensional vector space.

Q25 - What are some well-known word embedding techniques?

There are several methods for creating word embeddings, such as word2vec, GloVe, and fastText. These methods use neural networks to learn the representations of words from large amounts of text.

Q26 - Explain your understanding of the term “Curse of Dimensionality”.

The Curse of Dimensionality refers to the situation when training data has too many features. Due to a large number of features, it becomes difficult to train a model and learn the parameters well. This way getting a stable model becomes difficult. There are techniques such as PCA to take care of this situation.

Q27 - Name some tests to check the normality of a given data.

Shapiro-Wilk W Test
Anderson-Darling Test
Martinez-Iglewicz Test
Kolmogorov-Smirnov Test
D’Agostino Skewness Test

Q28 - Why is it key to introduce non-linearities in neural networks?

Without non-linearities, a neural network will act like a perceptron regardless of the methods used. A neural network without non-linearities cannot find appropriate solutions or classify the data rightfully in complex problems

Q29 - What are the left-skewed distribution and the right-skewed distribution?

In the left-skewed distribution, the left tail is longer than the right side.

Mean < median < mode

In the right-skewed distribution, the right tail is longer. It is also known as positive-skew distribution.

Mode < median < mean

Q30 - What is Bessel’s correction?

Bessel’s correction advocates the use of n-1 instead of n in the formula of standard deviation. It helps to increase the accuracy of results while analyzing a sample of data to derive more general conclusions.

Q31 - What are the different kernels in SVM?

There are four types of kernels in SVM.

1. LinearKernel

2. Polynomial kernel

3. Radial basis kernel

4. Sigmoid kernel

Q32 - What is ‘Naive’ in a Naive Bayes?

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

Q33 - How to convert normal distribution to standard normal distribution?

Z(standardized) = (x-µ) / σ

where x = data point

mu and sigma are the mean and standard deviation of the distribution

Q34 - What can you do with an outlier?

Don't give a direct answer like we will cap outlier at the 90th percentile. Talk from a business context, and bring in concepts such as volume of data, data gathering process, justification of your outlier treatment methodology, etc The purpose is to show the depth of your understanding.

Q35 - What are the different types of missing data?

Missing completely at random (MCAR)

Missing at random (MAR)

Missing not at random (MNAR)

Missing at random (MAR)

Q36 - How to interpret the residual vs fitted value curve?

The residual vs fitted value plot is used to see whether the predicted values and residuals have a correlation. If the residuals are distributed normally, with a mean around the fitted value and constant variance, the model is working fine; otherwise, an investigation is needed for this behavior.

Q37 - What is Linear Discriminant Analysis?

LDA is a supervised machine learning dimensionality reduction technique because it uses target variables for dimensionality reduction. It is commonly used for classification problems. The LDA works on two objectives:

Maximize the distance between the means of the two classes.
Minimize the variation within each class.

Q38 - What is KNN Imputer?

We generally impute null values in data preparation by the statistical measures of the data like mean, mode, or median but KNN Imputer is a more sophisticated method to fill the null values. Here the imputation is based on the nearest neighbor values for the missing data. This may be time-consuming at times but is preferred over statistical measures to avoid bias in training data.

Q39 - Define Random Forest.

Random Forest is an ensemble learning method used for classification and regression. It constructs multiple decision trees during training and averages their predictions. By combining several trees, it reduces overfitting and provides robust results.

Q40 - Do Random Forest need Pruning? Why or why not?

Random Forest usually does not require pruning. Unlike a single decision tree, it does not overfit. The bootstrapped trees and random feature selection together ensure this behaviour.

Q41 - Is Random forest different from Adaboost?

Yes it is. RF is a bagging technique whereas Adaboost is a boosting technique. Both fall in the category of ensemble models but are different.

You may also be interested in

TOP Machine Learning Interview Questions and Answers