To all the aspiring data scientists out there, the fear of making it through interviews is common within all of us, here we have mentioned a few most prominently asked interview questions with any data science interviewing company.
People who are somewhat experienced can also go through these in order for a quick revision before their big call.
Before going through these interview questions make sure you’ve revised and brush up your skills, be it theoretical as well as practical, we’d prefer you to answer and think of the solutions on your own first.
There are much more questions than you can surf and get to know from various resources, below we’ve mentioned questions related to Machine Learning, Data Analytics, Deep Learning, Mathematics, which are often asked by interviewers.
1. What is the difference between supervised and unsupervised machine learning?
The difference between the two of them is-
Supervised Machine Learning-
In this type of machine learning,it requires training labelling of the data whereas,
Unsupervised Machine Learning-
In this type of machine learning,it doesn’t required labelled data.
2. What is exploding gradients?
“Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.” At an extreme, the values of weights can become so large as to overflow and result in NaN values.
This has the effect of your model being unstable and unable to learn from your training data. Now let’s understand what is the gradient.
Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount.
3. What is a confusion matrix?
The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix
A data set used for performance evaluation is called test data set. It should contain the correct labels and predicted labels.
The predicted labels will exactly the same if the performance of a binary classifier is perfect.
The predicted labels usually match with part of the observed labels in real world scenarios.
A binary classifier predicts all data instances of a test dataset as either positive or negative. This produces four outcomes-
True positive(TP) — Correct positive prediction
False positive(FP) — Incorrect positive prediction
True negative(TN) — Correct negative prediction
False negative(FN) — Incorrect negative prediction
4. What is bias, variance trade off?
“Bias is error introduced in your model due to over simplification of machine learning algorithm.” It can lead to under fitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression
“Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs bad on test data set.” It can lead high sensitivity and over fitting.
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point.
As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.
Bias, Variance trade off
The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
The k-nearest neighbours’ algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbours that contribute to the prediction and in turn increases the bias of the model.
The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease bias.
5. What are support vectors in SVM?
In the above diagram, we see that the thinner lines mark the distance from the classifier to the closest data points called the support vectors (darkened data points). The distance between the two thin lines is called the margin.
6. Explain Decision Tree algorithm in detail?
A decision tree is a supervised machine learning algorithm mainly used for Regression and Classification.
It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision tree can handle both categorical and numerical data.
7. Explain how a ROC curve works?
The ROC curve is a graphical representation of the contrast between true positive rates and false-positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false-positive rate.
8. What is Ensemble Learning?
The ensemble is the art of combining a diverse set of learners(Individual models) together to improvise on the stability and predictive power of the model.
Ensemble learning has many types but two more popular ensemble learning techniques are mentioned below.
Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalised bagging, you can use different learners on a different population. As you expect this helps us to reduce the variance error.
Boosting is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may over fit on the training data.
9. What is Random Forest? How does it work?
Random forest is a versatile machine learning method capable of performing both regression and classification tasks. It is also used for dimensionality reduction, treats missing values, outlier values.
It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.
In Random Forest, we grow multiple trees as opposed to a single tree. To classify a new object based on attributes, each tree gives a classification. The forest chooses the classification having the most votes(Overall the trees in the forest) and in case of regression, it takes the average of outputs by different trees.
10. What is deep learning?
Deep learning is a sub field of machine learning inspired by the structure and function of the brain called artificial neural network. We have a lot of numbers of algorithms under machine learning like Linear regression, SVM, Neural network etc and deep learning is just an extension of Neural networks.
In neural nets, we consider a small number of hidden layers but when it comes to deep learning algorithms we consider a huge number of hidden layers to better understand the input-output relationship.
11. What are Recurrent Neural Networks(RNNs)?
Recurrent nets are a type of artificial neural networks designed to recognise the pattern from the sequence of data such as Time series, stock market and government agencies etc. To understand recurrent nets, first, you have to understand the basics of feedforward nets.
Both these networks RNN and feed-forward named after the way they channel information through a series of mathematical orations performed at the nodes of the network. One feeds information through straight(never touching the same node twice), while the other cycles it through the loop, and the latter are called recurrent.
Recurrent networks on the other hand, take as their input not just the current input example they see, but also the what they have perceived previously in time.
The BTSXPE at the bottom of the drawing represents the input example in the current moment, and CONTEXT UNIT represents the output of the previous moment.
The decision a recurrent neural network reached at time t-1 affects the decision that it will reach one moment later at time t. So recurrent networks have two sources of input, the present and the recent past, which combine to determine how they respond to new data, much as we do in life.
12. What is the difference between machine learning and deep learning?
Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. Machine learning can be categorised in following three categories.
Supervised machine learning
Unsupervised machine learning
Deep Learning is a sub field of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.
13. What is selection bias?
Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomisation is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analysed. It is sometimes referred to as the selection effect.
The phrase “selection bias” most often refers to the distortion of statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
14. Explain what regularisation is and why it is useful?
Regularisation is the process of adding tuning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector.
This constant is often the L1(Lasso) or L2(ridge). The model predictions should then minimize the loss function calculated on the regularized training set.
15. What are Recommender Systems?
A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.
16. What is the difference between Regression and classification of ML techniques?
Both Regression and classification machine learning techniques come under Supervised machine learning algorithms. In Supervised machine learning algorithm, we have to train the model using the labelled data set, While training we have to explicitly provide the correct labels and algorithm tries to learn the pattern from input to output.
If our labels are discrete values then it will a classification problem, e.g A, B etc. but if our labels are continuous values then it will be a regression problem, e.g 1.23, 1.333 etc.
17. If you are having 4GB RAM in your machine and you want to train your model on 10GB data set. How would you go about this problem? Have you ever faced this kind of problem in your machine learning/data science experience so far?
First of all you have to ask which ML model you want to train.
For Neural networks: Batch size with Numpy array will work.
Load the whole data in Numpy array. Numpy array has property to create mapping of complete data set, it doesn’t load complete data set in memory.
You can pass index to Numpy array to get required data.
Use this data to pass to Neural network.
Have small batch size.
For SVM: Partial fit will work
Divide one big data set in small size data sets.
Use partial fit method of SVM, it requires subset of complete data set.
Repeat step 2 for other subsets.
18. What is p-value?
When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called Null Hypothesis.
Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put it in another way,
High P values: your data are likely with a true null. Low P values: your data are unlikely with a true null.
19. What is ‘Naive’ in a Naive Bayes?
The Naive Bayes Algorithm is based on the Bayes Theorem. Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
What is Naive?
The Algorithm is ‘naive’ because it makes assumptions that may or may not turn out to be correct.
20. Why we generally use Softmax non-linearity function as last operation in network?
It is because it takes in a vector of real numbers and returns a probability distribution. Its definition is as follows. Let x be a vector of real numbers (positive, negative, whatever, there are no constraints). Then the i’th component of Softmax(x) is-
It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1.
21. Please explain the goal of A/B Testing?
A/B Testing is a statistical hypothesis testing meant for a randomized experiment with two variables, A and B. The goal of A/B Testing is to maximize the likelihood of an outcome of some interest by identifying any changes to a webpage.
A highly reliable method for finding out the best online marketing and promotional strategies for a business, A/B Testing can be employed for testing everything, ranging from sales emails to search ads and website copy.
22. How will you calculate the Sensitivity of machine learning models?
In machine learning, Sensitivity is used for validating the accuracy of a classifier, such as Logistic, Random Forest, and SVM. It is also known as REC (recall) or TPR (true positive rate).
Sensitivity can be defined as the ratio of predicted true events and total events i.e.:
Sensitivity = True Positives / Positives in Actual Dependent Variable
Here, true events are those events that were true as predicted by a machine learning model. The best sensitivity is 1.0 and the worst sensitivity is 0.0.
23. Could you draw a comparison between overfitting and underfitting?
In order to make reliable predictions on general untrained data in machine learning and statistics, it is required to fit a (machine learning) model to a set of training data. Overfitting and underfitting are two of the most common modeling errors that occur while doing so.
Following are the various differences between overfitting and underfitting-
Definition – A statistical model suffering from overfitting describes some random error or noise in place of the underlying relationship. When underfitting occurs, a statistical model or machine learning algorithm fails in capturing the underlying trend of the data.
Occurrence – When a statistical model or machine learning algorithm is excessively complex, it can result in overfitting. Example of a complex model is one having too many parameters when compared to the total number of observations. Underfitting occurs when trying to fit a linear model to non-linear data.
Poor Predictive Performance – Although both overfitting and underfitting yield poor predictive performance, the way in which each one of them does so is different. While the overfitted model overreacts to minor fluctuations in the training data, the underfit model under-reacts to even bigger fluctuations.
24. Between Python and R, which one would you pick for text analytics and why?
For text analytics, Python will gain an upper hand over R due to these reasons:
The Pandas library in Python offers easy-to-use data structures as well as high-performance data analysis tools
Python has a faster performance for all types of text analytics
R is a best-fit for machine learning than mere text analysis
25. Please explain the role of data cleaning in data analysis?
Data cleaning can be a daunting task due to the fact that with the increase in the number of data sources, the time required for cleaning the data increases at an exponential rate.
This is due to the vast volume of data generated by additional sources. Also, data cleaning can solely take up to 80% of the total time required for carrying out a data analysis task.
Nevertheless, there are several reasons for using data cleaning in data analysis. Two of the most important ones are:
Cleaning data from different sources helps in transforming the data into a format that is easy to work with
Data cleaning increases the accuracy of a machine learning model
26. What do you mean by cluster sampling and systematic sampling?
When studying the target population spread throughout a wide area becomes difficult and applying simple random sampling becomes ineffective, the technique of cluster sampling is used. A cluster sample is a probability sample, in which each of the sampling units is a collection or cluster of elements.
Following the technique of systematic sampling, elements are chosen from an ordered sampling frame. The list is advanced in a circular fashion. This is done in such a way so that once the end of the list is reached, the same is progressed from the start, or top, again.
27. Please explain Eigenvectors and Eigenvalues?
Eigenvectors help in understanding linear transformations. They are calculated typically for a correlation or covariance matrix in data analysis.
In other words, eigenvectors are those directions along which some particular linear transformation acts by compressing, flipping, or stretching.
Eigenvalues can be understood either as the strengths of the transformation in the direction of the eigenvectors or the factors by which the compressions happens.
28. Can you compare the validation set with the test set?
A validation set is part of the training set used for parameter selection as well as for avoiding overfitting of the machine learning model being developed. On the contrary, a test set is meant for evaluating or testing the performance of a trained machine learning model.
Instead of using k-fold cross-validation, you should be aware to the fact that a time series is not randomly distributed data — It is inherently ordered by chronological order.
In case of time series data, you should use techniques like forward chaining — Where you will be model on past data then look at forward-facing data.
fold 1: training, test
fold 1: training[1 2], test
fold 1: training[1 2 3], test
fold 1: training[1 2 3 4], test
29. What do you understand by the term Normal Distribution?
Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve. The random variables are distributed in the form of an asymmetrical bell-shaped curve.
30. How will you define the number of clusters in a clustering algorithm?
Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-Means clustering where “K” defines the number of clusters.
For example, the following image shows three different groups.
Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of number of clusters, you will get the plot shown below. The Graph is generally known as Elbow Curve.
Red circled point in above graph i.e. Number of Cluster =6 is the point after which you don’t see any decrement in WSS. This point is known as bending point and taken as K in K — Means.This is the widely used approach but few data scientists also use Hierarchical clustering first to create dendograms and identify the distinct groups from there.
31. In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the probability that you see at least one shooting star in the period of an hour?
Probability of not seeing any shooting star in 15 minutes is
= 1 – P( Seeing one shooting star )
= 1 – 0.2 = 0.8
Probability of not seeing any shooting star in the period of one hour
= (0.8) ^ 4 = 0.4096
Probability of seeing at least one shooting star in the one hour
= 1 – P( Not seeing any star )
= 1 – 0.4096 = 0.5904
32. How can you generate a random number between 1 – 7 with only a die?
Any die has six sides from 1-6. There is no way to get seven equal outcomes from a single rolling of a die. If we roll the die twice and consider the event of two rolls, we now have 36 different outcomes.
To get our 7 equal outcomes we have to reduce this 36 to a number divisible by 7. We can thus consider only 35 outcomes and exclude the other one.
A simple scenario can be to exclude the combination (6,6), i.e., to roll the die again if 6 appears twice.
All the remaining combinations from (1,1) till (6,5) can be divided into 7 parts of 5 each. This way all the seven sets of outcomes are equally likely.
33. A certain couple tells you that they have two children, at least one of whom is a girl. What is the probability that they have two girls?
In the case of two children, there are 4 equally likely possibilities
BB, BG, GB and GG;
where B = Boy and G = Girl and the first letter denotes the first child.
From the question, we can exclude the first case of BB. Thus from the remaining 3 possibilities of BG, GB & BB, we have to find the probability of the case with two girls.
Thus, P(Having two girls given one girl) = 1 / 3
34. A jar has 1000 coins, of which 999 are fair and 1 is double-headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head?
There are two ways of choosing the coin. One is to pick a fair coin and the other is to pick the one with two heads.
Probability of selecting fair coin = 999/1000 = 0.999
Probability of selecting unfair coin = 1/1000 = 0.001
Selecting 10 heads in a row = Selecting fair coin * Getting 10 heads + Selecting an unfair coin
P (A) = 0.999 * (1/2)^5 = 0.999 * (1/1024) = 0.000976
P (B) = 0.001 * 1 = 0.001
P( A / A + B ) = 0.000976 / (0.000976 + 0.001) = 0.4939
P( B / A + B ) = 0.001 / 0.001976 = 0.5061
Probability of selecting another head = P(A/A+B) * 0.5 + P(B/A+B) * 1 = 0.4939 * 0.5 + 0.5061 = 0.7531
35. How to combat Overfitting and Underfitting?
To combat overfitting and underfitting, you can resample the data to estimate the model accuracy (k-fold cross-validation) and by having a validation dataset to evaluate the model.
36. What is Cluster Sampling?
Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
For eg., A researcher wants to survey the academic performance of high school students in Japan. He can divide the entire population of Japan into different clusters (cities). Then the researcher selects a number of clusters depending on his research through simple or systematic random sampling.
Let’s continue our Data Science Interview Questions blog with some more statistics questions.
37. What is Systematic Sampling?
Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.
38. Can you explain the difference between a Validation Set and a Test Set?
A Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.
On the other hand, a Test Set is used for testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
39. How can outlier values be treated?
Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values.
All extreme values are not outlier values. The most common ways to treat outlier values
To change the value and bring it within a range.
To just remove the value.
40. What are the various steps involved in an analytics project?
The following are the various steps involved in an analytics project-
- Understand the Business problem
- Explore the data and become familiar with it.
- Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc.
- After data preparation, start running the model, analyze the result and tweak the approach. This is an iterative step until the best possible outcome is achieved.
- Validate the model using a new data set
- Start implementing the model and track the result to analyze the performance of the model over the period of time.
41. How Regularly Must an Algorithm be Updated?
You will want to update an algorithm when-
- You want the model to evolve as data streams through infrastructure
- The underlying data source is changing
- There is a case of non-stationarity
- The algorithm underperforms/ results lack accuracy
42. Describe the structure of Artificial Neural Networks?
Artificial Neural Networks works on the same principle as a biological Neural Network. It consists of inputs which get processed with weighted sums and Bias, with the help of Activation Functions.
43. What Are Hyperparameters?
With neural networks, you’re usually working with hyperparameters once the data is formatted correctly. A hyperparameter is a parameter whose value is set before the learning process begins.
It determines how a network is trained and the structure of the network (such as the number of hidden units, the learning rate, epochs, etc.).
44. What Will Happen If the Learning Rate Is Set inaccurately (Too Low or Too High)?
When your learning rate is too low, training of the model will progress very slowly as we are making minimal updates to the weights. It will take many updates before reaching the minimum point.
If the learning rate is set too high, this causes undesirable divergent behaviour to the loss function due to drastic updates in weights. It may fail to converge (model can give a good output) or even diverge (data is too chaotic for the network to train).
45. What Is the Difference Between Epoch, Batch, and Iteration in Deep Learning?
Epoch – Represents one iteration over the entire dataset (everything put into the training model).
Batch – Refers to when we cannot pass the entire dataset into the neural network at once, so we divide the dataset into several batches.
Iteration – if we have 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations (10,000 divided by 50).
46. What Are the Different Layers on CNN?
There are four layers in CNN:
Convolutional Layer – the layer that performs a convolutional operation, creating several smaller picture windows to go over the data.
ReLU Layer – it brings non-linearity to the network and converts all the negative pixels to zero. The output is a rectified feature map.
Pooling Layer – pooling is a down-sampling operation that reduces the dimensionality of the feature map.
Fully Connected Layer – this layer recognizes and classifies the objects in the image.
47. How Does an LSTM Network Work?
Long-Short-Term Memory (LSTM) is a special kind of recurrent neural network capable of learning long-term dependencies, remembering information for long periods as its default behaviour. There are three steps in an LSTM network:
Step 1: The network decides what to forget and what to remember.
Step 2: It selectively updates cell state values.
Step 3: The network decides what part of the current state makes it to the output.
48. Explain Gradient Descent.
To Understand Gradient Descent, Let’s understand what is a Gradient first.
A gradient measures how much the output of a function changes if you change the inputs a little bit. It simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function.
Gradient Descent can be thought of climbing down to the bottom of a valley, instead of climbing up a hill. This is because it is a minimization algorithm that minimizes a given function (Activation Function).
49. What is Back Propagation and Explain it’s Working.
Backpropagation is a training algorithm used for multilayer neural network. In this method, we move the error from an end of the network to all weights inside the network and thus allowing efficient computation of the gradient.
It has the following steps:
- Forward Propagation of Training Data
- Derivatives are computed using output and target
- Back Propagate for computing derivative of error wrt output activation
- Using previously calculated derivatives for output
- Update the Weights
50. What is an Auto-Encoder?
Auto-encoders are simple learning networks that aim to transform inputs into outputs with the minimum possible error. This means that we want the output to be as close to input as possible.
We add a couple of layers between the input and the output, and the sizes of these layers are smaller than the input layer. The auto-encoder receives unlabelled input which is then encoded to reconstruct the input.