preloader
Data Science Interview Questions

Top Data Science Interview Questions And Answers, Practice Now

author image

To learn and practice data science go through this page where we have provided the top 50 data science interview questions to prepare for an upcoming interview. Here we have given frequently asked data science interview questions and answers which were asked by the interviewer in recent interviews of big MNC’s. Any candidate who is fresher or has years of experience can go through these questions as we have provided questions for all difficulty levels.

About Data Science: Data science syndicates multiple fields that consist of statistics, scientific techniques, artificial intelligence, and data analysis, to get value from data. Those who learn data science and work on it are called data scientists, and they integrate a degree of skills to analyze data collected from different sources like the web, smartphones, etc. to derive actionable insights.

Also, prepare for Embedded C interview questions, Networking interview questions, and other language interview questions from here.

Data Science Interview Questions

1. Explain Normal Distribution?

2. Tell the ways to handle missing values in data?

3. What is the way to verify the items available in list A are available in series B?

4. Tell the way to find the placing of numbers from a series that are multiples of 4?

5. Is KNN and K-means clustering different?

6. Are you able to stack two series horizontally? If yes, how?

7. How can you transform date-strings to time-series in a series?

8. What is the ROC curve?

9. How is AUC dissimilar from ROC?

10. Why is Naive Bayes referred to as Naive?

11. Explain the confusion matrix?

12. Explain SVM and name some kernels used in it?

13. Tell the function to create a series from a given list in Pandas?

14. How can you calculate significance using p-value?

15. Why don’t gradient descent methods always converge to the same point?

16. What is A/B testing?

17. Explain box cox transformation?

18. Differentiate recall and precision?

19. Explain the pickle module in Python?

20. Tell the forms of joins in a table?

21. Differentiate between DELETE and TRUNCATE commands.

22. Name some clauses commonly used in SQL?

23. Write the function of getting the second highest salary of an employee from employee_table?

24. Explain a foreign key?

25. Explain Data Integrity?

26. Name some NoSQL databases?

27. How to use Hadoop in Data Science?

28. Explain different types of analysis?

29. Explain lambda expression in Python?

30. Make an identity matrix using NumPy?

31. Make a 1-D array in numpy?

32. Name different libraries of Python to use in Data Science?

33. Calculate the Euclidean distance in Python of the given plots?

34. Tell the way to add a border filled with 0s near an existing array?

35. Assume a (5,6,7) shape array, tell the index (x,y,z) of the 50th element?

36. Multiply a 4×3 matrix by a 3×2 matrix?

37. Name the type of biases?

38. Explain a z-score?

39. Tell the ideal seed for tuning hyperparameters of your machine learning model?

40. Differentiate between Eigenvectors and Eigenvalue.

41. Is Pearson capturing the monotonic behavior between two variables and Spearman captures how linearly dependent the two variables are?

42. What do you understand by regularization and its uses?

43. What does the cost parameter in SVM stand for?

44. Gradient descent stochastic in nature? Why?

45. Tell the function of subtracting means of each row of matrix?

46. Explain law of large numbers?

47. What do you understand by L1 and L2 Regularization?

48. In the Latent Dirichlet Model for text classification what does the Alpha and Beta Hyperparameter stand for?

49. Do the LogLoss evaluation metric can possess negative values?

50. What do you understand by the TF/IDF Vectorization?


Learn More Interview Questions Here:


Data Science Interview Questions And Answers

1. Explain Normal Distribution?

Normal Distribution is also called Gaussian distribution, which is a type of probability distribution that is symmetric close to the mean. It indicates that the data is nearer to the mean and the frequency of events in data is distant from the mean.

2. Tell the ways to handle missing values in data?

Some of the ways are: Predicting value with regression Dropping the values Replacing value with the mean, median, and mode of the observation. Finding appropriate value with clustering Deleting the observation (not always recommended).

3. What is the way to verify the items available in list A are available in series B?

Use the isin() function and for that, you need to create two series s1 and s2 – s1 = pd.Series([1, 2, 3, 4, 5]) s2 = pd.Series([4, 5, 6, 7, 8]) s1[s1.isin(s2)]

4. Tell the way to find the placing of numbers from a series that are multiples of 4?

For finding the multiples of 4, we use the argwhere() function. Initially, we will make a list of 10 numbers – s1 = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) np.argwhere(ser % 4==0) Output > [3], [7]

5. Is KNN and K-means clustering different?

KNN is a supervised learning algorithm and to teach this algorithm, we need labeled data. K-means is an unsupervised learning algorithm that scrutinizes designs that are inherent to the data. The K in KNN stands for the number of closest data points. The K in K-means identifies the number of centroids.

6. Are you able to stack two series horizontally? If yes, how?

By using the concat() function and setting axis = 1, we can stack the two series horizontally. df = pd.concat([s1, s2], axis=1)

7. How can you transform date-strings to time-series in a series?

Input: s = pd.Series([‘02 Feb 2011’, ‘02-02-2013’, ‘20160104’, ‘2011/01/04’, ‘2014-12-05’, ‘2010-06-06T12:05]) Use the to_datetime() function to solve this. pd.to_datetime(s)

8. What is the ROC curve?

Receiver Operating Characteristic curve (ROC Curve) is a proportion of the True Positive Rate (TPR) and False Positive Rate (FPR). We compute True Positive (TP) as TPR = TP/ (TP + FN), whereas false positive rate is specified as FPR = FP/FP+TN where, TP = true positive, TN = true negative, FP = false positive, FN = false negative.

9. How is AUC dissimilar from ROC?

AUC curve is a proportion of accuracy against the recall. Precision = TP/(TP + FP) and TP/(TP + FN). This is the difference with ROC that calculates and plots True Positive against False positive rate.

10. Why is Naive Bayes referred to as Naive?

In Naive Bayes, the speculations and probabilities that are calculated of the features are self-dependent of each other. It is the presumption of feature liberty that makes Naive Bayes, “Naive”.

11. Explain the confusion matrix?

A confusion matrix is a table that represents the performance of a supervised learning algorithm. It gives a summary of forecast results on a classification issue. By using a confusion matrix, you not only find the issues made but also the type of errors.

12. Explain SVM and name some kernels used in it?

SVM means support vector machine. It is used for classification and forecast tasks. SVM includes a separating plane that distinguishes between the two classes of variables. This separating plane is called a hyperplane. Name of some kernels used in SVM are: Polynomial Kernel Laplace RBF Kernel Gaussian Kernel Hyperbolic Kernel Sigmoid Kernel

13. Tell the function to create a series from a given list in Pandas?

We will use this list to the Series() function. ser1 = pd.Series(mylist)

14. How can you calculate significance using p-value?

After a hypothesis test is done, we calculate the significance of the output. The p-value is available between 0 and 1. If the p-value is lower than 0.05 it simply means that we are not able to reject the null hypothesis. If the p-value is higher than 0.05, then we are able to reject the null hypothesis.

15. Why don’t gradient descent methods always converge to the same point?

In some circumstances, it gets to a local or local optima point. The procedures don’t constantly fulfill global minima. It also depends on the data, the descent rate, and the origin point of descent.

16. What is A/B testing?

To act a hypothesis testing of randomized experimentation with 2 variables A and B, we use the A/B testing. It is operated to optimize web pages on the basis of user preferences where small modifications are added to web pages that are provided to a sampling of users. On the basis of their response to the web page and the response of the rest of the audience to the actual page, we can take out this statistical experiment.

17. Explain box cox transformation?

To alter the response variable so, the data satisfy its needed assumptions, we use the Box-Cox Transformation. It can change non-normal dependent variables into standard shapes. We can use a wider number of tests by using this transformation.

18. Differentiate recall and precision?

A recall is the fraction of illustrations that have been categorized as true. Whereas, precision is a standard of weighing true instances. While the recall is an approximation of true value whereas precision is a true value that describes factual knowledge.

19. Explain the pickle module in Python?

To serialize and de-serialize an object the pickle module is used. To keep this object on a drive, we use the pickle method. It transforms an object layout into a character stream.

20. Tell the forms of joins in a table?

Inner Join Left Join Outer Join Full Join Self-Join Cartesian Join

21. Differentiate between DELETE and TRUNCATE commands.

DELETE command is operated in conjunction with the WHERE clause to eliminate selected rows from the table. This act can be rolled back. Nevertheless, TRUNCATE can delete all the rows of a table and it can’t be rolled back.

22. Name some clauses commonly used in SQL?

Some commonly used clauses are – WHERE GROUP BY ORDER BY USING

23. Write the function of getting the second highest salary of an employee from employee_table?

Use the following query to get the second highest salary of an employee: SELECT TOP 1 salary FROM( SELECT TOP 2 salary FROM employee_table ORDER BY salary DESC) AS emp ORDER BY salary ASC;

24. Explain a foreign key?

It is a unique key that goes to one table and can work as a primary key of a different table. To make a connection between the given two tables, we place the foreign key with the preliminary key of the other table.

25. Explain Data Integrity?

We can specify the precision & consistency of the data by using Data Integrity. This integrity is to be assured over the whole life-cycle.

Know the Interview Criteria of these MNCs!!!

Data Science Interview Questions For Experienced

26. Name some NoSQL databases?

Some of the widespread NoSQL databases are Redis, Cassandra, MongoDB, Neo4j, HBase, etc.

27. How to use Hadoop in Data Science?

Hadoop gives the capability to data scientists to deal with large-scale unstructured data. Similarly, many new different extensions of Hadoop like PIG and Mahout give diverse features to analyze and execute machine learning algorithms on large-scale data. This functionality makes Hadoop a wide system that is competent in managing all forms of data, creating it an ideal suite for data scientists.

28. Explain different types of analysis?

Univariate analysis: It contains descriptive statistical analysis methods which you can determine based on the number of variables involved. Bivariate analysis: It describes the difference between two variables at the same time. Like it can analyze sale volume and spending volume by utilizing a scatterplot. Multivariate analysis: It has two or more variables and defines the effect of variables on replies.

29. Explain lambda expression in Python?

By using a lambda expression, you can make an anonymous function. Dissimilar to conventional functions, lambda functions occupy one line of code. For Example: lambda arguments: expression x = lambda a : a * 5 print(x(5)) We obtain an output of 25.

30. Make an identity matrix using NumPy?

We will use the identity() function to make the identity matrix with numpy. Numpy is imported as np np.identity(3) We will obtain the output as – array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])

31. Make a 1-D array in numpy?

x = np.array([1,2,3,4]) Where numpy is imported as np

32. Name different libraries of Python to use in Data Science?

Some of the important libraries are: Numpy SciPy Pandas Matplotlib Keras TensorFlow Scikit-learn

33. Calculate the Euclidean distance in Python of the given plots?

plot1 = [1,3] plot2 = [2,5] We use the function to calculate Euclidean distance: euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

34. Tell the way to add a border filled with 0s near an existing array?

To add a border to an existing array that is loaded with 0s, we foremost create an array Z and initialize it with 0s. Firstly we import numpy as np. Z = np.ones((5,5)) After that, we add padding on it by using pad() function. Z = np.pad(Z, pad_width=1, mode='constant’, constant_values=0) print(Z)

35. Assume a (5,6,7) shape array, tell the index (x,y,z) of the 50th element?

print(np.unravel_index(50,(5,6,7)))

36. Multiply a 4×3 matrix by a 3×2 matrix?

There are two multiply the given matrix. The first way is for the versions of Python < 3.5, Z = np.dot(np.ones((4,3)), np.ones((3,2))) print(Z) array([[3., 3.], [3., 3.], [3., 3.], [3., 3.]]) The other method is for Python version > 3.5, Z = np.ones((4,3)) @ np.ones((3,2))

37. Name the type of biases?

Sample Bias Prejudice Bias Measurement Bias Algorithm Bias

38. Explain a z-score?

Z-score also called the standard score is the number of regular deviations that the data-point is from the mean. It calculates how many regular deviations are above or below the population mean exists. Z-score varies from -3 and goes to +3 standard deviations.

39. Tell the ideal seed for tuning hyperparameters of your machine learning model?

There is no set value for the seed and there is also no ideal value available. The seed is set randomly to tune the hyperparameters of the machine learning model.

40. Differentiate between Eigenvectors and Eigenvalue.

As eigenvalues are the values that are related to the degree of a linear modification, eigenvectors of a non-singular matrix are related to its linear modifications that are computed with correlation or covariance matrix functions.

41. Is Pearson capturing the monotonic behavior between two variables and Spearman captures how linearly dependent the two variables are?

No, actually it is the opposite. Pearson captures the linear relationship between the two variables and Spearman captures the monotonic behavior of the relation between the two variables.

42. What do you understand by regularization and its uses?

It is a technique of reducing errors by using an appropriate function on a training set properly to ignore overfitting. While training the model, there is a good possibility of the model learning noise or the data points that do not show any property of your true data. This can guide us to overfitting. Thus, we use regularization in our machine learning models to minimize this form of error.

43. What does the cost parameter in SVM stand for?

Cost (C) Parameter in SVM determines how agreeably the data should be with the model. Cost Parameter is operated for adjusting the softness or hardness of your big margin classification. With low cost, we use a smooth decision surface whereas to organize more points we use the higher cost.

44. Gradient descent stochastic in nature? Why?

The word stochastic means random probability. Thus, in the point of stochastic gradient descent, the samples are picked at random than taking the full in a single iteration.

45. Tell the function of subtracting means of each row of matrix?

Use the mean() function as follows – X = np.random.rand(5, 10) Y = X – X.mean(axis=1, keepdims=True)

46. Explain law of large numbers?

It means the frequencies of happening of events that have the same prospect are evened out after they undertake a significant number of trials.

47. What do you understand by L1 and L2 Regularization?

L1 and L2 both regularizations are operated to bypass overfitting in the model. L1 regularization (also called Lasso) and L2 regularization (also called Ridge) Regularization eliminates features from our model. L1 regularization is more passive to outliers. Thus L1 regularization is much capable of managing noisy data.

48. In the Latent Dirichlet Model for text classification what does the Alpha and Beta Hyperparameter stand for?

Alpha shows the number of topics in the document and Beta represents the number of terms occurring in the topic in the Latent Dirichlet Model for text classification.

49. Do the LogLoss evaluation metric can possess negative values?

No, the LogLoss evaluation metric is not able to possess negative values.

50. What do you understand by the TF/IDF Vectorization?

TF/IDF means Term Frequency/Inverse Document Frequency is used for information recovery and mining. It also works as a weighing factor to find the essence of the word in a document. This essence is proportional and gains with the number of times a word appears in the document but is offset by the frequency of the word in a corpus.

Want to prepare for these languages:

Recent Articles