Got a data science coding interview lined up? Chances are that you are interviewing for an ML engineering and/or data scientist position. Companies that have data science coding interviews areĀ Google,Ā Meta,Ā Stripe, and startups. The coding questions are peppered throughout the technical screen and on-site rounds. We will cover the following areas of the data science coding interview so you are well-prepared for your upcoming interviewš
šĀ What is the Data Science Coding Interview?
šĀ Areas Covered inĀ Data Science Coding
āļøĀ Sample Questions and Solutions
š”Ā Prep Tips
šĀ What is the Data Science Coding Interview?
Letās start with how the interview is conducted. You will likely hop onto a virtual call with a code or text editor. The interviewer will most likely be a senior/staff MLE or data scientist who will be evaluating you based on code proficiency, accuracy, and interpretability. Your communication skills - the ability to understand and explain your thoughts clearly - will also be assessed.
šĀ Areas Covered inĀ Data Science Coding
There are four major areas often assessed in data science coding interviews. These areĀ data structures & algorithms, data manipulation, statistical coding, and machine learning functions. The types of roles covered tend to be role-specific.
- MLE / Full-Stack Data ScientistĀ - If the role requires you to deploy models to production, then you should expect algorithms & data structure questions. This means you should brush up on strings, math, arrays, sorting, searching, dynamic programming, queues & stacks, and, in some cases, trees and graphs.
- Product / Generalist Data ScientistĀ - You should expect more coverage on data manipulation questions. These are what you call āPandasā SQL problems that involve leveraging Pandas to solve SQL-like table manipulation problems. Sometimes, you may be asked about statistical coding problems, which tend to be asked in quant-roles, Google DS interviews, etc.
- Data AnalystĀ - Like product and generalist data scientist roles, you should expect Pandas-like SQL problems that involve leveraging Pandas to solve SQL-like table manipulation problems. You donāt need to worry too much about other areas.
Now, letās do a deep-dive on each of the four areas.
š Data Structures & Algorithms
These are the classic SWE questions posed in the data science interviewing: strings, math, arrays, sorting, searching, dynamic programming, queues & stacks, and, sometimes, trees and graphs. You should have a firm grasp of runtime and space complexity and write the most optimal solution. A great place to learn about data structures & algorithms are
# Sample Questions
1. [Microsoft] Function to check whether a word is a palindrome
2. [Adobe] Program to find a number from two sorted arrays such that the sum of the two numbers is closest to an integer
3. [Amazon] Find the shortest paths between two coordinates.
šĀ Data Manipulation
These are SQL-like table manipulations. Familiarity with Pandas or R DataFrames is essential in tackling these questions. The common operations you should be familiar with our - selection, aggregation, lags, group by, partition by, filtering, joins, sorting, and ranking
# Sample Questions
#| post_id | user_id | post_text | post_date | likes_count | comments_count | post_type |
#|---------|---------|-------------------------------|------------|-------------|----------------|-----------|
#| 1 | 101 | "Enjoying a day at the beach!"| 2023-07-25 | 217 | 30 | Photo |
#| 2 | 102 | "Just finished a great book!" | 2023-07-24 | 120 | 18 | Status |
#| 3 | 103 | "Check out this cool video!" | 2023-07-23 | 345 | 47 | Video |
#| 4 | 101 | "That's awesome?" | 2023-07-22 | 52 | 70 | Status |
# 1. Using the following dataset, find users who never posted a photo
# 2. Retrieve users who posted more than three times but received less #than 100 total likes
# 3. Find the user with the highest average comments per post
š Statistical Coding
These are Google-style questions that involve statistical simulation or writing functions that provide statistical values like the Pearson Correlation value. You should expect such questions generally across interviews, but more particularly in quant / Google / startup interviews. Depending on the interview, some may allow you to load third-party libraries like Numpy and Scipy. However, you will need to ask the interviewer for specifics.
# Sample Questions
1. [Google] In a World Series, suppose that the probability team A winning a match is 0.60. What is the probability that team A wins the World Series in each of the 7 matches? Use Numpy should you need to.
2. [Google] Demonstrate the confidence interval. Use Numpy and Scipy should you need to.
3. [Microsoft] Write a function that computes the inverse matrix. Use Numpy should you need to.
š Machine Learning Functions
ML coding is similar to the LeetCode style, but the main difference is the application of machine learning using coding. Expect to write ML functions from scratch. Sometimes, you will not be allowed to import third-party libraries like SkLearn as the questions are designed to assess your conceptual understanding and coding ability.
# Sample Questions
1. [Uber] Write an AUC from scratch using vanilla Python
2. [Google] Write the K-Means algorithm using Numpy only
āļøĀ Sample Problem Sets with Solutions
Now, let's practice with example problems. I will also discuss solutions.
Problem 1 - Data Manipulation
An interviewer at Meta asked:
# Sample Questions
[Meta] Retrieve users who posted more than three times but received less than 100 total likes
| post_id | user_id | post_text | post_date | likes_count | comments_count | post_type |
|---------|---------|-------------------------------|------------|-------------|----------------|-----------|
| 1 | 101 | "Enjoying a day at the beach!"| 2023-07-25 | 217 | 30 | Photo |
| 2 | 102 | "Just finished a great book!" | 2023-07-24 | 120 | 18 | Status |
| 3 | 103 | "Check out this cool video!" | 2023-07-23 | 345 | 47 | Video |
| 4 | 101 | "That's awesome?" | 2023-07-22 | 52 | 70 | Status |
Solution
# Logic
# 1. Group the original DataFrame by user_id.
# 2. Calculate the sum of the likes_count column and the count of posts for each user.
# 3. Filter the grouped data for users who posted more than three times but received less than 100 total likes.
# Group by user_id and calculate sum of likes_count and count of posts
grouped_users = df.groupby('user_id').agg({'likes_count': 'sum', 'post_id': 'count'})
# Filter users who posted more than three times but received less than 100 total likes
filtered_users_optimal_approach = grouped_users[(grouped_users['post_id'] > 3) & (grouped_users['likes_count'] < 100)]
filtered_users_optimal_approach
Problem 2 - Statistical Coding
An interviewer at Google asked:
Demonstrate the confidence interval. Use Numpy and Scipy should you need to.
Solution
# Import libraries
import numpy as np
import scipy.stats as sci
# Set the random seed
np.random.seed(111)
# Set the simulation parameters
pop_mean = 100 # Population mean
pop_std = 10 # Population standard deviation
sample_size = 100 # Sample size
num_samples = 1000 # Number of samples in the simulation
alpha = 0.05 # Set the alpha
# Run simulation
mean_in_interval = 0 # Count the number of times the pop. mean is in the CI interval
for i in range(num_samples):
# Sample 100 observations from a normal distribution
obs = np.random.normal(loc=100, scale=10, size=sample_size)
# Get the mean and standard error
sample_mean = np.mean(obs)
standard_error = sci.sem(obs)
# Generate the 95% confidence interval of the mean
lower, upper = sci.t.interval(confidence=(1-alpha), df=sample_size-1, loc=sample_mean, scale=standard_error)
# Count of number of instances when the bound
if pop_mean > lower and pop_mean < upper:
mean_in_interval += 1
# Generate the proportion of the times that the pop. mean is in the CI interval
proportion = mean_in_interval / num_samples
print(f'Based on a simulation {num_samples} trials, the true population mean,\n'
f'{pop_mean}, is found in the {1-alpha} confidence interval about {proportion*100}% of the time.')
Problem 3 - Machine Learning Functions
An interviewer at Google asked:
[Google] Write the K-Means algorithm using Numpy only
Solution
import numpy as np
class KMeans:
def __init__(self, k=2, max_iterations=500):
self.k = k
self.max_iterations = max_iterations
def fit(self, X):
# Initialize centroids randomly
self.centroids = X[np.random.choice(range(len(X)), self.k, replace=False)]
for i in range(self.max_iterations):
# Assign each data point to the nearest centroid
clusters = [[] for _ in range(self.k)]
for x in X:
distances = [np.linalg.norm(x - c) for c in self.centroids]
cluster = np.argmin(distances)
clusters[cluster].append(x)
# Recalculate centroids
prev_centroids = self.centroids
self.centroids = []
for cluster in clusters:
if cluster:
self.centroids.append(np.mean(cluster, axis=0))
else:
self.centroids.append(prev_centroids[np.random.choice(range(self.k))])
# Check for convergence
if np.allclose(prev_centroids, self.centroids):
break
def predict(self, X):
distances = [np.linalg.norm(X - c, axis=1) for c in self.centroids]
return np.argmin(distances, axis=0)
š”Ā Prep Tips
Tip 1 - Front-Load Python problem sets
Those who succeed in passing coding problems are often āprimedā for interviews. Since coding interviews are usually assessed first on the technical screen, you must front-load coding as part of your daily/weekly prep. Go through about 2 to 3 problems daily leading up to the interview. For more resources, visitĀ datainterview.com
Tip 2 - Practice Explaining Verbally
Interviewing is not a written exercise; itās a verbal exercise. Whether the interviewer asks you a coding question, you must explain your solution clearly and in detail. As you practice interview questions, practice verbally.
Tip 3 - Join the Ultimate Prep
Get access to ML questions, cases, and machine learning mock interview recordings when you join the interview program onĀ datainterview.com