[2023] Data Science Coding Interview Guide (+ Questions)

Got a data science coding interview lined up? Chances are that you are interviewing for ML engineering and/or data scientist position. Companies that have data science coding interviews are Google, Meta, Stripe, and startups. And, the coding questions are peppered throughout the technical screen and on-site rounds. We will cover the following areas of the data science coding interview so you are well prepared for your upcoming interview👇

📝 What is the Data Science Coding Interview?

📚 Areas Covered in Data Science Coding

✍️ Sample Questions and Solutions

💡 Prep Tips

📝 What is the Data Science Coding Interview?

Let’s start with how the interview is conducted. You will most likely hop onto a virtual call with a code or text editor. The interviewer will most likely be a senior/staff MLE or data scientist who will be evaluating you based on code proficiency, accuracy and interpretability. Your communication skills - the ability to understand and explain your thoughts clearly - will be assessed as well.

Technical Screen

📚 Areas Covered in Data Science Coding

There are four major areas often assessed in data science coding interviews. These are data structures & algorithms, data manipulation, statistical coding, and machine learning functions. The types of roles covered tend to be role-specific.

MLE / Full-Stack Data Scientist - If the role requires you to deploy models to production, then you should expect algorithms & data structure questions. This means that you should brush up on strings, math, arrays, sorting, searching, dynamic programming, queues & stacks, and, in some cases, trees and graphs.
Product / Generalist Data Scientist - You should expect more coverage on data manipulation questions. These are what you call “Pandas” SQL problems that involve leveraging Pandas to solve SQL like table manipulation problems. In some cases, you may be asked on statistical coding problems which tend to be asked in quant-roles, Google DS interviews and etc.
Data Analyst - Like product and generalist data scientist roles, you should expect Pandas-like SQL problems that involve leveraging Pandas to solve SQL-like table manipulation problems. You don’t need to worry too much about other areas.

Now, let’s do a deep-dive on each of the four areas.

📕 Data Structures & Algorithms

These are the classic SWE questions posed in the data science interviewing: strings, math, arrays, sorting, searching, dynamic programming, queues & stacks, and, in some cases, trees and graphs. You should have firm grasp of runtime and space complexity, and write the most optimal solution. A great place to learn about data structures & algorithms are

# Sample Questions
1. [Microsoft] Function to check whether a word is a palindrome
2. [Adobe] Program to find a number from two sorted arrays such that the sum of the two numbers is closest to an integer 
3. [Amazon] Find the shortest paths between two coordinates.

📘 Data Manipulation

These are SQL-like table manipulation. Familiarity with Pandas or R DataFrames is essential in tackling these questions. The common operations you should be familiar with are - selection, aggregation, lags, group by, partition by, filtering, JOINs, sorting, and ranking

# Sample Questions

| post_id | user_id | post_text                     | post_date  | likes_count | comments_count | post_type |
|---------|---------|-------------------------------|------------|-------------|----------------|-----------|
| 1       | 101     | "Enjoying a day at the beach!"| 2023-07-25 | 217         | 30             | Photo     |
| 2       | 102     | "Just finished a great book!" | 2023-07-24 | 120         | 18             | Status    |
| 3       | 103     | "Check out this cool video!"  | 2023-07-23 | 345         | 47             | Video     |
| 4       | 101     | "That's awesome?"             | 2023-07-22 | 52          | 70             | Status    |

1. [Meta] Using the following dataset, find users who never posted a photo
2. [Meta] Retrieve users who posted more than three times but received less than 100 total likes
3. [Meta] Find the user with the highest average comments per post

📗 Statistical Coding

These are Google style questions that involve statistical simulation or writing functions that provide statistical values like the Pearson Correlation value. You should expect such questions generally across interviews, but more particularly in quant / Google / startup interviews. Depending on the interview, some may allow you to load third-party libraries like Numpy and Scipy. But, you will need to ask the interviewer to get the specifics.

# Sample Questions
1. [Google] In a World Series, suppose that the probability team A winning a match is 0.60. What is the probability that team A wins the World Series in each of the 7 matches? Use Numpy should you need to.
2. [Google] Demonstrate the confidence interval. Use Numpy and Scipy should you need to.
3. [Microsoft] Write a function that computes the inverse matrix. Use Numpy should you need to.

📙 Machine Learning Functions

ML coding is similar to LeetCode style, but the main difference is that it is the application of machine learning using coding. Expect to write ML functions from scratch. In some cases, you will not be allowed to import third-party libraries like SkLearn as the questions are designed to assess your conceptual understanding and coding ability.

# Sample Questions
1. [Uber] Write an AUC from scratch using vanilla Python
2. [Google] Write the K-Means algorithm using Numpy only

✍️ Sample Questions and Solutions

Sample Question 1 - Data Manipulation

# Sample Questions

[Meta] Retrieve users who posted more than three times but received less than 100 total likes

| post_id | user_id | post_text                     | post_date  | likes_count | comments_count | post_type |
|---------|---------|-------------------------------|------------|-------------|----------------|-----------|
| 1       | 101     | "Enjoying a day at the beach!"| 2023-07-25 | 217         | 30             | Photo     |
| 2       | 102     | "Just finished a great book!" | 2023-07-24 | 120         | 18             | Status    |
| 3       | 103     | "Check out this cool video!"  | 2023-07-23 | 345         | 47             | Video     |
| 4       | 101     | "That's awesome?"             | 2023-07-22 | 52          | 70             | Status    |

Solution


# Logic
# 1. Group the original DataFrame by user_id.
# 2. Calculate the sum of the likes_count column and the count of posts for each user.
# 3. Filter the grouped data for users who posted more than three times but received less than 100 total likes.

# Group by user_id and calculate sum of likes_count and count of posts
grouped_users = df.groupby('user_id').agg({'likes_count': 'sum', 'post_id': 'count'})

# Filter users who posted more than three times but received less than 100 total likes
filtered_users_optimal_approach = grouped_users[(grouped_users['post_id'] > 3) & (grouped_users['likes_count'] < 100)]
filtered_users_optimal_approach

Sample Question 2 - Statistical Coding

[Google] Demonstrate the confidence interval. Use Numpy and Scipy should you need to.

Solution

# Import libraries
import numpy as np
import scipy.stats as sci

# Set the random seed
np.random.seed(111)

# Set the simulation parameters
pop_mean = 100      # Population mean
pop_std  = 10       # Population standard deviation
sample_size = 100   # Sample size
num_samples = 1000  # Number of samples in the simulation
alpha = 0.05        # Set the alpha 

# Run simulation 
mean_in_interval = 0 # Count the number of times the pop. mean is in the CI interval
for i in range(num_samples):
  # Sample 100 observations from a normal distribution
  obs = np.random.normal(loc=100, scale=10, size=sample_size)
  # Get the mean and standard error
  sample_mean = np.mean(obs)
  standard_error = sci.sem(obs)
  # Generate the 95% confidence interval of the mean
  lower, upper = sci.t.interval(confidence=(1-alpha), df=sample_size-1, loc=sample_mean, scale=standard_error)
  # Count of number of instances when the bound
  if pop_mean > lower and pop_mean < upper: 
    mean_in_interval += 1

# Generate the proportion of the times that the pop. mean is in the CI interval
proportion = mean_in_interval / num_samples
print(f'Based on a simulation {num_samples} trials, the true population mean,\n'
      f'{pop_mean}, is found in the {1-alpha} confidence interval about {proportion*100}% of the time.')

Sample Question 3 - Machine Learning Functions

[Google] Write the K-Means algorithm using Numpy only

Solution

import numpy as np

class KMeans:
    def __init__(self, k=2, max_iterations=500):
        self.k = k
        self.max_iterations = max_iterations

    def fit(self, X):
        # Initialize centroids randomly
        self.centroids = X[np.random.choice(range(len(X)), self.k, replace=False)]
        
        for i in range(self.max_iterations):
            # Assign each data point to the nearest centroid
            clusters = [[] for _ in range(self.k)]
            for x in X:
                distances = [np.linalg.norm(x - c) for c in self.centroids]
                cluster = np.argmin(distances)
                clusters[cluster].append(x)

            # Recalculate centroids
            prev_centroids = self.centroids
            self.centroids = []
            for cluster in clusters:
                if cluster:
                    self.centroids.append(np.mean(cluster, axis=0))
                else:
                    self.centroids.append(prev_centroids[np.random.choice(range(self.k))])

            # Check for convergence
            if np.allclose(prev_centroids, self.centroids):
                break

    def predict(self, X):
        distances = [np.linalg.norm(X - c, axis=1) for c in self.centroids]
        return np.argmin(distances, axis=0)

💡 Prep Tips

Tip 1 - Front-Load Python problem sets

Those who succeed in passing coding problems are often “primed” for interviews. Given that coding interviews are usually assessed first in the technical screen, it is vital that you front-load coding as part of your daily/weekly prep. Go through about 2 to 3 problems per day leading up to the interview. For more resources, visit datainterview.com

Tip 2 - Practice Explaining Verbally

Interviewing is not a written exercise, it’s a verbal exercise. Whether the interviewer asks you a coding question, you will be expected to explain you solution with clarity and in-details. As you practice interview questions, practice verbally.

Tip 3 - Join the Ultimate Prep

Get access to ML questions, cases and machine learning mock interview recordings when you join the interview program on datainterview.com