Join the ML Engineer Interview MasterClass šŸš€ | Now you can follow self-paced!

Data Science Coding Interview

Dan Lee's profile image
Dan LeeUpdated Oct 26, 2024 ā€” 6 min read
Data scientist coding interview feature image

Got a data science coding interview lined up? Chances are that you are interviewing for an ML engineering and/or data scientist position. Companies that have data science coding interviews areĀ Google,Ā Meta,Ā Stripe, and startups. The coding questions are peppered throughout the technical screen and on-site rounds. We will cover the following areas of the data science coding interview so you are well-prepared for your upcoming interviewšŸ‘‡

šŸ“Ā What is the Data Science Coding Interview?

šŸ“šĀ Areas Covered inĀ Data Science Coding

āœļøĀ Sample Questions and Solutions

šŸ’”Ā Prep Tips

šŸ“Ā What is the Data Science Coding Interview?

Letā€™s start with how the interview is conducted. You will likely hop onto a virtual call with a code or text editor. The interviewer will most likely be a senior/staff MLE or data scientist who will be evaluating you based on code proficiency, accuracy, and interpretability. Your communication skills - the ability to understand and explain your thoughts clearly - will also be assessed.

Technical Screen

šŸ“šĀ Areas Covered inĀ Data Science Coding

There are four major areas often assessed in data science coding interviews. These areĀ data structures & algorithms, data manipulation, statistical coding, and machine learning functions. The types of roles covered tend to be role-specific.

  • MLE / Full-Stack Data ScientistĀ - If the role requires you to deploy models to production, then you should expect algorithms & data structure questions. This means you should brush up on strings, math, arrays, sorting, searching, dynamic programming, queues & stacks, and, in some cases, trees and graphs.
  • Product / Generalist Data ScientistĀ - You should expect more coverage on data manipulation questions. These are what you call ā€œPandasā€ SQL problems that involve leveraging Pandas to solve SQL-like table manipulation problems. Sometimes, you may be asked about statistical coding problems, which tend to be asked in quant-roles, Google DS interviews, etc.
  • Data AnalystĀ - Like product and generalist data scientist roles, you should expect Pandas-like SQL problems that involve leveraging Pandas to solve SQL-like table manipulation problems. You donā€™t need to worry too much about other areas.

Now, letā€™s do a deep-dive on each of the four areas.

šŸ“• Data Structures & Algorithms

These are the classic SWE questions posed in the data science interviewing: strings, math, arrays, sorting, searching, dynamic programming, queues & stacks, and, sometimes, trees and graphs. You should have a firm grasp of runtime and space complexity and write the most optimal solution. A great place to learn about data structures & algorithms are

# Sample Questions
1. [Microsoft] Function to check whether a word is a palindrome
2. [Adobe] Program to find a number from two sorted arrays such that the sum of the two numbers is closest to an integer 
3. [Amazon] Find the shortest paths between two coordinates.

šŸ“˜Ā Data Manipulation

These are SQL-like table manipulations. Familiarity with Pandas or R DataFrames is essential in tackling these questions. The common operations you should be familiar with our - selection, aggregation, lags, group by, partition by, filtering, joins, sorting, and ranking

# Sample Questions

#| post_id | user_id | post_text                     | post_date  | likes_count | comments_count | post_type |
#|---------|---------|-------------------------------|------------|-------------|----------------|-----------|
#| 1       | 101     | "Enjoying a day at the beach!"| 2023-07-25 | 217         | 30             | Photo     |
#| 2       | 102     | "Just finished a great book!" | 2023-07-24 | 120         | 18             | Status    |
#| 3       | 103     | "Check out this cool video!"  | 2023-07-23 | 345         | 47             | Video     |
#| 4       | 101     | "That's awesome?"             | 2023-07-22 | 52          | 70             | Status    |

# 1. Using the following dataset, find users who never posted a photo
# 2. Retrieve users who posted more than three times but received less #than 100 total likes
# 3. Find the user with the highest average comments per post

šŸ“— Statistical Coding

These are Google-style questions that involve statistical simulation or writing functions that provide statistical values like the Pearson Correlation value. You should expect such questions generally across interviews, but more particularly in quant / Google / startup interviews. Depending on the interview, some may allow you to load third-party libraries like Numpy and Scipy. However, you will need to ask the interviewer for specifics.

# Sample Questions
1. [Google] In a World Series, suppose that the probability team A winning a match is 0.60. What is the probability that team A wins the World Series in each of the 7 matches? Use Numpy should you need to.
2. [Google] Demonstrate the confidence interval. Use Numpy and Scipy should you need to.
3. [Microsoft] Write a function that computes the inverse matrix. Use Numpy should you need to.

šŸ“™ Machine Learning Functions

ML coding is similar to the LeetCode style, but the main difference is the application of machine learning using coding. Expect to write ML functions from scratch. Sometimes, you will not be allowed to import third-party libraries like SkLearn as the questions are designed to assess your conceptual understanding and coding ability.

# Sample Questions
1. [Uber] Write an AUC from scratch using vanilla Python
2. [Google] Write the K-Means algorithm using Numpy only 

āœļøĀ Sample Problem Sets with Solutions

Now, let's practice with example problems. I will also discuss solutions.

Problem 1 - Data Manipulation

An interviewer at Meta asked:

# Sample Questions

[Meta] Retrieve users who posted more than three times but received less than 100 total likes

| post_id | user_id | post_text                     | post_date  | likes_count | comments_count | post_type |
|---------|---------|-------------------------------|------------|-------------|----------------|-----------|
| 1       | 101     | "Enjoying a day at the beach!"| 2023-07-25 | 217         | 30             | Photo     |
| 2       | 102     | "Just finished a great book!" | 2023-07-24 | 120         | 18             | Status    |
| 3       | 103     | "Check out this cool video!"  | 2023-07-23 | 345         | 47             | Video     |
| 4       | 101     | "That's awesome?"             | 2023-07-22 | 52          | 70             | Status    |


Solution

# Logic
# 1. Group the original DataFrame by user_id.
# 2. Calculate the sum of the likes_count column and the count of posts for each user.
# 3. Filter the grouped data for users who posted more than three times but received less than 100 total likes.

# Group by user_id and calculate sum of likes_count and count of posts
grouped_users = df.groupby('user_id').agg({'likes_count': 'sum', 'post_id': 'count'})

# Filter users who posted more than three times but received less than 100 total likes
filtered_users_optimal_approach = grouped_users[(grouped_users['post_id'] > 3) & (grouped_users['likes_count'] < 100)]
filtered_users_optimal_approach

Problem 2 - Statistical Coding

An interviewer at Google asked:

Demonstrate the confidence interval. Use Numpy and Scipy should you need to.

Solution

# Import libraries
import numpy as np
import scipy.stats as sci

# Set the random seed
np.random.seed(111)

# Set the simulation parameters
pop_mean = 100      # Population mean
pop_std  = 10       # Population standard deviation
sample_size = 100   # Sample size
num_samples = 1000  # Number of samples in the simulation
alpha = 0.05        # Set the alpha 

# Run simulation 
mean_in_interval = 0 # Count the number of times the pop. mean is in the CI interval
for i in range(num_samples):
  # Sample 100 observations from a normal distribution
  obs = np.random.normal(loc=100, scale=10, size=sample_size)
  # Get the mean and standard error
  sample_mean = np.mean(obs)
  standard_error = sci.sem(obs)
  # Generate the 95% confidence interval of the mean
  lower, upper = sci.t.interval(confidence=(1-alpha), df=sample_size-1, loc=sample_mean, scale=standard_error)
  # Count of number of instances when the bound
  if pop_mean > lower and pop_mean < upper: 
    mean_in_interval += 1

# Generate the proportion of the times that the pop. mean is in the CI interval
proportion = mean_in_interval / num_samples
print(f'Based on a simulation {num_samples} trials, the true population mean,\n'
      f'{pop_mean}, is found in the {1-alpha} confidence interval about {proportion*100}% of the time.')

Problem 3 - Machine Learning Functions

An interviewer at Google asked:

[Google] Write the K-Means algorithm using Numpy only

Solution

import numpy as np

class KMeans:
    def __init__(self, k=2, max_iterations=500):
        self.k = k
        self.max_iterations = max_iterations

    def fit(self, X):
        # Initialize centroids randomly
        self.centroids = X[np.random.choice(range(len(X)), self.k, replace=False)]
        
        for i in range(self.max_iterations):
            # Assign each data point to the nearest centroid
            clusters = [[] for _ in range(self.k)]
            for x in X:
                distances = [np.linalg.norm(x - c) for c in self.centroids]
                cluster = np.argmin(distances)
                clusters[cluster].append(x)

            # Recalculate centroids
            prev_centroids = self.centroids
            self.centroids = []
            for cluster in clusters:
                if cluster:
                    self.centroids.append(np.mean(cluster, axis=0))
                else:
                    self.centroids.append(prev_centroids[np.random.choice(range(self.k))])

            # Check for convergence
            if np.allclose(prev_centroids, self.centroids):
                break

    def predict(self, X):
        distances = [np.linalg.norm(X - c, axis=1) for c in self.centroids]
        return np.argmin(distances, axis=0)

šŸ’”Ā Prep Tips

Tip 1 - Front-Load Python problem sets

Those who succeed in passing coding problems are often ā€œprimedā€ for interviews. Since coding interviews are usually assessed first on the technical screen, you must front-load coding as part of your daily/weekly prep. Go through about 2 to 3 problems daily leading up to the interview. For more resources, visitĀ datainterview.com

Tip 2 - Practice Explaining Verbally

Interviewing is not a written exercise; itā€™s a verbal exercise. Whether the interviewer asks you a coding question, you must explain your solution clearly and in detail. As you practice interview questions, practice verbally.

Tip 3 - Join the Ultimate Prep

Get access to ML questions, cases, and machine learning mock interview recordings when you join the interview program onĀ datainterview.com

Dan Lee's profile image

Dan Lee

DataInterview Founder (Ex-Google)

Dan Lee is a former Data Scientist at Google with 8+ years of experience in data science, data engineering, and ML engineering. He has helped 100+ clients land top data, ML, AI jobs at reputable companies and startups such as Google, Meta, Instacart, Stripe and such.