Join the ML Engineer Interview MasterClass πŸš€ | Now you can follow self-paced!

100 Data Science Interview Questions

Dan Lee's profile image
Dan LeeUpdated Oct 26, 2024 β€” 23 min read
A feature image for 100 Data Scientist Interview Questions

Getting ready for the data scientist interviews? Here's a compilation of 100 data science interview questions from FAANGs, startups, and consulting firms. The core areas asked in data science interviews are:

  1. Applied Statistics
  2. Machine Learning
  3. Business/Product Case
  4. SQL
  5. Coding
  6. Leadership & Behavioral

Use this question list as a starting point for your prep!

Applied Statistics

1. Can you explain the Central Limit Theorem and its importance in statistics?

The Central Limit Theorem is fundamental in the field of statistics. It provides insight into the distributional properties of sample averages and sums.

Definition:

The CLT states that, for a sufficiently large sample size from any population with a finite standard deviation, the distribution of the sample means will approximate a normal distribution, no matter the shape of the population's distribution.

Mathematically:

Let X1,X2,…,Xn X_1, X_2, \dots, X_n be a random sample of size n n drawn from a population with mean ΞΌ \mu and standard deviation Οƒ \sigma . If n n is sufficiently large (often cited as nβ‰₯30 n \geq 30 ), the sampling distribution of the sample mean Xβ€Ύ \overline{X} will be approximately normal, with:

  • Mean: ΞΌ \mu
  • Standard Error (SE): Οƒn \frac{\sigma}{\sqrt{n}}

Thus,

Xβ€ΎβˆΌN(ΞΌ,Οƒ2n) \overline{X} \sim N\left( \mu, \frac{\sigma^2}{n} \right)

Importance:

  1. Predictability: The CLT provides a predictable shape for the distribution of sample means, such as the well-known bell curve or normal distribution.
  2. Inference: Many statistical tests and procedures are based on the assumption of normality. The CLT allows us to use these methods with large samples even if the original data isn't normally distributed.
  3. Practicality: In real-world applications, actual distributions are often unknown. The CLT provides a way to work with the familiar normal distribution, sidestepping challenges posed by unknown distributions.
  4. Simplicity: The normal distribution has well-understood properties, making it easier to work with in statistical computations.

2. What is the difference between Type I and Type II errors in hypothesis testing?

Type I and Type II Errors in Hypothesis Testing

When performing hypothesis testing in statistics, two main types of errors can occur, known as Type I and Type II errors.

Type I Error (False Positive):

This error occurs when the null hypothesis is true, but we incorrectly reject it.

  • Formula: 𝛼=𝑃 (Reject 𝐻0∣𝐻0Β isΒ true)Β whereΒ Ξ±Β is the significance level of the test, often set at levels like 0.05 or 0.01.

Type II Error (False Negative):

This error occurs when the alternative hypothesis is true, but we fail to reject the null hypothesis.

  • Formula: 𝛽=𝑃(FailΒ toΒ Reject 𝐻0βˆ£π»π‘ŽΒ isΒ true) The power of a test, denoted asΒ 1βˆ’π›½1βˆ’Ξ², represents the probability of correctly rejectingΒ H0​ whenΒ Ha​ is true.

In Summary:

  • Type I Error: Proclaiming an effect when there isn't one.
  • Type II Error: Failing to proclaim an effect when there is one.

The balance between these errors is critical in research. Researchers must decide the acceptable levels for these errors, considering the consequences of each type of mistake.


3. Can you explain what is meant by p-value?

In the context of hypothesis testing, the p-value is a fundamental concept that helps researchers determine the strength of evidence against the null hypothesis.

Definition:

The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, assuming the null hypothesis is true.

Mathematically:

Given a test statistic 𝑇TΒ and observed value 𝑑t, the p-value can be defined as:

  • For a two-tailed test:

p-value=P(Tβ‰₯∣t∣∣H0Β isΒ true)+P(Tβ‰€βˆ’βˆ£t∣∣H0Β isΒ true)\text{p-value} = P(T \geq |t| \mid H_0 \text{ is true}) + P(T \leq -|t| \mid H_0 \text{ is true})

  • For a right-tailed test:

p-value=P(Tβ‰₯t∣H0Β isΒ true)\text{p-value} = P(T \geq t \mid H_0 \text{ is true})

  • For a left-tailed test:

p-value=P(T≀t∣H0Β isΒ true)\text{p-value} = P(T \leq t \mid H_0 \text{ is true})

Interpretation:

A smaller p-value suggests stronger evidence against the null hypothesis and, thus, a greater chance of the alternative hypothesis being true. Common thresholds for the p-value are 0.05, 0.01, and 0.10, though these can vary by field and specific research. If the p-value is less than or equal to a pre-defined significance level (often denoted asΒ Ξ±), one might reject the null hypothesis in favor of the alternative.


4. How do you assess the normality of a dataset?

When working with statistical analyses, it's often essential to determine whether a dataset follows a normal (or Gaussian) distribution. There are several methods to assess normality:

1.Β Histograms and Q-Q Plots:

  • Histogram: A visual representation where data is grouped into bins, and the frequency of data points in each bin is shown using bars. A bell-shaped histogram suggests a normal distribution.
  • Q-Q Plot: A quantile-quantile plot compares the quantiles of a dataset to the quantiles of a normal distribution. If the data is normally distributed, the points should roughly lie on a straight line.

2.Β Statistical Tests:

  • Shapiro-Wilk Test:Null hypothesis (𝐻0H0​): The data is normally distributed.π‘Š=(βˆ‘π‘Žπ‘–π‘₯(𝑖))2βˆ‘(π‘₯π‘–βˆ’π‘₯Λ‰)2W=βˆ‘(xiβ€‹βˆ’xΛ‰)2(βˆ‘ai​x(i)​)2​WhereΒ π‘₯(𝑖)x(i)​ are the ordered sample values andΒ π‘Žπ‘–ai​ are constants derived from the means, variances, and covariances of the order statistics of a sample from a normal distribution. A low p-value suggests deviation from normality.
  • Kolmogorov-Smirnov Test:This test compares the cumulative distribution function of the sample data to that of a normal distribution. A significant difference suggests non-normality.

3.Β Normal Probability Plots:

A plot that pairs observed values with the corresponding percentiles of a normal distribution. If the dataset is normal, this plot should be linear.

4.Β Skewness and Kurtosis:

  • Skewness: Measures the asymmetry of the probability distribution. For a normal distribution, skewness should be about 0.Skewness=𝐸[(π‘‹βˆ’πœ‡πœŽ)3]Skewness=E[(ΟƒXβˆ’ΞΌβ€‹)3]
  • Kurtosis: Measures the "tailedness" of the distribution. A normal distribution has a kurtosis of 3.Kurtosis=𝐸[(π‘‹βˆ’πœ‡πœŽ)4]Kurtosis=E[(ΟƒXβˆ’ΞΌβ€‹)4]

5. What is the difference between correlation and causation?

Correlation:

Correlation refers to a statistical measure that describes the extent to which two variables change together. If one variable tends to go up when the other goes up, there is a positive correlation between them. Conversely, if one variable tends to go down when the other goes up, there's a negative correlation.

The most common measure of correlation is the Pearson correlation coefficient, denoted asΒ π‘Ÿr. It quantifies the strength and direction of a linear relationship.

π‘Ÿ=βˆ‘(π‘₯π‘–βˆ’π‘₯Λ‰)(π‘¦π‘–βˆ’π‘¦Λ‰)βˆ‘(π‘₯π‘–βˆ’π‘₯Λ‰)2βˆ‘(π‘¦π‘–βˆ’π‘¦Λ‰)2r=βˆ‘(xiβ€‹βˆ’xΛ‰)2βˆ‘(yiβ€‹βˆ’yˉ​)2β€‹βˆ‘(xiβ€‹βˆ’xΛ‰)(yiβ€‹βˆ’yˉ​)​

Where:

  • π‘₯𝑖xi​ and 𝑦𝑖yi​ are data points.
  • π‘₯Λ‰xΛ‰Β and 𝑦ˉyˉ​ are the means of the datasetsΒ π‘₯xΒ and 𝑦yΒ respectively.
  • π‘ŸrΒ ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.

While correlation can be quantified and easily computed, causation requires deeper investigation, often through controlled experiments. The oft-repeated maxim, "correlation does not imply causation," warns us that just because two variables move together doesn't mean one causes the other.


6. How do you handle missing data in a dataset?

Dealing with missing or corrupted data is vital for maintaining data integrity and ensuring valid outcomes from analyses. Here's how you can address these issues:

Imputation:

a.Β Mean/Median/Mode Imputation:

Replace missing values with the mean (for continuous data), median (when data has outliers), or mode (for categorical data). For a dataset 𝑋XΒ with missing values:

𝑋imputed=𝑋missingβˆͺ𝑋ˉXimputed​=Xmissing​βˆͺXΛ‰

Where 𝑋ˉXΛ‰Β is the mean of the observed values.

b.Β Linear Interpolation:

Especially useful for time series data. Given two known points,Β (π‘₯1,𝑦1)(x1​,y1​)Β andΒ (π‘₯2,𝑦2)(x2​,y2​), the value at a missing pointΒ π‘₯xΒ can be found as:

𝑦=𝑦1+(𝑦2βˆ’π‘¦1)(π‘₯2βˆ’π‘₯1)(π‘₯βˆ’π‘₯1)y=y1​+(x2β€‹βˆ’x1​)(y2β€‹βˆ’y1​)​(xβˆ’x1​)

c.Β Removal:

Sometimes, it's best to simply remove the data entries that have missing or corrupted values, especially if they are not a significant portion of the dataset.

d.Β Predictive Filling:

Use machine learning models, such as decision trees or k-nearest neighbors, to estimate and fill in missing values based on other available data.


7. What are the assumptions required for linear regression?

The assumptions of a linear regression model is summarized as the following:

  1. Linearity: The relationship between the independent and dependent variables should be linear. This can often be verified with scatter plots.
  2. Independence: Observations should be independent of each other. This is particularly crucial for time series data where this assumption might be violated.
  3. Homoscedasticity: The residuals (or errors) should have constant variance across all levels of the independent variables. This can be checked using a residuals vs. fitted values plot.
  4. Normality of Residuals: The residuals should be normally distributed. This can be tested using plots or tests like the Shapiro-Wilk:
  5. No Multicollinearity: Independent variables should not be too highly correlated with each other. Variance Inflation Factor (VIF) is commonly used:

VIF=11βˆ’π‘…2VIF=1βˆ’R21​

Where 𝑅2R2Β is the determination coefficient of the variable against others.


Additional Questions

Explain the concept of power in statistical testsHow would you explain to a non-technical team member what a confidence interval is?What is multiple regression and when do we use it?Can you describe the difference between cross-sectional and longitudinal data?What is the role of data cleaning in data analysis?Can you explain the difference between ANOVA and t-test?How do you interpret the coefficients of a logistic regression model?What is multicollinearity and how do you detect and deal with it?Can you describe the difference between a parametric and a non-parametric test?What are some of the methods you would use to detect outliers?What is survival analysis and when can it be useful?Can you explain what is meant by the term 'residual' in regression analysis?Describe a situation where you used statistical analysis to make a decision or solve a problem

Machine Learning

1. What is the difference between supervised and unsupervised learning?

Supervised Learning: In supervised learning, an algorithm learns from a labeled dataset, which means each example in the dataset is paired with the correct output. The aim is to learn a mapping from inputs (𝑋X) to outputs (𝑦y). Once the algorithm is trained on this data, it can start making predictions on new, unseen data. Common algorithms in this category include linear regression:

𝑦=𝛽0+𝛽1𝑋1+β‹―+𝛽𝑛𝑋𝑛+πœ–y=Ξ²0​+Ξ²1​X1​+β‹―+Ξ²n​Xn​+Ο΅

and classification algorithms like logistic regression.

Unsupervised Learning: This involves training an algorithm on data where the output labels are not provided. The goal is often to find patterns or structures in the data. Common techniques include clustering, where data is grouped based on similarities, and dimensionality reduction, which reduces the number of random variables under consideration and can be achieved through methods like Principal Component Analysis (PCA):

𝑍=𝑋𝑉Z=XV

Where 𝑋XΒ is the data and 𝑉VΒ represents the principal components.

While supervised learning is more directed in its approach, requiring explicit supervision with correct answers, unsupervised learning explores the data on its own to find hidden patterns or groupings.


2. Can you explain the concept of overfitting and underfitting in machine learning models?

OverfittingΒ occurs when a machine learning model learns not only the underlying pattern of the data but also its noise. As a result, it performs exceptionally well on the training data but poorly on new, unseen data. In a sense, the model is too complex. If we consider polynomial regression, an overfitted model might have a very high-degree polynomial:

𝑦=𝛽0+𝛽1𝑋+𝛽2𝑋2+β‹―+𝛽𝑛𝑋𝑛y=Ξ²0​+Ξ²1​X+Ξ²2​X2+β‹―+Ξ²n​Xn

where 𝑛nΒ is large, making the model fit even tiny fluctuations in the training data.

UnderfittingΒ is the opposite. The model is too simple to capture the underlying structure of the data. It performs poorly on both the training data and the unseen data. In the polynomial regression context, an underfitted model might be a straight line (linear) when the actual relation is quadratic or cubic.

Visualizing the model's fit can help identify these issues: Overfitting typically shows a very wiggly curve fitting all data points, whereas underfitting might show a too straight or simplistic curve not capturing evident trends.

Regularization techniques, model selection, and cross-validation are common strategies to counteract overfitting and underfitting.


3. What is cross-validation? Why is it important?

Cross-validationΒ is a technique used to assess the performance of machine learning models by splitting the dataset into two segments: one used to train the model and the other used to validate the model. By repeating this process multiple times and averaging out the performance metrics, we obtain a better estimation of how the model will perform on unseen data.

One popular form of cross-validation isΒ k-fold cross-validation. Here, the data is divided intoΒ π‘˜kΒ subsets (or "folds"). The model is trained onΒ π‘˜βˆ’1kβˆ’1Β of these folds and tested on the remaining fold. This process is repeatedΒ π‘˜kΒ times, with each fold serving as the test set exactly once. The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop:

𝐢𝑉(π‘˜)=1π‘˜βˆ‘π‘–=1π‘˜π‘€π‘†πΈπ‘–CV(k)​=k1​i=1βˆ‘k​MSEi​

Where 𝑀𝑆𝐸𝑖MSEi​ is the Mean Squared Error of theΒ π‘–π‘‘β„ŽithΒ iteration.

Cross-validation is crucial as it provides a more generalized performance metric for the model. By ensuring that every data point has been part of both training and testing, it reduces the chances of overfitting and provides a more robust measure of a model's predictive power.


4. Describe how a decision tree works. When would you use it over other algorithms?

AΒ decision treeΒ is a flowchart-like structure wherein each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents an outcome. The topmost node is called the root. The main idea is to split the data into subsets based on the value of input features, aiming to have subsets that are as pure (homogeneous) as possible concerning the target variable.

The splits are chosen based on metrics like:

  • Gini impurity:

𝐺𝑖𝑛𝑖(𝑝)=1βˆ’βˆ‘π‘–=1𝑛𝑝𝑖2Gini(p)=1βˆ’i=1βˆ‘n​pi2​

  • Entropy:

πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦(𝑝)=βˆ’βˆ‘π‘–=1𝑛𝑝𝑖log⁑2(𝑝𝑖)Entropy(p)=βˆ’i=1βˆ‘n​pi​log2​(pi​)

Decision trees are favored because they're interpretable, handle both numerical and categorical data, and require little data preprocessing. They're especially useful when domain knowledge suggests the existence of clear decision rules or when interpretability is paramount. However, they can be prone to overfitting, especially when deep. In such cases, algorithms like Random Forest or Gradient Boosting Machines, which ensemble multiple trees, can be preferred for better generalization.


5. What is the bias-variance tradeoff?

TheΒ bias-variance tradeoffΒ is a fundamental concept in machine learning that describes the balance between two sources of errors:

  1. Bias: Error due to overly simplistic assumptions in the learning algorithm. High bias can cause the model to miss relevant relations between input and output variables (underfitting).
  2. Variance: Error due to excessive complexity in the learning algorithm. High variance can cause the model to model the random noise in the training data (overfitting).

Mathematically, the expected test mean squared error (MSE) of a model can be decomposed as:

𝑀𝑆𝐸=π΅π‘–π‘Žπ‘ 2+π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’+πΌπ‘Ÿπ‘Ÿπ‘’π‘‘π‘’π‘π‘–π‘π‘™π‘’Β πΈπ‘Ÿπ‘Ÿπ‘œπ‘ŸMSE=Bias2+Variance+IrreducibleΒ Error

  • π΅π‘–π‘Žπ‘ 2Bias2Β represents the squared difference between the expected predictions of our model and the correct values.
  • Variance measures the variability of a model prediction for a given data point.
  • Irreducible Error is the noise inherent to any problem.

In essence, as we increase model complexity (like adding more parameters or features), bias decreases but variance increases. Conversely, reducing model complexity increases bias and decreases variance. The tradeoff is achieving the right balance to minimize the total error, which often means finding a middle ground between a model that's too simple (high bias) and one that's too complex (high variance).


Additional Questions

What is the difference between stochastic gradient descent (SGD) and batch gradient descent?What is the difference between bagging and boosting?How would you validate a model you created to generate a predictive analysis?What are some of the advantages and disadvantages of a neural network?How does the k-means algorithm work?Can you explain the difference between L1 and L2 regularization methods?What is principal component analysis (PCA) and when is it used?Can you describe what an activation function is and why it is used in an artificial neural network?How would you handle an imbalanced dataset?Can you explain the concept of "feature selection" in machine learning?

Product Case

1. Can you explain the concept of A/B testing?

A/B testing, also known as split testing, is a method of comparing two versions of a webpage, app, or other product (A and B) to determine which one performs better. The goal is to improve upon key performance indicators (KPIs) such as click-through rates, user engagement, or conversion rates.

The procedure starts by randomly splitting the users into two groups. Group 1 interacts with version A, while Group 2 interacts with version B. We then measure the effect on user behavior.

The key principle behind A/B testing is the hypothesis testing in statistics. We start with a null hypothesis, 𝐻0H0​, typically stating there is no difference in behavior between the groups. The alternative hypothesis, 𝐻1H1​, states there is a difference.

After running the experiment and collecting the data, we perform a statistical test (like the two-sample t-test) to decide if the observed differences are significant enough to reject 𝐻0H0​.

Ultimately, A/B testing provides an empirical basis to make decisions enhancing user experience and product effectiveness.

2. How would you measure the success of a new feature launch?

Measuring the success of a new feature launch involves evaluating both quantitative and qualitative metrics to ascertain its impact and reception.

  1. Quantitative Metrics:
    • Engagement Metrics: Look at the number of users interacting with the feature, and the frequency of use. IfΒ π‘π‘Žπ‘“π‘‘π‘’π‘ŸNafter​ represents the engagement after the feature launch andΒ π‘π‘π‘’π‘“π‘œπ‘Ÿπ‘’Nbefore​ represents engagement before, the relative increase is given byΒ π‘π‘Žπ‘“π‘‘π‘’π‘Ÿβˆ’π‘π‘π‘’π‘“π‘œπ‘Ÿπ‘’π‘π‘π‘’π‘“π‘œπ‘Ÿπ‘’Nbefore​Nafterβ€‹βˆ’Nbefore​​.
    • Retention Rate: Examine if the feature positively impacts user retention. If 𝑅𝑛𝑒𝑀Rnew​ is the retention rate with the new feature andΒ π‘…π‘œπ‘™π‘‘Rold​ without it, a success is when 𝑅𝑛𝑒𝑀>π‘…π‘œπ‘™π‘‘Rnew​>Rold​.
    • Conversion Rate: For features aiming to drive user actions, check if the conversion rate, represented asΒ numberΒ ofΒ conversionstotalΒ visitorstotalΒ visitorsnumberΒ ofΒ conversions​, has increased.
  2. Qualitative Metrics:
    • User Feedback: Collect feedback through surveys, feedback forms, or in-app prompts. Positive sentiments indicate a well-received feature.
    • Support Tickets: Monitor any increase in support tickets related to the new feature, indicating possible confusion or issues.

3. How would you determine the optimal sample size for an A/B test?

The optimal sample size for an A/B test is influenced by several parameters: the minimum detectable effect (the smallest difference that's meaningful to detect), the statistical power (typically set at 0.8 or 80%), and the significance level (often chosen as 0.05 or 5%).

The formula for computing the sample size for each group in a two-sample comparison (assuming equal variance) is:

𝑛=((𝑍𝛼/2+𝑍𝛽)2β‹…(𝜎𝐴2+𝜎𝐡2)𝛿2)n=(Ξ΄2(ZΞ±/2​+Zβ​)2β‹…(ΟƒA2​+ΟƒB2​)​)

Where:

  • 𝑛nΒ is the sample size per group.
  • 𝑍𝛼/2ZΞ±/2​ is the critical value for a two-tailed test (for 𝛼=0.05Ξ±=0.05, it's approximately 1.96).
  • 𝑍𝛽Zβ​ is the critical value associated with power (for 𝛽=0.2Ξ²=0.2, it's about 0.84).
  • 𝜎𝐴2ΟƒA2​ and 𝜎𝐡2ΟƒB2​ are the variances for groups A and B.
  • 𝛿δ is the minimum detectable effect.

Estimating variances can be tricky, so pilot tests or past data can be beneficial. Tools and calculators are available online to simplify this computation.

4. How would you interpret the results of an A/B test that showed no significant difference between the control and treatment groups?

When an A/B test yields no significant difference between the control (A) and treatment (B) groups, it means that, statistically, we cannot assert that the change in the treatment group had a detectable impact. In other words, the observed differences (if any) might have occurred by chance.

Here's how to interpret:

  1. Inherent Variability: There's always some variability in experiments. Given the sample size and the magnitude of the effect, the test might not have had enough power to detect a difference. Ensure your sample size was correctly calculated using formulas like 𝑛=((𝑍𝛼/2+𝑍𝛽)2β‹…(𝜎𝐴2+𝜎𝐡2)𝛿2)n=(Ξ΄2(ZΞ±/2​+Zβ​)2β‹…(ΟƒA2​+ΟƒB2​)​).
  2. Practical vs. Statistical Significance: Even if the result isn't statistically significant, consider if there's any practical significance. For instance, a small increase in conversion rate might not be statistically significant but can be materially impactful for a large business.
  3. External Factors: Ensure no external events (like a major holiday or website outage) influenced the results.
  4. Feature Evaluation: Maybe the feature or change being tested genuinely doesn't influence the behavior you're measuring. It might be worth re-evaluating the feature's objectives and designs.

5. Can you explain the concept of statistical power in the context of A/B testing?

Statistical power, often denoted asΒ 1βˆ’π›½1βˆ’Ξ², is the probability that a test correctly rejects the null hypothesis 𝐻0H0​ when the alternative hypothesis 𝐻1H1​ is true. In A/B testing terms, it's the likelihood of detecting a difference (if one genuinely exists) between the control and treatment groups.

Key points:

  1. Relationship with Type II Error: The power is intrinsically linked to Type II error (𝛽β). If 𝛽β is the risk of not detecting an effect that is there (false negative), thenΒ 1βˆ’π›½1βˆ’Ξ²Β (power) is the likelihood of detecting it.
  2. Desired Power: In many fields, a power of 0.8 (or 80%) is considered adequate, meaning there's an 80% chance of detecting a true effect.
  3. Factors Influencing Power: Several factors affect the power of a test:
    • Effect Size: Larger effect sizes are easier to detect, increasing power.
    • Sample Size: Larger sample sizes increase the power. This is why calculating the required sample size beforehand, often using formulas like 𝑛=((𝑍𝛼/2+𝑍𝛽)2β‹…(𝜎𝐴2+𝜎𝐡2)𝛿2)n=(Ξ΄2(ZΞ±/2​+Zβ​)2β‹…(ΟƒA2​+ΟƒB2​)​), is crucial.
    • Significance Level (𝛼α): A lower significance level (e.g., 0.01 vs. 0.05) reduces power, as it sets a stricter criterion for significance.

Additional Questions

What is a p-value? How would you explain it to a non-technical stakeholder?Describe a situation where A/B testing would not be appropriate.What metrics would you look at to understand user engagement on Instagram??How do you deal with seasonality in A/B testing?How would you design an AB test for the News Feed algorithm on Facebook?

SQL

Airbnb SQL Questions

| property_id | owner_id | property_type | property_quality | no_bedrooms | no_bathrooms | total_sqft  | nightly_price_range | parking_spots | location_town      | location_country | first_listed_on |
|-------------|----------|---------------|------------------|-------------|--------------|------------|---------------------|---------------|--------------------|------------------|-----------------|
| 1           | 9        | Luxury        | Acceptable       | 2           | 0.5         | 2001-2500  | 151-200             | 2             | Troutdale          | US               | 2021-12-06      |
| 2           | 18       | Mid-Budget    | Satisfactory     | 8           | 4           | 3000+      | 0-100               | 1             | Watsonville        | US               | 2020-12-28      |
| 3           | 3        | Luxury        | Acceptable       | 1           | 4           | 3000+      | 101-150             | 2             | Sun City Center    | US               | 2020-06-15      |
| 4           | 3        | Mid-Budget    | Acceptable       | 0           | 4           | 3000+      | 251-300             | 2             | Sunnyside          | US               | 2021-12-13      |
| 5           | 29       | Mid-Budget    | Satisfactory     | 2           | 2.5         | 2001-2500  | 151-200             | 1             | Granger            | US               | 2021-06-21      |

1. Retrieve all properties that are located in the town of "Troutdale".

SELECT * FROM properties
WHERE location_town = 'Troutdale';

2. Find the average number of bedrooms for all "Luxury" properties.

SELECT AVG(no_bedrooms) AS average_bedrooms
FROM properties
WHERE property_type = 'Luxury';

3. Identify owner_ids of those owners who have more than one property listed.

SELECT owner_id, COUNT(property_id) AS number_of_properties
FROM properties
GROUP BY owner_id
HAVING COUNT(property_id) > 1;

4. Find the properties in the "Mid-Budget" category that have the highest number of bathrooms.

SELECT property_id, no_bathrooms
FROM properties
WHERE property_type = 'Mid-Budget'
ORDER BY no_bathrooms DESC
LIMIT 1;

5. Determine how many properties were first listed in each month of 2021.

SELECT MONTH(first_listed_on) AS month, COUNT(property_id) AS number_of_properties
FROM properties
WHERE YEAR(first_listed_on) = 2021
GROUP BY MONTH(first_listed_on)
ORDER BY month ASC;

6. Identify Owners with a Large Variety of Property Types:

Use a CTE to find owners who have more than one type of property and then order the results by those with the most variety of property types.

WITH OwnerVariety AS (
    SELECT owner_id, COUNT(DISTINCT property_type) as type_count
    FROM properties
    GROUP BY owner_id
    HAVING COUNT(DISTINCT property_type) > 1
)

SELECT owner_id, type_count
FROM OwnerVariety
ORDER BY type_count DESC;

7. Find the Average Size of Luxury and Mid-Budget Properties and Compare Them:

Use a CTE to find the average size of both "Luxury" and "Mid-Budget" properties. Afterwards, identify whether Luxury properties are on average larger than Mid-Budget properties.

WITH AverageSizes AS (
    SELECT property_type,
           CASE 
               WHEN total_sqft = '2001-2500' THEN 2250.5
               WHEN total_sqft = '3000+' THEN 3250 -- Assuming an average value for demonstration
               ELSE 0
           END as avg_size
    FROM properties
    WHERE property_type IN ('Luxury', 'Mid-Budget')
)

SELECT property_type, AVG(avg_size) as average_size
FROM AverageSizes
GROUP BY property_type
HAVING AVG(avg_size) > (
    SELECT AVG(avg_size) 
    FROM AverageSizes
    WHERE property_type = 'Mid-Budget'
);

8. Determine the Property with the Best Value in Each Town:

Use a CTE to calculate a value score for each property based on the number of bedrooms, bathrooms, and price range. Then, for each town, identify the property with the highest value score.

WITH ValueScores AS (
    SELECT property_id, location_town,
           (no_bedrooms + no_bathrooms) / 
           CASE 
               WHEN nightly_price_range = '0-100' THEN 50
               WHEN nightly_price_range = '101-150' THEN 125.5
               WHEN nightly_price_range = '151-200' THEN 175.5
               WHEN nightly_price_range = '251-300' THEN 275.5
               ELSE 1 
           END as value_score
    FROM properties
)

SELECT location_town, MAX(value_score) as top_value_score
FROM ValueScores
GROUP BY location_town;


Facebook SQL Questions

| post_id | user_id | post_text | post_date | likes_count | comments_count | post_type | |---------|---------|-------------------------------|------------|-------------|----------------|-----------| | 1 | 101 | "Enjoying a day at the beach!"| 2023-07-25 | 217 | 30 | Photo | | 2 | 102 | "Just finished a great book!" | 2023-07-24 | 120 | 18 | Status | | 3 | 103 | "Check out this cool video!" | 2023-07-23 | 345 | 47 | Video | | 4 | 101 | "That's awesome?" | 2023-07-22 | 52 | 70 | Status |

1. Find Users Who Have Never Posted a Photo:

SELECT user_id
FROM UserPosts
GROUP BY user_id
HAVING SUM(CASE WHEN post_type = 'Photo' THEN 1 ELSE 0 END) = 0;

2. Count the Number of Posts Made by Each User in July 2023:

SELECT user_id, COUNT(post_id) as number_of_posts
FROM UserPosts
WHERE post_date BETWEEN '2023-07-01' AND '2023-07-31'
GROUP BY user_id;

3. Find the User with the Highest Average Comments per Post:

SELECT user_id, AVG(comments_count) as avg_comments
FROM UserPosts
GROUP BY user_id
ORDER BY avg_comments DESC
LIMIT 1;

4. Identify the Percentage of Posts that are Videos:

SELECT (COUNT(post_id) * 100.0 / (SELECT COUNT(*) FROM UserPosts)) as video_percentage
FROM UserPosts
WHERE post_type = 'Video';

5. Retrieve Users Who Posted More than Three Times but Received Less than 100 Total Likes:

WITH UserTotals AS (
    SELECT user_id, 
           COUNT(post_id) as total_posts, 
           SUM(likes_count) as total_likes
    FROM UserPosts
    GROUP BY user_id
)

SELECT user_id, total_posts, total_likes
FROM UserTotals
WHERE total_posts > 3 AND total_likes < 100;

6. Identify Users Who Had the Biggest Increase in Likes from Their First to Their Last Post:

WITH FirstPost AS (
    SELECT user_id, MIN(post_date) as first_post_date
    FROM UserPosts
    GROUP BY user_id
),

LastPost AS (
    SELECT user_id, MAX(post_date) as last_post_date
    FROM UserPosts
    GROUP BY user_id
),

FirstPostLikes AS (
    SELECT a.user_id, b.likes_count as first_post_likes
    FROM FirstPost a
    JOIN UserPosts b ON a.user_id = b.user_id AND a.first_post_date = b.post_date
),

LastPostLikes AS (
    SELECT a.user_id, b.likes_count as last_post_likes
    FROM LastPost a
    JOIN UserPosts b ON a.user_id = b.user_id AND a.last_post_date = b.post_date
)

SELECT f.user_id, 
       f.first_post_likes, 
       l.last_post_likes, 
       (l.last_post_likes - f.first_post_likes) as likes_difference
FROM FirstPostLikes f
JOIN LastPostLikes l ON f.user_id = l.user_id
ORDER BY likes_difference DESC
LIMIT 1;

7. Find Users Who Had the Most Consistent Engagement (least variance in likes):

WITH UserAverageLikes AS (
    SELECT user_id, AVG(likes_count) as average_likes
    FROM UserPosts
    GROUP BY user_id
),

UserLikesVariance AS (
    SELECT a.user_id, 
           SUM(POWER(b.likes_count - a.average_likes, 2)) / COUNT(b.post_id) as variance
    FROM UserAverageLikes a
    JOIN UserPosts b ON a.user_id = b.user_id
    GROUP BY a.user_id, a.average_likes
)

SELECT user_id, MIN(variance) as least_variance
FROM UserLikesVariance
GROUP BY user_id
ORDER BY least_variance ASC
LIMIT 1;


Additional Questions

Pinterest SQL Questions

| pin_id | user_id | board_id | image_url                    | pin_description                   | pin_date   | repins_count | pin_category |
|--------|---------|----------|------------------------------|----------------------------------|------------|--------------|--------------|
| 1      | 12      | 101      | "http://image101.com"          | "Beautiful sunset at the beach"  | 2023-01-05 | 230          | Travel      |
| 2      | 13      | 101      | "http://image223.com"          | "DIY candle making tutorial"     | 2023-02-10 | 15           | DIY         |
| 3      | 14      | 102      | "http://image443.com"          | "Homemade chocolate chip cookies"| 2023-03-07 | 75           | Food        |
| 4      | 12      | 103      | "http://image554.com"          | "Stunning mountain views"        | 2023-03-20 | 105          | Travel      |
| 5      | 15      | 104      | "http://image555.com"          | "Fashion trends for summer"      | 2023-04-08 | 320          | Fashion     |

How many pins have been added in each category?Identify the top 5 users with the highest number of pins.How many pins have been repinned more than 50 times?What's the average number of repins for pins in the 'Fashion' category?Find users who have pinned images but have never repinned someone else's image.For each board, how many pins does it have, and what's the total repin count?Identify the month in the past year with the highest number of pins added.Which categories have an average repin count above 100?Find the top 3 most popular pins (based on repins) in the 'Travel' category from the last 6 months.Which user has the widest variety of pin categories?

Coding

1. Given two sorted arrays, find the median of the combined arrays.

```python
def findMedianSortedArrays(nums1, nums2):
    merged = sorted(nums1 + nums2)
    n = len(merged)
    if n % 2 == 1:
        return merged[n // 2]
    else:
        return (merged[n // 2 - 1] + merged[n // 2]) / 2.0
```

2. Implement a moving average from a stream.

```python
class MovingAverage:

    def __init__(self, size: int):
        self.size = size
        self.queue = []
        self.sum = 0

    def next(self, val: int) -> float:
        if len(self.queue) == self.size:
            self.sum -= self.queue.pop(0)
        self.queue.append(val)
        self.sum += val
        return self.sum / len(self.queue)
```

3. Given a dataset with timestamps and values, find sections where the values have been steadily increasing for a given period.

```python
def steady_increase(data, period):
    increasing_sections = []
    start = None
    for i in range(1, len(data)):
        if data[i]['value'] > data[i-1]['value']:
            if start is None:
                start = data[i-1]['timestamp']
        else:
            if start and data[i-1]['timestamp'] - start >= period:
                increasing_sections.append((start, data[i-1]['timestamp']))
            start = None
    return increasing_sections
```

4. Implement an algorithm for stratified sampling

```python
import random

def stratified_sample(data, strat_col, sample_size):
    unique_strats = set(data[strat_col])
    strata_samples = []
    for strat in unique_strats:
        strat_data = [row for row in data if row[strat_col] == strat]
        strat_sample = random.sample(strat_data, min(sample_size, len(strat_data)))
        strata_samples.extend(strat_sample)
    return strata_samples
```

5. Given a time series of stock prices, implement a function to determine the best time to buy and sell to maximize profit.

```python
def max_profit(prices):
    if not prices:
        return 0

    min_price = prices[0]
    max_profit = 0
    
    for price in prices:
        if price < min_price:
            min_price = price
        else:
            profit = price - min_price
            max_profit = max(max_profit, profit)
    
    return max_profit
```

6. Implement a function to calculate the Root Mean Squared Error (RMSE) of a model's predictions.

```python
def rmse(predictions, targets):
    differences = [(a - b)**2 for a, b in zip(predictions, targets)]
    return (sum(differences) / len(predictions)) ** 0.5
```

7. You have a dataset that contains user activities. Implement a function to find users who have logged in for 5 consecutive days.

```python
def five_consecutive_logins(data):
    data.sort(key=lambda x: x['date'])
    consecutive_users = set()
    for i in range(len(data) - 4):
        if data[i]['user_id'] == data[i + 4]['user_id'] and (data[i + 4]['date'] - data[i]['date']).days == 4:
            consecutive_users.add(data[i]['user_id'])
    return consecutive_users
```

8. Given a list of strings, implement a function to cluster them based on string similarity (e.g., using Jaccard similarity or Levenshtein distance).

```python
def levenshtein(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for index2, char2 in enumerate(s2):
        new_distances = [index2 + 1]
        for index1, char1 in enumerate(s1):
            if char1 == char2:
                new_distances.append(distances[index1])
            else:
                new_distances.append(1 + min((distances[index1], distances[index1 + 1], new_distances[-1])))
        distances = new_distances

    return distances[-1]

def cluster_strings(strings, threshold):
    clusters = []
    for string in strings:
        for cluster in clusters:
            if levenshtein(string, cluster[0]) <= threshold:
                cluster.append(string)
                break
        else:
            clusters.append([string])
    return clusters
```

9. Implement a function to generate n bootstrap samples from a given dataset.

```python
import random

def bootstrap_samples(data, n):
    samples = []
    for _ in range(n):
        sample = [random.choice(data) for _ in data]
        samples.append(sample)
    return samples
```

10. Given an array of user reviews, implement a function to determine the top k frequently mentioned words, ignoring common stop words.

from collections import Counter
import re

def top_k_words(reviews, k, stop_words):
    """
    Function to find the top k frequently mentioned words from an array of user reviews, ignoring common stop words.
    
    :param reviews: List of user reviews.
    :param k: Number of top words to retrieve.
    :param stop_words: Set of words to ignore.
    :return: List of top k words.
    """
    # Convert all reviews to lowercase and tokenize by words
    words = [word for review in reviews for word in re.findall(r'\w+', review.lower())]
    
    # Filter out stop words
    words = [word for word in words if word not in stop_words]
    
    # Count word frequencies and retrieve top k words
    count = Counter(words)
    return [item[0] for item in count.most_common(k)]

# Example
reviews = ["I love this product!", "This is the best thing ever.", "I won't buy this again.", "Ever tried this?"]
stop_words = set(["this", "is", "the", "i", "won't", "ever", "buy"])
print(top_k_words(reviews, 2, stop_words))  # ['love', 'product']



Additional Questions

Implement the K-Nearest Neighbors (KNN) algorithm to classify new data points in a multi-dimensional space.You have a dataset with timestamps of user logins. Implement a function to find peak login times, i.e., times when the maximum number of users are logged in simultaneously.Given a set of coordinates (latitude, longitude), cluster them intoΒ nΒ groups where each group has nearby coordinates (use any clustering algorithm of your choice, such as DBSCAN or hierarchical clustering).Implement the gradient descent algorithm for a simple linear regression problem.Given a set of features and target values, implement a basic decision tree algorithm to predict the target based on feature values.

Leadership & Behavioral Questions

Behavioral and leadership interview questions aim to assess a candidate's interpersonal skills, decision-making capabilities, and leadership potential.

  1. Collaboration & Teamwork
    • "Can you describe a situation where you had to work with someone whose background or perspective was very different from yours? How did you handle it?"
    • "Describe a time when you received constructive criticism on your work. How did you respond, and what changes did you make based on the feedback?"
    • "Can you give an example of a time when there was a disagreement in your team about the interpretation of data or the approach to a problem? How did you handle it?"
    • "Tell me about a time when you had to collaborate with a non-technical team. How did you ensure effective communication and mutual understanding?"
  2. Problem-Solving & Continuous Learning
    • "Describe a time when you faced an unexpected challenge in a data project. How did you handle it, and what did you learn from it?"
    • "Can you tell me about a time when you had to quickly learn a new tool or technique for a project? How did you approach it, and what was the outcome?"
    • "With the rapidly evolving field of data science, how do you keep yourself updated with the latest tools and techniques?"
    • "Describe an instance where your initial analysis didn't produce the expected results. How did you adapt and what alternative methods did you employ?"
  3. Influence & Communication
    • "Have you ever had to persuade a stakeholder to accept an analysis or approach that they initially disagreed with? How did you go about it?"
    • "How have you handled situations where your data contradicted the existing beliefs or opinions of senior leadership? Walk me through your communication strategy."
    • "How have you handled situations where your data contradicted the existing beliefs or opinions of senior leadership? Walk me through your communication strategy."
  4. Decision-Making
    • "Have you ever encountered a situation where the data suggested one course of action, but ethically, another course was the right thing to do? How did you navigate this?"
  5. Leadership & Initiative
    • "Tell me about a time when you took the lead on a project without being asked. What prompted you to take the initiative, and what was the outcome?"
    • "Describe a time when you proposed a new approach or strategy based on your data analysis that had a significant impact on the direction of a project or the business."
    • "Have you ever had to abandon a particular approach or method in the middle of a project due to unforeseen challenges? How did you pivot and ensure the project remained on track?"
Dan Lee's profile image

Dan Lee

DataInterview Founder (Ex-Google)

Dan Lee is a former Data Scientist at Google with 8+ years of experience in data science, data engineering, and ML engineering. He has helped 100+ clients land top data, ML, AI jobs at reputable companies and startups such as Google, Meta, Instacart, Stripe and such.