Join Data Science Interview MasterClass (in 3 weeks) 🚀 led by FAANG Data Scientists | Just 8 slots remaining...

Data Preparation

0:0

Label Generation

A core aspect of devising a recommender system is deciding where to get the labels for your model. Most novice ML practitioners presume that labels are already present at the time when they are assigned with the project. However, in most cases, you have to determine what you are predicting and how you will get the labels in the first place.

In the case of recommender system, you have two options – Explicit Feedback and Implicit Feedback. Explicit Feedback are labels collected from human raters in the form of surveys, reviews, or label crowdsourcing platforms like Amazon Mechanical Turk.

However, these feedback systems have major problems:

  1. Surveys have response bias – users who responded may rate differently than those who did not.
  2. Reviews are highly sparse (with < ~1%) usually providing reviews.
  3. Crowdsourcing labels is expensive.

To mitigate this you can use implicit feedback, which is collected from user activity data. Think of clicks, add to cart, and purchase. Such labels are more rich in history compared to the feedback system, not to mention, it’s a cheaper alternative to explicitly collecting labels.

In the Amazon recommender system case, we can use the purchase as the target label to predict.

Data Sources

In the case of Amazon, you have the following data source:

  • Product Data
    • ID (UUID)
    • Name (Text)
    • Description (Text)
    • Category (Text)
    • Merchant Name (Text)
    • Image (Image)
    • Price (Numeric)
    • Ratings (Numeric)
  • User Data
    • ID (UUID)
    • Email Address (Text)
    • Age (Numeric)
    • Gender (Binary)
    • Location (Text)
  • User Activity Data
    • Event ID (UUID)
    • Event Time (Timestamp)
    • User ID (UUID)
    • Product ID (UUID)
    • Search Query (Text)
    • Impression (Binary)
    • Purchase (Binary)

Feature Engineering & Selection

With the signals available, you can apply the following feature engineering. The main idea is to translate non-numerical data (e.g. ID, texts, and images) into dense vector forms that are model-friendly.

For instance, if the data is text such as product description and search query, they can be represented using word embeddings. The individual IDs themselves can also be represented using embedding, and these embeddings can be created using Matrix Factorization or Two-Tower Models (which we will discuss in-depth in the following section on model training).

Product Data

  • ID (UUID) → Embedding
  • Name (Text)
  • Description (Text) → Word Embedding (e.g. Word2Vec, Bert)
  • Category (Text) → Word Embedding (e.g. Word2Vec, Bert)
  • Merchant Name (Text) → Word Embedding (e.g. Word2Vec, Bert)
  • Image (Image) → Image Embedding (e.g. Image Net)
  • Price (Numeric) → Mean Imputation
  • Ratings (Numeric) → “-99” Imputation

User Data

  • ID (UUID) → Embedding
  • Email Address (Text)
  • Age (Numeric) → Mean Imputation
  • Gender (Binary) → One-Hot Encoding
  • Location (Text) → Word Embedding (e.g. Word2Vec, Bert)

User Activity Data

  • Event ID (UUID)
  • Event Time (Timestamp) → Sine/Cosine Transformation
  • User ID (UUID)
  • Product ID (UUID)
  • Search Query (Text) → Word Embedding (e.g. Word2Vec, Bert)
  • Impression (Binary) → One-Hot Encoding