Join Data Science Interview MasterClass (in 3 weeks) 🚀 led by FAANG Data Scientists | Just 8 slots remaining...
Label Generation
A core aspect of devising a recommender system is deciding where to get the labels for your model. Most novice ML practitioners presume that labels are already present at the time when they are assigned with the project. However, in most cases, you have to determine what you are predicting and how you will get the labels in the first place.
In the case of recommender system, you have two options – Explicit Feedback and Implicit Feedback. Explicit Feedback are labels collected from human raters in the form of surveys, reviews, or label crowdsourcing platforms like Amazon Mechanical Turk.
However, these feedback systems have major problems:
To mitigate this you can use implicit feedback, which is collected from user activity data. Think of clicks, add to cart, and purchase. Such labels are more rich in history compared to the feedback system, not to mention, it’s a cheaper alternative to explicitly collecting labels.
In the Amazon recommender system case, we can use the purchase as the target label to predict.
Data Sources
In the case of Amazon, you have the following data source:
Feature Engineering & Selection
With the signals available, you can apply the following feature engineering. The main idea is to translate non-numerical data (e.g. ID, texts, and images) into dense vector forms that are model-friendly.
For instance, if the data is text such as product description and search query, they can be represented using word embeddings. The individual IDs themselves can also be represented using embedding, and these embeddings can be created using Matrix Factorization or Two-Tower Models (which we will discuss in-depth in the following section on model training).
Product Data
User Data
User Activity Data