Evaluating Recommender Systems
Now we will come to a very interesting topic: Evaluating recommender systems.
In general, we can think of 3 types of doing so:
- Offline Testing. We do this by using historic logs to compute some metrics
- Online Testing. This is when you do some A/B testing to see if a new recommender system actually produces results.
- User Studies. This is when you bring real users in for a survey to judge the system.
As for numbers, there are two main ways to evaluate methods: Under Retrieval and under Classification.
Evaluation under Classification
We can think of Recommendation as binary classification: Either a recommendation is relevant or not relevant. We do this using Precision and Recall as such:
Precision: Of all the items your system recommended, what fraction were actually relevant?
Recall: Of all the items that were truly relevant, what fraction did your system manage to recommend?
So imagine Ahmed likes Shawarma, Pepsi and Fries. But our system recommends him Shawarma, Burgers, Pepsi and Milkshakes.
The total number of relevant items in the recommendations would be 2, while the total number of recommended Items would be 4. And the total number of relevant Items is 3.
So we get for the Precision: because 2 out of the 4 recommendations were actually good, and for Recall we get because, of the 3 things that Ahmed likes, 2 of them were in the recommendations.
Evaluation under Retrieval
Another way to think of this, is as a retrieval problem. Meaning, how well ranked are the results? The previous approach just looked at if the desired items are in the recommendations, while this approach actually looks at how well the items are themselves ranked in the recommendations.
Similarly, we use Precision and Recall, but this time looking at the rank. We call this rank k, meaning if k items are recommended, how large is the precision and recall?
More concretely:
- Precision@K: of the top K suggestions, what fraction are relevant?
- Recall@K: of all relevant items, what fraction appear in the top K?
There are a couple of these metrics. For brevity, I will only show one more important one here: Mean Average Precision
mAP: Mean Average Percision
Definitions:
- Precision@k: of the top-k results, what fraction are relevant?
- Average Precision (AP) for one query: the mean of the precisions computed at each rank where a relevant item appears.
- mAP: the average of those AP scores over all queries.
Lets say we got the following queries:
Query | Relevant Docs | Retrieved Ranking |
---|---|---|
Q1 | Doc4 | ["Doc1", "Doc2", "Doc3", "Doc4", "Doc5"] |
Q2 | Doc7 | ["Doc7", "Doc8", "Doc6", "Doc9", "Doc10"] |
We would then calculate the Average Precision as such for Q1:
The system's ranked list:
- "Doc1" (relevant)
- "Doc2" (not)
- "Doc3" (relevant)
- "Doc4" (relevant)
- "Doc5" (not)
Compute Precision each time we hit a relevant item:
- At rank 1 ("Doc1"): Precision@1 = 1 relevant in top 1 → 1/1 = 1.0
- At rank 3 ("Doc3"): now 2 relevant in top 3 → 2/3 ≈ 0.667
- At rank 4 ("Doc4"): now 3 relevant in top 4 → 3/4 = 0.75
AP₁ = (1.0 + 0.667 + 0.75) / 3 ≈ 0.806
We would then do the same for Q2:
Ranking:
- "Doc7" (relevant)
- "Doc8" (not)
- "Doc6" (relevant)
- "Doc9" (not)
- "Doc10" (not)
At rank 1 ("Doc7"): Precision@1 = 1/1 = 1.0 At rank 3 ("Doc6"): 2 relevant in top 3 → 2/3 ≈ 0.667
AP₂ = (1.0 + 0.667) / 2 ≈ 0.833
Finally, we get the mean of all the queries:
This means, that on average, across these two queries, our system retrieves relevant items with about 82% "precision at their respective hit points."
Which is a really good metric.
User Centric
In many ways, this is the only thing that actually matters. How your users themselves perceive your system and how useful it actually is.
There are many ways to do this, but the most efficient way is to use user surveys with very exact questions. One of which is called ResQue (Recommender systems' Quality of user experience).