Evaluating Recommender Systems

Now we will come to a very interesting topic: Evaluating recommender systems.

In general, we can think of 3 types of doing so:

Offline Testing. We do this by using historic logs to compute some metrics
Online Testing. This is when you do some A/B testing to see if a new recommender system actually produces results.
User Studies. This is when you bring real users in for a survey to judge the system.

As for numbers, there are two main ways to evaluate methods: Under Retrieval and under Classification.

Evaluation under Classification

We can think of Recommendation as binary classification: Either a recommendation is relevant or not relevant. We do this using Precision and Recall as such:

Precision: Of all the items your system recommended, what fraction were actually relevant?

\text{Precision} = \frac{\text{Number of relevant} \cap \text{recommended}}{\text{Number of recommended}}

Recall: Of all the items that were truly relevant, what fraction did your system manage to recommend?

\text{Recall} = \frac{\text{Number of relevant} \cap \text{recommended}}{\text{Number of relevant}}

So imagine Ahmed likes Shawarma, Pepsi and Fries. But our system recommends him Shawarma, Burgers, Pepsi and Milkshakes.

The total number of relevant items in the recommendations would be 2, while the total number of recommended Items would be 4. And the total number of relevant Items is 3.

So we get for the Precision: $\frac{2}{4}$ because 2 out of the 4 recommendations were actually good, and for Recall we get $\frac{2}{3}$ because, of the 3 things that Ahmed likes, 2 of them were in the recommendations.

Evaluation under Retrieval

Another way to think of this, is as a retrieval problem. Meaning, how well ranked are the results? The previous approach just looked at if the desired items are in the recommendations, while this approach actually looks at how well the items are themselves ranked in the recommendations.

Similarly, we use Precision and Recall, but this time looking at the rank. We call this rank k, meaning if k items are recommended, how large is the precision and recall?

More concretely:

Precision@K: of the top K suggestions, what fraction are relevant?
Recall@K: of all relevant items, what fraction appear in the top K?

There are a couple of these metrics. For brevity, I will only show one more important one here: Mean Average Precision

mAP: Mean Average Percision

Definitions:

Precision@k: of the top-k results, what fraction are relevant?
Average Precision (AP) for one query: the mean of the precisions computed at each rank where a relevant item appears.
mAP: the average of those AP scores over all queries.

Lets say we got the following queries:

Query	Relevant Docs	Retrieved Ranking
Q1	Doc4	["Doc1", "Doc2", "Doc3", "Doc4", "Doc5"]
Q2	Doc7	["Doc7", "Doc8", "Doc6", "Doc9", "Doc10"]

We would then calculate the Average Precision as such for Q1:

The system's ranked list:

"Doc1" (relevant)
"Doc2" (not)
"Doc3" (relevant)
"Doc4" (relevant)
"Doc5" (not)

Compute Precision each time we hit a relevant item:

At rank 1 ("Doc1"): Precision@1 = 1 relevant in top 1 → 1/1 = 1.0
At rank 3 ("Doc3"): now 2 relevant in top 3 → 2/3 ≈ 0.667
At rank 4 ("Doc4"): now 3 relevant in top 4 → 3/4 = 0.75

AP₁ = (1.0 + 0.667 + 0.75) / 3 ≈ 0.806

We would then do the same for Q2:

Ranking:

"Doc7" (relevant)
"Doc8" (not)
"Doc6" (relevant)
"Doc9" (not)
"Doc10" (not)

At rank 1 ("Doc7"): Precision@1 = 1/1 = 1.0 At rank 3 ("Doc6"): 2 relevant in top 3 → 2/3 ≈ 0.667

AP₂ = (1.0 + 0.667) / 2 ≈ 0.833

Finally, we get the mean of all the queries:

\text{mAP} = \frac{0.806 + 0.833}{2} = 0.820

This means, that on average, across these two queries, our system retrieves relevant items with about 82% "precision at their respective hit points."

Which is a really good metric.

User Centric

In many ways, this is the only thing that actually matters. How your users themselves perceive your system and how useful it actually is.

There are many ways to do this, but the most efficient way is to use user surveys with very exact questions. One of which is called ResQue (Recommender systems' Quality of user experience).

Evaluation under Classification​

Evaluation under Retrieval​

mAP: Mean Average Percision​

User Centric​

Evaluation under Classification

Evaluation under Retrieval

mAP: Mean Average Percision

User Centric