Skip to main content

Content-Based Filtering

Content-based filtering are deer to me as this was my first recommender system which I shipped to production. They can be very easy to set up, and don't need an initial user base, as it's not reliant on ratings or views (user interactions).

At it's core, all that content-based recommender systems are doing, is comparing vectors. You know, that thing you did at high school. That's all it is in essence.

It tries to represent Items and Users as vectors and then compares them to recommend the next similar item. The differentiation in content-based filtering comes through the way you create these vectors.

In essense, think of it like this:

Imagine you have a resteraunt with all sorts of different things on the menu. You create a small dataset of all your foods, with some metrics you add to it. You add columns like "Spiciness", "isVegan" and so on to it.

What you do then is just create vectors out of each food by making these descriptions numerical. E.g. Spiciness could be something between 0 and 1, where 1 is super spicy. Or isVegan could be either 0 if its not vegan, or 1 if it is. These types of description is what we call metadata of an Item. They can also be called features.

You can then take all your vectors and then just plot them.

Here are now 2 approaches you could do:

  1. If you don't have any information on the user, and just want to suggest similar options, then just the next closest food vector.
  2. If you instead want to see how close an item might be for a specific user, what you would do is create a vector for the user themself, and then find the closest vectors to it.

Let's say Ahmad doesn't like vegan food, and wants it super spicy. So his vector might look like (0,1)(0, 1) for 0 vegan and 1 spiciness (super spicy). He would then get recommended items which have a very similar vector than that.

Creating the feature vector

As we spoke about earlier, the key difference is really in how exactly the feature vector is created for the items and the user. While doing it the way we just discussed is very simple and basic, it wouldn't work for data with very little meta data. Such as Twitter posts, or Youtube videos. For that we would need more sophisticated ways of actually using the content itself to create these feature vectors.

We will focus on doing this for Text for now. Some ways of doing this would be:

  1. Term-Weighting
  2. Latent Semantic Analysis
  3. Embedding-Based

Term Weighting

How this works, is that each term in a document is counted and the basically weighted.

Let's say there are 2 posts on a social media plattform:

  1. "I love cats"
  2. "I hate cats"

Each word is then counted per post:

post_1 = {
"I": 1,
"love": 1,
"hate": 0,
"cats": 1
}

post_2 = {
"I": 1,
"love": 0,
"hate": 1,
"cats": 1
}

Each word then gets a weight and is multiplied by that weight. Let's say, "I" gets 1, "love" gets 0.5, "hate" gets "0.8" and "cats" gets 0.1. So we would then get:

post_1 = {
"I": 1 * 1,
"love": 1 * 0.5,
"hate": 0 * 0.8,
"cats": 1 * 0.1
}

post_2 = {
"I": 1 * 1,
"love": 0 * 0.5,
"hate": 1 * 0.8,
"cats": 1 * 0.1
}

Which would give us then the final vectors as such:

post_1 = (1, 0.5, 0, 0.1)
post_2 = (1, 0, 0.8, 0.1)