#!/usr/bin/env python
# coding: utf-8

# # Collaborative filtering

# In[1]:


from fastai.gen_doc.nbdoc import *


# This package contains all the necessary functions to quickly train a model for a collaborative filtering task. Let's start by importing all we'll need.

# In[2]:


from fastai import *
from fastai.collab import * 


# ## Overview

# Collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. The fastai library contains a [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) class that will help you create datasets suitable for training, and a function `get_colab_learner` to build a simple model directly from a ratings table. Let's first see how we can get started before devling in the documentation.
# 
# For our example, we'll use a small subset of the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. In there, we have to predict the rating a user gave a given movie (from 0 to 5). It comes in the form of a csv file where each line is the rating of a movie by a given person.

# In[ ]:


path = untar_data(URLs.ML_SAMPLE)
ratings = pd.read_csv(path/'ratings.csv')
ratings.head()


# We'll first turn the `userId` and `movieId` columns in category codes, so that we can replace them with their codes when it's time to feed them to an `Embedding` layer. This step would be even more important if our csv had names of users, or names of items in it.

# In[ ]:


series2cat(ratings, 'userId','movieId')


# Now that this step is done, we can directly create a [`Learner`](/basic_train.html#Learner) object:

# In[ ]:


learn = get_collab_learner(ratings, n_factors=50, pct_val=0.2, min_score=0., max_score=5.)


# And the immediately begin training

# In[ ]:


learn.fit_one_cycle(5, 5e-3, wd=0.1)


# In[3]:


show_doc(CollabFilteringDataset, doc_string=False)


# This is the basic class to buil a [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) suitable for colaborative filtering. `user` and `item` should be categorical series that will be replaced with their codes internally and have the corresponding `ratings`. One of the factory methods will prepare the data in this format.

# In[4]:


show_doc(CollabFilteringDataset.from_df, doc_string=False)


# Takes a `rating_df` and splits it randomly for train and test following `pct_val` (unless it's None). `user_name`, `item_name` and `rating_name` give the names of the corresponding columns (defaults to the first, the second and the third column).

# In[5]:


show_doc(CollabFilteringDataset.from_csv, doc_string=False)


# Opens the file in `csv_name` as a `DataFrame` and feeds it to `show_doc.from_df` with the `kwargs`.

# ## Model and [`Learner`](/basic_train.html#Learner)

# In[6]:


show_doc(EmbeddingDotBias, doc_string=False, title_level=3)


# Creates a simple model with `Embedding` weights and biases for `n_users` and `n_items`, with `n_factors` latent factors. Takes the dot product of the embeddings and adds the bias, then feed the result to a sigmoid rescaled to go from `min_score` to `max_score`. 

# In[7]:


show_doc(get_collab_learner, doc_string=False)


# Creates a [`Learner`](/basic_train.html#Learner) object built from the data in `ratings`, `pct_val`, `user_name`, `item_name`, `rating_name` to [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset). Optionally, creates another [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) for `test`. `kwargs` are fed to [`DataBunch.create`](/basic_data.html#DataBunch.create) with these datasets. The model is given by [`EmbeddingDotBias`](/collab.html#EmbeddingDotBias) with `n_factors`, `min_score` and `max_score` (the numbers of users and items will be inferred from the data).

# ## Undocumented Methods - Methods moved below this line will intentionally be hidden

# In[8]:


show_doc(EmbeddingDotBias.forward)