from fastai.gen_doc.nbdoc import *
from fastai.tabular import *
from fastai import *
This package contains the basic class to define a transformation for preprocessing dataframes of tabular data, as well as basic TabularTransform. Preprocessing includes things like
In all those steps we have to be careful to use the correspondance we decide on our training set (which id we give to each category, what is the value we put for missing data, or how the mean/std we use to normalize) on our validation or test set. To deal with this, we use a speciall class called TabularTransform.
The data used in this document page is a subset of the adult dataset. It gives a certain amount of data on individuals to train a model to predict wether their salary is greater than $50k or not.
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
train_df, valid_df = df[:800].copy(),df[800:].copy()
train_df.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | >=50k | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | 1 |
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | 1 |
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | 0 |
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | 1 |
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | 0 |
We see it contains numerical variables (like age or education-num) as well as categorical ones (like workclass or relationship). The original dataset is clean, but we removed a few values to give examples of dealing with missing variables.
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
show_doc(TabularTransform, doc_string=False)
class TabularTransform[source]
TabularTransform(cat_names:StrList,cont_names:StrList)
Base class for creating transforms for dataframes with categorical variables cat_names and continuous variables cont_names. Note that any column not in one of those lists won't be touched.
show_doc(TabularTransform.__call__)
__call__[source]
call(df:DataFrame,test:bool=False)
Apply the correct function to df depending on test.
This simply calls apply_test if test or apply_train otherwise. Those functions apply the changes in place.
show_doc(TabularTransform.apply_train, doc_string=False)
apply_train[source]
apply_train(df:DataFrame)
Must be implemented by an inherited class with the desired transformation logic.
show_doc(TabularTransform.apply_test, doc_string=False)
apply_test[source]
apply_test(df:DataFrame)
If not implemented by an inherited class, defaults to calling apply_train.
The following TabularTransform are implemented in the fastai library. Note that the replacement from categories to codes as well as the normalization of continuous variables are automatically done in a TabularDataset.
show_doc(Categorify, doc_string=False)
class Categorify[source]
Categorify(cat_names:StrList,cont_names:StrList) ::TabularTransform
Changes the categorical variables in cat_names in categories. Variables in cont_names aren't affected.
show_doc(Categorify.apply_train, doc_string=False)
apply_train[source]
apply_train(df:DataFrame)
Transforms the variable in the cat_names columns in categories. The category codes are the unique values in these columns.
show_doc(Categorify.apply_test, doc_string=False)
apply_test[source]
apply_test(df:DataFrame)
Transforms the variable in the cat_names columns in categories. The category codes are the ones used for the training set, new categories are replaced by NaN.
tfm = Categorify(cat_names, cont_names)
tfm(train_df)
tfm(valid_df, test=True)
Since we haven't changed the categories by their codes, nothing visible has changed in the dataframe yet, but we can check that the variables are now categorical and view their corresponding codes.
train_df['workclass'].cat.categories
Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',
' Self-emp-not-inc', ' State-gov', ' Without-pay'],
dtype='object')
The test set will be given the same category codes as the training set.
valid_df['workclass'].cat.categories
Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',
' Self-emp-not-inc', ' State-gov', ' Without-pay'],
dtype='object')
show_doc(FillMissing, doc_string=False)
class FillMissing[source]
FillMissing(cat_names:StrList,cont_names:StrList,fill_strategy:FillStrategy=<FillStrategy.MEDIAN: 1>,add_col:bool=True,fill_val:float=0.0) ::TabularTransform
Transform that fills the missing values in cont_names. cat_names variables are left untouched (their missing value will be raplced by code 0 in the TabularDataset). fill_strategy is adopted to replace those nans and if add_col is True, whenever a column c has missing values, a column named c_nan is added and flags the line where the value was missing.
show_doc(FillMissing.apply_train, doc_string=False)
apply_train[source]
apply_train(df:DataFrame)
Fills the missing values in the cont_names columns.
show_doc(FillMissing.apply_test, doc_string=False)
apply_test[source]
apply_test(df:DataFrame)
Fills the missing values in the cont_names columns with the ones picked during train.
train_df[cont_names].head()
| age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
|---|---|---|---|---|---|---|
| 0 | 49 | 101320 | 12.0 | 0 | 1902 | 40 |
| 1 | 44 | 236746 | 14.0 | 10520 | 0 | 45 |
| 2 | 38 | 96185 | NaN | 0 | 0 | 32 |
| 3 | 38 | 112847 | 15.0 | 0 | 0 | 40 |
| 4 | 42 | 82297 | NaN | 0 | 0 | 50 |
tfm = FillMissing(cat_names, cont_names)
tfm(train_df)
tfm(valid_df, test=True)
train_df[cont_names].head()
| age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
|---|---|---|---|---|---|---|
| 0 | 49 | 101320 | 12.0 | 0 | 1902 | 40 |
| 1 | 44 | 236746 | 14.0 | 10520 | 0 | 45 |
| 2 | 38 | 96185 | 10.0 | 0 | 0 | 32 |
| 3 | 38 | 112847 | 15.0 | 0 | 0 | 40 |
| 4 | 42 | 82297 | 10.0 | 0 | 0 | 50 |
Values issing in the education-num column are replaced by 10, which is the median of the column in train_df. Categorical variables are not changed, since nan is simply used as another category.
valid_df[cont_names].head()
| age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
|---|---|---|---|---|---|---|
| 800 | 45 | 96975 | 10.0 | 0 | 0 | 40 |
| 801 | 46 | 192779 | 10.0 | 15024 | 0 | 60 |
| 802 | 36 | 376455 | 10.0 | 0 | 0 | 38 |
| 803 | 25 | 50053 | 10.0 | 0 | 0 | 45 |
| 804 | 37 | 164526 | 10.0 | 0 | 0 | 40 |
%reload_ext autoreload
%autoreload 2
%matplotlib inline
show_doc(FillStrategy, alt_doc_string='Enum flag represents determines how `FillMissing` should handle missing/nan values', arg_comments={
'MEDIAN':'nans are replaced by the median value of the column',
'COMMON': 'nans are replaced by the most common value of the column',
'CONSTANT': 'nans are replaced by `fill_val`'
})
Enum= [MEDIAN, COMMON, CONSTANT]
Enum flag represents determines how FillMissing should handle missing/nan values
fill_val