Tabular data handling¶

This module defines the main class to handle tabular data in the fastai library: TabularDataBunch. As always, there is also a helper function to quickly get your data.

To allow you to easily create a Learner for your data, it provides tabular_learner.

In [1]:

from fastai.gen_doc.nbdoc import *
from fastai.tabular import * 

In [2]:

show_doc(TabularDataBunch)

`class` `TabularDataBunch`[source][test]

TabularDataBunch(train_dl:DataLoader, valid_dl:DataLoader, fix_dl:DataLoader=*None, test_dl:Optional[DataLoader]=None, device:device=None, dl_tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate', no_check:bool=False*) :: DataBunch

No tests found for TabularDataBunch. To contribute a test please refer to this guide and this discussion.

Create a DataBunch suitable for tabular data.

The best way to quickly get your data in a DataBunch suitable for tabular data is to organize it in two (or three) dataframes. One for training, one for validation, and if you have it, one for testing. Here we are interested in a subsample of the adult dataset.

In [ ]:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
valid_idx = range(len(df)-2000, len(df))
df.head()

Out[ ]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

In [ ]:

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
dep_var = 'salary'

The initialization of TabularDataBunch is the same as DataBunch so you really want to use the factory method instead.

In [3]:

show_doc(TabularDataBunch.from_df)

`from_df`[source][test]

from_df(path, df:DataFrame, dep_var:str, valid_idx:Collection[int], procs:Optional[Collection[TabularProc]]=*None, cat_names:OptStrList=None, cont_names:OptStrList=None, classes:Collection[T_co]=None, test_df=None, bs:int=64, val_bs:int=None, num_workers:int=8, dl_tfms:Optional[Collection[Callable]]=None, device:device=None, collate_fn:Callable='data_collate', no_check:bool=False*) → DataBunch

Tests found for from_df:

Some other tests where from_df is used:

pytest -sv tests/test_tabular_data.py::test_from_df [source]

To run tests please refer to this guide.

Create a DataBunch from df and valid_idx with dep_var. kwargs are passed to DataBunch.create.

Optionally, use test_df for the test set. The dependent variable is dep_var, while the categorical and continuous variables are in the cat_names columns and cont_names columns respectively. If cont_names is None then we assume all variables that aren't dependent or categorical are continuous. The TabularProcessor in procs are applied to the dataframes as preprocessing, then the categories are replaced by their codes+1 (leaving 0 for nan) and the continuous variables are normalized.

Note that the TabularProcessor should be passed as Callable: the actual initialization with cat_names and cont_names is done during the preprocessing.

In [ ]:

procs = [FillMissing, Categorify, Normalize]
data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

You can then easily create a Learner for this data with tabular_learner.

In [4]:

show_doc(tabular_learner)

`tabular_learner`[source][test]

tabular_learner(data:DataBunch, layers:Collection[int], emb_szs:Dict[str, int]=*None, metrics=None, ps:Collection[float]=None, emb_drop:float=0.0, y_range:OptRange=None, use_bn:bool=True, ***learn_kwargs**)

No tests found for tabular_learner. To contribute a test please refer to this guide and this discussion.

Get a Learner using data, with metrics, including a TabularModel created using the remaining params.

emb_szs is a dict mapping categorical column names to embedding sizes; you only need to pass sizes for columns where you want to override the default behaviour of the model.

In [5]:

show_doc(TabularList)

`class` `TabularList`[source][test]

TabularList(items:Iterator[T_co], cat_names:OptStrList=*None, cont_names:OptStrList=None, procs=None, ***kwargs**) → TabularList :: ItemList

Tests found for TabularList:

Some other tests where TabularList is used:

pytest -sv tests/test_tabular_data.py::test_from_df [source]

To run tests please refer to this guide.

Basic ItemList for tabular data.

Basic class to create a list of inputs in items for tabular data. cat_names and cont_names are the names of the categorical and the continuous variables respectively. processor will be applied to the inputs or one will be created from the transforms in procs.

In [6]:

show_doc(TabularList.from_df)

`from_df`[source][test]

from_df(df:DataFrame, cat_names:OptStrList=*None, cont_names:OptStrList=None, procs=None, ***kwargs**) → ItemList

Tests found for from_df:

pytest -sv tests/test_tabular_data.py::test_from_df [source]

To run tests please refer to this guide.

Get the list of inputs in the col of path/csv_name.

In [7]:

show_doc(TabularList.get_emb_szs)

`get_emb_szs`[source][test]

get_emb_szs(sz_dict=*None*)

No tests found for get_emb_szs. To contribute a test please refer to this guide and this discussion.

Return the default embedding sizes suitable for this data or takes the ones in sz_dict.

In [8]:

show_doc(TabularList.show_xys)

`show_xys`[source][test]

show_xys(xs, ys)

No tests found for show_xys. To contribute a test please refer to this guide and this discussion.

Show the xs (inputs) and ys (targets).

In [9]:

show_doc(TabularList.show_xyzs)

`show_xyzs`[source][test]

show_xyzs(xs, ys, zs)

No tests found for show_xyzs. To contribute a test please refer to this guide and this discussion.

Show xs (inputs), ys (targets) and zs (predictions).

In [10]:

show_doc(TabularLine, doc_string=False)

`class` `TabularLine`[source][test]

TabularLine(cats, conts, classes, names) :: ItemBase

No tests found for TabularLine. To contribute a test please refer to this guide and this discussion.

An object that will contain the encoded cats, the continuous variables conts, the classes and the names of the columns. This is the basic input for a dataset dealing with tabular data.

In [11]:

show_doc(TabularProcessor)

`class` `TabularProcessor`[source][test]

TabularProcessor(ds:ItemBase=*None, procs=None*) :: PreProcessor

No tests found for TabularProcessor. To contribute a test please refer to this guide and this discussion.

Regroup the procs in one PreProcessor.

Create a PreProcessor from procs.

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

In [12]:

show_doc(TabularProcessor.process_one)

`process_one`[source][test]

process_one(item)

No tests found for process_one. To contribute a test please refer to this guide and this discussion.

In [13]:

show_doc(TabularList.new)

`new`[source][test]

new(items:Iterator[T_co], processor:Union[PreProcessor, Collection[PreProcessor]]=*None, ***kwargs**) → ItemList

No tests found for new. To contribute a test please refer to this guide and this discussion.

Create a new ItemList from items, keeping the same attributes.

In [14]:

show_doc(TabularList.get)

`get`[source][test]

get(o)

No tests found for get. To contribute a test please refer to this guide and this discussion.

Subclass if you want to customize how to create item i from self.items.

In [15]:

show_doc(TabularProcessor.process)

`process`[source][test]

process(ds)

No tests found for process. To contribute a test please refer to this guide and this discussion.

In [16]:

show_doc(TabularList.reconstruct)

`reconstruct`[source][test]

reconstruct(t:Tensor)

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

Reconstruct one of the underlying item for its data t.

Tabular data handling¶

class TabularDataBunch[source][test]

from_df[source][test]

tabular_learner[source][test]

class TabularList[source][test]

from_df[source][test]

get_emb_szs[source][test]

show_xys[source][test]

show_xyzs[source][test]

class TabularLine[source][test]

class TabularProcessor[source][test]

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

process_one[source][test]

new[source][test]

get[source][test]

process[source][test]

reconstruct[source][test]

New Methods - Please document or move to the undocumented section¶

`class` `TabularDataBunch`[source][test]

`from_df`[source][test]

`tabular_learner`[source][test]

`class` `TabularList`[source][test]

`from_df`[source][test]

`get_emb_szs`[source][test]

`show_xys`[source][test]

`show_xyzs`[source][test]

`class` `TabularLine`[source][test]

`class` `TabularProcessor`[source][test]

`process_one`[source][test]

`new`[source][test]

`get`[source][test]

`process`[source][test]

`reconstruct`[source][test]