NLP datasets¶

In [1]:

from fastai.gen_doc.nbdoc import *
from fastai.text import * 
from fastai.gen_doc.nbdoc import *
from fastai import *

This module contains the TextDataset class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in text.transform. It also contains all the functions to quickly get a TextDataBunch ready.

Quickly assemble your data¶

You should get your data in one of the following formats to make the most of the fastai library and use one of the factory methods of one of the TextDataBunch classes:

raw text files in folders train, valid, test in an ImageNet style,
a csv where some column(s) gives the label(s) and the folowwing one the associated text,
a dataframe structured the same way,
tokens and labels arrays,
ids, vocabulary (correspondance id to word) and labels.

If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a DataBunch with one of those functions, your data will be preprocessed automatically. You can save it, so that the next time you call it is almost instantaneous.

Below are the classes that help assembling the raw data in a DataBunch suitable for NLP.

In [2]:

show_doc(TextLMDataBunch, title_level=3, doc_string=False)

`class` `TextLMDataBunch`[source]

TextLMDataBunch(train_dl:DataLoader, valid_dl:DataLoader, test_dl:Optional[DataLoader]=None, device:device=None, tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate') :: TextDataBunch

Create a DataBunch suitable for language modeling: all the texts in the datasets are concatenated and the labels are ignored. Instead, the target is the next word in the sentence.

In [3]:

show_doc(TextLMDataBunch.show_batch)

`show_batch`[source]

show_batch(sep=' ', ds_type:DatasetType=<DatasetType.Train: 1>, rows:int=10, max_len:int=100)

Show rows texts from a batch of ds_type, tokens are joined with sep, truncated at max_len.

In [4]:

show_doc(TextClasDataBunch, title_level=3, doc_string=False)

`class` `TextClasDataBunch`[source]

TextClasDataBunch(train_dl:DataLoader, valid_dl:DataLoader, test_dl:Optional[DataLoader]=None, device:device=None, tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate') :: TextDataBunch

Create a DataBunch suitable for a text classifier: all the texts are grouped by length (with a bit of randomness for the training set) then padded.

In [5]:

show_doc(TextClasDataBunch.show_batch)

`show_batch`[source]

show_batch(sep=' ', ds_type:DatasetType=<DatasetType.Train: 1>, rows:int=10, max_len:int=100)

Show rows texts from a batch of ds_type, tokens are joined with sep, truncated at max_len.

In [6]:

show_doc(TextDataBunch, title_level=3, doc_string=False)

`class` `TextDataBunch`[source]

TextDataBunch(train_dl:DataLoader, valid_dl:DataLoader, test_dl:Optional[DataLoader]=None, device:device=None, tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate') :: DataBunch

Create a DataBunch with the raw texts. This is only going to work if they all ahve the same lengths.

Factory methods (TextDataBunch)¶

All those classes have the following factory methods.

In [7]:

show_doc(TextDataBunch.from_folder, doc_string=False)

`from_folder`[source]

from_folder(path:PathOrStr, train:str='train', valid:str='valid', test:Optional[str]=None, tokenizer:Tokenizer=None, vocab:Vocab=None, kwargs)

This function will create a DataBunch from texts placed in path in a train, valid and maybe test folders. Text files in the train and valid folders should be places in subdirectories according to their classes (always the same for a language model) and the ones for the test folder should all be placed there directly. tokenizer will be used to parse those texts into tokens. The shuffle flag will optionally shuffle the texts found.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

In [8]:

show_doc(TextDataBunch.from_csv, doc_string=False)

`from_csv`[source]

from_csv(path:PathOrStr, csv_name, valid_pct:float=0.2, test:Optional[str]=None, tokenizer:Tokenizer=None, vocab:Vocab=None, classes:StrList=None, header='infer', kwargs) → DataBunch

This function will create a DataBunch from texts placed in path in a csv file and maybe test csv file opened with header. You can specify txt_cols and lbl_cols or just an integer n_labels in which case the label(s) should be the first column(s). tokenizer will be used to parse those texts into tokens.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

In [9]:

show_doc(TextDataBunch.from_df, doc_string=False)

`from_df`[source]

from_df(path:PathOrStr, train_df:DataFrame, valid_df:DataFrame, test_df:OptDataFrame=None, tokenizer:Tokenizer=None, vocab:Vocab=None, classes:StrList=None, kwargs) → DataBunch

This function will create a DataBunch in path from texts in train_df, valid_df and maybe test_df. By default, those are opened with header=infer but you can specify another value in the kwargs. You can specify txt_cols and lbl_cols or just an integer n_labels in which case the label(s) should be the first column(s). tokenizer will be used to parse those texts into tokens.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

In [10]:

show_doc(TextDataBunch.from_tokens, doc_string=False)

`from_tokens`[source]

from_tokens(path:PathOrStr, trn_tok:Tokens, trn_lbls:Collection[Union[int, float]], val_tok:Tokens, val_lbls:Collection[Union[int, float]], vocab:Vocab=None, tst_tok:Tokens=None, classes:ArgStar=None, kwargs) → DataBunch

This function will create a DataBunch from trn_tok, trn_lbls, val_tok, val_lbls and maybe tst_tok.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels, tok_suff and lbl_suff (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

In [11]:

show_doc(TextDataBunch.from_ids, doc_string=False)

`from_ids`[source]

from_ids(path:PathOrStr, vocab:Vocab, trn_ids:Collection[Collection[int]], val_ids:Collection[Collection[int]], tst_ids:Collection[Collection[int]]=None, trn_lbls:Collection[Union[int, float]]=None, val_lbls:Collection[Union[int, float]]=None, classes:ArgStar=None, kwargs) → DataBunch

This function will create a DataBunch in path from texts already processed into trn_ids, trn_lbls, val_ids, val_lbls and maybe tst_ids. You can specify the corresponding classes if applciable. You must specify the vocab so that the RNNLearner class can later infer the corresponding sizes in the model it will create. kwargs will be passed to the class initialization.

Load and save¶

To avoid losing time preprocessing the text data more than once, you should save/load your TextDataBunch using thse methods.

In [12]:

show_doc(TextDataBunch.load)

`load`[source]

load(path:PathOrStr, cache_name:PathOrStr='tmp', kwargs)

Load a TextDataBunch from path/cache_name. kwargs are passed to the dataloader creation.

In [13]:

show_doc(TextDataBunch.save)

`save`[source]

save(cache_name:PathOrStr='tmp')

Save the DataBunch in self.path/cache_name folder.

Example¶

Untar the IMDB sample dataset if not already done:

In [11]:

path = untar_data(URLs.IMDB_SAMPLE)
path

Out[11]:

PosixPath('/home/ubuntu/.fastai/data/imdb_sample')

Since it comes in the form of csv files, we will use the corresponding text_data method. Here is an overview of what your file you should look like:

In [12]:

pd.read_csv(path/'texts.csv').head()

Out[12]:

	label	text	is_valid
0	negative	Un-bleeping-believable! Meg Ryan doesn't even ...	False
1	positive	This is a extremely well-made film. The acting...	False
2	negative	Every once in a long while a movie will come a...	False
3	positive	Name just says it all. I watched this movie wi...	False
4	negative	This movie succeeds at being one of the most u...	False

And here is a simple way of creating your DataBunch for language modelling or classification.

In [13]:

data_lm = TextLMDataBunch.from_csv(Path(path), 'texts.csv')
data_clas = TextClasDataBunch.from_csv(Path(path), 'texts.csv')

The TextBase dataset classes¶

Behind the scenes, the previous functions will create a training, validation and maybe test TextDataset which will then be transformed in a TokenizedDataset then a NumericalizedDataset. Those are all subclasses of TextBase.

In [14]:

show_doc(TextBase, title_level=3)

`class` `TextBase`[source]

TextBase(x:ArgStar, labels:Collection[Union[int, float]]=None, classes:ArgStar=None, encode_classes:bool=True) :: LabelDataset

Base class for fastai datasets that do classification, mapped according to classes.

x is an array representing the inputs (filenames, texts, tokens or ids) with certain labels (default to all zeros if not specified). classes can be passed and if encode_classes, the labels are changed from their class to the corresponding index.

In [15]:

show_doc(TextDataset, doc_string=False, title_level=3)

`class` `TextDataset`[source]

TextDataset(texts:StrList, labels:ArgStar=None, classes:ArgStar=None, mark_fields:bool=True, encode_classes:bool=True, is_fnames:bool=False) :: TextBase

Create a TextBase dataset of texts with labels belonging to classes. The texts are joined in the column dimension and if mark_fields, field markers are added in-between. If encode_classes the labels are changed from their class to the corresponding index. If is_fnames, the filenames in texts are read to pull the texts.

In [16]:

show_doc(TextDataset.from_folder, doc_string=False)

`from_folder`[source]

from_folder(path:PathOrStr, classes:ArgStar=None, valid_pct:float=0.0, extensions:StrList=['.txt'], mark_fields:bool=True) → TextDataset

Create a TextDataset by scanning the subfolders in path for files with extensions. Only keep the ones with labels in classes if it's specified. If valid_pct is not 0., returns two datasets randomly split. mark_fields is passed to the initialization.

In [17]:

show_doc(TextDataset.from_one_folder, doc_string=False)

`from_one_folder`[source]

from_one_folder(path:PathOrStr, classes:ArgStar, extensions:StrList=['.txt'], mark_fields:bool=True) → TextDataset

Primarly used for the test set. Create a TextDataset by scanning the subfolders in path for files with extensions. Labels all of them for classes[0]. mark_fields is passed to the initialization.

In [18]:

show_doc(TextDataset.from_df)

`from_df`[source]

from_df(df:DataFrame, classes:ArgStar=None, n_labels:int=1, txt_cols:Collection[Union[int, str]]=None, label_cols:Collection[Union[int, str]]=None, mark_fields:bool=True) → TextDataset

Create a TextDataset from the texts in a dataframe

In [19]:

show_doc(TextDataset.tokenize)

`tokenize`[source]

tokenize(tokenizer:Tokenizer=None, chunksize:int=10000) → TokenizedDataset

Tokenize the texts with tokenizer by bits of chunksize.

In [20]:

show_doc(TokenizedDataset, doc_string=False, title_level=3)

`class` `TokenizedDataset`[source]

TokenizedDataset(tokens:Tokens, labels:Collection[Union[int, float]]=None, classes:ArgStar=None, encode_classes:bool=True) :: TextBase

Create a TextBase dataset of tokens with labels belonging to classes. If encode_classes the labels are changed from their class to the corresponding index.

In [21]:

show_doc(TokenizedDataset.save)

`save`[source]

save(path:Path, name:str)

Save the dataset in path with name.

In [22]:

show_doc(TokenizedDataset.numericalize)

`numericalize`[source]

numericalize(vocab:Vocab=None, max_vocab:int=60000, min_freq:int=2) → NumericalizedDataset

Numericalize the tokens with vocab (if not None) otherwise create one with max_vocab and min_freq from tokens.

In [23]:

show_doc(NumericalizedDataset, doc_string=False, title_level=3)

`class` `NumericalizedDataset`[source]

NumericalizedDataset(vocab:Vocab, ids:Collection[Collection[int]], labels:Collection[Union[int, float]]=None, classes:ArgStar=None, encode_classes:bool=True) :: TextBase

Create a TextBase dataset of ids with labels belonging to classes. vocab contains the correspondance between ids an tokens. If encode_classes the labels are changed from their class to the corresponding index.

In [24]:

show_doc(NumericalizedDataset.get_text_item)

`get_text_item`[source]

get_text_item(idx, sep=' ', max_len:int=None)

Return the text in idx, tokens separated by sep and cutting at max_len.

In [25]:

show_doc(NumericalizedDataset.save)

`save`[source]

save(path:Path, name:str)

Save the dataset in path with name.

In [26]:

show_doc(NumericalizedDataset.load)

`load`[source]

load(path:Path, name:str)

Load a NumericalizedDataset from path in name.

Language Model data¶

A language model is trained to guess what the next word is inside a flow of words. We don't feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into bs chuncks of continuous texts. Note that in all NLP tasks, we use the pytoch convention of sequence length being the first dimension (and batch size being the second one) so we transpose that array so that we can read the chunks of texts in columns. Here is an example of batch from our imdb sample dataset.

In [26]:

path = untar_data(URLs.IMDB_SAMPLE)
data = TextLMDataBunch.from_csv(path, 'texts.csv')
x,y = next(iter(data.train_dl))
example = x[:20,:10].cpu()
texts = pd.DataFrame([data.train_ds.vocab.textify(l).split(' ') for l in example])
texts

Out[26]:

	0	1	2	3	4	5	6	7	8	9
0	xxfld	the	xxfld	what	this	i	his	"	)	out
1	1	first	2	makes	.	ever	work	entertainment	and	of
2	this	things	false	more	i	saw	.	"	the	their
3	is	i	xxfld	interesting	also	outside	jerry	.	next	xxup
4	a	noticed	1	hollywood	wish	of	van	10	he	dvd
5	very	was	ask	movies	they	star	xxunk	/	's	collection
6	old	,	yourself	,	'd	wars	's	10	trying	.
7	and	during	where	even	done	.	splendid	xxfld	to	this
8	cheaply	winston	she	today	some	since	score	2	beat	may
9	made	's	got	.	self	then	xxunk	false	up	give
10	film	day	the	p.s	-	i	as	xxfld	protée	you
11	--	to	gun	.	xxunk	have	the	1	!	an
12	a	day	?	i	humor	become	viewer	pixar	i	idea
13	typical	life	remember	spent	about	a	is	has	could	that
14	low	in	what	10	the	very	thrown	had	only	scarface
15	-	his	she	xxunk	changes	big	from	massive	guess	is
16	budget	work	was	of	-	ewan	one	success	as	a
17	b	,	taught	20	like	mcgregor	bizarre	over	to	"
18	-	his	about	)	on	fan	xxunk	the	what	gangster
19	western	conversations	the	and	"	but	to	years	motivated	movie

Then, as suggested in this article from Stephen Merity et al., we don't use a fixed bptt through the different batches but slightly change it from batch to batch.

In [27]:

iter_dl = iter(data.train_dl)
for _ in range(5):
    x,y = next(iter_dl)
    print(x.size())

torch.Size([68, 64])
torch.Size([64, 64])
torch.Size([57, 64])
torch.Size([76, 64])
torch.Size([70, 64])

This is all done internally when we use TextLMDataBunch, by creating DataLoader using the following class:

In [27]:

show_doc(LanguageModelLoader, doc_string=False)

`class` `LanguageModelLoader`[source]

LanguageModelLoader(dataset:TextDataset, bs:int=64, bptt:int=70, backwards:bool=False, shuffle:bool=False)

Takes the texts from dataset and concatenate them all, then create a big array with bs columns (transposed from the data source so that we read the texts in the columns). Spits batches with a size approximately equal to bptt but changing at every batch. If backwards is True, reverses the original text.

In [28]:

show_doc(LanguageModelLoader.batchify, doc_string=False)

`batchify`[source]

batchify(data:ndarray) → LongTensor

Called at the inialization to create the big array of text ids from the data array.

In [29]:

show_doc(LanguageModelLoader.get_batch)

`get_batch`[source]

get_batch(i:int, seq_len:int) → Tuple[LongTensor, LongTensor]

Create a batch at i of a given seq_len.

Classifier data¶

When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don't all have the same length, we can't easily collate them together in batches. To help with this we use two different techniques:

padding: each text is padded with the PAD token to get all the ones we picked to the same size
sorting the texts (ish): to avoid having together a very long text with a very short one (which would then have a lot of PAD tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.

Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start).

In [31]:

path = untar_data(URLs.IMDB_SAMPLE)
data = TextClasDataBunch.from_csv(path, 'texts.csv')
iter_dl = iter(data.train_dl)
_ = next(iter_dl)
x,y = next(iter_dl)
x[:20,-10:]

Out[31]:

tensor([[   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [  20,   20,    1,    1,    1,    1,    1,    1,    1,    1],
        [  42,   42,   20,   20,    1,    1,    1,    1,    1,    1],
        [  70,   94,   42,   42,    1,    1,    1,    1,    1,    1],
        [  14, 1662,   53, 2822,   20,    1,    1,    1,    1,    1],
        [ 935, 2061,    9,    3,   42,    1,    1,    1,    1,    1],
        [ 101,  269,  199, 3848,   23,    1,    1,    1,    1,    1],
        [2911,  212,  907,    7,    6,   20,    1,    1,    1,    1]],
       device='cuda:0')

This is all done internally when we use TextClasDataBunch, by using the following classes:

In [30]:

show_doc(SortSampler, doc_string=False)

`class` `SortSampler`[source]

SortSampler(data_source:NPArrayList, key:KeyFunc) :: Sampler

pytorch Sampler to batchify the data_source by order of length of the texts. Used for the validation and (if applicable) the test set.

In [31]:

show_doc(SortishSampler, doc_string=False)

`class` `SortishSampler`[source]

SortishSampler(data_source:NPArrayList, key:KeyFunc, bs:int) :: Sampler

pytorch Sampler to batchify with size bs the data_source by order of length of the texts with a bit of randomness. Used for the training set.

In [32]:

show_doc(pad_collate, doc_string=False)

`pad_collate`[source]

pad_collate(samples:BatchSamples, pad_idx:int=1, pad_first:bool=True) → Tuple[LongTensor, LongTensor]

Function used by the pytorch DataLoader to collate the samples in batches while adding padding with pad_idx. If pad_first is True, padding is applied at the beginning (before the sentence starts) otherwise it's applied at the end.

Data block API¶

The data block API works for the text application too. Here are a few subclasses of the usual objects to implement the parts speficic to the text application.

In [33]:

show_doc(TextFileList, doc_string=False, title_level=3)

`class` `TextFileList`[source]

TextFileList(items:Iterator, path:PathOrStr='.') :: InputList

This subclasses InputList just to change the defulat extentions in from_folder to text extensions.

In [34]:

show_doc(TextFileList.from_folder)

`from_folder`[source]

from_folder(path:PathOrStr='.', extensions:StrList=['.txt'], recurse=True) → ImageFileList

Get the list of files in path that have a suffix in extensions. recurse determines if we search subfolders.

In [35]:

show_doc(SplitDatasetsText, doc_string=False, title_level=3)

`class` `SplitDatasetsText`[source]

SplitDatasetsText(path:PathOrStr, train_ds:Dataset, valid_ds:Dataset, test_ds:Optional[Dataset]=None) :: SplitDatasets

A subclass of SplitDatasets that implements methods specific to texts.

In [36]:

show_doc(SplitDatasetsText.tokenize)

`tokenize`[source]

tokenize(tokenizer:Tokenizer=None, chunksize:int=10000)

Tokenize self.datasets with tokenizer by bits of chunksize.

In [37]:

show_doc(SplitDatasetsText.numericalize)

`numericalize`[source]

numericalize(vocab:Vocab=None, max_vocab:int=60000, min_freq:int=2)

Numericalize self.datasets with vocab or by creating one on the training set with max_vocab and min_freq.

In [38]:

show_doc(SplitDatasetsText.databunch)

`databunch`[source]

databunch(cls_func, path:PathOrStr=None, kwargs)

Create an cls_func from self, path will override self.path, kwargs are passed to cls_func.create.

Enums¶

In [39]:

show_doc(TextMtd, alt_doc_string='`TextDataset` enum to keep track of what data needs to be processed (dataframe, csv, tokens, ids)')

`TextMtd`

Enum = [DF, TOK, IDS]

TextDataset enum to keep track of what data needs to be processed (dataframe, csv, tokens, ids)

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

In [40]:

show_doc(TextLMDataBunch.create)

`create`[source]

create(train_ds, valid_ds, test_ds=None, path:PathOrStr='.', kwargs) → DataBunch

Create a TextDataBunch in path from the datasets for language modelling.

In [41]:

show_doc(TextClasDataBunch.create)

`create`[source]

create(train_ds, valid_ds, test_ds=None, path:PathOrStr='.', bs=64, pad_idx=1, pad_first=True, kwargs) → DataBunch

Function that transform the datasets in a DataBunch for classification.

In [ ]:

NLP datasets¶

Quickly assemble your data¶

class TextLMDataBunch[source]

show_batch[source]

class TextClasDataBunch[source]

show_batch[source]

class TextDataBunch[source]

Factory methods (TextDataBunch)¶

from_folder[source]

from_csv[source]

from_df[source]

from_tokens[source]

from_ids[source]

Load and save¶

load[source]

save[source]

Example¶

The TextBase dataset classes¶

class TextBase[source]

class TextDataset[source]

from_folder[source]

from_one_folder[source]

from_df[source]

tokenize[source]

class TokenizedDataset[source]

save[source]

numericalize[source]

class NumericalizedDataset[source]

get_text_item[source]

save[source]

load[source]

Language Model data¶

class LanguageModelLoader[source]

batchify[source]

get_batch[source]

Classifier data¶

class SortSampler[source]

class SortishSampler[source]

pad_collate[source]

Data block API¶

class TextFileList[source]

from_folder[source]

class SplitDatasetsText[source]

tokenize[source]

numericalize[source]

databunch[source]

Enums¶

`TextMtd`

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

create[source]

create[source]

New Methods - Please document or move to the undocumented section¶

`class` `TextLMDataBunch`[source]

`show_batch`[source]

`class` `TextClasDataBunch`[source]

`show_batch`[source]

`class` `TextDataBunch`[source]

`from_folder`[source]

`from_csv`[source]

`from_df`[source]

`from_tokens`[source]

`from_ids`[source]

`load`[source]

`save`[source]

`class` `TextBase`[source]

`class` `TextDataset`[source]

`from_folder`[source]

`from_one_folder`[source]

`from_df`[source]

`tokenize`[source]

`class` `TokenizedDataset`[source]

`save`[source]

`numericalize`[source]

`class` `NumericalizedDataset`[source]

`get_text_item`[source]

`save`[source]

`load`[source]

`class` `LanguageModelLoader`[source]

`batchify`[source]

`get_batch`[source]

`class` `SortSampler`[source]

`class` `SortishSampler`[source]

`pad_collate`[source]

`class` `TextFileList`[source]

`from_folder`[source]

`class` `SplitDatasetsText`[source]

`tokenize`[source]

`numericalize`[source]

`databunch`[source]

`create`[source]

`create`[source]