NLP datasets¶

In [ ]:

from fastai.gen_doc.nbdoc import *
from fastai.text import * 
from fastai.gen_doc.nbdoc import *
from fastai import *

This module contains the TextDataset class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in text.transform. It also contains all the functions to quickly get a TextDataBunch ready.

Quickly assemble your data¶

You should get your data in one of the following formats to make the most of the fastai library and use one of the factory methods of one of the TextDataBunch classes:

raw text files in folders train, valid, test in an ImageNet style,
a csv (with no index or Header) where the first column(s) gives the label(s) and the folowwing one the associated text,
tokens and labels arrays already saved,
ids, vocabulary (correspondance id to word) and labels already saved.

If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a DataBunch with one of those functions, your data will be preprocessed automatically and saved, so that the next time you call it is almost instantaneous.

Below are the classes that help assembling the raw data in a DataBunch suitable for NLP.

In [ ]:

show_doc(TextLMDataBunch, title_level=3, doc_string=False)

`class` `TextLMDataBunch`[source]

TextLMDataBunch(train_dl:DataLoader, valid_dl:DataLoader, test_dl:Optional[DataLoader]=None, device:device=None, tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate') :: TextDataBunch

Create a DataBunch suitable for language modeling: all the texts in the datasets are concatenated and the labels are ignored. Instead, the target is the next word in the sentence.

In [ ]:

show_doc(TextClasDataBunch, title_level=3, doc_string=False)

`class` `TextClasDataBunch`[source]

TextClasDataBunch(train_dl:DataLoader, valid_dl:DataLoader, test_dl:Optional[DataLoader]=None, device:device=None, tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate') :: TextDataBunch

Create a DataBunch suitable for a text classifier: all the texts are grouped by length (with a bit of randomness for the training set) then padded.

In [ ]:

show_doc(TextDataBunch, title_level=3, doc_string=False)

`class` `TextDataBunch`[source]

TextDataBunch(train_dl:DataLoader, valid_dl:DataLoader, test_dl:Optional[DataLoader]=None, device:device=None, tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate') :: DataBunch

Create a DataBunch with the raw texts. This is only going to work if they all ahve the same lengths.

Factory methods (TextDataBunch)¶

All those classes have the following factory methods.

In [ ]:

show_doc(TextDataBunch.from_folder, doc_string=False)

`from_folder`[source]

from_folder(path:PathOrStr, tokenizer:Tokenizer=None, train:str='train', valid:str='valid', test:Optional[str]=None, shuffle:bool=True, vocab:Vocab=None, kwargs)

This function will create a DataBunch from texts placed in path in a train, valid and maybe test folders. Text files in the train and valid folders should be places in subdirectories according to their classes (always the same for a language model) and the ones for the test folder should all be placed there directly. tokenizer will be used to parse those texts into tokens. The shuffle flag will optionally shuffle the texts found.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

In [ ]:

show_doc(TextDataBunch.from_csv, doc_string=False)

`from_csv`[source]

from_csv(path:PathOrStr, tokenizer:Tokenizer=None, train:str='train', valid:str='valid', test:Optional[str]=None, vocab:Vocab=None, kwargs) → DataBunch

This function will create a DataBunch from texts placed in path in a train.csv, valid.csv and maybe test.csv files. These csv files should have no header or index, and the label(s) should be the first column(s) (be sure to adjust the parameter n_labels if you have more than one). tokenizer will be used to parse those texts into tokens.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

In [ ]:

show_doc(TextDataBunch.from_tokens, doc_string=False)

`from_tokens`[source]

from_tokens(path:PathOrStr, train:str='train', valid:str='valid', test:Optional[str]=None, vocab:Vocab=None, kwargs) → DataBunch

This function will create a DataBunch from texts already tokenized placed in path in files named f{train}{tok_suff}.npy, f{train}{lbl_suff}.npy, f{valid}{tok_suff}.npy, f{valid}{lbl_suff}.npy and maybe f{test}{tok_suff}.npy. If no label file exists, labels will default to all zeros. tok_suff and lbl_suff are '_tok' and '_lbl' respectively.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels, tok_suff and lbl_suff (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

In [ ]:

show_doc(TextDataBunch.from_id_files, doc_string=False)

`from_id_files`[source]

from_id_files(path:PathOrStr, train:str='train', valid:str='valid', test:Optional[str]=None, itos:str='itos.pkl', kwargs) → DataBunch

This function will create a DataBunch from texts already tokenized placed in path in files named f{train}{id_suff}.npy, f{train}{lbl_suff}.npy, f{valid}{id_suff}.npy, f{valid}{lbl_suff}.npy and maybe f{test}{id_suff}.npy. If no label file exists, labels will default to all zeros. id_suff and lbl_suff are '_ids' and '_lbl' respectively. The itos file should contain the correspondance from ids to words.

kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels, tok_suff and lbl_suff (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

In [ ]:

show_doc(TextDataBunch.from_ids, doc_string=False)

`from_ids`[source]

from_ids(path, trn_ids:Collection[Collection[int]], trn_lbls:Collection[Union[int, float]], val_ids:Collection[Collection[int]], val_lbls:Collection[Union[int, float]], vocab_size:int, tst_ids:Collection[Collection[int]]=None, classes:ArgStar=None, kwargs) → DataBunch

This function will create a DataBunch in path from texts already processed into trn_ids, trn_lbls, val_ids, val_lbls and maybe tst_ids. You can specify the corresponding classes if applciable. You must specify the vocab_size so that the RNNLearner class can later infer the corresponding sizes in the model it will create. kwargs will be passed to the class initialization.

Example¶

Untar the IMDB sample dataset if not already done:

In [ ]:

path = untar_data(URLs.IMDB_SAMPLE)
path

Out[ ]:

PosixPath('/home/ubuntu/.fastai/data/imdb_sample')

Since it comes in the form of csv files, we will use the corresponding text_data method. Here is an overview of what your file you should look like:

In [ ]:

pd.read_csv(path/'train.csv', header=None).head()

Out[ ]:

	0	1
0	0	Un-bleeping-believable! Meg Ryan doesn't even ...
1	1	This is a extremely well-made film. The acting...
2	0	Every once in a long while a movie will come a...
3	1	Name just says it all. I watched this movie wi...
4	0	This movie succeeds at being one of the most u...

And here is a simple way of creating your DataBunch for language modelling or classification.

In [ ]:

data_lm = TextLMDataBunch.from_csv(Path(path))
data_clas = TextClasDataBunch.from_csv(Path(path))

The TextDataset class¶

Behind the scenes, the previous functions will create a training, validation and maybe test TextDataset which is the class responsible for collecting and preprocessing the data.

In [ ]:

show_doc(TextDataset, doc_string=False)

`class` `TextDataset`[source]

TextDataset(path:PathOrStr, tokenizer:Tokenizer=None, vocab:Vocab=None, max_vocab:int=60000, chunksize:int=10000, name:str='train', df=None, min_freq:int=2, n_labels:int=1, txt_cols=None, label_cols=None, create_mtd:TextMtd=<TextMtd.DF: 1>, classes:ArgStar=None, clear_cache:bool=False) :: BaseTextDataset

This class shouldn't be initialized directly as it will rely on internal files being put in an 'tmp' folder of path. tokenizer and vocab will be used to tokenize and numericalize the texts (if needed). max_vocab and min_freq are passed at the create of the vocabulary (if needed). chunksize is the size of chunks preprocessed when loading the data from csv or folders. name is the name of the set that will be used to name the temporary files. n_labels is the number of labels if creating the data from a csv file. classes is the correspondance between label and classe. create_mtd is an internal flag that tells the TextDataset how it was created. It can be:

CSV if it was created from texts or csv
TOK if it was created from tokens (which means the TextDataset will always skip the tokenization)
IDS if it was created from tokens (which means the TextDataset will always skip the tokenization and the numericalization)

Factory methods (TextDataset)¶

Instead of using the TextDataset init method, one of the following factory functions should be used instead:

In [ ]:

show_doc(TextDataset.from_folder, doc_string=False)

`from_folder`[source]

from_folder(folder:PathOrStr, tokenizer:Tokenizer=None, name:str='train', classes:ArgStar=None, shuffle:bool=True, kwargs) → TextDataset

Creates a TextDataset named name by scanning the subfolders in folder and using tokenizer. If classes are passed, only the subfolders named accordingly are checked. If shuffle is True, the data will be shuffled. Any additional kwargs are passed to the init method of TextDataset.

In [ ]:

show_doc(TextDataset.from_one_folder, doc_string=False)

`from_one_folder`[source]

from_one_folder(folder:PathOrStr, classes:ArgStar, tokenizer:Tokenizer=None, name:str='train', shuffle:bool=True, kwargs) → TextDataset

Creates a TextDataset named name by scanning the text files in folder and using tokenizer. All files are labelled classes[0] so this is typically used for the test set. If shuffle is True, the data will be shuffled. Any additional kwargs are passed to the init method of TextDataset.

In [ ]:

show_doc(TextDataset.from_df)

`from_df`[source]

from_df(folder:PathOrStr, df:Union[DataFrame, TextFileReader], tokenizer:Tokenizer=None, name:str='train', kwargs) → TextDataset

Create a dataset from texts in a dataframe

In [ ]:

show_doc(TextDataset.from_tokens, doc_string=False)

`from_tokens`[source]

from_tokens(folder:PathOrStr, name:str='train', tok_suff:str='_tok', lbl_suff:str='_lbl', kwargs) → TextDataset

Creates a TextDataset named name from tokens and labels saved in f{name}{tok_suff}.npy and f{name}{lbl_suff}.npy respectively. Any additional kwargs are passed to the init method of TextDataset.

In [ ]:

show_doc(TextDataset.from_ids, doc_string=False)

`from_ids`[source]

from_ids(folder:PathOrStr, name:str='train', id_suff:str='_ids', lbl_suff:str='_lbl', itos:str='itos.pkl', kwargs) → TextDataset

Creates a TextDataset named name from ids, labels and dictionary saved in f{name}{id_suff}.npy, f{name}{lbl_suff}.npy and itos respectively. Any additional kwargs are passed to the init method of TextDataset.

Preprocessing¶

The internal preprocessing is done by the two following methods:

In [ ]:

show_doc(TextDataset.tokenize)

`tokenize`[source]

tokenize()

Tokenize the texts in the csv file.

In [ ]:

show_doc(TextDataset.numericalize)

`numericalize`[source]

numericalize()

Numericalize the tokens in the token file.

Internally, the TextDataset will create a 'tmp' folder in which he will copy or save the following files:

name.csv (if created from folders or csv)
name_tok.npy and name_lbl.npy (created by TextDataset.tokenize from the last step or copied if created from tokens)
name_ids.npy, name_lbl.npy and itos (created by TextDataset.numericalize from the last step or copied if created from ids)

Then, when you invoke the TextDataset again, it will look for those temporary files and check their consistency to use them, in order to avoid doing again the numericalization or the tokenization. If you feel those files have been corrupted in any way, the following method will clear the 'tmp' subfolder of those files:

In [ ]:

show_doc(TextDataset.clear)

`clear`[source]

clear()

Remove all temporary files.

Internal methods¶

In [ ]:

show_doc(TextDataset.check_ids)

`check_ids`[source]

check_ids() → bool

Check if a new numericalization is needed.

In [ ]:

show_doc(TextDataset.check_toks)

`check_toks`[source]

check_toks() → bool

Check if a new tokenization is needed.

In [ ]:

show_doc(TextDataset.general_check)

`general_check`[source]

general_check(pre_files:Collection[PathOrStr], post_files:Collection[PathOrStr])

Check that post_files exist and were modified after all the prefiles.

In [ ]:

show_doc(BaseTextDataset)

`class` `BaseTextDataset`[source]

BaseTextDataset(ids:Collection[Collection[int]], labels:Collection[Union[int, float]], vocab_size:int, classes:ArgStar=None)

To directly create a text datasets from ids and labels.

Language Model data¶

A language model is trained to guess what the next word is inside a flow of words. We don't feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into bs chuncks of continuous texts. Note that in all NLP tasks, we use the pytoch convention of sequence length being the first dimension (and batch size being the second one) so we transpose that array so that we can read the chunks of texts in columns. Here is an example of batch from our imdb sample dataset.

In [ ]:

path = untar_data(URLs.IMDB_SAMPLE)
data = TextLMDataBunch.from_csv(path)
x,y = next(iter(data.train_dl))
example = x[:20,:10].cpu()
texts = pd.DataFrame([data.train_ds.vocab.textify(l).split(' ') for l in example])
texts

Out[ ]:

	0	1	2	3	4	5	6	7	8	9
0	xxfld	protagonist	xxunk	into	occasionally	start	humor	his	the	xxunk
1	1	is	for	this	xxunk	planning	is	revenge	box	in
2	un	xxunk	a	film	in	and	the	.	office	my
3	-	her	massive	,	other	not	biggest	still	,	xxunk
4	xxunk	early	series	although	versions	filming	problem	alive	xxunk	.
5	-	life	of	having	of	until	with	,	b.	first
6	believable	as	gags	the	the	everything	the	it	demille	,
7	!	a	built	main	story	has	film	looks	stopped	the
8	meg	butcher	upon	character	.	come	.	like	doing	xxunk
9	ryan	.	gags	a	wells	down	sure	carradine	films	scene
10	does	weird	,	drunk	'	on	,	tries	about	between
11	n't	stuff	but	and	description	a	making	to	non	the
12	even	.	stops	a	of	storyboard	fun	shoot	-	women
13	look	then	short	heroine	the	.	of	her	american	at
14	her	there	(	addict	martians	you	mentally	and	history	the
15	usual	's	for	did		certainly	ill	misses	.	xxunk
16	xxunk	the	all	n't	a	have	people	,	his	xxunk
17	lovable	core	the	come	giant	the	is	but	films	--
18	self	premise	xxunk	as	head	ability	pretty	it	for	undertext
19	in	of	)	an	xxunk	and	low	does	the	:

Then, as suggested in this article from Stephen Merity et al., we don't use a fixed bptt through the different batches but slightly change it from batch to batch.

In [ ]:

iter_dl = iter(data.train_dl)
for _ in range(5):
    x,y = next(iter_dl)
    print(x.size())

torch.Size([81, 64])
torch.Size([66, 64])
torch.Size([27, 64])
torch.Size([69, 64])
torch.Size([67, 64])

This is all done internally when we use TextLMDataBunch, by creating DataLoader using the following class:

In [ ]:

show_doc(LanguageModelLoader, doc_string=False)

`class` `LanguageModelLoader`[source]

LanguageModelLoader(dataset:TextDataset, bs:int=64, bptt:int=70, backwards:bool=False)

Takes the texts from dataset and concatenate them all, then create a big array with bs columns (transposed from the data source so that we read the texts in the columns). Spits batches with a size approximately equal to bptt but changing at every batch. If backwards is True, reverses the original text.

In [ ]:

show_doc(LanguageModelLoader.batchify, doc_string=False)

`batchify`[source]

batchify(data:ndarray) → LongTensor

Called at the inialization to create the big array of text ids from the data array.

In [ ]:

show_doc(LanguageModelLoader.get_batch)

`get_batch`[source]

get_batch(i:int, seq_len:int) → Tuple[LongTensor, LongTensor]

Create a batch at i of a given seq_len.

Classifier data¶

When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don't all have the same length, we can't easily collate them together in batches. To help with this we use two different techniques:

padding: each text is padded with the PAD token to get all the ones we picked to the same size
sorting the texts (ish): to avoid having together a very long text with a very short one (which would then have a lot of PAD tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.

Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start).

In [ ]:

path = untar_data(URLs.IMDB_SAMPLE)
data = TextClasDataBunch.from_csv(path)
iter_dl = iter(data.train_dl)
_ = next(iter_dl)
x,y = next(iter_dl)
x[:20,-10:]

Out[ ]:

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')

This is all done internally when we use TextClasDataBunch, by using the following classes:

In [ ]:

show_doc(SortSampler, doc_string=False)

`class` `SortSampler`[source]

SortSampler(data_source:NPArrayList, key:KeyFunc) :: Sampler

pytorch Sampler to batchify the data_source by order of length of the texts. Used for the validation and (if applicable) the test set.

In [ ]:

show_doc(SortishSampler, doc_string=False)

`class` `SortishSampler`[source]

SortishSampler(data_source:NPArrayList, key:KeyFunc, bs:int) :: Sampler

pytorch Sampler to batchify with size bs the data_source by order of length of the texts with a bit of randomness. Used for the training set.

In [ ]:

show_doc(pad_collate, doc_string=False)

`pad_collate`[source]

pad_collate(samples:BatchSamples, pad_idx:int=1, pad_first:bool=True) → Tuple[LongTensor, LongTensor]

Function used by the pytorch DataLoader to collate the samples in batches while adding padding with pad_idx. If pad_first is True, padding is applied at the beginning (before the sentence starts) otherwise it's applied at the end.

Enums¶

In [ ]:

show_doc(TextMtd, alt_doc_string='`TextDataset` enum to keep track of what data needs to be processed (dataframe, csv, tokens, ids)')

`TextMtd`

Enum = [DF, TOK, IDS]

TextDataset enum to keep track of what data needs to be processed (dataframe, csv, tokens, ids)

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

New Methods - Please document or move to the undocumented section¶

In [ ]:

show_doc(read_classes)

`read_classes`[source]

read_classes(fname)

In [ ]:

show_doc(TextLMDataBunch.create)

`create`[source]

create(datasets:Collection[TextDataset], path:PathOrStr, kwargs) → DataBunch

Create a TextDataBunch in path from the datasets for language modelling.

In [ ]:

show_doc(TextClasDataBunch.create)

`create`[source]

create(datasets:Collection[TextDataset], path:PathOrStr, bs=64, pad_idx=1, pad_first=True, kwargs) → DataBunch

Function that transform the datasets in a DataBunch for classification.

In [ ]:

show_doc(TextDataBunch.from_df)

`from_df`[source]

from_df(path:PathOrStr, train_df:Union[DataFrame, TextFileReader], valid_df:Union[DataFrame, TextFileReader], test_df:Union[DataFrame, TextFileReader, NoneType]=None, tokenizer:Tokenizer=None, vocab:Vocab=None, kwargs) → DataBunch

Create a TextDataBunch from DataFrames.

NLP datasets¶

Quickly assemble your data¶

class TextLMDataBunch[source]

class TextClasDataBunch[source]

class TextDataBunch[source]

Factory methods (TextDataBunch)¶

from_folder[source]

from_csv[source]

from_tokens[source]

from_id_files[source]

from_ids[source]

Example¶

The TextDataset class¶

class TextDataset[source]

Factory methods (TextDataset)¶

from_folder[source]

from_one_folder[source]

from_df[source]

from_tokens[source]

from_ids[source]

Preprocessing¶

tokenize[source]

numericalize[source]

clear[source]

Internal methods¶

check_ids[source]

check_toks[source]

general_check[source]

class BaseTextDataset[source]

Language Model data¶

class LanguageModelLoader[source]

batchify[source]

get_batch[source]

Classifier data¶

class SortSampler[source]

class SortishSampler[source]

pad_collate[source]

Enums¶

`TextMtd`

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

New Methods - Please document or move to the undocumented section¶

read_classes[source]

create[source]

create[source]

from_df[source]

`class` `TextLMDataBunch`[source]

`class` `TextClasDataBunch`[source]

`class` `TextDataBunch`[source]

`from_folder`[source]

`from_csv`[source]

`from_tokens`[source]

`from_id_files`[source]

`from_ids`[source]

`class` `TextDataset`[source]

`from_folder`[source]

`from_one_folder`[source]

`from_df`[source]

`from_tokens`[source]

`from_ids`[source]

`tokenize`[source]

`numericalize`[source]

`clear`[source]

`check_ids`[source]

`check_toks`[source]

`general_check`[source]

`class` `BaseTextDataset`[source]

`class` `LanguageModelLoader`[source]

`batchify`[source]

`get_batch`[source]

`class` `SortSampler`[source]

`class` `SortishSampler`[source]

`pad_collate`[source]

`read_classes`[source]

`create`[source]

`create`[source]

`from_df`[source]