%matplotlib inline
from fastai2.text.all import *
from nbdev.showdoc import *
First let's download the dataset we are going to study. The dataset has been curated by Andrew Maas et al. and contains a total of 100,000 reviews on IMDB. 25,000 of them are labelled as positive and negative for training, another 25,000 are labelled for testing (in both cases they are highly polarized). The remaning 50,000 is an additional unlabelled data (but we will find a use for it nonetheless).
We'll begin with a sample we've prepared for you, so that things run quickly before going over the full dataset.
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()
(#8) [Path('/home/sgugger/.fastai/data/imdb_sample/texts.csv'),Path('/home/sgugger/.fastai/data/imdb_sample/export_lm.pkl'),Path('/home/sgugger/.fastai/data/imdb_sample/models'),Path('/home/sgugger/.fastai/data/imdb_sample/data_save.pkl'),Path('/home/sgugger/.fastai/data/imdb_sample/export.pkl'),Path('/home/sgugger/.fastai/data/imdb_sample/export_clas.pkl'),Path('/home/sgugger/.fastai/data/imdb_sample/data_clas_export.pkl'),Path('/home/sgugger/.fastai/data/imdb_sample/data_lm_export.pkl')]
It only contains one csv file, let's have a look at it.
df = pd.read_csv(path/'texts.csv')
df.head()
| label | text | is_valid | |
|---|---|---|---|
| 0 | negative | Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! | False |
| 1 | positive | This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... | False |
| 2 | negative | Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li... | False |
| 3 | positive | Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic... | False |
| 4 | negative | This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... | False |
df['text'][1]
'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very unpleasant to remember and to see on the screen. But it is never painted as a black-and-white case. There is baseness and nobility on both sides, and also the hope for change in the younger generation.<br /><br />There is redemption of a sort, in the end, when Puro has to make a hard choice between a man who has ruined her life, but also truly loved her, and her family which has disowned her, then later come looking for her. But by that point, she has no option that is without great pain for her.<br /><br />This film carries the message that both Muslims and Hindus have their grave faults, and also that both can be dignified and caring people. The reality of partition makes that realisation all the more wrenching, since there can never be real reconciliation across the India/Pakistan border. In that sense, it is similar to "Mr & Mrs Iyer".<br /><br />In the end, we were glad to have seen the film, even though the resolution was heartbreaking. If the UK and US could deal with their own histories of racism with this kind of frankness, they would certainly be better off.'
It contains one line per review, with the label ('negative' or 'positive'), the text and a flag to determine if it should be part of the validation set or the training set.
First, we need to tokenize the texts in our dataframe, which means separate the sentences in individual tokens (often words).
The first step of processing we make the texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, but we can be smarter:
The texts are truncated at 100 tokens for more readability. We can see that it did more than just split on space and punctuation symbols:
Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at least twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token UNK.
This is done automatically behind the scenes if we use a facotry method of TextDataLoaders.
dbunch_lm = TextDataLoaders.from_df(df, text_col='text', label_col='label', path=path, is_lm=True, valid_col='is_valid')
And if we look at what a what's in our datasets, we'll see the numericalized text as a representation:
dbunch_lm.train_ds[0]
(tensor([ 2, 8, 21, 29, 2190, 47, 138, 44, 14, 9, 111, 908,
126, 33, 178, 136, 11, 8, 237, 21, 269, 51, 9, 196,
20, 33, 206, 37, 114, 1488, 57, 0, 14, 21, 792, 11,
8, 17, 222, 201, 27, 13, 219, 14, 4512, 282, 71, 15,
3306, 812, 57, 39, 33, 38, 1065, 15, 1173, 60, 14, 9,
29, 12, 248, 71, 9, 377, 33, 58, 1578, 11, 8, 59,
33, 61, 37, 197, 15, 252, 0, 33, 227, 793, 155, 21,
587, 32, 12, 2042, 13, 175, 288, 14, 1655, 28, 9, 177,
691, 10, 45, 178, 46, 136, 139, 482, 10, 31, 112, 33,
1023, 45, 174, 211, 1578, 186, 13, 224, 14, 377, 15, 61,
21, 12, 33, 227, 1355, 1655, 28, 118, 58, 50, 33, 178,
378, 257, 28, 9, 32, 11, 19, 206, 37, 158, 35, 208,
56, 21, 587, 29, 10, 110, 222, 156, 20, 8, 1774, 8,
2043, 74, 42, 116, 12, 18, 20, 376, 17, 1356, 74, 42,
107, 41, 1174, 11, 26, 8, 15, 345, 33, 38, 746, 15,
793, 155, 49, 813, 493, 9, 1175, 14, 86, 33, 178, 123,
136, 120, 535, 10, 21, 173, 66, 211, 15, 43, 483, 253,
255, 11, 8, 35, 349, 14, 20, 17, 16, 3824, 10, 27,
15, 1579, 20, 33, 9, 588, 83, 39, 301, 11, 8, 112,
91, 1024, 8, 508, 11, 8, 0, 12, 9, 0, 14, 389,
449, 235, 0, 10, 17, 16, 27, 59, 40, 16, 5470, 2929,
15, 100, 449, 35, 15, 9, 29, 2930, 5471, 501, 11, 8,
624, 91, 1024, 8, 1489, 8, 0, 10, 53, 19, 3825, 423,
34, 12, 40, 95, 41, 147, 39, 15, 5472, 18, 21, 31,
1416, 40, 95, 11, 8, 112, 269, 9, 2191, 2044, 10, 8,
3826, 8, 5473, 509, 49, 1176, 260, 10, 432, 385, 213, 55,
8, 3826, 8, 5473, 16, 39, 13, 70, 1289, 710, 12, 856,
73, 0, 235, 570, 18, 126, 10, 18, 82, 589, 165, 73,
149, 909, 15, 126, 14, 9, 8, 294, 3827, 674, 11, 26,
8, 46, 18, 46, 33, 58, 223, 197, 15, 88, 21, 28,
613, 72, 17, 269, 60, 35, 348, 10, 19, 144, 20, 19,
211, 13, 2405, 60, 14, 17, 10, 19, 473, 1656, 46, 43,
1290, 142, 10, 515, 110, 5474, 15, 0, 18, 9, 4513, 14,
350, 11]),)
The correspondence is stored in the vocab attribute of our DataLoaders
dbunch_lm.vocab[:20]
['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'it', 'in', 'i']
dbunch_lm.show_batch()
| text | text_ | |
|---|---|---|
| 0 | xxbos … said a couple xxunk the movie theater just as i was entering to watch this . xxmaj hmm , not a good sign , but who knows ? xxmaj different xxunk for different folks , after all . xxmaj well , nope . xxmaj they were being kind . xxmaj godard has released work that is passionate ( contempt ) , entertaining ( band of xxmaj outsiders ) , sometimes | … said a couple xxunk the movie theater just as i was entering to watch this . xxmaj hmm , not a good sign , but who knows ? xxmaj different xxunk for different folks , after all . xxmaj well , nope . xxmaj they were being kind . xxmaj godard has released work that is passionate ( contempt ) , entertaining ( band of xxmaj outsiders ) , sometimes both |
| 1 | the lake to xxunk ? xxmaj other than some token fog in one or two scenes , we see no evidence of the water being hot , other than a few lines in the script . \n\n xxmaj the script is xxunk rather obviously in a few sequences , and it will do anything to get the characters near the lake so that they can be xxunk by the claymation xxunk . | lake to xxunk ? xxmaj other than some token fog in one or two scenes , we see no evidence of the water being hot , other than a few lines in the script . \n\n xxmaj the script is xxunk rather obviously in a few sequences , and it will do anything to get the characters near the lake so that they can be xxunk by the claymation xxunk . a |
| 2 | . 's intelligence . xxup t.k . is torn between his own xxunk and his xxunk background . xxmaj sajani is a woman xxunk of having a choice in her romantic life . xxmaj oh , and , of course , xxmaj moores ' family is xxunk into xxmaj sajani 's death but still slightly racist to xxmaj indians . xxmaj if the tone was n't so serious , i would be | 's intelligence . xxup t.k . is torn between his own xxunk and his xxunk background . xxmaj sajani is a woman xxunk of having a choice in her romantic life . xxmaj oh , and , of course , xxmaj moores ' family is xxunk into xxmaj sajani 's death but still slightly racist to xxmaj indians . xxmaj if the tone was n't so serious , i would be willing |
| 3 | nowadays . xxmaj it 's like all you need is a camera , a group of people to be your cast and crew , a script , and a little money and xxunk you have a movie . xxmaj problem is that talent is n't always part of this xxunk and often times these kind of low budget films turn out to be duds . xxmaj the video store xxunk are filled | . xxmaj it 's like all you need is a camera , a group of people to be your cast and crew , a script , and a little money and xxunk you have a movie . xxmaj problem is that talent is n't always part of this xxunk and often times these kind of low budget films turn out to be duds . xxmaj the video store xxunk are filled with |
| 4 | for xxmaj major xxmaj payne even though he xxunk the kids poorly , has no social skills and is simply impossible to xxunk into someone you would want to spend time with . xxmaj she must either be incredibly stupid or incredibly desperate . xxmaj i 'm not sure which ( though it would seem " stupid " since the movie makes it clear she gets out of the house often enough | xxmaj major xxmaj payne even though he xxunk the kids poorly , has no social skills and is simply impossible to xxunk into someone you would want to spend time with . xxmaj she must either be incredibly stupid or incredibly desperate . xxmaj i 'm not sure which ( though it would seem " stupid " since the movie makes it clear she gets out of the house often enough ) |
| 5 | xxmaj boeing . " xxmaj the original author of that work , xxmaj marc xxmaj xxunk , is credited nowhere . xxmaj at least xxmaj priyadarshan changed the title for this remake , rather than xxunk using the original without giving credit , as he did in his xxunk version of this same tale . ( according to imdb 's credits list . ) xxbos xxmaj what i think xxmaj i 'll | boeing . " xxmaj the original author of that work , xxmaj marc xxmaj xxunk , is credited nowhere . xxmaj at least xxmaj priyadarshan changed the title for this remake , rather than xxunk using the original without giving credit , as he did in his xxunk version of this same tale . ( according to imdb 's credits list . ) xxbos xxmaj what i think xxmaj i 'll probably |
| 6 | relationship between xxmaj frankie and her grandmother is convincing , but the relationship between xxmaj hazel and xxmaj frankie is a bit … off . xxmaj it 's interesting to see how she has to work hard to keep a balance between her best friend , her grandmother , and her two xxunk : ballet and baseball . xxmaj being a baseball player myself , it was quite painful to watch xxmaj | between xxmaj frankie and her grandmother is convincing , but the relationship between xxmaj hazel and xxmaj frankie is a bit … off . xxmaj it 's interesting to see how she has to work hard to keep a balance between her best friend , her grandmother , and her two xxunk : ballet and baseball . xxmaj being a baseball player myself , it was quite painful to watch xxmaj frankie |
| 7 | the genre with open arms . \n\n xxmaj xxunk is xxunk as the menacing , damn sexy , but vicious and mean bitch who xxunk out an entire police force and poor xxmaj tanya 's parents in one fail xxunk , in less than ten or so minutes . xxmaj she stabs one in the back with a xxunk xxunk ! xxmaj she bites the fingers off of poor xxmaj ron xxmaj | genre with open arms . \n\n xxmaj xxunk is xxunk as the menacing , damn sexy , but vicious and mean bitch who xxunk out an entire police force and poor xxmaj tanya 's parents in one fail xxunk , in less than ten or so minutes . xxmaj she stabs one in the back with a xxunk xxunk ! xxmaj she bites the fingers off of poor xxmaj ron xxmaj xxunk |
| 8 | one of them ( " xxunk : xxmaj fighting xxmaj edition " ) , which was xxup really sad . xxmaj anyway , it was the show that set a stepping stone on my interest on robot series ( especially anime xxunk series like " gundam " ) xxmaj now that xxmaj i 'm 18 , xxmaj i 'd like to think this show is pretty cheesy to me now . xxmaj | of them ( " xxunk : xxmaj fighting xxmaj edition " ) , which was xxup really sad . xxmaj anyway , it was the show that set a stepping stone on my interest on robot series ( especially anime xxunk series like " gundam " ) xxmaj now that xxmaj i 'm 18 , xxmaj i 'd like to think this show is pretty cheesy to me now . xxmaj to |
We can use the data block API with NLP and have a lot more flexibility than what the default factory methods offer. In the previous example for instance, the data was randomly split between train and validation instead of reading the third column of the csv.
With the data block API though, we have to manually call the tokenize and numericalize steps. This allows more flexibility, and if you're not using the defaults from fastai, the various arguments to pass will appear in the step they're revelant, so it'll be more readable.
imdb_lm = DataBlock(blocks=(TextBlock.from_df('text', is_lm=True),),
get_x=ColReader('text'),
splitter=RandomSplitter())
dbunch_lm = imdb_lm.dataloaders(df)
Note that language models can use a lot of GPU, so you may need to decrease batchsize here.
bs=128
Now let's grab the full dataset for what follows.
path = untar_data(URLs.IMDB)
path.ls()
(#7) [Path('/home/sgugger/.fastai/data/imdb/unsup'),Path('/home/sgugger/.fastai/data/imdb/imdb.vocab'),Path('/home/sgugger/.fastai/data/imdb/tmp_lm'),Path('/home/sgugger/.fastai/data/imdb/train'),Path('/home/sgugger/.fastai/data/imdb/test'),Path('/home/sgugger/.fastai/data/imdb/README'),Path('/home/sgugger/.fastai/data/imdb/tmp_clas')]
(path/'train').ls()
(#4) [Path('/home/sgugger/.fastai/data/imdb/train/pos'),Path('/home/sgugger/.fastai/data/imdb/train/unsupBow.feat'),Path('/home/sgugger/.fastai/data/imdb/train/labeledBow.feat'),Path('/home/sgugger/.fastai/data/imdb/train/neg')]
The reviews are in a training and test set following an imagenet structure. The only difference is that there is an unsup folder on top of train and test that contains the unlabelled data.
We're not going to train a model that classifies the reviews from scratch. Like in computer vision, we'll use a model pretrained on a bigger dataset (a cleaned subset of wikipedia called wikitext-103). That model has been trained to guess what the next word is, its input being all the previous words. It has a recurrent structure and a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point.
We are going to use that 'knowledge' of the English language to build our classifier, but first, like for computer vision, we need to fine-tune the pretrained model to our particular dataset. Because the English of the reviews left by people on IMDB isn't the same as the English of wikipedia, we'll need to adjust the parameters of our model by a little bit. Plus there might be some words that would be extremely common in the reviews dataset but would be barely present in wikipedia, and therefore might not be part of the vocabulary the model was trained on.
This is where the unlabelled data is going to be useful to us, as we can use it to fine-tune our model. Let's create our data object with the data block API (next line takes a few minutes the first minute you run it).
imdb_lm = DataBlock(blocks=(TextBlock.from_folder(path, is_lm=True),),
get_items=partial(get_text_files, folders=['train', 'test', 'unsup']),
splitter=RandomSplitter(0.1))
dbunch_lm = imdb_lm.dataloaders(path, path=path, bs=bs, seq_len=80)
We have to use a special kind of TextDataLoaders for the language model, that ignores the labels (that's why we put 0 everywhere), will shuffle the texts at each epoch before concatenating them all together (only for training, we don't shuffle for the validation set) and will send batches that read that text in order with targets that are the next word in the sentence.
The line before being a bit long, we want to load quickly the final ids by using the following cell.
dbunch_lm.show_batch()
| text | text_ | |
|---|---|---|
| 0 | xxunk there xxunk xxunk enjoying this weird and weirdly detailed movie with weird animation and a weird mix of styles xxunk real faces photographed for the xxunk and the credits start rolling when xxunk xxunk xxunk a xxunk is this music xxunk xxunk xxunk long xxunk xxunk xxunk xxunk xxunk it is xxunk xxunk xxunk music isn't what makes a xxunk even a short animation like this xxunk but anyone who knows who xxunk xxunk is xxunk many should considering | there xxunk xxunk enjoying this weird and weirdly detailed movie with weird animation and a weird mix of styles xxunk real faces photographed for the xxunk and the credits start rolling when xxunk xxunk xxunk a xxunk is this music xxunk xxunk xxunk long xxunk xxunk xxunk xxunk xxunk it is xxunk xxunk xxunk music isn't what makes a xxunk even a short animation like this xxunk but anyone who knows who xxunk xxunk is xxunk many should considering the |
| 1 | not one for educational xxunk but this one grabs hold of you and doesn't let go until the xxunk xxunk be so hooked and entranced by what you are watching that you'll forget your at home watching xxunk xxunk series is available to buy on xxunk and xxunk xxunk recommend picking this one xxunk xxunk all the evil and death in this xxunk this documentary series gives us proof that life is beautiful and worth saving and xxunk xxunk main | one for educational xxunk but this one grabs hold of you and doesn't let go until the xxunk xxunk be so hooked and entranced by what you are watching that you'll forget your at home watching xxunk xxunk series is available to buy on xxunk and xxunk xxunk recommend picking this one xxunk xxunk all the evil and death in this xxunk this documentary series gives us proof that life is beautiful and worth saving and xxunk xxunk main attraction |
| 2 | the xxunk in new xxunk xxunk movie is no xxunk xxunk uses the blueprint of con movie elements with little new xxunk xxunk is a xxunk one ya seen 'em xxunk con game xxunk xxunk xxunk newer ones just add a few more unimaginative side xxunk involve modern devices such as electronic xxunk computer xxunk a little more of the sex xxunk xxunk complicating a tired plot doesn't make it xxunk xxunk back we had a list of worst movies | xxunk in new xxunk xxunk movie is no xxunk xxunk uses the blueprint of con movie elements with little new xxunk xxunk is a xxunk one ya seen 'em xxunk con game xxunk xxunk xxunk newer ones just add a few more unimaginative side xxunk involve modern devices such as electronic xxunk computer xxunk a little more of the sex xxunk xxunk complicating a tired plot doesn't make it xxunk xxunk back we had a list of worst movies ever |
| 3 | film has the same sort of premise from there xxunk xxunk xxunk i have mentioned the ending is xxunk it isn't very believable xxunk at xxunk why would bloody bill believe that gwen was his sister and the fact that they look xxunk like each other isn't very plausible xxunk and it would have been nice to find out what happened to gwen after she left death xxunk xxunk xxunk hope their isn't a sequel as zombie sequels are never | has the same sort of premise from there xxunk xxunk xxunk i have mentioned the ending is xxunk it isn't very believable xxunk at xxunk why would bloody bill believe that gwen was his sister and the fact that they look xxunk like each other isn't very plausible xxunk and it would have been nice to find out what happened to gwen after she left death xxunk xxunk xxunk hope their isn't a sequel as zombie sequels are never good |
| 4 | xxunk afraid to chew the xxunk xxunk also had a decent xxunk xxunk one had me scratching my xxunk xxunk xxunk isn't really xxunk about a xxunk why does she have a manager xxunk why is he wasting his xxunk xxunk xxunk and xxunk are xxunk why do they sign up for xxunk xxunk of xxunk xxunk xxunk xxunk the xxunk xxunk movie where xxunk xxunk wants to do xxunk only to find himself on xxunk xxunk xxunk industry xxunk | afraid to chew the xxunk xxunk also had a decent xxunk xxunk one had me scratching my xxunk xxunk xxunk isn't really xxunk about a xxunk why does she have a manager xxunk why is he wasting his xxunk xxunk xxunk and xxunk are xxunk why do they sign up for xxunk xxunk of xxunk xxunk xxunk xxunk the xxunk xxunk movie where xxunk xxunk wants to do xxunk only to find himself on xxunk xxunk xxunk industry xxunk but |
| 5 | camera and had it trained exclusively on xxunk xxunk boys do their best routines xxunk xxunk xxunk xxunk xxunk xxunk the xxunk xxunk they are a joy to xxunk xxunk who thinks that xxunk xxunk xxunk the best straight man in the business should watch the xxunk routine and check out how he consistently and skillfully reins in xxunk whenever xxunk manic energy takes him too far outside the skit while still allowing him the freedom to employ the xxunk | and had it trained exclusively on xxunk xxunk boys do their best routines xxunk xxunk xxunk xxunk xxunk xxunk the xxunk xxunk they are a joy to xxunk xxunk who thinks that xxunk xxunk xxunk the best straight man in the business should watch the xxunk routine and check out how he consistently and skillfully reins in xxunk whenever xxunk manic energy takes him too far outside the skit while still allowing him the freedom to employ the xxunk and |
| 6 | xxunk the worse xxunk always xxunk little or no xxunk and every one was drunk or xxunk xxunk the xxunk the xxunk and the xxunk made it all worth xxunk xxunk like xxunk xxunk xxunk too damn old for it xxunk and the arthritis in the hands and hips mean no more xxunk but for the length of that xxunk it all came xxunk and it was all xxunk xxunk xxunk the xxunk and the xxunk xxunk xxunk was young | the worse xxunk always xxunk little or no xxunk and every one was drunk or xxunk xxunk the xxunk the xxunk and the xxunk made it all worth xxunk xxunk like xxunk xxunk xxunk too damn old for it xxunk and the arthritis in the hands and hips mean no more xxunk but for the length of that xxunk it all came xxunk and it was all xxunk xxunk xxunk the xxunk and the xxunk xxunk xxunk was young xxunk |
| 7 | or give detailed or coherent answers or even answers at xxunk xxunk highly doubt that if some of those who were interviewed knew what xxunk was creating or saw the final product would allow themselves to be included in the xxunk xxunk film puts punks in a bad light by making them seem unintelligent and xxunk xxunk film should not be taken as a representation of the xxunk punk xxunk xxunk you want to see a good punk documentary watch | give detailed or coherent answers or even answers at xxunk xxunk highly doubt that if some of those who were interviewed knew what xxunk was creating or saw the final product would allow themselves to be included in the xxunk xxunk film puts punks in a bad light by making them seem unintelligent and xxunk xxunk film should not be taken as a representation of the xxunk punk xxunk xxunk you want to see a good punk documentary watch xxunk |
| 8 | featured in this xxunk xxunk just a choice between a hard life with your pride or getting more money with less xxunk xxunk to the xxunk xxunk xxunk film is very sad xxunk xxunk xxunk because the xxunk is sad but because sadly there is no xxunk xxunk because the xxunk or any other artistic feature is so convincing but because of the lack of anything xxunk xxunk others have pointed out here the director obviously had nothing together before | in this xxunk xxunk just a choice between a hard life with your pride or getting more money with less xxunk xxunk to the xxunk xxunk xxunk film is very sad xxunk xxunk xxunk because the xxunk is sad but because sadly there is no xxunk xxunk because the xxunk or any other artistic feature is so convincing but because of the lack of anything xxunk xxunk others have pointed out here the director obviously had nothing together before shooting |
We can then put this in a learner object very easily with a model loaded with the pretrained weights. They'll be downloaded the first time you'll execute the following line and stored in ~/.fastai/models/ (or elsewhere if you specified different paths in your config file).
len(dbunch_lm.vocab)
60008
learn = language_model_learner(dbunch_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()
learn.lr_find()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
learn.recorder.plot_lr_find(skip_end=15)
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7,0.8))
| epoch | train_loss | valid_loss | accuracy | perplexity | time |
|---|---|---|---|---|---|
| 0 | 4.121422 | 3.914404 | 0.299510 | 50.119186 | 07:11 |
learn.save('fit_head')
learn.load('fit_head');
To complete the fine-tuning, we can then unfeeze and launch a new training.
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3, moms=(0.8,0.7,0.8))
| epoch | train_loss | valid_loss | accuracy | perplexity | time |
|---|---|---|---|---|---|
| 0 | 3.892977 | 3.774952 | 0.316564 | 43.595413 | 07:37 |
| 1 | 3.814780 | 3.715484 | 0.323794 | 41.078480 | 07:41 |
| 2 | 3.749609 | 3.664291 | 0.329369 | 39.028442 | 07:41 |
| 3 | 3.685419 | 3.629105 | 0.333663 | 37.679073 | 07:40 |
| 4 | 3.618381 | 3.603617 | 0.336755 | 36.730846 | 07:41 |
| 5 | 3.567165 | 3.585856 | 0.338791 | 36.084221 | 07:46 |
| 6 | 3.503764 | 3.575684 | 0.340408 | 35.719028 | 07:47 |
| 7 | 3.453132 | 3.569989 | 0.341550 | 35.516193 | 07:49 |
| 8 | 3.409692 | 3.569814 | 0.341963 | 35.509995 | 07:56 |
| 9 | 3.384848 | 3.572026 | 0.341847 | 35.588623 | 07:58 |
learn.save('fine_tuned')
How good is our model? Well let's try to see what it predicts after a few given words.
learn.load('fine_tuned');
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))
I liked this movie because of the cool scenery and the high level of xxmaj british hunting . xxmaj the only thing this movie has going for it is the horrible acting and no script . xxmaj the movie was a big disappointment . xxmaj I liked this movie because it was one of the few movies that made me laugh so hard i did n't like it . xxmaj it was a hilarious film and it was very entertaining . xxmaj the acting was great , i 'm
We have to save not only the model, but also its encoder, the part that's responsible for creating and updating the hidden state. For the next part, we don't care about the part that tries to guess the next word.
learn.save_encoder('fine_tuned_enc')
Now, we'll create a new data object that only grabs the labelled data and keeps those labels. Again, this line takes a bit of time.
def read_tokenized_file(f): return L(f.read().split(' '))
imdb_clas = DataBlock(blocks=(TextBlock.from_folder(path, vocab=dbunch_lm.vocab),CategoryBlock),
get_x=read_tokenized_file,
get_y = parent_label,
get_items=partial(get_text_files, folders=['train', 'test']),
splitter=GrandparentSplitter(valid_name='test'))
dbunch_clas = imdb_clas.dataloaders(path, path=path, bs=bs, seq_len=80)
dbunch_clas.show_batch()
| text | category | |
|---|---|---|
| 0 | xxunk have praised xxunk xxunk xxunk as a xxunk adventure for xxunk xxunk don't think xxunk least not for thinking xxunk xxunk xxunk script suggests a beginning as a xxunk xxunk that struck someone as the type of crap you cannot sell to adults xxunk xxunk xxunk xxunk of many older adventure movies has been done well xxunk xxunk xxunk xxunk xxunk but xxunk represents one of the worse films in that xxunk xxunk characters are xxunk xxunk the background that each member trots out seems stock and awkward at xxunk xxunk xxunk xxunk a tomboy mechanic whose father always wanted xxunk if we have not at least seen these xxunk we have seen xxunk quirks xxunk xxunk story about how one xxunk xxunk played by xxunk xxunk xxunk xxunk xxunk went from flower stores to demolitions totally xxunk xxunk xxunk the main xxunk xxunk xxunk a young xxunk academic | neg |
| 1 | xxunk thought that xxunk was clearly the best out of the three xxunk xxunk xxunk xxunk find it surprising that xxunk is considered the weakest installment in the xxunk by many who have xxunk xxunk me it seemed like xxunk was the best because it had the most profound xxunk the most xxunk xxunk most xxunk the xxunk and definitely the most episodic xxunk xxunk personally like the xxunk xxunk xxunk a lot also but xxunk think it is slightly less good than than xxunk since it was xxunk was not as xxunk and xxunk just did not feel as much suspense or emotion as xxunk did with the third xxunk xxunk xxunk also seems like to me that after reading these surprising reviews that the reasons people cited for xxunk being an inferior film to the other two are just plain ludicrous and are insignificant reasons compared to the | pos |
| 2 | xxunk the realm of xxunk xxunk two particular themes consistently elicit xxunk were initially explored in the literature of a xxunk xxunk and have since been periodically revisited by filmmakers and writers xxunk with varying degrees of xxunk xxunk first xxunk that of time xxunk has held an unwavering fascination for fans of xxunk as well as the written xxunk most recently on the screen with yet another version of the xxunk xxunk xxunk xxunk xxunk xxunk xxunk second xxunk which also manages to hold audiences in xxunk is that of xxunk which sparks the imagination with it's seemingly endless and myriad xxunk xxunk this xxunk xxunk has again become the basis for a film adapted from another xxunk xxunk xxunk xxunk xxunk xxunk the realization of xxunk xxunk is xxunk xxunk directed by xxunk xxunk and starring xxunk xxunk and xxunk xxunk xxunk xxunk xxunk xxunk and his colleagues | neg |
| 3 | xxunk xxunk makes me nostalgic for the early xxunk a time when virtually every new action movie could be described as xxunk xxunk in a xxunk xxunk xxunk is xxunk xxunk on a xxunk and pretty xxunk for what it xxunk xxunk xxunk unlike xxunk 57 and xxunk xxunk which are decent xxunk xxunk clones on their own xxunk xxunk dispenses with the enclosed feeling of many action movies and embraces breathtaking landscapes xxunk in their xxunk threaten to overwhelm and trivialize the conflicts of the people fighting and dying among the xxunk xxunk xxunk before other movies like xxunk xxunk xxunk and xxunk dramatized crime and murder on snowbound xxunk xxunk director xxunk xxunk recognized the visual impact of juxtaposing brutal violence and grim struggles to survive against cold and indifferent natural xxunk xxunk xxunk opening sequence has already received substantial xxunk all of which it xxunk its intensity | pos |
| 4 | xxunk effects of job related stress and the pressures born of a moral dilemma that pits conscience against the obligations of a family business xxunk a unique xxunk all brought to a head xxunk or perhaps the catalyst xxunk a midlife xxunk are examined in the dark and absorbing xxunk xxunk written and directed by xxunk xxunk and starring xxunk xxunk xxunk and xxunk xxunk xxunk a telling look at how indecision and denial can bring about the internal strife and misery that ultimately leads to apathy and that moment of truth when the conflict xxunk of xxunk at last be xxunk xxunk xxunk xxunk is xxunk he has a loving xxunk xxunk xxunk xxunk a precocious xxunk xxunk xxunk xxunk xxunk a mail order business he runs out of the xxunk as well as his main source of xxunk the xxunk business he shares with his xxunk xxunk xxunk | pos |
| 5 | xxunk xxunk style movie takes the middle aged divorcee victim who then finally fights back genre to new depths of xxunk xxunk xxunk xxunk the 40 something xxunk xxunk of a successful lawyer protagonist xxunk away at xxunk is starting a new life after her xxunk helped by a female college friend in opening a new dress shop as a sort of franchise expansion xxunk xxunk has even started up a friendship with her xxunk slightly younger xxunk landscape architect / gardener xxunk xxunk xxunk then horror of xxunk xxunk xxunk xxunk 20 something female she took on as a tenant to let a room xxunk starts xxunk xxunk her xxunk xxunk xxunk this new younger woman threat really does is mildly flirt with the xxunk and offer him a glass of wine that * gasp * really belonged to the xxunk xxunk runs up the utility bills by not | neg |
| 6 | xxunk xxunk sadly xxunk xxunk rented xxunk xxunk on xxunk xxunk 3 several years ago and there was a music xxunk xxunk pretty sure which was called xxunk xxunk at the end of xxunk and xxunk rented this one on xxunk hoping that the video would be there because it was one of the funniest things xxunk ever xxunk xxunk amazing how stuff from the 80s is so funny xxunk but nothing is funnier than 80s rap xxunk xxunk was this rap group singing that song xxunk xxunk on the xxunk version of this movie after the xxunk and they're all wearing like denim jackets with no shirt underneath and form fitting jean shorts that are all frayed at the bottoms like xxunk xxunk xxunk could make a rap group look more foolish xxunk can't xxunk xxunk xxunk xxunk any xxunk xxunk was disappointed in looking for that video on | neg |
| 7 | xxunk have never seen any of xxunk xxunk prior xxunk as their trailers never caught my xxunk xxunk have xxunk and admire xxunk xxunk and xxunk xxunk xxunk and have several of their xxunk xxunk xxunk xxunk entirely disappointed with this xxunk xxunk this film is any indication of xxunk xxunk ability as a xxunk my advice would be to xxunk a xxunk and stop wasting the time and talent of xxunk xxunk xxunk xxunk wonder if some of the other xxunk commentators watched the same movie that xxunk xxunk xxunk can only xxunk from their sappy lovelorn xxunk that their adoration of xxunk xxunk has blinded them to the banality of this piece of xxunk xxunk only paid xxunk in a xxunk xxunk xxunk and still felt xxunk wasted my xxunk xxunk xxunk xxunk xxunk page says it xxunk xxunk xxunk xxunk in 39 xxunk -- xxunk can you | neg |
| 8 | xxunk is not as great a film as many people believe xxunk my late xxunk who said it was her favorite xxunk xxunk due to the better sections of this film xxunk particularly that justifiably famous xxunk xxunk xxunk xxunk xxunk xxunk xxunk has gained a position of importance beyond it's actual worth as a key to the saga of xxunk xxunk failure to conquer xxunk xxunk xxunk 1946 xxunk position as a xxunk figure was xxunk xxunk xxunk was not recognized as the great movie it has since been seen as due to the way it was attacked by the xxunk press and by xxunk insiders xxunk xxunk attempt at total control xxunk and production and xxunk of his movies seemed to threaten the whole xxunk xxunk best job in this period was as xxunk xxunk in xxunk xxunk supposedly shot by xxunk xxunk but actually shot in large | pos |
We can then create a model to classify those reviews and load the encoder we saved before.
learn = text_classifier_learner(dbunch_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()
learn.load_encoder('fine_tuned_enc')
<fastai2.text.learner.RNNLearner at 0x7f033cb75f10>
learn.lr_find()
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7, 0.8))
| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.347427 | 0.184480 | 0.929320 | 00:33 |
learn.save('first')
learn.load('first');
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7, 0.8))
| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.247763 | 0.171683 | 0.934640 | 00:37 |
learn.save('second')
learn.load('second');
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3),moms=(0.8,0.7, 0.8))
| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.193377 | 0.156696 | 0.941200 | 00:45 |
learn.save('third')
learn.load('third');
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7, 0.8))
| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.172888 | 0.153770 | 0.943120 | 01:01 |
| 1 | 0.161492 | 0.155567 | 0.942640 | 00:57 |
learn.predict("I really loved that movie , it was awesome !")
('pos', tensor(1), tensor([1.4426e-04, 9.9986e-01]))