Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

In the given line of code tokenizer=Tokenizer(num_words=, oov_token= '<OOV>'), what does the num_words parameter actually do and what to take into consideration before determining the value to assign to it. What will be the effect of assigning a very high value to it and a very low one.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
639 views
Welcome To Ask or Share your Answers For Others

1 Answer

It is basically the size of vocabulary you want to have it in your model based on the data you have. Below simple example will explain you in detail.

Without num_words:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer  = Tokenizer(oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)

print("sequences : ",sequences,'
')

print("word_index : ",tokenizer.word_index) 

print("word counts : ",tokenizer.word_counts) 

sequences :  [[3, 4, 2, 1, 6, 7, 2, 8]] 

word_index :  {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts :  OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)]) 

Here tokenizer.fit_on_texts(fit_text) will create the word_index of the words mentioned present in fit_text in the order starting from oov_token which will be 1 and followed by most frequent words from the word_counts.
If you don't mention num_words then all the unique words of fit_text will be considered for word_index and will be used to represent the sequences.

If the num_words is present then it will restrict the sequences to num_words -1 words from word_index will only be considered to form the sequence while using tokenizer.texts_to_sequences() if any word is present beyond num_words -1 it will be considered as oov_token.
Below is the example of it.

With use of num_words:

tokenizer  = Tokenizer(num_words=4,oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)

print("sequences : ",sequences,'
')

print("word_index : ",tokenizer.word_index)

print("word counts : ",tokenizer.word_counts) 

sequences :  [[3, 1, 2, 1, 1, 1, 2, 1]] 

word_index :  {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts :  OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)]) 

Regarding the accuracy of the model it's always better to have the correct representation of the words in sequences from your data instead of oov_token.
In case of large data it's always better to provide the num_words parameter instead of giving load to model.
It's a good practice to do the preprocessing like stopword removal,lemmatization/stemming to remove all the unnecessary words and then followed by Tokenizer with the preprocessed data to choose the num_words parameter better.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...