I am doing some text mining in R with the tm
-package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things).
For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.
Is there any elegant solution how to implement this for some cases/words manually (e.g. that "lecturer" and "lecture" are kept as two different things)?
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
# Step 1: Create corpus
corpus <- Corpus(DataframeSource(data.frame(texts)))
# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion
corpus.copy <- corpus
# Step 3: Stem words in the corpus
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
inspect(corpus.temp)
# Step 4: Complete the stems to their original form
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
See Question&Answers more detail:os