Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

What algorithm is used for finding ngrams?

Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?

I'm asking for code, with preference for R. The data is stored in database, so can be a plgpsql function too. Java is a language I know better, so I can "translate" it to another language.

I'm not lazy, I'm only asking for code because I don't want to reinvent the wheel trying to do an algorithm that is already done.

Edit: it's important know how many times each n-gram appears.

Edit 2: there is a R package for N-GRAMS?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
366 views
Welcome To Ask or Share your Answers For Others

1 Answer

If you want to use R to identify ngrams, you can use the tm package and the RWeka package. It will tell you how many times the ngram occurs in your documents, like so:

  library("RWeka")
  library("tm")

  data("crude")

  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

  inspect(tdm[340:345,1:10])

A term-document matrix (6 terms, 10 documents)

Non-/sparse entries: 4/56
Sparsity           : 93%
Maximal term length: 13 
Weighting          : term frequency (tf)

               Docs
Terms           127 144 191 194 211 236 237 242 246 248
  and said        0   0   0   0   0   0   0   0   0   0
  and security    0   0   0   0   0   0   0   0   1   0
  and set         0   1   0   0   0   0   0   0   0   0
  and six-month   0   0   0   0   0   0   0   1   0   0
  and some        0   0   0   0   0   0   0   0   0   0
  and stabilise   0   0   0   0   0   0   0   0   0   1

hat-tip: http://tm.r-forge.r-project.org/faq.html


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...