java - How to get frequently occurring phrases with Lucene

Question

Welcome To Ask or Share your Answers For Others

java - How to get frequently occurring phrases with Lucene

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I would like to get some frequently occurring phrases with Lucene. I am getting some information from TXT files, and I am losing a lot of context for not having information for phrases e.g. "information retrieval" is indexed as two separate words.

What is the way to get the phrases like this? I can not find anything useful on internet, all the advices, links, hints especially examples are appreciated!

EDIT: I store my documents just by title and content:

 Document doc = new Document();
 doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
 doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));

because for what I am doing the most important is the content of the file. Titles are too often not descriptive at all (e.g., I have many PDF academic papers whose titles are codes or numbers).

I desperately need to index top occurring phrases from text contents, just now I see how much this simple "bag of words" approach is not efficient.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

249 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:18:19+0000

Julia, It seems what you are looking for is n-grams, specifically Bigrams (also called collocations).

Here's a chapter about finding collocations (PDF) from Manning and Schutze's Foundations of Statistical Natural Language Processing.

In order to do this with Lucene, I suggest using Solr with ShingleFilterFactory. Please see this discussion for details.

Categories

java - How to get frequently occurring phrases with Lucene

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags