The optimum data structure to retrieve top n items from a larger collection is the min/max heap and the related abstract data structure is called the priority queue. Java has an unbounded PriorityQueue
which is based on the heap structure, but there is no version specialized for primitive types. It can used as a bounded queue by adding external logic, see this comment for details..
Apache Lucene has an implementation of the bounded priority queue:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/5.2.0/org/apache/lucene/util/PriorityQueue.java#PriorityQueue
Here is a simple modification that specializes it for ints:
/*
* Original work Copyright 2014 The Apache Software Foundation
* Modified work Copyright 2015 Marko Topolnik
*
* Licensed under the Apache License, Version 2.0 (the "License");
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/** A PriorityQueue maintains a partial ordering of its elements such that the
* worst element can always be found in constant time. Put()'s and pop()'s
* require log(size) time.
*/
class IntPriorityQueue {
private static int NO_ELEMENT = Integer.MIN_VALUE;
private int size;
private final int maxSize;
private final int[] heap;
IntPriorityQueue(int maxSize) {
this.heap = new int[maxSize == 0 ? 2 : maxSize + 1];
this.maxSize = maxSize;
}
private static boolean betterThan(int left, int right) {
return left > right;
}
/**
* Adds an int to a PriorityQueue in log(size) time.
* It returns the object (if any) that was
* dropped off the heap because it was full. This can be
* the given parameter (in case it isn't better than the
* full heap's minimum, and couldn't be added), or another
* object that was previously the worst value in the
* heap and now has been replaced by a better one, or null
* if the queue wasn't yet full with maxSize elements.
*/
public void consider(int element) {
if (size < maxSize) {
size++;
heap[size] = element;
upHeap();
} else if (size > 0 && betterThan(element, heap[1])) {
heap[1] = element;
downHeap();
}
}
public int head() {
return size > 0 ? heap[1] : NO_ELEMENT;
}
/** Removes and returns the least element of the PriorityQueue in log(size)
time. */
public int pop() {
if (size > 0) {
int result = heap[1];
heap[1] = heap[size];
size--;
downHeap();
return result;
} else {
return NO_ELEMENT;
}
}
public int size() {
return size;
}
public void clear() {
size = 0;
}
public boolean isEmpty() {
return size == 0;
}
private void upHeap() {
int i = size;
// save bottom node
int node = heap[i];
int j = i >>> 1;
while (j > 0 && betterThan(heap[j], node)) {
// shift parents down
heap[i] = heap[j];
i = j;
j >>>= 1;
}
// install saved node
heap[i] = node;
}
private void downHeap() {
int i = 1;
// save top node
int node = heap[i];
// find worse child
int j = i << 1;
int k = j + 1;
if (k <= size && betterThan(heap[j], heap[k])) {
j = k;
}
while (j <= size && betterThan(node, heap[j])) {
// shift up child
heap[i] = heap[j];
i = j;
j = i << 1;
k = j + 1;
if (k <= size && betterThan(heap[j], heap[k])) {
j = k;
}
}
// install saved node
heap[i] = node;
}
}
The way you implement betterThan
decides whether it will behave as a min or max heap. This is how it's used:
public int[] maxN(int[] input, int n) {
final int[] output = new int[n];
final IntPriorityQueue q = new IntPriorityQueue(output.length);
for (int i : input) {
q.consider(i);
}
// Extract items from heap in sort order
for (int i = output.length - 1; i >= 0; i--) {
output[i] = q.pop();
}
return output;
}
Some interest was expressed in the performance of this approach vs. the simple linear scan from user rakeb.void. These are the results, size
pertaining to the input size, always looking for 16 top elements:
Benchmark (size) Mode Cnt Score Error Units
MeasureMinMax.heap 32 avgt 5 270.056 ± 37.948 ns/op
MeasureMinMax.heap 64 avgt 5 379.832 ± 44.703 ns/op
MeasureMinMax.heap 128 avgt 5 543.522 ± 52.970 ns/op
MeasureMinMax.heap 4096 avgt 5 4548.352 ± 208.768 ns/op
MeasureMinMax.linear 32 avgt 5 188.711 ± 27.085 ns/op
MeasureMinMax.linear 64 avgt 5 333.586 ± 18.955 ns/op
MeasureMinMax.linear 128 avgt 5 677.692 ± 163.470 ns/op
MeasureMinMax.linear 4096 avgt 5 18290.981 ± 5783.255 ns/op
Conclusion: constant factors working against the heap approach are quite low. The breakeven point is around 70-80 input elements and from then on the simple approach loses steeply. Note that the constant factor stems from the final operation of extracting items in sort order. If this is not needed (i.e., just a set of the best items is enough), the we can simply retrieve the internal heap
array directly and ignore the heap[0]
element which is not used by the algorithm. In that case this solution beats one like rakib.void's even for the smallest input size (I tested with 4 top elements out of 32).