This is a library to address the top-k items problem. It uses the Misra-Gries algorithm to provide the k items in a stream or data set of length m which appear more frequently than m/k, using only k-1 counters in memory.
Note that while this is an effective algorithm for large data sets, and does not contain false negatives, it can contain arbitrarily-false positives.
package main
import (
"fmt"
"io/ioutil"
"net/http"
"strings"
"github.com/adamdrake/hitters"
)
func main() {
hitters, err := hitters.New(10) // This will keep track of items which appear with frequency datasetSize/50
if err != nil {
panic("cannot init hitters")
}
resp, err := http.Get("http://www.gutenberg.org/cache/epub/2680/pg2680.txt")
source, err := ioutil.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
panic("cannot get text")
}
cleaned := strings.Replace(string(source), "\n", " ", -1)
tokens := strings.Split(cleaned, " ")
hitters.Add(tokens...) // This adds one or more items to the Hitters
fmt.Println(hitters.Items())
}
This will return a map with item 5 having a count of 4 and item b having a count of 1.