Words Count

Last week I participated in a small competition at work. The goal was to calculate the top 10 the most common words in a file and print them with the number of occurrences. For the testing we used export from HackerNews comments from 2012 until today, and that file was 4GB.

I will describe what optimizations I used to solve the problem from the most to the least obvious.

All measurements here made on MacBook Pro 2018 with 2,2 GHz Intel Core i7 CPU and 32 GB RAM.

Generic solution

The most straightforward way to solve this is to iterate over the file, save words to a map with a number of occurrences, then iterate over the map and print top 10 most common words.

func countWords(fileData []byte) {
    words := map[string]uint{}

    wordBuf := make([]byte, 64)
    wordPos := 0
    for _, c := range fileData {
            switch {
            case c >= 'A' && c <= 'Z':
                    c += 32
                    fallthrough
            case c >= 'a' && c <= 'z':
                    if wordPos == maxLen {
                            continue
                    }
                    wordBuf[wordPos] = c
                    wordPos++
            default:
                    if wordPos == 0 {
                        continue
                    }

                    words[string(wordBuf[:wordPos])]++
                    wordPos = 0
            }
            words[string(wordBuf[:wordPos])]++
    }
}

It will make a map with the number of occurrences of each English word in the file.

It takes ~1min to make this map.

After that keys needs to be sorted:

freq := make([]*string, 2<<25)
for word, count := range words {
    wCopy := word
    freq[count] = &wCopy
}

done := 0
for i := uint64(len(freq) - 1); i > 0 && done < 10; i-- {
    if freq[i] == nil {
        continue
    }
    fmt.Printf("%s: %d\n", *freq[i], i)
    done++
}

This is a terrible way of sorting anything in real life, but it works fast, and that’s exactly what we need in our case. It took ~113ms

Reading file concurrently

The next step is to assume that it could go faster if the file is processed in batches concurrently. That is true, but it will also create data races. To solve it, I will use this implementation of a map with atomic write.

Another way to go from here is to save results from each thread to its map and then merge maps.

func fromFile(filepath string, batchSize int64, tk *count.Stream) error {
        file, err := os.Open(filepath)
        if err != nil {
                return fmt.Errorf("failed to read `%s`: %s", filepath, err)
        }
        defer file.Close()

        info, err := file.Stat()
        if err != nil {
                return fmt.Errorf("failed to stat `%s`: %s", filepath, err)
        }

        all := info.Size() / batchSize
        wg := &errgroup.Group{}
        for i := int64(0); i < all; i++ {
                i := i
                // NOTE: read concurrently and process in batch
                wg.Go(func() error {

                        buff := make([]byte, batchSize)

                        off, err := file.ReadAt(buff, batchSize*i)
                        switch err {
                        case nil:
                        case io.EOF:
                                return nil
                        default:
                                return err
                        }

                        processBatch(buff[:off], tk)

                        return nil
                })
        }

        return wg.Wait()
}

I was using 2<<19-1 as a batchSize. This is based on test runs and nothing else.

tk here contains an instance of github.com/cornelk/hashmap

Law of Large numbers

The last optimization and the biggest one I’ve made is not entirely about programming, what I think makes it the best.

There is this theorem in probablility theory that says:

the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed

That means that if you take a large file with English text and count top 10 words in there, the top will always look almost the same.

Next step here is to google most common words in English and use this list to decrease number of the words we count.

By doing that, the number of writes to the map will dramatically decrease, which will allow to spend less time waiting for write access to the map.

After implementing this, the processing time is ~25s

You can find the full code on Github