Performance of nodejs word count implementation

Recently I found an article about word counting challenge. With unexpectedly bad results of nodeJS. Performance in handling of multi-gigs data set doesn't matter for most node-based web applications, but it still a fun challenge and a nice way to understand nodejs limitations.

I recently compared the performance of nodejs and c++ applications in iterating through a large memory buffer with performing of simple math on each byte. I got approximately 6x time difference. That's the difference I was ready for. But in word count challenge the difference is more than 15x.

Let's check how to get more than 6x performance improvement.

Starting point

Source code, used as a starting point, can be found on github.

To be able to compare results I executed it on my hardware (i7-4770, 32GB, SATA 7200RPM drive, nodeJS v5.9.1)

cat ../data/huwiki-latest-pages-meta-current.xml| head -n 5000000 | \
/usr/bin/time -f "%e__%U__%M" node ./wordcount.js >js.out

Result for 5M lines is 316.20 seconds and 1 042 956k memory

Step 0 - Measure

Code contains four significant steps:

Read data
Count words
Sort result
Output data

time command is used to measure the whole time. Measuring of each step separately can be tricky due to loops. Starting and stopping of a timer multiple millions times can add some time to result. It can be taken into consideration, but I decided to keep things simple. So I chose to do incremental measurements. Each one is included in corresponding step description.

Step 1 data reading

Original code

"use strict";
var readline = require('readline');
var TimSort = require('timsort');

process.stdin.setEncoding('utf8');

let rl = readline.createInterface({ input: process.stdin, terminal: false });
let wordCounts = {};
const regExp = /[ \t\n\r]+/g;
rl.on('line', (line) => {
  let words = line.trim().split(regExp);

}).on('close', () => {
  process.exit(0);
});

Just read and tokenize data. Execution time 5.90s, 48 716k.

Obviously, we can't shave a lot from this time. In a real project, it will have a low priority. But I decided to win some time here.

As you can see above, the stream is read line by line. So core module should search stream for line separators. And then we search text for delimiters again. Parsing of the same string twice looks like an overhead.

"use strict";
process.stdin.setEncoding('utf8');
const regExp = /[ \t\n\r]+/;
var lastChunk = '';
var counts = new Map();
process.stdin.on('data', function (chunk) {
  var words = chunk.split(regExp);
  var count = words.length;
  if(lastChunk){
    words[0] = lastChunk+words[0];
  }
  lastChunk = words[count-1];
}).on('end', function () {

});

Execution time 4.06s, 48 664k.

Not much of total time, but it's 30% from time of data reading/tokenizing.

Step 2 data storing

We need to count the number of occurrences for each word. The most obvious way is to use js object as a hash map. But to sort result - we need to create an array of words.

Original code

...
rl.on('line', (line) => {
  let words = line.trim().split(regExp);
  for(let i = 0, len = words.length; i < len; i++) {
    let word = words[i];

    if (!word)
      return;

    if(wordCounts.hasOwnProperty(word)) {
      wordCounts[word]++;
    }
    else {
      wordCounts[word] = 1;
    }
  }
}).on('close', () => {
  let wordList = Object.keys(wordCounts);
  ...

Execution time 300.18s, 987 936k.

It took surprisingly long. So I measured Object.keys() call to understand what part took so long. Result - 277,9s in a single call. Looks like a place for improvements. To finish time calculations, counting took approximately 300 - 277 - 6 = 17s.

First thing I tried was a Map class from ES6.

var counts = new Map();
process.stdin.on('data', function (chunk) {
  var words = chunk.split(regExp);
  var count = words.length;
  if(lastChunk){
    words[0] = lastChunk+words[0];
  }
  lastChunk = words[count-1];
  count--;
  for(var i=0;i<count;i++){
    var word = words[i];
    if(word){
      counts.set(word,(counts.get(word)||0)+1);
      wc++;
    }
  }
}).on('end', function () {
if(lastChunk){
  counts.set(lastChunk,(counts.get(lastChunk)||0)+1);
}
var res = [...counts.keys()];
});

Execution time 20.99s, 1 348 936k. Generating of words array - 1.01s, counting - 16s

Counting performance is comparable, but keys generating is amazing. The downside is higher memory usage.

I also tried couple npm modules for hash maps, but without any luck.

Step 3 data sorting

There is no much code to show.

Original code execution time - 312.65s, 1 042 024k

I tested included TimSort, then tried Array.prototype.sort()

Unexpectedly, the native search took longer.

Array.prototype.sort() - 51.77s, 1 352 460k. TimSort - 46.04s, 1 359 664k.

Step 4 Data output

Original code

let word = '';
  let count = '';
  let buff = '';
  const maxBuff = 1000;
  for(let i = 0, len = wordList.length; i < len; i++) {
    word = wordList[i];
    count = wordCounts[word];

    buff += word + '\t' + count + '\n';
    if (i % maxBuff === 0) {
      process.stdout.write(buff);
      buff = '';
    }
  }
  process.stdout.write(buff);

I like output buffering and don't like string concatenations. I tried to remove buffering and write directly to the stdout. But the performance of such solution was much worse. So I rolled back to original one.

Final measurement

After putting all pieces together, I got the following result.

Execution time 49.87s, 1 360 284k.

And starting point was 316.20s, 1 042 956k.

The difference is more than 6x, as I promised.

As you can see, some of the guesses give you a huge benefit. And some - makes things worse. And you can't tell which is which without testing.

Oleg Romanovskyi

The experienced Software architect NodeJS developer AngularJS developer

Performance of nodejs word count implementation

Starting point

Step 0 - Measure

Step 1 data reading

Step 2 data storing

Step 3 data sorting

Step 4 Data output

Final measurement