Recently I found an article about word counting challenge. With unexpectedly bad results of nodeJS. Performance in handling of multi-gigs data set doesn't matter for most node-based web applications, but it still a fun challenge and a nice way to understand nodejs limitations.
I recently compared the performance of nodejs and c++ applications in iterating through a large memory buffer with performing of simple math on each byte. I got approximately 6x time difference. That's the difference I was ready for. But in word count challenge the difference is more than 15x.
Let's check how to get more than 6x performance improvement.
Starting point
Source code, used as a starting point, can be found on github.
To be able to compare results I executed it on my hardware (i7-4770, 32GB, SATA 7200RPM drive, nodeJS v5.9.1)
cat ../data/huwiki-latest-pages-meta-current.xml| head -n 5000000 | \ |
Result for 5M lines is 316.20 seconds and 1 042 956k memory
Step 0 - Measure
Code contains four significant steps:
- Read data
- Count words
- Sort result
- Output data
time
command is used to measure the whole time. Measuring of each step separately can be tricky due to loops. Starting and stopping of a timer multiple millions times can add some time to result. It can be taken into consideration, but I decided to keep things simple. So I chose to do incremental measurements. Each one is included in corresponding step description.
Step 1 data reading
Original code
; |
Just read and tokenize data. Execution time 5.90s, 48 716k.
Obviously, we can't shave a lot from this time. In a real project, it will have a low priority. But I decided to win some time here.
As you can see above, the stream is read line by line. So core module should search stream for line separators. And then we search text for delimiters again. Parsing of the same string twice looks like an overhead.
; |
Execution time 4.06s, 48 664k.
Not much of total time, but it's 30% from time of data reading/tokenizing.
Step 2 data storing
We need to count the number of occurrences for each word. The most obvious way is to use js object as a hash map. But to sort result - we need to create an array of words.
Original code
... |
Execution time 300.18s, 987 936k.
It took surprisingly long. So I measured Object.keys()
call to understand what part took so long. Result - 277,9s in a single call. Looks like a place for improvements. To finish time calculations, counting took approximately 300 - 277 - 6 = 17s.
First thing I tried was a Map class from ES6.
var counts = new Map(); |
Execution time 20.99s, 1 348 936k. Generating of words array - 1.01s, counting - 16s
Counting performance is comparable, but keys generating is amazing. The downside is higher memory usage.
I also tried couple npm modules for hash maps, but without any luck.
Step 3 data sorting
There is no much code to show.
Original code execution time - 312.65s, 1 042 024k
I tested included TimSort, then tried Array.prototype.sort()
Unexpectedly, the native search took longer.
Array.prototype.sort() - 51.77s, 1 352 460k. TimSort - 46.04s, 1 359 664k.
Step 4 Data output
Original code
let word = ''; |
I like output buffering and don't like string concatenations. I tried to remove buffering and write directly to the stdout. But the performance of such solution was much worse. So I rolled back to original one.
Final measurement
After putting all pieces together, I got the following result.
Execution time 49.87s, 1 360 284k.
And starting point was 316.20s, 1 042 956k.
The difference is more than 6x, as I promised.
As you can see, some of the guesses give you a huge benefit. And some - makes things worse. And you can't tell which is which without testing.