A lot of web applications start as simple as "take data from DB and apply some template". But more CPU and time-consuming tasks occur as time goes. It can be large photos processing for a photo gallery, sophisticated reporting, massive sending of emails, etc. That's where single threaded nature of node.js and its browser roots become a flaw.
When it matters
Here is an incomplete list of cases, when distributed job processing will be helpful:
- Tasks are CPU bound. It's hard to get perfect load-balancing when some requests are lightweight, and some contain heavy computations. Moreover, it's easy to cause lag accidentally for other users due to single threaded nature of node.js.
- Tasks are memory intensive. When one or couple jobs fits node.js memory limits, but real workload sometimes causes out of memory exceptions due to V8 engine heap limitations (~1.5Gb by default).
- Reloading a web service is hard. If your task needs 5 minutes, then you should wait for at least this amount of time before service reloading. Otherwise, there is a chance to break the current task.
Way from a "throw and forget" task to advanced usage
For a demo application, we'll use a fake file downloader, which takes a URL and "downloads" it. In fact, it'll use setTimeout, to strip away as much unrelated code as possible.
Prerequisites
For job queue we'll use Kue npm module. Kue itself requires a Redis >= 2.6.12. If you are using Ubuntu, then run
$ sudo apt-get install redis-server
On other OS either use docker container or read an OS-specific installation manual.
I wrote samples as console applications to keep the code simple.
Basic job
Install Kue module
$ npm install kue
Create worker.js
file
var kue = require('kue'); |
Create client.js
file
var kue = require('kue'); |
Now you can start the client
$ node client.js
It will exit shortly after launch. It's because we don't wait until jobs are done. Just save them to the queue. Then you can start worker
$node worker.js
And it'll write data about progress to a console. Something like
Got a file to download "http://example.org/document-1.pdf" |
With a little effort, we can restart our client whenever needed without waiting until processing done. And even without waiting until there is a free worker to handle a job. By adjusting of a concurrency, we can prevent out of memory exceptions, limit an amount of HTTP connections, etc.
Complete source code for this tutorial can be found on Github.
Track how job is going
In a real project, it's useful to report back to user that job is done (video is converted, import complete, ...). Also, some information about progress is helpful to give a user a clue, when an operation will be completed. Let's check how we can get detailed information about a job.
Kue makes it easy. We call a single method to report progress and an event listener to receive progress/complete notifications.
Firstly, to get a progress information - job handler should provide it. Modified version of job handler:
function downloadDocument(job, url, done) { |
Then we should listen for progress events on client side of the queue. After adding of listeners, creating of a job can look like this:
function createJob(url){ |
You can find complete code in steps/tracking
folder of the git repository. But what if an application starts a new task and then we restart the application? In such a case, we have no event listeners and can't get them because we have no job object. The solution to this is listening for events on the queue instead of the job object. Code will look like this:
queue.on('job progress', function(id, progress){ |
Note, that we removed removeOnComplete(true)
to prevent automatic removing of jobs. Otherwise, we'll get an error on getting a job in 'job complete' handler.
We this approach we can continue to track progress after a restart. But what will happen with jobs which are finished between restarts? There are two issues to deal with: we don't know that job is done to report to a user, and we have a "lost" finished job in the queue which eats Redis memory. To clean up completed task we select jobs by status and do what's needed:
//cleanup jobs, which are finished before client started |
So now we can listen for complete events to report progress and completion status to the user. And we can catch up if we missed something.
What to learn next
All the best games are easy to learn and difficult to master
-- Nolan Bushnell
Distributed job processing is easy to start but has a notable amount of edge cases. Here is a list of some topics, not covered in this tutorial, which will be useful in complex systems:
- How to restart task
- Collecting and displaying of statistics
- How to take out nodes for maintenance
- Job priorities
Take a look at Kue documentation for better understanding what it's capable of.