A distributed computing structure for embarrassingly-parallel tasks

Background

In middle school, when I first heard about parallel computing I thought to myself, "great, all I need to do is get a couple of computers, stick them together, and I'll have an awesome PC". Of course, things are little more complicated than that. However, making my way through graduate school, I revisited this idea; what is the best way to execute a large set of embarrassingly-parallel tasks -- computation jobs that have minimal dependency between the components -- with as many of the resources available as I can? Can I bind a bunch of PC's together to work, unified, on a large job?

I led a project that was looking at the best sampling design for clustered arrays of camera trapping grids for ecological studies; as was typical in my laboratory, the main exploratory tool we utilized was simulation of data analyzed under our models to examine bias and precision of the estimates under ideal circumstances. If we arranged our cameras in such a way, how much better is it than some other way? The project was ambitious in that it called for ~4 million simulations and analyses. Even at an optimistic 1 minute per simulation/analysis, that would take a single computer ~8 years to get through. The code was fairly speedy, with the analysis being handled by calls to C++ out of Rcpp, so improvements to the code would probably be marginal. Parallelization was required to cut down the time to analysis.

Simulation studies like this are easily parallelized, since each simulation has no bearing on the next. Certainly cloud-computing is the way to go for the largest of data sets, but the cost incurred -- whatever it may be -- is often a deterrent for mid-range projects like ours, since we have so many good multi-core computers lying around. Since computing within a single computer is easy enough thanks to doParallel, doSNOW, and other such packages the next question is how do we enable those computers to do synchronized work among themselves? 

Cluster computing isn't anything new, and you can put one together yourself over a local network, but we were going to use all of our personal computers to tackle the job; some located across several states. We could have used something like SSH to handle the communications, but I wanted to get up-and-running quickly, and with minimal configuration on the node computers; a fire-and-forget system for bundling together a rag-tag team of personal computers, to be added as easily as possible.

In the past I used local records to keep track of what jobs my computer completed, so I scaled this idea up to the core of the method, which is a remote SQL server hosting the record of tasks, the owner of each task, and their completion status.

The method

There are several components to the method; a remote SQL server, participant computers, a local SQL database for each participant, and a cloud storage service with API access. In short, computers reserve tasks from the SQL server, analyze them in parallel, mark them completed on the server, and upload the output data to the cloud storage facility.

SQL server hosting task list

The core of the distribution method relies on a remote SQL database of tasks, their owners, and their completion status. SQL is the best system for this setup, as it allows in-place editing of data with locking, so that there are no conflicts. I hosted an SQL server with gearhost and initialized a table of the 4M tasks to be completed. Each participant PC makes a transaction to reserve a set of tasks to be completed equal to the square of their core count, so small PC's aren't hogging a lot of tasks, and large PC's can do their work with minimal communication overhead. Since the performance increase is proportional to the ratio of computation : communication across parallel nodes, it's best to limit this communication as much as possible. A 28-core computer thus works on ~784 tasks at once, parallelizing those internally. The R package `DBI` handled all of the transactions to the SQL server.

Local analysis

Within each PC, the tasks that are to be completed are parallelized using the doParallel package. The PC's must know what `task = 1` means, so they refer to a local SQLite database to obtain the records of settings to perform the simulation with. The local SQL implementation was used because if those 4M rows of data were to be loaded into memory, they would be distributed to each of the parallel workers internally, leading to an unnecessary ballooning of memory. Loading and unloading all of those rows to grab one row at a time is inefficient compared to an SQL query.

Once the PC's finished their work, they performed another transaction with the remote SQL server to mark the tasks complete, and the time of completion.

Cloud storage

I used the Box API through R package 'boxr' to handle the automated uploads of tasks. Every so often, the local output directory would be scanned, and new files uploaded to a central location from all participant computers.

Subroutines

The SQL transactions and the automated uploads take time. It would not be very good to have the main analysis script waiting on them to finish before moving on to other jobs; so, I found the best course of action was to use a system call to Rscript.exe to execute those things in alternate instances of R. All the main script does is reserve a set of tasks, perform the analysis, mark complete, and repeat, minimizing communication overhead.

Results

We are on track to complete all 4 million jobs fully by December 2018, a mere seven months after the start in May 2018. This is an improvement of roughly 13x!

At our disposal were the following computers, all Windows machines:

  • 1 - 40-core desktop
  • 2 - 28-core desktops
  • 1 - 28-core cloud-computing instance
  • 5 - 8-core desktops
Home