A behind-the-scenes look at how we process and store our customers' brain scans.
At BrainKey, we need to analyze lots of scans (from users and from baseline datasets) and change our workflow quickly. To meet these goals, we developed Rollerblade, a data analysis engine that leverages cloud storage and Kubernetes container orchestration to take the challenge out of parallelizing our compute. This post describes how we adopted existing technologies and standards, and unified them to create the system that powers our backend. If you’re interested in the nuts-and-bolts details of how we process your scan, read on!
Bringing order to data storage with aBIDS
We use a number of publicly available brain imaging datasets to help build and test our systems. When we were first starting out, we had no convenient way to organize this data or select just the scans we wanted (such as those from people of a specific age, or taken using a specific imaging technique). To organize this data, we took advantage of the Brain Imaging Data Structure (BIDS) standard (bids.neuroimaging.io). BIDS describes a consistent hierarchical structure for organizing data, extending from overarching Datasets down to individual Scans. By converting each public dataset to a BIDS structure, we made them inter-operable, pre-emptively solving most of our data-munging problems. Integrating this with our website’s internal database made it searchable, giving us the concept of a “Query” – a subset of the database containing scans matching specified query parameters.
An aBIDS Query selects a subset of the database for download or analysis
We extended the BIDS specification to account for Derivatives – new files generated by running a specific computation on an object in the original BIDS hierarchy. (The official BIDS specification is considering its own derivative extension, but that wasn’t available while we were getting started.) We modestly call the resulting specification “advanced BIDS” (aBIDS). This allowed us to store the results of our analyses, avoiding duplicate effort and letting one analysis build on another.
Data analysis: First steps with Docker
The aBIDS architecture helped us organize our data, but we still needed a way to apply the machine learning models we developed to analyze it – specifically, we needed to isolate the models to make deploying them on cloud-based virtual machines foolproof. To do this, we wrapped the models as Docker containers, invoking them from a script which downloaded un-processed scans from the site and uploaded the analysis results. Our initial system looked something like this:
Our rough-and-ready prototype scan analysis backend
This system met our initial needs, but it didn’t scale well – we could only process one scan at a time! To put multiple containers to work at once, we created Redis queues on the server; our containers took turns pinging the queue and downloading any available tasks. While this helped us scale, the containers had too much responsibility for keeping track of each scan’s progress through the pipeline, with the help of an ever-growing array of status flags. We ended up spending too much time fixing our analysis pipeline instead of actually improving our product.
Confused? So were we.
Running at scale: Kubernetes and the Analysis interface
At the same time, we wanted to test our analyses by running them on our aBIDS-formatted database, but this would take an unacceptably long time unless we could find a way of parallelizing the computation. Luckily, the modular nature of Docker containers lends itself well to parallelization with cluster management tools like Kubernetes, which carries out tasks by assigning containers among a cluster of machines based on resource availability. Further, we realized that we could “slice” our aBIDS queries, breaking them into individual items which could each be processed by a separate container.
This insight led us to creating what we call an “Analysis.” At the core of each Analysis is a Docker container which obeys a simple standard for receiving and transmitting data – one slice of a Query (usually a single Scan) goes in, and one or more Derivatives comes out. Additionally, the Analysis specifies the RAM and CPU resources the Analysis needs, as well as the names of all expected derivative files. Credentialed users can create an Analysis, generate a Query to select the scans they’re interested in, and then run the Analysis on those scans with a single command. When at rest, the data is stored in a cloud storage bucket, abstracting away most of the challenges involved in storing and sharing large datasets. Kubernetes takes care of provisioning the needed cloud resources and carrying out the analysis in parallel, allowing us to process thousands of scans in a matter of hours.
Putting it all together: ConfigMap and the analysis pipeline
Wrapping our models as Analyses gave us a way to utilize our massive dataset, but we still needed a way to orchestrate the complex workflow that swings into action every time a customer uploads a new scan. The ConfigMap feature provided by Kubernetes let us solve this problem; we scrapped the status flags and created an event-based pipeline. Now, whenever an analysis finishes, our ConfigMap tests whether another analysis needs to run. We also use the ConfigMap to “tag” the files that represent the final results of our automated analysis; these tags tell our user-facing frontend which files and metadata need to be displayed. In contrast to our original setup, this system is easy to update: adding or updating an analytical step is just a matter of changing a few lines in a config file, rather than refactoring multiple containers to listen for new status flags.
ConfigMaps simplify linking Analyses together
Using a Kubernetes cluster also relieves us of the need to manually scale the number of instances we use (and pay for) – cluster autoscaling lets us automatically create or destroy instances to track our workloads. This lets us save money when the cluster is idle, but keeps our backlog at a minimum when a large number of tasks need to be completed.
Rollerblade and you
Rollerblade enables rapid iteration and data analysis at scale. It also includes a user account and permissions system; each Dataset and Analysis has an owner, who can grant or revoke permission to others to access the data. These features make Rollerblade a unique collaborative tool for data storage and high-throughput analysis. If you have a storage, analysis, or collaboration need that you think could be addressed by a system like Rollerblade, please reach out to us at firstname.lastname@example.org !