Rollerblade: The BrainKey Data Analysis Engine

Rollerblade is a unique collaborative tool for data storage and high-throughput analysis. It enables rapid iteration and data analysis at scale. It also includes a user account and permissions system; each Dataset and Analysis has an owner, who can grant or revoke permission to others to access the data.

The Problem

It’s gotten easy to build simple AI that comes from a single data source.

But if you’re in medicine, you need to build smart AI that incorporates data from different sources and locations.

Our Solution: Rollerblade

Rollerblade helps you build smart AI that is robust, reliable and reproducible. Rollerblade addresses challenges associated with storing and analyzing complex datasets, making organizing data and automating analysis workflows simple and scalable. It provides a simple, intuitive interface which leverages the scalability and parallelism of cloud computing.

How we process and store our customers' brain scans: Behind the Scenes

At BrainKey, we need to analyze lots of scans (from users and from baseline datasets) and change our workflow quickly. To meet these goals, we developed Rollerblade, a data analysis engine that leverages cloud storage and Kubernetes container orchestration to take the challenge out of paralleling our compute. We adopted existing technologies and standards, and unified them to create the system that powers our backend.

An aBIDS Query selects a subset of the database for download or analysis

Bringing order to data storage with aBIDS

We use a number of publicly available brain imaging datasets to help build and test our systems. When we were first starting out, we had no convenient way to organize this data or select just the scans we wanted (such as those from people of a specific age, or taken using a specific imaging technique). To organize this data, we created the advanced Brain Imaging Data Structure (aBIDS), which took advantage of the BIDS standard.

(bids.neuroimaging.io).

Data analysis: First steps with Docker

The aBIDS architecture helped us organize our data, but we still needed a way to apply the machine learning models we developed to analyze it – specifically, we needed to isolate the models to make deploying them on cloud-based virtual machines foolproof. To do this, we wrapped the models as Docker containers, invoking them from a script which downloaded un-processed scans from the site and uploaded the analysis results. Our initial system looked something like this:

Our rough-and-ready prototype scan analysis backend.

Running at scale: Kubernetes and the Analysis interface

This insight led us to creating what we call an “Analysis.” At the core of each Analysis is a Docker container which obeys a simple standard for receiving and transmitting data – one slice of a Query (usually a single Scan) goes in, and one or more Derivatives comes out. Additionally, the Analysis specifies the RAM and CPU resources the Analysis needs, as well as the names of all expected derivative files. Credentialed users can create an Analysis, generate a Query to select the scans they’re interested in, and then run the Analysis on those scans with a single command. When at rest, the data is stored in a cloud storage bucket, abstracting away most of the challenges involved in storing and sharing large datasets. Kubernetes takes care of provisioning the needed cloud resources and carrying out the analysis in parallel, allowing us to process thousands of scans in a matter of hours.

ConfigMaps simplify linking Analyses together

Putting it all together: ConfigMap and the analysis pipeline

Wrapping our models as Analyses gave us a way to utilize our massive dataset, but we still needed a way to orchestrate the complex workflow that swings into action every time a customer uploads a new scan. The ConfigMap feature provided by Kubernetes let us solve this problem; we scrapped the status flags and created an event-based pipeline. Now, whenever an analysis finishes, our ConfigMap tests whether another analysis needs to run. We also use the ConfigMap to “tag” the files that represent the final results of our automated analysis; these tags tell our user-facing frontend which files and metadata need to be displayed.

Using a Kubernetes cluster also relieves us of the need to manually scale the number of instances we use (and pay for) – cluster autoscaling lets us automatically create or destroy instances to track our workloads. This lets us save money when the cluster is idle, but keeps our backlog at a minimum when a large number of tasks need to be completed.

Rollerblade and you

Rollerblade enables rapid iteration and data analysis at scale. It also includes a user account and permissions system; each Dataset and Analysis has an owner, who can grant or revoke permission to others to access the data. These features make Rollerblade a unique collaborative tool for data storage and high-throughput analysis. If you have a storage, analysis, or collaboration need that you think could be addressed by a system like Rollerblade, please reach out to us at research@brainkey.ai!

Ready to explore your brain?

Version:46.2.3