A Deep Dive on Rubrik
Temporary Data Loss is Unavoidable
Modern businesses depend on data for everything that they do. Employees build spreadsheets and presentations for internal use. Customer interactions generate data used for strategy and marketing efforts. Shareholders require detailed and versioned financial histories to be frequently published.
Understandably, permanent data loss, or even temporary data downtime, can cause real harm.
Data loss can result from software issues, lost hardware, network outages, corruption issues, insider / outsider hacks, or even natural disasters. For example, a series of earthquakes that hit Christchurch New Zealand in 2011 caused physical damage to local data centers resulting in complete data loss from which many local businesses never fully recovered.1 Most corporate data loss results from plain old human error, but clearly the causes of such loss can be quite broad.
The most effective defense against data loss is to prevent issues before they occur, but since bumps in the road are inevitable, data backup and recovery is a must for enterprise resiliency.
Traditional Backup and Recovery
From the 1990s to present day, the suite of technologies running in corporate data centers has been quite broad. Virtual machines from VMWare power many applications and websites with which end users interact. Structured data is stored in Oracle or Microsoft SQL Server clusters. Whatever the underlying data stores, traditional backup and recovery processes have followed a somewhat similar pattern.
A set of backup proxy servers talk directly to production data sources to scrape data on a scheduled basis, perform any necessary compression and de-duplication, and then send it forward to data repositories. A data backup server combines heavyweight storage capabilities with inline software to store data backups and make them available upon request. In many deployments, a separate search server may be used to give faster insight to system admins as to what data is globally available in the data center and where specifically it can be found.
The first layer of backup storage has traditionally been on hard disk drives, which provide a good mix of rapid access and storage capacity. As a more long term solution, data can be archived from disk to magnetic tape storage.
This tape storage is often stored offsite in "tape vaults".2 One does not need to be an engineer to understand that shipping data on a truck between data center and vault may lead to slow data recovery in the event of an outage. Interestingly, tape data storage is much more common than most people realize, even in modern data centers. IBM recently estimated that 345,000 exabytes (that's 345 billion terabytes) of data globally is stored on tape drives.3 To see this in real life, check out the website of Secure Data Recovery, which is in the business of helping enterprises to recover their data on damaged tapes.
Problems with Traditional Backup and Recovery
Since the technical problem at hand is "backup and recovery", it's easy to segment the issues with the architecture described above into precisely those two categories:
Backup — Operational overhead and poor scalability
Recovery — Painfully slow, while critical systems are down
The numerous backup components in the diagram above are usually managed by system admins themselves, not any third party. That means a significant amount of IT time is spent managing an ancillary aspect of the data center, rather than scaling and maintaining the infrastructure that more directly supports business functions. Some system administrators estimate that 25% of their time is spent managing data backups.4
In the event of primary data loss, data recovery speeds are influenced by both the software and the hardware of the backup system. Legacy software provides less flexibility in quickly understanding where specifically data resides in a backup repository, and older disk drives (and certainly tape storage) can't retrieve data as quickly as modern flash storage, resulting in slower data recovery speeds. Data recovery times will vary widely by data center architecture, and also the nature of the data loss, but they can often be on the order of a few days.5
Rubrik: Backup and Recovery for the Modern Enterprise
Meet Rubrik.
Rubrik is a data management company founded in 2013 and based in Palo Alto, CA. The founding team is comprised of Bipul Sinha (former partner at Lightspeed Ventures), Arvind Nithrakashyap (ex-Oracle), Soham Mazumdar (ex-Facebook), and Arvind Jain (ex-Google). The company has raised a total of $553M in funding from investors such as Lightspeed, Greylock, Khosla, IVP, and Bain Capital Ventures. No official IPO announcement has been made to date, though some expect the company may go public in the second half of 2021.6
Speaking on the product-centric motivation for starting the company, co-founder Arvind "Nitro" Nithrakashyap said:
When we looked at the modern-day enterprise, we saw a clear trend towards virtualization and leveraging the economics of the cloud, so we wondered...what if we were able to build a new platform from scratch that was built for this era, rather than try to force legacy solutions into this era.7
Rubrik has built a new data management solution from scratch to give IT agility back to customers. A few key advantages of Rubrik's platform are:
A software-defined intelligence layer (called the metadata layer)
Leveraging cloud-native architecture and cloud economics
An API-only approach to usability and user experience
I'm excited to dive a bit more into Rubrik's basic architecture and each of the advantages listed above. Needless to say, customers love using Rubrik's platform, and the company has experienced monster growth the last few years, achieving a $600M revenue run rate as of the end of FY2020.8
Rubrik's Basic Architecture
The core physical component of Rubrik's system is called a brik, which is a single device shipped to the data center that contains physical storage and backup software. The secret sauce is not in the hardware itself, as Rubrik just uses commodity hardware compenents, but it is instead in the software that ships with the device.9 One or more briks can be racked in a company's data center, connected to the local network, and then they scan the network and find workloads that need to be protected (virtual machines, SQL databases, Windows machines, etc.).
The first tier of data backup is stored on the brik itself, and briks can be scaled horizontally to hold any amount of data. Additionally, Rubrik integrates nicely with public cloud providers (AWS and Azure shown below) to create a secondary archive of production data with relatively cheap cloud-scale storage technologies. This cloud archive is analogous to the tape storage that I discussed in part one of this article (but of course, easier to use and more powerful).
From a customer point of view, Rubrik uses a "set and forget" approach to data management, allowing users to specify how frequently they want backups to be taken, when and where data should be archived, and other simple configurations. Allowing customers to focus purely on the SLA of their production data, rather than the underlying mechanics of its storage and transport, is a huge lift for IT organizations.
For example, Home Depot **deployed Rubrik as a backup and recovery solution throughout its network of 2,200 small, in-store data centers. As a result, Home Depot reduced the time it took to create local backups from 12 hours to 45 minutes. Also, Home Depot's data recovery times in the event of primary data loss went from 8 hours to 10 minutes, and it estimated in a Rubrik case study that it had saved 206 business days of productivity in the first year of using the product.10
Because Rubrik built a relatively new system from scratch, their architecture has a handful of key advantages compared with that of legacy products in the space.
Flexibility in the Metadata Layer
Rubrik took a software-first approach in building the storage orchestration system necessary to backup big data centers. I think of this similarly to how Cloudflare took software-defined networking to a new level in building its innovative CDN solution. Abstracting away the underlying hardware on which backups are stored gives Rubrik's platform an advantage in speed and flexibility.
This advantage is specifically manifested in Rubrik's data recovery times in the event of a production system outage. In the weeks and months before any outage, while Rubrik has been taking periodic data backups, the system tracks data about the data being backed up (aka metadata) creating a global index that allows for rapid data location and retrieval. In the case of a VMware virtual machine data loss, Rubrik can power up a replacement virtual machine within a few seconds or minutes by locating the virtual machine's data using its global index and live mounting the new virtual machine using the data residing in Rubrik storage. Only later does Rubrik need to copy the backed up data into the production system asynchronously to restore the virtual machine to its desired state running on production hardware.
The very first core principle of Rubrik's architecture mentioned in its white paper explains it this way:
Software-defined: Rubrik consolidates disparate hardware and software components into a single software fabric. Enterprises can run Rubrik anywhere via plug-and-play appliances on-premises, as software on third-party hardware, or as software in the cloud.11
Built for the Cloud
I described above how long-term data archival has typically been kept on magnetic tape storage, often even in a site remote to the data center known as a tape vault. The key difference between a data backup and a data archive is that backups need to be used relatively quickly to restore the primary system to a working state, while archives do not. Data archives are more a system of last resort if all else fails.
Luckily for system administrators, archiving data in the era of cloud computing has reduced costs and recovery times. At high storage levels, archiving data in AWS S3 (the industry leading cloud file storage solution) can cost as low as 2 cents per GB per month.12 This price point is significantly less than traditional tape storage costs, though tape storage has recently made some major strides in efficiency. In any case, when using cold storage in the cloud as a data archive, recovery times are on the order of a few hours, rather than a few days with offsite tape vaults. Rubrik's foresight to make cloud storage a key part of its platform from day one has paid off. Beyond just backup and recovery, Rubrik now positions itself as fast track to the cloud for all kinds of IT organizations (just see Rubrik's current landing page as evidence).
Rubrik of course doesn't limit itself to only AWS S3 for data archival, but has instead built its platform to be multi-cloud. This allows customers to work in whichever public cloud they already have data and applications.
An API-first approach
As a newer product than legacy backup and recovery systems, Rubrik has the advantage of being built in the era of APIs focused on simplicity and usability. Legacy backup and recovery companies have often implemented user-accessible APIs as an afterthought or a bolt-on to their decades old architecture. Because of this, data center admins have typically needed to do a lot of custom scripting in order to perform backups and archival according to their needs.
Rubrik, on the other hand, has made a clean API model a top priority of its product, giving customers easy customization and control. Additionally, this high level of control is available no matter the underlying hardware or type of data center, as shown in the diagram below:
The beauty of a unified API is that customers can work across clouds, or using hybrid clouds, and the control layer via API remains consistent. One Rubrik customer, The CSI Companies, explained the difficulties of managing their data management solution prior to switching to Rubrik:
With our old solution, there were far too many variables; we needed our own storage, our own switches, our own compute. We were spending far too much time scheduling jobs and figuring out optimal backup windows, not to mention dealing with backup failures.13
In early product overview tech talks, the Rubrik founding team touted its ability to turn complicated into simple, and clearly customers like The CSI Companies have benefitted from this improvement.
Rubrik to the Future
In my view, Rubrik's long-term resiliency in born out of its separation of customer problem from implementation specifics. The importance of data management and loss prevention goes all the way back to the rise of corporate data centers in the 1950's and 1960's. Rubrik's platform solves the data management problem in a way that delights customers and allows them to focus more on their core businesses.
Though the implementation specifics of customer data centers are changing and will continue to change (the rise of cloud, hybrid cloud, multi-cloud), the problem of seamless data management and backup will remain as important as ever. Rubrik's product can thrive in any of these environments, giving the company an exciting future.
https://www.businessblogshub.com/2012/10/natural-disasters-and-data-loss/
https://ssgimaging.com/what-is-tape-vaulting
https://www.ibm.com/blogs/research/2020/12/tape-density-record/
https://www.solarwindsmsp.com/blog/backup-wasting-your-time-and-costing-you-money
https://prodatarecover.com/how-long-does-data-recovery-take/
https://victorkoch.medium.com/list-of-upcoming-ipos-in-2020-2021-part-8-52ab51c09b7c
https://www.youtube.com/watch?v=H5OmbAFWmVo
https://blocksandfiles.com/2020/02/22/rubrik-claims-600m-revenue-run-rate/
https://www.rubrik.com/content/dam/rubrik/en/resources/data-sheet/Spec-Sheet-Rubrik-Appliance-Specs-r6000.pdf
https://www.youtube.com/watch?v=LXEdrRpwWwY
https://www.rubrik.com/en
https://aws.amazon.com/s3/pricing/?nc=sn&loc=4
https://www.rubrik.com/content/dam/rubrik/en/resources/case-study/Customer-Success-Rubrik-and-CSI-Companies.pdf