Appendix C Box: Case Study

The following was originally published on Kubernetes.io by theCNCF and is used here with permission.

In the summer of 2014, Box was feeling the pain of a decade’s worth of hardware and software infrastructure that wasn’t keeping up with the company’s needs.

A platform that allows its more than 50 million users (including governments and big businesses like General Electric) to manage and share content in the cloud, Box was originally a PHP monolith of millions of lines of code built with bare metal inside of its own data centers. It had already begun to slowly chip away at the monolith, decomposing it into microservices. “And as we were expanding into regions around the globe, the public cloud wars were heating up and we started to focus on how to run our workload across many different environments and many different cloud infrastructure providers,” says Box co-founder and services architect Sam Ghods. “Its been a huge challenge thus far because all these different providers, especially bare metal, have very different interfaces and ways in which you work with them.”

Box’s cloud native journey accelerated that June when Ghods attended DockerCon.The company had come to the realization that it could no longer run its applications only off bare metal and was researching containerizing with Docker, virtualizing with OpenStack, and supporting public cloud.

At that conference, Google announced the release of its Kubernetes container management system, and Ghods was won over. “We looked at a lot of different options, but Kubernetes really stood out, especially because of the incredibly strong team ofBorg veterans and the vision of having a completely infrastructure-agnostic way of being able to run cloud software,” he says, referencing Google’s internal container orchestrator Borg. “The fact that on day one it was designed to run on bare metal just as well as Google Cloud meant that we could actually migrate to it inside of our data centers, and then use those same tools and concepts to run across public cloud providers as well.”

Another plus: Ghods liked that Kubernetes has a universal set of API objects like pods, services, replica sets, and deployments, which created a consistent surface to build tooling against. “Even PaaS layers like OpenShift or Deis that build on top ofKubernetes still treat those objects as first-class principles,” he says. “We were excited about having these abstractions shared across the entire ecosystem, which would result in a lot more momentum than we saw in other potential solutions.”

Box deployed Kubernetes in a cluster in a production data center just six months later. Kubernetes was then still pre-beta, on version 0.11. They started small: the very first thing Ghods’s team ran on Kubernetes was a Box API monitor that confirms Boxis up. “It was just a test service to get the whole pipeline functioning,” he says. Next came some daemons that process jobs, which are “nice and safe because if they experienced any interruptions, we wouldn’t fail synchronous incoming requests from customers.”

The first live service, which the team could route to and ask for information, was launched a few months later. At that point, Ghods says, “We were comfortable with the stability of the Kubernetes cluster. We started to port some services over, then we would increase the cluster size and port a few more, and that’s ended up to about 100servers in each data center that are dedicated purely to Kubernetes. And that’s going to be expanding a lot over the next 12 months, to hundreds, then thousands.”

While observing teams who began to use Kubernetes for their microservices, “we immediately saw an uptick in the number of microservices being released,” Ghodsnotes. “There was clearly a pent-up demand for a better way of building software through microservices, and the increase in agility helped our developers be more productive and make better architectural choices.”

Ghods reflects that as early adopters, Box had a different journey from what companies experience now. “We were definitely lock step with waiting for certain things to stabilize or features to get released,” he says. “In the early days we were doing a lot of contributions [to components such as kubectl apply] and waiting for Kubernetes to release each of them, and then we’d upgrade, contribute more, and go back and forth several times. The entire project took about eighteen months from our first real deployment on Kubernetes to having general availability. If we did that exact same thing today, it would probably be less than six.”

In any case, Box didn’t have to make too many modifications to Kubernetes for it to work for the company. “The vast majority of the work our team has done to implement Kubernetes at Box has been making it work inside of our existing (and often legacy) infrastructure,” says Ghods, “such as upgrading our base operating system from RHEL6 to RHEL7 or integrating it into Nagios, our monitoring infrastructure.But overall Kubernetes has been remarkably flexible with fitting into many of our constraints, and we’ve been running it very successfully on our bare metal infrastructure.”

Perhaps the bigger challenge for Box was a cultural one. “Kubernetes, and cloud native in general, represents a pretty big paradigm shift, and it’s not very incremental,” Ghods says. “We’re essentially making this pitch that Kubernetes is going to solve everything because it does things the right way and everything is just suddenly better.But it’s important to keep in mind that it’s not nearly as proven as many other solutions out there. You can’t say how long this or that company took to do it because there just aren’t that many yet. Our team had to really fight for resources because our project was a bit of a moonshot.”

Having learned from experience, Ghods offers these two pieces of advice for companies going through similar challenges:

  1. Deliver early and often. Service discovery was a huge problem for Box, and the team had to decide whether to build an interim solution or wait for Kubernetesto natively satisfy Box’s unique requirements. After much debate, “we just started focusing on delivering something that works, and then dealing with potentially migrating to a more native solution later,” Ghods says. “The above-all-else target for the team should always be to serve real production use cases on the infrastructure, no matter how trivial. This helps keep the momentum going both for the team itself and for the organizational perception of the project.”
  2. Keep an open mind about what your company has to abstract away from developers and what it doesn’t. Early on, the team built an abstraction on top of Dockerfiles to help ensure that all container images had the right security updates.This turned out to be superfluous work since container images are immutable and you can instead scan them post-build to ensure they do not contain vulnerabilities. Because managing infrastructure through containerization is such a discontinuous leap, it’s better to start by working directly with the native tools and learning their unique advantages and caveats. An abstraction should be built only after a practical need for it arises.

In the end, the impact has been powerful. “Before Kubernetes,” Ghods says, “our infrastructure was so antiquated it was taking us over six months to deploy a new microservice. Now a new microservice takes less than five days to deploy. And we’reworking on getting it to less than a day. Granted, much of that six months was due to how broken our systems were, but bare metal is intrinsically a difficult platform to support unless you have a system like Kubernetes to help manage it.”

By Ghods’s estimate, Box is still several years away from his goal of being a 90-plus percent Kubernetes shop. “So far we’ve accomplished having a stable, mission-criticalKubernetes deployment that provides a lot of value,” he says. “Right now about 10percent of all of our computer runs on Kubernetes, and I think in the next year we’ll likely get over half. We’re working hard on enabling all stateless service use cases, and plan to shift our focus to stateful services after that.”

In fact, that’s what he envisions across the industry: Ghods predicts that Kuberneteshas the opportunity to be the new cloud platform. Kubernetes provides an API consistent across different cloud platforms including bare metal, and “I don’t think people have seen the full potential of what’s possible when you can program against one single interface,” he says. “The same way AWS changed infrastructure so that you don’t have to think about servers or cabinets or networking equipment anymore, Kubernetes enables you to focus exclusively on the software that you’re running, which is pretty exciting. That’s the vision.”

Ghods points to projects that are already in development or recently released forKubernetes as a cloud platform: cluster federation, the Dashboard UI, and CoreOS’setcd operator. “I honestly believe it’s the most exciting thing I’ve seen in cloud infrastructure,” he says, “because it’s a never-before-seen level of automation and intelligence surrounding infrastructure that is portable and agnostic to every infrastructure platform.”

Box, with its early decision to use bare metal, embarked on its Kubernetes journey out of necessity. But Ghods says that even if companies don’t have to be agnostic about cloud providers today, Kubernetes may soon become the industry standard, as more and more tooling and extensions are built around the API.

“The same way it doesn’t make sense to deviate from Linux because it’s such a standard,” Ghods says, “I think Kubernetes is going down the same path. It’s still early days—the documentation still needs work and the user experience for writing and publishing specs to the Kubernetes clusters is still rough. When you’re on the cutting edge you can expect to bleed a little. But the bottom line is, this is where the industry is going. Three to five years from now it’s really going to be shocking if you run your infrastructure any other way.”