We introduced the first GitLab Reference Architectures five years ago. Originally developed as a partnership between the GitLab Test Platform (formally Quality Engineering) and Support teams, along with other contributors, these architectures aim to provide scalable and elastic starting points to deploy GitLab at scale, tailored to an organization's target load.
Since their debut, we've been thrilled to see the impact these architectures have had on our customers as they navigate their DevSecOps journey. We continue to iterate, expand, and refine the architectures, reflecting our commitment to providing you with the latest, best-in-class guidance on deploying, scaling, and maintaining your GitLab environments.
In recognition of the five-year milestone, here is a peek behind the curtain on how we designed the Reference Architectures and how that design still applies today.
The problem
Before introducing the Reference Architectures, we frequently heard from our customers about the hurdles they faced when deploying GitLab at scale to meet their performance and availability goals.
While every GitLab environment can be considered a little unique because of the need to meet a customer's own requirements, we recognized from running GitLab.com, as well as from our larger customers, that there were common fundamentals to deploying GitLab at scale that were worth sharing. Our objective was to address customer needs while promoting deployment best practices to reduce drift and increase alignment.
Simultaneously, we wanted to significantly expand our performance testing efforts. The goals of this expansion were to provide our engineering teams with a deeper understanding of performance bottlenecks, to drive improvements in GitLab's performance, and to continuously test the application moving forward to ensure it remained performant. However, to conduct meaningful performance tests, we needed a standardized GitLab environment design capable of handling the target loads.
Enter the Reference Architectures.
The goals
With the need for a common architecture clear, we turned next to set the goals of this initiative, which ultimately became the following:
- Performance: Ensure the architecture can handle the target load efficiently.
- Availability: Maximize uptime and reliability wherever possible.
- Scalability and elasticity: Ensure the architecture is scalable and elastic to meet individual customer needs.
- Cost-effectiveness: Optimize resource allocation to avoid unnecessary expenses.
- Maintainability: Make the architecture deployment and management as straightforward as possible with standardized configurations.
It's crucial to note that these goals were not in order and they are goals we stay true to today.
The process
Once the goals were set, we faced the challenge of designing an architecture, validating it, and making sure that it was fit for purpose and met those goals.
The process itself was relatively simple in design:
- Gather metrics on existing environments and the loads they were able to handle.
- Define a prototype architecture based on these metrics.
- Build and test the environment to validate.
- Adjust the environment iteratively based on the test results and metrics until we had a validated architecture that met the goals.
While simple in design, this, of course, was not the case in practice so we got to work.
First, we collected and reviewed the data. To that end, we reviewed metrics and logging data from GitLab.com as well as several participating large customers to correlate the environment sizes deployed to the load they were handling. To achieve this, we needed an objective and quantifiable way to measure that load across any environment, and for that we used Requests per Seconds (RPS). With RPS we could see the concurrent load each environment handled and correlate this to the user count accordingly. Specifically, a user count would correlate to the full manual and automated load (such as continuous integration). From that data, we were able to correlate this across several environment sizes and start to pick out common patterns for the architectures.
Next, we started with a prototype architecture that aimed to meet the goals while cross-referencing with the data we collected. In fact, we actually started this step in conjunction with the first step initially as we had a good enough idea of where to start: Taking the fundamental GitLab.com design and scaling it down for individual customer loads in cost-effective ways. This allowed us to start performance testing the prototype with the data we were analyzing to corroborate accordingly. After quite a few iterations, we had a starting point for our prototype architecture.
To thoroughly test and validate the architecture we needed to turn to performance testing and define our methodology. The approach was to target our most common endpoints with a representative test data set at RPS loads that were also representative. Then, although we had manually built the prototype architecture, we knew we needed tooling to automatically build environments and handle tasks such as updates. These efforts resulted in the GitLab Performance Tool and GitLab Environment Toolkit, which I blogged about previously and which we continue to use to this day (and you can use too!).
With all the above in place we started the main work of validating the prototype architecture through multiple cycles of testing and iterating. In each cycle, we would performance test the environment, review the results and metrics, and adjust the environment accordingly. Through iteration we were able to identify what failures were real application performance issues and what were environmental, and eventually we had our first architecture. That architecture is now known as the 200 RPS or 10,000-user Reference Architecture.
Where Reference Architectures are today
Since publishing our first validated Reference Architecture, the work has never stopped! We like to describe the architectures as living documentation, as they're constantly being improved and expanded with additions such as:
- various Reference Architecture sizes based on common deployments
- non-highly available sizes for smaller environments
- full step-by-step documentation in collaboration with our colleagues in Technical Writing and Support
- expanded guidance and new naming scheme to help with right sizing, scaling, and how to deal with outliers such as monorepos
- cloud native hybrid variants where select components are run in Kubernetes
- recommendations and guidance for cloud provider services
- and more! Check out the update history section in the Reference Architecture documentation!
All this is driven by our comprehensive testing program that we built alongside the Reference Architectures to continuously test that they remain fit for purpose against the latest GitLab code every single week and to catch any unexpected performance issues early.
And we're thrilled to see these efforts have helped numerous customers to date as well as our own engineering teams deliver new, exciting services. In fact, our engineering teams used the Reference Architectures to develop GitLab Dedicated. Five years on, our commitment is stronger than ever. The work very much continues in the same way it started to ensure you have the best-in-class guidance for your DevSecOps journey.
Learn more about GitLab Reference Architectures.