Product Direction - Scalability

The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.

Overview
Challenges
3-5 Year Strategy
1-Year Plus Plan

Overview

The Scalability group is responsible for GitLab at scale, working on the highest priority scaling items related to our SaaS platforms. The group works in close coordination with the Platform Engineering teams. We support other Engineering teams by sharing data and techniques so they can become better at scalability as well.

As its name implies, the Scalability group enhances the availability, reliability and performance of GitLab's SaaS platforms by observing the application's capabilities to operate at scale.

The Scalability group analyzes application performance on GitLab's SaaS platforms, recognizes bottlenecks in service availability, proposes (and develops) short term improvements and develops long term plans that help drive the decisions of other Engineering teams.

Challenges

GitLab teams have a large amount of autonomy and are empowered to work on things that are the most important for their stage and user base. This is great for the development of features, categories and stages, but can create a challenging environment for operating a platform at scale. Since stage teams are empowered to make the changes they need and GitLab operates with a bias for action, stage teams may decide that a shared implementation does not fit their requirements and end up building their own. This can lead to redundancy, the inability to share and re-use code and ultimately increases the tech debt of GitLab. It is therefore important to balance the overall velocity and scalability of GitLab with individual stage team's desire to ship value to our customers.

Discoverability is also a significant challenge in the platform space. It is vital that users of platform tools are able to quickly discover and implement shared tools and best practices. If the tools are not flexible, easy to discover and easy to implement, they may hurt feature velocity rather than increase it.

The Scalability group can often become the owner of components and be responsible for maintaining them in an operational sense, for example Redis. This work can shift capacity away from other enabling tools that would make GitLab easier to scale and in an ideal world, Scalability would be able to handover tools that allow teams to maintain their own components.

Lastly, considering the image below, Scalability tends to operate on the right hand side of the graph, after changes are deployed to our SaaS platforms. This can mean that our work is reactive in nature and we often treat symptoms of bad health in the platforms instead of root causes. Shifting the scaling concerns left, earlier in the software development lifecycle, will help us to scale our SaaS platforms more efficiently.

Two examples where the Scalability Group has already shifted left is Error Budgets and Capacity Planning.

alt text

3-5 Year Strategy

Observability across the Production Fleet is accessible for all.

The Observability team within the Scalability group is responsible for architecting, provisioning and operating the observability platforms that enable us to operate the SaaS platforms at GitLab.

Durable Observability at both the instance and fleet level

Observability and automation will become more critical as the number of instances of GitLab increases. The growth in instances is being driven by the introduction of GitLab Dedicated and the move toward a cellular architecture for our multi-tenant SaaS solution.

We envision “Observability Units” rolled out alongside our GitLab instances, connecting into a Global observability stack that gives us insight into system health across all of GitLab and directs us to issues with actionable insights. These “Observability Units” will be durable and capable of running independently without connection to the global stack. The global stack will be eventually consistent with all of the units in the fleet, whilst preserving the data security and visibility requirements that we ensure today.

Leveraging cost effective managed service providers

As the number of GitLab instances grows, increasing the level of automation required to rollout and operate both units and the global observability stack will be paramount. Additionally, to improve resiliency, it is likely that these units will become cloud agnostic over time. In order to drive the level of automation up and the level of toil down, we leverage managed service providers where appropriate. That allows us to focus our efforts on the higher level functions of our observability platform instead of the mechanics of operating it.

We expect to continue to maintain some number of GitLab operated foundational services. We will assess this on a case by case basis considering the cost with associated benefit.

Availability metrics better reflect the user experience

GitLab’s availability metrics represent our experience of running and operating GitLab.com at million user scale. This has been successful so far, but as the number of capabilities and use cases grow on our platform, we want to shift these to better reflect the user experience, rather than the operator experience.

Published availability metrics

As mentioned in other sections of this page, GitLab Dedicated and a cellular architecture increase the complexity of operating the SaaS solutions offered by GitLab. Over time we will shift towards availability metrics that better represent the user experience on GitLab.com and GitLab Dedicated. As our metrics become more representative of the user experience, we expect the Service Level Availability metric to change alongside.

Internal Error Budget metrics

Error budgets have allowed us to meaningfully address availability issues in the early stages of operating GitLab.com. This has been great for product teams, who can proactively address performance and availability issues in their features. Whilst they have been useful for improving the availability of GitLab.com over time, we understand that operating a GitLab instance the size of GitLab.com is not a common user experience. To make our error budgets more effective guides for teams, we will implement error budgets across the production fleet, for all instances. This will allow us to gain better insight on instances that look more like customer instances, improve the performance of GitLab at every size, and ensure that our Error Budgets can mature to reflect the user experience as well as system health.

Systematic reviews of data

While we have automation to cover as much as possible regarding the availability and performance of our platforms, our automation only covers what we know to look for. We should conduct periodic human reviews to assess if there are missing pieces of our automation rather than waiting for incidents and customer reports.

Paved roads are the default for all team members

One of our core focuses is providing paved roads for GitLab team members. The way our infrastructure organization operates will likely evolve over the next few years and we aim to create paved roads for all the models that we operate in. We aim to encourage cultural change over time by democratizing our capability and putting in the hands of GitLab team members. Our paved roads will be as good for day-1 operations as they are for day-50 operations, supporting a number of different application architectures and generating more customer results through increased efficiency.

Onboarding

Feature owners and service owners will have a self-service onboarding experience that supports all stages of feature and service maturity. Documentation and processes should be clear and easy to find and follow.

Operations

Feature owners and service owners should be empowered to operate their features and services as far as reasonably possible. Self-serve will be a core part of this, with simple interfaces to allow SREs to collaborate on issues where deep expertise is required.

Solutions at GitLab Follow the Well Architected Services Framework

As part of the move to paved roads, we’ll create a Well Architected Services framework, which will tie in with our Service Maturity Model. This will gives teams guidance on how to build services and solutions that are resilient and able to scale to serve our production load. Having practical examples and guidelines, along with getting involved in the design and architecture process earlier will ensure that services get to production (and customers) sooner, safer and happier than they would with ad-hoc processes and reactive work.

Over time services will follow the paved road for the infrastructure model that best suits them and the engineering portal will provide the information needed for the owners to understand the needs of their services.

Engineering Portal is the home for engineers at GitLab

Most of the information that we need to operate GitLab.com and maintain excellent levels of service for our customers is already present within GitLab. We can improve this by providing a single pane of glass that empowers service owners, support engineers and members of the engineering organization to see all of this information in one place.

This should culminate in a Thinnest Viable Platform, where GitLab team members can discover vital information about their service(s)' health, key infrastructure performance indicators and other information that will contribute to the decision making process in feature development. This will be composed of atomic tools and solutions, be customized to various GitLab roles and reduce cognitive load, increase discoverability & efficiency across GitLab.

1-Year Plus Plan

After the expansion of the Scalability Group and to stop this page becoming too long, we have broken out the one year plan into two new pages:

The list above can change and should not be taken as a hard commitment. For the most up-to-date information about our work, please see our top level epic.