The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.
Last updated: 2022-10-08
This page contains the product direction for the Database group of the Enablement stage.
Please reach out to Gabe Weaver, Acting Product Manager for the Database group (Email) if you'd like to provide feedback or ask any questions related to this product direction.
This strategy is a work in progress, and everyone can contribute. Please comment and contribute in the linked issues and epics on this page. Sharing your feedback directly on GitLab.com is the best way to contribute to our strategy and vision.
There are many ways in which GitLab stores data. Relational databases are the primary persistent storage mechanism in GitLab, used for storing most of the user-generated data in GitLab. We use Redis as a cache and lightweight semi-persistant storage. We leverage Object storage for data unsuited for a database. In some cases we even use the file system, for example for Gitaly or the existing version of the Container Registry.
The Database Group is the steward of most relational database technologies at GitLab except for ClickHouse. The Group owns the primary GitLab database from an application technology perspective. Individual feature groups own specific tables and additional feature-specific databases, such as Gitaly Cluster, Geo and the database for the new Container Registry.
The DB team is responsible for growing GitLab's database expertise, promoting and supporting its proper use, and ensuring continuity and no knowledge gaps as the GitLab team and product increase. In practice, the Database group focuses on improving the scalability, performance, and resilience of GitLab's database layer and instituting best practices across the broader development organization.
The database is at the core of all user interactions with a GitLab instance and the layer on which most automated GitLab features, such as the CI pipelines, rely. Any bottleneck at the database layer, regression, or non-performant application code interacting with the database can break or render any lovable GitLab feature practically unusable. At the same time, database-related incidents can pose a significant threat to the availability of any GitLab instance.
As GitLab's architecture becomes more complex and the list of features grows, the database layer becomes more complicated. The Database group's task is to ensure that this complexity is not introducing any short or long-term risks to Gitlab's availability so that we can continue to iterate on existing and new features while GitLab remains performant at all scales.
We apply the Group's combined application and database-related expertise to tackle the complexity of the database design as it grows, no matter the scale to which this design is applied. Scaling is not the end goal by itself. It's the constant we have to always take into account. We must ensure that the best database practices are applied to all problems so that GitLab is performant at all sizes, ranging from small self-managed instances to GitLab.com. These best practices and research on state-of-the-art database approaches consequently allow us to scale our largest instances to the next order of magnitude with each longer-term iteration we take.
We try to achieve those goals by:
The Database Group's ongoing focus will always remain on the scalability of the database, increasing the responsiveness and the availability of the GitLab platform while also improving the efficiency and reliability of making database changes. We are planning to continue addressing this by focusing on the following themes, which align with the Enablement Section's theme of managing complexity for large software projects
Create easy-to-use and understand application-level libraries and frameworks that make all database operations performant and as reliable as possible.
We are moving towards more self-monitoring, auto-tuning approaches that can respond to changing conditions in production environments without requiring manual intervention, which includes the operations running in the background in a GitLab server and all the necessary operations to upgrade a GitLab instance. We are also trying to close the knowledge gap and make the most complex database operations simple to implement for other GitLab Groups by calling a few helper functions.
Shift left our ability to pre-emptively find database-related regressions and performance issues by testing all database operations against a production clone of GitLab.com's database. In the long term, we also plan to extend our automated database testing capabilities and explore how we can provide the tools we are building to a broader audience. That audience can be internal or external, benefiting all the users of GitLab.
Today, most developers have no easy way to test new queries or database changes at scale. We want to figure out ways for developers to test all their database changes on production data before they reach production. Our approach will be to automate the process of testing queries against production data and integrate it into the DevOps lifecycle, at the stage where developers spend most of their time developing and reviewing code. Automated query testing will enable developers to complete database-related updates and perform code reviews faster, with less guessing and more confidence backed by quantifiable data.
We are looking for ways to contribute to the wider community beyond GitLab. We have extended the way Ruby on Rails projects interact with the database and introduced numerous new ideas and frameworks. We are evaluating how we could extract parts of our tools by creating a separate library, open-sourcing it, and letting other developers use it. At the same time, that will enable more seamless contributions to GitLab's database layer.
Background migrations are the vehicle for executing all significant data updates (data migrations) in GitLab. Any operation that has to update more than a few thousand records has to be performed through a background migration with workers running asynchronously in the background executing the update in batches to avoid performance problems of a GitLab instance.
The way of performing such operations up until GitLab 15.0, by scheduling background jobs in regular intervals, was static and did not take into account the database server load when each job was executed. To address this, we had to rethink our approach to performing massive data upgrades, which led us to build the batched background migrations framework.
We first introduced the batched background migrations framework while addressing the Primary Key overflow risk for tables with an integer PK, in which we had to update more than 8.5 Billion records. It provides real-time mechanisms to adapt to the load of a Database Server, adjust the work performed by monitoring in real-time the performance of the migrations, it requires minimal monitoring by the instance administrators and can automatically recover from performance-related errors.
In 15.0, we made batched background migrations available to all GitLab engineers and switched them to the default way for performing background migrations. In the long term, we plan to update many of our existing tools that perform data operations to use batched background migrations as well. We plan to continue addressing any issues discovered and add support for missing or novel features that will make this framework even more reliable. We also plan to evaluate whether we can extend the framework or introduce a similar one, covering other asynchronous operations that are not background migrations, such as scheduled or recurring jobs.
We are implementing a generic throttling mechanism for large data changes that will monitor the health of the Database for various signals (leading indicators) and react to problems by throttling or even pausing the execution of the updates. We plan to do so by extending the batched background migrations auto-tuning layer to monitor for said signals and actively react by further adjusting the batch sizes of scheduled jobs.
We are shifting left our ability to pre-emptively find database-related regressions and performance issues by testing all database updates against a production clone of GitLab.com's database. With every feature we add, we move closer to GitLab being more performant and lowering the risk that deployed code could cause incidents and affect the performance and availability of GitLab.com or other self-managed instances.
Gitlab 15.0 marked a significant milestone for the Automated database testing GitLab internal feature - we are now testing all types of database migrations against a clone of the production database of GitLab.com:
That means that 100% of scheduled database updates are covered, ensuring we test our most tricky operations before they are merged. We expect that the effect of those tests will be evident in both GitLab.com and self-managed instances throughout GitLab 15 and beyond.
One top reason for performance degradation in relational databases is tables growing too large. There is no globally applicable hard rule on the size threshold that tables should not exceed. It depends on the schema of the table, the number and types of indexes defined over it, the mix of read and write traffic it gets, the query workload used to access it, and more. As a result of our analysis, we set the limit at 100GB, and we explain in detail our rationale and how we plan to approach this problem.
Addressing this for tables in GitLab.com is critical, but it will also allow us to provide a more scalable database design for GitLab instances with smaller databases. Based on our analysis, we are planning to work with multiple other GitLab teams in an ongoing fashion towards achieving that goal.
Following the initial success of the automated database testing effort, we are expanding our scope as scheduled database updates do not include regular queries or updates that result from user interactions (users performing an action or causing a background job to run). Our next effort will be to find ways to perform automated query analysis for Merge Requests and test newly introduced queries against our production clones as well. This automated testing is a complex problem as we must figure out the parameters for the queries, which depend on the data stored. We will start with the simplest iteration possible, identifying the queries introduced by each MR to support the database reviewers.
Ensuring that the deployed database schema matches codebase expectations is essential for addressing challenges self-managed instances may face while upgrading to newer versions. It will allow us to support preemptive checks before an instance upgrade and warn about potential issues before the process starts.