·
Ciência da Computação ·
Introdução à Lógica e Programação
Send your question to AI and receive an answer instantly
Recommended for you
69
Um Estudo Comparativo sobre Uso de Modelos de Dados para Notas Fiscais Eletrônicas
Introdução à Lógica e Programação
UFPB
24
Introdução aos Ponteiros em Linguagem C
Introdução à Lógica e Programação
UFPB
27
Introdução aos Arquivos em Linguagem C
Introdução à Lógica e Programação
UFPB
23
Estruturas, Uniões e Enumerações em Linguagem C - Introdução à Programação
Introdução à Lógica e Programação
UFPB
19
Introdução à Programação em C: Strings e Vetores de Caracteres
Introdução à Lógica e Programação
UFPB
1
Verificação de Sorteio para Visto de Estágio
Introdução à Lógica e Programação
UFAL
18
Conceitos sobre Vetores em Programação
Introdução à Lógica e Programação
UFAL
35
Introdução a Ponteiros e Arrays em C
Introdução à Lógica e Programação
UFAL
10
Introdução às Strings em C++
Introdução à Lógica e Programação
UFAL
7
Implementação de Busca Binária e Funções de Ordenação em C++
Introdução à Lógica e Programação
UFAL
Preview text
Citus Distributed PostgreSQL for DataIntensive Applications Umur Cubukcu umurcubukcumicrosoftcom Microsoft Corporation Ozgun Erdogan ozgunerdoganmicrosoftcom Microsoft Corporation Sumedh Pathak sumedhpathakmicrosoftcom Microsoft Corporation Sudhakar Sannakkayala sudhakarsannakkayalamicrosoftcom Microsoft Corporation Marco Slot marcoslotmicrosoftcom Microsoft Corporation ABSTRACT Citus is an open source distributed database engine for PostgreSQL that is implemented as an extension Citus gives users the ability to distribute data queries and transactions in PostgreSQL across a cluster of PostgreSQL servers to handle the needs of dataintensive applications The development of Citus has largely been driven by conversations with companies looking to scale PostgreSQL beyond a single server and their workload requirements This paper de scribes the requirements of four common workload patterns and how Citus addresses those requirements It also shares benchmark results demonstrating the performance and scalability of Citus in each of the workload patterns and describes how Microsoft uses Citus to address one of its most challenging data problems CCS CONCEPTS Information systems Relational parallel and distributed DBMSs KEYWORDS postgresql distributed database relational database database ex tension ACM Reference Format Umur Cubukcu Ozgun Erdogan Sumedh Pathak Sudhakar Sannakkayala and Marco Slot 2021 Citus Distributed PostgreSQL for DataIntensive Applications In Proceedings of the 2021 International Conference on Manage ment of Data SIGMOD 21 June 2025 2021 Virtual Event China ACM New York NY USA 13 pages httpsdoiorg10114534480163457551 1 INTRODUCTION PostgreSQL is one of the most popular open source database man agement systems 19 It is highly versatile and used across different industries and areas as diverse as particle physics 23 and geospa tial databases 18 One of the defining characteristics of PostgreSQL is its extensibility 24 which enables developers to add new data base functionality without forking from the original project Many companies have leveraged the rich functionality and ecosystem of PostgreSQL to build advanced successful applications This in turn Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page Copyrights for thirdparty components of this work must be honored For all other uses contact the ownerauthors SIGMOD 21 June 2025 2021 Virtual Event China 2021 Copyright held by the ownerauthors ACM ISBN 97814503834312106 httpsdoiorg10114534480163457551 has created significant demand for PostgreSQL to scale beyond a single server Over the past decade our team has developed an open source PostgreSQL extension plugin called Citus 3 which turns Post greSQL into a distributed database management system DDBMS The goal of Citus is to address the scalability needs within the Post greSQL ecosystem At an early stage we started offering Citus as a product which drove us to talk to over a thousand companies that were looking to scale out PostgreSQL From these conversations we learned that the need for scale often goes hand in hand with complex application logic that relies on many different relational database capabilities and on performant implementations of those capabilities In addition applications rely on a broad ecosystem of tools and extensions Traditionally new DDBMSs that aimed to offer compatibility with an existing relational database system have followed one of three approaches i Build the database engine from scratch and write a layer to provide overthewire SQL compatibility ii Fork an open source database systems and build new features on top of it or iii Provide new features through a layer that sits between the application and database as middleware For each of these approaches the cost of keeping up with the ongoing developments in the core project over the decadeslong lifecycle of a database management system is substantial and often insurmountable Most projects lag by many years in terms of compatibility with new PostgreSQL features tools and extensions Citus is the first distributed database that delivers its function ality through the PostgreSQL extension APIs The extension APIs provide sufficient control over the behavior of PostgreSQL to inte grate a sharding layer a distributed query planner and executor and distributed transactions in a way that is transparent to the application Being an extension allows Citus to maintain compat ibility with the latest PostgreSQL features and tools at negligible engineering cost Moreover Citus distributes data across regular PostgreSQL servers and sends queries over the regular PostgreSQL protocol This means that Citus can utilize all the data access and storage capabilities offered by the underlying PostgreSQL servers including advanced capabilities such as JSONB lateral joins GiST indexes array types and other extensions Building a distributed PostgreSQL engine that is 100 compatible with a single server and scales in all scenarios without performance regressions is perhaps impossible but also unnecessary Not ev ery application that benefits from PostgreSQL also benefits from scaling out We found that the PostgreSQL applications that bene fit from scaling out largely fall into 4 workload patterns namely Multitenant SaaS realtime analytics highperformance CRUD Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2490 Scale requirements MT RA HC DW Typical query latency 10ms 100ms 1ms 10s Typical query throughput 10ks 1ks 100ks 10s Typical data size 1TB 10TB 1TB 10TB Table 1 Scale requirements of workload patterns for dis tributed relational databases and data warehousing Each workload pattern requires a different combination of capabilities from the database The requirements of these workload patterns have largely driven the development of Citus This paper shares what we learned from building and deploy ing Citus over the years As such the paper brings three main contributions First we describe the four workload patterns that we observed in customer conversations and their requirements in terms of scale and distributed database capabilities Second we describe the PostgreSQL extension APIs and how Citus uses them to implement a comprehensive distributed database system Finally the Citus architecture shows how we addressed the requirements of a broad range of dataintensive applications within a single dis tributed database system The remainder of this paper is organized as follows Section 2 distills the highlevel requirements of four Post greSQL workload patterns that benefit from scaling out Section 3 describes how Citus implements and scales dis tributed query planning and execution distributed transac tions and other database operations Section 4 presents benchmark results demonstrating that Citus can scale PostgreSQL in each of the four workload patterns Section 5 shows how Citus is used in a dataintensive real time analytics application at Microsoft Section 6 describes related work in distributed relational database management systems Section 7 concludes the paper and shares some of our future work 2 WORKLOAD REQUIREMENTS When talking to companies looking to scale out PostgreSQL about potential Citus adoption we observed that almost all workloads we encountered followed four patterns Multitenant MT realtime analytics RA highperformance CRUD HC and data warehous ing DW Table 1 gives an overview of the approximate scale require ments in each workload pattern It is worth noting that the latency throughput and data size requirements vary significantly which in practice means that each workload pattern requires a different combination of distributed database capabilities to achieve its no tion of high performance at scale We describe the four workload patterns and capabilities requested for each workload in italics in the remainder of this Section and give an overview in Table 2 In the Citus architecture section we will further describe these capabilities and how Citus implements them Feature requirements MT RA HC DW Distributed tables Yes Yes Yes Yes Colocated distributed tables Yes Yes Yes Yes Reference tables Yes Yes Yes Yes Local tables Some Some Distributed transactions Yes Yes Yes Yes Distributed schema changes Yes Yes Yes Yes Query routing Yes Yes Yes Parallel distributed SELECT Yes Yes Parallel distributed DML Yes Colocated distributed joins Yes Yes Yes Noncolocated distributed joins Yes Columnar storage Some Yes Parallel bulk loading Yes Yes Connection scaling Yes Table 2 Workload patterns and required distributed rela tional database capabilities channels users messages organization inchannel sentby partof tochannel partof Figure 1 Data model for a simple multitenant messaging application with blue and orange representing different ten ants There are relatively complex relationships within the data model that are expressed through foreign keys and joins but not across tenants Hence multitenant applications can scale along the tenant dimension 21 Multitenant Multitenant applications serve many relatively independent ten ants from a single backend deployment A typical example is Software asaService SaaS Such applications often have complex OLTP workloads with many relationships though tenants are relatively independent within the data model as shown in the example in Figure 1 One of the benefits of the SaaS model that has helped make it successful is that the cost of adoption ie adding a new tenant is low for both user and application developer which can lead to rapid growth One of the challenges in scaling a multitenant workload is that the working set is relatively large due to the large number of independent tenants A traditional approach to scaling a relational database for a multitenant application is manual sharding The data for each ten ant is placed into its own database or schema namespace The application then needs to keep track of where each database or schema is placed build infrastructure for moving data around syn chronize data and schema changes across potentially thousands of Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2491 databases and use external systems to do analytics across tenants This approach only scales up to the level of investment that the application developer is willing to make in building a distributed database management plane The alternative approach is to use a shared schema with tenant ID columns To scale out this approach for multitenant applications in PostgreSQL we identified several distributed database capabil ities These capabilities are essential to meet both the functional requirements of complex SaaS applications including complex SQL foreign keys constraints indexes and the scale requirements low query latency Tables that contain tenantspecific data should all have a tenant ID column such that they can be distributed and colocated by tenant ID Colocation enforces that the same tenant ID is always on the same server such that joins and foreign keys on the tenant ID can be performed locally The database system should then be able to route arbitrarily complex SQL queries that filter by tenant ID to the appropriate server with minimal overhead to linearly scale query throughput Attempts by the distributed query planner to analyze and break up the query will typically result in significant regressions from single server PostgreSQL Ref erence tables that are replicated across servers are needed for local joins and foreign keys with tables that are shared across tenants Applications may also perform analytics eg parallel queries that do colocated distributed joins on the tenant ID and transactions incl distributed schema changes across all tenants In addition to distributed database capabilities we also hear questions from customers that are specific to multitenant work loads First customers may need the flexibility to customize data for certain tenants For these customers we recommend adding new fields using the JSONB data type which can be efficiently indexed through GIN or expression indexes Second customers may need control over tenant placement to avoid issues with noisy neighbors For this Citus provides features to view hotspots to isolate a tenant onto its own server and to provide finegrained control over tenant placement 22 Citus implements the combination of these capabilities required for multitenant workloads As such Citus provides a significantly simpler alternative to scaling out than the databasepertenant approach 22 Realtime Analytics Realtime analytics applications provide interactive analytics or search on large streams of data with only minor delay The main data stream typically consists of event data or time series data de scribing telemetry from devices or software Use cases may include system monitoring anomaly detection eg fraud detection be havioral analytics eg funnel analysis segmentation geospatial analytics and others A common realtime analytics application is a multiuser analytics dashboard that visualizes aggregations of the data through charts The database needs to be able to sustain a high write throughput to keep up with a stream of events while the application issues hundreds of analytical queries per second Queries should have sub second response times to be able to show results interactively regardless of the data volume Since the query set is known upfront indexes materialized views and other data transformations eg rollups can be used to minimize response events rollup1 rollup2 Preaggregation Ingestion Queries Figure 2 A simple realtime analytics pipeline Data is ingested into a raw data table and then incrementally preaggregated into one or more rollups The application may query both the rollups and the raw event data times The database needs to be able to update these incrementally to quickly reflect new data in the application An example of a typical realtime analytics pipeline is shown in Figure 2 PostgreSQL has many powerful capabilities for building real time analytics applications Its heap storage format and COPY com mand allow very fast data ingestion while MVCC allows analytical queries to run concurrently with writes PostgreSQL also has a ver satile set of data types and index types including comprehensive support for arrays JSON and custom types The only shortcoming of PostgreSQL for realtime analytics applications is that the data volume can easily exceed the capacity of a single server and most operations are singlethreaded To scale realtime analytics workloads the database system needs to be able to distribute tables across servers and support parallel bulk loading to keep up with the data volume Parallel distributed DML in particular INSERTSELECT is needed to be able to incrementally preaggregate large volumes of incoming data into rollup tables Colocation between the source table and the rollup table enables very fast INSERTSELECT Queries from the dashboard use query routing or parallel distributed SELECT on the event data or the rollup tables to keep response times low Colocated distributed joins are needed for advanced analytics eg funnel queries Citus supports the capabilities required for building largescale realtime analytics dashboards on PostgreSQL Section 5 gives an ex ample of a petabytescale realtime analytics use case at Microsoft 23 Highperformance CRUD Highperformance CRUD workloads involve many objects docu ments that are modified in a relatively independent manner The application primarily accesses the data through simple CRUD cre ate read update delete operations on a key but may also issue more complex queries across objects The objects typically follow an unstructured data format like JSON PostgreSQL is a popular choice for this type of workload because of its sophisticated JSON support A large PostgreSQL server can handle hundreds of thousands of writes and millions of reads per second Scalability problems can arise when the working set is large or when making incremental changes to large objects at a high rate The PostgreSQL MVCC model requires writing a new copy and later reclaiming the space autovacuuming which can Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2492 cause performance degradation if autovacuuming cannot keep up Another limitation of PostgreSQL is that it can only handle a limited number of idle connections because of the processper connection architecture and the relatively high memory overhead of a process To scale high performance CRUD workloads tables need to be distributed by key CRUD operations can then be routed to a shard with very low planning overhead and use the primary key index within the shard enabling high throughput and low latency A sig nificant benefit of sharding PostgreSQL tables in high performance CRUD workloads is that apart from the ability to use more hard ware autovacuuming is parallelized across many cores Reference tables are not strictly required but can make objects significantly more compact through normalization which is beneficial at scale Parallel distributed SELECT and DML are useful for performing scans and analytics across a large number of objects To scale the number of connections any server needs to be able to process dis tributed queries Citus meets all the requirements for high performance CRUD workloads but still has some inherent limitations around connec tion scalability We are working with the community to improve on connection scalability in PostgreSQL itself 20 24 Data warehousing Data warehousing applications combine data from different sources into a single database system to generate adhoc analytical reports The application typically does not have low latency or high through put requirements but queries may need to scan very large amounts of data The database needs to support very fast scans and be able to find efficient query plans for handwritten SQL involving arbitrary joins PostgreSQL generally lacks the scanning performance and par allelism to perform very large scans within a reasonable amount of time On the other hand its performance features comprehensive SQL support and ecosystem still make it an attractive option for analytics To scale data warehouse applications scans need to be sped up through parallel distributed SELECT and columnar storage Dis tribution columns should be chosen to maximize the number of colocated distributed joins but the database needs to support ef ficient noncolocated joins as well by reshuffling or broadcasting data over the network The query optimizer needs to decide on a join order that minimizes network traffic At the time of writing Citus meets most of the requirements for data warehouse applications but has several limitations around noncolocated joins eg correlated subqueries are unsupported which limits its applicability in some data warehousing workloads 25 Other workloads Some workloads have not been a focus area of Citus primarily because we rarely observed them among potential customers There are various workload patterns that PostgreSQL could ad dress given its extensibility but the ecosystem is centered around other database systems and tools Examples include streaming ana lytics ExtractTransformLoad ETL machine learning and text search We believe PostgreSQL can be successful in these areas but it will take broader movements within developer communities to make PostgreSQL an important part of those ecosystems We also observed that complex singletenant OLTP workloads are less likely to run into the large working set problem that occur in multitenant OLTP workloads Moreover scaling out a complex singletenant OLTP application often lowers overall throughput because network latency lowers perconnection throughput and results in locks being held for longer This in turn lowers achievable concurrency Overall the observations about the workload patterns described in this section have significantly influenced the Citus architecture 3 CITUS ARCHITECTURE In a Citus cluster all servers run PostgreSQL with the Citus exten sion plus any number of other extensions installed Citus uses the PostgreSQL extension APIs to change the behavior of the database in two ways First Citus replicates database objects such as custom types and functions to all servers Second Citus adds two new ta ble types that can be used to take advantage of additional servers The remainder of this section describes how Citus implements and scales the most important operations for Citus tables 31 PostgreSQL extension APIs A PostgreSQL extension consists of two parts a set of SQL objects eg metadata tables functions types and a shared library that is loaded into PostgreSQL All database modules within PostgreSQL are extensible except for the parser The main reason is that the parser code is generated at build time while the extension infras tructure loads the shared library at run time Keeping the parser nonextensible also forces syntactic interoperability between ex tensions Once a PostgreSQL extension is loaded it can alter the behavior of PostgreSQL by setting certain hooks Citus uses the following hooks Userdefined functions UDFs are callable from SQL queries as part of a transaction and are primarily used to manipulate the Citus metadata and implement remote procedure calls Planner and executor hooks are global function pointers that allow an extension to provide an alternative query plan and exe cution method After PostgreSQL parses a query Citus checks if the query involves a Citus table If so Citus generates a plan tree that contains a CustomScan node which encapsulates a distributed query plan CustomScan is an execution node in a PostgreSQL query plan that holds custom state and returns tuples via custom function pointers The Citus CustomScan calls the distributed query executor which sends queries to other servers and collects the results before returning them to the PostgreSQL executor Transaction callbacks are called at critical points in the life cycle of a transaction eg precommit postcommit abort Citus uses these to implement distributed transactions Utility hook is called after parsing any command that does not go through the regular query planner Citus uses this hook primarily to apply DDL and COPY commands that affect Citus tables Background workers run usersupplied code in separate pro cesses Citus uses this API to run a maintenance daemon This Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2493 Items4 Items4 Items4 users4 catego1 Items4 Items4 Items4 items4 Items4 Items4 Items4 users8 Items4 Items4 Items4 items8 catego1 foreign keys Worker 1 Worker 2 Coordinator categories users sessions items items distributed by userid users distributed by userid categories reference table sessions local table Figure 3 Example Citus deployment showing the coordina tor and two worker nodes daemon performs distributed deadlock detection 2PC prepared transaction recovery and cleanup Through these hooks Citus can intercept any interaction be tween the client and the PostgreSQL engine that involves Citus tables Citus can then replace or augment PostgreSQLs behavior 32 Citus architecture diagram Citus deployments typically have 1 coordinator and 0n worker nodes as shown in the Figure 3 The coordinator stores the metadata catalogs of the distributed tables and clients typically connect to the coordinator A PostgreSQL server implicitly becomes the coordinator when the user adds a worker node via a Citus UDF Worker nodes store the shards that contain the actual data When the cluster is small the coordinator itself can also be used as a worker node so the smallest possible Citus cluster is a single server The benefit of using a single coordinator as the entry point is that PostgreSQL libraries and tools can interact with a Citus cluster as if it was a regular PostgreSQL server Since the overhead of distributed queries is small compared to query execution a large coordinator node can handle hundreds of thousands of transactions per second or ingest millions of rows per second via PostgreSQLs COPY command 321 Scaling the coordinator node The coordinator can become a scaling bottleneck in some cases for example when serving demanding high performance CRUD workloads To resolve this bottleneck Citus can distribute the distributed table metadata and all changes to it to all the nodes In this case each worker node assumes the role of coordinator for all distributed queries and trans actions it receives Clients should use a load balancing mechanism to divide connections over the workers Since DDL commands mod ify distributed metadata the application should continue to connect to the coordinator when issuing DDL commands Since the volume of DDL commands is low the coordinator no longer becomes a bottleneck Today we recommend this more complex mode only when cus tomers are familiar with PostgreSQL and have an actual scaling bottleneck When each node can coordinate and also serve queries each connection to the Citus cluster creates one or more connec tions within the cluster Citus caches connections for higher per formance and this could lead to a connection scaling bottleneck within the cluster We typically mitigate this issue by setting up connection pooling between the instances via PgBouncer 10 We are working to improve on connection scaling behavior in future versions of Citus and PostgreSQL itself 20 33 Citus table types Citus introduces two types of tables to PostgreSQL Distributed tables and reference tables without taking away the concept of regular local PostgreSQL tables Citus tables are initially cre ated as regular PostgreSQL tables and then they are converted by calling Citusspecific functions After conversion Citus intercepts all commands involving Citus tables in the relevant PostgreSQL hooks 331 Distributed tables Distributed tables are hashpartitioned along a distribution column into multiple logical shards with each shard containing a contiguous range of hash values The advantage of hashpartitioning is that it enables colocation and reasonably wellbalanced data without the need for frequent resharding Range partitioning is also available for some advanced use cases The createdistributedtable UDF converts a regular table to a distributed table by creating the shards on the workers and adding to the Citus metadata CREATE TABLE mytable SELECT createdistributedtablemytable distributioncolumn Shards are placed on worker nodes in a roundrobin fashion A single worker node can contain multiple logical shards such that the cluster can be rebalanced by moving individual shards between worker nodes 332 Colocation Citus ensures that the same range of hash values is always on the same worker node among distributed tables that are colocated Relational operations eg joins foreign keys that involve the distribution column of two or more colocated distributed tables can be performed without any network traffic by operating on pairs of colocated shards When creating a second distributed table colocation can be specified using the colocatewith option CREATE TABLE othertable SELECT createdistributedtableothertable distributioncolumn colocatewith mytable If no colocatewith option is specified Citus automatically co locates distributed tables based on the data types of the distribution columns We found this to be helpful for users unfamiliar with distributed database concepts 333 Reference tables Reference tables are replicated to all nodes in a Citus cluster including the coordinator Joins between dis tributed tables and reference tables are implemented by joining each shard of the distributed table with the local replica of the reference table Similarly foreign keys from distributed tables to reference tables are enforced locally by creating regular foreign Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2494 keys between each of the shards of the distributed table and the local replica of the reference table Users create reference tables by calling createreferencetable CREATE TABLE dimensions SELECT createreferencetabledimensions After conversion writes to the reference table are replicated to all nodes and reads are answered directly by the coordinator or via loadbalancing across worker nodes 34 Data rebalancing Most Citus clusters grow in data size and query volume Certain worker nodes may also receive more load than others due to data distribution and query patterns To enable an even distribution Citus provides a shard rebalancer By default the rebalancer moves shards until it reaches an even number of shards across worker nodes Alternatively users can choose to rebalance based on data size or create a custom policy by defining cost capacity and con straint functions in SQL 7 Most rebalance operations start with the customer changing their cluster size The rebalancer then picks one shard and any shards that are colocated with it and starts a shard move operation To move shards Citus creates a replica of the shards on a different node using PostgreSQLs logical replication With logical replication the shards in transit can continue to receive read and write queries When the shard replicas have caught up with their source Citus obtains write locks on the shards waits for replication to complete and updates distributed table metadata From that point on any new queries go to the new worker node The last few steps typically only take a few seconds hence there is minimal write downtime 35 Distributed query planner When a SQL query references a Citus table the distributed query planner produces a PostgreSQL query plan that contains a Custom Scan node which contains the distributed query plan A distributed query plan consists of a set of tasks queries on shards to run on the workers and optionally a set of subplans whose results need to be broadcast or repartitioned such that their results can be read by subsequent tasks Citus needs to handle a wide range of workloads that require different query planning strategies to scale Simple CRUD queries benefit from minimal planning overhead Complex data warehous ing queries on the other hand benefit from advanced query opti mizations which incur higher planning overhead Over time Citus evolved to have planners for different classes of queries Figure 4 gives a basic example for each planner We further describe those planners below Fast path planner handles simple CRUD queries on a single table with a single distribution column value The planner extracts the distribution column value directly from a filter in the query and determines the shard that matches the value The planner then rewrites the table name to the shard name to construct the query to run on the worker which can be done with minimal CPU overhead Hence the fast path planner supports high throughput CRUD workloads Router planner handles arbitrarily complex queries that can be scoped to one set of colocated shards The router planner checks a1 a3 b1 b3 r a2 a4 b2 b4 r a2 a4 b2 b4 r SELECT FROM a WHERE key 1 SELECT FROM a JOIN b ON akey bkey JOIN r ON bdim rid WHERE key 1 SELECT FROM a JOIN b ON akey bkey JOIN r ON bdim rid SELECT FROM a JOIN b ON aany bkey a1 a3 b1 b3 r a2 a4 b2 b4 a1 a3 b1 b3 r r a1 a3 i1 i3 a1 a3 i1 i3 b2 b4 b1 b3 A B C D Figure 4 Citus planning examples A Fast path planner showing single shard access B Router planner showing sin gle shard colocated and reference table joins C Logical pushdown planner showing multishard colocated and ref erence table joins D Logical join order planner showing one table being repartitioned into intermediate results to perform a noncolocated join or infers whether all distributed tables have the same distribution column filter If so the table names in the query are rewritten to the names of the colocated shards that match the distribution column value The router planner implicitly supports all SQL features that PostgreSQL supports since it will simply delegate the full query to another PostgreSQL server Hence the router planner enables multitenant SaaS applications to use all SQL features with minimal overhead Logical planner handles queries across shards by construct ing a multirelational algebra tree 15 Multirelational algebra formalizes two distributed execution primitives that are not avail able in PostgreSQL to collect and repartition data This difference influences the separation between router and logical planner The goal of the logical planner is to push as much of the compu tation to the worker nodes as possible before merging the results on the coordinator We further distinguish between two logical planning strategies 1 Logical pushdown planner detects whether the join tree can be fully pushed down This requires that all distributed ta bles have colocated joins between them and that subqueries do not require a global merge step eg a GROUP BY must include the distribution column If so the planner can be largely agnostic to the SQL constructs being used within the join tree since they are fully delegated to the worker nodes and the distributed query plan becomes trivially parallel 2 Logical join order planner determines the optimal exe cution order for join trees involving noncolocated joins It evaluates all possible join orders between distributed ta bles and subqueries using colocated joins broadcast joins and repartition joins and chooses the order that minimizes the network traffic Broadcast joins and repartition joins result in subplans with filters and projections pushed into the subplan Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2495 a2 a4 a1 a3 Aggregate CustomScan SELECT avgv FROM a JOIN b ON akey bkey Adaptive executor SELECT sumv count FROM a4 JOIN b4 ON akey bkey b2 b4 b1 b3 Figure 5 Call flow for the execution of a a simple analytical query that was planned by the logical pushdown planner For each query Citus iterates over the four planners from lowest to highest overhead If a particular planner can plan the query Citus uses it This approach works well for two reasons First there is an order of magnitude difference between each workloads latency expectations as described in Table 1 Second this difference also applies to each planners overhead As an example a complex data warehousing query could take minutes to complete In that case the user does not mind paying a few milliseconds of overhead for the fast path router and pushdown planner 36 Distributed query executor A PostgreSQL query plan is structured as a tree of execution nodes that each have a function to return a tuple Plans generated by the distributed query planner contain a CustomScan node that calls into the distributed query executor In case the plan was gener ated via the fast path or router planner the entire plan is a single CustomScan node since the execution is fully delegated to a single worker node Plans generated by the logical planner may require a merge step eg aggregation across shards In that that case there will be additional execution nodes above the CustomScan which are handled by the regular PostgreSQL executor When the PostgreSQL executor calls into the CustomScan Citus first executes subplans if any and then hands the execution to a component called the adaptive executor Figure 5 shows an example of the call flow for a simple analytical query 361 Adaptive executor The design of the adaptive executor is driven by the need to support a mixture of workloads and by the processperconnection architecture of PostgreSQL Some query plans will have a single task that needs to be routed to the right worker node with minimal overhead while other query plans have many tasks that Citus runs in parallel by opening multiple connec tions per worker node We found parallelizing queries via multiple connections to be more versatile and performant than the builtin parallel query capability in PostgreSQL that uses a fixed set of pro cesses The downside of opening multiple connections is the cost of connection establishment and overhead of extra processes in particular when many distributed queries run concurrently The adaptive executor manages the parallelism vs low latency tradeoff using a technique we call slow start At the start of the query the executor can use one connection to each worker n1 Every 10ms the number of new connections that can be opened increases by one nn1 If there are t pending tasks for a worker node that are not assigned to a specific connection the executor increases the connection pool of that worker node by minnt new connections The reasoning behind slow start is that a simple in memory index lookup typically takes less than a millisecond so all tasks on a worker are typically executed before opening any additional connections On the other hand analytical queries often take hundreds of milliseconds or more and the delayed connection establishment will barely be noticeable in the overall runtime While slow start increases the number of connections when tasks take a long time to complete sometimes tasks take a long time because there are already many concurrent connections issuing queries to the worker node Therefore the executor also keeps track of the total number of connections to each worker node in shared memory to prevent it from exceeding a shared connection limit When the counter reaches the limit opening additional connections is avoided such that the overall number of outgoing connections remains at or below the limit The implementation converges to a state where the available connection slots on the worker nodes are fairly distributed between the processes executing distributed queries on the coordinator When multiple connections per worker node are used each connection will access a different subset of shards and hold uncom mitted writes and locks in case of a multistatement transaction For every connection Citus therefore tracks which shards have been accessed to ensure that the same connection will be used for any subsequent access to the same set of colocated shards in the same transaction When starting the execution of a statement tasks are assigned to a connection if there was a prior access to the shards accessed within the transaction and otherwise they are assigned to the general pool for the worker node When a connection is ready the executor first takes an assigned task from its queue and otherwise takes a task from the general pool By combining slow start the shared connection limit and the task assignment algorithm the adaptive executor can handle a variety of workload patterns even when they run concurrently on a single database cluster and support complex interactive transaction blocks without sacrificing parallelism 37 Distributed transactions Distributed transactions in Citus comprise a transaction on the coordinator initiated by the client and one or more transactions on worker nodes initiated by the executor For transactions that only involve a single worker node Citus delegates responsibility to the worker node For transactions that involve multiple nodes Citus uses twophase commit 2PC for atomicity and implements distributed deadlock detection 371 Singlenode transactions The simplest type of transaction Citus supports is a single statement transaction that goes to a single node eg CRUD operations In that case there is no overhead to using a distributed table other than the extra round trip between coordinator and worker When handling a multistatement transac tion in which all statements are routed to the same worker node eg operations on a single tenant in a multitenant app the coor dinator simply sends commitabort commands to that worker node from the commitabort transaction callbacks The worker node by definition provides the same transactional guarantees as a single PostgreSQL server Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2496 372 Twophase commit protocol For transactions that write to multiple nodes the executor opens transaction blocks on the worker nodes and performs a 2PC across them at commit time PostgreSQL implements commands to prepare the state of a transaction in a way that preserves locks and survives restarts and recovery This enables later committing or aborting the prepared transaction Citus uses those commands to implement a full 2PC protocol When the transaction on the coordinator is about to commit the precommit callback sends a prepare transaction over all connections to worker nodes with open transaction blocks If suc cessful the coordinator writes a commit record for each prepared transaction to the Citus metadata and then the local transaction commits which ensures the commit records are durably stored In the postcommit and abort callbacks the prepared transactions are committed or aborted on a besteffort basis When one or more prepared transactions fail to commit or abort the commit record in the Citus metadata is used to determine the outcome of the transaction A background daemon periodically compares the list of pending prepared transactions on each worker to the local commit records If a commit record is present read visible for a prepared transaction the coordinator committed hence the prepared transaction must also commit Conversely if no record is present for a transaction that has ended the prepared transaction must abort When there are multiple coordinators each coordinator performs 2PC recovery for the transactions it initiated Since both commit records and prepared transactions are stored in the write ahead log which may be replicated see Section 39 this approach is robust to failure of any of the nodes involved 373 Distributed deadlocks A challenge in multinode transac tions is the potential for distributed deadlocks in particular between multistatement transactions To overcome this challenge deadlock prevention or deadlock detection methods can be used Deadlock prevention techniques such as woundwait require a percentage of transactions to restart PostgreSQL has an interactive proto col which means results might be returned to the client before a restart occurs and the client is not expected to retry transactions Hence woundwait is unsuitable for Citus To maintain PostgreSQL compatibility Citus therefore implements distributed deadlock de tection which aborts transactions when they get into an actual deadlock PostgreSQL already provides deadlock detection on a single node Citus extends on this logic with a background daemon running on the coordinator node This daemon detects distributed deadlocks by polling all worker nodes for the edges in their lock graph process a waits for process b every 2 seconds and then merging all processes in the graph that participate in the same distributed transaction If the resulting graph contains a cycle then a cancellation is sent to the process belonging to the youngest distributed transaction in the cycle to abort the transaction Unless there is an actual deadlock only a small number of trans actions will be waiting for a lock in typical distributed database workloads hence the overhead of distributed deadlock detection is small When distributed deadlocks happen frequently users are recommended to change the statement order in their transactions 374 Multinode transaction tradeoffs Multinode transactions in Citus provide atomicity consistency and durability guarantees but do not provide distributed snapshot isolation guarantees A concurrent multinode query could obtain a local MVCC snapshot before commit on one node and after commit on another Address ing this would require changes to PostgreSQL to make the snapshot manager extensible In practice we did not find a strong need for distributed snapshot isolation in the four workload patterns and customers did not express a need for it yet Most transactions in multitenant and CRUD applications are scoped to a single node meaning they get isolation guarantees on that node Analytical applications do not have strong dependencies between transactions and are hence more tolerant to relaxed guarantees Distributed snapshot isolation can be important in certain hy brid scenarios However existing distributed snapshot isolation techniques have a significant performance cost due to the need for additional network round trips or waiting for the clock which increases response times and lowers achievable throughput In the context of the synchronous PostgreSQL protocol throughput is ultimately capped by connections response time Since making a very large number of database connections is often impractical from the application perspective low response time is the only way to achieve high throughput Hence we would likely make distributed snapshot isolation optional if we implement it in the future 38 Specialized scaling logic Apart from SELECT and DML commands that are handled via the distributed query planner and executor there are several other important PostgreSQL capabilities for which Citus implements specialized scaling logic DDL commands in PostgreSQL are transactional online oper ations Citus preserves this property by taking the same locks as PostgreSQL and propagating the DDL commands to shards via the executor in a parallel distributed transaction COPY commands append a CSVformatted stream of data to a table In PostgreSQL this happens in a single thread which also needs to update indexes and checks constraints In case of Citus the coordinator opens COPY commands for each of the shards and streams rows to the shards asynchronously which means writes are partially parallelized across cores even with a single client INSERTSELECT between distributed tables use one of 3 strate gies If the SELECT requires a merge step on the coordinator the command is internally executed as a distributed SELECT and a COPY into the destination table If there is no merge step but the source and destination tables are not colocated the INSERTSELECT performs distributed repartitioning of the SELECT result before in serting into the destination table Otherwise the INSERTSELECT is performed directly on the colocated shards in parallel Stored procedures can be delegated to a worker node based on a distribution argument and a colocated distributed table to avoid network round trips between coordinator and worker nodes The worker node can then perform most operations locally without net work round trips but it can also perform a distributed transaction across worker nodes when necessary 39 High Availability and backups High availability HA and backups for distributed database systems are complex topics that need to be looked at holistically from the Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2497 Workload Benchmark Multitenant HammerDB TPCCbased Realtime analytics Custom microbenchmarks Highperformance CRUD YCSB Data warehouse Queries from TPCH Table 3 Benchmarks used for different workload patterns perspective of the user and the platform that runs the database As this paper is primarily focused on the Citus extension we only give a brief overview of the typical HA and backup approach for Citus clusters and leave further details to future papers HA in Citus is handled primarily at the server level using existing PostgreSQL replication In an HA setup each node in the cluster has one or more hot standby nodes and replicates its writeahead log WAL using synchronous asynchronous or quorum replication When a node fails the cluster orchestrator promotes a standby and updates the Citus metadata DNS record or virtual IP The whole failover process takes 2030 seconds during which distributed trans actions involving the node roll back The orchestrator is typically part of the control plane in a managed service but onpremises users can use the pgautofailover extension 9 to serve the same function Backups are also handled primarily at the server level by cre ating periodic disk snapshots or copies of the database directory and continuously archiving the WAL to remote storage in each server Citus supports periodically creating a consistent restore point which is a record in the WAL of each node The restore point is created while blocking writes to the commit records tables on the coordinators which prevents inflight 2PC commits while creating the restore point Restoring all servers to the same restore point guarantees that all multinode transactions are either fully committed or aborted in the restored cluster or can be completed by the coordinator through 2PC recovery on startup 4 BENCHMARKS This section presents benchmark results that compare the perfor mance of Citus with one PostgreSQL server With Citus our goal has been to turn PostgreSQL into a distributed database This goal included providing compatibility with PostgreSQL features and its ecosystem of libraries and tools As such when customers ap proached us their primary performance baseline was single node PostgreSQL The benchmarks here reflect that performance base line for target workloads described in Section 2 Table 3 summarizes the relationship between workload patterns and benchmarks Each benchmark compares PostgreSQLA single PostgreSQL server Citus 01a single PostgreSQL server that uses Citus to shard data locally coordinator also acts as worker Citus 41a Citus cluster with a coordinator and 4 worker nodes and Citus 81a Citus cluster with a coordinator and 8 worker nodes All servers were Microsoft Azure virtual machines with 16 vcpus 64 GB of memory and networkattached disks with 7500 IOPS running PostgreSQL 13 and Citus 95 with default settings Benchmarks were run from a separate driver node Each benchmark is structured such that a single server cannot keep all the data in memory but Citus 41 can which demonstrates Citus 81 Citus 41 Citus 01 PostgreSQL 0 200000 400000 381911 214043 15419 15923 Throughput in NOPM avg p95 368 ms 766 ms 382 ms 802 ms 39 ms 108 ms 21 ms 59 ms Figure 6 HammerDB TPCC results with 250 vusers and 500 warehouses 100GB in new order transactions per minute NOPM and response times in milliseconds the often dramatic effect of scaling out memory along with CPU and IO capacity Citus 81 demonstrates the effect of scaling out only CPU and IO capacity compared to Citus 41 41 Multitenant benchmarks To simulate a multitenant workload we used the HammerDB 6 TPCCbased workload which is an OLTP benchmark that models an order processing system for warehouses derived from TPCC 1 The benchmark effectively models a multitenant OLTP workload in which warehouses are the tenants Most tables have a warehouse ID column and most transactions only affect a single warehouse ID which allows the workload to scale Around 7 of transactions span across multiple warehouses and are likely to be multinode transactions in Citus We configured HammerDB 33 with 500 warehouses 100GB of data 250 virtual users connections a 1ms sleep time between transactions and a 1 hour runtime and ran it against each set up In case of Citus we converted the items table to a reference table and the remaining tables to colocated distributed tables with the warehouse ID column as the distribution column Additionally we configured Citus to delegate stored procedure calls to worker nodes based on the warehouse ID argument The New Orders Per Minute NOPM results obtained from run ning HammerDB against each set up are shown in Figure 6 On single server PostgreSQL and Citus 01 the data set does not fit in memory which means that the amount of IO is relatively high and bottlenecked on a single disk Citus does not provide immediate performance benefits for OLTP workloads on the same hardware hence Citus 01 is slightly slower than single server PostgreSQL due to small distributed query planning overhead The main ben efit of using Citus for OLTP workloads is that it can scale beyond a single server to ensure the working set fits in memory and sufficient IO and CPU is available Throughput on Citus 41 is around 13 times higher than through put on a single PostgreSQL server with only 5 times more hardware because the cluster can keep all data in memory Hence Citus 41 performs less IO and becomes CPU bottlenecked From 4 to 8 nodes Citus shows slightly sublinear scalability This is expected for the TPCCbased workload due to the 7 of transactions that span across nodes The response time of these transactions is dominated by network roundtrips for individual statements sent between nodes which remains the same as the cluster scales out Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2498 Citus 81 Citus 41 Citus 01 PostgreSQL 0 100 200 300 46 47 95 268 a Load time in seconds 0 1 2 007 017 032 223 b Query time in seconds 0 2000 4000 136 390 1673 3033 c Transformation time in seconds Figure 7 Realtime analytics microbenchmarks using 100GB of GitHub Archive data a Single session COPY b Dashboard query using GIN index c Data transformation using INSERTSELECT Citus 81 Citus 41 Citus 01 PostgreSQL 0 50 100 150 133 45 4 1 Queries per hour Figure 8 Data warehousing bench mark using queries from TPCH at scale factor 100 135GB Citus 81 Citus 41 Citus 01 PostgreSQL 0 10000 20000 15324 6505 1061 1279 19194 9254 1055 1276 Transactions per second Same key Different key avg p95 212 ms 456 ms 211 ms 451 ms 226 ms 488 ms 222 ms 490 ms 46 ms 96 ms 65 ms 104 ms 25 ms 58 ms 30 ms 88 ms Figure 9 Distributed transactions benchmark comparing two updates on using the same distribution key vs differ ent keys across two tables 411 Distributed transaction performance Multitenant applica tions mostly have singletenant transactions but there may be cross tenant transactions such as the ones simulated by HammerDB To get a more accurate measure of the overhead of 2PC we created a synthetic benchmark using two tables of 50GB generated by the pgbench tool that comes with PostgreSQL We then distributed and colocated these tables by key and defined a simple multistatement transaction UPDATE a1 SET v v d WHERE key key1 UPDATE a2 SET v v d WHERE key key2 We ran the transactions using pgbench for 1 hour with 250 connections In one set of runs we used the same random value for both keys such that these are two colocated updates In another set of runs we used a different random value which results in a 2PC when the keys are on different nodes The results are displayed in Figure 9 The figure shows 2PC incurs a 2030 performance penalty but scales with the number of worker nodes 42 Realtime analytics benchmark There is not a standard realtime analytics benchmark so we ran several microbenchmarks for the individual commands involved in realtime analytics We used publicly available data from the GitHub archive 4 in JSON format and loaded data for January 2020 into a table defined as follows CREATE TABLE githubevents eventid text default md5randomtext primary key data jsonb SELECT createdistributedtablegithubevents eventid Citus only CREATE INDEX textsearchidx ON githubevents USING GIN jsonbpathqueryarraydata payloadcommitsmessagetext gintrgmops We used the pgtrgrm extension included in PostgreSQL to index the commit messages within the JSON data The index makes queries for a substring in a commit message much faster at the cost of increased write overhead We created the index both on PostgreSQL and on the Citus clusters Our first microbenchmark measures ingestion performance in the presence of large indexes We appended the first day of February 2020 44GB of JSON data using the COPY command The average load times over 5 runs are shown in Figure 7a In this case Citus 01 gives a speed up over PostgreSQL due to the partial parallelism described in Section 38 The Citus cluster with 4 worker nodes can speed up the COPY further due to the greater number of cores and IO capacity After that the single COPY command becomes bottlenecked on a single core on the coordinator hence increasing to 8 worker nodes does not provide additional speed up To resolve this bottleneck customers ingest data by running concurrent COPY commands Our second microbenchmark is a query that might be run by a dashboard Compute the number of commits that contain the phrase postgres per day SELECT datacreatedatdate sumjsonbarraylengthdatapayloadcommits FROM githubevents WHERE jsonbpathqueryarraydata payloadcommitsmessagetext ILIKE postgres GROUP BY 1 ORDER BY 1 ASC The average runtime of the dashboard query over 5 runs ex cluding the first to mitigate the variability caused by cache misses is shown in Figure 7b The query only reads from memory and is largely bottlenecked on CPU hence the greater parallelism pro vided by Citus enables the query to run faster even on a single server Finally realtime analytics often involves INSERTSELECT queries to transform or preaggregate the data For our third microbench mark we defined a data transformation step that extract commits from the GitHub push events Average runtime over five runs is show in Figure 7c The parallelization significantly speeds up the INSERTSELECT with a 96 reduction in runtime on Citus 81 Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2499 Citus 81 Citus 41 Citus 01 PostgreSQL 0 50000 10000 83240 34562 3544 3950 Operations per second avg p95 103 ms 249 ms 133 ms 295 ms 28 ms 78 ms 9 ms 14 ms Figure 10 YCSB Workload A results on 100M rows 100GB and response times for updates compared to a single PostgreSQL server showing the ability of Citus to scale complex transactional data transformations 43 High performance CRUD benchmark The Yahoo Cloud Serving Benchmark YCSB 16 is designed to test high performance CRUD workloads on NoSQL databases YCSB also has a JDBC driver that supports PostgreSQL We ran workload A 50 reads 50 updates from YCSB on a table of 100M rows 100GB using 256 threads with uniform request distribution For this benchmark the coordinators CPU usage becomes a scaling bottleneck Hence we ran the benchmark with every worker node acting as coordinator and configured YCSB to load balance across all nodes The results appear in Figure 10 The workload is largely IO bound hence throughput scales linearly with the higher IO capacity when adding worker nodes Single server Citus performs slightly worse than PostgreSQL due to the additional overhead of distributed query planning On bigger clusters the speed up is roughly proportional to the amount of IO capacity with a small additional speed up due to data fitting in memory 44 Data warehousing benchmark A standard benchmark for data warehouses is TPCH 2 Queries in TPCH do not have selective filters and therefore scan most of the data Answering a TPCH query quickly requires fast scanning and processing which Citus achieves mainly through distributed paral lelism and keeping more data in memory At the time of writing 4 of the 22 queries in TPCH are not yet supported We used HammerDB to generate a TPCH schema with scale factor 100 135GB and distributed and colocated the lineitem and orders table by order key and converted the smaller tables to reference tables to enable local joins We then ran the 18 queries supported by Citus over a single session on each setup Figure 8 shows the number of queries per hour based on the completion time of the full set of queries over a single session Citus can achieve significant speeds up compared to PostgreSQL by efficiently utilizing all available cores The fact that TPCH queries scan all the data and the tables do not fully fit in memory also means the single server is IO bottlenecked while the Citus cluster is only CPU bottlenecked which results in a two orders of magnitude speedup on the 8 node cluster compared to a single PostgreSQL server 5 CITUS CASE STUDY VENICEDB Citus is used in many largescale production systems that rely on a broad array of PostgreSQL and Citus capabilities to get the most out of their hardware A good example of this is the VeniceDB project at Microsoft Microsoft uses Citus to analyze Windows measure data which is derived from the telemetry coming in from hundreds of millions of Windows devices Metrics are displayed on a realtime analyt ics dashboard called Release Quality View RQV which helps Windows engineering teams to assess the quality of the customer experience for each Windows release at the device grain The RQV dashboard is a critical tool for Windows engineers program man agers and executives with hundreds of users per day The data store underlying RQV code named VeniceDB is pow ered by two 1000 core Citus clusters running on Microsoft Azure which store over a petabyte of data While many different dis tributed databases and data processing systems were evaluated for VeniceDB only Citus could address the specific combination of re quirements associated with the petabytescale VeniceDB workload including Sub second response times p95 for 6M queries per day Ingest 10TB of new measure data per day Show new measure data in RQV within 20 minutes Nested subqueries with high cardinality group by Advanced secondary indexes eg partial indexes GiST in dexes to efficiently find reports along various dimensions Advanced data types eg arrays HyperLogLog to imple ment sophisticated analytical algorithms in SQL Row count reduction through incremental aggregation Atomic updates across nodes to cleanse bad data In the Citus clusters raw data is stored in the measures table which is distributed by device ID and partitioned by time on disk using the builtin partitioning capability in PostgreSQL The COPY command is used to parallelize the ingestion of incoming JSON data into the distributed table Distributed INSERTSELECT commands are used to perform devicelevel preaggregation of incoming data into several reports tables with various indexes The reports tables are also distributed on device ID and colocated with the measures table such that Citus can fully parallelize the INSERTSELECT Many of the queries from the RQV dashboard are of the form SELECT avgdeviceavg FROM SELECT deviceid avgmetric as deviceavg FROM reports WHERE GROUP BY deviceid time period other dimensions AS subq GROUP BY time period other dimensions These queries filter by several dimensions eg measure time range Windows build to find a substantial subset of the data The nested subquery first aggregates reports by device ID which is needed to weigh overall averages by device rather than by the number of reports There can be tens of millions of devices per query which makes the GROUP BY deviceid challenging to compute efficiently Since the subquery groups by the distribution column the logical pushdown planner in Citus recognizes that it can push down the full subquery to all worker nodes to parallelize it The Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2500 worker nodes then use indexonly scans to read the data in device ID order and minimize the disk IO and memory footprint of the GROUP BY Finally Citus distributes the outer aggregation step by calculating partial aggregates on the worker nodes and merging the partial aggregates on the coordinator to produce the final result At each step VeniceDB uses a combination of advanced Post greSQL and Citus capabilities to achieve maximum efficiency and scale on a single system 6 RELATED WORK Citus has architectural similarities with various other distributed database systems but most systems focus only on a single workload pattern In addition Citus is unique in that it is a distributed RDBMS implemented as an extension of an existing RDBMS which gives many benefits in terms of robustness versatility and compatibility with the open source ecosystem around PostgreSQL Vitess 12 is a sharding solution for MySQL Like Citus Vitess scales out an existing open source relational database Unlike Ci tus it is not an extension and therefore must be deployed sepa rately from the database servers and requires additional application changes Vitess is primarily optimized for multitenant and high performance CRUD workloads and has builtin connection pool ing for scaling the number of connections It has limited support for queries and transactions across shards which makes it less applicable in other workload patterns Greenplum 5 and Redshift 21 are PostgreSQLbased data warehouses that are hence optimized for handling complex SQL queries that analyze large amounts of data with low concurrency As a result both systems today provide better percore performance than Citus for long analytical queries Greenplum and Redshift also use columnar storage for fast scans and implement joins by shuffling data over the network Citus supports those primitives as well but the Citus implementation is not as welloptimized yet On the other hand Citus can handle a mixture of transactional and analytical workloads and can take advantage of the latest PostgreSQL features and extensions Aurora 27 can scale out the storage for a single PostgreSQL server for demanding OLTP workloads and fast recovery Citus has a shared nothing architecture which means storage scale out and fast recovery is achieved by dividing data across many database servers The downside of a sharednothing architecture is that the application needs to make additional data modelling decisions choosing distributed tables so it is not a dropin replacement for applications built on a single server RDBMS The advantages of a sharednothing architecture over shared storage are the ability to combine the compute power of all servers and use advanced query parallelism Also Citus can be deployed in any environment Spanner 17 CockroachDB 25 and Yugabyte 13 have been developed with a focus on serializability for multinode transac tions CockroachDB and Yugabyte support the PostgreSQL protocol as well though significant functional limitations compared to Post greSQL remain A notable architectural difference between these systems and Citus is that they provide distributed snapshot isola tion and use woundwait rather than deadlock detection In sections 374 and 373 we discussed the downsides of these techniques in the context of PostgreSQL compatibility and why we did not use them for Citus One of the benefits of distributed snapshot isola tion is that it avoids data modelling constraints Citus users need to use colocation and reference tables to scope transactions to a single node in order to get full ACID guarantees On the other hand these techniques also enable efficient joins and foreign keys and we therefore find them to be essential for scaling complex relational database workloads TimescaleDB 11 is a PostgreSQL extension that optimizes Post greSQL for time series data It uses similar hooks as Citus to intro duce the concept of a hypertable which is automatically partitioned by time Partitioning tables by time is useful for limiting index sizes to maintain high write performance for time series workloads and for partition pruning which speeds up queries by time range Citus and TimescaleDB are currently incompatible due to conflicting us ages of PostgreSQL hooks but Citus does work with pgpartman 8 which is a simpler time partitioning extension Many realtime analytics applications that use Citus also use pgpartman on top of distributed tables in which case the individual shards are locally partitioned to get both the benefits of distributed tables and time partitioning 7 CONCLUSIONS AND FUTURE WORK Citus is a distributed database engine for PostgreSQL that addresses the need for scalability in the PostgreSQL ecosystem As an exten sion Citus maintains longterm compatibility with the PostgreSQL project including new features and tools Rather than focusing on a particular workload we designed Citus as a multipurpose database system that can handle a broad variety of PostgreSQL workloads that need to scale beyond a single server That way users get the simplicity and flexibility of using a widely adopted open source relational database system at scale Much of our future work is around implementing support for any remaining PostgreSQL features that are not fully supported on distributed tables These include noncolocated correlated sub queries recursive CTEs and logical replication between different table types Increasingly we are also seeing users with hybrid data models that keep small tables on a single server and then distribute only large tables Automated data model optimization for these scenarios is another important area of future work Finally Citus is increasingly being used in more specialized workload patterns such as MobilityDB 14 and KyrixS 26 There are many potential dis tributed query optimizations that can be implemented specifically for those workloads We will explore making Citus itself extensible to iterate on those optimizations faster ACKNOWLEDGMENTS Citus is the result of a collaborative effort that spans close to a decade We would especially like to thank Andres Freund Hadi Moshayedi Jason Petersen Metin Döşlü Önder Kalacı Murat Tuncer Samay Sharma and Utku Azman for their vast and numerous contri butions over many years We would also like to thank our reviewers for their invaluable comments in writing this paper and Min Wei for his input on the VeniceDB section Finally we would like to thank the many users of Citus who provide the input and inspiration that continue to shape Citus into a practical distributed PostgreSQL solution Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2501 REFERENCES 1 2010 TPC Benchmark C Standard Specification Revision 511 Technical Report Transaction Processing Performance Council TPC 2 2018 TPC Benchmark H Standard Specification Revision 2180 Technical Report Transaction Processing Performance Council TPC 3 2020 Citus Open Source Repo httpsgithubcomcitusdatacitus 4 2020 GitHub Archive httpswwwgharchiveorg 5 2020 Greenplum Database httpsgreenplumorg 6 2020 HammerDB httpswwwhammerdbcom 7 2020 Hyperscale Citus documentation on Citus metadata httpsdocs microsoftcomenusazurepostgresqlreferencehyperscalemetadata 8 2020 PG Partitioning Manager httpsgithubcompgpartmanpgpartman 9 2020 pgautofailover Postgres extension and service for automated failover and highavailability httpsgithubcomcitusdatapgautofailover 10 2020 PgBouncer lightweight connection pooler for PostgreSQL httpswww pgbouncerorg 11 2020 Timescale PostgreSQL for time series httpswwwtimescalecom 12 2020 Vitess A database clustering system for horizontal scaling of MySQL httpsvitessio 13 2020 YugabyteDB httpswwwyugabytecom 14 Mohamed Bakli Mahmoud Sakr and Esteban Zimanyi 2019 Distributed moving object data management in MobilityDB In Proceedings of the 8th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data 110 15 Stefano Ceri and Giuseppe Pelagatti 1983 Correctness of query execution strategies in distributed databases ACM Transactions on Database Systems TODS 8 4 1983 577607 16 Brian F Cooper Adam Silberstein Erwin Tam Raghu Ramakrishnan and Russell Sears 2010 Benchmarking cloud serving systems with YCSB In Proceedings of the 1st ACM symposium on Cloud computing 143154 17 James C Corbett Jeffrey Dean Michael Epstein Andrew Fikes Christopher Frost Jeffrey John Furman Sanjay Ghemawat Andrey Gubarev Christopher Heiser Peter Hochschild et al 2013 Spanner Googles globally distributed database ACM Transactions on Computer Systems TOCS 31 3 2013 122 18 P Corti TJ Kraft SV Mather and B Park 2014 PostGIS Cookbook Packt Publishing httpsbooksgooglenlbooksidzCaxAgAAQBAJ 19 DBengines 2020 PostgreSQL System Properties httpsdbenginescomen systemPostgreSQL 20 Andres Freund 2020 Analyzing the Limits of Connection Scalability in Post gres httpstechcommunitymicrosoftcomt5azuredatabaseforpostgresql analyzingthelimitsofconnectionscalabilityinpostgresbap1757266 21 Anurag Gupta Deepak Agarwal Derek Tan Jakub Kulesza Rahul Pathak Stefano Stefani and Vidhya Srinivasan 2015 Amazon redshift and the case for simpler data warehouses In Proceedings of the 2015 ACM SIGMOD international conference on management of data 19171923 22 Craig Kerstiens 2018 Introducing Landlord per tenant stats in Postgres with Citus httpswwwcitusdatacomblog20180731introducinglandlordper tenantstats 23 Eamonn Maguire Lukas Heinrich and Graeme Watt 2017 HEPData a repository for high energy physics data In J Phys Conf Ser Vol 898 52 24 Michael Stonebraker and Lawrence A Rowe 1986 The design of POSTGRES ACM Sigmod Record 15 2 1986 340355 25 Rebecca Taft Irfan Sharif Andrei Matei Nathan VanBenschoten Jordan Lewis Tobias Grieger Kai Niemi Andy Woods Anne Birzin Raphael Poss et al 2020 CockroachDB The Resilient GeoDistributed SQL Database In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data 14931509 26 Wenbo Tao Xinli Hou Adam Sah Leilani Battle Remco Chang and Michael Stonebraker 2020 KyrixS Authoring Scalable Scatterplot Visualizations of Big Data arXiv preprint arXiv200715904 2020 27 Alexandre Verbitski Anurag Gupta Debanjan Saha Murali Brahmadesam Kamal Gupta Raman Mittal Sailesh Krishnamurthy Sandor Maurice Tengiz Kharatishvili and Xiaofeng Bao 2017 Amazon aurora Design considerations for high throughput cloudnative relational databases In Proceedings of the 2017 ACM International Conference on Management of Data 10411052 Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2502
Send your question to AI and receive an answer instantly
Recommended for you
69
Um Estudo Comparativo sobre Uso de Modelos de Dados para Notas Fiscais Eletrônicas
Introdução à Lógica e Programação
UFPB
24
Introdução aos Ponteiros em Linguagem C
Introdução à Lógica e Programação
UFPB
27
Introdução aos Arquivos em Linguagem C
Introdução à Lógica e Programação
UFPB
23
Estruturas, Uniões e Enumerações em Linguagem C - Introdução à Programação
Introdução à Lógica e Programação
UFPB
19
Introdução à Programação em C: Strings e Vetores de Caracteres
Introdução à Lógica e Programação
UFPB
1
Verificação de Sorteio para Visto de Estágio
Introdução à Lógica e Programação
UFAL
18
Conceitos sobre Vetores em Programação
Introdução à Lógica e Programação
UFAL
35
Introdução a Ponteiros e Arrays em C
Introdução à Lógica e Programação
UFAL
10
Introdução às Strings em C++
Introdução à Lógica e Programação
UFAL
7
Implementação de Busca Binária e Funções de Ordenação em C++
Introdução à Lógica e Programação
UFAL
Preview text
Citus Distributed PostgreSQL for DataIntensive Applications Umur Cubukcu umurcubukcumicrosoftcom Microsoft Corporation Ozgun Erdogan ozgunerdoganmicrosoftcom Microsoft Corporation Sumedh Pathak sumedhpathakmicrosoftcom Microsoft Corporation Sudhakar Sannakkayala sudhakarsannakkayalamicrosoftcom Microsoft Corporation Marco Slot marcoslotmicrosoftcom Microsoft Corporation ABSTRACT Citus is an open source distributed database engine for PostgreSQL that is implemented as an extension Citus gives users the ability to distribute data queries and transactions in PostgreSQL across a cluster of PostgreSQL servers to handle the needs of dataintensive applications The development of Citus has largely been driven by conversations with companies looking to scale PostgreSQL beyond a single server and their workload requirements This paper de scribes the requirements of four common workload patterns and how Citus addresses those requirements It also shares benchmark results demonstrating the performance and scalability of Citus in each of the workload patterns and describes how Microsoft uses Citus to address one of its most challenging data problems CCS CONCEPTS Information systems Relational parallel and distributed DBMSs KEYWORDS postgresql distributed database relational database database ex tension ACM Reference Format Umur Cubukcu Ozgun Erdogan Sumedh Pathak Sudhakar Sannakkayala and Marco Slot 2021 Citus Distributed PostgreSQL for DataIntensive Applications In Proceedings of the 2021 International Conference on Manage ment of Data SIGMOD 21 June 2025 2021 Virtual Event China ACM New York NY USA 13 pages httpsdoiorg10114534480163457551 1 INTRODUCTION PostgreSQL is one of the most popular open source database man agement systems 19 It is highly versatile and used across different industries and areas as diverse as particle physics 23 and geospa tial databases 18 One of the defining characteristics of PostgreSQL is its extensibility 24 which enables developers to add new data base functionality without forking from the original project Many companies have leveraged the rich functionality and ecosystem of PostgreSQL to build advanced successful applications This in turn Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page Copyrights for thirdparty components of this work must be honored For all other uses contact the ownerauthors SIGMOD 21 June 2025 2021 Virtual Event China 2021 Copyright held by the ownerauthors ACM ISBN 97814503834312106 httpsdoiorg10114534480163457551 has created significant demand for PostgreSQL to scale beyond a single server Over the past decade our team has developed an open source PostgreSQL extension plugin called Citus 3 which turns Post greSQL into a distributed database management system DDBMS The goal of Citus is to address the scalability needs within the Post greSQL ecosystem At an early stage we started offering Citus as a product which drove us to talk to over a thousand companies that were looking to scale out PostgreSQL From these conversations we learned that the need for scale often goes hand in hand with complex application logic that relies on many different relational database capabilities and on performant implementations of those capabilities In addition applications rely on a broad ecosystem of tools and extensions Traditionally new DDBMSs that aimed to offer compatibility with an existing relational database system have followed one of three approaches i Build the database engine from scratch and write a layer to provide overthewire SQL compatibility ii Fork an open source database systems and build new features on top of it or iii Provide new features through a layer that sits between the application and database as middleware For each of these approaches the cost of keeping up with the ongoing developments in the core project over the decadeslong lifecycle of a database management system is substantial and often insurmountable Most projects lag by many years in terms of compatibility with new PostgreSQL features tools and extensions Citus is the first distributed database that delivers its function ality through the PostgreSQL extension APIs The extension APIs provide sufficient control over the behavior of PostgreSQL to inte grate a sharding layer a distributed query planner and executor and distributed transactions in a way that is transparent to the application Being an extension allows Citus to maintain compat ibility with the latest PostgreSQL features and tools at negligible engineering cost Moreover Citus distributes data across regular PostgreSQL servers and sends queries over the regular PostgreSQL protocol This means that Citus can utilize all the data access and storage capabilities offered by the underlying PostgreSQL servers including advanced capabilities such as JSONB lateral joins GiST indexes array types and other extensions Building a distributed PostgreSQL engine that is 100 compatible with a single server and scales in all scenarios without performance regressions is perhaps impossible but also unnecessary Not ev ery application that benefits from PostgreSQL also benefits from scaling out We found that the PostgreSQL applications that bene fit from scaling out largely fall into 4 workload patterns namely Multitenant SaaS realtime analytics highperformance CRUD Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2490 Scale requirements MT RA HC DW Typical query latency 10ms 100ms 1ms 10s Typical query throughput 10ks 1ks 100ks 10s Typical data size 1TB 10TB 1TB 10TB Table 1 Scale requirements of workload patterns for dis tributed relational databases and data warehousing Each workload pattern requires a different combination of capabilities from the database The requirements of these workload patterns have largely driven the development of Citus This paper shares what we learned from building and deploy ing Citus over the years As such the paper brings three main contributions First we describe the four workload patterns that we observed in customer conversations and their requirements in terms of scale and distributed database capabilities Second we describe the PostgreSQL extension APIs and how Citus uses them to implement a comprehensive distributed database system Finally the Citus architecture shows how we addressed the requirements of a broad range of dataintensive applications within a single dis tributed database system The remainder of this paper is organized as follows Section 2 distills the highlevel requirements of four Post greSQL workload patterns that benefit from scaling out Section 3 describes how Citus implements and scales dis tributed query planning and execution distributed transac tions and other database operations Section 4 presents benchmark results demonstrating that Citus can scale PostgreSQL in each of the four workload patterns Section 5 shows how Citus is used in a dataintensive real time analytics application at Microsoft Section 6 describes related work in distributed relational database management systems Section 7 concludes the paper and shares some of our future work 2 WORKLOAD REQUIREMENTS When talking to companies looking to scale out PostgreSQL about potential Citus adoption we observed that almost all workloads we encountered followed four patterns Multitenant MT realtime analytics RA highperformance CRUD HC and data warehous ing DW Table 1 gives an overview of the approximate scale require ments in each workload pattern It is worth noting that the latency throughput and data size requirements vary significantly which in practice means that each workload pattern requires a different combination of distributed database capabilities to achieve its no tion of high performance at scale We describe the four workload patterns and capabilities requested for each workload in italics in the remainder of this Section and give an overview in Table 2 In the Citus architecture section we will further describe these capabilities and how Citus implements them Feature requirements MT RA HC DW Distributed tables Yes Yes Yes Yes Colocated distributed tables Yes Yes Yes Yes Reference tables Yes Yes Yes Yes Local tables Some Some Distributed transactions Yes Yes Yes Yes Distributed schema changes Yes Yes Yes Yes Query routing Yes Yes Yes Parallel distributed SELECT Yes Yes Parallel distributed DML Yes Colocated distributed joins Yes Yes Yes Noncolocated distributed joins Yes Columnar storage Some Yes Parallel bulk loading Yes Yes Connection scaling Yes Table 2 Workload patterns and required distributed rela tional database capabilities channels users messages organization inchannel sentby partof tochannel partof Figure 1 Data model for a simple multitenant messaging application with blue and orange representing different ten ants There are relatively complex relationships within the data model that are expressed through foreign keys and joins but not across tenants Hence multitenant applications can scale along the tenant dimension 21 Multitenant Multitenant applications serve many relatively independent ten ants from a single backend deployment A typical example is Software asaService SaaS Such applications often have complex OLTP workloads with many relationships though tenants are relatively independent within the data model as shown in the example in Figure 1 One of the benefits of the SaaS model that has helped make it successful is that the cost of adoption ie adding a new tenant is low for both user and application developer which can lead to rapid growth One of the challenges in scaling a multitenant workload is that the working set is relatively large due to the large number of independent tenants A traditional approach to scaling a relational database for a multitenant application is manual sharding The data for each ten ant is placed into its own database or schema namespace The application then needs to keep track of where each database or schema is placed build infrastructure for moving data around syn chronize data and schema changes across potentially thousands of Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2491 databases and use external systems to do analytics across tenants This approach only scales up to the level of investment that the application developer is willing to make in building a distributed database management plane The alternative approach is to use a shared schema with tenant ID columns To scale out this approach for multitenant applications in PostgreSQL we identified several distributed database capabil ities These capabilities are essential to meet both the functional requirements of complex SaaS applications including complex SQL foreign keys constraints indexes and the scale requirements low query latency Tables that contain tenantspecific data should all have a tenant ID column such that they can be distributed and colocated by tenant ID Colocation enforces that the same tenant ID is always on the same server such that joins and foreign keys on the tenant ID can be performed locally The database system should then be able to route arbitrarily complex SQL queries that filter by tenant ID to the appropriate server with minimal overhead to linearly scale query throughput Attempts by the distributed query planner to analyze and break up the query will typically result in significant regressions from single server PostgreSQL Ref erence tables that are replicated across servers are needed for local joins and foreign keys with tables that are shared across tenants Applications may also perform analytics eg parallel queries that do colocated distributed joins on the tenant ID and transactions incl distributed schema changes across all tenants In addition to distributed database capabilities we also hear questions from customers that are specific to multitenant work loads First customers may need the flexibility to customize data for certain tenants For these customers we recommend adding new fields using the JSONB data type which can be efficiently indexed through GIN or expression indexes Second customers may need control over tenant placement to avoid issues with noisy neighbors For this Citus provides features to view hotspots to isolate a tenant onto its own server and to provide finegrained control over tenant placement 22 Citus implements the combination of these capabilities required for multitenant workloads As such Citus provides a significantly simpler alternative to scaling out than the databasepertenant approach 22 Realtime Analytics Realtime analytics applications provide interactive analytics or search on large streams of data with only minor delay The main data stream typically consists of event data or time series data de scribing telemetry from devices or software Use cases may include system monitoring anomaly detection eg fraud detection be havioral analytics eg funnel analysis segmentation geospatial analytics and others A common realtime analytics application is a multiuser analytics dashboard that visualizes aggregations of the data through charts The database needs to be able to sustain a high write throughput to keep up with a stream of events while the application issues hundreds of analytical queries per second Queries should have sub second response times to be able to show results interactively regardless of the data volume Since the query set is known upfront indexes materialized views and other data transformations eg rollups can be used to minimize response events rollup1 rollup2 Preaggregation Ingestion Queries Figure 2 A simple realtime analytics pipeline Data is ingested into a raw data table and then incrementally preaggregated into one or more rollups The application may query both the rollups and the raw event data times The database needs to be able to update these incrementally to quickly reflect new data in the application An example of a typical realtime analytics pipeline is shown in Figure 2 PostgreSQL has many powerful capabilities for building real time analytics applications Its heap storage format and COPY com mand allow very fast data ingestion while MVCC allows analytical queries to run concurrently with writes PostgreSQL also has a ver satile set of data types and index types including comprehensive support for arrays JSON and custom types The only shortcoming of PostgreSQL for realtime analytics applications is that the data volume can easily exceed the capacity of a single server and most operations are singlethreaded To scale realtime analytics workloads the database system needs to be able to distribute tables across servers and support parallel bulk loading to keep up with the data volume Parallel distributed DML in particular INSERTSELECT is needed to be able to incrementally preaggregate large volumes of incoming data into rollup tables Colocation between the source table and the rollup table enables very fast INSERTSELECT Queries from the dashboard use query routing or parallel distributed SELECT on the event data or the rollup tables to keep response times low Colocated distributed joins are needed for advanced analytics eg funnel queries Citus supports the capabilities required for building largescale realtime analytics dashboards on PostgreSQL Section 5 gives an ex ample of a petabytescale realtime analytics use case at Microsoft 23 Highperformance CRUD Highperformance CRUD workloads involve many objects docu ments that are modified in a relatively independent manner The application primarily accesses the data through simple CRUD cre ate read update delete operations on a key but may also issue more complex queries across objects The objects typically follow an unstructured data format like JSON PostgreSQL is a popular choice for this type of workload because of its sophisticated JSON support A large PostgreSQL server can handle hundreds of thousands of writes and millions of reads per second Scalability problems can arise when the working set is large or when making incremental changes to large objects at a high rate The PostgreSQL MVCC model requires writing a new copy and later reclaiming the space autovacuuming which can Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2492 cause performance degradation if autovacuuming cannot keep up Another limitation of PostgreSQL is that it can only handle a limited number of idle connections because of the processper connection architecture and the relatively high memory overhead of a process To scale high performance CRUD workloads tables need to be distributed by key CRUD operations can then be routed to a shard with very low planning overhead and use the primary key index within the shard enabling high throughput and low latency A sig nificant benefit of sharding PostgreSQL tables in high performance CRUD workloads is that apart from the ability to use more hard ware autovacuuming is parallelized across many cores Reference tables are not strictly required but can make objects significantly more compact through normalization which is beneficial at scale Parallel distributed SELECT and DML are useful for performing scans and analytics across a large number of objects To scale the number of connections any server needs to be able to process dis tributed queries Citus meets all the requirements for high performance CRUD workloads but still has some inherent limitations around connec tion scalability We are working with the community to improve on connection scalability in PostgreSQL itself 20 24 Data warehousing Data warehousing applications combine data from different sources into a single database system to generate adhoc analytical reports The application typically does not have low latency or high through put requirements but queries may need to scan very large amounts of data The database needs to support very fast scans and be able to find efficient query plans for handwritten SQL involving arbitrary joins PostgreSQL generally lacks the scanning performance and par allelism to perform very large scans within a reasonable amount of time On the other hand its performance features comprehensive SQL support and ecosystem still make it an attractive option for analytics To scale data warehouse applications scans need to be sped up through parallel distributed SELECT and columnar storage Dis tribution columns should be chosen to maximize the number of colocated distributed joins but the database needs to support ef ficient noncolocated joins as well by reshuffling or broadcasting data over the network The query optimizer needs to decide on a join order that minimizes network traffic At the time of writing Citus meets most of the requirements for data warehouse applications but has several limitations around noncolocated joins eg correlated subqueries are unsupported which limits its applicability in some data warehousing workloads 25 Other workloads Some workloads have not been a focus area of Citus primarily because we rarely observed them among potential customers There are various workload patterns that PostgreSQL could ad dress given its extensibility but the ecosystem is centered around other database systems and tools Examples include streaming ana lytics ExtractTransformLoad ETL machine learning and text search We believe PostgreSQL can be successful in these areas but it will take broader movements within developer communities to make PostgreSQL an important part of those ecosystems We also observed that complex singletenant OLTP workloads are less likely to run into the large working set problem that occur in multitenant OLTP workloads Moreover scaling out a complex singletenant OLTP application often lowers overall throughput because network latency lowers perconnection throughput and results in locks being held for longer This in turn lowers achievable concurrency Overall the observations about the workload patterns described in this section have significantly influenced the Citus architecture 3 CITUS ARCHITECTURE In a Citus cluster all servers run PostgreSQL with the Citus exten sion plus any number of other extensions installed Citus uses the PostgreSQL extension APIs to change the behavior of the database in two ways First Citus replicates database objects such as custom types and functions to all servers Second Citus adds two new ta ble types that can be used to take advantage of additional servers The remainder of this section describes how Citus implements and scales the most important operations for Citus tables 31 PostgreSQL extension APIs A PostgreSQL extension consists of two parts a set of SQL objects eg metadata tables functions types and a shared library that is loaded into PostgreSQL All database modules within PostgreSQL are extensible except for the parser The main reason is that the parser code is generated at build time while the extension infras tructure loads the shared library at run time Keeping the parser nonextensible also forces syntactic interoperability between ex tensions Once a PostgreSQL extension is loaded it can alter the behavior of PostgreSQL by setting certain hooks Citus uses the following hooks Userdefined functions UDFs are callable from SQL queries as part of a transaction and are primarily used to manipulate the Citus metadata and implement remote procedure calls Planner and executor hooks are global function pointers that allow an extension to provide an alternative query plan and exe cution method After PostgreSQL parses a query Citus checks if the query involves a Citus table If so Citus generates a plan tree that contains a CustomScan node which encapsulates a distributed query plan CustomScan is an execution node in a PostgreSQL query plan that holds custom state and returns tuples via custom function pointers The Citus CustomScan calls the distributed query executor which sends queries to other servers and collects the results before returning them to the PostgreSQL executor Transaction callbacks are called at critical points in the life cycle of a transaction eg precommit postcommit abort Citus uses these to implement distributed transactions Utility hook is called after parsing any command that does not go through the regular query planner Citus uses this hook primarily to apply DDL and COPY commands that affect Citus tables Background workers run usersupplied code in separate pro cesses Citus uses this API to run a maintenance daemon This Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2493 Items4 Items4 Items4 users4 catego1 Items4 Items4 Items4 items4 Items4 Items4 Items4 users8 Items4 Items4 Items4 items8 catego1 foreign keys Worker 1 Worker 2 Coordinator categories users sessions items items distributed by userid users distributed by userid categories reference table sessions local table Figure 3 Example Citus deployment showing the coordina tor and two worker nodes daemon performs distributed deadlock detection 2PC prepared transaction recovery and cleanup Through these hooks Citus can intercept any interaction be tween the client and the PostgreSQL engine that involves Citus tables Citus can then replace or augment PostgreSQLs behavior 32 Citus architecture diagram Citus deployments typically have 1 coordinator and 0n worker nodes as shown in the Figure 3 The coordinator stores the metadata catalogs of the distributed tables and clients typically connect to the coordinator A PostgreSQL server implicitly becomes the coordinator when the user adds a worker node via a Citus UDF Worker nodes store the shards that contain the actual data When the cluster is small the coordinator itself can also be used as a worker node so the smallest possible Citus cluster is a single server The benefit of using a single coordinator as the entry point is that PostgreSQL libraries and tools can interact with a Citus cluster as if it was a regular PostgreSQL server Since the overhead of distributed queries is small compared to query execution a large coordinator node can handle hundreds of thousands of transactions per second or ingest millions of rows per second via PostgreSQLs COPY command 321 Scaling the coordinator node The coordinator can become a scaling bottleneck in some cases for example when serving demanding high performance CRUD workloads To resolve this bottleneck Citus can distribute the distributed table metadata and all changes to it to all the nodes In this case each worker node assumes the role of coordinator for all distributed queries and trans actions it receives Clients should use a load balancing mechanism to divide connections over the workers Since DDL commands mod ify distributed metadata the application should continue to connect to the coordinator when issuing DDL commands Since the volume of DDL commands is low the coordinator no longer becomes a bottleneck Today we recommend this more complex mode only when cus tomers are familiar with PostgreSQL and have an actual scaling bottleneck When each node can coordinate and also serve queries each connection to the Citus cluster creates one or more connec tions within the cluster Citus caches connections for higher per formance and this could lead to a connection scaling bottleneck within the cluster We typically mitigate this issue by setting up connection pooling between the instances via PgBouncer 10 We are working to improve on connection scaling behavior in future versions of Citus and PostgreSQL itself 20 33 Citus table types Citus introduces two types of tables to PostgreSQL Distributed tables and reference tables without taking away the concept of regular local PostgreSQL tables Citus tables are initially cre ated as regular PostgreSQL tables and then they are converted by calling Citusspecific functions After conversion Citus intercepts all commands involving Citus tables in the relevant PostgreSQL hooks 331 Distributed tables Distributed tables are hashpartitioned along a distribution column into multiple logical shards with each shard containing a contiguous range of hash values The advantage of hashpartitioning is that it enables colocation and reasonably wellbalanced data without the need for frequent resharding Range partitioning is also available for some advanced use cases The createdistributedtable UDF converts a regular table to a distributed table by creating the shards on the workers and adding to the Citus metadata CREATE TABLE mytable SELECT createdistributedtablemytable distributioncolumn Shards are placed on worker nodes in a roundrobin fashion A single worker node can contain multiple logical shards such that the cluster can be rebalanced by moving individual shards between worker nodes 332 Colocation Citus ensures that the same range of hash values is always on the same worker node among distributed tables that are colocated Relational operations eg joins foreign keys that involve the distribution column of two or more colocated distributed tables can be performed without any network traffic by operating on pairs of colocated shards When creating a second distributed table colocation can be specified using the colocatewith option CREATE TABLE othertable SELECT createdistributedtableothertable distributioncolumn colocatewith mytable If no colocatewith option is specified Citus automatically co locates distributed tables based on the data types of the distribution columns We found this to be helpful for users unfamiliar with distributed database concepts 333 Reference tables Reference tables are replicated to all nodes in a Citus cluster including the coordinator Joins between dis tributed tables and reference tables are implemented by joining each shard of the distributed table with the local replica of the reference table Similarly foreign keys from distributed tables to reference tables are enforced locally by creating regular foreign Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2494 keys between each of the shards of the distributed table and the local replica of the reference table Users create reference tables by calling createreferencetable CREATE TABLE dimensions SELECT createreferencetabledimensions After conversion writes to the reference table are replicated to all nodes and reads are answered directly by the coordinator or via loadbalancing across worker nodes 34 Data rebalancing Most Citus clusters grow in data size and query volume Certain worker nodes may also receive more load than others due to data distribution and query patterns To enable an even distribution Citus provides a shard rebalancer By default the rebalancer moves shards until it reaches an even number of shards across worker nodes Alternatively users can choose to rebalance based on data size or create a custom policy by defining cost capacity and con straint functions in SQL 7 Most rebalance operations start with the customer changing their cluster size The rebalancer then picks one shard and any shards that are colocated with it and starts a shard move operation To move shards Citus creates a replica of the shards on a different node using PostgreSQLs logical replication With logical replication the shards in transit can continue to receive read and write queries When the shard replicas have caught up with their source Citus obtains write locks on the shards waits for replication to complete and updates distributed table metadata From that point on any new queries go to the new worker node The last few steps typically only take a few seconds hence there is minimal write downtime 35 Distributed query planner When a SQL query references a Citus table the distributed query planner produces a PostgreSQL query plan that contains a Custom Scan node which contains the distributed query plan A distributed query plan consists of a set of tasks queries on shards to run on the workers and optionally a set of subplans whose results need to be broadcast or repartitioned such that their results can be read by subsequent tasks Citus needs to handle a wide range of workloads that require different query planning strategies to scale Simple CRUD queries benefit from minimal planning overhead Complex data warehous ing queries on the other hand benefit from advanced query opti mizations which incur higher planning overhead Over time Citus evolved to have planners for different classes of queries Figure 4 gives a basic example for each planner We further describe those planners below Fast path planner handles simple CRUD queries on a single table with a single distribution column value The planner extracts the distribution column value directly from a filter in the query and determines the shard that matches the value The planner then rewrites the table name to the shard name to construct the query to run on the worker which can be done with minimal CPU overhead Hence the fast path planner supports high throughput CRUD workloads Router planner handles arbitrarily complex queries that can be scoped to one set of colocated shards The router planner checks a1 a3 b1 b3 r a2 a4 b2 b4 r a2 a4 b2 b4 r SELECT FROM a WHERE key 1 SELECT FROM a JOIN b ON akey bkey JOIN r ON bdim rid WHERE key 1 SELECT FROM a JOIN b ON akey bkey JOIN r ON bdim rid SELECT FROM a JOIN b ON aany bkey a1 a3 b1 b3 r a2 a4 b2 b4 a1 a3 b1 b3 r r a1 a3 i1 i3 a1 a3 i1 i3 b2 b4 b1 b3 A B C D Figure 4 Citus planning examples A Fast path planner showing single shard access B Router planner showing sin gle shard colocated and reference table joins C Logical pushdown planner showing multishard colocated and ref erence table joins D Logical join order planner showing one table being repartitioned into intermediate results to perform a noncolocated join or infers whether all distributed tables have the same distribution column filter If so the table names in the query are rewritten to the names of the colocated shards that match the distribution column value The router planner implicitly supports all SQL features that PostgreSQL supports since it will simply delegate the full query to another PostgreSQL server Hence the router planner enables multitenant SaaS applications to use all SQL features with minimal overhead Logical planner handles queries across shards by construct ing a multirelational algebra tree 15 Multirelational algebra formalizes two distributed execution primitives that are not avail able in PostgreSQL to collect and repartition data This difference influences the separation between router and logical planner The goal of the logical planner is to push as much of the compu tation to the worker nodes as possible before merging the results on the coordinator We further distinguish between two logical planning strategies 1 Logical pushdown planner detects whether the join tree can be fully pushed down This requires that all distributed ta bles have colocated joins between them and that subqueries do not require a global merge step eg a GROUP BY must include the distribution column If so the planner can be largely agnostic to the SQL constructs being used within the join tree since they are fully delegated to the worker nodes and the distributed query plan becomes trivially parallel 2 Logical join order planner determines the optimal exe cution order for join trees involving noncolocated joins It evaluates all possible join orders between distributed ta bles and subqueries using colocated joins broadcast joins and repartition joins and chooses the order that minimizes the network traffic Broadcast joins and repartition joins result in subplans with filters and projections pushed into the subplan Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2495 a2 a4 a1 a3 Aggregate CustomScan SELECT avgv FROM a JOIN b ON akey bkey Adaptive executor SELECT sumv count FROM a4 JOIN b4 ON akey bkey b2 b4 b1 b3 Figure 5 Call flow for the execution of a a simple analytical query that was planned by the logical pushdown planner For each query Citus iterates over the four planners from lowest to highest overhead If a particular planner can plan the query Citus uses it This approach works well for two reasons First there is an order of magnitude difference between each workloads latency expectations as described in Table 1 Second this difference also applies to each planners overhead As an example a complex data warehousing query could take minutes to complete In that case the user does not mind paying a few milliseconds of overhead for the fast path router and pushdown planner 36 Distributed query executor A PostgreSQL query plan is structured as a tree of execution nodes that each have a function to return a tuple Plans generated by the distributed query planner contain a CustomScan node that calls into the distributed query executor In case the plan was gener ated via the fast path or router planner the entire plan is a single CustomScan node since the execution is fully delegated to a single worker node Plans generated by the logical planner may require a merge step eg aggregation across shards In that that case there will be additional execution nodes above the CustomScan which are handled by the regular PostgreSQL executor When the PostgreSQL executor calls into the CustomScan Citus first executes subplans if any and then hands the execution to a component called the adaptive executor Figure 5 shows an example of the call flow for a simple analytical query 361 Adaptive executor The design of the adaptive executor is driven by the need to support a mixture of workloads and by the processperconnection architecture of PostgreSQL Some query plans will have a single task that needs to be routed to the right worker node with minimal overhead while other query plans have many tasks that Citus runs in parallel by opening multiple connec tions per worker node We found parallelizing queries via multiple connections to be more versatile and performant than the builtin parallel query capability in PostgreSQL that uses a fixed set of pro cesses The downside of opening multiple connections is the cost of connection establishment and overhead of extra processes in particular when many distributed queries run concurrently The adaptive executor manages the parallelism vs low latency tradeoff using a technique we call slow start At the start of the query the executor can use one connection to each worker n1 Every 10ms the number of new connections that can be opened increases by one nn1 If there are t pending tasks for a worker node that are not assigned to a specific connection the executor increases the connection pool of that worker node by minnt new connections The reasoning behind slow start is that a simple in memory index lookup typically takes less than a millisecond so all tasks on a worker are typically executed before opening any additional connections On the other hand analytical queries often take hundreds of milliseconds or more and the delayed connection establishment will barely be noticeable in the overall runtime While slow start increases the number of connections when tasks take a long time to complete sometimes tasks take a long time because there are already many concurrent connections issuing queries to the worker node Therefore the executor also keeps track of the total number of connections to each worker node in shared memory to prevent it from exceeding a shared connection limit When the counter reaches the limit opening additional connections is avoided such that the overall number of outgoing connections remains at or below the limit The implementation converges to a state where the available connection slots on the worker nodes are fairly distributed between the processes executing distributed queries on the coordinator When multiple connections per worker node are used each connection will access a different subset of shards and hold uncom mitted writes and locks in case of a multistatement transaction For every connection Citus therefore tracks which shards have been accessed to ensure that the same connection will be used for any subsequent access to the same set of colocated shards in the same transaction When starting the execution of a statement tasks are assigned to a connection if there was a prior access to the shards accessed within the transaction and otherwise they are assigned to the general pool for the worker node When a connection is ready the executor first takes an assigned task from its queue and otherwise takes a task from the general pool By combining slow start the shared connection limit and the task assignment algorithm the adaptive executor can handle a variety of workload patterns even when they run concurrently on a single database cluster and support complex interactive transaction blocks without sacrificing parallelism 37 Distributed transactions Distributed transactions in Citus comprise a transaction on the coordinator initiated by the client and one or more transactions on worker nodes initiated by the executor For transactions that only involve a single worker node Citus delegates responsibility to the worker node For transactions that involve multiple nodes Citus uses twophase commit 2PC for atomicity and implements distributed deadlock detection 371 Singlenode transactions The simplest type of transaction Citus supports is a single statement transaction that goes to a single node eg CRUD operations In that case there is no overhead to using a distributed table other than the extra round trip between coordinator and worker When handling a multistatement transac tion in which all statements are routed to the same worker node eg operations on a single tenant in a multitenant app the coor dinator simply sends commitabort commands to that worker node from the commitabort transaction callbacks The worker node by definition provides the same transactional guarantees as a single PostgreSQL server Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2496 372 Twophase commit protocol For transactions that write to multiple nodes the executor opens transaction blocks on the worker nodes and performs a 2PC across them at commit time PostgreSQL implements commands to prepare the state of a transaction in a way that preserves locks and survives restarts and recovery This enables later committing or aborting the prepared transaction Citus uses those commands to implement a full 2PC protocol When the transaction on the coordinator is about to commit the precommit callback sends a prepare transaction over all connections to worker nodes with open transaction blocks If suc cessful the coordinator writes a commit record for each prepared transaction to the Citus metadata and then the local transaction commits which ensures the commit records are durably stored In the postcommit and abort callbacks the prepared transactions are committed or aborted on a besteffort basis When one or more prepared transactions fail to commit or abort the commit record in the Citus metadata is used to determine the outcome of the transaction A background daemon periodically compares the list of pending prepared transactions on each worker to the local commit records If a commit record is present read visible for a prepared transaction the coordinator committed hence the prepared transaction must also commit Conversely if no record is present for a transaction that has ended the prepared transaction must abort When there are multiple coordinators each coordinator performs 2PC recovery for the transactions it initiated Since both commit records and prepared transactions are stored in the write ahead log which may be replicated see Section 39 this approach is robust to failure of any of the nodes involved 373 Distributed deadlocks A challenge in multinode transac tions is the potential for distributed deadlocks in particular between multistatement transactions To overcome this challenge deadlock prevention or deadlock detection methods can be used Deadlock prevention techniques such as woundwait require a percentage of transactions to restart PostgreSQL has an interactive proto col which means results might be returned to the client before a restart occurs and the client is not expected to retry transactions Hence woundwait is unsuitable for Citus To maintain PostgreSQL compatibility Citus therefore implements distributed deadlock de tection which aborts transactions when they get into an actual deadlock PostgreSQL already provides deadlock detection on a single node Citus extends on this logic with a background daemon running on the coordinator node This daemon detects distributed deadlocks by polling all worker nodes for the edges in their lock graph process a waits for process b every 2 seconds and then merging all processes in the graph that participate in the same distributed transaction If the resulting graph contains a cycle then a cancellation is sent to the process belonging to the youngest distributed transaction in the cycle to abort the transaction Unless there is an actual deadlock only a small number of trans actions will be waiting for a lock in typical distributed database workloads hence the overhead of distributed deadlock detection is small When distributed deadlocks happen frequently users are recommended to change the statement order in their transactions 374 Multinode transaction tradeoffs Multinode transactions in Citus provide atomicity consistency and durability guarantees but do not provide distributed snapshot isolation guarantees A concurrent multinode query could obtain a local MVCC snapshot before commit on one node and after commit on another Address ing this would require changes to PostgreSQL to make the snapshot manager extensible In practice we did not find a strong need for distributed snapshot isolation in the four workload patterns and customers did not express a need for it yet Most transactions in multitenant and CRUD applications are scoped to a single node meaning they get isolation guarantees on that node Analytical applications do not have strong dependencies between transactions and are hence more tolerant to relaxed guarantees Distributed snapshot isolation can be important in certain hy brid scenarios However existing distributed snapshot isolation techniques have a significant performance cost due to the need for additional network round trips or waiting for the clock which increases response times and lowers achievable throughput In the context of the synchronous PostgreSQL protocol throughput is ultimately capped by connections response time Since making a very large number of database connections is often impractical from the application perspective low response time is the only way to achieve high throughput Hence we would likely make distributed snapshot isolation optional if we implement it in the future 38 Specialized scaling logic Apart from SELECT and DML commands that are handled via the distributed query planner and executor there are several other important PostgreSQL capabilities for which Citus implements specialized scaling logic DDL commands in PostgreSQL are transactional online oper ations Citus preserves this property by taking the same locks as PostgreSQL and propagating the DDL commands to shards via the executor in a parallel distributed transaction COPY commands append a CSVformatted stream of data to a table In PostgreSQL this happens in a single thread which also needs to update indexes and checks constraints In case of Citus the coordinator opens COPY commands for each of the shards and streams rows to the shards asynchronously which means writes are partially parallelized across cores even with a single client INSERTSELECT between distributed tables use one of 3 strate gies If the SELECT requires a merge step on the coordinator the command is internally executed as a distributed SELECT and a COPY into the destination table If there is no merge step but the source and destination tables are not colocated the INSERTSELECT performs distributed repartitioning of the SELECT result before in serting into the destination table Otherwise the INSERTSELECT is performed directly on the colocated shards in parallel Stored procedures can be delegated to a worker node based on a distribution argument and a colocated distributed table to avoid network round trips between coordinator and worker nodes The worker node can then perform most operations locally without net work round trips but it can also perform a distributed transaction across worker nodes when necessary 39 High Availability and backups High availability HA and backups for distributed database systems are complex topics that need to be looked at holistically from the Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2497 Workload Benchmark Multitenant HammerDB TPCCbased Realtime analytics Custom microbenchmarks Highperformance CRUD YCSB Data warehouse Queries from TPCH Table 3 Benchmarks used for different workload patterns perspective of the user and the platform that runs the database As this paper is primarily focused on the Citus extension we only give a brief overview of the typical HA and backup approach for Citus clusters and leave further details to future papers HA in Citus is handled primarily at the server level using existing PostgreSQL replication In an HA setup each node in the cluster has one or more hot standby nodes and replicates its writeahead log WAL using synchronous asynchronous or quorum replication When a node fails the cluster orchestrator promotes a standby and updates the Citus metadata DNS record or virtual IP The whole failover process takes 2030 seconds during which distributed trans actions involving the node roll back The orchestrator is typically part of the control plane in a managed service but onpremises users can use the pgautofailover extension 9 to serve the same function Backups are also handled primarily at the server level by cre ating periodic disk snapshots or copies of the database directory and continuously archiving the WAL to remote storage in each server Citus supports periodically creating a consistent restore point which is a record in the WAL of each node The restore point is created while blocking writes to the commit records tables on the coordinators which prevents inflight 2PC commits while creating the restore point Restoring all servers to the same restore point guarantees that all multinode transactions are either fully committed or aborted in the restored cluster or can be completed by the coordinator through 2PC recovery on startup 4 BENCHMARKS This section presents benchmark results that compare the perfor mance of Citus with one PostgreSQL server With Citus our goal has been to turn PostgreSQL into a distributed database This goal included providing compatibility with PostgreSQL features and its ecosystem of libraries and tools As such when customers ap proached us their primary performance baseline was single node PostgreSQL The benchmarks here reflect that performance base line for target workloads described in Section 2 Table 3 summarizes the relationship between workload patterns and benchmarks Each benchmark compares PostgreSQLA single PostgreSQL server Citus 01a single PostgreSQL server that uses Citus to shard data locally coordinator also acts as worker Citus 41a Citus cluster with a coordinator and 4 worker nodes and Citus 81a Citus cluster with a coordinator and 8 worker nodes All servers were Microsoft Azure virtual machines with 16 vcpus 64 GB of memory and networkattached disks with 7500 IOPS running PostgreSQL 13 and Citus 95 with default settings Benchmarks were run from a separate driver node Each benchmark is structured such that a single server cannot keep all the data in memory but Citus 41 can which demonstrates Citus 81 Citus 41 Citus 01 PostgreSQL 0 200000 400000 381911 214043 15419 15923 Throughput in NOPM avg p95 368 ms 766 ms 382 ms 802 ms 39 ms 108 ms 21 ms 59 ms Figure 6 HammerDB TPCC results with 250 vusers and 500 warehouses 100GB in new order transactions per minute NOPM and response times in milliseconds the often dramatic effect of scaling out memory along with CPU and IO capacity Citus 81 demonstrates the effect of scaling out only CPU and IO capacity compared to Citus 41 41 Multitenant benchmarks To simulate a multitenant workload we used the HammerDB 6 TPCCbased workload which is an OLTP benchmark that models an order processing system for warehouses derived from TPCC 1 The benchmark effectively models a multitenant OLTP workload in which warehouses are the tenants Most tables have a warehouse ID column and most transactions only affect a single warehouse ID which allows the workload to scale Around 7 of transactions span across multiple warehouses and are likely to be multinode transactions in Citus We configured HammerDB 33 with 500 warehouses 100GB of data 250 virtual users connections a 1ms sleep time between transactions and a 1 hour runtime and ran it against each set up In case of Citus we converted the items table to a reference table and the remaining tables to colocated distributed tables with the warehouse ID column as the distribution column Additionally we configured Citus to delegate stored procedure calls to worker nodes based on the warehouse ID argument The New Orders Per Minute NOPM results obtained from run ning HammerDB against each set up are shown in Figure 6 On single server PostgreSQL and Citus 01 the data set does not fit in memory which means that the amount of IO is relatively high and bottlenecked on a single disk Citus does not provide immediate performance benefits for OLTP workloads on the same hardware hence Citus 01 is slightly slower than single server PostgreSQL due to small distributed query planning overhead The main ben efit of using Citus for OLTP workloads is that it can scale beyond a single server to ensure the working set fits in memory and sufficient IO and CPU is available Throughput on Citus 41 is around 13 times higher than through put on a single PostgreSQL server with only 5 times more hardware because the cluster can keep all data in memory Hence Citus 41 performs less IO and becomes CPU bottlenecked From 4 to 8 nodes Citus shows slightly sublinear scalability This is expected for the TPCCbased workload due to the 7 of transactions that span across nodes The response time of these transactions is dominated by network roundtrips for individual statements sent between nodes which remains the same as the cluster scales out Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2498 Citus 81 Citus 41 Citus 01 PostgreSQL 0 100 200 300 46 47 95 268 a Load time in seconds 0 1 2 007 017 032 223 b Query time in seconds 0 2000 4000 136 390 1673 3033 c Transformation time in seconds Figure 7 Realtime analytics microbenchmarks using 100GB of GitHub Archive data a Single session COPY b Dashboard query using GIN index c Data transformation using INSERTSELECT Citus 81 Citus 41 Citus 01 PostgreSQL 0 50 100 150 133 45 4 1 Queries per hour Figure 8 Data warehousing bench mark using queries from TPCH at scale factor 100 135GB Citus 81 Citus 41 Citus 01 PostgreSQL 0 10000 20000 15324 6505 1061 1279 19194 9254 1055 1276 Transactions per second Same key Different key avg p95 212 ms 456 ms 211 ms 451 ms 226 ms 488 ms 222 ms 490 ms 46 ms 96 ms 65 ms 104 ms 25 ms 58 ms 30 ms 88 ms Figure 9 Distributed transactions benchmark comparing two updates on using the same distribution key vs differ ent keys across two tables 411 Distributed transaction performance Multitenant applica tions mostly have singletenant transactions but there may be cross tenant transactions such as the ones simulated by HammerDB To get a more accurate measure of the overhead of 2PC we created a synthetic benchmark using two tables of 50GB generated by the pgbench tool that comes with PostgreSQL We then distributed and colocated these tables by key and defined a simple multistatement transaction UPDATE a1 SET v v d WHERE key key1 UPDATE a2 SET v v d WHERE key key2 We ran the transactions using pgbench for 1 hour with 250 connections In one set of runs we used the same random value for both keys such that these are two colocated updates In another set of runs we used a different random value which results in a 2PC when the keys are on different nodes The results are displayed in Figure 9 The figure shows 2PC incurs a 2030 performance penalty but scales with the number of worker nodes 42 Realtime analytics benchmark There is not a standard realtime analytics benchmark so we ran several microbenchmarks for the individual commands involved in realtime analytics We used publicly available data from the GitHub archive 4 in JSON format and loaded data for January 2020 into a table defined as follows CREATE TABLE githubevents eventid text default md5randomtext primary key data jsonb SELECT createdistributedtablegithubevents eventid Citus only CREATE INDEX textsearchidx ON githubevents USING GIN jsonbpathqueryarraydata payloadcommitsmessagetext gintrgmops We used the pgtrgrm extension included in PostgreSQL to index the commit messages within the JSON data The index makes queries for a substring in a commit message much faster at the cost of increased write overhead We created the index both on PostgreSQL and on the Citus clusters Our first microbenchmark measures ingestion performance in the presence of large indexes We appended the first day of February 2020 44GB of JSON data using the COPY command The average load times over 5 runs are shown in Figure 7a In this case Citus 01 gives a speed up over PostgreSQL due to the partial parallelism described in Section 38 The Citus cluster with 4 worker nodes can speed up the COPY further due to the greater number of cores and IO capacity After that the single COPY command becomes bottlenecked on a single core on the coordinator hence increasing to 8 worker nodes does not provide additional speed up To resolve this bottleneck customers ingest data by running concurrent COPY commands Our second microbenchmark is a query that might be run by a dashboard Compute the number of commits that contain the phrase postgres per day SELECT datacreatedatdate sumjsonbarraylengthdatapayloadcommits FROM githubevents WHERE jsonbpathqueryarraydata payloadcommitsmessagetext ILIKE postgres GROUP BY 1 ORDER BY 1 ASC The average runtime of the dashboard query over 5 runs ex cluding the first to mitigate the variability caused by cache misses is shown in Figure 7b The query only reads from memory and is largely bottlenecked on CPU hence the greater parallelism pro vided by Citus enables the query to run faster even on a single server Finally realtime analytics often involves INSERTSELECT queries to transform or preaggregate the data For our third microbench mark we defined a data transformation step that extract commits from the GitHub push events Average runtime over five runs is show in Figure 7c The parallelization significantly speeds up the INSERTSELECT with a 96 reduction in runtime on Citus 81 Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2499 Citus 81 Citus 41 Citus 01 PostgreSQL 0 50000 10000 83240 34562 3544 3950 Operations per second avg p95 103 ms 249 ms 133 ms 295 ms 28 ms 78 ms 9 ms 14 ms Figure 10 YCSB Workload A results on 100M rows 100GB and response times for updates compared to a single PostgreSQL server showing the ability of Citus to scale complex transactional data transformations 43 High performance CRUD benchmark The Yahoo Cloud Serving Benchmark YCSB 16 is designed to test high performance CRUD workloads on NoSQL databases YCSB also has a JDBC driver that supports PostgreSQL We ran workload A 50 reads 50 updates from YCSB on a table of 100M rows 100GB using 256 threads with uniform request distribution For this benchmark the coordinators CPU usage becomes a scaling bottleneck Hence we ran the benchmark with every worker node acting as coordinator and configured YCSB to load balance across all nodes The results appear in Figure 10 The workload is largely IO bound hence throughput scales linearly with the higher IO capacity when adding worker nodes Single server Citus performs slightly worse than PostgreSQL due to the additional overhead of distributed query planning On bigger clusters the speed up is roughly proportional to the amount of IO capacity with a small additional speed up due to data fitting in memory 44 Data warehousing benchmark A standard benchmark for data warehouses is TPCH 2 Queries in TPCH do not have selective filters and therefore scan most of the data Answering a TPCH query quickly requires fast scanning and processing which Citus achieves mainly through distributed paral lelism and keeping more data in memory At the time of writing 4 of the 22 queries in TPCH are not yet supported We used HammerDB to generate a TPCH schema with scale factor 100 135GB and distributed and colocated the lineitem and orders table by order key and converted the smaller tables to reference tables to enable local joins We then ran the 18 queries supported by Citus over a single session on each setup Figure 8 shows the number of queries per hour based on the completion time of the full set of queries over a single session Citus can achieve significant speeds up compared to PostgreSQL by efficiently utilizing all available cores The fact that TPCH queries scan all the data and the tables do not fully fit in memory also means the single server is IO bottlenecked while the Citus cluster is only CPU bottlenecked which results in a two orders of magnitude speedup on the 8 node cluster compared to a single PostgreSQL server 5 CITUS CASE STUDY VENICEDB Citus is used in many largescale production systems that rely on a broad array of PostgreSQL and Citus capabilities to get the most out of their hardware A good example of this is the VeniceDB project at Microsoft Microsoft uses Citus to analyze Windows measure data which is derived from the telemetry coming in from hundreds of millions of Windows devices Metrics are displayed on a realtime analyt ics dashboard called Release Quality View RQV which helps Windows engineering teams to assess the quality of the customer experience for each Windows release at the device grain The RQV dashboard is a critical tool for Windows engineers program man agers and executives with hundreds of users per day The data store underlying RQV code named VeniceDB is pow ered by two 1000 core Citus clusters running on Microsoft Azure which store over a petabyte of data While many different dis tributed databases and data processing systems were evaluated for VeniceDB only Citus could address the specific combination of re quirements associated with the petabytescale VeniceDB workload including Sub second response times p95 for 6M queries per day Ingest 10TB of new measure data per day Show new measure data in RQV within 20 minutes Nested subqueries with high cardinality group by Advanced secondary indexes eg partial indexes GiST in dexes to efficiently find reports along various dimensions Advanced data types eg arrays HyperLogLog to imple ment sophisticated analytical algorithms in SQL Row count reduction through incremental aggregation Atomic updates across nodes to cleanse bad data In the Citus clusters raw data is stored in the measures table which is distributed by device ID and partitioned by time on disk using the builtin partitioning capability in PostgreSQL The COPY command is used to parallelize the ingestion of incoming JSON data into the distributed table Distributed INSERTSELECT commands are used to perform devicelevel preaggregation of incoming data into several reports tables with various indexes The reports tables are also distributed on device ID and colocated with the measures table such that Citus can fully parallelize the INSERTSELECT Many of the queries from the RQV dashboard are of the form SELECT avgdeviceavg FROM SELECT deviceid avgmetric as deviceavg FROM reports WHERE GROUP BY deviceid time period other dimensions AS subq GROUP BY time period other dimensions These queries filter by several dimensions eg measure time range Windows build to find a substantial subset of the data The nested subquery first aggregates reports by device ID which is needed to weigh overall averages by device rather than by the number of reports There can be tens of millions of devices per query which makes the GROUP BY deviceid challenging to compute efficiently Since the subquery groups by the distribution column the logical pushdown planner in Citus recognizes that it can push down the full subquery to all worker nodes to parallelize it The Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2500 worker nodes then use indexonly scans to read the data in device ID order and minimize the disk IO and memory footprint of the GROUP BY Finally Citus distributes the outer aggregation step by calculating partial aggregates on the worker nodes and merging the partial aggregates on the coordinator to produce the final result At each step VeniceDB uses a combination of advanced Post greSQL and Citus capabilities to achieve maximum efficiency and scale on a single system 6 RELATED WORK Citus has architectural similarities with various other distributed database systems but most systems focus only on a single workload pattern In addition Citus is unique in that it is a distributed RDBMS implemented as an extension of an existing RDBMS which gives many benefits in terms of robustness versatility and compatibility with the open source ecosystem around PostgreSQL Vitess 12 is a sharding solution for MySQL Like Citus Vitess scales out an existing open source relational database Unlike Ci tus it is not an extension and therefore must be deployed sepa rately from the database servers and requires additional application changes Vitess is primarily optimized for multitenant and high performance CRUD workloads and has builtin connection pool ing for scaling the number of connections It has limited support for queries and transactions across shards which makes it less applicable in other workload patterns Greenplum 5 and Redshift 21 are PostgreSQLbased data warehouses that are hence optimized for handling complex SQL queries that analyze large amounts of data with low concurrency As a result both systems today provide better percore performance than Citus for long analytical queries Greenplum and Redshift also use columnar storage for fast scans and implement joins by shuffling data over the network Citus supports those primitives as well but the Citus implementation is not as welloptimized yet On the other hand Citus can handle a mixture of transactional and analytical workloads and can take advantage of the latest PostgreSQL features and extensions Aurora 27 can scale out the storage for a single PostgreSQL server for demanding OLTP workloads and fast recovery Citus has a shared nothing architecture which means storage scale out and fast recovery is achieved by dividing data across many database servers The downside of a sharednothing architecture is that the application needs to make additional data modelling decisions choosing distributed tables so it is not a dropin replacement for applications built on a single server RDBMS The advantages of a sharednothing architecture over shared storage are the ability to combine the compute power of all servers and use advanced query parallelism Also Citus can be deployed in any environment Spanner 17 CockroachDB 25 and Yugabyte 13 have been developed with a focus on serializability for multinode transac tions CockroachDB and Yugabyte support the PostgreSQL protocol as well though significant functional limitations compared to Post greSQL remain A notable architectural difference between these systems and Citus is that they provide distributed snapshot isola tion and use woundwait rather than deadlock detection In sections 374 and 373 we discussed the downsides of these techniques in the context of PostgreSQL compatibility and why we did not use them for Citus One of the benefits of distributed snapshot isola tion is that it avoids data modelling constraints Citus users need to use colocation and reference tables to scope transactions to a single node in order to get full ACID guarantees On the other hand these techniques also enable efficient joins and foreign keys and we therefore find them to be essential for scaling complex relational database workloads TimescaleDB 11 is a PostgreSQL extension that optimizes Post greSQL for time series data It uses similar hooks as Citus to intro duce the concept of a hypertable which is automatically partitioned by time Partitioning tables by time is useful for limiting index sizes to maintain high write performance for time series workloads and for partition pruning which speeds up queries by time range Citus and TimescaleDB are currently incompatible due to conflicting us ages of PostgreSQL hooks but Citus does work with pgpartman 8 which is a simpler time partitioning extension Many realtime analytics applications that use Citus also use pgpartman on top of distributed tables in which case the individual shards are locally partitioned to get both the benefits of distributed tables and time partitioning 7 CONCLUSIONS AND FUTURE WORK Citus is a distributed database engine for PostgreSQL that addresses the need for scalability in the PostgreSQL ecosystem As an exten sion Citus maintains longterm compatibility with the PostgreSQL project including new features and tools Rather than focusing on a particular workload we designed Citus as a multipurpose database system that can handle a broad variety of PostgreSQL workloads that need to scale beyond a single server That way users get the simplicity and flexibility of using a widely adopted open source relational database system at scale Much of our future work is around implementing support for any remaining PostgreSQL features that are not fully supported on distributed tables These include noncolocated correlated sub queries recursive CTEs and logical replication between different table types Increasingly we are also seeing users with hybrid data models that keep small tables on a single server and then distribute only large tables Automated data model optimization for these scenarios is another important area of future work Finally Citus is increasingly being used in more specialized workload patterns such as MobilityDB 14 and KyrixS 26 There are many potential dis tributed query optimizations that can be implemented specifically for those workloads We will explore making Citus itself extensible to iterate on those optimizations faster ACKNOWLEDGMENTS Citus is the result of a collaborative effort that spans close to a decade We would especially like to thank Andres Freund Hadi Moshayedi Jason Petersen Metin Döşlü Önder Kalacı Murat Tuncer Samay Sharma and Utku Azman for their vast and numerous contri butions over many years We would also like to thank our reviewers for their invaluable comments in writing this paper and Min Wei for his input on the VeniceDB section Finally we would like to thank the many users of Citus who provide the input and inspiration that continue to shape Citus into a practical distributed PostgreSQL solution Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2501 REFERENCES 1 2010 TPC Benchmark C Standard Specification Revision 511 Technical Report Transaction Processing Performance Council TPC 2 2018 TPC Benchmark H Standard Specification Revision 2180 Technical Report Transaction Processing Performance Council TPC 3 2020 Citus Open Source Repo httpsgithubcomcitusdatacitus 4 2020 GitHub Archive httpswwwgharchiveorg 5 2020 Greenplum Database httpsgreenplumorg 6 2020 HammerDB httpswwwhammerdbcom 7 2020 Hyperscale Citus documentation on Citus metadata httpsdocs microsoftcomenusazurepostgresqlreferencehyperscalemetadata 8 2020 PG Partitioning Manager httpsgithubcompgpartmanpgpartman 9 2020 pgautofailover Postgres extension and service for automated failover and highavailability httpsgithubcomcitusdatapgautofailover 10 2020 PgBouncer lightweight connection pooler for PostgreSQL httpswww pgbouncerorg 11 2020 Timescale PostgreSQL for time series httpswwwtimescalecom 12 2020 Vitess A database clustering system for horizontal scaling of MySQL httpsvitessio 13 2020 YugabyteDB httpswwwyugabytecom 14 Mohamed Bakli Mahmoud Sakr and Esteban Zimanyi 2019 Distributed moving object data management in MobilityDB In Proceedings of the 8th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data 110 15 Stefano Ceri and Giuseppe Pelagatti 1983 Correctness of query execution strategies in distributed databases ACM Transactions on Database Systems TODS 8 4 1983 577607 16 Brian F Cooper Adam Silberstein Erwin Tam Raghu Ramakrishnan and Russell Sears 2010 Benchmarking cloud serving systems with YCSB In Proceedings of the 1st ACM symposium on Cloud computing 143154 17 James C Corbett Jeffrey Dean Michael Epstein Andrew Fikes Christopher Frost Jeffrey John Furman Sanjay Ghemawat Andrey Gubarev Christopher Heiser Peter Hochschild et al 2013 Spanner Googles globally distributed database ACM Transactions on Computer Systems TOCS 31 3 2013 122 18 P Corti TJ Kraft SV Mather and B Park 2014 PostGIS Cookbook Packt Publishing httpsbooksgooglenlbooksidzCaxAgAAQBAJ 19 DBengines 2020 PostgreSQL System Properties httpsdbenginescomen systemPostgreSQL 20 Andres Freund 2020 Analyzing the Limits of Connection Scalability in Post gres httpstechcommunitymicrosoftcomt5azuredatabaseforpostgresql analyzingthelimitsofconnectionscalabilityinpostgresbap1757266 21 Anurag Gupta Deepak Agarwal Derek Tan Jakub Kulesza Rahul Pathak Stefano Stefani and Vidhya Srinivasan 2015 Amazon redshift and the case for simpler data warehouses In Proceedings of the 2015 ACM SIGMOD international conference on management of data 19171923 22 Craig Kerstiens 2018 Introducing Landlord per tenant stats in Postgres with Citus httpswwwcitusdatacomblog20180731introducinglandlordper tenantstats 23 Eamonn Maguire Lukas Heinrich and Graeme Watt 2017 HEPData a repository for high energy physics data In J Phys Conf Ser Vol 898 52 24 Michael Stonebraker and Lawrence A Rowe 1986 The design of POSTGRES ACM Sigmod Record 15 2 1986 340355 25 Rebecca Taft Irfan Sharif Andrei Matei Nathan VanBenschoten Jordan Lewis Tobias Grieger Kai Niemi Andy Woods Anne Birzin Raphael Poss et al 2020 CockroachDB The Resilient GeoDistributed SQL Database In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data 14931509 26 Wenbo Tao Xinli Hou Adam Sah Leilani Battle Remco Chang and Michael Stonebraker 2020 KyrixS Authoring Scalable Scatterplot Visualizations of Big Data arXiv preprint arXiv200715904 2020 27 Alexandre Verbitski Anurag Gupta Debanjan Saha Murali Brahmadesam Kamal Gupta Raman Mittal Sailesh Krishnamurthy Sandor Maurice Tengiz Kharatishvili and Xiaofeng Bao 2017 Amazon aurora Design considerations for high throughput cloudnative relational databases In Proceedings of the 2017 ACM International Conference on Management of Data 10411052 Industrial Track Paper SIGMOD 21 June 2025 2021 Virtual Event China 2502