Polyglot Persistence

Polyglot Persistence

A diagram for Services with Polyglot Persistence, in abstractness-subdomain-sharding coordinates.

Unbind your data. Use multiple specialized data stores.

Known as: Polyglot Persistence.

Structure: A layer of data services used by higher-level components.

Type: Extension component, derived from Shared Repository.

BenefitsDrawbacks
Performance is fine-tuned for various data types and use casesThe peculiarities of each data store need to be learned
Less load on each data storeMuсh more work for the DevOps team
The data stores may satisfy conflicting forcesMore points of failure in the system
Consistency is hard or slow to achieve

References: The original and closely related CQRS articles from Martin Fowler, chapter 7 of [MP], chapter 11 of [DDIA] and much information dispersed all over the Web.

You can choose a dedicated technology for each kind of data or pattern of data access in your system. That improves performance (as each data store engine is optimized for a few use cases), distributes load between the data stores, and may solve conflicts between forces (like when you need both low latency and large storage). However, you may need to hire several experts to get the best use of and to support the multiple data stores, especially if those are full-featured databases. Moreover, having your data spread over multiple data stores makes it the application’s responsibility to keep the data in sync (by implementing some kind of distributed transactions or making sure that the clients don’t get stale data).

Performance#

Polyglot Persistence aims at improving performance through the following means:

Beware that the read-write separation introduces a replication lag [MP] which is a headache when both data consistency and responsiveness are important for the system’s clients.

Dependencies#

In general, each service depends on all of the data stores which it uses. There may also be an additional dependency between the data stores if they share a dataset (one or more data stores are derived).

The business logic depends on every database. A derived database depends on its data source.

Applicability#

Polyglot Persistence helps:

  • High load and low latency projects. Specialized Databases shine when given fitting tasks. Caching and Read-Only Replicas take the load off the main database. External Search Indices may occasionally save the day as well.
  • Event sourcing and event collaboration. A Memory Image maintains the current state of an event-sourced component. A CQRS View aggregates domain events to provide its host service with whatever data from other subdomains it may need to use.
  • Conflicting forces. An instance of a stateless service inherits many of the qualities of the data store which it accesses for any given request it is processing. When there are several data stores, the qualities (e.g. latency) of a service instance may vary from request to request, depending on which data store is involved.

Polyglot Persistence may harm:

  • Small projects. Properly setting up and maintaining multiple databases is not that easy.
  • High availability. Each data store which your system uses will tend to fail in its own crazy way.
  • User experience. For systems with read-write database separation the replication lag between the databases will make you choose between reading changes from the leader (write database), adding synchronization code to your application to wait for the read database to be updated, and risking returning outdated results to the users.

Relations#

Polyglot Persistence for Monolith, Layers, Shards, and Services.

Polyglot Persistence:

Examples with independent storage#

Many cases of Polyglot Persistence use multiple data stores just because there is no single technology that matches all the application’s needs. The data stores used are filled with different subsets of the system’s data:

Private and Shared Databases#

A subset of data shared between shards and between services.

If several services or shards become coupled through a subset of the system’s data, that subset can be put into a separate database which is accessible to all the participants. All the other data remains private to the shards or services.

Specialized Databases#

A service which uses both SQL and NoSQL databases.

Databases vary in their optimal use cases. You can employ several different databases to achieve the best performance for each kind of data that you persist.

Data File, Content Delivery Network (CDN)#

A backend reads page templates while a frontend reads content from a content delivery network.

Some data is happy to stay in files. Web frameworks load web page templates from OS files and store images and videos in a Content Delivery Network (CDN) which replicates the data all over the world so that each user downloads the content from the nearest server (which is faster and cheaper).

Examples with derived storage#

In other cases there is a single writable data store (called system of record [DDIA]) which is the main source of truth from which the other data stores are derived. The primary reason to use several data stores is to relieve the main database of read requests and maybe support some additional qualities: special kinds of queries, aggregation for materialized and CQRS views, full text search for text indices, huge dataset size for historical data, or high performance for an in-memory cache.

The updates to the derived data stores may come from:

  • the main database as Change Data Capture (CDC) [DDIA] (a log of changes),
  • the application after it changes the main data store (see caching strategies below),
  • another service as an event stream [DDIA, MP],
  • a dedicated indexer that periodically crawls the main data store or web site.
A derived database is fed data from the main database, from an indexer which scans the main database, from application events, or from events originating with another service.

Read-Only Replicas#

An instance of a backend writes to a leader database which streams updates to database replicas. Other backend instances read from the replicas.

Multiple instances of the database are deployed and one of them is the leader [DDIA] instance which processes all writes to the system’s data. The changes are then replicated to the other instances (via Change Data Capture (CDC)) which are used for read requests. Distributing workload over multiple instances increases maximum read throughput which the system is capable of, as the database is usually the system’s bottleneck. Having several running replicas greatly improves reliability and allows for nearly instant recovery of database failures as any replica may quickly be promoted to the leader role to serve write traffic.

Database Cache, Cache-Aside#

A cache hit only reads from the cache; a cache miss reads from the cache, reads from the database, and writes to the cache; a write writes to the database and to the cache.

Database queries are resource-heavy while databases scale only to a limited extent. That means that a highly loaded system benefits from bypassing its main database in as many queries as possible, which is usually achieved by storing recent queries and their results in an in-memory data store (Cache-Aside). Each incoming query is first looked for in the fast cache, and if it is found then you are lucky to get the result immediately without having to consult the main database.

Keeping the cache consistent with the main database is the hard part. There are quite a few strategies (some of them treat the cache as a Proxy for the database): write-through, write-behind, write-around and refresh-ahead.

Memory Image, Materialized View#

At startup a service reads from an event store and writes to a memory image. At runtime it reads from the memory image and updates both the memory image and event store.

Event sourcing (of Event-Driven Architecture or Microservices) is all about changes. A service persists only changes to its data instead of its current data. As a result, the service needs to aggregate its history into a Memory Image (Materialized View [DDIA]) by loading a snapshot and replaying any further events to rebuild its current state (which other architectural styles store in databases) to start operating.

Reporting Database, CQRS View Database, Event-Sourced View, Source-Aligned (Native) Data Product Quantum (DPQ) of Data Mesh#

A service reads from a CQRS view which aggregates updates streamed by another service. An analyst queries a reporting database which aggregates a stream of events from the main database.

It is common wisdom that a database is good for either OLTP (transactions) or OLAP (queries). Here we have two databases: one optimized for commands (write traffic protected with transactions) and another one for complex analytical queries. The databases differ at least in their schemas (the OLAP schema is optimized for queries) and often vary in type (e.g. SQL vs NoSQL).

A Reporting Database (or Source-Aligned (Native) Data Product Quantum of Data Mesh [SAHP]) derives its data from a write-enabled database in the same subsystem (service) while a CQRS View [MP] or Event-Sourced View [DEDS] is fed a stream of events from another service from which it filters the data relevant to its owner. This way a CQRS View lets its host service query (its replica of) the data that originally belonged to other services.

Query Service, Front Controller, Data Warehouse, Data Lake, Aggregate Data Product Quantum (DPQ) of Data Mesh#

Several services both stream updates and query data from a shared query service.

A Query Service [MP] (or Aggregate Data Product Quantum of Data Mesh [SAHP]) subscribes to events from several full-featured services and aggregates them into its data store, making it a CQRS View of several services or even the whole system. If any other service or a data analyst needs to process data which belongs to multiple services, it retrieves it from the Query Service which has already joined the data streams and represents the join in a convenient way.

A Front Controller [SAHP but not PEAA] is a Query Service embedded in the first (user-facing) service of a Pipeline. It collects status updates from the downstream components of the Pipeline to track the state of every request being processed by the Pipeline.

Data Warehouse [SAHP] and Data Lake [SAHP] are analytical data stores that connect directly to and import all the data from the operational (main) databases of all the system’s services. A Data Warehouse transforms the imported data into its own unified schema while a Data Lake stores the imported data in its original format(s).

External Search Index#

An external index receives events from the main datastore. A service queries the external index to find ids of specific records which it later reads from the main database.

Some domains require a kind of search which is not naturally supported by ordinary database engines. Full text search, especially NLP-enabled, is one such case. Geospatial data may be another. If you are comfortable with your main data store(s), you can set up an External Search Index by deploying a product dedicated to the special kind of search that you need and feeding it updates from your main data store.

Alternatively, you may just need a way to quickly search through text documents or videos stored in a file system or in a cloud, which requires some kind of index.

Historical Data, Data Archiving#

A service works with an operational database. An archiver reads from the operational database and writes to an archive. An analyst reads from both the archive and operational database.

It is common to store the history of sales in a database. However, once a month or two has passed, it is very unlikely that the historical records will ever be edited. And though they are queried on very rare occasions, like audits, they still slow down your database. Some businesses offload any data older than a couple of months to a cheaper archive storage which does not allow for changing the data and has limited query capabilities. That helps keep the main datasets small and fast.

Evolutions#

Polyglot Persistence with derived storage can often be made subject to CQRS:

The backend layer that uses OLAP and OLTP databases is subdivided into command and query backends, resulting in full-featured Command-Query Responsibility Segregation.

Summary#

Polyglot Persistence employs several specialized data stores to improve performance, often at the cost of eventual data consistency or implementing transactions in the application.