On Durable Objects

Some notes and observations on Cloudflare’s globally distributed and strongly consistent coordination API.

Kevin Wang


Revisions:

May 12, 2024 - Expanded on Transaction idea. Fixed typos.
May 15, 2024 - Update "Real world example" callout with more context.
July 10, 2024 - Added How to identify location section

So, I’ve been working on our edge workload at Clerk for the past 3-4 months and that has largely meant shifting some of our data, and critical paths, from our origin server over to Cloudflare. For this, we’re leveraging Workers, KV and Durable Objects pretty extensively. We also considered D1 and Hyperdrive, but those have not come into play quite yet. Durable Objects, or DO’s, in particular are a very unique and different beast on their own, which I will reflect on in this article.

It has been quite a while since I last tried to write something, but the timing for this feels right. I find myself repeatedly getting tripped up and forgetting how Durable Objects behave, despite reading about a particular detail. The docs are also a bit tricky to navigate at times, so this writing process aims to ingrain information into my own head, and to consolidate disparate sources into an artifact that I can revisit down the road.

Overview

This is a purely subject brain dump so take everything here with a grain of salt. However, the reflections and observations stem from addressing very real problems of data integrity, scalability and resiliency that we’re experiencing at Clerk.

Working with Workers and KV forces you shift your mental model away from the traditional, long-running single-origin server. I’d say this is a reasonably low hurdle as it’s a sort of full flip from one side of an architectural spectrum to the opposite side — you just do it.

The How Workers works article is a great write up on this, which I’m embarassingly seeing for the first time as I am writing this post.

Traditional

  • Single origin
  • Long running
  • Strong consistency is the norm

Newer

  • Distributed
  • Ephemeral
  • Expect eventual consistency

Durable Objects land somewhere in the middle, which makes the complexity jump. This is because their interface is both distributed and centralized. You interface with DO’s via and global API layer that gives you ID’s. ID’s are how you create “stubs”, which are clients that connect to a specific, location-constrained DO.

DO’s + Stubs

  • Distributed API... but stubs are geographically pinned 1
  • DO’s offer both persistent storage and ephemeral state

Again, it feels like they land right in the middle of the centralized-to-distributed programming spectrum. And that is kind of the beauty of them.

Real world example

Here is a foot-gun scenario that we experienced at Clerk. If you have an origin server, say in Ohio, USA, that creates all your end-user’s DO’s, then every DO will be effectively pinned to somewhere near Ohio, by ways of calling the Worker that calls the Durable Object namespace, that also calls get(ID) for the very first time for an ID.

This becomes a problem when a user in Seoul, S. Korea hits your edge server, and tries to access a DO by a previously created ID (via idFromName(), or idFromString()), their request will make a very long trip across the world, since it ultimately has to hit a DO what physically resides in Ohio. As a result, that user would experience high latency.

To solve for this, I shared the idea of “priming objects at edge”, since >95% of our traffic that hits DO’s would take place at, with the remaining 5% coming from our origin server.

“Intents”, “warming”, and “preemptive-provisioning” were some other terms thrown around. The a-ha moment was when I remembered a pattern that my coworker, Bryce, implemented at HashiCorp called “cache warming”. It was a completely different use case, but the fundamental idea of doing something optimistically was reused.

I currently think of DO’s as single computers, distributed accross the world. But unlike the typical leader and follower configuration that makes up a cluster, DO’s are their own strong independent computer.

When DO’s are created, they are pinned to a geographic location. This effectively eliminates network partition concerns (P in CAP theroem), so that you can focus on gauranteeing consistency and availability.

Usage

So far, the writing has been high-level. I’ll zoom in to more code-level things below.

get(id: DurableObjectId)

Calling NAMESPACE.get(id) creates a Durable Object. 2 The implication here is that the DO gets created near where the original caller is located. Workers appear to be the only things capable of creating stubs and connecting to DO’s.

get() is fairly unintuitive for the following reasons:

  • get suggests READ which should be side-effect-free, and not WRITE
  • get is confusing because you get a stub which is then used to communicate with the actual DO
  • get is predicated on there first being an id available
  • id is a result from special DO namespace methods: newUniqueId(), idFromName(), or idFromString() 3

... createStub may have been a more clear name 🤷‍♂️.

When is the constructor called?

I could not find an explicit mention of this on the docs or in the Cloudflare discord. But I tried to run some tests to observe the behavior.

If you have a worker that accesses the same DO multiple times like so, it appears that the constructor is only called once here.

If you don’t call fetch on the stub, the DO class’s constructor is never called.

This feels like a pretty important implementation detail that should be better documented, considering that how you write your constructor (ie. if you use state.blockConcurrencyWhile()) will have non-trivial performance and/or consistency implications on the worker that calls your DO, and your overall system.

Interacting with stubs

I would say that actual DO creation is a side effect of stub instantiation. For a given ID, the very first call to NAMESPACE.get(ID) would create the DO, whereas every Nth call would reuse the existing DO, regardless of the caller's geographic location.

Stubs have an older fetch() API and a newer, and much more ergonomic RPC interface.4

Eviction from memory

It’s unclear to me when DO’s (and probably worker proceses too for that matter) are created in memory and evicted from memory. I could also just be missing a fundamental piece of understanding around the worker environment as a whole.

How to identify location

There’s a neat project out there called https://where.durableobjects.live/. It was through this repository that I learned about Cloudflare’s /cdn-cgi/trace endpoint, which is a way to get metadata about a request, including the requestor’s colo and latitude and longitude coordinates.

Docs: https://developers.cloudflare.com/fundamentals/reference/cdn-cgi-endpoint/

Endpoint: https://www.cloudflare.com/cdn-cgi/trace

If you call the endpoint from within a DO, you can see various data about that DO.

CAP Problems

I’m currently exploring this idea of using a group of 5 or more distributed DO’s as an alternative to KV.

Why?

We have consistency requirements at Clerk, and Cloudflare KV’s lack of any strong consistency gaurantee is a no-go for us, so this idea comes up. There’ll be extra work, not a lot though, required to maintain a static mapping of the 5 locations and their coordinates, and logic5 required to map an incoming request to the closest DO.

Additionally, any prior data replication will have to go to 5 DO’s, instead of one.

Use cases

Some loose ideas...

Caching

In-memory state (this.state.thing) can serve as a potential caching mechanism. You read something from persistent storage (this.state.storage.get("thing")), once, and cache it in in-memory state so that subsequent calls can skip the direct reads to persistent storage.

The trade off here is reduced latency, but the consistency model now shifts from strong to eventual.

Strong reads after writes

Given that DO’s implement E-order semantics 6, you can, in-theory, use a single Durable Object to implement strongly consistent reads after writes in a scenario that involves two unique workers. The key pieces here are that the workers have to bind the same Durable Object namespace, and the actor has to use/pass the same Durable Object ID around, to ensure that traffic is routed to and handled by the same DO.

sequenceDiagram
  actor User
  participant WorkerA
  participant DO_NAMESPACE
  participant DO
  participant WorkerB
  User->>WorkerA: write
  activate User
  WorkerA->>DO_NAMESPACE: `newUniqueId()`
  DO_NAMESPACE->>WorkerA: <ID>
  WorkerA->>DO: write
  activate DO
  DO->>WorkerA: OK
  WorkerA->>User: OK + Use this <string(ID)> for strong reads
  deactivate User
  User->>WorkerB: Read w/ <string(ID)>
  activate User
  WorkerB->>DO_NAMESPACE: `idFromString(<string(ID)>)`
  DO_NAMESPACE->>WorkerB: previous <ID>
  WorkerB->>DO: read
  DO->>WorkerB: OK
  WorkerB->>User: OK, result that WorkerA wrote
  deactivate User
  deactivate DO

Transaction

I would like to explore a scenario where a worker transcationally writes to Durable Object persistent storage, and another system.

I can picture DO’s transaction API being used to transactionally write to a DO’s persistent storage and a centralized datastore, like Postgres. This would be a way to ensure that the DO’s state and the centralized datastore are always in sync, which opens up to door to possibilities like alleviating central datastore load by routing all read traffic to the DO. (Though this also opens up the door to new problems like split-brain.)

Benchmarking

This is a very nascent exploration into Vitest benchmarking. All I’ll share for now is a standalone file that can be run with vitest bench, after installing dependencies.

I don’t have any conclusions yet as I’m stil getting a handle on exactly how to benchmark Workers and DO’s, and if the Miniflare environment is reflective of the real world workerd, and if results are even accurate.

Here’s what the expected output looks like.


That is all for now folks.

Footnotes

  1. Cloudflare states: “Dynamic relocation of existing Durable Objects is planned for the future.” (Source)

  2. Source

  3. Durable object IDs themselves are objects that must be generated from select methods, each having global-level implications. Their interface looks like:

  4. RPC-ing was made available on the 2024-04-03 compatibility date. See blog post.

  5. See Haversine formula for calculating distances between two points on a sphere.

  6. Durable Objects implement E-order semantics. When you make multiple calls to the same Durable Object, it is guaranteed that the calls will be delivered to the remote Durable Object in the order in which you made them. E-order semantics makes many distributed programming problems easier. (Source)