Substrate leans heavily on three concepts in order to improve the reliability of the overall system by minimizing the blast radius of changes.
They’re usually referred to in alphabetical order - domain, environment, and quality - but are presented here in a progression more suitable to readers new to Substrate.
Environments identify a set of data and the infrastructure that stores and processes it. (After all, what distinguishes your production environment from development, staging, or another? Production data; your customers’ real, business-critical data.) An environment’s primary purpose is to protect its data against access from other environments.
Use multiple environments to protect your customers’ data from code that hasn’t been tested thoroughly in pre-production environments.
An AWS account in your organization is a member of exactly one environment and can only access the networks assigned to that environment.
Organizations typically define environments like development, staging, and production though the names and number is entirely up to them. Add more environments to support more different kinds of testing with greater parallelism.
Highly reliable services almost always implement changes gradually to give their operators a chance to detect and mitigate failures when the impact is small. Qualities help make gradual change possible for many AWS resources like load balancers and security groups, even within a single service.
Use multiple qualities to protect any one service from changes that affect that whole service immediately.
An AWS account in your organization is associated with exactly one quality but can access and use resources in any AWS account that shares its environment.
Suppose your organization defined the qualities alpha, beta, and gamma (which are what Substrate recommends). You could run 1% of your production environment in your alpha accounts, 9% in your beta accounts, and the remaining 90% in your gamma accounts. This isn’t as smooth as routing a slowly increasing percentage of traffic to your new software as it’s being deployed (and you should strongly consider doing that, too) but this strategy works even for AWS resources like load balancers and security groups.
You could also decide to name your qualities blue and green and swing traffic back and forth between them. The slight disadvantage to this architecture is that the one that’s not receiving any traffic is not, at that moment, proving that its configuration is functional and thus the first trickle of traffic that comes to it when you start to swing back to it is slightly higher risk.
Domains are collections of one or more software services that form an isolated failure domain (pun very much intended). The software may be that which you’ve written yourself, hosted in any serverless or serverful manner, or an AWS-managed service.
Use multiple domains to protect services in any one domain from changes in all other domains.
An AWS account in your organization is associated with exactly one domain but can access and use resources in any AWS account that shares its environment. There may be multiple AWS accounts within a domain, each of a different quality.
It is intended that every AWS account that shares the same domain also shares the same Terraform codebase. That codebase progresses through environments and qualities as changes are deployed. For example, consider the domain example which exists in the following environments and qualities:
They all refer back to the same modules, parameterized according to their domain, environment, and quality plus the appropriate VPC and subnet IDs. The difference between them, then, is when changes are deployed.
Because domains are made out of AWS accounts, and because AWS accounts are notoriously awkward to delete, you should try to create domains that you believe will be (nearly) permanent. Thus, consider naming them after stable architectural or business units not e.g. speculative skunkworks projects.