AWS Outpost vs. Oracle Region @ Customer: How do they differ?

10 Jul 2021 Tags: Distributed systems

Cloud vendors are rolling out products that place proprietary hardware in customer’s datacentres, delivering a selection of the vendor’s cloud services from the customer’s datacentre. These new products require a rethinking of current guidelines regarding tradeoffs between data residency, latency, and failure tolerance. As a start towards this, I compare two specific products, Amazon Outpost in 42U rack form factor and Oracle Dedicated Region Cloud @ Customer, according to these criteria. What do these comparisons tell us about the new possibilities such products create?

Disclaimer: This comparison is the most accurate that I could make, based upon careful reading of the products’ documentation as of early July, 2021. I have not used either product and, as with all cloud products, both are under continuous improvement. Take these notes as general guidelines. Consult the vendors’ sales staff for definitive product descriptions.

The two sample systems

I will compare two high-end configurations, aimed at high-volume customers running their own, substantially-sized datacentres. In each case, Amazon or Oracle staff would install their proprietary hardware in a designated area of the datacentre and be responsible for all maintenance of the hardware and software. The vendor racks would be locked and inaccessible to the customer. The customer would be responsible for physical security of the overall datacentre and its power and also be responsible for networking to the rest of the datacentre, to any internal customer network, and to the Internet.

The specific configurations I will compare are:

AWS Outpost: Two or more 42U racks, associated with at least two AWS Availability Zones in the same Region.
Oracle Dedicated Region Cloud @ Customer: A full Oracle Region installed on dedicated floorspace of the customer’s datacentre. An Oracle Region comprises at least one Availability Domain, which in turn has three Fault Domains. As of this writing, billing for this configuration begins at 6 million US dollars per year. For terseness, I will refer to this configuration by the shorter phrase, “Dedicated Region”.

For details, see the linked vendor descriptions.

I will compare these systems according to a selection of the tradeoffs between location, latency, fault tolerance, and data residency. To begin, I want to contrast the different range of services provided by each system.

Comparison 1: Completeness of services

AWS: Amazon proposes Outpost as a solution to specific customer needs (see “Common Use Cases”). As such, they only support a small subset of AWS services on an Outpost:

EC2, EBS, S3: for basic compute and storage.
ALB, ECS, EKS: for interconnecting services.
EMR for an analytics suite. Note that despite its name, EMR includes services beyond MapReduce, such as Spark.
RDS for relational databases.
ElastiCache for caching.

Oracle: By contrast, Oracle supports all their services in a dedicated Region, guaranteeing the same SLAs as for Oracle’s own datacentres. If you can do it on a Region in an Oracle datacentre, you can do it on a dedicated Region in your own datacentre, with the same guarantees of availability, maintainability, and performance.

Control planes: A big distinction between these two products is the residency of their control planes. An AWS Outpost is located within an Availability Zone, in turn located in an AWS Region. Although an Outpost will perform computation locally for its EC2 instances and EMR Spark nodes, the control plane for those services resides in the Amazon datacentres for the Zones comprising the Region. We will see in the comparisons below that this has implications for availability, latency, and data residency.

By contrast, both the control and data planes for an Oracle Dedicated Region run entirely on Oracle racks located within the customer’s datacentre. The only component external to that datacentre is the low-level monitoring of the hardware and software necessary for Oracle to ensure the Region’s performance remains within the committed Service Level Agreements. This data stream is much smaller than the control plane stream and contains virtually no customer data.

With this understanding of the very different service palettes supported by these products, let’s consider how they handle different challenges.

Comparison 2: Failure of a top-of-rack switch

The first challenge is the failure of the connection between a rack and the rest of the datacentre, partitioning the rack. This is often colloquially termed “failure of the top-of-rack (ToR) switch”, though in the case of an Outpost the failure is more than a single switch.

AWS: Each Outpost 42U rack has two ToR switches, so this scenario only arises after both have failed. Our sample Outpost configuration has at least two racks, located in distinct Availability Zones. If the customer has organized their services to locate replicas in both Zones, those services should continue to run, albeit at reduced capacity due to the inaccessibility of a substantial fraction of their hardware capacity in the partitioned rack.

If the customer has not configured their services across Zones, the entire services may be unavailable so long as even one Outpost rack is partitioned.

However, spreading service replicas across Outpost racks in different Zones may be difficult to accomplish with standard AWS services. The AWS Application Load Balancer (ALB) service can only point to a single Outpost subnet (see Point 9 of Step 1 from “Configure a load balancer and listener”), suggesting that the ALB service cannot balance incoming requests across Outposts in different Zones. The effective result may be that the only practical organization for most Outpost installations is with all racks in a single Zone, with the reduced fault tolerance that implies.

More generally, this small but crucial limitation suggests that AWS services on Outpost may be less complete than a first reading suggests. There is a possibility that the loss of a rack will make some of the AWS services on the Outpost unavailable.

Oracle: The documentation does not indicate the degree of redundancy of Oracle racks, so potentially failure of a single top-of-rack switch could partition a rack. However, a Dedicated Region is guaranteed, as with all Oracle Regions, to expose three distinct Fault Domains to the customer, allowing the customer to spread service replicas across Domains and ensure service availability despite failure of any single one.

All Oracle services should also continue to run if any single rack is partitioned. Oracle’s commitment to offering all the services of a Region provides customers a much higher confidence that those services will continue to run in the event of the failure of a single rack. Furthermore, any customer services built atop the Oracle services and spread across the Dedicated Region’s Fault Domains will also continue running despite a single-rack failure.

Comparison 3: Latency

The second challenge is how latency might be improved by each product. Both vendors emphasize the reduced latency possible from situating their hardware directly in a customer’s datacentre. What are the limits to these gains?

AWS: Latency for all services running entirely on the Outpost will be extremely low, due to the proximity of the resources to their clients, whether other services in the customer’s own datacentre or end users for whom that datacentre is closer than any AWS AZ.

However, the latency for services that run on the Outpost but that keep redundant copies on an AWS AZ will be slowed by the round-trip time to the AZ. The latency impact will be determined by such factors as:

the latency of the link from the customer’s datacentre to the AZ,
whether the writes are synchronous or asynchronous with the response sent to the service client,
whether the writes are batched,

and other factors.

All AWS services that run on an AWS Availability Zone rather than the Outpost will have the latency of accesing that Zone. If a customer application has one of these services as a dependency, their latency may limit the application’s latency.

Oracle: The same principles will apply to latency, replacing “AWS AZ” with “Oracle Region”. However the much larger service suite supported by a Dedicated Region considerably reduces the chances that a customer application will be limited by the latency of an Oracle datacentre.

Moving cloud services into a customer’s data centre is a selling point of both products and both deliver that benefit in most cases, for customer applications that only use the services running on local hardware.

Comparison 4: Failure of datacentre’s external link

A more extreme, albeit less frequent, failure is the loss of the link from the customer’s data centre to any external networks, including the customer’s private network and the Internet.

AWS: Most services will become unavailable (see questions “Can I use Outposts when it is not connected to the AWS Region or in a disconnected environment?” and “What happens when my facility’s network connection goes down?”). EC2 instances may continue running and their metrics and logs will be retained for a few hours, though these will be lost if the partition continues for too long. The control plane will not work, making it impossible, for example, to start and stop instances. DNS requests within the Outpost will fail.

Oracle: All services will continue running (see Question 21 of the FAQ). Services will not be able to access the Internet but will be able to continue any processing only requiring local data and continue to meet the Oracle SLAs. So long as the customer has local access, they can perform all control plane operations, such as starting or stopping services and making DNS lookups. Metrics and logs will be cached for up to a few hours’ partition. Longer partitions may result in unmet SLAs.

This kind of failure is rare but when it occurs its impact can be severe. The tolerance of Oracle’s Dedicated Regions to theses failure is a point in Oracle’s favour.

Comparison 5: Data residency

A major selling point of these products is compliance with requirements for data residency (where copies of the data are stored) and the related requirements of data sovereignty (which laws govern the data) and data locality (where the data is processed). By keeping data in its own datacentre, a customer can meet these requirements whenever they are not met by a cloud vendor’s datacentres. Although residency is straightforward when referring to original data, the determination is more subtle when we consider metadata from the control plane. Must the records of a service starting and stopping reside locally? What about metrics and logs for the service? What about the low-level telemetry used by operations staff to monitor the performance of the underlying datacentre hardware? Different compliance requirements may make different determinations for these forms of metadata.

AWS: If the data is only processed by services running on the Outpost and stored on EBS or S3 on the Outpost, the data itself will be contained entirely within the customer’s datacentre. However, control plane metadata, including the start and stop of services, batch job names, DNS lookups, and service metrics, will be exported to the owning AWS Region. This metadata and the low-level telemetry will have the residency of the AWS Region:

Some limited meta-data (e.g. instance IDs, monitoring metrics, metering records, tags, bucket names, etc.) will flow back to the AWS Region.
— Outpost FAQ, “Can Outposts be used to meet data residency requirements?”

Oracle: As the Dedicated Region’s services run entirely in the customer’s datacentre, including the control plane, the original data together with all service-related metadata, such as metrics, service starts and stops, and DNS lookups, resides in the customer’s datacentre. Lowest-level telemetry data, however, is non-local:

Data that helps Oracle achieve its SLAs and provide continuous security and functionality updates will flow in and out as required, but without impacting data residency and sovereignty requirements.
— Dedicated Region FAQ, Question 17

Oracle’s data residency proposition is simple to state: The only data that leaves the customer’s datacentre is low-level telemetry that is unnconnected to any of the customer’s data. This can be a powerful case for compliance in heavily-regulated domains. The Amazon data residency proposition requires careful parsing. For many applications, all that matters is residency of the data plane and the Outpost architecture will suffice. Making this case to the auditors, however, will require more work than for an Oracle Dedicated Region.

Conclusion

The distinctions between these configurations of AWS Outpost and Oracle Dedicated Region are subtle. In many ways they are addressing different sizes of customer installations and may not have much overlap. To make a fair comparison for this post, I had to choose one of the largest possible Outpost configurations and the smallest possible Dedicated Region. Amazon supports much smaller Outposts than the one I describe here, especially with the recently introduced 1U and 2U form factors, which are single servers mounted in a customer’s rack. For these smaller installations, limited services and an offsite control plane are sensible architectural choices. The customer simply needs to understand their implications.

Oracle Dedicated Regions are unabashedly aimed at much larger configurations. For customers that require this level of service, more straightforward and robust guarantees of availability and data residency are worth the higher cost. The minimum spend for acquiring this availability is substantially above the minimal spend for an Outpost.

The overall conclusion from this sequence of comparisons is that customers can use AWS Outpost to achieve genuine improvements in latency and data residency without seriously compromising availability, at low cost, with the crucial caveat that these outcomes require applications to be carefully architected to match the limitations of Outpost.

On the other hand, Oracle Dedicated Regions offer a far simpler approach for improving latency and data residency, with higher availability and a straightforward compliance argument. Applications can be designed around the full suite of Oracle services. The tradeoff for this simplicity is that a Dedicated region requires a much larger financial commitment.

More generally, the introduction of another class of products, where vendors run their services on customer hardware (Google Anthos, Azure Stack Hub, and Alibaba Apsara Stack), offers further combinations of availability, latency, and data residency. The comparisons made in this post will have to be extended to handle these use cases as well.

All my marbles in one place