Skip to main content
Data Encryption

The 3691 Deep Dive: Architecting Encryption for Confidential Computing Workloads

If you are reaching for confidential computing to encrypt data in use, you already know the promise: hardware-enforced isolation that keeps memory invisible to the host OS, the hypervisor, and even the cloud provider. But the architecture that delivers that promise is full of sharp edges. This guide is for engineers who have read the whitepapers and now need to decide between Intel TDX, AMD SEV-SNP, or a GPU TEE for a specific workload—and who want to avoid the mistakes that teams often make in production. We will walk through the core mechanisms, compare the dominant approaches, highlight patterns that hold up under pressure, and—just as important—flag when you should not use confidential computing at all. By the end, you will have a decision framework and a checklist for your next architecture review. Where Confidential Computing Hits Real Work The canonical use case is multi-tenant machine learning inference.

If you are reaching for confidential computing to encrypt data in use, you already know the promise: hardware-enforced isolation that keeps memory invisible to the host OS, the hypervisor, and even the cloud provider. But the architecture that delivers that promise is full of sharp edges. This guide is for engineers who have read the whitepapers and now need to decide between Intel TDX, AMD SEV-SNP, or a GPU TEE for a specific workload—and who want to avoid the mistakes that teams often make in production.

We will walk through the core mechanisms, compare the dominant approaches, highlight patterns that hold up under pressure, and—just as important—flag when you should not use confidential computing at all. By the end, you will have a decision framework and a checklist for your next architecture review.

Where Confidential Computing Hits Real Work

The canonical use case is multi-tenant machine learning inference. A cloud provider runs a model on behalf of several clients; each client wants assurance that neither the provider nor other tenants can see their input data or the model output. In this setting, the CPU must guarantee that memory pages belonging to one tenant are never readable by another, even if the hypervisor is compromised. Another common scenario is joint data analysis across organizations—think two banks running a fraud detection model on combined transaction histories without either exposing raw records. Here, the encryption architecture must also protect the model weights themselves, which may be proprietary.

What makes these workloads different from traditional encryption-at-rest or TLS is that the data is being actively processed. The CPU must decrypt the data on the fly, operate on it in plaintext inside the enclave, and re-encrypt before writing to memory. Any mistake in the memory encryption engine or in the attestation protocol can leak data. In practice, teams discover that the hardest part is not the encryption itself but the key management and attestation chain. A typical project might involve a financial services firm that wants to run a risk model on a public cloud. They choose AMD SEV-SNP because it supports larger memory footprints than Intel SGX. But they soon find that the attestation report from the AMD Secure Processor must be verified against a known good measurement, and that measurement changes every time the firmware or kernel is updated. Without a robust measurement management process, the enclave fails to start after a routine patch.

The field context also includes regulatory pressures. GDPR and HIPAA do not explicitly require confidential computing, but auditors increasingly ask how data in use is protected. For organizations that handle protected health information or payment card data, confidential computing can reduce the scope of PCI DSS compliance by limiting the environments that process plaintext. However, the auditor will want to see the attestation verification logs and the key hierarchy documentation—not just a checkbox that says "TEE enabled."

Common Workload Profiles

Not every workload benefits equally. Encryption overhead is typically 5–15% for compute-bound tasks but can exceed 30% for memory-bound ones. We have seen teams abandon confidential computing for high-frequency trading applications because the latency jitter from memory encryption made their models miss order book windows. The sweet spot is batch or streaming workloads where throughput matters more than single-request latency.

Foundations Readers Often Confuse

The most persistent confusion is between memory encryption and enclave isolation. Memory encryption—like Intel Total Memory Encryption (TME) or AMD SME—scrambles DRAM content with a CPU-internal key, but it does not protect against an attacker who controls the hypervisor or can inspect the memory bus between the CPU and RAM. Confidential computing, by contrast, uses an on-die memory encryption engine that encrypts each cache line with a key that is tied to the enclave identity. The key never leaves the CPU package. This is a fundamentally different threat model: the attacker is assumed to have physical access to the machine, not just software access.

Another common mix-up is between attestation and authentication. Attestation proves that the enclave is running the expected code on a genuine TEE-capable processor. It does not, by itself, authenticate the user or the workload. Many teams build an attestation service that verifies the quote from the CPU, but then they forget to bind that attestation to a session key. A man-in-the-middle can replay the attestation quote and trick the client into thinking it is talking to the real enclave. The fix is to embed a public key in the attestation report and use that key to establish a TLS session—a pattern called "attestation-based TLS."

There is also confusion around memory limits. Intel SGX originally had a hard limit of 128 MB for enclave memory (EPC). Later generations extended it, but the performance penalty for paging EPC to main memory is severe. AMD SEV-SNP can encrypt the entire guest memory, but the guest must still be aware that its memory is encrypted—some device drivers and DMA operations break if they expect to access guest memory directly. Teams often assume that switching on SEV-SNP is transparent, only to find that their database engine cannot use RDMA or that their GPU driver fails because the GPU cannot decrypt guest memory. These integration issues are the real cost of adopting confidential computing.

Trust Model Hierarchy

It helps to think of three trust layers: the hardware root of trust (the CPU and its fused keys), the firmware (BIOS, microcode), and the software (OS, hypervisor, enclave code). The attestation report covers all three, but the hardware root is the only one that is truly immutable. If the firmware is compromised, the attestation report can be faked. This is why the latest TEEs include firmware versioning in the attestation and why cloud providers must keep their firmware updated to maintain the trust chain.

Patterns That Usually Work

After observing dozens of production deployments, a few patterns consistently deliver reliable security without excessive performance pain. The first is to use a dedicated key management service (KMS) that is external to the enclave. The enclave attests to the KMS, which then releases the data encryption keys only if the attestation matches the expected measurement. This pattern, sometimes called "key release with attestation," ensures that even if the enclave binary is replaced by an attacker, the keys are never released to the wrong code. A well-known implementation is the Key Broker Service (KBS) from the Confidential Computing Consortium.

The second pattern is to minimize the enclave surface area. Instead of putting the entire application inside the TEE, teams isolate only the sensitive processing—for example, the decryption and scoring of a single model input—while leaving the rest of the application outside. This reduces the attack surface and the performance overhead. It also simplifies debugging, because the enclave code is small and can be formally verified. A common mistake is to try to run an entire database engine inside an enclave, which leads to huge memory requirements and frequent paging. A better approach is to use a trusted execution environment for the query processing logic while storing the data encrypted in untrusted memory, decrypting rows only inside the enclave.

The third pattern is to cache attestation results. Attestation verification is expensive—it involves parsing the CPU quote, checking the certificate chain, and often contacting the platform's attestation service (like Intel's Attestation Service or AMD's Key Distribution Service). For a high-throughput workload, verifying every request is impractical. Instead, the enclave can attest once at startup and then use a session token that is valid for a few hours. The token is signed by the KMS and includes a timestamp and the enclave measurement. This pattern is widely used in production but requires careful handling of token expiry and revocation.

Composite Scenario: Fintech Model Serving

A fintech company needed to serve a credit scoring model to multiple partner banks. Each bank sent encrypted customer data; the model ran inside an Intel TDX enclave on a cloud VM. The team initially used SGX but hit the memory limit because the model was 2 GB. They switched to TDX, which supports full VM encryption. The key release was handled by a KMS running in a separate trusted VM. Every hour, the enclave re-attested to the KMS to refresh its session key. The main challenge was that the model had a GPU dependency for inference, but TDX does not support GPU encryption. They had to preprocess the input data on the CPU enclave and then send it to a GPU that was not protected. To mitigate, they used homomorphic encryption for the GPU stage, accepting a 10× slowdown for that step. The trade-off was acceptable because the GPU processing was a small fraction of the total pipeline.

Anti-patterns and Why Teams Revert

The most common anti-pattern is treating confidential computing as a drop-in replacement for a regular VM. Teams enable SEV-SNP on a Kubernetes node and expect all pods to run transparently. They quickly discover that the container runtime must be TEE-aware, that the pods cannot use host networking, and that any privileged container escapes break the isolation. The result is a painful migration that often ends with the team reverting to regular VMs and using network encryption instead.

Another anti-pattern is ignoring side channels. While TEEs protect against direct memory access, they are still vulnerable to cache timing attacks, power analysis, and speculative execution leaks. The Spectre and Foreshadow vulnerabilities affected SGX, and newer attacks against SEV-SNP have been demonstrated in academic papers. Teams that assume the TEE is a silver bullet often skip additional hardening, such as constant-time code and noise injection. We have seen a team revert from SEV-SNP to plain VMs after a penetration test showed that an attacker on the same physical host could infer the model weights through cache timing.

A third anti-pattern is poor key lifecycle management. If the enclave measurement changes—because of a code update or a library recompilation—the KMS will refuse to release the keys. Without a mechanism to migrate keys to the new measurement, the workload becomes unavailable. Some teams try to bypass this by using a fixed measurement that never changes, but that prevents security updates. The correct approach is to use a key hierarchy where the data encryption keys are wrapped by a key that is derived from the enclave measurement. When the measurement changes, the old keys can be unwrapped by the previous enclave and re-wrapped with the new measurement, but this requires the old enclave to still be running. A simpler method is to use a KMS that supports measurement policies with multiple allowed measurements and a grace period.

When Teams Abandon Confidential Computing

We have observed that teams often revert when the performance overhead exceeds 20% for their primary workload, or when the complexity of attestation and key management outweighs the security benefit. In regulated industries, some teams find that auditors are not familiar with TEEs and demand additional controls that negate the simplicity they were hoping for. The decision to revert is not a failure—it is a rational assessment that the current TEE technology does not fit the workload's cost or operational profile.

Maintenance, Drift, and Long-Term Costs

The most underestimated cost is firmware and microcode updates. Cloud providers regularly update the BIOS and CPU microcode to fix security vulnerabilities. Each update changes the platform's trust chain, which means the attestation reports will include a new firmware version. If your workload verifies the firmware version as part of the attestation, you must update your policies accordingly. This can cause outages if the update is rolled out without coordination. Some teams pin their workload to a specific firmware version, but that leaves them exposed to known vulnerabilities. The recommended practice is to subscribe to the cloud provider's attestation service updates and automate the policy change using a CI/CD pipeline that re-attests the enclave after a firmware update.

Another long-term cost is the need to recompile and re-sign enclaves for every security patch to the enclave code. Unlike a regular application where you can hot-patch, an enclave must be stopped, updated, and restarted with a new measurement. For high-availability workloads, this requires a blue-green deployment pattern where two enclaves run simultaneously, and traffic is switched after the new one attests successfully. This doubles the infrastructure cost during the transition.

There is also the risk of vendor lock-in. Intel TDX and AMD SEV-SNP are not interchangeable. Code that uses Intel's SDK for attestation will not work on AMD hardware without significant changes. If you want to run across multiple cloud providers, you need an abstraction layer like the Confidential Container project or Enarx, but these are still maturing. The cost of migrating from one TEE to another can be as high as the initial implementation.

Performance Drift Over Time

As the workload grows, the memory encryption overhead can change. A model that fits in L3 cache may see only 2% overhead, but as the dataset grows and spills to DRAM, the overhead can jump to 15%. Teams should benchmark with realistic data sizes early and plan for scaling. We have seen a project that worked fine in development with 1 GB of data but failed performance targets in production with 50 GB because the encryption engine became the bottleneck.

When Not to Use This Approach

Confidential computing is not the right tool for every problem. If your threat model does not include a compromised hypervisor or physical attacker, then memory encryption and TLS are sufficient. Many workloads in a private data center with strict access controls do not need TEEs. The added complexity and performance cost are not justified.

It is also a poor fit for latency-sensitive microservices. The overhead of attestation and the context-switching cost of entering and exiting an enclave add tens of microseconds to each request. For a service that needs to respond in under 5 milliseconds, this can be a 10% or more increase. Similarly, workloads that use RDMA or GPU direct memory access often break with TEEs because the hardware cannot encrypt memory regions accessed by non-CPU devices. Until memory encryption extends to the PCIe bus, these workloads will need to use alternative protection, such as encrypting data before sending it to the accelerator.

Another case is legacy batch jobs that run for hours and process terabytes of data. The memory encryption overhead compounds over time, and the job may take 20% longer. If the data is ephemeral and the environment is air-gapped, the security gain is minimal. We have seen teams revert to plain VMs for such jobs after measuring the throughput loss.

Finally, if your organization lacks the operational maturity to manage attestation policies, key rotation, and measurement changes, then confidential computing will introduce more risk than it mitigates. A misconfigured attestation policy that accepts any measurement is worse than no TEE at all, because it gives a false sense of security. Start with a simpler encryption strategy and adopt TEEs only when you have the team and tooling to manage them.

Open Questions and FAQ

How do I handle attestation for ephemeral enclaves?

Ephemeral enclaves that spin up and down frequently, such as in a serverless function, cannot rely on long-lived sessions. The recommended approach is to use a lightweight attestation that verifies the enclave measurement and then issues a short-lived token (minutes). The token is cached by the client and used for subsequent requests. The enclave itself should be stateless, with all persistent state stored encrypted in untrusted memory and sealed to the enclave measurement.

Can I use confidential computing with Kubernetes?

Yes, but it requires a TEE-aware container runtime like Kata Containers or a confidential pod implementation. The Kubernetes node must be configured with the appropriate hardware and firmware. The main challenge is that the pod's resource limits must account for the memory overhead of encryption, and the pod cannot use host networking or privileged capabilities. Projects like Confidential Containers (CoCo) are working to make this seamless, but production deployments still require careful configuration.

What happens if the CPU key is compromised?

The CPU key is fused during manufacturing and is not accessible to software. If it is compromised, the entire TEE platform is broken. In practice, this has not happened publicly, but it is a risk that cloud providers mitigate by using multiple generations of hardware and rotating the platform keys periodically. As a user, you can reduce the impact by using your own key hierarchy where the CPU key only protects a wrapping key, not the data directly.

How do I migrate keys when the enclave measurement changes?

Use a KMS that supports key derivation from the enclave measurement. The data encryption keys are wrapped with a key that is derived from the measurement. When the measurement changes, the old enclave can unwrap the keys and re-wrap them with a new key derived from the new measurement. This requires the old enclave to be available during the transition. An alternative is to use a policy that allows multiple measurements, but this weakens security because an attacker could use an older, vulnerable measurement to request keys.

Is confidential computing enough for compliance?

It depends on the regulation. For GDPR, it can help demonstrate that data in use is protected, but it does not replace other controls like access logging and data minimization. For PCI DSS, it can reduce the scope of the cardholder data environment, but the attestation and key management processes must be documented and audited. Always consult your compliance officer and auditor before relying solely on TEEs for compliance.

To move forward, start by identifying a single workload that has a clear threat model and a tolerable performance budget. Run a proof of concept with the chosen TEE, focusing on attestation and key management first. Do not try to migrate everything at once. Build a small, well-tested enclave and extend from there. The technology is powerful, but it demands respect for its complexity.

Share this article:

Comments (0)

No comments yet. Be the first to comment!