Architecture Decisions¶

Chapter Summary

This chapter documents the key Architecture Decision Records (ADRs) made during Contoso Insurance's journey through the hybrid continuum. Each ADR captures the context, alternatives considered, decision rationale, and consequences. These decisions shaped the migration path from Azure public cloud to fully disconnected sovereign operation.

ADR Process & Template¶

Contoso adopted the Architecture Decision Record format to document significant architectural choices. Each ADR follows a consistent structure ensuring decisions are traceable and reversible.

ADR Template Structure¶

# ADR-XXX: [Decision Title]

**Status**: [Proposed | Accepted | Superseded | Deprecated]
**Date**: YYYY-MM-DD
**Deciders**: [List of people involved in decision]
**Context**: [What is the issue we're facing?]

## Decision

[The decision that was made]

## Alternatives Considered

1. **Option 1**: Description, pros/cons
2. **Option 2**: Description, pros/cons
3. **Option 3**: Description, pros/cons

## Rationale

[Why we chose this option]

## Consequences

**Positive**: Benefits of this decision
**Negative**: Drawbacks and risks
**Mitigation**: How we address the negatives

## Related Decisions

[Links to related ADRs]

ADR-001: Container Orchestration Platform Selection¶

Status: Accepted
Date: 2023-01-15
Deciders: CTO, Platform Architecture Team, DevOps Lead

Context¶

Contoso Insurance needed to select a container orchestration platform for Phase 1 (Azure cloud) that would enable portability through Phase 2 (hybrid) and Phase 3 (disconnected). The platform must support high availability, autoscaling, and integrate with Azure services.

Decision¶

Use Azure Kubernetes Service (AKS) in Phase 1, transition to AKS on Azure Local in Phase 2, and RKE2 in Phase 3.

Alternatives Considered¶

Option 1: Azure App Service / Container Apps: Azure-native PaaS compute services with excellent Azure integration.

✅ Pros: Simplest operational model, deep Azure integration, automatic scaling, no cluster management
❌ Cons: Azure-only, no portability to on-premises, vendor lock-in, limited control over infrastructure

Option 2: Azure Kubernetes Service (AKS) → AKS on Azure Local → RKE2: Kubernetes orchestration with migration path to on-premises.

✅ Pros: Portable workloads (Kubernetes is industry standard), consistent API across phases, on-premises capable
❌ Cons: Cluster management complexity, requires Kubernetes expertise

Option 3: Azure Service Fabric: Azure-native microservices orchestrator.

✅ Pros: Azure-optimized, supports .NET applications natively, stateful services
❌ Cons: Limited on-premises support, smaller ecosystem than Kubernetes, declining community momentum

Rationale¶

Kubernetes provides the best balance of Azure integration (Phase 1) and on-premises portability (Phase 2-3). While AKS requires more operational investment than App Service, the ability to run identical workloads on Azure and on-premises is essential for Contoso's cloud exit strategy. Kubernetes is an industry standard with deep ecosystem support and multi-cloud portability.

Consequences¶

Positive: Application workloads containerized from day one, enabling seamless migration across phases. Helm charts used consistently from Phase 1 through Phase 3. Team developed Kubernetes expertise applicable across environments.
Negative: Higher Phase 1 complexity vs. PaaS alternatives (App Service, Functions). Required investment in Kubernetes training and operations. Cluster management overhead in Phase 3 (no Azure-managed control plane).
Mitigation: Engaged Azure FastTrack consultants for AKS best practices training in Phase 1. Used managed AKS in Phase 1-2 to minimize operational burden. Partnered with Rancher for RKE2 deployment assistance in Phase 3.

Related Decisions: ADR-008 (Container Registry), ADR-009 (CI/CD Pipeline)

ADR-002: Database Migration Strategy¶

Status: Accepted
Date: 2023-02-10
Deciders: CTO, Database Administrator, Platform Architecture Team

Context¶

Contoso's SQL Server database contains 180 GB of transactional data requiring ACID compliance. Migration from Azure SQL Database (Phase 1) to on-premises (Phase 2-3) must achieve <4 hour RTO and <1 hour RPO requirements while maintaining zero data loss.

Decision¶

Use Azure SQL Database (Phase 1) → Arc-enabled SQL Managed Instance (Phase 2) → SQL Server 2022 on VMs (Phase 3) with log shipping for migration.

Alternatives Considered¶

Option 1: Azure SQL Database → SQL Server on VMs (skip Arc MI): Direct migration from Azure SQL to traditional SQL Server.

✅ Pros: Simpler architecture (no intermediate Arc MI step), lower Phase 2 cost
❌ Cons: Larger gap between Azure SQL and SQL Server features, no Arc management benefits in Phase 2

Option 2: Azure SQL Database → Arc SQL MI → SQL Server 2022 on VMs: Phased migration with Arc-enabled SQL MI as intermediate state.

✅ Pros: Gradual migration, Arc SQL MI provides Azure Portal management in Phase 2, easier rollback
❌ Cons: Additional migration step, Arc SQL MI licensing costs

Option 3: Azure SQL Database → Azure SQL MI (fully managed) → SQL Server on VMs: Use fully managed Azure SQL MI instead of Arc SQL MI.

✅ Pros: Fully managed in Phase 2, high availability built-in
❌ Cons: Still cloud-hosted (doesn't achieve on-premises goal), higher cost than Arc SQL MI

Rationale¶

Arc-enabled SQL Managed Instance provides the best migration path by offering Azure SQL compatibility with on-premises hosting. The two-phase migration (Azure SQL → Arc SQL MI → SQL Server VMs) reduces risk by allowing incremental changes. Arc SQL MI in Phase 2 enables continued use of Azure Portal for database management while achieving data sovereignty. The final transition to SQL Server VMs in Phase 3 is straightforward (both run SQL Server engine).

Consequences¶

Positive: Zero data loss achieved during Phase 1 → Phase 2 migration using transaction log shipping. Arc SQL MI provided familiar Azure management experience during Phase 2. Database performance improved in Phase 2-3 due to local NVMe storage (2ms vs 8ms latency).
Negative: Database migration was critical path for both Phase 2 and Phase 3 (longest downtime component). Arc SQL MI licensing added €3,000/month cost in Phase 2. SQL Server VMs require manual backup configuration and Always On Availability Group management in Phase 3.
Mitigation: Practiced migration in dev/test environments 3 times before production cutover. Maintained Azure SQL Database as hot standby for 72 hours post-migration. Implemented automated backup validation checks (checksum verification).

Related Decisions: ADR-007 (Secret Management for connection strings)

ADR-003: Message Broker Selection¶

Status: Accepted
Date: 2023-03-01
Deciders: Platform Architecture Team, Backend Development Team

Context¶

Azure Service Bus provided reliable message queuing in Phase 1 but is cloud-only. Phase 2-3 require on-premises message broker supporting at-least-once delivery, dead-letter queues, and durable message storage.

Decision¶

Use Azure Service Bus (Phase 1) → RabbitMQ (Phase 2-3).

Alternatives Considered¶

Option 1: RabbitMQ: Open-source AMQP message broker with excellent Kubernetes support.

✅ Pros: Battle-tested (15+ years), excellent .NET support (MassTransit, EasyNetQ), quorum queues for HA, active community
❌ Cons: Requires operational expertise, no managed offering on-premises

Option 2: Apache Kafka: Distributed event streaming platform.

✅ Pros: Highest throughput, excellent for event sourcing, strong consistency guarantees
❌ Cons: Overkill for Contoso's workload (not event streaming), operational complexity, higher resource requirements

Option 3: NATS: Lightweight cloud-native messaging system.

✅ Pros: Extremely lightweight, low latency, Kubernetes-native
❌ Cons: Less mature than RabbitMQ, limited .NET ecosystem, fewer enterprise features (no built-in DLQ)

Rationale¶

RabbitMQ offers the best balance of enterprise features (dead-letter queues, message persistence, at-least-once delivery) and operational simplicity. The .NET ecosystem support via MassTransit simplifies application integration. RabbitMQ's quorum queues provide high availability without complex Kafka clusters. Contoso's workload (3 queues, ~15,000 messages/day) fits RabbitMQ's strengths.

Consequences¶

Positive: RabbitMQ Cluster Operator simplified Kubernetes deployment. MassTransit abstraction layer minimized code changes (only configuration updates). RabbitMQ Management UI provided excellent operational visibility. Message processing latency unchanged vs. Azure Service Bus.
Negative: Team required RabbitMQ training (queue configuration, memory/disk management). Monitoring integration required custom Prometheus exporters. Quorum queue configuration subtly different from Azure Service Bus (message TTL, prefetch count tuning).
Mitigation: Engaged RabbitMQ consultants for best practices training. Migrated dev/test environments first to identify configuration gaps. Implemented comprehensive monitoring (queue depth, consumer lag, memory usage).

Related Decisions: ADR-001 (Kubernetes as RabbitMQ host)

ADR-004: Object Storage Selection¶

Status: Accepted
Date: 2023-03-15
Deciders: Platform Architecture Team, Storage Administrator

Context¶

Azure Blob Storage provided scalable document storage in Phase 1. Phase 2-3 require on-premises object storage for 8 TB of insurance documents with S3 API compatibility for minimal application changes.

Decision¶

Use Azure Blob Storage (Phase 1) → MinIO (Phase 2-3).

Alternatives Considered¶

Option 1: MinIO: Open-source S3-compatible object storage.

✅ Pros: Excellent S3 API compatibility, Kubernetes-native, erasure coding for durability, active development
❌ Cons: Requires operational expertise, manual scaling management

Option 2: Ceph Object Gateway (RGW): Enterprise-grade distributed storage system.

✅ Pros: Battle-tested at scale, strong consistency, multi-protocol support (S3, Swift)
❌ Cons: Complex deployment, heavy resource requirements, steep learning curve

Option 3: Azure Local Native Storage (CSV): Use Cluster Shared Volumes directly for file storage.

✅ Pros: Integrated with Azure Local, no additional software
❌ Cons: No object storage API (requires file share protocol), significant application changes

Rationale¶

MinIO's S3 API compatibility enabled nearly zero application code changes — only connection string updates. The Azure Blob Storage SDK includes S3-compatible endpoint support, making migration seamless. MinIO's Kubernetes-native design aligns with Contoso's orchestration strategy. Erasure coding provides durability without full replication overhead.

Consequences¶

Positive: Migration required only configuration changes (no code changes). Presigned URL generation, multipart uploads, and lifecycle policies worked identically. MinIO's performance exceeded Azure Blob (local NVMe vs. network storage). MinIO Console provided excellent operational UI.
Negative: Manual capacity management (no auto-scaling like Azure Blob). Erasure coding configuration (data shards, parity shards) required planning for storage efficiency. Initial misconfiguration caused intermittent upload failures (corrected by increasing I/O buffer sizes).
Mitigation: Allocated 2x current storage capacity (16 TB) for growth headroom. Implemented Prometheus monitoring for storage utilization alerts (trigger at 70% full). Staged migration over 14 days to identify issues incrementally.

Related Decisions: ADR-001 (Kubernetes as MinIO host)

ADR-005: Identity Provider Migration¶

Status: Accepted
Date: 2023-04-01
Deciders: CTO, Security Team, Platform Architecture Team

Context¶

Azure AD B2C (customers) and Microsoft Entra ID (employees) provided cloud-based identity in Phase 1-2. Phase 3 requires on-premises identity for full disconnection while maintaining SSO, MFA, and RBAC capabilities.

Decision¶

Use Azure AD B2C + Entra ID (Phase 1-2) → Active Directory Domain Services + ADFS (Phase 3).

Alternatives Considered¶

Option 1: Active Directory Domain Services + ADFS: Traditional Windows-based identity stack.

✅ Pros: Enterprise-proven, deep Windows integration, full control over identity infrastructure
❌ Cons: Requires significant application changes, complex certificate management, limited modern auth features (vs. Azure AD)

Option 2: Keycloak (Open Source IdP): Modern open-source identity provider with OIDC/SAML support.

✅ Pros: Modern protocol support, Kubernetes-native, extensible, REST API for management
❌ Cons: Less mature than Azure AD/ADFS, smaller ecosystem, requires custom UI development

Option 3: Okta (SaaS IdP): Cloud-based identity service with on-premises connector.

✅ Pros: Modern features, excellent UX, hybrid model supported
❌ Cons: Still cloud-dependent (defeats disconnection goal), recurring SaaS cost, vendor lock-in

Rationale¶

AD DS + ADFS is the only option providing fully on-premises identity without cloud dependencies. While Keycloak is attractive, the team's existing Windows expertise and ADFS's maturity reduced implementation risk. The investment in custom authentication UI (matching Azure AD B2C's experience) was justified by full sovereign control.

Consequences¶

Positive: Complete identity sovereignty achieved. No external dependencies for authentication. Integrated with existing corporate AD infrastructure (agents use domain accounts). Certificate management via internal PKI.
Negative: Significant application code changes required (authentication logic, token validation, logout flows). Custom UI development for customer-facing login pages (~80 hours effort). Certificate rotation operational burden (vs. Azure-managed). User account migration required password resets for all customers.
Mitigation: Phased migration — agents migrated first (internal users, easier rollback), customers migrated in batches of 2,000. Extensive testing with Playwright end-to-end tests (login, logout, MFA flows). Certificate monitoring via Prometheus x509 exporter.

Related Decisions: ADR-010 (Network architecture for ADFS high availability)

ADR-006: Monitoring & Observability Stack¶

Status: Accepted
Date: 2023-04-15
Deciders: SRE Team, Platform Architecture Team

Context¶

Azure Monitor and Application Insights provided comprehensive observability in Phase 1-2 but require cloud connectivity. Phase 3 requires fully on-premises monitoring with similar capabilities (metrics, logs, traces, dashboards, alerting).

Decision¶

Use Azure Monitor + App Insights (Phase 1-2) → Prometheus + Grafana + Loki + Jaeger (Phase 3).

Alternatives Considered¶

Option 1: Prometheus + Grafana + Loki + Jaeger (PGLJ Stack): Open-source observability stack with Kubernetes-native components.

✅ Pros: Industry standard, Kubernetes-native, excellent Grafana dashboards, active community
❌ Cons: Requires operational expertise, manual integration work, no single-pane-of-glass (4 separate tools)

Option 2: Elastic Stack (Elasticsearch, Logstash, Kibana): All-in-one log analytics platform.

✅ Pros: Unified platform for logs/metrics/traces, powerful search, mature ecosystem
❌ Cons: Heavy resource requirements (Elasticsearch clusters), licensing complexity (Elastic License vs. open source), less Kubernetes-native than PGLJ

Option 3: Datadog Agent (on-premises): SaaS monitoring with on-premises agent support.

✅ Pros: Excellent UX, unified platform, APM included
❌ Cons: Still sends data to cloud (defeats disconnection goal), recurring SaaS cost, vendor lock-in

Rationale¶

The PGLJ stack (Prometheus, Grafana, Loki, Jaeger) is the de-facto standard for Kubernetes observability. Each component is Kubernetes-native and integrates seamlessly. While managing four tools is more complex than Azure Monitor, the stack provides equivalent capabilities without cloud dependencies. OpenTelemetry adoption future-proofs instrumentation (vendor-neutral).

Consequences¶

Positive: Successfully replicated Azure Monitor dashboards in Grafana (metrics + logs + traces in single pane). OpenTelemetry instrumentation portable across environments. Prometheus + Alertmanager replicated all Azure Monitor alerts. On-premises data retention (90 days) vs. Azure Monitor (30 days default).
Negative: Significant application instrumentation changes (Azure SDK → OpenTelemetry). Prometheus maintenance (storage management, retention policies). Grafana dashboard migration required manual recreation (no automated export from Azure Monitor). Initial Loki query language learning curve for log analysis.
Mitigation: Ran PGLJ stack in parallel with Azure Monitor during Phase 2-3 transition (validated metric parity). Engaged Grafana consultants for dashboard design best practices. Implemented automated Prometheus backup to MinIO (hourly snapshots).

Related Decisions: ADR-001 (Kubernetes as monitoring stack host)

ADR-007: Secrets Management¶

Status: Accepted
Date: 2023-05-01
Deciders: Security Team, Platform Architecture Team

Context¶

Azure Key Vault provided secrets management in Phase 1-2 with managed identities for authentication. Phase 3 requires on-premises secrets management with similar capabilities (encryption, access control, audit logging, rotation).

Decision¶

Use Azure Key Vault (Phase 1-2) → HashiCorp Vault (Phase 3).

Alternatives Considered¶

Option 1: HashiCorp Vault: Enterprise secrets management platform with Kubernetes integration.

✅ Pros: Industry standard, excellent Kubernetes integration (Vault Agent Injector), auto-unseal, audit logging, secrets rotation
❌ Cons: Operational complexity, requires HA cluster, unsealing procedures

Option 2: Sealed Secrets (Kubernetes-native): Encrypted secrets stored in Git, decrypted by controller in cluster.

✅ Pros: GitOps-friendly, simple concept, low operational overhead
❌ Cons: No secrets rotation, no access audit logs, limited to Kubernetes, no external secret sources

Option 3: CyberArk Conjur: Enterprise secrets management for cloud and on-premises.

✅ Pros: Enterprise-grade, strong security features, audit compliance
❌ Cons: Commercial licensing, heavier than Vault, smaller open-source community

Rationale¶

HashiCorp Vault strikes the best balance of enterprise features (encryption, audit logging, access control, rotation) and Kubernetes-native integration. The Vault Agent Injector simplifies secret injection into pods (similar to Azure Key Vault CSI driver). Vault's active open-source community and extensive documentation reduce operational risk.

Consequences¶

Positive: Successful secret migration from Azure Key Vault with zero secrets exposure. Kubernetes ServiceAccount authentication via Vault's native auth method. Secrets rotation policies implemented (database passwords rotate every 90 days). Comprehensive audit logs for compliance.
Negative: Vault unsealing operational burden (automated unseal using Shamir secret sharing, but manual backup required). External Secrets Operator configuration more complex than Azure Key Vault CSI driver. Initial Vault deployment required security review (TLS certificates, access policies).
Mitigation: Implemented automated Vault backup (Raft snapshots to MinIO hourly). Documented unseal procedures for disaster recovery (laminated cards in secure safe). Used Terraform to manage Vault policies as code (version control, peer review).

Related Decisions: ADR-002 (Database connection string storage), ADR-005 (ADFS certificate storage)

ADR-008: Container Registry¶

Status: Accepted
Date: 2023-05-15
Deciders: DevOps Team, Security Team

Context¶

Azure Container Registry provided image storage and vulnerability scanning in Phase 1-2. Phase 3 requires fully on-premises container registry with image signing, replication, and vulnerability scanning capabilities.

Decision¶

Use Azure Container Registry (Phase 1-2) → Harbor (Phase 3).

Alternatives Considered¶

Option 1: Harbor: Open-source CNCF registry with enterprise features.

✅ Pros: Enterprise features (RBAC, replication, signing, scanning), excellent UI, Kubernetes-native, Notary integration
❌ Cons: Requires PostgreSQL backend, operational complexity for HA

Option 2: GitLab Container Registry: Integrated container registry with GitLab.

✅ Pros: Integrated with CI/CD pipeline, single platform, simple management
❌ Cons: Tightly coupled to GitLab (less flexible), smaller feature set than Harbor

Option 3: Docker Distribution (Open Source Registry): Lightweight Docker registry.

✅ Pros: Minimal resource requirements, simple deployment
❌ Cons: No UI, no vulnerability scanning, no RBAC, no signing

Rationale¶

Harbor provides the enterprise features Contoso requires (vulnerability scanning via Trivy, image signing via Notary, RBAC, replication) while remaining fully open-source. The excellent web UI simplifies operations vs. command-line-only alternatives. Harbor's CNCF graduated status indicates maturity and community support.

Consequences¶

Positive: Image vulnerability scanning caught 3 critical CVEs during Phase 3 migration (images blocked from production). Image signing enforcement via admission controller prevented unsigned images from deploying. Harbor replication to DR site ensures image availability during disasters. Excellent UI reduced operational burden vs. command-line registry tools.
Negative: PostgreSQL backend required for Harbor (additional operational component). Initial image promotion from ACR to Harbor took 8 hours (60 GB of images). Harbor upgrade procedures require careful sequencing (database migrations, UI downtime).
Mitigation: PostgreSQL deployed with Zalando Postgres Operator (automated backups, HA). Implemented image promotion automation (Azure CLI + Harbor API scripts). Documented Harbor upgrade runbook with rollback procedures.

Related Decisions: ADR-001 (Kubernetes as Harbor host), ADR-009 (CI/CD pipeline integration)

ADR-009: CI/CD Pipeline Platform¶

Status: Accepted
Date: 2023-06-01
Deciders: DevOps Team, Development Team

Context¶

GitHub Actions provided CI/CD automation in Phase 1-2 with seamless Azure integration. Phase 3 requires fully on-premises CI/CD pipeline for builds, tests, and deployments without cloud dependencies.

Decision¶

Use GitHub Actions (Phase 1-2) → GitLab Community Edition (Phase 3).

Alternatives Considered¶

Option 1: GitLab Community Edition (Self-Hosted): Open-source DevOps platform with integrated CI/CD.

✅ Pros: All-in-one platform (Git + CI/CD), excellent Kubernetes integration, LDAP/AD authentication, Auto DevOps features
❌ Cons: Heavier resource requirements than Jenkins, migration effort from GitHub

Option 2: Jenkins: Traditional CI/CD automation server.

✅ Pros: Most mature CI/CD platform, enormous plugin ecosystem, highly customizable
❌ Cons: Dated UI, plugin maintenance burden, less Kubernetes-native than GitLab

Option 3: Tekton (Kubernetes-Native CI/CD): Cloud-native CI/CD built on Kubernetes primitives.

✅ Pros: Kubernetes-native, pipelines as CRDs, lightweight
❌ Cons: Requires separate Git repository (no integrated VCS), immature UI, steep learning curve

Rationale¶

GitLab provides the most complete DevOps platform, integrating source control and CI/CD in a single tool. The integrated experience simplifies operations vs. separate Git + Jenkins. GitLab's Kubernetes-native CI/CD runners align with Contoso's orchestration strategy. LDAP authentication integrates seamlessly with AD DS.

Consequences¶

Positive: Single platform for Git repositories and CI/CD simplified operations (no Jenkins maintenance). GitLab Auto DevOps accelerated pipeline creation for new projects. Kubernetes executor runners provided consistent build environment. Git repository migration completed in single weekend (minimal disruption).
Negative: Pipeline syntax migration from GitHub Actions to GitLab CI required effort (~40 hours for 15 pipelines). GitLab CE resource requirements higher than GitHub Actions (8 vCPU, 32 GB RAM vs. zero operational overhead). Initial runner configuration issues caused failed builds (misconfigured image pull secrets).
Mitigation: Created GitLab pipeline templates for common patterns (backend API, workers, frontend). Documented migration guide for development teams. Allocated dedicated VMs for GitLab runners (isolated from application workloads).

Related Decisions: ADR-008 (Harbor integration for image publishing), ADR-001 (Kubernetes deployment targets)

ADR-010: Network Architecture¶

Status: Accepted
Date: 2023-02-20
Deciders: Network Team, Security Team, Platform Architecture Team

Context¶

Network architecture must support application workloads across three phases with different connectivity requirements: Azure (Phase 1), hybrid cloud-to-on-premises (Phase 2), fully disconnected (Phase 3). Security zones must isolate customer-facing, internal, and management traffic.

Decision¶

Use hub-spoke topology in Azure (Phase 1), ExpressRoute for hybrid connectivity (Phase 2), and isolated network zones for disconnected operation (Phase 3).

Alternatives Considered¶

Option 1: Flat Network (Single Subnet): All components in single network segment.

✅ Pros: Simplest configuration, no routing complexity, lowest latency
❌ Cons: No security segmentation, blast radius for security incidents, fails compliance requirements

Option 2: Three-Tier Architecture (DMZ, Application, Data): Traditional enterprise network design with security zones.

✅ Pros: Strong security isolation, clear trust boundaries, well-understood model
❌ Cons: Kubernetes workloads don't map cleanly to tiers (pods communicate across tiers), complex firewall rules

Option 3: Kubernetes Network Policies (Micro-Segmentation): Leverage Kubernetes-native network policies for pod-level isolation.

✅ Pros: Kubernetes-native, fine-grained control, dynamic policy updates
❌ Cons: Doesn't isolate infrastructure components (domain controllers, monitoring), complex policy debugging

Rationale¶

The hybrid approach (VLAN-based zones + Kubernetes network policies) provides defense-in-depth. VLAN segmentation isolates infrastructure components (domain controllers, ADFS, monitoring) from Kubernetes workloads. Within Kubernetes, network policies enforce pod-level segmentation (frontend → backend → database flows). This model works consistently across all three phases.

Consequences¶

Positive: Security zones enforced via Azure NSGs (Phase 1), VLAN ACLs (Phase 2-3), and Kubernetes network policies (all phases). Achieved zero lateral movement during red team penetration test. Clear network diagrams simplified troubleshooting.
Negative: Network policy debugging challenging (logs scattered across components). ExpressRoute outage in Phase 2 caused cascading failures (identity, monitoring). Initial network policy misconfiguration blocked legitimate traffic (workers → database).
Mitigation: Implemented Cilium Hubble for network policy observability (visualize allowed/denied flows). Cached Azure AD tokens for ExpressRoute outage resilience. Network policy changes require peer review and smoke tests before production deployment.

Related Decisions: ADR-005 (ADFS network placement), ADR-006 (Monitoring traffic flows)

Cross-Cutting Decision Themes¶

Several themes emerge across these ADRs:

Portability Over Convenience

Contoso consistently chose portable solutions (Kubernetes, RabbitMQ, MinIO) over Azure-specific services (App Service, Service Bus, Blob Storage) even when Azure services offered better integration. This prioritization enabled the Phase 1 → Phase 2 → Phase 3 migration path.

Operational Complexity Trade-Off

Each phase increased operational complexity: Phase 1 (minimal ops, Azure-managed) → Phase 2 (hybrid ops, some on-premises) → Phase 3 (full ops, all on-premises). The trade-off was deliberate — sovereignty requires operational investment.

Open Source Preference

Phase 3 exclusively uses open-source tooling (RKE2, RabbitMQ, MinIO, Harbor, GitLab CE, Prometheus, Grafana). This decision reduces vendor lock-in and licensing costs but increases operational burden (no commercial support contracts).

Decisions That Would Change (Lessons for Greenfield)¶

If starting from scratch, Contoso would make different decisions:

Decision	Original Choice	Greenfield Alternative	Reason
Kubernetes Distro	AKS → AKS Hybrid → RKE2	Start with RKE2 on-premises	Avoid AKS migration cost, optimize for disconnected from day one
Database	Azure SQL → Arc SQL MI → SQL Server VMs	Start with SQL Server on VMs	Simpler migration path, no Arc licensing cost
Identity	Azure AD → ADFS	Start with ADFS or Keycloak	Avoid large identity migration effort
Monitoring	Azure Monitor → PGLJ Stack	Start with PGLJ Stack	Avoid instrumentation rewrite, consistent observability

However: The phased approach was essential for risk management. Greenfield disconnected systems are rare — most organizations start in cloud and migrate outward through the continuum.

References¶

Next: Lessons Learned →