☁️ Cloud & DevOps

How software gets built, tested, deployed, and operated at scale. Cloud service models define the infrastructure abstraction. DevOps practices and tooling determine how fast and reliably teams can ship changes. This layer is the engine room of modern software delivery.

☁️

Cloud Service Models

Levels of abstraction and shared responsibility
IaaS — Infrastructure as a Service
EC2, Azure VMs, GCE, DigitalOcean
Rent raw compute, storage, and networking. You manage everything from the OS up — patching, runtime, application, data. Maximum control with maximum operational responsibility. The cloud equivalent of renting rack space.
🏛️ Context: IaaS makes sense for lift-and-shift migrations and workloads requiring full OS control. Use reserved instances for steady-state, spot/preemptible for fault-tolerant batch. Automate everything via IaC — no console clicking.
PaaS — Platform as a Service
Heroku, App Service, Cloud Run, Elastic Beanstalk
Deploy your code; the provider manages OS, runtime, scaling, and patching. You focus on application logic. Faster time-to-market at the cost of less infrastructure control. Ideal for standard web applications.
🏛️ Context: PaaS maximises developer productivity for standard workloads. Cloud Run (containers-as-a-service) bridges PaaS simplicity with container flexibility. Evaluate: can you accept the platform's constraints, or do you need IaaS control?
SaaS — Software as a Service
Salesforce, M365, Workday, ServiceNow
Complete applications delivered over the internet. The vendor manages everything — infrastructure, platform, application, updates. You configure and consume. The highest abstraction with the least control.
🏛️ Context: SaaS is the default for commodity capabilities (CRM, HR, ITSM). Integration is the primary architecture concern — SaaS creates data silos. Evaluate API quality, data export capabilities, and exit strategy.
Multi-Cloud & Hybrid Cloud
AWS + Azure + GCP, On-prem + Cloud, Anthos, Arc
Multi-cloud: using multiple cloud providers strategically. Hybrid: combining on-premises infrastructure with cloud. Both add architectural complexity but address vendor lock-in, regulatory, latency, or best-of-breed requirements.
🏛️ Context: Multi-cloud should be justified by concrete requirements (regulation, M&A, best-of-breed), not fear of lock-in. The cost of abstraction layers often exceeds lock-in risk. Hybrid is more commonly justified (data sovereignty, latency).
FinOps / Cloud Cost Management
Cost allocation, Right-sizing, Reserved capacity
The practice of managing cloud spend as an operational discipline. Combines real-time cost visibility, resource optimisation, team-level chargeback/showback, and commitment-based discounts (RIs, savings plans).
🏛️ Context: FinOps is an architecture concern — design for cost-awareness. Tag everything for allocation. Right-size continuously (most VMs are 40-60% over-provisioned). Establish FinOps team or practice with engineering, finance, and procurement.
🐳

Containers & Orchestration

Packaging and running applications consistently
Container Images
Docker, OCI, Dockerfile, Buildpacks
Lightweight, portable packages containing application code, runtime, libraries, and configuration. OCI (Open Container Initiative) is the open standard. Built from Dockerfiles or Cloud Native Buildpacks. Stored in container registries.
🏛️ Context: Multi-stage builds minimise image size and attack surface. Use distroless or Alpine base images. Scan images for vulnerabilities in CI (Trivy, Snyk). Pin base image digests, not just tags, for reproducibility.
Kubernetes
K8s, EKS, GKE, AKS, OpenShift
Container orchestration platform managing deployment, scaling, networking, and self-healing of containerised workloads. Declarative: you describe desired state, Kubernetes makes it happen. The de facto standard for container orchestration.
🏛️ Context: K8s is powerful but operationally demanding. Managed K8s (EKS, GKE, AKS) eliminates control plane burden. Evaluate if simpler alternatives (ECS, Cloud Run, App Runner) suffice before committing to Kubernetes complexity.
Helm & Kustomize
Helm charts, Kustomize overlays, Package management for K8s
Templating and configuration tools for Kubernetes. Helm packages applications as charts with configurable values. Kustomize patches base manifests with overlays per environment. Both simplify multi-environment deployments.
🏛️ Context: Helm for distributing reusable application packages. Kustomize for environment-specific overrides without templating complexity. Many teams use both: Helm charts customised via Kustomize overlays.
Container Registry
ECR, ACR, GCR, Docker Hub, GHCR, Harbor
Repositories for storing and distributing container images. Private registries provide access control, vulnerability scanning, and image signing. The "package manager" for containers.
🏛️ Context: Use cloud-native registries (ECR, ACR) for simplicity and IAM integration. Enable vulnerability scanning on push. Implement image signing (cosign/Notary) for supply chain integrity. Lifecycle policies to manage storage costs.
Serverless Containers
Cloud Run, Fargate, Azure Container Apps, Knative
Run containers without managing servers or clusters. The provider handles scaling (including to zero), networking, and infrastructure. You provide a container image; they run it. The sweet spot between PaaS simplicity and container flexibility.
🏛️ Context: Cloud Run / Fargate eliminates K8s operational burden for many workloads. Scale-to-zero saves cost for intermittent traffic. Cold starts are the trade-off. Excellent for APIs, webhooks, and event-driven services.
🔄

CI/CD & Delivery

Automating the path from code to production
Continuous Integration (CI)
GitHub Actions, GitLab CI, Jenkins, CircleCI
Automatically building, testing, and validating every code change. Developers merge frequently to a shared branch. The CI pipeline catches issues immediately: compilation errors, test failures, security vulnerabilities, code quality.
🏛️ Context: CI should complete in under 10 minutes — slow pipelines kill developer productivity. Parallelise tests, cache dependencies, use incremental builds. Include SAST, SCA, and linting as pipeline gates.
Continuous Deployment (CD)
ArgoCD, Flux, Spinnaker, GitHub Actions
Automatically deploying validated artifacts to environments. Continuous Delivery: artifacts are always deployable, humans approve production. Continuous Deployment: every passing change goes to production automatically.
🏛️ Context: GitOps (ArgoCD, Flux) is the modern CD standard for Kubernetes — Git is the source of truth, agents reconcile cluster state. Progressive delivery (canary, blue-green) reduces deployment risk. Target: deploy on demand, multiple times per day.
Progressive Delivery
Canary, Blue-green, Feature flags, A/B testing
Gradually rolling out changes to reduce blast radius. Canary: send 5% of traffic to the new version, monitor, then expand. Blue-green: maintain two environments, switch traffic. Feature flags: toggle features without deploying.
🏛️ Context: Progressive delivery decouples deployment from release. Feature flags (LaunchDarkly, Unleash) enable deploying code that's hidden until activated. Combine with automated rollback based on error rate or latency thresholds.
Environment Management
Dev, Staging, Production, Ephemeral environments
Managing multiple environments for development, testing, and production. Ephemeral environments spin up per pull request and tear down after merge. Environment parity reduces "works in staging, breaks in prod" failures.
🏛️ Context: Ephemeral environments (via IaC + containers) dramatically improve review quality — reviewers can test changes in isolation. Invest in production-like staging; the closer to prod, the more confident you can be.
📐

Infrastructure as Code

Managing infrastructure through code, not consoles
Terraform
HashiCorp Terraform, OpenTofu, HCL
The most widely adopted IaC tool. Declarative, cloud-agnostic (providers for AWS, Azure, GCP, and hundreds more). State file tracks what's deployed. Plan/apply workflow previews changes before executing them.
🏛️ Context: OpenTofu is the open-source fork after Terraform's licence change — evaluate based on your licensing requirements. Use remote state (S3 + DynamoDB, Terraform Cloud). Modularise for reusability. Enforce via CI: no manual changes.
Pulumi / CDK
Pulumi, AWS CDK, CDKTF, General-purpose language IaC
Define infrastructure using real programming languages (TypeScript, Python, Go) instead of domain-specific languages. Enables loops, conditionals, abstractions, and testing using familiar developer tools.
🏛️ Context: GPLs for IaC improve developer experience and enable sophisticated abstractions. Trade-off: more power means more rope to hang yourself. Enforce code review and testing standards just like application code.
Configuration Management
Ansible, Chef, Puppet, Salt
Tools that enforce desired state on servers — installing packages, configuring services, managing files. Ansible is agentless (SSH-based) and the most widely adopted. Less critical in container-native architectures where images are immutable.
🏛️ Context: Configuration management is essential for VM-based infrastructure but fading for container workloads (where images are immutable). Ansible remains useful for bootstrapping, network device config, and hybrid environments.
Policy as Code
OPA/Rego, Sentinel, Kyverno, Checkov
Defining and enforcing compliance, security, and governance rules as code. Policies evaluate infrastructure definitions before deployment (shift-left) and at runtime. Prevents misconfigurations from reaching production.
🏛️ Context: Policy as code is the enforcement arm of platform engineering. OPA is the general standard; Kyverno for Kubernetes-native. Checkov scans Terraform for misconfigurations in CI. Essential for guardrails without gatekeeping.
📊

Observability & Reliability

Seeing, understanding, and maintaining system health
Logging
ELK, Loki, CloudWatch Logs, Splunk, Fluentd
Capturing and centralising application and infrastructure event records. Structured logging (JSON) enables querying and correlation. Log aggregation platforms collect, index, and search across all services.
🏛️ Context: Structured JSON logging is non-negotiable for queryability. Centralise with Loki (cost-effective, Grafana-native) or ELK. Define retention tiers: hot (recent, fast search), warm (weeks), cold (archive for compliance).
Metrics
Prometheus, Grafana, Datadog, CloudWatch, InfluxDB
Numerical measurements over time — CPU usage, request rate, error count, latency percentiles. Prometheus is the open-source standard (pull-based scraping). Grafana visualises metrics from multiple sources in unified dashboards.
🏛️ Context: The RED method (Rate, Errors, Duration) for services and USE method (Utilisation, Saturation, Errors) for resources provide baseline metrics. Define SLIs/SLOs that map to user experience, not just infrastructure health.
Distributed Tracing
Jaeger, Zipkin, Tempo, X-Ray, OpenTelemetry
Following a single request as it traverses multiple services. Each service adds a span; spans compose into traces. Identifies latency bottlenecks, failure points, and dependency issues across microservices.
🏛️ Context: OpenTelemetry is the converged standard for instrumentation (traces, metrics, logs). Instrument from day one — retrofitting distributed tracing is expensive. Sampling strategy balances cost vs. visibility.
Alerting & Incident Management
PagerDuty, OpsGenie, Alertmanager, Incident.io
Automated notification when systems breach defined thresholds. Incident management formalises response: detection → triage → mitigation → resolution → post-mortem. On-call rotations ensure someone is always responsible.
🏛️ Context: Alert fatigue kills reliability culture. Every alert must be actionable — if you can't do something about it, it shouldn't page. Runbooks for every alert. Blameless post-mortems with concrete action items.
SRE & Reliability Engineering
SLO, SLI, Error budget, Chaos engineering, Toil reduction
Site Reliability Engineering applies software engineering to operations. SLOs define reliability targets. Error budgets balance innovation speed against stability. Chaos engineering proactively tests resilience (Chaos Monkey, Litmus).
🏛️ Context: SLOs aligned to user experience are the north star. Error budgets create a scientific framework for release velocity vs. stability. Chaos engineering in production validates that failover and recovery actually work.

Key DevOps & Cloud Patterns

GitOps
Git as the single source of truth for infrastructure and application state. Agents (ArgoCD, Flux) continuously reconcile actual state with desired state in Git. Audit trail, rollback, and review built in.
Platform Engineering
Build an Internal Developer Platform (IDP) that provides self-service capabilities — environment provisioning, CI/CD, observability, secrets — with guardrails. Reduces cognitive load on product teams.
Immutable Infrastructure
Never modify running infrastructure. Instead, build new images/instances and replace the old ones. Eliminates configuration drift and "snowflake servers." Containers are immutable by design.
Shift-Left Security
Move security checks earlier in the pipeline: SAST in IDE, SCA on commit, container scanning in CI, policy checks on infrastructure plans. Catch issues when they're cheapest to fix.
Cattle Not Pets
Treat servers as disposable (cattle) not irreplaceable (pets). Any instance can be destroyed and recreated from code. Auto-scaling groups, container orchestration, and IaC enable this mindset.
Well-Architected Framework
Cloud providers' best practice guides across pillars: operational excellence, security, reliability, performance efficiency, cost optimisation, and sustainability. Use as a review checklist for architecture decisions.

How Cloud & DevOps Connects

⬇️
Infrastructure (Layer 1) → Cloud: Cloud abstracts physical infrastructure into API-driven resources. IaC provisions compute, storage, and networking. Cloud regions map to physical data centres.
⬆️
Cloud → Application (Layer 7): CI/CD deploys applications. Containers package them. Observability monitors them. Cloud services (managed DB, queues, auth) become application building blocks.
🔄
Cloud ↔ Platform (Layer 2): Containers package runtime + application. IaC provisions platform components. CI/CD uses the developer toolchain. Cloud PaaS replaces self-managed platforms.
🛡️
Security (Layer 5) ↔ Cloud: Policy as code enforces security guardrails. Container scanning in CI. IAM governs cloud access. Secrets management protects credentials. Shared responsibility model defines who secures what.