🏢 Infrastructure

The physical and virtual foundation everything runs on. Compute, storage, networking, and the facilities that house them. Architecture decisions here determine resilience, performance ceilings, and long-term cost structure.

💻

Compute

Processing power — from bare metal to serverless
Bare Metal Server
Physical server, Dedicated host
A dedicated physical machine with no virtualisation layer. Provides maximum performance and hardware-level control. Used for latency-sensitive workloads, licensing constraints, or workloads that need the full hardware stack.
🏛️ Context: Reserved for workloads where hypervisor overhead is unacceptable (high-frequency trading, GPU compute, specific database engines). Procurement lead times of weeks vs. minutes for cloud VMs.
Virtual Machine (VM)
EC2, Azure VM, GCE, vSphere
A software-emulated computer running on a hypervisor that shares physical hardware. Each VM has its own OS, CPU allocation, memory, and storage. The workhorse of enterprise IT for two decades.
🏛️ Context: VMs offer strong isolation but higher overhead than containers. Right-size continuously — most enterprise VMs are over-provisioned by 40-60%. Consider reserved instances for stable workloads.
Hypervisor
VMware ESXi, KVM, Hyper-V, Xen
Software that creates and manages virtual machines by abstracting the physical hardware. Type 1 (bare-metal) hypervisors run directly on hardware. Type 2 (hosted) run on top of an existing OS.
🏛️ Context: VMware dominance is being challenged by cloud-native approaches. Evaluate migration paths — VMware licensing changes (Broadcom acquisition) may alter cost models significantly.
Serverless Compute
Lambda, Cloud Functions, Azure Functions
Event-driven compute where the cloud provider manages all infrastructure. Code runs in response to triggers (HTTP requests, queue messages, schedules). You pay only for execution time, not idle capacity.
🏛️ Context: Excellent for event-driven, spiky workloads. Cold start latency and execution time limits are real constraints. Vendor lock-in is high — abstract with frameworks like Serverless Framework or SST.
GPU / Accelerated Compute
NVIDIA A100/H100, TPU, FPGA
Specialised hardware for parallel processing. GPUs accelerate AI/ML training, rendering, and scientific simulation. TPUs are Google's custom AI chips. FPGAs offer programmable hardware acceleration.
🏛️ Context: GPU scarcity drives architecture decisions around AI workloads. Consider spot/preemptible instances for training, reserved for inference. Evaluate cloud GPU vs. on-prem based on utilisation patterns.
🌐

Networking

Connectivity, routing, and traffic management
LAN / WAN
Local/Wide Area Network, MPLS
LAN connects devices in a single location (office, data centre). WAN connects LANs across geographic distances. Traditional WANs use MPLS circuits; modern approaches favour SD-WAN with internet underlay.
🏛️ Context: MPLS is expensive but predictable. SD-WAN reduces cost with intelligent routing over broadband. Evaluate hybrid: MPLS for critical traffic, SD-WAN for general connectivity.
DNS
Domain Name System, Route 53, Cloudflare DNS
The internet's phone book — translates human-readable domain names (example.com) to IP addresses. Supports load balancing, failover, and geo-routing through intelligent DNS records.
🏛️ Context: DNS is a critical dependency and attack surface. Use multiple DNS providers for redundancy. Leverage DNS-based traffic management for global load balancing and disaster recovery.
VPN / Private Connectivity
IPSec, WireGuard, Direct Connect, ExpressRoute
Encrypted tunnels over public internet (VPN) or dedicated private links to cloud providers (Direct Connect / ExpressRoute). Private connectivity offers consistent latency and bandwidth guarantees.
🏛️ Context: Direct Connect/ExpressRoute is justified when data transfer volumes make VPN bandwidth insufficient or when latency consistency matters. Plan for redundant circuits.
Load Balancer
L4/L7, ALB, NLB, F5, HAProxy
Distributes traffic across multiple backends. L4 (transport) routes by IP/port. L7 (application) inspects HTTP for content-based routing, URL path routing, and header manipulation.
🏛️ Context: Multi-AZ load balancing is the minimum for production. Use health checks with appropriate thresholds. L7 enables canary deployments and A/B testing at the infrastructure level.
CDN / Edge Network
CloudFront, Cloudflare, Akamai, Fastly
Globally distributed cache that serves content from edge locations close to users. Reduces latency, offloads origin servers, and provides DDoS protection. Modern CDNs also run compute at the edge.
🏛️ Context: CDN is a performance and resilience multiplier. Configure cache invalidation carefully. Edge compute (Cloudflare Workers, Lambda@Edge) enables logic without round-tripping to origin.
Firewall / Network Security
NGFW, Security Groups, NACLs, NSGs
Traffic filtering at network boundaries. Traditional firewalls use rules based on IP/port. Next-gen firewalls (NGFW) add application awareness. Cloud equivalents: Security Groups, NACLs.
🏛️ Context: Microsegmentation is the modern approach — security groups per workload, not per network zone. Default-deny with explicit allow rules. Log all dropped traffic for forensic value.
💿

Storage

Where data physically resides
Block Storage
SAN, EBS, Azure Managed Disk, iSCSI
Raw storage volumes attached to compute instances. Data is stored in fixed-size blocks. Offers lowest latency and is required for databases and applications needing direct disk access.
🏛️ Context: Understand IOPS and throughput tiers — under-provisioned block storage is the #1 cause of 'unexplained slowness'. Use provisioned IOPS for production databases.
Object Storage
S3, Azure Blob, GCS, MinIO
Flat-namespace storage where each object has a key, the data itself, and metadata. Scales virtually without limit. Ideal for unstructured data: files, images, backups, logs, data lake content.
🏛️ Context: Object storage is the foundation of modern data architectures. Implement lifecycle policies (hot → warm → cold → archive) to optimise cost. Beware egress fees across regions.
File Storage
NAS, NFS, EFS, Azure Files, SMB
Shared file systems accessible by multiple compute instances simultaneously via network protocols (NFS, SMB). Used for shared application data, home directories, and legacy application compatibility.
🏛️ Context: Cloud-managed file storage (EFS, FSx) simplifies operations but at higher per-GB cost. Evaluate if the workload truly needs shared filesystem semantics or if object storage suffices.
Backup & Disaster Recovery
Snapshots, Cross-region replication, RPO/RTO
Strategies for data protection and business continuity. RPO (Recovery Point Objective) defines acceptable data loss. RTO (Recovery Time Objective) defines acceptable downtime. Tiered by criticality.
🏛️ Context: Define RPO/RTO per service tier, not globally. Test recovery procedures regularly — untested backups are just storage costs. Consider active-active for Tier 1 systems.
🏢

Facilities & Physical

Data centres, power, cooling, and physical security
On-Premises Data Centre
On-prem, Private DC
Company-owned facilities with full control over hardware, networking, and physical security. Maximum control and data sovereignty but highest CapEx and operational burden.
🏛️ Context: Justifiable for regulatory requirements (data sovereignty, air-gapped environments), ultra-low-latency needs, or when existing investment is not yet depreciated.
Colocation
Colo, Equinix, Digital Realty
Renting physical space, power, and cooling in a third-party data centre. You own and manage the hardware; the provider manages the facility. Offers carrier-neutral network connectivity.
🏛️ Context: Colo provides data centre-grade facilities without building your own. Key for interconnecting with cloud providers (on-ramp) and peering. Evaluate contract terms and power density.
Cloud Regions & Availability Zones
AWS Regions, Azure Regions, GCP Regions, AZs
Cloud providers organise infrastructure into Regions (geographic areas) containing multiple Availability Zones (isolated data centres with independent power, cooling, and networking). AZs provide fault isolation.
🏛️ Context: Multi-AZ is the minimum for production resilience. Multi-region adds disaster recovery but significantly increases complexity and cost. Choose regions based on data residency, latency, and service availability.
Edge Locations
Edge compute, IoT gateways, MEC
Compute and storage placed close to where data is generated or consumed — factory floors, retail stores, cell towers. Reduces latency and bandwidth to central cloud for real-time processing.
🏛️ Context: Edge is critical for IoT, real-time analytics, and latency-sensitive applications. Design for intermittent connectivity, local autonomy, and eventual sync with the central platform.

How Infrastructure Connects

⬆️
Compute → Platform (Layer 2): Servers and VMs host operating systems and runtimes. Container orchestrators schedule workloads across compute pools.
🔄
Network ↔ Everything: Networking is the connective tissue. Every layer depends on reliable, secure, low-latency connectivity between components.
💾
Storage → Data (Layer 3): Block storage backs databases. Object storage backs data lakes. File storage serves shared application data. Storage tiers map to data temperature.
☁️
Facilities → Cloud (Layer 4): Cloud abstracts facilities — you stop thinking about racks and power. Hybrid architectures blend on-prem facilities with cloud regions.