🗄️ Data

Where information is stored, organised, processed, and governed. Data architecture decisions shape everything from application performance to regulatory compliance. Getting this layer right determines whether your organisation can trust and act on its information.

🗄️

Operational Databases

Transactional systems powering live applications

Relational Database (RDBMS)

PostgreSQL, MySQL, Oracle, SQL Server

Data stored in tables with defined schemas and enforced relationships. ACID transactions guarantee consistency. SQL provides a powerful, standardised query language. The default choice for structured, transactional data.

🏛️ Context: PostgreSQL is the default recommendation for new projects — open-source, full-featured, excellent extension ecosystem. Oracle/SQL Server persist in enterprises due to existing investment. Evaluate managed services (RDS, Cloud SQL) vs. self-hosted.

Document Database

MongoDB, Cosmos DB, Couchbase, Firestore

Stores data as flexible JSON/BSON documents — no fixed schema required. Each document can have different fields. Excellent for content management, catalogues, user profiles, and rapidly evolving data models.

🏛️ Context: Document DBs trade schema enforcement for flexibility. This is powerful for agile development but dangerous without discipline — schema validation and data contracts are essential. Avoid for highly relational data.

Key-Value Store

Redis, DynamoDB, Memcached, etcd

Simplest data model: a key maps to a value. Extremely fast lookups. Redis adds data structures (lists, sets, sorted sets, streams). DynamoDB provides serverless key-value with single-digit millisecond latency at any scale.

🏛️ Context: DynamoDB is the go-to for serverless architectures. Design around access patterns (single-table design). Redis serves as both cache and primary store for session data, leaderboards, and real-time features.

Wide-Column Store

Apache Cassandra, ScyllaDB, HBase, Bigtable

Distributed databases optimised for massive write throughput and horizontal scaling across many nodes. Data organised by rows and column families. No single point of failure. Used for time-series, IoT, and event data.

🏛️ Context: Cassandra excels at write-heavy, globally distributed workloads. Data modelling is query-driven (denormalise aggressively). ScyllaDB offers Cassandra API compatibility with better performance (C++ vs. Java).

Graph Database

Neo4j, Amazon Neptune, TigerGraph, ArangoDB

Stores data as nodes (entities) and edges (relationships). Excels at traversing complex, highly-connected data — social networks, fraud detection, recommendation engines, knowledge graphs, and network topology.

🏛️ Context: Graph DBs solve problems that are expensive in relational systems (multi-hop joins). Use when relationships are as important as the data itself. Neo4j dominates; Neptune for AWS-native. Consider graph-on-relational (Apache AGE) for lighter needs.

Vector Database

Pinecone, Weaviate, Milvus, pgvector, Qdrant

Stores high-dimensional vector embeddings and enables similarity search. Core infrastructure for AI/ML applications — semantic search, RAG (Retrieval-Augmented Generation), recommendation systems, and image search.

🏛️ Context: Vector DBs are essential for AI-powered applications. Evaluate purpose-built (Pinecone, Weaviate) vs. extensions on existing DBs (pgvector). Consider: index type (HNSW, IVF), dimensions, and update frequency.

📊

Analytical Data Stores

Optimised for queries, reporting, and insights

Data Warehouse

Snowflake, BigQuery, Redshift, Synapse, Databricks SQL

Centralised analytical store using columnar storage and MPP (Massively Parallel Processing) for fast complex queries over large historical datasets. Schema-on-write. The backbone of enterprise BI and reporting.

🏛️ Context: Modern DWH separates storage from compute (Snowflake, BigQuery), enabling independent scaling. Design the semantic layer carefully — it becomes the single source of truth. Evaluate cost models: Snowflake (credit-based) vs. BigQuery (per-query).

Data Lake

S3, ADLS, GCS + Delta Lake / Iceberg / Hudi

Schema-on-read storage for raw data in any format. Modern data lakes use open table formats (Delta Lake, Apache Iceberg) to add ACID transactions, time travel, schema evolution, and partition pruning on top of object storage.

🏛️ Context: The lakehouse pattern (Delta/Iceberg on object storage) merges lake flexibility with warehouse-grade querying. Apache Iceberg is emerging as the open standard. Enforce data cataloguing and quality checks to prevent "data swamp."

Data Lakehouse

Databricks, Snowflake (Iceberg), Dremio

Combines data lake economics (cheap object storage, any format) with data warehouse capabilities (ACID transactions, SQL queries, governance). Eliminates the need to copy data between lake and warehouse.

🏛️ Context: The lakehouse is converging the lake/warehouse divide. Databricks (Delta Lake) and Snowflake (Iceberg support) are the two main camps. Evaluate vendor lock-in against open table formats.

OLAP / Analytical Engine

ClickHouse, Apache Druid, StarRocks, Pinot

Real-time analytical databases optimised for sub-second queries on billions of rows. Used for real-time dashboards, operational analytics, and user-facing analytics features. Column-oriented with advanced indexing.

🏛️ Context: ClickHouse is the fastest-growing OLAP engine. Use when traditional DWH latency (seconds) isn't sufficient — particularly for customer-facing analytics. Ingests streaming data and serves queries concurrently.

Search Engine

Elasticsearch, OpenSearch, Solr, Algolia, Typesense

Inverted-index databases optimised for full-text search, filtering, and aggregation. Power search bars, log analytics (ELK stack), and faceted navigation. Increasingly combined with vector search for semantic capabilities.

🏛️ Context: Elasticsearch/OpenSearch serve dual duty: application search and log analytics (observability). Consider managed services to avoid operational burden. Algolia/Typesense for search-as-a-service with simpler ops.

💾

Caching & Performance

Accelerating data access at every level

Application Cache

Redis, Memcached, Hazelcast

In-memory data stores that serve frequently-accessed data without hitting the primary database. Sub-millisecond latency. Supports patterns: cache-aside, write-through, write-behind, and read-through.

🏛️ Context: Cache-aside is the safest default (app checks cache, falls back to DB, populates cache). Define TTLs per data type. Cache invalidation is genuinely hard — design for eventual consistency and instrument cache hit ratios.

CDN Cache

CloudFront, Cloudflare, Fastly, Akamai

Edge caching that serves static assets (images, CSS, JS) and API responses from locations geographically close to users. Reduces origin server load and dramatically improves perceived performance.

🏛️ Context: CDN caching decisions affect freshness vs. performance. Use immutable filenames with content hashing for static assets (infinite cache). For API responses, evaluate stale-while-revalidate patterns.

Database Query Cache

Materialized views, Query result cache, ReadySet

Caching at the database level — materialised views pre-compute complex queries, query result caches store recent results. ReadySet acts as a transparent cache layer that auto-maintains consistency with the source DB.

🏛️ Context: Materialised views are the simplest performance win for expensive, frequently-run analytical queries. Refresh strategy (on-demand, periodic, incremental) depends on freshness requirements.

📈

Analytics & BI

Turning data into decisions

BI Platforms

Tableau, Power BI, Looker, Metabase, Superset

Visual analytics tools that let business users explore data through dashboards, charts, and interactive reports. Self-service BI empowers non-technical users to answer their own questions without writing SQL.

🏛️ Context: Standardise on one primary BI platform to avoid fragmentation. Power BI for Microsoft-heavy shops; Looker for governed metrics; Tableau for advanced visualisation. Open-source (Metabase, Superset) for embedded analytics.

Semantic / Metrics Layer

dbt metrics, Looker modelling, Cube.dev

A single, governed definition of business metrics that sits between raw data and consumers. Ensures "revenue" means the same thing in every dashboard, report, and query across the organisation.

🏛️ Context: The metrics layer prevents the "multiple versions of the truth" problem. dbt's semantic layer and Cube.dev are leading approaches. Invest in this early — retrofitting metric governance is painful.

Data Science / ML Platform

Jupyter, MLflow, SageMaker, Vertex AI, Databricks ML

Platforms enabling data scientists to develop, train, deploy, and monitor machine learning models. Includes experiment tracking, feature stores, model registries, and serving infrastructure.

🏛️ Context: MLOps is the DevOps of data science. Standardise on a platform that covers the full lifecycle: experimentation → training → deployment → monitoring. Feature stores prevent duplicated feature engineering across teams.

Reverse ETL

Census, Hightouch, Polytomic

Syncing data from the warehouse back into operational tools (CRM, marketing platforms, support systems). Activates analytical data by putting it where teams actually work — closing the data feedback loop.

🏛️ Context: Reverse ETL completes the data cycle: operational → analytical → back to operational. Ensures business teams work with enriched, unified data. Sync frequency and conflict resolution are key design decisions.

🏛️

Data Governance

Trust, quality, ownership, and compliance

Data Catalogue

DataHub, Amundsen, Atlan, Alation, Collibra

A searchable inventory of all data assets — tables, columns, dashboards, pipelines — with metadata, ownership, descriptions, and lineage. The "Google for your data" that enables discoverability and self-service.

🏛️ Context: A data catalogue is foundational to data mesh and self-service analytics. DataHub (open-source, LinkedIn) is the leading open option. Enforce ownership and documentation standards — a catalogue is only useful if populated.

Data Quality

Great Expectations, Monte Carlo, Soda, dbt tests

Automated testing and monitoring of data accuracy, completeness, freshness, and consistency. Data quality checks run in pipelines (preventive) and via continuous monitoring (detective). Alerts when data drifts or breaks.

🏛️ Context: Data quality is an architecture concern, not an afterthought. Embed checks in pipelines (dbt tests, Great Expectations). Monte Carlo provides anomaly detection across the entire data estate. Define data SLAs per domain.

Data Lineage

End-to-end lineage, Column-level lineage, Impact analysis

Tracing data from source to consumption — which systems produced it, what transformations were applied, and who consumes it. Enables impact analysis (what breaks if this changes?) and regulatory compliance.

🏛️ Context: Column-level lineage is the gold standard. dbt provides transformation lineage automatically. Tools like DataHub and Atlan aggregate lineage across ingestion, transformation, and consumption. Essential for GDPR compliance.

Data Mesh

Domain ownership, Data products, Federated governance

An organisational paradigm where domain teams own their data as a product — producing, documenting, and guaranteeing quality. A self-serve data platform provides infrastructure. Federated governance ensures interoperability.

🏛️ Context: Data mesh is primarily an organisational change, not a technology purchase. Requires mature data platform, clear domain boundaries, and cultural shift to data ownership. Start with one pilot domain, not a big-bang rollout.

Key Data Architecture Patterns

Polyglot Persistence

Use different database types for different workloads — relational for transactions, document for content, graph for relationships, time-series for metrics. No single database is best at everything.

CQRS (Command Query Separation)

Separate write model (optimised for transactions) from read model (optimised for queries). Each can use different storage, schemas, and scaling strategies. Powerful but adds complexity.

Event Sourcing

Store every state change as an immutable event. Current state is derived by replaying events. Perfect audit trail, temporal queries, and ability to rebuild views. Requires careful event schema evolution.

Data Vault

Warehouse modelling method using Hubs (business keys), Links (relationships), and Satellites (descriptive data). Designed for auditability, flexibility, and parallel loading. Popular in regulated industries.

Lambda / Kappa Architecture

Lambda: parallel batch + stream processing layers merged at query time. Kappa: stream-only, treating everything as events. Kappa is simpler and increasingly preferred with modern streaming platforms.

Medallion Architecture

Bronze (raw) → Silver (cleansed, conformed) → Gold (business-level aggregates). Progressive refinement of data quality through layers. The standard pattern for lakehouse implementations.

How Data Connects

⬆️

Data → Application (Layer 7): Applications read and write to databases. Data model design is driven by application access patterns. Caching accelerates data delivery to the presentation layer.

🔄

Data ↔ Integration (Layer 6): ETL/CDC pipelines move data between stores. APIs expose data to consumers. Event streaming connects operational and analytical systems in near real-time.

⬇️

Infrastructure (Layer 1) → Data: Databases run on compute. Storage IOPS and throughput directly determine database performance. Network latency affects replication and distributed query times.

🛡️

Security (Layer 5) ↔ Data: Encryption at rest protects stored data. Access controls govern who sees what. Compliance frameworks dictate data residency, retention, and erasure requirements.