Skip to main content

Modern Data Architecture Patterns That Actually Scale

The data architecture decisions that matter in 2026. From warehouse selection to pipeline design, patterns from building platforms that handle real enterprise workloads.

24 min read

The modern data stack has matured enough that the landscape is no longer dominated by one obviously correct answer. Every architectural decision now comes with real trade-offs: cost at scale, operational complexity, team capability, and the risk of building something your organisation cannot maintain two years from now. This covers the patterns that hold across platforms in fintech, telecoms, and gaming environments where data volume and access pattern requirements were non-negotiable. The data architecture topic hub provides context for how these patterns connect. For guidance on picking the underlying storage engine, start with database selection for enterprise systems.

The Architecture Decision That Matters Most#

Before picking tools, pick your primary access pattern. This single decision shapes almost everything else.

Warehouse-first is for organisations where the dominant workload is analytical queries on historical data: revenue analysis, cohort analysis, compliance reporting, and finance. Snowflake, BigQuery, and Redshift are designed for this. They give you ANSI SQL, automatic scaling, and governance tooling without operational overhead. The trade-off is cost at scale and limited support for anything that is not a SQL query.

Streaming-first is for platforms where decisions must happen within seconds of an event: fraud detection, real-time personalisation, live operational dashboards. Kafka with Flink or Spark Structured Streaming is the standard stack. The trade-off is significant engineering investment and operational complexity that a warehouse-first team should not take on unless the business requirement is genuine.

Hybrid is what most mature platforms end up at, but the key word is "clear boundaries." Batch pipelines feeding the warehouse for analytical queries; a separate streaming pipeline feeding an operational store for latency-sensitive use cases. The failure mode is treating hybrid as a single unified architecture and creating systems where everything touches everything.

Most failures come not from picking the wrong tool but from trying to use one tool for all three access patterns. A warehouse is not a streaming platform. A streaming platform is not a warehouse. Pick your primary pattern, build for it, and add the secondary layer only when you have a concrete business requirement that the primary layer cannot meet.

Warehouse Architecture in 2026#

Cloud warehouses are mature. Snowflake, BigQuery, and Redshift have converged on the same core capabilities: separation of storage and compute, ANSI SQL, semi-structured data support, and pay-per-query pricing. The architectural discipline that separates good warehouse deployments from expensive ones is not the warehouse itself but how you use it.

Storage and Compute Separation#

The practical benefit of storage-compute separation is workload isolation. Spinning up a dedicated compute cluster per workload type is now standard practice on any warehouse that supports it. A reporting cluster for your BI tool, a transformation cluster for your dbt runs, an ad-hoc cluster for analysts, each with its own sizing and auto-suspend policy. Without this separation, a data scientist running an exploratory query on a 500GB table at 2pm can spike costs for everyone else and slow down the dashboard refresh your CEO is watching.

Concrete configuration worth standardising: auto-suspend set to 5 minutes for development clusters, 1-2 minutes for ad-hoc clusters, and disabled for scheduled jobs that run on predictable cadences. Query queuing configured per-cluster rather than letting everything pile onto one warehouse.

Data Modelling Discipline: The Semantic Layer#

The pattern most underinvested in is the semantic layer: a single place where business metrics are defined, versioned, and resolved consistently. When finance calculates "monthly active users" differently from product, and product calculates it differently from marketing, the problem is not the warehouse. It is the absence of a semantic layer.

The main options in 2026 are dbt Semantic Layer (MetricFlow), Cube, and LookML. The dbt Semantic Layer is the most integrated choice if you are already running dbt for transformations. Cube gives you more flexibility if you need a standalone semantic layer that serves multiple downstream BI tools. LookML is locked to Looker.

A metric definition in dbt Semantic Layer looks like this:

yaml
metrics:
  - name: monthly_active_users
    label: Monthly Active Users
    description: Distinct users with at least one session in the trailing 30 days
    type: simple
    type_params:
      measure:
        name: distinct_active_users
    meta:
      owner: product
      slack_channel: "#data-product"

The rule: if a metric has a name and a business owner, it belongs in the semantic layer. If two teams are computing it independently in SQL, you already have a problem.

Cost Observability: The $60K/Month Problem#

Organisations commonly hit $60,000+ monthly Snowflake bills after two to three years, driven by compute bloat, poor clustering, and uncontrolled ad-hoc queries on large unpartitioned tables. This is a governance problem more than a technical one, but technical controls help.

The foundation is per-query cost attribution. In BigQuery, the equivalent is INFORMATION_SCHEMA.JOBS_BY_PROJECT filtered on job_type = 'QUERY' and sorted by total_bytes_billed. Run this weekly, share the top 20 most expensive queries with their owners, and you will have more leverage on cost reduction than any auto-suspend policy change.

The three levers that move the needle most: clustering and partitioning (stops full table scans, the single biggest cost driver), query result caching (avoid re-running the same expensive query multiple times per day), and auto-suspend policies on all non-production warehouses. Target less than 20% of compute spend on unattributed or ad-hoc queries. Most teams land at 50-60% when they first measure it.

The Lakehouse Question#

A lakehouse adds object storage as the primary storage layer with a table format that enables ACID transactions, schema evolution, and SQL queries directly on files. The pitch is: warehouse-quality query capabilities at object storage prices.

The lakehouse makes sense when you have significant unstructured or semi-structured data that a warehouse handles poorly, when you need both SQL and non-SQL workloads (ML training, feature engineering) accessing the same data, or when warehouse storage costs are genuinely prohibitive relative to S3 or GCS. Most teams should start with a cloud warehouse and add lakehouse capabilities only when they hit a specific limitation that has a cost justification.

Open Table Formats: The Decision That Now Defines Your Architecture#

If you are building on a data lake or lakehouse in 2026, the choice of open table format is foundational. Three formats dominate: Apache Iceberg, Delta Lake, and Apache Hudi. They solve the same core problems but with different trade-offs.

The problems they all solve: plain Parquet on S3 has no ACID guarantees, no schema evolution support, no time travel, and no partition evolution without a full table rewrite. If you have ever had a pipeline fail halfway through writing a partition and corrupted your table, you understand why these formats exist.

Apache Iceberg#

Iceberg is now the neutral enterprise default. Databricks acquired Tabular (founded by the original Iceberg engineers from Netflix) for over $1 billion in mid-2024, and the deal explicitly positioned Iceberg as the open, vendor-neutral standard. Every major query engine supports it: Snowflake, AWS Glue, Dremio, Trino, Spark, and now Databricks itself.

The feature that makes Iceberg genuinely different is hidden partitioning. With Hive-style partitioning, your queries need to know about the partition columns. With Iceberg hidden partitioning, the table tracks partitioning metadata internally. You write your query as a plain predicate and the engine figures out which files to read:

sql
-- Iceberg: just write the predicate, no partition filter required
SELECT * FROM events
WHERE event_date = '2026-04-15'
  AND country = 'MT';

Partition evolution means you can change the partitioning strategy of an existing table without rewriting it. If you initially partitioned by month and later need daily partitions, Iceberg handles that migration transparently. Iceberg V2 also introduced deletion vectors: instead of rewriting files for every update or delete, it writes a small deletion bitmap file alongside the data. This makes UPDATE and DELETE operations significantly faster.

Delta Lake#

Delta Lake is the format native to Databricks, and 60%+ of Fortune 500 companies run on Databricks. If your platform is built on Databricks, Delta is the correct choice. The tooling integration is tight, the performance is well-tuned, and you get Liquid Clustering (introduced in 2024) which replaces the older OPTIMIZE + Z-ORDER approach.

Liquid Clustering is adaptive. You define the cluster keys but the engine incrementally clusters data as it is written, without requiring a full table rewrite on each OPTIMIZE run:

sql
CREATE TABLE events (
  event_id STRING,
  user_id STRING,
  event_date DATE,
  event_type STRING
)
CLUSTER BY (user_id, event_date);

-- Incremental clustering (fast, not a full rewrite)
OPTIMIZE events;

The honest trade-off: Delta Lake is tightly coupled to Databricks. While the open-source Delta Lake specification exists and Spark supports it, the best features and performance optimisations appear on Databricks first. If you are considering a future multi-cloud or multi-engine architecture, that coupling is a real constraint.

Apache Hudi#

Hudi is the right choice for high-frequency streaming upsert workloads. It was built at Uber specifically for ingesting CDC (change data capture) streams into a data lake with near-real-time latency. If you are processing millions of row-level updates per hour from a transactional database via Debezium or AWS DMS, Hudi's write path is optimised for that pattern.

The key architectural choice in Hudi is Copy-on-Write (COW) versus Merge-on-Read (MOR). COW rewrites the entire Parquet file on every update. Reads are fast because there is no merge step. Writes are slow for high-frequency updates. MOR appends delta logs alongside the base Parquet files on write and merges them at read time. Writes are fast. Reads have a small overhead from the merge step, amortised by running compaction jobs. For streaming CDC workloads, MOR is almost always the right choice.

The Honest Recommendation#

If you are starting a new data platform and are not already on Databricks: choose Iceberg. It has the widest engine support, the best partition evolution story, and no vendor dependency.

If you are already on Databricks: choose Delta Lake. The integration is too tight and the performance too well-tuned to justify the friction of running a different format on the same platform.

If your primary workload is high-frequency streaming upserts from CDC pipelines: choose Hudi, with MOR table type and a tuned compaction schedule.

The Medallion Architecture#

The medallion architecture organises data in three layers: Bronze (raw), Silver (cleaned), and Gold (transformed). It has become the standard pattern for lakehouse and warehouse-backed data platforms because it makes quality gates explicit and gives you a recovery point at each stage.

Bronze: Raw and Immutable#

Bronze is the exact copy of source data, append-only, never modified. Whatever arrives from your source system goes in at Bronze, including malformed records, duplicates, and schema violations. The rule is: if you lose Bronze, you can rebuild everything else from it.

Naming convention worth standardising:

rounded-md bg-accent-subtle px-1.5 py-0.5 text-[0.88em] font-normal text-foreground border border-border/50
bronze/
  {source_system}/
    {entity}/
      year={yyyy}/month={mm}/day={dd}/
        {timestamp}_{batch_id}.parquet

Retention policy should be driven by your recovery SLA and regulatory requirements. A 90-day Bronze retention covers most operational failure scenarios. Regulatory data may require 7 years.

Silver: Where Quality Actually Lives#

Silver is where you enforce the contract between raw ingestion and downstream consumers. The operations that belong in Silver: deduplication (choosing the canonical record when the same entity appears multiple times), schema enforcement (rejecting or quarantining records that violate the expected schema), PII tokenisation (replacing raw PII with consistent pseudonymous identifiers before it reaches any consumer), NULL handling (defining explicit rules for what NULL means in each field), and data contract validation.

The quality gate logic in dbt:

yaml
models:
  - name: silver_transactions
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns: [transaction_id, source_system]
    columns:
      - name: transaction_id
        tests:
          - not_null
          - unique
      - name: amount_usd
        tests:
          - not_null
      - name: user_id_token
        tests:
          - not_null
          - relationships:
              to: ref('silver_users')
              field: user_id_token

Teams that skip Silver and pipe Bronze directly to Gold create a quality cliff. When a source system sends a bad record, it propagates directly to reporting. There is no quarantine, no audit trail, no recovery without a full Gold rebuild.

Gold: What Consumers Actually Query#

Gold contains business logic: dimensional models, one-big-table (OBT) aggregations, or materialised views that match the query patterns of your consumers. Gold tables should be named after business concepts, not technical ones. fct_revenue_daily is a Gold table name. silver_transactions_deduplicated_v3 is not.

The realistic scale of a mature mid-size enterprise data platform: 50 to 200 Gold tables, 200 to 500 Silver tables, and Bronze landing zones for 20 to 50 source systems. Platforms with 2,000+ tables and no clear ownership are a sign that quality governance was never enforced. Every Gold table should have a named business owner, a documented refresh SLA, and at least one test that would catch a full-table NULL scenario.

Data Contracts#

A data contract is a formal agreement between the producer of a dataset and its consumers. It specifies what the data looks like, what quality thresholds it meets, who owns it, what the latency SLA is, which fields contain PII, and what happens when the contract is violated.

The problem contracts solve is trust erosion. Without contracts, data engineering teams spend significant time on reactive debugging: a downstream dashboard shows wrong numbers, an analyst raises a Slack message, the data engineer investigates, discovers a source system changed its schema three weeks ago, and rebuilds the pipeline. This cycle repeats indefinitely.

A concrete contract structure:

yaml
contract:
  id: user-events-v2
  version: 2.1.0
  owner:
    team: platform-engineering
    notification_channel: "#data-contracts-alerts"
  dataset:
    name: silver_user_events
    refresh_schedule: "0 * * * *"
    latency_sla_minutes: 30
  schema:
    - name: event_id
      type: STRING
      nullable: false
    - name: user_id_token
      type: STRING
      nullable: false
      pii: tokenised
    - name: event_type
      type: STRING
      nullable: false
      allowed_values: [click, view, purchase, session_start, session_end]
    - name: event_timestamp
      type: TIMESTAMP
      nullable: false
  quality_thresholds:
    completeness_event_id: 1.0
    row_count_min_hourly: 1000
    freshness_max_minutes: 45
  consumers:
    - team: product-analytics
    - team: ml-platform

Enforcement happens at three levels. Schema enforcement at the registry level (Confluent Schema Registry for Kafka topics, AWS Glue Schema Registry for S3-backed pipelines) prevents schema-breaking changes without a version bump. Quality enforcement at the transformation level via dbt schema tests and Great Expectations. Contract violation alerting through your pipeline orchestrator notifies the owner and consumers when a threshold is breached.

Airbnb, LinkedIn (DataHub), and Zalando are the most cited enterprise implementations. The cultural challenge is real: a contract only works if the producer is accountable for it. A data engineering team that writes contracts but has no organisational mechanism to enforce them with source system owners is doing documentation, not governance.

Pipeline Architecture#

Batch vs Streaming#

Batch processing with Airflow, dbt, or Dagster is the right choice for the majority of analytical workloads. It is simpler to build, cheaper to run, and straightforward to debug when it fails.

Streaming with Kafka, Kinesis, or Flink is required when the business requirement is genuine second-level latency: fraud scoring at transaction time, live operational dashboards where a 10-minute lag is a business problem. The common mistake is adopting streaming because someone thinks you might need real-time later. Adding streaming to a working batch system is significantly easier than simplifying an over-engineered streaming system.

Incremental Processing#

Full table scans on every pipeline run do not scale. Design for incremental from the start:

python
def get_incremental_query(table: str, last_processed: datetime) -> str:
    return f"""
        SELECT *
        FROM {table}
        WHERE updated_at > '{last_processed.isoformat()}'
          AND updated_at <= CURRENT_TIMESTAMP()
        ORDER BY updated_at ASC
    """

Idempotency#

Every pipeline stage should be idempotent: running it twice with the same input produces the same output. The simplest pattern for batch pipelines is to write to a staging partition and atomically replace the target partition on success:

python
def write_idempotent_partition(
    df: DataFrame,
    target_table: str,
    partition_date: date
) -> None:
    staging_path = f"s3://data-platform/staging/{target_table}/date={partition_date}"
    final_path = f"s3://data-platform/gold/{target_table}/date={partition_date}"

    df.write.parquet(staging_path, mode="overwrite")

    if validate_partition(staging_path, expected_min_rows=100):
        move_partition(staging_path, final_path)
        update_catalog(target_table, partition_date, final_path)
    else:
        raise DataQualityError(
            f"Partition validation failed for {target_table} on {partition_date}"
        )

Orchestration#

The orchestration landscape changed meaningfully in 2025. Airflow 3.0 shipped in April 2025 with data assets as a first-class concept, closing the gap with Dagster's software-defined asset model. Dagster Components reached GA in October 2025. Prefect remains the easiest onboarding path for teams new to orchestration.

Airflow 3.0#

The most significant change in Airflow 3.0 is data assets: a formal representation of the datasets that DAGs produce and consume, similar to Dagster's software-defined assets. This enables asset-level lineage tracking, asset-triggered DAG runs, and better visibility into what data each DAG produces. DAG versioning was also added, allowing you to track which version of a DAG produced a given data asset.

When to choose Airflow: you have existing Airflow DAGs representing significant engineering investment, your team has Airflow expertise, or you need access to the large Airflow provider ecosystem (300+ integrations).

Dagster#

Dagster's core concept is the software-defined asset: instead of defining tasks that run, you define the datasets you want to exist and declare the computation that produces them. The engine figures out what needs to run to materialise the assets you request.

python
from dagster import asset, AssetIn

@asset
def bronze_transactions(context):
    df = extract_from_source_system()
    df.to_parquet(f"s3://data-platform/bronze/transactions/date={today}")
    context.log.info(f"Loaded {len(df)} raw transactions")
    return df

@asset(ins={"bronze_transactions": AssetIn()})
def silver_transactions(context, bronze_transactions):
    df = (
        bronze_transactions
        .drop_duplicates(subset=["transaction_id"])
        .pipe(enforce_schema)
        .pipe(tokenise_pii_fields, fields=["user_email", "card_number"])
    )
    return df

@asset(ins={"silver_transactions": AssetIn()})
def gold_revenue_daily(context, silver_transactions):
    return (
        silver_transactions
        .groupby(["date", "currency"])
        .agg(total_amount_usd=("amount_usd", "sum"), transaction_count=("transaction_id", "count"))
    )

When to choose Dagster: greenfield data platform, team open to learning a new paradigm, asset lineage and data observability are first-class requirements.

Prefect#

Prefect has the best developer experience of the three. Flows and tasks are plain Python functions. The onboarding time for a data engineer new to orchestration is measured in hours rather than days. When to choose Prefect: team new to orchestration, rapid prototyping needs, smaller platforms where the conceptual overhead of Dagster is not yet warranted.

The Honest Answer#

If you have existing Airflow DAGs that work, stay on Airflow 3.0. For greenfield platforms, Dagster's asset-first model is the better foundation. For teams that just need to get pipelines running without a steep learning curve, Prefect. Regardless of which orchestrator: keep DAGs shallow (three to five tasks deep at most), fail fast and explicitly, and treat metadata as a first-class output.

Real-Time OLAP#

Most analytics platforms do not need real-time OLAP. If a five to ten second query latency on your BI dashboard is acceptable, you do not need the operational complexity of a separate OLAP engine. The use cases that genuinely require it: fraud dashboards needing sub-second response on data that is minutes old, operational metrics displayed to customer-facing teams, and real-time personalisation engines.

ClickHouse#

ClickHouse uses a column-oriented MergeTree storage engine that makes aggregations over billions of rows extremely fast. The architectural principle: denormalise aggressively, avoid JOINs. On well-structured denormalised tables, ClickHouse handles 10 billion rows on a single server and delivers sub-100ms analytical queries. The 2024 release introduced lightweight updates, 1,600x faster than the previous approach that required rewriting the underlying data part.

What ClickHouse does not do well: ad-hoc JOINs across large fact tables. If your query pattern requires joining a 500M row events table to a 50M row user table on every query, you will see performance degrade. The correct response is to denormalise at ingestion time. If your workload requires frequent ad-hoc JOINs, a warehouse like BigQuery or Snowflake is the better choice.

Apache Pinot#

Pinot is streaming-native. It ingests directly from Kafka with segment generation happening at ingestion time, meaning data is queryable within seconds of being produced. LinkedIn uses Pinot to serve 400 billion events per day for internal analytics. Uber and Stripe run it for operational dashboards.

The trade-off: Pinot has higher operational complexity than ClickHouse. Running it requires managing controllers, brokers, servers, and Zookeeper. For teams without dedicated infrastructure engineering capacity, ClickHouse is significantly easier to operate.

The Positioning Question#

Real-time OLAP is a specialised layer for specific high-frequency queries, not a warehouse replacement. ClickHouse or Pinot handles the 5% of queries that need sub-second latency on recent data. Your warehouse handles the 95% of analytical queries that need historical depth and complex SQL.

The Modern Data Stack Backlash#

The modern data stack (dbt, Snowflake, Fivetran, Looker) delivered on its promise of accessible analytics infrastructure. At scale, two problems emerged that the original narrative did not address honestly.

The first is cost. A mid-size organisation with 50 to 200 engineers and two to three years of accumulated data can hit $60,000 per month in Snowflake costs. The drivers are almost always the same: compute clusters running when they should not be, full table scans on large tables because clustering was never set up, analysts running exploratory queries on production warehouses without cost guardrails, and dbt models that materialise as tables when a view or incremental model would work.

The second is complexity. A data team of five engineers maintaining 15 tools is not a productive data team. It is a tools management team.

The DuckDB Dimension#

DuckDB has become the most interesting tool in the data engineering ecosystem because it runs analytical SQL on Parquet files from S3 with zero infrastructure setup and is roughly 10x faster than Pandas for most aggregation workloads.

python
import duckdb

result = duckdb.sql("""
    SELECT
        date_trunc('day', event_timestamp) AS event_date,
        event_type,
        COUNT(*) AS event_count,
        COUNT(DISTINCT user_id_token) AS unique_users
    FROM read_parquet('s3://data-platform/silver/events/year=2026/**/*.parquet')
    WHERE event_timestamp >= CURRENT_DATE - INTERVAL 30 DAYS
    GROUP BY 1, 2
    ORDER BY 1 DESC, 3 DESC
""").df()

DuckDB is not a Snowflake replacement for production analytical workloads at scale. It is an excellent replacement for Pandas in transformation logic, an outstanding tool for local development and debugging pipelines, and a viable primary analytical engine for teams with data volumes under a few hundred GB.

The balanced view: the MDS is still the right starting point for most teams. Go in with a FinOps plan from day one, be deliberate about adding tools, and keep DuckDB in the toolkit for development, medium workloads, and ad-hoc analysis.

FinOps for Data Platforms#

Compute costs are not a data engineering problem. They are a product of every decision made in the data platform: schema design, materialisation strategy, orchestration schedule, clustering configuration, and user behaviour.

The foundation is visibility. Per-query cost attribution in Snowflake:

sql
SELECT
  query_id,
  user_name,
  warehouse_name,
  execution_time / 1000 AS execution_seconds,
  bytes_scanned,
  query_text
FROM snowflake.account_usage.query_history
WHERE start_time >= dateadd('day', -7, current_timestamp())
  AND execution_status = 'SUCCESS'
ORDER BY bytes_scanned DESC
LIMIT 50;

Run this weekly. Share the top 20 most expensive queries with their owners. You will have more leverage on cost reduction than any infrastructure change.

The three biggest levers: clustering and partitioning (stops full table scans, the single biggest cost driver), query result caching (avoid re-running the same expensive query), and auto-suspend policies. Target less than 20% of compute spend on unattributed or ad-hoc queries.

Data Quality#

Data quality problems compound silently. A missing NULL check in Silver becomes a wrong dashboard number in Gold, which becomes a wrong decision in a board meeting. By the time anyone traces the issue back, weeks of data are affected.

The practical framework is layered validation: schema and volume checks at ingestion, uniqueness and business rule validation at Silver, metric consistency checks at Gold.

The cultural rule: whoever produces a dataset is responsible for its quality. Source system teams own Bronze quality. Data engineering owns Silver transformation quality. Business intelligence owns Gold model quality. See the data observability engineering guide for the full observability pattern.

The Data Platform as Product#

The internal data platform is a product with internal users. Concrete practices that matter: weekly office hours where data consumers can bring questions. A public backlog where consumers can submit requests. SLAs published for every Gold table with a visible freshness indicator in the catalogue. Response time targets for bug reports.

The data catalogue is where this surfaces technically. Atlan, DataHub, and Alation are the enterprise options. The minimum viable catalogue is a dbt docs site with ownership metadata. If a consumer cannot find who owns a dataset and how fresh it is without asking in Slack, the catalogue is not working.

Reverse ETL and Data Activation#

Reverse ETL is the pattern that closes the loop: data flowing from your warehouse back into the operational tools your business runs on. A warehouse has better customer lifetime value data than your CRM. Your marketing automation tool needs that data to send the right message at the right time.

Census and Hightouch are the primary tools. Both connect to your warehouse, let you define a SQL model describing the audience or entity you want to sync, and handle the incremental sync to the destination (Salesforce, HubSpot, Intercom, and a large catalogue of others).

When this matters: when your customer data in the warehouse is materially better than what is in your CRM, and when the business is making operational decisions based on CRM data rather than warehouse data. When it does not matter: when the tools your business uses can already query the warehouse directly, or when data latency tolerance is compatible with a batch export.

What I Would Build Today#

Starting a data platform from scratch in mid-2026 with a mid-size team (10-20 engineers):

Apache Iceberg on S3 for the lakehouse portion, with a cloud warehouse (BigQuery or Snowflake depending on the existing cloud provider) for ad-hoc analytical queries and BI tool connections. Iceberg is the neutral open standard; there is no reason to choose Delta Lake unless you are already on Databricks.

dbt Core for SQL transformations with dbt Semantic Layer for metric definitions. Medallion architecture (Bronze, Silver, Gold) with data contracts enforced at Silver using dbt schema tests and schema registry validation for streaming sources.

Dagster for a greenfield build. The software-defined asset model reflects how data pipelines should be thought about: what data needs to exist, not what tasks need to run. If there were existing Airflow DAGs to preserve, Airflow 3.0 with the new data assets feature.

For the subset of analytical queries requiring sub-second latency on recent data (operational dashboards, real-time metrics), ClickHouse as a specialised query layer sitting alongside the warehouse, fed by a Kafka consumer from the event stream. Not replacing the warehouse. Complementing it for the specific high-frequency queries that justify the operational overhead.

FinOps tooling from day one: per-query cost attribution queries running weekly, clustering configured on every table over 100GB, auto-suspend policies on all non-production warehouses.

Reverse ETL via Hightouch for the customer-facing operational sync back to the CRM, triggered on a one-hour cadence.

The one thing I would not do: add more tools to the stack until there is a concrete, measurable problem that the existing stack cannot solve. Every tool added multiplies maintenance overhead. The best data platforms are not the ones with the most tools. They are the ones where every tool earns its place.

Frequently Asked Questions#

What are the key data architecture patterns for enterprise in 2026?#

The most important pattern is choosing your primary access pattern first: warehouse-first for heavy analytical queries on historical data, streaming-first for real-time operational decisions, or hybrid with clear boundaries for mixed workloads. Most failures come from trying to do everything with one system. Beyond that, separation of storage and compute, intentional data modelling with a semantic layer, and cost observability per query are the fundamentals that matter most.

What is the medallion architecture and should you use a lakehouse?#

The medallion architecture organises data in stages from raw to cleaned to transformed to aggregated, with each stage independently testable and recoverable. A lakehouse makes sense when you have significant unstructured data, need both SQL and non-SQL workloads, or face prohibitive warehouse storage costs. Most teams should start with a cloud warehouse and add lakehouse capabilities only when they hit real limitations.

When should you use real-time streaming versus batch processing for data pipelines?#

Batch processing with tools like Airflow, dbt, or Dagster is simpler to build, debug, and maintain, and is cost-effective for most analytical workloads with latency in minutes to hours. Streaming with Kafka, Kinesis, or Flink is required only for true real-time requirements with second-level latency. The common mistake is adopting streaming because you might need real-time later. It is easier to add streaming to a working batch system than to simplify an over-engineered streaming system.

Emanuel Mallia
Emanuel Mallia

Data & AI Architect

Share
Emanuel Mallia, Data & AI Architect
Emanuel Mallia

End-to-end Data & AI Architect with 14+ years building production data platforms, enterprise data warehouses, and AI systems across fintech, telecom, gaming, and S&P 500 companies.

More about Emanuel

Have Questions?

If you'd like to discuss this topic or explore how I can help with your AI and data initiatives, let's connect.