Case Study: How INSART Builds PSD2 Compliance Using Kafka, Databricks, and Snowflake

Case Study - Design System - Blog - 8 (1)

October 11, 2025

25 min

Head of Software Engineering

Vadym Shvydkyi

INSART’s tech quarterback. Oversees full-stack architecture from backend to data platforms. His team crafts fintech solutions at startup pace while keeping enterprise-grade quality and reliability front-and-center.

The modern fintech ecosystem lives at the intersection of speed and scrutiny.

Startups building digital banks, wealth management apps, or payment platforms must move fast to deliver personalized, real-time financial experiences — yet they operate in one of the most tightly regulated environments in the world.

In Europe, PSD2 and Open Banking directives reshaped how financial institutions share customer data. They gave consumers control over their accounts and allowed third-party providers to access financial information — but only under strict conditions of consent, auditability, and security.

What most early-stage fintech founders realize too late is that compliance isn’t an API feature. It’s a data architecture problem. You can’t be “Open Banking compliant” unless your underlying data systems can prove integrity, trace lineage, and enforce privacy constraints at every layer.

At INSART, we approach this as a data-engineering challenge first, not a legal afterthought. Our regulatory data pipeline is a blueprint for fintechs that want to build scalable, transparent, and regulator-ready systems — without sacrificing agility or performance.

The Challenge

Imagine a digital bank processing thousands of account information requests per minute through Tink, TrueLayer, or SaltEdge APIs.

Each call brings new transactions, balance updates, and consent states. Those records feed fraud models, user dashboards, and analytics reports — while simultaneously being subject to:

GDPR’s data minimization and erasure obligations
PSD2’s requirement for secure audit trails
ISO 27001’s encryption and key management standards
FCA and EBA mandates for timely regulatory reporting

In practice, that means every single piece of data — from a user’s IBAN to a consent token or a failed API call — must be:

Traceable: Regulators can reconstruct the flow of any record from ingestion to reporting.
Tamper-proof: No one can alter or delete data without leaving a digital fingerprint.
Secure: Personally identifiable information (PII) is encrypted at rest and in transit.
Valid: Data inconsistencies are caught before they reach dashboards or reports.
Observable: Data quality, lineage, and SLA metrics are continuously monitored.

Designing for this level of precision requires more than pipelines — it requires a system of record for compliance itself.

INSART’s Architectural Philosophy

We design regulatory pipelines around three core design tenets:

Transparency by design: Every transformation step, schema change, and data dependency must be explainable — in technical and regulatory language alike.
Automation with human oversight: Compliance workflows are codified into DAGs, but governance checkpoints ensure human-in-the-loop verification for sensitive operations.
Zero-trust data flow: Every data transfer assumes potential compromise; encryption, key rotation, and identity propagation are enforced everywhere.

To achieve this, INSART combines its expertise in data engineering, DevSecOps, and fintech domain modeling.

Our pipeline design typically unfolds in four major layers:

Ingestion – Collecting and authenticating data from regulated APIs
Storage & Security – Persisting it in a compliant, encrypted data lakehouse
Transformation & Validation – Cleansing, modeling, and testing data for analytical use
Audit & Reporting – Exposing verifiable, regulator-ready datasets and dashboards

Let’s break each layer down in detail.

Secure Data Ingestion

The first and most sensitive step is pulling financial and personal data from third-party Open Banking APIs. INSART engineers begin by building a secure ingestion boundary using AWS API Gateway and Lambda.

Each client or fintech partner is assigned a separate VPC endpoint, ensuring tenant isolation at the network level.

Requests to external APIs like Tink, TrueLayer, or Plaid are signed using OAuth 2.0 client credentials and processed asynchronously to prevent data leakage or replay attacks.

Incoming data — whether it’s transaction history, balance updates, or account metadata — is serialized as Avro objects. Avro schemas are versioned in Confluent Schema Registry, which ensures backward compatibility when third-party APIs evolve.

Rather than dumping data directly into a warehouse, we publish these Avro payloads to Apache Kafka topics grouped by domain (accounts, transactions, consents, errors). Kafka provides a durable, ordered, and replayable event stream — an essential audit trail for regulators.

Each record is stamped with a correlation ID that ties it to its API call, user, and consent artifact. This enables end-to-end lineage tracing later in the pipeline.

Data Normalization and Privacy Controls

Once ingested, data passes through a normalization layer implemented in Python using Pydantic for schema validation and Marshmallow for field-level serialization. This step standardizes inconsistent field names across multiple providers (e.g., balance.amount vs. current_balance) and unifies date/time formats to ISO 8601.

Before data leaves the ingestion layer, privacy enforcement begins.

All PII fields — account names, IBANs, merchant names, or user identifiers — are encrypted using AWS KMS with per-field keys. Tokens are generated through HashiCorp Vault to allow reversible pseudonymization when legitimate processing requires re-identification.

We embed a privacy contract within the schema itself. Every field includes metadata such as:

field_name: account_id
sensitivity: high
encryption: KMS_AES256
retention_days: 730

These metadata tags are later consumed by the retention and deletion policies, allowing automated GDPR compliance workflows.

Lakehouse Storage Architecture

INSART favors a data lakehouse paradigm — combining the flexibility of object storage with the consistency of a warehouse.

Our standard stack includes:

AWS S3 for raw and processed storage
Databricks Delta Lake for ACID-compliant table management
Snowflake or BigQuery as analytical layers
Glue Catalog for metadata discovery

Raw data (Bronze layer) lands in partitioned S3 buckets organized by ingestion date and data domain.

Transformation outputs (Silver and Gold layers) are stored as Delta tables, ensuring atomicity and version control — every modification generates a new version, preserving the full history for audit purposes.

To prevent unauthorized access, we apply fine-grained access control (FGAC) via AWS Lake Formation. Permissions are bound to IAM roles representing data consumers — for instance, compliance analyst, ML engineer, data scientist.

PII fields remain encrypted; non-sensitive data is queryable in plaintext for analytics.

When data retention periods expire, AWS S3 Object Lock and lifecycle policies automatically transition outdated records to Glacier Deep Archive, maintaining immutability for required legal periods (typically seven years under FCA).

Transformation and Validation

The transformation stage turns raw API payloads into structured, regulator-friendly tables. INSART uses Apache Airflow as the orchestration backbone and dbt as the transformation framework.

Airflow DAGs represent compliance workflows — for example:

Extract all daily transactions from the raw layer
Validate schema and field-level constraints via Great Expectations
Join with consent logs and KYC data
Apply business rules and compute derived metrics (e.g., total monthly spend, average account balance)
Write results to Gold tables

Each DAG is annotated with SLA metrics and lineage tags:

@dag(
    schedule="@hourly",
    sla_miss_callback=notify_slack,
    tags=["openbanking", "compliance", "transactions"]
)
def reconcile_transactions():
    ...

Within dbt, transformations are defined as modular SQL models with explicit tests:

-- models/validated_transactions.sql
SELECT *
FROM {{ ref('transactions_normalized') }}
WHERE amount >= 0
  AND consent_status = 'active'
  AND account_id IS NOT NULL;

Tests run automatically as part of CI/CD pipelines using GitHub Actions.

If validation fails — for instance, a transaction references a revoked consent — Airflow retries the job, marks the dataset as quarantined, and notifies the compliance Slack channel.

This proactive alerting system transforms what used to be post-hoc audit headaches into continuous assurance.

Data Lineage and Observability

Regulators often ask, “Where did this number come from?”

Answering that question within seconds is the true mark of a mature data organization.

INSART integrates OpenLineage with Airflow and dbt to automatically capture lineage metadata.

Each dataset’s journey — from ingestion topic to transformation script to warehouse table — is logged in Marquez, a lineage catalog that visualizes dependencies as graphs.

For example, the lineage graph for gold.transactions_daily_summary might show:

Tink_API_Transactions
   ↓ Kafka topic: transactions_v3
      ↓ Airflow DAG: normalize_transactions
         ↓ dbt model: transactions_validated
            ↓ dbt model: transactions_daily_summary

This transparency enables auditable replay: if regulators dispute a metric, engineers can replay the same pipeline on historical versions to reproduce results bit-for-bit.

Data observability is handled through Monte Carlo or Prometheus + Grafana dashboards.

We monitor:

Data freshness (lag from API event to warehouse availability)
Data volume anomalies (sudden drop in transactions)
Schema drift (added/removed fields)
SLA adherence per DAG

When anomalies are detected, incidents are automatically logged in Jira with metadata — dataset name, timestamp, failure reason — closing the loop between data reliability and operational accountability.

Compliance and Governance Layer

The compliance layer enforces rules around consent, retention, and access.

Consent Validation

Every Open Banking event carries a consent ID linking back to a user authorization.

INSART builds a consent registry — a dedicated microservice (Python + FastAPI + PostgreSQL) that tracks the lifecycle of each consent: created, refreshed, expired, revoked.

During data transformation, a lightweight service queries this registry. If a consent is revoked, all associated downstream records are marked inactive and excluded from analytical outputs.

This ensures no unauthorized data processing occurs post-consent expiration.

Retention and Deletion

A scheduled Airflow DAG, purge_expired_data, scans metadata for retention_days and removes expired records from the lakehouse.

Deletions are performed via Delta Lake’s DELETE WHERE statements with tombstoning enabled — meaning the operation itself becomes part of the audit history.

To comply with GDPR’s “Right to be Forgotten,” PII tokens can be invalidated by deleting their corresponding Vault keys. Once keys are revoked, data remains in storage but is cryptographically unreadable — an elegant solution balancing auditability and erasure.

Access Control

Access policies are enforced using Snowflake Secure Views with row-level security:

CREATE SECURE VIEW transactions_compliance AS
SELECT * FROM transactions_gold
WHERE region = CURRENT_ROLE_REGION();

This prevents cross-border data leakage when multi-region teams access compliance datasets.

Regulatory Reporting and Dashboards

INSART’s final output is not just a database — it’s an operational reporting system that regulators and compliance teams can trust.

Validated datasets feed into dashboards built with Apache Superset or Power BI.

These dashboards visualize key compliance metrics:

Number of active vs. expired consents
API uptime and data freshness SLAs
Transaction volume anomalies
GDPR deletion requests processed per month

Behind each metric lies a traceable dbt model and Airflow DAG, giving compliance teams confidence in every number presented.

For regulatory agencies such as the FCA or EBA, INSART automates report generation.

Python-based Lambda functions transform the Gold datasets into regulator-specified formats (CSV, XML, or JSON).

Each report is cryptographically signed using AWS Signer, uploaded to an SFTP endpoint, and logged in the audit catalog with checksum verification.

This automation eliminates manual report compilation — a process that traditionally consumes hundreds of hours per quarter.

DevSecOps and Delivery Automation

A compliance-ready data pipeline is only as strong as its deployment process.

INSART implements infrastructure as code using Terraform and AWS CDK, allowing full reproducibility of all environments (dev, staging, prod).

Security scanning runs continuously:

Snyk checks Python dependencies.
Trivy scans container images.
Checkov enforces Terraform security policies.

CI/CD pipelines deploy dbt and Airflow updates only after automated data-quality tests pass.

Secrets and tokens are managed in AWS Secrets Manager, rotated every 90 days.

This disciplined DevSecOps setup turns compliance into code — enabling startups to scale without accumulating “regulatory debt.”

Example Architecture Diagram (Textual)

         +-----------------------+
         |  Open Banking APIs    |
         |  (Tink / TrueLayer)   |
         +----------+------------+
                    |
             OAuth2 + HTTPS (TLS1.3)
                    |
          +---------v-----------+
          |   AWS API Gateway   |
          +---------+-----------+
                    |
             Lambda Functions
                    |
          +---------v-----------+
          |   Kafka Topics      |
          |  (Avro + SchemaReg) |
          +---------+-----------+
                    |
          |   Normalization Layer  |
          | (Python + Pydantic)    |
                    |
          +---------v-----------+
          |   S3 + Delta Lake   |
          | (Bronze/Silver/Gold)|
          +---------+-----------+
                    |
            Airflow + dbt + GE
                    |
          +---------v-----------+
          | Snowflake / BI Dash |
          +---------+-----------+
                    |
         +----------v----------+
         | Regulatory Reports  |
         | (CSV/XML, Signed)   |
         +---------------------+

This schematic, when rendered visually, becomes the centerpiece of INSART’s RegTech Systems Architecture presentation.

Scalability and Performance

Performance isn’t secondary in compliance pipelines — it’s integral.

Slow or unscalable systems cause delayed reports, which can themselves become regulatory violations.

INSART optimizes for both throughput and reliability:

Kafka partitions are balanced across multiple brokers for horizontal scalability.
Airflow Executors run in Kubernetes, auto-scaling based on job load.
Delta caching in Databricks speeds up reporting queries by 70%.
Query acceleration through Snowflake result caching ensures BI tools remain responsive even with billions of rows.

For resilience, Cross-Region Replication (CRR) keeps data synchronized between two AWS regions.

All critical Airflow state and lineage metadata are backed up daily to S3 and restored in under 15 minutes during failover simulations.

Security and Compliance Validation

At project completion, INSART conducts a Security & Compliance Validation Sprint.

This includes:

Penetration testing of data access endpoints.
Encryption verification: ensuring all S3 and Snowflake data are encrypted using AES-256.
SOC 2 and ISO 27001 mapping: every control requirement is mapped to technical evidence (Terraform outputs, IAM policies, encryption keys).
Disaster recovery drills simulating data loss, validating RTO < 30 min, RPO < 10 min.

These exercises provide not only assurance but documentation, which can be shared directly with regulators or investors during due diligence.

Results and Impact

By implementing this pipeline architecture, fintechs can achieve:

Complete end-to-end data lineage — every figure can be traced back to its raw source.
Zero manual compliance reporting — reports are auto-generated and verifiable.
Regulator-grade observability — compliance metrics updated in near real-time.
Data quality confidence > 99.8%, validated by continuous testing.
GDPR & PSD2 alignment by design — encryption, consent, and retention embedded in infrastructure.

Most importantly, the fintech gains trust — from customers, auditors, and investors.

Compliance ceases to be a bottleneck; it becomes a competitive advantage.

Lessons Learned and INSART’s Value Proposition

Through numerous fintech acceleration projects, INSART has learned that the biggest risk in compliance engineering isn’t non-compliance — it’s complexity.

Startups often accumulate fragmented tools and manual processes that fail at scale. INSART’s approach unifies these concerns through automation and modular design.

Our teams combine:

Fintech domain expertise (understanding PSD2, FCA, GDPR)
Engineering depth (Kafka, Airflow, dbt, Databricks, Snowflake)
Operational rigor (DevSecOps, IaC, continuous validation)

The result is a platform-level foundation that can power not only regulatory needs but also data-driven growth, analytics, and personalization.

Conclusion

Regulatory compliance is often seen as friction — an obligation imposed on innovators.

INSART reframes it as an engineering discipline that, when done right, strengthens the product’s credibility and scalability.

The Regulatory Data Pipeline for Open Banking Compliance represents the best of what we stand for:

Technical precision,
Fintech intelligence,
And the belief that transparency breeds trust.

By building systems that can prove their integrity at any moment, we empower fintech founders to move faster — confidently, securely, and in full alignment with the future of open finance.

VIDEOS