Signals ETL

Cloud ETL: Getting It Right for Data Synchronization

Let’s admit it: real-time data has snowballed from a cool feature into a mainstream requirement. If you want to keep up with the trends, as our clients do, find out how we solved a data synchronization challenge with one reporting system and several legacy databases.

Where we started

So, our client wanted to level up their cloud ETL. Our team's task was to integrate the multi-tenant web-based reporting system with existing legacy databases (on-premises MS SQL Server). First, we were to create an MVP version, something for early adopters, and then we would scale it up. Such an integration would allow us to replicate the existing set of SAP Crystal Reports. After we successfully completed the project, the client achieved the following:

  • Initial synchronization of the data source for the report or a set of reports when onboarding a new customer or introducing a new set of reports. When we import a new customer, we need to synchronize twelve data sources, each of which could be over a million records. Multiple reports can be built on one data source. 

  • Subsequent synchronization of the data sources in pre-defined intervals. This ensures the near real-time accuracy. 

Technical considerations for the cloud ETL project

Orchestration. The initial challenge was how to orchestrate everything properly. Some of the options included functions, durable functions, and Worker Services. Eventually, we decided to opt for Worker Services because the load was pretty predictable: we didn’t need to bring up lots of instances of functions. Also, we needed to connect to on-premises databases; with functions, it’s possible only with a premium subscription. On the other hand, the Worker Services are out-of-the-box in Kubernetes, and the development and DevOps team had hands-on experience with them.

Auto scaling. The application was ready to support KEDA (Kubernetes-based Event Driven Autoscaler), which was available for auto-scaling based on the number of messages we had in use, tables, etc. However, we decided to integrate HPA or KEDA later; Helm Charts specification would be enough for the first version.

Change tracking. Change Data Capture, Database Triggers, Exposing Queue or API to the legacy System, Custom Logic
Both the Web-based Reporting database and On-Premises databases were technology-based SQL Servers, and this was not to be changed in the middle term. So, Change Data Capture (CDC) was selected as a change tracking approach, instead of custom logic, database triggers, or exposing queue or API to the legacy system.

Queueing. For the first stage, we selected the database table as a “message queue” to minimize the effort and cost. With Azure Storage Queue, Azure Service Bus Queue, or API exposure, we wouldn’t be able to achieve similar results.

Finally, there was one more choice to make: whether to use data warehousing or other Azure data-driven technologies. As all the data had a predefined schema, we selected the MS SQL Database as a staging storage for ETL.

Cloud ETL architecture and flow

The final version of our architecture centered around a Kubernetes cluster and several workers, which could be scaled up to manage as many client data sources as we needed. The only limitation was the node on the Kubernetes cluster. 

Frame 67 (1)

We also had Azure Key Vault, which provided secure credential management by encrypting keys and passwords for data stores. Application Insights allowed us to monitor and enhance the app’s health, activity, and performance.

The data flow unfolded through the following stages:

  • Data extraction from the source databases to the Snapshot table using Bulk Copy
  • Data merge of the Snapshot and Staging tables to get the CDC generated
  • Batches creation based on the CDC data
  • Converting batches into API requests and sending them to the Data Push endpoints of EnVision

This is just one example from our portfolio of data solutions. If you'd like to know more about how engineering can make your startup data-driven, let us help you!

Tap our years of expertise in Fintech solution development.
Discover INSART

 

Business Analysis for Fintechs: What Tech Leaders Need to Know

Business Analysis for Fintechs: What Tech Leaders Need to Know

A deepdive assessment of your business processes and functions may reveal that you could do much better, even if you perform just fine.

For financial service companies, business analysis is necessary to optimize the workflow. For startups, it’s a chance to play it safe. Cases differ, but BA is still a king.

In this article, you’ll learn how ...

A deepdive assessment of your business processes and functions may reveal that you could do much better, even if you ...
More

How to Leverage Fintech Regulations and Standards

How to Leverage Fintech Regulations and Standards

For Fintech leaders, mastering compliance is not merely a choice; it's an imperative. Investing effort in this complex terrain paves the way for sustainable growth, trust, and legal integrity. And since technology has simplified complying with all the burgeoning regulations, acing the rules becomes less demanding a task.

In this article, I will e...

For Fintech leaders, mastering compliance is not merely a choice; it's an imperative. Investing effort in this comple...
More

How to Automate with CI/CD Pipelines for Connected Microservices

How to Automate with CI/CD Pipelines for Connected Microservices

Let’s admit it: software always needs to be delivered faster than before. To survive this race, leave your customer satisfied, and offload tons of work from your team, you need to speed up. CI/CD pipelines for automated delivery are a proven solution.

Among CI/CD adopters, you’ll find tech giants like Google, Amazon, and Netflix. Discover how you...

Let’s admit it: software always needs to be delivered faster than before. To survive this race, leave your customer s...
More