Signals ETL

Cloud ETL: Getting It Right for Data Synchronization

Let’s admit it: real-time data has snowballed from a cool feature into a mainstream requirement. If you want to keep up with the trends, as our clients do, find out how we solved a data synchronization challenge with one reporting system and several legacy databases.

Where we started

So, our client wanted to level up their cloud ETL. Our team's task was to integrate the multi-tenant web-based reporting system with existing legacy databases (on-premises MS SQL Server). First, we were to create an MVP version, something for early adopters, and then we would scale it up. Such an integration would allow us to replicate the existing set of SAP Crystal Reports. After we successfully completed the project, the client achieved the following:

  • Initial synchronization of the data source for the report or a set of reports when onboarding a new customer or introducing a new set of reports. When we import a new customer, we need to synchronize twelve data sources, each of which could be over a million records. Multiple reports can be built on one data source. 

  • Subsequent synchronization of the data sources in pre-defined intervals. This ensures the near real-time accuracy. 

Technical considerations for the cloud ETL project

Orchestration. The initial challenge was how to orchestrate everything properly. Some of the options included functions, durable functions, and Worker Services. Eventually, we decided to opt for Worker Services because the load was pretty predictable: we didn’t need to bring up lots of instances of functions. Also, we needed to connect to on-premises databases; with functions, it’s possible only with a premium subscription. On the other hand, the Worker Services are out-of-the-box in Kubernetes, and the development and DevOps team had hands-on experience with them.

Auto scaling. The application was ready to support KEDA (Kubernetes-based Event Driven Autoscaler), which was available for auto-scaling based on the number of messages we had in use, tables, etc. However, we decided to integrate HPA or KEDA later; Helm Charts specification would be enough for the first version.

Change tracking. Change Data Capture, Database Triggers, Exposing Queue or API to the legacy System, Custom Logic
Both the Web-based Reporting database and On-Premises databases were technology-based SQL Servers, and this was not to be changed in the middle term. So, Change Data Capture (CDC) was selected as a change tracking approach, instead of custom logic, database triggers, or exposing queue or API to the legacy system.

Queueing. For the first stage, we selected the database table as a “message queue” to minimize the effort and cost. With Azure Storage Queue, Azure Service Bus Queue, or API exposure, we wouldn’t be able to achieve similar results.

Finally, there was one more choice to make: whether to use data warehousing or other Azure data-driven technologies. As all the data had a predefined schema, we selected the MS SQL Database as a staging storage for ETL.

Cloud ETL architecture and flow

The final version of our architecture centered around a Kubernetes cluster and several workers, which could be scaled up to manage as many client data sources as we needed. The only limitation was the node on the Kubernetes cluster. 

Frame 67 (1)

We also had Azure Key Vault, which provided secure credential management by encrypting keys and passwords for data stores. Application Insights allowed us to monitor and enhance the app’s health, activity, and performance.

The data flow unfolded through the following stages:

  • Data extraction from the source databases to the Snapshot table using Bulk Copy
  • Data merge of the Snapshot and Staging tables to get the CDC generated
  • Batches creation based on the CDC data
  • Converting batches into API requests and sending them to the Data Push endpoints of EnVision

This is just one example from our portfolio of data solutions. If you'd like to know more about how engineering can make your startup data-driven, let us help you!

Tap our years of expertise in Fintech solution development.
Discover INSART

 

Instead of 1000 Words: How Fintechs Can Use BPMN to Scale Business Processes

Instead of 1000 Words: How Fintechs Can Use BPMN to Scale Business Processes

With a firm grip on its title, BPMN remains the premier way to design and implement business processes. It’s intuitive, easy, and it can also be fun.

Learn what BPMN is, what its key elements and best practices are, and how to scale business processes using BPMN if you’re in Fintech.

With a firm grip on its title, BPMN remains the premier way to design and implement business processes. It’s intuitiv...
More

Building AI-Powered Robo-Advisors for Investment: A Comprehensive Development Guide

Building AI-Powered Robo-Advisors for Investment: A Comprehensive Development Guide

AI-powered robo-advisors lure billions in assets and thousands of users. For sure, they are here to stay, flaunting personalized insights and automated portfolio management.

Let’s delve into the process of developing your own robo-advisor from scratch. As you finish reading this guide, you'll pick up the essentials needed to create a top-class f...

AI-powered robo-advisors lure billions in assets and thousands of users. For sure, they are here to stay, flaunting ...
More

How to Build AI-Powered Credit Scoring Software - Tips from Experts

How to Build AI-Powered Credit Scoring Software - Tips from Experts

Credit score is your multipass in the world of financial services. It helps you secure a loan, apply for a credit card, or rent an apartment. It’s a thread binding it all together in the intricate web of financial decision-making.

Today, the landscape of credit scoring is evolving, and it's not just about numbers and formulas anymore. AI, the po...

Credit score is your multipass in the world of financial services. It helps you secure a loan, apply for a credit ca...
More