Signals ETL

Cloud ETL: Getting It Right for Data Synchronization

Let’s admit it: real-time data has snowballed from a cool feature into a mainstream requirement. If you want to keep up with the trends, as our clients do, find out how we solved a data synchronization challenge with one reporting system and several legacy databases.

Where we started

So, our client wanted to level up their cloud ETL. Our team's task was to integrate the multi-tenant web-based reporting system with existing legacy databases (on-premises MS SQL Server). First, we were to create an MVP version, something for early adopters, and then we would scale it up. Such an integration would allow us to replicate the existing set of SAP Crystal Reports. After we successfully completed the project, the client achieved the following:

  • Initial synchronization of the data source for the report or a set of reports when onboarding a new customer or introducing a new set of reports. When we import a new customer, we need to synchronize twelve data sources, each of which could be over a million records. Multiple reports can be built on one data source. 

  • Subsequent synchronization of the data sources in pre-defined intervals. This ensures the near real-time accuracy. 

Technical considerations for the cloud ETL project

Orchestration. The initial challenge was how to orchestrate everything properly. Some of the options included functions, durable functions, and Worker Services. Eventually, we decided to opt for Worker Services because the load was pretty predictable: we didn’t need to bring up lots of instances of functions. Also, we needed to connect to on-premises databases; with functions, it’s possible only with a premium subscription. On the other hand, the Worker Services are out-of-the-box in Kubernetes, and the development and DevOps team had hands-on experience with them.

Auto scaling. The application was ready to support KEDA (Kubernetes-based Event Driven Autoscaler), which was available for auto-scaling based on the number of messages we had in use, tables, etc. However, we decided to integrate HPA or KEDA later; Helm Charts specification would be enough for the first version.

Change tracking. Change Data Capture, Database Triggers, Exposing Queue or API to the legacy System, Custom Logic
Both the Web-based Reporting database and On-Premises databases were technology-based SQL Servers, and this was not to be changed in the middle term. So, Change Data Capture (CDC) was selected as a change tracking approach, instead of custom logic, database triggers, or exposing queue or API to the legacy system.

Queueing. For the first stage, we selected the database table as a “message queue” to minimize the effort and cost. With Azure Storage Queue, Azure Service Bus Queue, or API exposure, we wouldn’t be able to achieve similar results.

Finally, there was one more choice to make: whether to use data warehousing or other Azure data-driven technologies. As all the data had a predefined schema, we selected the MS SQL Database as a staging storage for ETL.

Cloud ETL architecture and flow

The final version of our architecture centered around a Kubernetes cluster and several workers, which could be scaled up to manage as many client data sources as we needed. The only limitation was the node on the Kubernetes cluster. 

Frame 67 (1)

We also had Azure Key Vault, which provided secure credential management by encrypting keys and passwords for data stores. Application Insights allowed us to monitor and enhance the app’s health, activity, and performance.

The data flow unfolded through the following stages:

  • Data extraction from the source databases to the Snapshot table using Bulk Copy
  • Data merge of the Snapshot and Staging tables to get the CDC generated
  • Batches creation based on the CDC data
  • Converting batches into API requests and sending them to the Data Push endpoints of EnVision

This is just one example from our portfolio of data solutions. If you'd like to know more about how engineering can make your startup data-driven, let us help you!

Tap our years of expertise in Fintech solution development.
Discover INSART

 

How to Automate with CI/CD Pipelines for Connected Microservices

How to Automate with CI/CD Pipelines for Connected Microservices

Let’s admit it: software always needs to be delivered faster than before. To survive this race, leave your customer satisfied, and offload tons of work from your team, you need to speed up. CI/CD pipelines for automated delivery are a proven solution.

Among CI/CD adopters, you’ll find tech giants like Google, Amazon, and Netflix. Discover how you...

Let’s admit it: software always needs to be delivered faster than before. To survive this race, leave your customer s...
More

How to build a winning engineering culture

How to build a winning engineering culture

The culture of an engineering team is the DNA that defines the principles a group of engineers will use to make decisions about technology, strategy, and planning. Wherever you work, you want to contribute to the evolution of the engineering culture to build scalable, cross-functional, and successful teams. This article will help you find out how t...

The culture of an engineering team is the DNA that defines the principles a group of engineers will use to make decis...
More

Use Cases: How to Implement Predictive Analytics in FinTech

Use Cases: How to Implement Predictive Analytics in FinTech

“For every human, there are 2.5 million ants on Earth.” — that’s analytics. Predictive analytics can tell you how many ants per human we will have in 100 years. What can it do in Fintech?

Find out the benefits and challenges of predictive analysis in the Fintech industry, where you can apply it, and how to best implement it in your product or servic...

“For every human, there are 2.5 million ants on Earth.” — that’s analytics. Predictive analytics can tell you how many a...
More