Where we started
So, our client wanted to level up their cloud ETL. Our team's task was to integrate the multi-tenant web-based reporting system with existing legacy databases (on-premises MS SQL Server). First, we were to create an MVP version, something for early adopters, and then we would scale it up. Such an integration would allow us to replicate the existing set of SAP Crystal Reports. After we successfully completed the project, the client achieved the following:
- Initial synchronization of the data source for the report or a set of reports when onboarding a new customer or introducing a new set of reports. When we import a new customer, we need to synchronize twelve data sources, each of which could be over a million records. Multiple reports can be built on one data source.
- Subsequent synchronization of the data sources in pre-defined intervals. This ensures the near real-time accuracy.
Technical considerations for the cloud ETL project
Orchestration. The initial challenge was how to orchestrate everything properly. Some of the options included functions, durable functions, and Worker Services. Eventually, we decided to opt for Worker Services because the load was pretty predictable: we didn’t need to bring up lots of instances of functions. Also, we needed to connect to on-premises databases; with functions, it’s possible only with a premium subscription. On the other hand, the Worker Services are out-of-the-box in Kubernetes, and the development and DevOps team had hands-on experience with them.
Auto scaling. The application was ready to support KEDA (Kubernetes-based Event Driven Autoscaler), which was available for auto-scaling based on the number of messages we had in use, tables, etc. However, we decided to integrate HPA or KEDA later; Helm Charts specification would be enough for the first version.
Change tracking. Change Data Capture, Database Triggers, Exposing Queue or API to the legacy System, Custom Logic
Both the Web-based Reporting database and On-Premises databases were technology-based SQL Servers, and this was not to be changed in the middle term. So, Change Data Capture (CDC) was selected as a change tracking approach, instead of custom logic, database triggers, or exposing queue or API to the legacy system.
Queueing. For the first stage, we selected the database table as a “message queue” to minimize the effort and cost. With Azure Storage Queue, Azure Service Bus Queue, or API exposure, we wouldn’t be able to achieve similar results.
Finally, there was one more choice to make: whether to use data warehousing or other Azure data-driven technologies. As all the data had a predefined schema, we selected the MS SQL Database as a staging storage for ETL.
Cloud ETL architecture and flow
The final version of our architecture centered around a Kubernetes cluster and several workers, which could be scaled up to manage as many client data sources as we needed. The only limitation was the node on the Kubernetes cluster.
We also had Azure Key Vault, which provided secure credential management by encrypting keys and passwords for data stores. Application Insights allowed us to monitor and enhance the app’s health, activity, and performance.
The data flow unfolded through the following stages:
- Data extraction from the source databases to the Snapshot table using Bulk Copy
- Data merge of the Snapshot and Staging tables to get the CDC generated
- Batches creation based on the CDC data
- Converting batches into API requests and sending them to the Data Push endpoints of EnVision
This is just one example from our portfolio of data solutions. If you'd like to know more about how engineering can make your startup data-driven, let us help you!
Tap our years of expertise in Fintech solution development.
→ Discover INSART