Signals Expert interviews

How to optimize data analysis with AI and ML: Interview with Andrei Girenkov, Tech Expert at INSART

In this talk, Andrei Girenkov, Tech Expert at INSART and Principal Technology Strategy Consultant at Mizuho, shares his perspective on leveraging AI and ML to navigate the complexities of big data analysis in the financial sector.

From current trends and challenges in FinTech to the intricacies of AI algorithms and combating bias in models, Andrei speaks from his own experience. Find out what the prospects of AI startups are and where the potential of industry-specific generative AI solutions lies, according to Andrei.

Trends and challenges in AI and ML for big data

— Andrei, what are the current trends in FinTech in terms of using AI and ML for big data analysis? What will go big and what will move out of the spotlight in the next few years? 

— ML for big data analysis has been around for a while. Some exemplar use cases are anomaly detection for compliance purposes — such as detection of atypical (insider) trading patterns and predictive algorithms (e.g., for stock valuation) typically supplementing more traditional quantitative statistical techniques. We also see ML used for more general use cases, such as marketing spend optimization and facial image recognition for biometrics.

What’s particularly fascinating is the first wave of real-world Gen AI use cases. For example, there’s summarizing research on various investments — these have been enabled by industry-specific LLMs.

— I bet some use cases you named come from your professional experience. Could you share some challenges you had to overcome and insights you gained along the way?

— Sure. For instance, lead scoring of prospects was one of the things the team and I wanted to implement working at an insurance company. We had hundreds of thousands of potential insurance clients in our lead book, but there’s a question: whom should we prioritize? Who is most likely to buy the most profitable products and stay with the company the longest? 

From a rules-based point of view, we could figure some things out and make broad recommendations just by looking at our prior history. But then we tried to get to a granular level and assign a dollar value to a specific prospective client — we realized we did not have the necessary data. Of course, we had the raw billing data for particular policies, but we didn’t track changes over time and could not easily tie back multiple policies to individual clients, who our oldest clients were, etc. We had to spend about a year setting up the right data collection practices, so a year later, we could say: OK, now we have some track record of our client lifetime value. And this was probably not out-of-the-world big data, but 5 million customers is still a decent size. 

What I learned more than once is that companies have this idea, “Data just exists somewhere in our system; AI will figure it out.” But it doesn’t work this way. You need to think ahead of time to capture the data that will let you draw conclusions for the future in a structured manner. 

— What would you name the key and less obvious pros and cons of using AI and ML for big data analysis in FinTech?

— Obviously, the big pro is the ability to detect patterns in large data sets, and otherwise analyze and condense or report on these data sets at scale.

Not so much of a con, just an eyes-wide-open acknowledgement that’s getting into AI implementation is a long-term commitment. You must either staff up or find qualified partners to build, deploy, and maintain models over time. Data sources naturally change and drift; biases can be introduced into the models over time and must be monitored. Supporting roles such as AIOps need to be created.  It’s an expensive commitment.

Optimizing big data analysis processes with ML

 — How does one choose the right algorithm? I know that much depends on the individual purpose, but what would be some general recommendations to hit the right button for one’s case?

 — There are some common classes of algorithms that are traditionally applied for specific types of problems.  For example, anomaly detection (e.g., for compliance) can be achieved with any number of unsupervised learning techniques; the algorithm classifies transactions into normal and abnormal based on clustering and flags anomalies for manual review.

On the other hand, predictive models depend on supervised or semi-supervised algorithms, which allow the model to learn which patterns lead to more desired outcomes. There are several different training techniques. Typically, the labeled data set is divided into a learning and a training set, which allows ML scientists to try different training algorithms on the same data set to see which achieves the best result. 

Gen AI use cases include their own set of challenges. Usually, one starts with an LLM, hopefully pre-trained for a specific industry, and then through techniques such as fine-tuning, retrieval augmented generation, prompt engineering, etc., one can refine the results.

 — Bias in models  — what is the most effective way to fight it? 

— Engineering against bias has to be designed from the beginning — right from the training population selection. We need to be conscious that sometimes bias is inherent in the way society and business are structured, and that those variables need to be explicitly accounted for: whatever we train the model, it will learn. An obvious example is racial bias in facial recognition models.  If you train a model primarily with images of a single ethnic group, it will not be accurate for the general population. A less obvious one is the prediction of success in education.  If one trains a model for predicting education success without compensating for factors such as socio-economic background, the model will generate a flawed prediction of raw potential. 

use case (1)

Even if not based on a discriminative characteristic, bias can still distort results to make them less useful. For instance, in car safety prediction, if we don’t take care in selecting a true random sample, we can inadvertently teach the model that, for example, green cars are less safe.  This nonsensical result can occur if the training algorithm runs across a statistically significant cluster of green cars in the “unsafe” group. So, suppose the data wasn’t sufficiently randomized, or the sample collected wasn’t representative of the population. In that case, you can be left with the wrong conclusion where you have a correlation between unrelated variables.

AI for big data analysis: training, testing, and validation

 — How do you prepare the data to ensure the quality and reliability of the insights derived from data analysis? What role can AI and ML play in this process?

 — Data quality is probably the number-one obstacle to adopting AI, particularly in legacy organizations. With so many people being exposed to advanced NLP and Gen AI tools such as Chat GPT, there is an impression that AI can make sense of disorganized inputs. The reason Chat GPT can do that with human speech is because it’s been trained on billions of online records such as Wikipedia articles. 

Most AI problems deal with analyzing business data, not with parsing natural speech, and these types of models need to be trained on well-structured data. For example, a common application of AI is lead scoring of prospects to maximize their lifetime value to the business. To train that model, you need to capture the lifetime value (or the raw data to calculate it) of a large population of existing customers. The model then sorts your customers into high, medium, and low-value groups and answers the question — does the prospect look more like one of these groups than others? None of that is possible if data is not consistently captured up front.

One of the mantras of any tech organization should be to capture as much accurate data about relevant company transactions as possible to pave the way for future machine learning opportunities.

On a final note: the future value of AI startups

 — What about the future of Gen AI? Y Combinator's winter startup cohort included a platform for simplifying the building of industry-specific generative AI solutions. How do you see the future of AI in FinTech where AI solutions become a mass product?

 — My answer will be to just a slightly different question: about AI startups where I see value. If you want to go out and train models from scratch, it’s costly. You need a lot of data, a lot of compute power, and a lot of time.  Training a single model can take months. 

I think the real opportunity and the real shortage that’s out there are pre-trained industry-specific models. ChatGPT and other open-availability public-facing tools set a high bar for expectations. But those models are trained for general purposes and on vast natural language data sets: Wikipedia articles, online media, etc. These models are good at understanding colloquial speech but often fail to get industry-specific jargon, like medical or financial terminology; the general public does not really use that. So, for me, the gold rush that will be happening is companies building next-level scaffolding. The solutions that will be built by individual companies have to sit on large industry-specific language models. So, there will be a couple of companies that create AI-as-a-service, or as a product, for the industry. 

Interview by Svietoslava Myloradovych, Content Writer at INSART

Andrei Girenkov, Tech Expert at INSART

Andrei is a technology and business executive with a track record of leading enterprise transformations. His current roles include Principal Technology Strategy Consultant at Mizuho and Advisory Board Member at two prestigious U.S. education institutions, including Carnegie Mellon University, where Andrei graduated from a program called “Executive Education — Applied Artificial Intelligence.”

How to Automate with CI/CD Pipelines for Connected Microservices

How to Automate with CI/CD Pipelines for Connected Microservices

Let’s admit it: software always needs to be delivered faster than before. To survive this race, leave your customer satisfied, and offload tons of work from your team, you need to speed up. CI/CD pipelines for automated delivery are a proven solution.

Among CI/CD adopters, you’ll find tech giants like Google, Amazon, and Netflix. Discover how you...

Let’s admit it: software always needs to be delivered faster than before. To survive this race, leave your customer s...

Defying Fraud: 2023 Cybersecurity Checklist for Your Fintech Company

Defying Fraud: 2023 Cybersecurity Checklist for Your Fintech Company

As a professional in the ever-evolving world of Fintech, you understand the critical importance of cybersecurity. While many are looking for new practices to put in place, often the problem is in disregarding the time-tested gems, not in need of a cutting-edge digital security solution.

In this article, I put together the core cybersecurity measu...

As a professional in the ever-evolving world of Fintech, you understand the critical importance of cybersecurity. Whi...

Breaking Down the Barriers: How Fintech Is Driving Financial Inclusion

Breaking Down the Barriers: How Fintech Is Driving Financial Inclusion

Financial inclusion presupposes equal access to affordable and reliable financial services for all individuals and businesses. Unfortunately, many people around the world are still excluded from the formal financial system.

In this post, we'll explore how Fintech is driving financial inclusion and helping to break down the barriers that have tra...

Financial inclusion presupposes equal access to affordable and reliable financial services for all individuals and bu...