Enhancing Fraud Detection with Synthetic Data Generation (GenAI)

Insurance companies are always looking for ways to improve their operations while delivering quality coverage to consumers. Incorporating cutting-edge Artificial Intelligence (AI) into fraudulent claims investigations presents a promising opportunity. MAPFRE has developed an AI-driven fraud detection process to assist its Claims team in boosting the accuracy and efficiency of fraud detection. However, in the process of building this, grappling with imbalanced historical datasets posed a significant challenge in training AI fraud models. To address this, MAPFRE utilized Generative AI to generate synthetic data. In this session, Mireia Rojo, VP of Advanced Analytics, will illustrate how to apply GenAI to create synthetic data, and how synthetic data was employed in the Homeowners Fraud Detection Project to amplify the predictive capabilities of AI models, thereby bolstering the efficacy and efficiency of fraud detection efforts.


Transcription:

Mireia Rojo Arribas (00:08):
Well hello everyone. Welcome to this interesting topic, Enhancing Fraud Detection With Synthetic Data Generation. Synthetic data generated by applying generative artificial intelligence. So this is Mireia Rojo Arribas, I'm the VP of advanced analytics at MAPFRE Insurance. So let's just start.

(00:32):
Let's firstly look at the agenda. We are going to start with where you are going to take away from this session. After that, we are going to introduce the challenge and the solution. And what's the challenge that when we try to apply artificial intelligence in fraud detection, we have far fewer samples of fraudulent cases. And that's a challenge when we try to apply artificial intelligence, right? Because we need our models to learn from historical data. So what's the solution? Well, we want to augment our data how with synthetic data. So we are going to delve into those details in the presentation and we are going to wrap up with some conclusions. Let's just start with the takeaways.

(01:11):
Three takeaways for today and well, I know f been talking about enhancing fraud detection with artificial intelligence, then generative AI, then synthetic data a lot. Let's just start. Firstly, you're going to discover how to apply artificial intelligence to detect additional fraud on a more efficient way. Second, you are going to figure out a solution that applies generative artificial intelligence and creates synthetic data. And finally you are going to learn about synthetic data and how to apply it to additional insurance problems. All the numbers that you're going to see in the presentation are not actuals. It's setting one slide. I will mention it later, but they really represent the reality of the problem we have solved. So let's just start. So answer as how many of you are careers here?

(02:08):
Here? Okay, answer. You're all trying to improve your operations and you really try to provide coverage to your customers. And if you're in consultancy probably too. And if you're trying to improve your operations, what is one of the challenges that you're seeing? And sure you're seeing huge losses. Significant losses because of fraud. And this is not a joke actually we look at 2022 Coalition against Insurance fraud data. And this is public data. You can find on an annual basis we waste 308.6 billion in fraud being property a casualty at 10% of it. How does that sound? It sounds like a real challenge, right? So what can we do? And one solution is like can we increase our fraud detection and can we make our operation more efficient? When we say our operation is, can we provide our adjusters with, instead of looking at 100% of our claims, look at the 20% and in that 20% find out probably 90% of the fraudulent cases plus finding additional fraud.

(03:10):
That would be great. So that's what we are intending to do here and that's what we have done at MAPFRE. But what has been the challenge? The challenge is that when we were doing this exercise of developing these machine learning models to help our adjusters, we found out that we have very few samples of fraud cases from our historical data. It can be because of data governance, it can be because we didn't have all that fraud or because we didn't catch it. So that means that our data was very imbalanced. So what was the solution and what are we going to see today? Well, we generated synthetic data using our data. That's the magic here. You use our data, you learn from it, and you apply synthetic data and you pull all of it together. You train your artificial intelligence models and lemme tell you what these models are today in production and they are grieving good results in our p and l. So let's just start with AI models.

(04:08):
So we have here three steps when we want to develop artificial intelligence models. The first one is we go and we graph our historical data and establish our data. As you can see here, we have an indicator what was fraudulent, which claim was fraudulent, which claim was not. And if we focus in home, we can have for example how long the policy was without the tenure. Was there a gold ring involving that claim? Was that claim reported after 10 days the loss happen? These are just some examples, but in order to play synthetic data here, we have selected less than 100 variables, but more than enough. And after that you start trying to understand your data to analyze it. And when I see it before like hey, we have a challenge. We didn't have enough data. I'm talking about only having a point 24 and this is real data fraud ratio.

(04:57):
Do you think that's enough? No, that's actually insufficient. It means that our data is very balanced. So we couldn't develop a proper artificial model, artificial intelligence model with that data. So when you might be wondering what type of results were you getting? Was it that bad? Let's jump into it before jumping into it. It's going to be in the next slide. Let me tell you that we graph our historical data and we split it into train and test with 80% of it is training data. You train your model with it and with the remaining 20% you test how things are going. So the results I'm going to share in the next slide, which are real results represent the test dataset. It means that the model hadn't seen that data before.

(05:43):
So let's take a look at the table in balance real data, the second column please. And just to the third row that says recall. So when you try to develop artificial intelligence models, something you measure is recall. Recall represents which percentage of the fraudulent cases you were able to capture. And with our imbalance data though, let's remember it was 24% only. We were only to capture 62 to 5% of the cases. And then of course we didn't even have to go to with claims because I knew the first question they were going to ask me was, where is the remaining 37.5%? That's totally insufficient. If you want the adjusters to really trust your model and only look at the 10, 20%, whatever you referred them, you need to capture at least a 90%. So what we did was creating synthetic data as mentioned, pull it together with our historical data and train models and what was the recall?

(06:43):
What was the percentage of fraud that we were able to capture? 93.75, very acceptable. Right? Now we are good. And you can also be wondering, well how will this impact you in terms of economic impact? Well let's assume, and this is just one example, this part of the data is not real. That we have 100 fraudulent claims we save. If it's home property, $10,000 per claim, we'll be talking about the difference moving from 625,000 to 937, $310,000 just very quickly in one click, enough to justify considering that we have a marked larger volume that we really need synthetic data in this case. So now you have seen how we have gone through models at applying artificial intelligence in fraud. I know you might be wondering like well, but you're talking about synthetic data. How much synthetic data do you create? How do you create it? How do you know it's good? No worries, we are going to see it now.

(07:45):
So when you are going to create synthetic data, let's just start firstly thinking about the definition. Synthetic data is information you are creating based on your real data. You need good data in that case please everyone that's ensure your data has the right quality and this synthetic data is supply created, applying generative artificial intelligence. So how much data did we have to generate? Well in this case we put some examples. These are not real data, these are not real numbers that will have 41,000 fraudulent claims and maintaining our fraud rate, which is real 20.4%. That would mean we have 100 fraudulent claims. And we were saying before that we split our data there to training, we try use it to train and we are here and test here we have an 80 versus percent versus here have a 20%. But now where is the part?

(08:39):
Where is the magic of synthetic data where you have to test how much data you have in synthetic data you have to create. And we tested several angles. Half of the sample, original sample, we have double five times more, 10 times more. Finally half of it was what worked better. But you also have to test how much synthetic data to create. We tested creating a 1%, 2%, 5%, 10%. And you might be wondering like well 10 work better? No, not necessarily. It really depends on the quality and the type of characteristics that you have in your data. In our case, what were better was creating a 2%. So there you have 20,000 fraudulent claims, non fraudulent claims and 415 fraudulent claims. And surprise, we didn't only have to add fraudulent claims, we did not only have to up men. The sample of fraudulent claims is both fraudulent and non fraudulent while created better results. And with that we put together synthetic data and real data and we were able to train a good model with point 91%. We didn't have to augmenting to 10 or 20%. No, no, no, that wasn't necessary. It was 0.91% of fraud rate. That's what provided better results.

(10:00):
So now you have seen how much data. So my tips for you, if you're going to train synthetic data and you want to know how many samples test, there's the number of samples and test the number of the implicit fraud or whatever problem you are tackling, let's imagine you want to also think about predicting if there will be a litigation in the claim. Well test with several numbers, one, two, 5% and from there you will find a solution. And so you are all now curious to know how did you create this data? Well we've said that we were using in this case generative artificial intelligence. And what we use actually because we had tabular data is Citi guns that we can see here conditional tabular, generative adversarial networks. And these algorithms are pretty complex inside but very easy to use I'll say. So when they're using these algorithms, you have to think about three steps.

(11:02):
The first one, how much data do I want to generate? How many samples? And you have to tell this algorithm. Second, what is my fraud rate in this case, how much do I want to test? And third, do I have historical data So the algorithm can learn from it is these three key pieces. So let's see what is the flow of the algorithm? We tell the algorithm, hey generate 21,000 samples as we were creating. And the algorithm starts creating a distribution that follows a normal distribution, random data. And from there the algorithm starts using what it is called the generator. It's a neural network it has inside to try to mimic your data and it's trying to mimic, in this case 21,000 samples that will be similar to your data. And now we have in the third step, in the first matrix, a data that is synthetic. So we graph our real data and we compare and we go to another algorithm inside the gun which is the discriminator. And what the discriminator does is, okay, you have put together real data and synthetic data.

(12:13):
Am I clear about what is synthetic and what is real? And the process iterates as many times as possible until in this case for example, this what was synthetic, the algorithm thinks it's real. So basically it's like how do you sit on the algorithm and you can't even go and tell them iterate 1000 times as many as possible. But the idea is to let the algorithm optimize. So that is how the generative adversarial network works and that's how we generated the synthetic data. Now let's go to the final step here. How did we know if the data since the data were those reliable?

(12:54):
So first thing I did was to select several variables of our sample. And in this case we were selecting if we reported the claim was reported after 10 days the loss happened, if there was a gold ring involved, the general, how long that customer had been with us, and then of course if the claim was fraudulent or not. And first thing we did is like, okay, let's grab real data. And I selected real data and I compare what was the distribution for fraudulent claims versus non fraudulent case for the variable reported after 10 days. And what we can see here is that the fraudulent claims have an 80% of the cases that were not reported after 10 days. While they have a 20% that yes they have been reported after 20 days. And then we also look at the non fraudulent claims that have a 70 and a 30% respectively.

(13:50):
So now let's please take only the first two columns, fraudulent claims and the variable we are looking at and we compare it with the synthetic data claim distribution and guess what? It was the same 80 20%, exactly the same distribution. And you may be wondering, well you are looking at the universe analysis. Is that enough? Maybe you have to look at more variables, right? We did it. Let's take a look at the results. So I answer when you're thinking about fraud, you might be thinking like, well probably those customers that have been with you for a longer time actually are less fraudulent. That's kind of real, but we need to look at other perspectives. And my question for you will like what if the claim was reported after 10 days and there was a gold ring? Is that the situation? Do we have a longer tenure for fraudulent claims than for non fraudulent claims?

(14:50):
Well that's what we analyzed firstly in this table, the real data table. And what we were seeing is that those cases of claims that were reported after 10 days and did not have a gold ring involved half an average average tenure of three years while the highest value was produced. When the claim was reported after 10 days and there was a goal ring which was 11 years, that was the average of those policies if we compare them with those claims that were not fraudulent. The case of having the lowest tenure of reported after 10 days and note there was not a gold ring was six years and it was also the lowest value. However, the highest value in tenure happened when the claim was not reported after 10 days. And the gold ring, yes there was a gold ring involved. As you can see here, the different colors, lowest, highest, lowest, highest.

(15:47):
So now we do the same exercise that we were doing before. It's like okay, let's grab the fraudulent cases and the first three column and we pull them here again and we compare them with the synthetic data cases, the fraudulent cases here. And what we are seeing is that look at the colors, the lowest value of the synthetic data also happened is the same that with the real data, yes, your claim was reported after 10 days and there was not a goal ring and the highest value also happened when the claim was reported after 10 days and there was a goal ring as you can see here. And you might be saying like, well, but the abate is not the same. No, because it's synthetic data, it's not a copy paste of rows, it follows the same distribution. And that is what this table is saying us. Having said this, what we can say is that the synthetic data was reliable and that totally makes sense because otherwise we wouldn't have been able to develop artificial intelligence model for fraud that today are in production helping our adjusters.

(16:53):
Let's jump into the conclusion. So what have we seen in this presentation? We have seen that applying artificial intelligence and generative artificial intelligence really help insurance. And if we focus over overall on fraud, this is today in production, this really helps. Secondly, what we can say is that for those cases that have imbalance data, and this is not only about fraud, synthetic data really helps thoroughly. It's not guaranteed that synthetic data is going to help you however test it because maybe that's value and even in some cases if it's not really imbalanced, you can still try probably it's going to be successful. That's something we have seen. And finally we wrote an article called How in synthetic data health insurance business, you're going to have the link available. Feel free to jump in. The numbers that are shown there are real, are similar to the ones you have seen in this presentation and take a look and test. If you want to transfer your company, you have to test and later build. So please test synthetic data. It really helps. And that's all we have for today.

(18:04):
Thank You very much. Does anyone have any questions? They can send you the mic. One second. Let's wait for the mic.

Audience Member 1 (18:24):
At what point in the process would you employ that fraud detection model? Is that running live as claims are coming through or are you looking at a replicated data warehouse door that you're running on a daily monthly batch cycle that you're working alerts afterwards?

Mireia Rojo Arribas (18:41):
No, it's running live. It's helping. We're adjusters and the point claim happen, it goes through the model. The model tries to identify there can be fraud or that cannot and the adjusters are recommended to take a look at the claim. But they are the ones making the final decision, correct? Exactly. It's real time. Exactly. And it's a recommendation again, if the adjusters are the one making the decision, no problem.

Audience Member 2 (19:12):
Kind of a two part question. Are there use cases when you're needing data to train your models that you would not recommend using synthetic data for? And is there any kind of potential concern over the old thing of if you photocopy a photocopy of a photocopy, small errors start to get very magnified that using synthetic data can introduce extra errors.

Mireia Rojo Arribas (19:41):
Good questions. So for the first part, is there anything I recommend? No, I actually recommend even in the cases where you don't have imbalance data necessarily test it because this might give you synthetic data might help you. We have seen it in the past for of course you need the right variables, you need to have talk with your regulator, with legal to ensure that you have the right variables in your dataset coming back to amplifying the biases. So we've been testing a lot and that's why I was reinforcing ensure you have the right quality. It is true. However, that's something you can do. And we are testing this in another framework. We have a framework of responsible artificial intelligence. So something we are doing is putting these models to ensure that we are not amplifying any bias. However, it's not that we are amplifying bias because our data might not have the right quality. It's in that case that maybe our data has that representative inside.

Audience Member 2 (20:39):
Just a quick follow up on it, because sometimes the problems is you're searching for signal in the data, like you have a data stream and I'm not sure which of the particular data elements are indicative. I just know here's a set of data that resulted in this kind of outcome and here's a set of data that resulted in that kind of outcome. But things like tenure and this and that and nothing else has been defined. Is that a bad use case or do you really need to have a deeper understanding of what each of the variables in the data means? Or can you use this simply to amplify your dataset to search for meaning?

Mireia Rojo Arribas (21:18):
You need firstly to have analyze your data and understand it very well. It's super clear. That's something that I emphasize to my team every day. Now we need to understand the business, otherwise we cannot create proper artificial intelligence assets. And that goes through understanding every variable and the combination of variables together.

Audience Member 3 (21:39):
Hi. So a question I have is, is this something that runs in the background? I think you mentioned that earlier that it does run in the background, but supposing you with a broken firm and you have hundreds of thousands of accounts, do you have to work with our internal IT to put this on the servers to run in the background or how does that work? How do you integrate it?

Mireia Rojo Arribas (22:03):
So we have Guidewire and our adjusters work inside Guidewire. They have an inbox there and the way this is integrated is we work in AWS, we have our platform, our models run there. There is a data pipeline. Our models run. We use machine learning operations. Once you have the model results, they are loaded into Guidewire. So what our IT has done is integrate those results into Guidewire. So for the adjusters, this is totally transparent. They just go in and they just have something in their inbox. They have a fraud they have to take a look at in a similar way to other rules they have in the systems. So for them it's totally transparent. So it has been a connection between our AWS platform and Guidewire.

Audience Member 3 (22:45):
Thank you,

Mireia Rojo Arribas (22:46):
Thank you,

Audience Member 4 (22:50):
Thank you. You said almost exactly what I was going to ask in terms of relationship building and figuring out, and you mentioned Guidewire, we recently adopted Guidewire also. So that is very interesting to hear that basically you have to work with that team and have AWS sitting there. Any challenges that you experienced in establishing that integration or relationship?

Mireia Rojo Arribas (23:15):
Not in the integration at all. The integration was pretty easy because at the end it's an API call calling to Guidewire and a normal integration as such as the type of tasks they have right now, the hardest part has been working with the business with claims and making the adjusted trust on the results that has taken time. They have really had to go through a lot, a lot of samples and reviewing cases. So at the end we have shared with them like, hey, here is your control group versus here is what the model was saying, look at the difference. It's more than five times the fraud you were looking at before and until that hasn't happened, there was resistance to be honest. So it's not been with it, it's been with the claims side.

Audience Member 4 (23:59):
That's interesting. Any other examples of any other on top AI type solutions you have placed on top of Guidewire besides the fraud?

Mireia Rojo Arribas (24:09):
We have a lot. And so anything related actually to claims litigation, everything is there because we really want for the adjusters to be transparent, so we load it, everything there. Anything related to claims is there.

Audience Member 4 (24:21):
That's great. Thank you.

Mireia Rojo Arribas (24:22):
No problem. Any other questions?

Audience Member 5 (24:35):
What is the second use case you are going to pursue with synthetic data?

Mireia Rojo Arribas (24:41):
Can I say it again? Bill?

Audience Member 5 (24:42):
What is your second use case you're going to pursue with synthetic data?

Mireia Rojo Arribas (24:46):
The second use, we have already pursued it and it's been related to litigation actually. How can we help adjusters understand if a claim will go through litigation? In that case, do we have to settle? What do we have to go through that claim? We have a claims resolution committee and they are exposed to these cases. Even something we are testing is, what about reserving? Are there some claims that are under reserve and are going to have a huge, maybe something scary like 100,000? So overall in the claims expense, however we are testing in underwriting, that's important too.

Audience Member 5 (25:24):
I have a follow up question on that. So in terms of what you just showed us, without revealing any confidential information, can you give the audience a sense of how much money is potentially being saved because of using synthetic data to better protect fraud?

Mireia Rojo Arribas (25:43):
So I cannot talk about the numbers I think, but we are deducting five times more fraud than what we were before. That's a good number. Any other questions? Well, thank you everyone. Feel free to ask any questions after recession or whatever. Thank you.