Ethical AI in life insurance, Hareem Naveed of Munich Re

Munich Re offices at “Am Münchner Tor”, Schlüterstrasse 6–10
Munich Re

Digital Insurance spoke with Hareem Naveed, associate vice president of integrated analytics at Munich Re Life, about the ethical application of artificial intelligence in life insurance.

The responses have been lightly edited for clarity.

Can you tell me about your experience in data science?

My training is in data science for social good. I did a degree in math, but then my fellowship and my training were in data science applications for social good use cases. I was at the Center for Data Science and Public Policy at the University of Chicago. A lot of the things I worked on there were very much like, can we build models that identify police officers who are at risk of having negative interactions with members of the public? A lot of that training was very people focused. How do you build models that act at the individual level? So, I worked on problems related to criminal justice, students and truancy, building models at the person level where you could have maybe an assistive intervention. 

When I was looking for a job, I still really wanted to keep it at the level of the person. I really liked these use cases, because I felt like there was so much scope for improvement. I liked the growing nature of medical data. And honestly, it felt kind of like datasets were social neutral, in a sense, you're working really hard to get somebody insured. But there's already a process, you're trying to help improve the process and digitize it, and bring in these kinds of innovations. 

I'm now based in the U.S., but I used to be in the Toronto office. I really liked the cross-border component of it because Canada was behind when it came to things like having cloud storage. So banks were still doing things on the servers, whereas in the U.S., AWS and Azure had these cloud banks and things for a really long time.

What has changed in the past six years in data and technology?

The biggest thing that has changed is we spent the past six years laying foundations. We would build one model, deployed in this one specific context, monitor and learn a lot from it. And over time, we invested a lot in the machine learning operations and governance side, because we were thinking, it's such a heavily regulated market, we're really in touch with the regulators and we have a lot of conversations with them, talk them through our analyses or processes. And we're now at the point where everybody's scrutiny is kind of turned to these models, and everyone's talking about how to use AI. But we've already done the legwork. And our pace of iteration is a lot faster than maybe peers or competitors in that same sense. And a lot of that is because we invest in defining a governance framework, educating everybody on the team, making sure people are doing the right use cases, the right tools and then we also implement operationalizing, the technical side. For example, when I started here, if you had a model that had to go through review, you would develop it, and then you would talk about the use case, and then you would get approval. Then you would go through the entire deployment process. So you build a model, in whatever time frame, it would take six months by the time you're ready to deploy, and then you start doing the risk review. Now we tie in bias testing, performance testing, and the review, to reduce that dev-deployment cut off. 

It means we can deploy faster, we can also shut down things faster. If we don't spend a year bringing something to market, then it's also easy to say, 'Hey, that didn't work. Let's shut it down.' I feel that has changed in the past six years, just that speed of iteration. And the investments we made, which at the time people were like, why do you guys care so much about bias testing? 

We also have a tool called alitheia, which offers risk assessment as a service. When we analyze the data, we find we're able to automate some cases. But if we look at the cases that are being referred, and then ultimately being issued, we're finding a lot of them have to do with anxiety and depression. And it's because the underwriter needs a little bit more information. We know when we look at an electronic health record (EHR), you can find a lot of information. For example, someone with diabetes, you may ask somebody their latest blood glucose reading and they may know that off the top of their head, or know where to go look it up. But somebody who suffers from anxiety, they don't know there's a GAD-7 score, but we know that information is in the EHR. We use the large language model to pull up very specific information from the EHR. When we're being specific, we can test for accuracy a lot better, it's not going to hallucinate because we're pulling stuff out from a table. We know what we're looking for. 

What governance structures are in place to ensure accountability?

We have a global AI governance directive and it takes a lot of its direction from the EU AI Act, which is great, because it's really well laid out. So that enforces things like a data scientist code of ethics, it enforces things like a technical place for us to register all our AI models, and ensure that there's board level accountability, or anything that even the regional offices are doing. 

We have regional work instructions and that's what my team manages is a technical implementation of a lot of the controls that are defined at the high level. And then what we also do is scan the regulatory environment and adapt and implement additional controls as necessary. So for example, we have this local work instruction related to the quantitative testing requirement for Colorado, people in Germany or Singapore, they don't care about that. But for us, that's a local piece that we use to put into place to kind of poke our requirements. And I've really loved that structure because you get to have an AI governance, peer group and discussion to learn about best practices. For example, here's how the Singapore regulator is asking for models to be filed. Here's the standard we've adopted that works. And then when you go and implement all your controls, they can be responsive to your environment. If I have 10 models and another entity has one, we might have a more defined machine learning operations practice. We may use different tools to kind of build an image. I like the ability to manage all that locally for my team.

Risk management frameworks are kind of crystallizing, which is really helpful.

How do life insurers prevent bias in their models?

I come from public policy, that was my training, so I tried to bring a lot of that here because I feel like it's kind of similar. The biggest thing that we start with is your scope has to be reasonable. Your scope has to make sense. When I started six years ago, people were like, 'Oh, let's use facial recognition to detect if you're a smoker.' We can't do that. That doesn't make any sense. So if someone is coming to you make sure the data they're using is appropriate and you have the right permissions to use the data, and all those aspects.

We have a cross disciplinary team that includes people from risk management, legal, the business product owner and data scientists, and they just ask the right questions, no one's going to say, 'you can't do that because this is risky.' They're gonna say, 'do you have the right controls? Did you review the documentation? Did you review the agreement?' That's the first place to start.  

Then when it comes to bias detection and mitigation, one of the things that we do is we think about the intervention. The reason I mentioned scope is important because nobody should be building a model just for the sake of it as an intellectual exercise, right? 

If you have a preferred model that can be used to move people up a level, if you think about that, in the framework of somebody applying for life insurance, that's an assistive action. So the metric on that is different than one in which a model may knock somebody down. So once you define the intervention, you define that as performance testing, so you're looking at accuracy, precision recall, you also use the bias metric you define and look at it across subgroups. When we look at all demographic variables, for example with age, we can bucket it, so we can say, this is 18 to 35, 35 to 45, 45 to 55, we bucket it, we compute the metric. Then you set a reference group, and the reference group is one that's historically advantaged, or one that's the biggest size in your group. 

So, you know, for example, for us, maybe 45 to 55 is the largest present in that group. So we set that as the denominator and we calculate that metric. We take the other metrics for the different subgroups and divide them and we use 80% rule, it's just a yardstick. Whatever you want it to be from the Equal Employment Opportunity Commission. And if your model passes, on the metric you've defined, you have to tell us already as the data scientists why you picked the metric, what your intervention is meant to be, how you're assessing for it, and if it passes that, then you're good to go. If it doesn't, you have to mitigate and figure out what's going on. And sometimes that can be the population you apply your model to. So, for example, we had somebody who built a model that wasn't doing very well on ages 60 and above. But that didn't matter so much, because the model was only going to be used on applicants up to age 60. So those are the kinds of mitigation. You can either update the intervention, or you can change the data or maybe look at the labels because for me, bias testing is just as much a part of performance testing, because you don't want to do worse on a subgroup. Right? Yeah, that doesn't make any sense. 

That's a framework that we use, it's really simple, it's easy for legal to understand, it's easy for a data scientist to implement, and kind of just understand as a rule of thumb.

Every data scientist, when they have built a model they're ready to deploy or use, they have to share these metrics and put them on like, one sheet of paper, describe the data, and all the features you use, what your final iteration was, give us your performance metrics, and give us your bias metrics. They are side-by-side with the performance analysis to make sure all that is there. So, that kind of helps because if the models are not going to pass they're not going to be put up for legal review or ready to play. You put the onus on the developer to iterate. The biggest thing is to make sure we have a culture where this is important, a culture where it's supported, so no one is going to be penalized. And then a culture that is open to talking about the mitigations.

What measures are being used to test large language models and Gen AI?

We have this tool that determines if a short piece of text, maybe 250 characters, has medically relevant information. And we were finding through our review, because before we launch, anything, everything gets reviewed by the developer, expert users. But we're finding that if, for example, the text said, 'young man with Alzheimer's disease,' it was tagging that as other and not picking up Alzheimer's, but if the sentence was '70-year old presenting with Alzheimer's symptoms' it would tag it as Alzheimer's disease. 

When we build and we test these, we try to test every single scenario that we can think of, and that we can then add to our arsenal later on. We had to make sure we built a guardrail. And we changed something like the context window, for example, so leave age out of it, pick up Alzheimer's. Those are new skills that we're developing, both by building and seeing what's going on. And also looking to the broader development community, not just within insurance, because everyone's figuring out what to do with large language models and generative AI.

We spent a lot of time developing these controls in partnership. We've had really good partners internally. 

Are there long-term ethical implications of using AI in life insurance?

With the volume of data that exists, the problem we are trying to solve is to accelerate people's application. We're also trying to deal with a deluge of information. Previously, information was very targeted, you would go and get the same seven to 10 insurance labs. And they would have a very simple rule for what to do. But that had other problems. For example, if you got a one-time cholesterol reading that was high or you got a high stress reading or something, it could reduce the scope of people that you were able to insure. But now with the digital data that we're able to access, you can give people credit for managing their diabetes well over time. You can give people credit for managing their hypertension well over time. 

So that's the opportunity. AI is the right tool for the job. When you ask about long-term ethical implications, I think about the problem we're trying to solve. We've learned our lessons, we know that AI is a tool, it's not the end all. Customers are a lot more savvy and if you decline applications because of misuse of AI, they're going to follow up, and you're gonna have to deal with it.

One of the things that I liked about coming to this industry is a lot of the controls are in the regulatory environment. If I want to build a model and deploy it, I'm going to have to run it by several underwriters, several actuaries and explain to them both the cost perspective, but also the underwriters are not going to trust something if it doesn't make sense medically, right? If I'm using a variable, and it just has some correlation, we're like, 'Well, you can't explain this to me, I can't explain this to the end user, therefore, I'm not going to use it.' So both the limited use cases, the nature of the data, the fact that AI is the right tool, and all the controls that exist, I feel mitigates some of those long-term implications that could result.

I think I'm really excited about the ability to get one fundamental view of risk. Now we have so many data sources, prescription data, medical claims, information from HR, medical claims from HR, an applicant's disclosures. But what you're fundamentally trying to understand is like one view on risk, what happens when two things disagree? Someone says, 'Oh, I've never had x prescription.' But it shows they have three years of that prescription. How do you reconcile that without ruining the customer relationship?

You can't refer all that to a human to review. So I think that's what I'm really excited about is using all that information to get to one fundamental view of risk rather than just, you know, fixing errors or like analyzing differences or understanding things a lot better, because the data is getting more and more convoluted. The more you get, the more you have to clear up. So any automation gains you may have may disappear if things ballooned out of control.