'Cannibalizing their own industry': Dangers of Big Data for insurers

di-data-scientist-stock-062918.jpg
A scientist gestures toward data displayed on screens in the control room of the Laser Interferometer Gravitational-Wave Observatory at the Hanford Site in Richland, Washington on Feb. 13, 2016.
David Ryder/Bloomberg

Cathy O'Neil, an independent data scientist, says she basically invented the notion of an algorithmic audit, a review of processes to inspect outcomes and inner workings. She's the founder of O'Neil Risk Consulting and Algorithmic Auditing, a company that is working with the Colorado Division of Insurance and other organizations across the U.S. on how to understand and mitigate algorithmic risk. 

O'Neil wrote a book released in 2016 on the impact of algorithms, 'Weapons of Math Destruction: How Data Increases Inequality and Threatens Democracy.' During her research she found that regulators didn't really know what discrimination meant when it came to an algorithm, that's why she decided to start a company focused on this. 

Digital Insurance spoke to O'Neil about AI and model bias. Answers have been lightly edited for clarity.

Can you explain model bias as it pertains to insurance?

Cathy O'Neil

At the insurance level, what we have is a proliferation of data. And a bunch of experts, which are actuaries who are trained to think about risk and only risk and they do and they're good at it. They really can, at a very granular level, figure out somebody's risk in terms of insurance claims. On the other hand, they're not supposed to use race against someone. And they aren't doing that intentionally. The question is if they're doing it unintentionally. And it's complicated because a lot of the data that they use is both useful for understanding risk and a proxy for race. So it's really a question of the notion of fairness rather than a normative notion that I'm not in charge of, thank god. 

The main thing I do is I try to decouple the technical data questions from the regulatory fairness questions. It's way above my paygrade to decide whether something is allowable. This thing, which is good at finding risk, but it's also a proxy for race. Is it okay or not? I don't know. It's not my job. My job is to do the math and just show my boss, the insurance commissioner, here's what this looks like, this is a measurement of how much risk it's measuring. This is a measurement of how much of a proxy to race it is. It's your call. 

A lot of my work as an algorithmic auditor ends up being this very question. Yes, it makes sense that you're using that information about people to score them for risk. But you have to understand it's also a proxy for race. And it's a judgment call, are you going to use it anyway? Because it's useful. Are you going to decorrelate it from race before using it? Are you going to adjust your algorithm to be less correlated to race? There's a lot of different options. I hope it's becoming the case that we at the very least have to take this seriously and work with it directly as a question.

Are advanced technologies allowing more access to the data insurers have?

Insurers have always had data and they're always really good with the data they have but the amount of data they have now is crazy, number one. But it's also the case that they have not been collecting race data, and they've been claiming that since they don't collect race data, they don't have to worry about that, which is just not true. 

All of my work points to the fact that race is everywhere. And because we care about it so much. If we didn't care about it, it probably wouldn't be embedded in every single type of data. But we do care about it a lot. And it is embedded in every type of data, especially when it comes to things like risk and finances.

It infiltrates all data because it is so social, and it's so important socially, and gender does too. 

We're not as worried about gender, which maybe we should be, as well as ageism. I'm not here to say this is right or wrong. I'm just saying that it's inescapable that things are proxies. 

More recently, we have way more data. So insurers are getting at a granular level of risk profiling and it's a bonanza for them in some sense, because they love to understand risk. And they love to offer people who are less risky, better rates, so they can keep them as customers. But at a higher level, insurance has to rely on pooling data and pooling risk. Because if we really, really just charge everyone for exactly the risk that they represent, it wouldn't work. It just wouldn't work. We wouldn't need insurance because we'd basically be pre-charging people for their costs. And the whole point of insurance is to be protected from unaffordable costs. If you're not protected, because it's unaffordable, then insurance has failed. 

What's going to happen to insurance if every insurance company cuts and dices and profiles everyone so much that only the people who don't really need insurance can afford insurance? Then it failed as an institution. I call it cannibalizing their own industry. And it's because of these competitions with each other.  I'm just talking about big data as a new paradigm in the context of insurance and I don't know what's going to happen, but my feeling is if private insurance is going to last, [providers] have to agree on pooling more, not less. Because right now, the fight for granularity is going to cannibalize the industry.

Will insurers start incorporating more third-party data, for example, social media habits?

I think the answer is, quite possibly, yes. It's publicly available information. Third-party data collectors like to collect it and also your license plate and where you're driving? Where do you post? Your location? What kind of vacations do you take? 

You're exchanging this quote-unquote free service for being tracked. And it's real. And the laws in this country are such that it's legal to collect that information unless it's explicitly not legal and the only explicit illegal surveillance is medical. So it's a very porous system. You should expect that stuff to be collected and eventually infiltrate a scoring system somewhere that is then offered for sale to insurance companies. The question of whether they're going to actually buy it and use it is probably evolving right now. The fact that the Colorado law refers to these alternative data sources is part of this conversation.

Could you define explainable fairness?

There's a balancing test to decide the legitimacy of using the data. How much risk do you need to be inferring versus how much of a proxy for race is this? You want there to be a good argument for using something for risk purposes, if it's also a proxy for race, especially if it makes the outcomes of your overall process more different. 

Insurers are not actually testing that right now. Not directly. And so this is the kind of testing that we know is coming down the pike, it's on the pike, it's taking the exit off the pike. It's here and and I'm sure that the insurance carriers are somewhat resistant to it psychologically, because it's hard for them to think outside their risk basis framework, but they have to admit it's coming and so they're probably going to need a little advice on how to actually do that testing. 

Explainable fairness is this process where we're just going to look at the differences in outcomes, by gender or by race. If there's no difference in outcomes, on average, we're done. But if there are differences, we'd like to understand why. The people who have the burden to explain themselves always have an explanation and it's like, oh, well, that's because we discriminate based on, driving record, or whatever it is. Okay, great, tell us the driving record information, give us that column, and we'll account for it, and then we'll see if the disparities and outcomes change or if they're gone, and if they aren't gone, we'll be like, they're still there. And they're like, oh, that's because we have another legitimate thing that we that we discriminate based on. 

To be clear, all algorithms like this discriminate. That's the whole point — they're trying to discriminate between risky and not so risky. The question is whether the discriminations are legitimate, whether they're allowed, and again it's not my job to know the answer to that.

I call it explainable fairness because it actually does explain the notion of fairness at the end of the day. I started thinking about it as a framework, where legitimacy is context dependent, but you're doing basically the same thing every time you're sort of measuring outcomes, to try to explain differences with legitimate factors. 

What are some of your thoughts on generative AI?

I would say that as little as we really, truly understand machine learning algorithms, it's so much better understood than ChatGPT, which is literally a Wild West. We haven't even been told what it's trained on. You can't trust it.