This post is an adapted excerpt from “Everybody Lies: Big Data, Little Data, And What The Internet Can Tell Us About Who We Really Are,” by Seth Stephens-Davidowitz (May 2017, Dey Street Books). Stephens-Davidowitz is a New York Times op-ed contributor, a visiting lecturer at The Wharton School, and a former Google data scientist. He received a BA in philosophy from Stanford, where he graduated Phi Beta Kappa, and a PhD in economics from Harvard. His research, which uses new, big data sources to uncover hidden behaviors and attitudes, has appeared in the Journal of Public Economics and other prestigious publications. He lives in New York City. Buy the book at Amazon, Apple, Barnes & Noble and Google.
Mo data, mo problems? What we shouldn’t do
Sometimes, the power of Big Data is so impressive it’s scary. It raises ethical questions.
THE DANGER OF EMPOWERED CORPORATIONS
Recently, three economists—Oded Netzer and Alain Lemaire, both of Columbia, and Michal Herzenstein of the University of Delaware—looked for ways to predict the likelihood of whether a borrower would pay back a loan. The scholars utilized data from Prosper, a peer-to-peer lending site. Potential borrowers write a brief description of why they need a loan and why they are likely to make good on it, and potential lenders decide whether to provide them the money. Overall, about 13 percent of borrowers defaulted on their loan.
It turns out the language that potential borrowers use is a strong predictor of their probability of paying back. And it is an important indicator even if you control for other relevant information lenders were able to obtain about those potential borrowers, including credit ratings and income.
Listed below are 10 phrases the researchers found that are commonly used when applying for a loan. Five of them positively correlate with paying back the loan. Five of them negatively correlate with paying back the loan. In other words, five tend to be used by people you can trust, five by people you cannot. See if you can guess which are which.
- God
- promise
- debt-free
- minimum payment
- lower interest rate
- will pay
- graduate
- thank you
- after-tax
- hospital
You might think — or at least hope — that a polite, openly religious person who gives his word would be among the most likely to pay back a loan. But in fact this is not the case. This type of person, the data shows, is less likely than average to make good on their debt.
Here are the phrases grouped by the likelihood of paying back.
Before we discuss the ethical implications of this study, let’s think through, with the help of the study’s authors, what it reveals about people. What should we make of the words in the different categories?
First, let’s consider the language that suggests someone is more likely to make their loan payments. Phrases such as “lower interest rate” or “after-tax” indicate a certain level of financial sophistication on the borrower’s part, so it’s perhaps not surprising they correlate with someone more likely to pay their loan back. In addition, if he or she talks about positive achievements such as being a college “graduate” and being “debt-free,” he or she is also likely to pay their loans.
Now let’s consider language that suggests someone is unlikely to pay their loans. Generally, if someone tells you he will pay you back, he will not pay you back. The more assertive the promise, the more likely he will break it. If someone writes “I promise I will pay back, so help me God,” he is among the least likely to pay you back. Appealing to your mercy — explaining that he needs the money because he has a relative in the “hospital” — also means he is unlikely to pay you back. In fact, mentioning any family member — a husband, wife, son, daughter, mother, or father — is a sign someone will not be paying back. Another word that indicates default is “explain,” meaning if people are trying to explain why they are going to be able to pay back a loan, they likely won’t.
The authors did not have a theory for why thanking people is evidence of likely default.
In sum, according to these researchers, giving a detailed plan of how he can make his payments and mentioning commitments he has kept in the past are evidence someone will pay back a loan. Making promises and appealing to your mercy is a clear sign someone will go into default. Regardless of the reasons — or what it tells us about human nature that making promises is a sure sign someone will, in actuality, not do something — the scholars found the test was an extremely valuable piece of information in predicting default. Someone who mentions God was 2.2 times more likely to default. This was among the single highest indicators that someone would not pay back.
But the authors also believe their study raises ethical questions. While this was just an academic study, some companies do report that they utilize online data in approving loans. Is this acceptable? Do we want to live in a world in which companies use the words we write to predict whether we will pay back a loan? It is, at a minimum, creepy — and, quite possibly, scary.
A consumer looking for a loan in the near future might have to worry about not merely her financial history but also her online activity. And she may be judged on factors that seem absurd — whether she uses the phrase “Thank you” or invokes “God,” for example. Further, what about a woman who legitimately needs to help her sister in a hospital and will most certainly pay back her loan afterward? It seems awful to punish her because, on average, people claiming to need help for medical bills have often been proven to be lying. A world functioning this way starts to look awfully dystopian.
This is the ethical question: Do corporations have the right to judge our fitness for their services based on abstract but statistically predictive criteria not directly related to those services?
Leaving behind the world of finance, let’s look at the larger implications on, for example, hiring practices. Employers are increasingly scouring social media when considering job candidates. That may not raise ethical questions if they’re looking for evidence of bad-mouthing previous employers or revealing previous employers’ secrets. There may even be some justification for refusing to hire someone whose Facebook or Instagram posts suggest excessive alcohol use. But what if they find a seemingly harmless indicator that correlates with something they care about?
Researchers at Cambridge University and Microsoft gave 58,000 U.S. Facebook users a variety of tests about their personality and intelligence. They found that Facebook likes are frequently correlated with IQ, extraversion, and conscientiousness. For example, people who like Mozart, thunderstorms, and curly fries on Facebook tend to have higher IQs. People who like Harley-Davidson motorcycles, the country music group Lady Antebellum, or the page “I Love Being a Mom” tend to have lower IQs. Some of these correlations may be due to the curse of dimensionality. If you test enough things, some will randomly correlate. But some interests may legitimately correlate with IQ.
Nonetheless, it would seem unfair if a smart person who happens to like Harleys couldn’t get a job commensurate with his skills because he was, without realizing it, signaling low intelligence.
In fairness, this is not an entirely new problem. People have long been judged by factors not directly related to job performance — the firmness of their handshakes, the neatness of their dress. But a danger of the data revolution is that, as more of our life is quantified, these proxy judgments can get more esoteric yet more intrusive. Better prediction can lead to subtler and more nefarious discrimination.
Better data can also lead to another form of discrimination, what economists call price discrimination. Businesses are often trying to figure out what price they should charge for goods or services. Ideally they want to charge customers the maximum they are willing to pay. This way, they will extract the maximum possible profit.
Most businesses usually end up picking one price that everyone pays. But sometimes they are aware that the members of a certain group will, on average, pay more. This is why movie theaters charge more to middle-aged customers—at the height of their earning power—than to students or senior citizens and why airlines often charge more to last-minute purchasers. They price discriminate.
Big Data may allow businesses to get substantially better at learning what customers are willing to pay — and thus gouging certain groups of people. Optimal Decisions Group was a pioneer in using data science to predict how much consumers are willing to pay for insurance. How did they do it? They used a methodology that we have previously discussed in this book. They found prior customers most similar to those currently looking to buy insurance — and saw how high a premium they were willing to take on. In other words, they ran a doppelganger search. A doppelganger search is entertaining if it helps us predict whether a baseball player will return to his former greatness. A doppelganger search is great if it helps us cure someone’s disease. But if a doppelganger search helps a corporation extract every last penny from you? That’s not so cool. My spendthrift brother would have a right to complain if he got charged more online than tightwad me.
Gambling is one area in which the ability to zoom in on customers is potentially dangerous. Big casinos are using something like a doppelganger search to better understand their consumers. Their goal? To extract the maximum possible profit—to make sure more of your money goes into their coffers.
Here’s how it works. Every gambler, casinos believe, has a “pain point.” This is the amount of losses that will sufficiently frighten her so that she leaves your casino for an extended period of time. Suppose, for example, that Helen’s “pain point” is $3,000. This means if she loses $3,000, you’ve lost a customer, perhaps for weeks or months. If Helen loses $2,999, she won’t be happy. Who, after all, likes to lose money? But she won’t be so demoralized that she won’t come back tomorrow night.
Imagine for a moment that you are managing a casino. And imagine that Helen has shown up to play the slot machines. What is the optimal outcome? Clearly, you want Helen to get as close as possible to her “pain point” without crossing it. You want her to lose $2,999, enough that you make big profits but not so much that she won’t come back to play again soon.
How can you do this? Well, there are ways to get Helen to stop playing once she has lost a certain amount. You can offer her free meals, for example. Make the offer enticing enough, and she will leave the slots for the food.
But there’s one big challenge with this approach. How do you know Helen’s “pain point”? The problem is, people have different “pain points.” For Helen, it’s $3,000. For John, it might be $2,000. For Ben, it might be $26,000. If you convince Helen to stop gambling when she lost $2,000, you left profits on the table. If you wait too long — after she has lost $3,000 — you have lost her for a while. Further, Helen might not want to tell you her pain point. She may not even know what it is herself.
So what do you do? If you have made it this far in the book, you can probably guess the answer. You utilize data science. You learn everything you can about a number of your customers—their age, gender, zip code, and gambling behavior. And, from that gambling behavior—their winnings, losings, comings, and goings — you estimate their “pain point.”
You gather all the information you know about Helen and find gamblers who are similar to her—her doppelgangers, more or less. Then you figure out how much pain they can withstand. It’s probably the same amount as Helen. Indeed, this is what the casino Harrah’s does, utilizing a Big Data warehouse firm, Terabyte, to assist them.
Scott Gnau, general manager of Terabyte, explains, in the excellent book “Super Crunchers” what casino managers do when they see a regular customer nearing their pain point: “They come out and say, ‘I see you’re having a rough day. I know you like our steakhouse. Here, I’d like you to take your wife to dinner on us right now.’ ”
This might seem the height of generosity: a free steak dinner. But really it’s self-serving. The casino is just trying to get customers to quit before they lose so much that they’ll leave for an extended period of time. In other words, management is using sophisticated data analysis to try to extract as much money from customers, over the long term, as it can.
We have a right to fear that better and better use of online data will give casinos, insurance companies, lenders, and other corporate entities too much power over us.
On the other hand, Big Data has also been enabling consumers to score some blows against businesses that overcharge them or deliver shoddy products.
One important weapon is sites, such as Yelp, that publish reviews of restaurants and other services. A recent study by economist Michael Luca, of Harvard, has shown the extent to which businesses are at the mercy of Yelp reviews. Comparing those reviews to sales data in the state of Washington, he found that one fewer star on Yelp will make a restaurant’s revenues drop 5% to 9%.
Consumers are also aided in their struggles with business by comparison shopping sites — like Kayak and Booking.com. As discussed in “Freakonomics,” when an internet site began reporting the prices different companies were charging for term life insurance, these prices fell dramatically. If an insurance company was overcharging, customers would know it and use someone else. The total savings to consumers? One billion dollars per year.
Data on the internet, in other words, can tell businesses which customers to avoid and which they can exploit. It can also tell customers the businesses they should avoid and who is trying to exploit them. Big Data to date has helped both sides in the struggle between consumers and corporations. We have to make sure it remains a fair fight.