Paper Reviews

Do randomized econ studies suffer the placebo effect?

What Is A Randomized Controlled Trial?

It is nearly impossible to do proper scientific experiments in economics. Other branches of science don’t have this problem. In medicine, if you want to know if a new drug is more effective than the current standard: you can find some sick people, randomly give half of them the new medicine, give the other half the standard medicine, wait a bit, and see which group dies less. Since the assignment of which medicine a patient gets is randomly determined, the observed differences are due to the treatment and not due to some underlying characteristic of those treated.

When They Work

This is not so easily done in the social sciences. Researchers cannot alter the results of a presidential election, economic policy, or the demographic mix of a city just to see what happens. Because of this, researchers typically rely on mathematical models, case studies, natural experiments, or statistical analysis. While sometimes convincing, these methods never quite have the same rigor as a proper experiment.

Fortunately, there are some situations where researchers can randomly assign subjects to treatment and control groups. In economics, randomized controlled trials (RCT’s) are most common in development economics. Organizations like the Poverty Action lab set up experiments to test all manner of interventions. Since the treatments are randomly assigned, the conclusions should be robust, and not subject to various selection biases.

External Validity

Of course, there are issues. Principally, issues of external validity. By their nature, RCT’s only provide direct results for the specific group of people involved in the experiment. Sometimes this might be a large number or a small number, but it is always a limited group.

External Validity is the extent that a result applies to other contexts. Does a result found in one set of circumstances apply to others? Pretend there is a study that shows providing free meals to school children in a village in Côte d’Ivoire, increases school attendance. Does it mean that the same policy would work in a different village, or a different country? for secondary school children? Does it matter if the food provided is different? Maybe attendance increased, but did any additional learning take place?

External Validity is the headline problem of RCT’s. It is a serious issue. The argument in favor of RCT’s has always been that they have very strong internal validity. While the conclusions may not apply elsewhere, at least they are valid for the group you studied.

Probably.

Double Blind

Cowpeas. Apparently.

In medicine, RCT’s are nearly always double blind. Neither those receiving treatment, nor those administering the experiment know who is getting the real medicine and who is not. The reason for this excess of mystery, is the placebo effect. Receiving some treatment, even if that treatment is completely worthless, tends to improve outcomes.

RCT studies in economics are essentially never double blind. The participants typically know exactly what they are getting or not getting. This might be a problem. A 2014 Study [1] tried to test out if it would matter if the participants did not know if they were receiving a treatment or not.

It did.

How to increase output?

In studies of agricultural productivity the central question might be something along the lines of “how does using a different sort of seed increase crop output?”

There are at least two components of this sort of question. One is the physical impact of the seeds themselves. Some seeds simply produce higher yields than others.

The second component is behavioral-what the farmers do with the seeds. If a farmer receives a more productive variety of seed, they will do things differently. This includes decisions like the amount of care and effort they put into their crops. Better seeds could result in people doing more work (that extra work is more valuable) or less work (they don’t need to work as hard to get the same result).

Bulte et al, the folks who did this study, also surmise a third component. Which is how behavior is changed in ways that are unrelated to the treatment. The example given is that participants might be overly optimistic about the potency of the seeds they receive, and put extra effort into raising their crops, even if this extra effort is not merited by the actual quality of the seeds. When the outcome is measured, due to this effort, it will look like the intervention is more effective than it was.

The distinction between these second and third components is that the first one is directly related to the actual mechanics of the intervention while the second one is not. Sort of like a placebo effect, the second effect happens even if the actual treatment has no impact at all.

Cowpea seeds in Tanzania

Here’s how the study went down.

First, some farmers are randomly divided into these four different groups:

Those who received traditional cowpea seeds and were explicitly told they had received traditional seeds
Those who received modern cowpea seeds and were explicitly told that they had received modern seeds
Those who received traditional cowpea seeds but did not know which kinds of seeds they had received
Those who had received modern cowpea seeds but did not know which kinds of seeds they had received.

The seeds appeared identical, so it was not possible to tell which one they had received by observing them.

Results

When the harvest was gathered, the results showed that the modern seeds outperformed the traditional seeds by 27%. This is where most studies like this would stop. Researchers would then go off and make policy recommendations, telling governments and NGOs to give people better cowpea seeds, or something.

The wrinkle in this case, was that the 27% difference was only for those farmers who had known which type of seeds they had received. The impact of the seeds on the group that did not know which one they had received was zero[1]. The seeds themselves had done nothing. Since the seeds were ineffective, any behavioral changes that were brought on by the impact of the seeds was also negligible. The 27% increase in harvest was due to the response of the farmers themselves to receiving the new seeds.[3]

Effort, Expectations, and External Validity

Seeds do not know they are being experimented on. Proper scientific trials can be done on seeds, and it can be determined how they grow in different conditions. The whole point of giving them to actual farmers is to see what those farmers do with them. From the standpoint of evaluating an intervention for policy purposes, the behavioral effect is a necessary factor. How people respond to an intervention is not some confounding variable, it is a key part of the intervention.

Crucially, Bulte et al, make a distinction between changes in behavior that are driven by the genuine impact of the treatment and those which are not. The changes in behavior that were not due to the treatment biases the estimate upward, because even if the treatment does nothing, you still observe a positive impact.

I am not convinced this distinction is a particularly important one. While it is important to think about how much of an outcome is due to participants’ effort, it seems that expectations (even wrong ones) are part of the intervention, not separate from it. Differences in expectations in time and space, would be a question of external, not internal, validity.

Part of the measured outcome will be driven by the actual impact of the intervention. Part of that will be driven by expectations that may be unrelated to the actual effectiveness of the intervention. The measured impact will include both things. It probably should.

Sources, References, And Further Reading

http://www.economist.com/blogs/freeexchange/2013/12/randomised-control-trials

http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001497

http://blogs.worldbank.org/impactevaluations/health/developmenttalk/impactevaluations/impactevaluations/node/771

https://fxdiebold.blogspot.com/2017/01/torpedoing-econometric-randomized.html

https://boringdevelopment.wordpress.com/2014/04/09/a-torpedo-aimed-straight-at-h-m-s-randomista/

[1]Brought to my attention by this recent blog post

[2] An increase of about 5%. Not statistically significant at traditional levels.

[3] Unfortunately, this is a deeply imperfect paper. There are issues with attrition, the way the experiment was carried out regardless as to the validity of the methods used

How Much Consumer Surplus Does Uber Generate?

NOTE: One of my rules when starting this blog is I wouldn’t write about questions I could answer with a google search. If someone else had already answered my question, I wouldn’t bother repeating them. This post sort of breaks that rule. Freakonomics already did a nice podcast on this paper, which I used in my research and covers many of the same things. My piece has more footnotes though.

Uber’s U logo.

Uber, the increasingly ubiquitous transportation/technology/taxi company, whose logo is either a U or that square thing, has changed how I get around. I take Uber frequently, and have had almost universally pleasant experiences. I am not the only fan of Uber, Freakonomics author and U of Chicago prof, Steven Levitt recently co-authored a paper with some folks from Uber which calculated the consumer surplus generated by Uber in 2015 was $6.8 Billion. This post is about how that number was calculated, what it means, and what it doesn’t.

Consumer Surplus

Consumer Surplus is one method economists use to think about gains from transactions. The basic idea behind consumer surplus is that everyone has a maximum price they are willing to pay for a good or service, and whenever consumers can acquire something for less than that price, they are better off for it.

Pretend I am willing to pay $25 to get a ride from the airport to my house. Anything above $25 and I will take the bus/train/hitchhike/walk/just stay in the airport forever. If the price is less than $25 I will pay for my ride, and the lower the price is the more pleased about it I will be. This difference between the maximum amount I am willing to pay and the amount I actually pay is my consumer surplus. If I am willing to pay $25 and I get a ride for $15, my consumer surplus for this transaction is $10. Sum up the surpluses of all the other people looking for rides and you get the total consumer surplus for everyone. If there are 10 people all of whom are willing to pay $25 for a ride, but can buy a ride for $15, the total consumer surplus is 10*(25 – 15) = $100.

The Demand Curve

Consumer surplus is one of those economics concepts that’s lovely in theory but nearly impossible to measure in practice. The essential problem is figuring out the maximum willingness to pay. You could ask each person what their maximum willingness to pay is, but if you ask me, I will probably lie to you.[1] Even if I am being honest, it is not certain that I will truly know what my maximum willingness to pay is anyway. I may think it is $25, but when only offered rides for $40, I might begrudgingly accept.

Researchers typically are not interested in the maximum willingness to pay of a single person, but the overall relationship between the price of a good and how many people buy it. This relationship makes up another basic economics idea-The demand curve.

A Picture of a demand curve I found on the internet. The demand curve shows how much of a good would be purchased at different prices. In practice, only one point along this line is actually observed.

Demand curves, like maximum willingness to pay, are simple in theory but very difficult to measure in practice.

Why are demand curves so elusive?

What is observed are companies offering a certain price for a good, and a certain number of people buying it. The relationship that researchers would like to observe is how many people would have purchased the good if price was different but everything else was the same.

Of course, prices and quantities sold change frequently, and it is tempting to think one could estimate the demand curve from these different points in time. This doesn’t really work. Each observed price/quantity duo occurs under different market conditions. When looking at multiple price/quantity observations over time, It is usually very difficult to say what else has changed between observations (consumer preferences, supply decisions, substitutes, etc.)

One way to figure out the demand curve would be to randomly offer people buying the same product different prices. With this information it would be possible to compare how many people purchased the product at each given price, and randomization would ensure that these differences were due to the price and not due to observed or unobserved factors about the consumers or the market conditions.

This is not something Uber does.

Except sometimes, they almost do.

Surge Prices

The answer comes from (perhaps not surprisingly) Uber’s surge price feature and (perhaps surprisingly) rounding. Uber passengers will be familiar with surge pricing, the practice through which prices are higher when there are more riders than drivers, such as peak commuting times, Friday nights, or New Years Eve. These come in the form of a multiplier, starting at 1.2 times the base fare, then 1.3x, and 1.4x, all the way up to 4 times the regular price or more. Surge prices alone are not good enough to estimate the demand curve, since the surges themselves are determined by the current market conditions, simply comparing between surges is no better than comparing different prices at different times.

Rounding

Uber has an algorithm that calculates what the surge should be. This algorithm calculates the surge estimator very precisely, down to some number of decimal points. Users of Uber only ever see surge prices to one decimal place[2]. When Uber’s algorithm calculates that the surge should be 1.28893, the app would round this up to 1.3. This is good news for people like Levitt[3], who are big fans of an econometric technique called regression discontinuity (RD).

Regression Discontinuity

Arbitrary cutoffs are nice ways of separating otherwise similar things into distinct groups, and therefore getting at something approaching a natural experiment. In this case, the thing being separated into distinct groups are the different market conditions that Uber passengers face. A market that the Uber algorithm prices at 1.249 is very similar to one it prices at 1.251. Due to rounding, the first is put in the 1.2 surge category and the other is put in the 1.3 surge category. This causes near identical market conditions to be priced differently through an arbitrary cutoff. By observing how purchase decisions change at each side of this cutoff, it is possible to see how consumers in nearly identical market conditions behave when the price is changed.

Fig 4. From Cohen et al, 2016. This shows how the purchases rate changes when the surge jumps from 1.2 to 1.3. Of particular interest are the points very near the point of discontinuity as these points experience similar market conditions, but different prices.

The dataset used in the study involved around 50 million different UberX user sessions from 4 cities in early 2015. Not these people ended up ordering a ride, so by comparing the percentage of people who hailed a ride from each side of the rounding discontinuity the researchers could see the impact of different prices, under very similar market conditions[4]. In the figure above, sessions just to the right of the 1.25 cutoff (who experienced a 1.3 surge) purchased Rides in about 56.5% of cases. Sessions just to the left (who experienced a 1.2 surge) purchased rides in about 58.5% of cases.

Finding Consumer Surplus

To calculate the consumer surplus the researchers started with the group of people who bought a ride when there was no surge. Based on the estimates generated from the discontinuity, they calculated how many of these people would have purchased if the surge had been 1.2 instead of 1.0.

The total consumer surplus for this jump can be calculated as

Consumer Surplus = Percent Price Difference * avg. Fare paid * Number of trips that would have occurred had the surge level been 1.2.

The next step is to use the estimated number of people who would have purchased at a surge of 1.2 and do the same calculation to estimate the number of them that would have made a purchase at a surge of 1.3. This is kept up until the surge level reaches 3.9.

Ultimately this gives a set of points that map price and quantity purchased for people who purchased an Uber ride at a surge of 1.0. Plotting these points and drawing lines between them, provides a demand curve. It looks like this:

Fig 6. from Cohen et al 2016. This shows the estimated relationship between the number of trips taken and the surge multiplier, for those people who actually purchased a ride at a surge level of 1.0.

The process is repeated for customers who bought rides at each other surge point. When these numbers are totaled the value for the total consumer surplus equals $2.88 Billion for the 4 cities that Levitt and Company had data on. To get the headline figure of $6.8 Billion, it is assumed that the same elasticities hold true for Uber riders in other cities as well.

Whose demand curve is it anyway?

One of the reasons the consumer surplus number is so big is that the consumers don’t seem to be too fussed[5] about higher surge prices. It’s worth keeping in mind who these people are, they are people who have already decided they want a ride, and have gone far enough to open the app and see what the price is. It makes sense that these people would not be particularly sensitive to price changes.

The high surge times are pretty predictable, and I suspect that many potential Uber riders who are unwilling to pay for a high surge, don’t even bother opening the app in a high surge time. These more price sensitive customers are not observed in the data.

Another issue may be that consumers in LA, NYC, Chicago, and San Francisco do not represent all of Ubers users (this is why I prefer the $2.88 Billion dollar number, rather than the $6.8 Billion dollar extrapolation.) It’s not clear which direction this would bias the extrapolation, it is likely consumers in these cities (with the possible exception of L.A.)[6] have more access to substitutes such as taxis and public transport than consumers in other places. This should make them less likely to take an Uber when faced with a higher price. On the other hand, residents of these cities are wealthier than people from other parts of the country which should make them less susceptible to changes in price.

What the Surplus doesn’t tell you

With a few billion dollars of consumer surplus to dole out, it seems necessary that consumers must be better off with Uber than without it. This is very likely, but not certain. A comparison would need to be made between the consumer surplus generated by Uber and the consumer surplus generated by the pre-Uber world of taxis, public transport, and private cars. Since measuring consumer surplus is very difficult, this is a difficult comparison to make. The strongest evidence that consumers have been made better off by Uber is that they choose to take it over the preexisting options.

There are other things that aren’t included in this estimate: This estimate is only for the UberX product, it doesn’t say anything about other Uber products or ridesharing products offered by other companies such as Lyft.

This estimate tells nothing about the other players in the Uber picture. The surplus figure is not a net sum, it doesn’t count any losses experienced by taxi drivers and taxi medallion owners, for example, nor does the surplus provides any information about any gains Uber drivers experience from their vocation.

Where does the surplus come from?

The $2.88 Billion or, if one is feeling expansive, $6.8 Billion isn’t coming from anyone because consumer surplus does not mean that an actual transfer is taking place. To illustrate this, consider an extreme example, water. Water is very important to us, we need it to stay alive and to make coffee. A person who has no water would be willing to pay nearly anything for that little bit of water needed to sustain life. Despite this, most of us get our water for prices that are much lower than infinity. This means that our consumer surplus for water is probably near infinity.

There is no transfer taking place in the water example. It is just our good fortune[7] that something we need so desperately is so plentifully and cheaply available. Likewise, there is no transfer taking place in the Uber situation either. There is just a market that provides rides at a cost below what many people would be willing to pay.

There is something that makes Uber rides cheaper and more plentiful than they otherwise might be.

The Investors

If there is anything Uber is good at, it is taking investors’ money and losing it. While Uber is a private company and does not release all of its financial information, it did lose 1.2 Billion dollars in the first half of 2016, and has lost $4 Billion in its history. Uber is spending money to gain market share. What this means in practice is that Uber pays drivers more than they can afford, and/or does not charge enough from riders to cover these costs. In effect, they are subsidizing the public’s use of Uber. As someone who is an Uber user and not an Uber investor, I think this is a superb idea and I encourage everyone involved to keep at it. It seems like there will be a point where they probably have to stop.

We may be near a high point of Uber’s consumer surplus per rider. If Uber is going to become profitable it seems they are going to have to either charge more for rides, or pay their drivers less[8], both of which would reduce consumer surplus.

That square thing

A Surplus of Surplus?

I don’t really care very much about the consumer surplus figure that Levitt and his Uber friends found. I just don’t have a good sense of how to interpret the raw dollar figure. What I would care about would be a figure for the change in consumer surplus. A change in consumer surplus would be a good way to give a direction to welfare changes. A comparison between two changes in consumer surplus, would provide at least a relative understanding of the magnitude of a welfare change. But just a single measure is not particularly useful.

The part I do care about, is the methodology and data used to estimate the demand curve. I think the identification strategy is quite clever, and I do not know of a more nicely identified Demand curve[9]. I suspect this sort of estimation will be used frequently as companies have an easier time altering prices and recording individual transactions.

Sources, References, and Further Reading:

The Paper:

https://cbpp.georgetown.edu/sites/cbpp.georgetown.edu/files/ConsumersurplusatUber_PR.PDF

Freakonomics Podcast on this:

http://freakonomics.com/podcast/uber-economists-dream/

Naked Capitalism is not a fan:

http://www.nakedcapitalism.com/2017/01/can-uber-ever-deliver-part-six-bleak-pl-performance-while-stephen-levitt-makes-indefensible-claims.html

Others:

http://blogs.wsj.com/economics/2016/09/19/ubers-pricing-formula-has-allowed-economists-to-map-out-a-real-demand-curve/

http://www.citylab.com/commute/2016/09/uber-consumer-surplus/500135/

[1] Especially if you are the agent of a transportation company, or a secret agent, or any kind of agent really…

[2] Or at least they did when the study took place. I can’t get the Uber app to show me what the current surge level is. Google Maps still seems to though.

[3] And me!

[4] Very Similar is not a technical term. The ideal would be to compare identical market conditions with different prices, but since this is not an ideal world, very similar will have to do. The main issue with RD analysis is when very similar, because not similar enough. As one moves further from the point of discontinuity, the observations become less similar, and the basic assumption that they are essentially identical becomes less valid. In this paper this is dealt with in two ways. The first is to only look at points within a 0.01 window of the discontinuity. The second is to compare other observable variables for observations on either side of the discontinuity. If they do not change too much, then it is likely that the unobservable variable of interest does not either.

[5] “Don’t seem too Fussed about higher prices” is not a technical term but “inelastic” might be.

[6] Of the 4 cities, LA has the most inelastic riders, possibly because of these density and transportation issues.

[7] Of course, there are many people who do not have cheap access to clean water.

[8] This assumes that they actually have drivers in the future. Uber appears to be as excited about driverless cars as everyone else is.

[9] If you know of a more nicely identified demand curve, please send it to me: partysheephats@gmail.com or twitter @partysheephats_