GLENDOWER.

I can call spirits from the vasty deep.HOTSPUR. Why, so can I, or so can any man;

But will they come when you do call for them?

- Shakespeare, Henry IV, Part 1

# Calling swans

Recently a dear friend invited me to join them as they took their wedding photos, at the Palace of Fine Arts. There's a pond next to the structure, and across the pond we saw one of the swans who reside there. Someone observed that it would have been nice to take a picture with the swan. So I called out, in a loud and clear voice, "Excuse me! Would you come over here?" and beckoned. Repeatedly.

I was pretty sure that it wouldn't work. Swans don't understand spoken language. Even if they did, as far as I could tell they have no plausible motive to respond.

The swan turned towards us and swam halfway across the pond. As it slowed down, my companions thought of more ways to get its attention, ways that seemed more likely to work on a swan, like tossing things into the water. But my plan did more than nothing.

It's an important skill, to be able to come up with plans like that. Sometimes you need to notice when things are impossible, and give up. But other times, it's worth at least trying the plan "yell at the swan."

What heuristic was I using? I'm not sure, but I think it has to do with noticing that my model of the world is incomplete.

# Overconfidence and model error

In general, people are notoriously overconfident when they make explicit predictions, unless they have been specifically trained to make probabilistic estimates. The review article, Calibration of Probabilities: The State of the Art, by Lichtenstein, Fischhoff, and Phillips (pdf here), gives a good summary:

Subjective probability assessments play a key role in decision making. It is often necessary to rely on an expert to assess the probability of some future event. How good are such assessments? One important aspect of their quality is called calibration. Formally, an assessor is calibrated if, over the long run, for all statements assigned a given probability (e.g., the probability is .65 that “Romania will maintain its current relation with People’s China.”), the proportion that is true is equal to the probability assigned. For example, if you are well calibrated, then across all the many occasions that you assign a probability of .8, in the long run 80% of them should turn out to be true. If, instead, only 70% are true, you are not well calibrated, you are overconfident. If 95% of them are true, you are underconfident.

[...]

Two general classes of calibration problem have been studied. The first class is calibration for events for which the outcome is discrete. These include probabilities assigned to statements like “I know the answer to that question,” “They are planning an attack,” or “Our alarm system is foolproof.” For such tasks, the following generalizations are justified by the research:

- Weather forecasters, who typically have had several years of experience in assessing probabilities, are quite well calibrated.
- Other experiments, using a wide variety of tasks and subjects, show that people are generally quite poorly calibrated . In particular , people act as though they can make much finer distinctions in their degree of uncertainty than is actually the case.
- Overconfidence is found in most tasks ; that is, people tend to overestimate how much they know .
- Despite the abundant evidence that untutored assessors are badly calibrated , there is little research showing how and how well these deficiencies can be overcome through training.
The second class of tasks is calibration for probabilities assigned to uncertain continuous quantities. For example, what is the mean time between failures for this system? How much will this project cost? The assessor must repor a probability density function across the possible values of such uncertain quantities . The usual method for eliciting such probability density functions is to assess a small number of fractiles of the function. The .25 fractile, for example, is that value of the uncertain quantity such that there is just a 25% chance that the true value will be smaller than the specified value. Suppose we had a person assess a large number of .25 fractiles. He would be giving numbers such that , for example , “There is a 25% chance that this repair will be done in less than x hours ” or “There is a 25% chance that Warsaw Pact personnel in Czechoslovakia number less than x.” This person will be well calibrated if, over a large set of such 1 estimates, the true value will be less than x 25% of the time. The measures of calibration used most frequently in research consider pairs of extreme fractiles. For example , experimenters assess calibration by asking whether 98% of the true values fall between an assessor ’s .01 and .99 fractiles.

For calibration of continuous quantities, the following results summarize the research .

- A nearly universal bias is found : assessors’ probability density functions are too narrow. For example, 20 to 50% of the true values lie outside the .01 and .99 fractiles, instead of the prescribed 2%. This bias reflects overconfidence; the assessors think they know more about the uncertain quantities • than they actually do know.
- Some data from weather forecasters suggests that they are not overconfident in this task. But it is unclear whether this is due to training, experience, special instructions, or the specific uncertain quantities they deal with (e.g., tomorrow’s high temperature).

When people do epistemic calibration training, a common nonobvious initial source of error is failing to take into account the fact that your model might be wrong.

For instance, I once participated in a group calibration exercise in which someone gave an extremely narrow 90% confidence interval (or in the language of the above article, the range between the .05 and .95 fractiles) for the distance from the sun to the earth in kilometers based on having memorized the speed of light and having a general sense of how many light-minutes away the sun was. They ended up being several orders of magnitude off because they'd made a mistake about units.

Another time someone asked for 90% confidence intervals around how many members of parliament the UK had, and almost everyone got it wrong because parliament was out of session pending an election so there were technically zero members of parliament at the time.

YVain summarizes the reason this sort of thing happens:

When an argument gives a probability of 999,999,999 in a billion for an event, then probably the majority of the probability of the event is no longer in "But that still leaves a one in a billion chance, right?". The majority of the probability is in "That argument is flawed". Even if you have no particular reason to believe the argument is flawed, the background chance of an argument being flawed is still greater than one in a billion.

More than one in a billion times a political scientist writes a model, ey will get completely confused and write something with no relation to reality. More than one in a billion times a programmer writes a program to crunch political statistics, there will be a bug that completely invalidates the results. More than one in a billion times a staffer at a website publishes the results of a political calculation online, ey will accidentally switch which candidate goes with which chance of winning.

So one must distinguish between levels of confidence internal and external to a specific model or argument. Here the model's internal level of confidence is 999,999,999/billion. But my external level of confidence should be lower, even if the model is my only evidence, by an amount proportional to my trust in the model.

Is That Really True?

One might be tempted to respond "But there's an equal chance that the false model is too high, versus that it is too low." Maybe there was a bug in the computer program, but it prevented it from giving the incumbent's real chances of 999,999,999,999 out of a trillion.

The prior probability of a candidate winning an election is 50%. We need information to push us away from this probability in either direction. To push significantly away from this probability, we need strong information. Any weakness in the information weakens its ability to push away from the prior. If there's a flaw in FiveThirtyEight's model, that takes us away from their probability of 999,999,999 in of a billion, and back closer to the prior probability of 50%

It's obvious how this should make us more cautious, when it looks like our model predicts that some strange action is useful. But for the same reason, it should make us more reluctant than we might naively be, to rule out the usefulness of normal-seeming actions.

# To fight overconfidence, try things you "know" won't work

There's a stereotype that being epistemically humble, being uncertain, means behaving in a timid way, not trying new things. But this conflates social with epistemic confidence. Social confidence is feeling confident that it's OK to try things, that you won't be rejected or harmed if you stick out and take initiative. Epistemic confidence means that your expectations are constrained, your uncertainty is focused.

I'm reading Algorithms to Live By, by Brian Christian, and just got through a part where he suggests that toddlers will try all sorts of weird stuff because they are in "exploration" mode. If you don't know much about how the world works yet, it pays to invest in trying more things, because you have little idea what the payoffs are. This is especially important when you're young, so the cumulative gains from finding even a slightly better option may be large, since you have a long time to "exploit" your knowledge afterwards.

Anna Salamon proposed, as a rationality practice, increasing your propensity to try new things and things you don't know will work, as a corrective for overconfidence:

You may think “overconfidence” when you hear an explicit probability (“It’s 99% likely I’ll make it to Boston on Tuesday”). But when no probability is mentioned -- or, worse, when you act on a belief without noticing that belief at all -- your training has little impact. [...]

If you want to notice errors while you’re making them, think ahead of time about what your errors might look like. List the circumstances in which to watch out and the alternative action to try then. [...]

- How does it help to know about overconfidence? What can you do differently, once you know your impressions are unreliable?
Action ideas:

- Try many things, including things you “know” won’t work. Try cheap ones.
- Don’t be so sure you can’t do things.

My friend Brent likes to pose apparently impossible problems, and recently proposed one (in the comments - the OP is another impossible problem):

There is a rover on Mars that is sending a distress call. You just received it; you need to reply so that it receives instructions no later than 15 seconds after it sent the distress. How do you accomplish this with only stone age technology?

I came up with a clever tricksy plausible answer that buys a tiny tiny chance of meeting the success condition depending on how you construe the question, but I don't want to talk about my clever tricksy plausible answer. I want to talk about my boring, obvious, desperate solution:

*Yell at Mars.*

To state the solution more fully: Sometimes, if I have a problem of the class, "send a signal a substantial distance quickly," the solution is "yell really loud". In this particular case, I have specific beliefs about the world - the distance between me and Mars, the speed of light, the speed of sound, the impossibility of transmitting sound waves through the vacuum of space - that make this solution impossible within my model. But I also have some model uncertainty.

By the same principle that my confidence intervals should take into account model error, I should consider solutions that resemble the true solution but seem like they can't work in this particular case. My prior assigns some usefulness to the action ("yell really loud") that crudely pattern-matches to the problem ("send a message far, fast"). My model of the situation justifies a strong update away from that. But I don't expect this model to be as strong as it claims to be. I expect to be able to make ten judgments this strong without being wrong once, but I don't expect to be able to do that ten thousand times in a row.

This doesn't mean that you should always try the thing. I genuinely wouldn't yell at Mars under normal circumstances, even if it would be very convenient to send a message to the Martian rover, because the cost of trying every strategy like that is way too high and I have much more promising things to try that are much more likely to succeed. But if everything I cared about depended on solving this problem, if I were making a desperate effort, then I would definitely at least try yelling at Mars.

And that's exactly the mental move that let me summon a swan from across the pond.