Summary

Quantification and metric optimization are powerful tools for reducing suffering, but they have to be used carefully. Many studies can be noisy, and results that seem counterintuitive may indeed be wrong because of sensitivity to experiment conditions, human error, measurement problems, or many other reasons. Sometimes you're looking at the wrong metric, and optimizing a metric blindly can be dangerous. Designing a robust set of metrics is actually a nontrivial undertaking that requires understanding the problem space, and sometimes it's more work than necessary. There can be a tendency to overemphasize statistics at the expense of insight and to use big samples when small ones would do. Finally, think twice about complex approaches that sound cool or impressive when you could instead use a dumb, simple solution.

Introduction

The effective-altruist (EA) and rationalist movements love quantification. We extol the power of data to guide our decisions and sometimes tell us unexpected conclusions. We love the ways in which metrics allow us to optimize for good outcomes with more power and efficiency than just relying on intuition or random acts of charity. These notions are not unique just to us; they're shared by many in the worlds of business, science/technology, etc.

I'm enthusiastic about the above, but there are a few areas in which I feel people sometimes place too much weight on formal studies at the expense of anecdote and common sense. Everything I say below is already known, so consider this just a summary of my thoughts rather than a novel contribution.

"95% confidence" is not 95% confidence

Sometimes quantifiers get excited when a study's results contradict common sense. "You see," they say, "your naive intuition can't be trusted after all. It's numbers or the highway, my friend." They present a study with p-value <0.05 to prove a surprising conclusion.

There are many well-known problems here. One is publication bias. Even without that, there are plenty of ways to massage your data or statistical tests to get a significant result. For example, in a regression, if you don't get a significant result on the raw variables, try taking the log of the values to see if it becomes significant then. Keep going until you find a test and transform that work. And then there are ways to design experiments to achieve the outcomes you want. That industry-funded drug studies have more favorable results than independent ones is a hint of what's possible.

What if you're doing the research yourself and are careful to avoid the problems suggested above? Even if so, your results may be overly sensitive to the specific conditions or data set of your study. I've seen this many times: People show a stat-sig result in one configuration of an experiment, but then on a slightly different configuration that has only trivial changes, the result goes away. It might even go away just with the passage of time. The same idea applies to research in a lab done in a very particular way at a particular time with particular conditions measuring particular variables. The result may be sensitive to so many minor factors that when you change the configuration a little bit, the result vanishes. This is especially true when you have big sample size: It's hard not to get a stat-sig result with a big enough sample, even if all that's being picked up is irreducible bias due to noise in the experiment configuration.

People say "most published findings are false" in the sense that they don't replicate. I won't get into the debate about how much this holds true in science as a whole, but there are certainly domains where it does. If you can replicate the result over a variety of conditions and metrics, then you're in pretty good shape.

Don't believe everything you see

I often see results that seem too good (or too bad) to be true. Upon further investigation, it almost always turns out that indeed there was something wrong that invalidated the ostensible finding. Maybe there was system instability. Maybe I configured the experiment wrong. Maybe I used the incorrect baseline. Maybe the metric wasn't being computed the way I thought it was. Maybe there's a trend that always appears and isn't due to my particular changes. And so on.

Standards of rigor differ as far as verifying experimental configurations, procedures, and data analysis, but there are many times when even careful people, if they go back and inspect their work, will find errors that can make a big difference to the results. One reason experimental replication is important is because it reduces the impact of human error as well as noise from the environment and methodology.

So if you see results presented that sound strange, be skeptical. Don't silence your intuition. The output of the neural networks in your brain is not necessarily "less rigorous" than the results of a study that has potentially undetected flaws.

Power and metric choice

There are some metrics that you can't move visibly, no matter how good your intervention is. Sometimes a change can be shown to be a good thing at a local level beyond a shadow of a doubt, and still it shows no impact on a higher-level metric. This does not mean your local change was worthless. It probably means the higher-level metric isn't sensitive enough to detect what you're looking for. Proxy metrics are inevitable, and to the extent that they allow you to iterate faster and with less noise in your measurement process, use them.

If a study finds "no correlation" between some specific intervention X and some broad population variable Y, I'm not surprised. You're probably not measuring at an observable level.

Campbell's law

I'm being liberal with my interpretation of Campbell's law, which I'm taking to mean the general phenomenon that optimizing blindly for a metric can undermine the original purpose. When people criticize quantitative analysis as being "too narrow" and "missing the forest for the trees," this is often what they're getting at.

Someone tells you that "a high GPA is an excellent predictor of career success," so you end up studying for your courses excessively at the expense of other important factors like networking, career research, philosophical analysis of how to accomplish good in the world, and trying new experiences that will shape your perspective on life and teach you "things you never knew you never knew."

Sometimes Campbell's law looks like "gaming the system," i.e., finding loopholes in the metrics that allow you to "cheat" the underlying purpose. Your teacher penalizes students who don't have white eyes looking at the board during class, so you paint dots on ping-pong balls. A third-party company gets paid for getting visits to a veg video landing page, so it generates a bunch of spam links to your site, or encourages bot traffic, or whatever. Such scenarios are common with principal-agent problems.

That said, sometimes Campbell's law is less obvious. Sometimes you don't know what problems the metric has, especially if you never dive into examples of how it works. So it's good to keep your metrics grounded by looking at some sample cases. It's also recommended to develop multiple metrics that are as independent as possible, even if they track slightly different components of the overall goal.

Finally, sometimes metrics are just not worth the effort. It's too hard to quantify some things, and maybe our brains already do it well enough. Sometimes trying to follow a metric ends up doing more harm than good because you tinkered with something that didn't need fixing. Sometimes you waste time designing a seemingly sophisticated metric for something that was obvious all along. Sometimes the metrics that you make up just encapsulate what you were going to do anyway and merely obfuscate that fact.

Some metrics are very powerful. But this doesn't mean that making something else into a metric will unleash the same power. Some situations are more amenable to explicit metrics than others. Plus, metrics are hard. They take time and attention to get right. A naive metric for steering your car will likely cause you to drive into a ditch when you could have stayed on the road just by looking where you were going. Use sanity checks.

Take samples

Say you have a data set with demographic information on 10,000 people. You want to get an understanding of what kinds of people are in your sample. One way is to run a bunch of statistical tests and scatter plots on the whole data set. Another is to look through 10-20 of the people by eye and get a sense for who they are and why they do what they do. Sometimes looking at some specific examples can give you more insight than all the regression coefficients in the world. Sometimes there's a story going on that you wouldn't notice unless you got your hands dirty with concrete cases. Of course, it's great if you can do both. But don't assume the statistical stuff is necessarily better, and don't do only statistical stuff just because it's more cool.

Applying statistical sampling to reading can also be a good practice. Instead of reading one book end-to-end, read a sampled one-tenth of 10 books. Of course, reading is different from merely assessing quantitative traits of a population, because explore-exploit dynamics enter in based on expected value of information. But the idea of using samples to avoid perfectionist inefficiency is similar.

Sampling can also be a good idea because sometimes high-powered statistical stuff isn't worth the effort at all. You could look through 10-20 people's profiles in <45 minutes, but to do the statistical analysis, you'd need to set up your tools, format the data, do appropriate transforms, and everything else. If you have all of this ready, that's great, or if you want to put the tools in place so that you can do it on a regular basis, that's wonderful. But if this is a one-off job that you won't ever do again (e.g., some special parsing of the data that isn't common), then consider whether you need to analyze the whole data set. Samples can give you a good basic picture, and most of the time, you only care about trends that have a big effect size. Often you don't need to worry about optimizing to the ones digit of precision.

Interestingness bias

Relatedly, I often see a tendency for people to making things unecessarily complicated because that way the problem is more interesting. This can be true for experimental designs, intervention approaches, or statistical analysis. For example, writing a web tool for a survey that reads data into a customized numerical-analysis program for statistical processing when in fact the number of people surveyed is 20, and the statistics could have been computed in 10 minutes using Excel or an online confidence-interval calculator. Don't use fancy statistical tests when the t-test works just fine. Don't introduce a bunch of tweaks and special cases to a system before you've tried the simple thing and identified specific weaknesses where complexity is justified. "Complexity kills," as they say -- in terms of execution time, debuggability, and maintenance cost -- so "keep it simple, stupid."

The real world is not academia where you only get published if you show something interesting. Reducing animal suffering is about doing what works, even if it's a boring, simple thing rather than a grand theory or complicated adaptive system that you've built. There's a natural tendency to be impressed by complex things, so resist the temptation: Encourage the people who achieve the same results in a simpler way.

I have a friend who once said, "Math envy is a disease. A disease." While I love math for recreation, I realize that most of the time, complicated math is much more dangerous that simple math, because (a) it's unnecessary, (b) it's opaque, (c) it gives a false air of sophistication, and (d) it's almost certainly wrong. Every additional variable you add to give 1% extra nuance to a decision analysis takes away 5% of the accuracy of the final conclusions because of the additional uncertainty and conjunctiveness. (I just made those numbers up, but they illustrate my point.)

95% confidence is not necessary

Sometimes people complain about a study because "the sample size was too small." There are two possible reasons why this objection isn't important.

The first is that people may not have a good intuitive sense for how much sample size matters. When looking at means of a sample, the standard error decreases as square root of the sample size. So the standard error of a sample of size 16 (sqrt(16) = 4) is only half that of a sample of size 4 (sqrt(4) = 2). The standard error of a sample of size 100 is only half that of a sample of size 25.

The second reason is that, as noted above, most analyses don't need high precision. For first-pass investigations, you usually only care about the order of magnitude or maybe the differences that are at least a factor of 2. You don't require 1000 data points for this. Moreover, the standard levels of statistical significance are usually higher than you need. You don't need to use 95% confidence levels when the cost of sampling is big. A single data point can update your Bayesian prior probabilities, sometimes quite a bit. In the sunrise problem, if you start with a 1/2 prior, then after the first day that the sun rises, your posterior probability is already 2/3. By day 8, it's 9/10. Any extra assurance that you get after that isn't needed most of the time.

A practical example of the above point is that when you're comparing veg booklets or persuasive essays or website layouts, don't immediately jump to a big survey or website-optimization tools. Consider asking 10 people for feedback. Or 5 people, or even just one other person. The incremental value of the additional input declines rapidly. If you ask one person for feedback on ten different things, that's much more useful than asking 10 people for feedback on just one of them.

Sometimes a smaller sample size helps because it lets you reduce bias. For example, say you're trying to assess salaries in a given career. One way is to consult standard salary tables computed by the government, which presumably use very big samples. But these can be misleading, sometimes by a factor of 2 or 3 or more, because the metrics may not capture everything that's relevant (e.g., bonuses, carried interest, stock options). Moreover, the metrics are relative to some unknown average of people, whereas you might know people who are closer to your own level of ability and ambition. I wager that asking 3 people who work in the industry how much you could expect to earn will give you a better estimate than looking at an entire salary survey that may be biased. In the bias-variance tradeoff, the "ask 3 people" approach has higher variance, but I suspect the reduction in bias outweighs that.

What I said about small sample sizes applies in the usual case when you're doing preliminary analysis or working with big effect sizes. If you've picked a specific problem and want to hone in to make sure the cost-effectiveness is what it seems, at that point you begin to worry more about statistical significance. Like in the problem of multiple hypothesis testing, if you evaluate a bunch of causes with an unbiased but high-variance approach, some will come out as looking really promising just by chance. At that point, you'll want to increase the sample size and pursue independent lines of investigation to make sure you were right about what you saw. The evidential standards are higher when a lot hinges on the decision. For example, a charity recommendation by Effective Animal Activism should be held to higher standards than a preliminary screening analysis by an individual, because the additional effort to verify the cost-effectiveness will affect a lot of donated dollars down the road.

If you're using a proper Bayesian approach to probabilities, then the expected value of an intervention will itself be affected by the amount of evidence it has, as Holden explains in a famous blog post. The post is rightly criticized for assuming normal/lognormal distributions, but the basic idea of constraining your estimates by a prior is uncontroversial. It's basically just saying, "don't naively use an unbiased MLE." The so-called "optimizer's curse" is a simple concept that has been around well before Holden's post.

Quantification and metric optimization are powerful tools for reducing suffering, but they have to be used carefully. Many studies can be noisy, and results that seem counterintuitive may indeed be wrong because of sensitivity to experiment conditions, human error, measurement problems, or many other reasons. Sometimes you're looking at the wrong metric, and optimizing a metric blindly can be dangerous. Designing a robust set of metrics is actually a nontrivial undertaking that requires understanding the problem space, and sometimes it's more work than necessary. There can be a tendency to overemphasize statistics at the expense of insight and to use big samples when small ones would do. Finally, think twice about complex approaches that sound cool or impressive when you could instead use a dumb, simple solution.

Introduction

The effective-altruist (EA) and rationalist movements love quantification. We extol the power of data to guide our decisions and sometimes tell us unexpected conclusions. We love the ways in which metrics allow us to optimize for good outcomes with more power and efficiency than just relying on intuition or random acts of charity. These notions are not unique just to us; they're shared by many in the worlds of business, science/technology, etc.

I'm enthusiastic about the above, but there are a few areas in which I feel people sometimes place too much weight on formal studies at the expense of anecdote and common sense. Everything I say below is already known, so consider this just a summary of my thoughts rather than a novel contribution.

"95% confidence" is not 95% confidence

Sometimes quantifiers get excited when a study's results contradict common sense. "You see," they say, "your naive intuition can't be trusted after all. It's numbers or the highway, my friend." They present a study with p-value <0.05 to prove a surprising conclusion.

There are many well-known problems here. One is publication bias. Even without that, there are plenty of ways to massage your data or statistical tests to get a significant result. For example, in a regression, if you don't get a significant result on the raw variables, try taking the log of the values to see if it becomes significant then. Keep going until you find a test and transform that work. And then there are ways to design experiments to achieve the outcomes you want. That industry-funded drug studies have more favorable results than independent ones is a hint of what's possible.

What if you're doing the research yourself and are careful to avoid the problems suggested above? Even if so, your results may be overly sensitive to the specific conditions or data set of your study. I've seen this many times: People show a stat-sig result in one configuration of an experiment, but then on a slightly different configuration that has only trivial changes, the result goes away. It might even go away just with the passage of time. The same idea applies to research in a lab done in a very particular way at a particular time with particular conditions measuring particular variables. The result may be sensitive to so many minor factors that when you change the configuration a little bit, the result vanishes. This is especially true when you have big sample size: It's hard not to get a stat-sig result with a big enough sample, even if all that's being picked up is irreducible bias due to noise in the experiment configuration.

People say "most published findings are false" in the sense that they don't replicate. I won't get into the debate about how much this holds true in science as a whole, but there are certainly domains where it does. If you can replicate the result over a variety of conditions and metrics, then you're in pretty good shape.

Don't believe everything you see

I often see results that seem too good (or too bad) to be true. Upon further investigation, it almost always turns out that indeed there was something wrong that invalidated the ostensible finding. Maybe there was system instability. Maybe I configured the experiment wrong. Maybe I used the incorrect baseline. Maybe the metric wasn't being computed the way I thought it was. Maybe there's a trend that always appears and isn't due to my particular changes. And so on.

Standards of rigor differ as far as verifying experimental configurations, procedures, and data analysis, but there are many times when even careful people, if they go back and inspect their work, will find errors that can make a big difference to the results. One reason experimental replication is important is because it reduces the impact of human error as well as noise from the environment and methodology.

So if you see results presented that sound strange, be skeptical. Don't silence your intuition. The output of the neural networks in your brain is not necessarily "less rigorous" than the results of a study that has potentially undetected flaws.

Power and metric choice

There are some metrics that you can't move visibly, no matter how good your intervention is. Sometimes a change can be shown to be a good thing at a local level beyond a shadow of a doubt, and still it shows no impact on a higher-level metric. This does not mean your local change was worthless. It probably means the higher-level metric isn't sensitive enough to detect what you're looking for. Proxy metrics are inevitable, and to the extent that they allow you to iterate faster and with less noise in your measurement process, use them.

If a study finds "no correlation" between some specific intervention X and some broad population variable Y, I'm not surprised. You're probably not measuring at an observable level.

Campbell's law

I'm being liberal with my interpretation of Campbell's law, which I'm taking to mean the general phenomenon that optimizing blindly for a metric can undermine the original purpose. When people criticize quantitative analysis as being "too narrow" and "missing the forest for the trees," this is often what they're getting at.

Someone tells you that "a high GPA is an excellent predictor of career success," so you end up studying for your courses excessively at the expense of other important factors like networking, career research, philosophical analysis of how to accomplish good in the world, and trying new experiences that will shape your perspective on life and teach you "things you never knew you never knew."

Sometimes Campbell's law looks like "gaming the system," i.e., finding loopholes in the metrics that allow you to "cheat" the underlying purpose. Your teacher penalizes students who don't have white eyes looking at the board during class, so you paint dots on ping-pong balls. A third-party company gets paid for getting visits to a veg video landing page, so it generates a bunch of spam links to your site, or encourages bot traffic, or whatever. Such scenarios are common with principal-agent problems.

That said, sometimes Campbell's law is less obvious. Sometimes you don't know what problems the metric has, especially if you never dive into examples of how it works. So it's good to keep your metrics grounded by looking at some sample cases. It's also recommended to develop multiple metrics that are as independent as possible, even if they track slightly different components of the overall goal.

Finally, sometimes metrics are just not worth the effort. It's too hard to quantify some things, and maybe our brains already do it well enough. Sometimes trying to follow a metric ends up doing more harm than good because you tinkered with something that didn't need fixing. Sometimes you waste time designing a seemingly sophisticated metric for something that was obvious all along. Sometimes the metrics that you make up just encapsulate what you were going to do anyway and merely obfuscate that fact.

Some metrics are very powerful. But this doesn't mean that making something else into a metric will unleash the same power. Some situations are more amenable to explicit metrics than others. Plus, metrics are hard. They take time and attention to get right. A naive metric for steering your car will likely cause you to drive into a ditch when you could have stayed on the road just by looking where you were going. Use sanity checks.

Take samples

Say you have a data set with demographic information on 10,000 people. You want to get an understanding of what kinds of people are in your sample. One way is to run a bunch of statistical tests and scatter plots on the whole data set. Another is to look through 10-20 of the people by eye and get a sense for who they are and why they do what they do. Sometimes looking at some specific examples can give you more insight than all the regression coefficients in the world. Sometimes there's a story going on that you wouldn't notice unless you got your hands dirty with concrete cases. Of course, it's great if you can do both. But don't assume the statistical stuff is necessarily better, and don't do only statistical stuff just because it's more cool.

Applying statistical sampling to reading can also be a good practice. Instead of reading one book end-to-end, read a sampled one-tenth of 10 books. Of course, reading is different from merely assessing quantitative traits of a population, because explore-exploit dynamics enter in based on expected value of information. But the idea of using samples to avoid perfectionist inefficiency is similar.

Sampling can also be a good idea because sometimes high-powered statistical stuff isn't worth the effort at all. You could look through 10-20 people's profiles in <45 minutes, but to do the statistical analysis, you'd need to set up your tools, format the data, do appropriate transforms, and everything else. If you have all of this ready, that's great, or if you want to put the tools in place so that you can do it on a regular basis, that's wonderful. But if this is a one-off job that you won't ever do again (e.g., some special parsing of the data that isn't common), then consider whether you need to analyze the whole data set. Samples can give you a good basic picture, and most of the time, you only care about trends that have a big effect size. Often you don't need to worry about optimizing to the ones digit of precision.

Interestingness bias

Relatedly, I often see a tendency for people to making things unecessarily complicated because that way the problem is more interesting. This can be true for experimental designs, intervention approaches, or statistical analysis. For example, writing a web tool for a survey that reads data into a customized numerical-analysis program for statistical processing when in fact the number of people surveyed is 20, and the statistics could have been computed in 10 minutes using Excel or an online confidence-interval calculator. Don't use fancy statistical tests when the t-test works just fine. Don't introduce a bunch of tweaks and special cases to a system before you've tried the simple thing and identified specific weaknesses where complexity is justified. "Complexity kills," as they say -- in terms of execution time, debuggability, and maintenance cost -- so "keep it simple, stupid."

The real world is not academia where you only get published if you show something interesting. Reducing animal suffering is about doing what works, even if it's a boring, simple thing rather than a grand theory or complicated adaptive system that you've built. There's a natural tendency to be impressed by complex things, so resist the temptation: Encourage the people who achieve the same results in a simpler way.

I have a friend who once said, "Math envy is a disease. A disease." While I love math for recreation, I realize that most of the time, complicated math is much more dangerous that simple math, because (a) it's unnecessary, (b) it's opaque, (c) it gives a false air of sophistication, and (d) it's almost certainly wrong. Every additional variable you add to give 1% extra nuance to a decision analysis takes away 5% of the accuracy of the final conclusions because of the additional uncertainty and conjunctiveness. (I just made those numbers up, but they illustrate my point.)

95% confidence is not necessary

Sometimes people complain about a study because "the sample size was too small." There are two possible reasons why this objection isn't important.

The first is that people may not have a good intuitive sense for how much sample size matters. When looking at means of a sample, the standard error decreases as square root of the sample size. So the standard error of a sample of size 16 (sqrt(16) = 4) is only half that of a sample of size 4 (sqrt(4) = 2). The standard error of a sample of size 100 is only half that of a sample of size 25.

The second reason is that, as noted above, most analyses don't need high precision. For first-pass investigations, you usually only care about the order of magnitude or maybe the differences that are at least a factor of 2. You don't require 1000 data points for this. Moreover, the standard levels of statistical significance are usually higher than you need. You don't need to use 95% confidence levels when the cost of sampling is big. A single data point can update your Bayesian prior probabilities, sometimes quite a bit. In the sunrise problem, if you start with a 1/2 prior, then after the first day that the sun rises, your posterior probability is already 2/3. By day 8, it's 9/10. Any extra assurance that you get after that isn't needed most of the time.

A practical example of the above point is that when you're comparing veg booklets or persuasive essays or website layouts, don't immediately jump to a big survey or website-optimization tools. Consider asking 10 people for feedback. Or 5 people, or even just one other person. The incremental value of the additional input declines rapidly. If you ask one person for feedback on ten different things, that's much more useful than asking 10 people for feedback on just one of them.

Sometimes a smaller sample size helps because it lets you reduce bias. For example, say you're trying to assess salaries in a given career. One way is to consult standard salary tables computed by the government, which presumably use very big samples. But these can be misleading, sometimes by a factor of 2 or 3 or more, because the metrics may not capture everything that's relevant (e.g., bonuses, carried interest, stock options). Moreover, the metrics are relative to some unknown average of people, whereas you might know people who are closer to your own level of ability and ambition. I wager that asking 3 people who work in the industry how much you could expect to earn will give you a better estimate than looking at an entire salary survey that may be biased. In the bias-variance tradeoff, the "ask 3 people" approach has higher variance, but I suspect the reduction in bias outweighs that.

What I said about small sample sizes applies in the usual case when you're doing preliminary analysis or working with big effect sizes. If you've picked a specific problem and want to hone in to make sure the cost-effectiveness is what it seems, at that point you begin to worry more about statistical significance. Like in the problem of multiple hypothesis testing, if you evaluate a bunch of causes with an unbiased but high-variance approach, some will come out as looking really promising just by chance. At that point, you'll want to increase the sample size and pursue independent lines of investigation to make sure you were right about what you saw. The evidential standards are higher when a lot hinges on the decision. For example, a charity recommendation by Effective Animal Activism should be held to higher standards than a preliminary screening analysis by an individual, because the additional effort to verify the cost-effectiveness will affect a lot of donated dollars down the road.

If you're using a proper Bayesian approach to probabilities, then the expected value of an intervention will itself be affected by the amount of evidence it has, as Holden explains in a famous blog post. The post is rightly criticized for assuming normal/lognormal distributions, but the basic idea of constraining your estimates by a prior is uncontroversial. It's basically just saying, "don't naively use an unbiased MLE." The so-called "optimizer's curse" is a simple concept that has been around well before Holden's post.