Exploratory data analysis is often geared towards finding new patterns simply from looking around, whereas statistical inference focuses on rejecting patterns that are spurious. In a 2010 paper titled Graphical inference for infovis, Hadley Wickham, Dianne Cook, Heike Hofmann and Andreas Buja suggest a visualization technique allowing us to do visual hypothesis testing. The paper is a good read, written in clear language, and uses their suggested visualization framework to create puzzles for the reader.

How do we convict data sets?

The paper goes all out on the analogy of convicting data sets: In the statistical justice system, a data set plays the role of the accused, and the accusation is the hypothesis we wish to test, for example that a medical treatment is effective. The test statistic, our p value, is the evidence that we compare to a standard to convict or aquit the dataset. In the statistical justice system, we compare the accused to the population of innocents which we call the null distribution. The innocents are generated from the null hypothesis and test statistic. The guilt of the accused is determined by comparing the accused with the fraction of the population of innocents which look more guilty than the accused.

In visual testing, the test statistic and the evaluation of similarity with innocents differs: Instead of a test statistic we will use a plot of the data, and instead of a mathematical measure of difference, judgement will be left to a human judge (or even a jury). In visual testing, we use a null dataset, a sample from the null distribution, and a null plot, a plot of a null distribution. This shows us what an innocent data set might look like. We will present our human judge or jury with a plot of the accused dataset, among several null plots, and see if the judge can spot the accused. If they can't, we fail to reject the null hypothesis, but do not instantly take on the position that the accused is innocent. If they can, there is evidence that the accused is not innocent.

The article does not suggest that visual testing should replace statistical hypothesis testing, but that in exploratory data analysis, where testing is perhaps not the standard, visual testing provides a framework for skepticism. The consequence of convicting a guilty data set, i.e. a false positive, is usually low in exploratory analysis, so we can allow ourselves a few more errors. Visual testing can also meet the needs of complex exploratory analysis for which there is no corresponding numerical test.

Using Rorschach and line-ups for graphical inference

The paper suggests two different tools that are useful when performing visual testing: The Rorschach and the line-up.

The Rorschach

The Rorschach provides a plot like the one below, and encourages the reader to take a moment to study the plots and consider what they tell us. This visualization is named after the Rorschach test as it is posed as an open question to subject: What do you see?

The twist is that the data is all generated from the same distribution - the uniform distribution between 0 and 1, so any patterns spotted in the plots above are spurious relationships. The Rorschach should be used to calibrate to natural variation.

The paper suggests to set a test up with an administrator providing plots and asking questions, and an analyst answering. One could also use software instead of an administrator, and the paper suggests using this a calibrating exercise. If we include plots of real data inbetween null distribution plots, we can hopefully avoid the analyst tiring.

The line-up

In the line-up, we measure similarity between the accused data set and the innocents visually instead of numerically. We line up several null plots along with the accused data set, for a human judge or jury to evaluate. If the accused data set can be distinguished from null plots, there is some evidence that the data set is guilty. If we cannot identify the accused dataset, we do not conclude that the data set is innocent, rather we say that we fail to reject the null hypothesis.

A line-up might look something like the plot below, where the true data plot compares the yearly temperature of the reference period of the Meteorological Institute, 1961 - 1990, to the yearly temperature of the following 30 years. The data was fetched from Norsk Klimaservicesenter. The true data is mixed in with plots of data resampled from the period 1961 - 2019. The null hypothesis is that the temperature distribution in the reference period and the following thirty years are equal.

Resampling data seems to be the simplest strategy for creating null data when our hypothesis is that two groups are different. In this case the groups are the reference period, 1961 - 1990, and the following years, 1991 - 2019. If we transform our data set to tidy data, where each variable is a column and each observation is a row, we can resample the data by randomly permuting one of the columns. Any dependence between the groups will be broken by the permutation. (The Tidy data paper is written by one of the authors of the Graphical inference for infovis authors, Hadley Wickham, of R and tidyverse fame.)

Other use cases for resampling one column in the data to create null datasets:

If we want to investigate spatial correlation on a map, the null hypothesis can be that location and value are independent. We can permute the value column to generate null datasets.
Scatter plots can be used for investigating correlation between $x$ and $y$. If our null hypothesis is that $x$ and $y$ are independent, we can permute either the $x$ or $y$ column to create null datasets.
If we want to investigate correlation between $x$ and $y$ between a number of different groups, we might visualize the data in a scatter plot of $x$ and $y$, with different colors for different groups. The null hypothesis is that groups and position are independent, so to generate a null dataset, we can permute the group id column.

Another example of a plot of the same temperature data can explore whether or not year and temperature are independent:

Here, the temperature column has been resampled for the null plots, and there is no splitting into groups. The hypothesis that year and temperature is independent is perhaps too strict to provide valuable input. However, it is interesting to see the spurious patterns we falsely identify in the null plots: Plot 4 has an upward trend at the beginning of the period, whereas plot 2 has an upward trend at the end of the period.

Often we will find that the assumption that two variables are independent is too strong: in some settings it is obvious that the two variables are related, following a specific example. If we want to test more complex hypothesis, for example that our data follows a certain model, we can simulate data from the model for our null plots. A typical example would be that the residuals in a regression model are normally distributed. We could then generate null datasets from the normal distribution to mix with our real data.

What is the probability of correctly convicting a guilty dataset with our new tools?

First and foremost, it is important that the analyser of the data does not see the data before making guesses about the true distribution, for inferential validity. We can rely on an independent analyst or write software to ensure the analyst does not need to see the data.

In a practical setting, the paper recommends to use 19 null plots along with one plot from our accused data set. Then the probability of picking the accused data set, if innocent, is 1/20 = 0.05, a traditional boundary for statistical significance. A larger number of plots give a smaller p value, but also leads to viewer fatigue.

Instead of increasing the number of plots, we can increase the number of people trying to find the real data set among the plots, gathering a jury of analyst instead of a single judge. Imagine we have $K$ jurors. If $k$ spot the real data, our p value is $P(X\leq k)$, where X follows a binomial distribution $B(n, p)$ with parameters $n=K$ and $p=0.05$ if we use 20 plots. If all jurors spot the data set, our p value will be $0.05^K$.

In reality, an effective visualization will help the analyst spot the true data, and a poor visualization can make it more difficult. In any case, the techniques in this paper describes a few ways of safeguarding against spurious pattern finding. Calibration to variability in data is also an interesting practice for audience looking at unfamiliar data. In my opinion, it is also well written and teaches a thing or two about engaging the audience.

Next on the reading list

Next up on my reading list on this topic is Jessica Hullman and Andrew Gelman's 2020 paper on Interactive analysis needs theories of inference, which discusses Bayesian model checking. They assume people perform some model checking when examining graphs, against a pseudo-statistical mental model of the data, and explore some implications for visualizations, among other things. The paper describes the line-up as a special case of Bayesian model checking, where graphs are examined as hypothesis testing, comparing data to a null hypothesis.