Last year, the Celonis data visualization team launched a new histogram component designed to help Studio users better understand the distributions of their data. In this article, we'll share the process for making this component an easy and rewarding solution for our customers. Read on to learn how we created human-readable buckets and axes and mitigated a one-sided distribution with sophisticated outlier handling.
The histogram is a type of graph that helps users understand the distribution of their data. Its appearance is very similar to that of a bar chart, but instead of comparing different categories, a histogram allows you to show the frequency distribution of the data. For this purpose, the data is divided into a series of bars, each bar spans a range of numerical values called a bucket (or bin), and finally the height of the bar represents the frequency of the data falling into each bin. Below you can see the comparison between a histogram and a bar chart.
Visual comparison of a histogram vs. a bar chart
Histograms are an excellent way to get an overview of your data, you can easily identify distribution peaks, abnormal trends and outlier data points. To use a histogram, you only need a numeric value data attribute. For example, if you have a data set that shows the different ages of people, you can create a histogram from it:
A column of numerical data in a table can be grouped into bins with start and end values, and the histogram can be used to visualize those bins
One of the main objectives while developing the histogram was to make it as easy to read as possible. Bucket boundaries and tick labels play a key role in communicating where the boundaries of each bar are.
Making bins with easy-to-understand and easy-to-browse intervals such as 1, 2, 5 and 10 helps the user to keep track of the values on the axis. We took a lot of inspiration from d3 binning algorithms to define a custom bucketing algorithm for our histogram component. By default, the histogram will display a reasonable amount of intervals for the provided data, but users can configure the component to change the number of intervals for their data.
Hard-to-read boundaries vs. easy-to-read bucket boundaries
Small prototype for the bin boundary calculations. We tested different scenarios to ensure that for all cases the ticks in the histogram axis were easy to read. We played around with the size of the buckets, the extent of the data (absolute minimum and absolute maximum values), and the start and end of the outliers (regular minimum and maximum values).
A second objective was to ensure that the tick labels were always legible and did not overlap each other. To do this, we made sure to omit the marks if the axis space was limited, but we always displayed the first outlier bucket so that the user could know what the normal range of their data was.
New improved ticks vs. overlapping and redundant tick labels
An outlier is a data point that differs significantly from other points in the data set. If you have outliers far from most of your data it means that your histogram may have a very long tail, and the majority of the data you want to see is clustered in only a few bars.
To avoid this, we grouped all outlier values in catch-all bins using a calculation similar to boxplots. Meaning we would see the original distribution where the density is the highest and, at the same time get an idea of the number of outliers. Finally, we worked together with the Celonis PQL team (Process Query Language is a domain-specific language tailored towards process data) to put this logic into a bucketing operator that would return the bin boundaries for any given numerical variable.
Histogram without outlier calculation vs. histogram with smart outliers
To visually represent the outliers we used a piecewise scale and a differentiating color to make users aware that these buckets might have a different size than the regular buckets of their histogram. We also made sure that the tick between regular data and outlier was always labeled.
How does a piecewise scale look?
We wanted users to be able to further customize the sections of their data. To this end, we implemented two convenient features: custom coloring and annotations, which when used together can turn the histogram into a very powerful visual tool.
Users can add annotations and customize the line style, color, and label of these position lines. They can also decide whether they want this position line to fall along a bucket bound or not. This is an important feature when the annotation represents a threshold or a change in the data i.e. if something is early or delayed. Finally, users can also define a specific color for a section of their distribution to emphasize certain areas or categories in their histograms.
From left to right: Histogram with annotations without forced boundaries, histogram with annotations as forced boundaries and, histogram with annotations as forced boundaries and color-defined sections
Knowing which data points are included in a bucket is critical information that is not always clear in a histogram. We put a lot of thought into how to communicate this through the histogram tooltips. For example, to indicate a bucket that goes from 2 to 4, how can the tooltip communicate that this groups the frequency of values 2 and 3, but does not include 4?
Finally, we decided to include logical operator notation to show what values are in each bucket. This subtle but important change removes any uncertainty about where each specific data point is located.
Tooltips with clear indication of the data each bucket contains Vs. generic tooltip