Part 1 Beginning to Understand Methods
What is Data Visualization
While there are a few different schools of thought on what Data Visualization really is, ranging anywhere from Excel spreadsheets to 3-foot-long Infographics, in general, it is the practice of turning raw, abstract data into usable, visual information via charts, graphs, and pictures. We all know the saying, “A picture is worth a thousand words.” In Data Visualization we transform abstract information into visual stories and clues that align with the powers of human perception and understanding. We play to the way our brains are wired to work most efficiently and, in effect, we turn a thousand words, or numbers, into knowledge representation.
Data Visualization is a way to use our natural ability to learn quicker through pictures and “visuals”. Some of the more basic forms of data visualization are things like pie charts and bar graphs. Through the use of several different types of newly available, open-source javascript libraries, data visualization has gone way beyond those two forms. Other tools are making use of Adobe Flex, Perl, Python, and Java just to name a few.
Two sites that have been my major resources, and have helped me get a firmer grasp on what types of visualizations are best suited to the various types of datasets are https://www.visualcomplexity.com and https://www-958.ibm.com/software/data/cognos/manyeyes . Visual Complexity offers hundreds of real-world examples of data visualization types being used around the globe. Some give reference to the type of technology used to create them and that has led me to the discovery of several available technologies. Many Eyes is a project from IBM Research and the IBM Cognos software group. They give excellent examples and explanations about data sets and the correct type of visualization to represent that set.
The world is being inundated with data and information and according to the IDC Digital Universe Study sponsored by EMC, more than 1.8 zettabyes (1,800,000,000,000,000,000,000 bytes) of digital data was created and stored in 2011. Put bluntly, we are being virtually buried.
In the introduction to Edward Tufte’s classic book, “Visual Explanations”, he states:
“Assessments of change, dynamics, and cause and effect are at the heart of thinking and explanation. To understand is to know what cause provokes what effect, by what means, at what rate.”
A bit further down the page is this:
“What principles should inform our designs for showing data? Where do those principles come from? How can the integrity of quantitative descriptions be maintained in the face of complex and animated representation of data? What are the standards for evaluating visual evidence, especially for making decisions and reaching conclusions?”
So then, if we start with these premises and principles, and if we determine what we’re striving to envision, whether the data represent nouns, verbs, or numbers, we should be able to give the client new ways to look at data, thereby giving them new ways to make decisions.
The Three Main Categories of Data and Their Standard Methods
In my research and work I’ve begun to speak about the different ways to show data as methods. The vast majority of data can be filtered into one of three main categories – temporal, relational, and quantitative. When did these events occur; how does this relate to that; how many are there? In the world of an analyst, these are their three most pertinent questions and in order for data visualization to give them quicker answers, the correct methods need to be employed.
The most familiar temporal, or time, visualization method is the Line Chart. A Line Chart allows us to give a visual representation of events as they occur over time, within an X Y Plot. The Y Axis usually is reserved for numbers, and the X Axis for the dates. For any given period of time, this type of visualization can show us for instance, how many phone calls were made by Person A between Jan 1 and May 1 of this year. From left to right, we can see the trend as the numbers rise and fall on the months. This is temporal data visualization in its simplest form.
Quantitative data and analysis can semantically be broken down into two subcategories – comparison and percentages. I say “semantically” because in a method that gives you parts of a whole, such as our ubiquitous pie chart, a comparison can also be gleaned. If wedge “A” takes up 50% of the chart, we know by comparison that it’s larger than the other two smaller wedges. And upon further consideration gleaned from my writing of a paper that this article is excerpted from, I’ll make the statement that all chart method types can and do show quantitative data. Whether it’s a line chart showing measures across time, or a Radial Convergence Diagram showing links between nodes, they all in some fashion, though not always most accurately, show how many or how much.
When needing to compare data value sets, the Bar Chart is the de facto method of choice. The height of the bar representing the value, and across a range, different value sets are easily compared. For percentages data (or parts of a whole), the Pie Chart is the most widely used, though terribly inefficient, method. Within a pie, there are slices, each representing the parts of a whole. This week, Person A made 20 calls, Person B made 43 calls, and Person C made 15 calls. The entire pie represents 78 calls.
Relational data needs analysis to determine how pieces of the puzzle fit together and while Scatter Plots are used widely for this, a Network Diagram can provide a picture that leads to a much more compelling story. In these diagrams, we can easily see that Person A called Person B twelve times last week, or that a sensitive document was found on the hard-drives of 6 different computers found in 3 different cities. The puzzle pieces start fitting together.
There is a Method for the Big Data Madness
As data sets continue to grow and we dump more and more of it on analysts to filter through and strive to glean some sort of understanding towards crafting stories that lead to knowledge that leads to action, we need to make use of better methods. Although we won’t get to all of the methods that I’ve researched in this article – that will be covered in Part 2 of this series – we can a start.
My research has uncovered two extremely powerful and useful methods that we’ll begin with – Parallel Coordinates Charts and Radial Convergence Diagrams. Both have been around for a while, with RCDs being a bit newer to the scene, and in my opinion these two can handle what I feel is one of our biggest data challenges – figuring out relationships between bits and pieces of seemingly disparate data. As a quick example, in any type of investigation we need to find and then put together puzzle pieces to figure out what happened. We may need to know how two people are related and what their phone conversations between May 1 and June 3 of this year mean to the investigation. We may need to track down an email trail between members of an organization that we’re monitoring to find the primary source of all the information being passed around.
Our data may contain a list of countries, cities within the countries, hard-drives, documents, and email addresses. If this is a terrorism investigation we need to quickly find patterns that lead us to actions.
Parallel Coordinates Chart
Here is an example of a Parallel Coordinates Chart for the above mentioned scenario. This is all fictitious data that I put together solely for the purpose of demonstrating the capabilities of this method.
A quick glance at this chart gives many different types of information. The colors in this configuration indicate dates of input. The email address she@them.com is found on the most hard-drives. Saudi Arabia is supplying the most information. Doc9 is only found on a hard-drive in Jordan while Doc2 and Doc3 are found in both countries on multiple hard-drives. So then, what’s going on with Doc9? Are Doc9 and Doc 9 the same document? What are we missing? Those discoveries all came in a matter of seconds.
This chart was built with Orange – https://orange.biolab.si It is partially interactive and the few other tools I’ve found to create PCCs with have some degree of interactivity as well.
Radial Convergence Diagrams/RCDs
RCDs show relationships among nodes along the circumference of a circle. Relationships will be shown in various ways as edges between nodes. Circos – https://circos.ca/ – was my first introduction to RCDs and it offers us a good look at how data relationships are quickly identified. It is being used by many research institutes and has been featured in several scientific journals over the past couple of years.
The following screenshot is of a very detailed and sophisticated RCD. Several layers of granularity are present through the double circumference arcs and the colors further provide good visual input, both in the nodes themselves and also through the use of color in the edges/connectors.
A fairly large set of data is represented in a relatively small amount of space. If this were to be straightened out linearly the power and functionality would be lost. Within a small amount of visual space, many different stories are unfolding before our eyes. While we see that most information flows from the blue section highlighted, we also see very clearly that four sources come into it from the outside. Another quick understanding is that even though it seems to cover the most territory, the bar charts in the outermost ring shows us that relative to the other segments, the blue has the least activity in that only one bar of green stands out with a higher value. Even though we don’t know what this data represents, the quick inspection allows us to see relationships, and that’s what we’re after.
Most RCDs that I’ve come across so far are interactive. In this example, a hover over the blue fades out all the other segments so that you can concentrate on what relationships are shared by it.
What Are We Really Trying to Achieve?
Clarity. Vision. Understanding. These are just a few of the needs we have from our data. And these are in there, if we can figure out the best way to coax it out. Clarity, vision and understanding all lie buried under these massive amounts of data that we need to process every day. Are you in sales? You need to track figures to find what’s working the best and what needs a boost. Are you in law enforcement? You need to find out what the clues are telling you so that you can get the bad guys locked up. No matter what type of data you have, there is a method that can show you what it’s trying to tell you. It’s all in there, you just have to find the most efficient way to get it out so that it can tell you stories.
Stay tuned for Part 2 of this series.
Jeff, great post. I find Data Viz to be very interesting and am wondering if we can incorporate more of it into our day-to-day operation of Armedia as well as for our customers. Thanks.
Hey Scott. Tks for the comment. I too would like to see how we can make use of it internally. Several Armedians have expressed interest in learning more and I’m very happy to sit down with anyone to discuss and plan a way forward.
We’ve come to a time when data is simply overwhelming and presenting in the correct data viz format and method can be a major step forward.
Let me know if you’d like to schedule some time.
Jeff
Jeff – great post! Thanks for sharing your thoughts. I found the “temporal”, “relational”, “quantitative” breakdown very helpful.
Does your post suggest a 4th category? Seems like there are different visualization techniques that we use when we don’t know whether data is best illustrated using one of these three categories. This suggests a 4th “unknown” category.
As I said, I imply this from your overall post. Am I making this up?
Thanks again for sharing your insights.
-mike.
Thanks Mike. Great question. In the quantitative data area, we do have the separation of comparison and percentages, but I think you’re looking for something else.
In most cases, data we receive will be from different types, requiring different methods and when I find that to be the case, I pump it all into a Parallel Coordinates Chart.
Orange has a very good one, and the Protovis Javascript library has a good one too. Once in that kind of chart, it’s pretty easy to figure out what your data really looks like.
While I don’t think I’d call unknown a category, it’s very insightful to allow for the fact that you’ll get that kind of stuff quite a bit, and your job is to figure it all out.
Thanks again,
Jeff