In the first article in this 3-part series The New Data Visualization – This is Not Your Father’s Pie Chart we discussed the three main categories of data and the standard methods used to best portray each of them. If you haven’t read that one yet I’d like to suggest you CLICK HERE and do that first as it will give you a good foundation through which to understand this article.
When beginning to craft a data visualization solution, it’s important to understand the scope of your task. Often times, a very large amount of data will be the starting point. What we want to do first then is use this data to give us clues about where we need to focus our time. We aren’t going query on this initial dump, we’re going to put it into a tool that will show us what we have.
A good way to illustrate this tactic is via a network diagram. Also referred to as node/link diagrams, these are charts that show us how entities – the nodes – are “linked” to other entities in our data set.
Look at this screenshot from an open source tool called Gephi
This is a social network diagram of my Facebook network. If you were to look at this data in a spreadsheet, you’d be hard pressed to discover the very obvious clustering seen here at top right in red; or the sub-clusters of pink within the green cluster at right, or showing up inside the large blue cluster; or the many nodes that float around the outside, unconnected to anything else.
The patterns here are quite obvious. I belong to one very large and very interconnected network – the blue – and a few smaller ones that are lightly connected, or not connected at all, as is the case with the red cluster. While it is a tight network, there are no connections into the larger green or blue networks. Why is that?
Let’s touch on the blue cluster for a minute. I mentioned that it’s a very connected network and I want to make sure that’s understood. If you look at the nodes towards the outer rim, you’ll notice a lot of links running inward from them. This is my high school network, and in this type of network, most everybody knows most everybody else, which makes these clusters very dense. There will be times within this type of cluster that you’ll find one or two “main influencers” who everyone else seems to connect with and through. If your data includes information about number of posts, this can be used to give more significance to a specific entity as well.
With tools like Gephi, you can set parameters for node size and color. Sizes can be set so that the most influential entities are the largest. The colors can denote different social groups like this one does. My high school is here, my group of friends when I worked at a Borders book store in NYC, and my yoga community all occupy different clusters. Some are interconnected lightly while others are’nt connected at all. The people around the outsides are in my network, but not in networks with any of my friends. So, while I have a connection to all of them, all the clusters show how they interconnect independent of me.
In essence, this picture tells a story. TRIVIA TIME. “Every picture tells a story, don’t it?” Leave me a comment if you know the reference and are willing to date yourself as I just did. But so true though, so true. There are stories in this picture and in the hundreds of others that data visualization tools will allow you to provide your analysts. And it’s these stories that give them the power to make quicker, more accurate decisions. And those decisions lead to the actions that make a difference in whatever your enterprise or endeavor.
So now you’ve looked at the forest and seen patterns within it. You’ve identified anomalies or points of interest that require further investigation. From this point you can subdivide the larger data set into more focused sets where you can ask specific questions. One of those questions might be, “Who is the central figure in the red cluster?” Another might be, “Who is the lynchpin between the blue and green clusters?” You get the point – what we’re able to see gives us the power to dig deeper.
As Michael commented on the first installment of this series, there could sometimes be a 4th category of “unknown”. For these large data sets that will be the case more often than not if your mission is one of initial discovery. Even within known data sets, this first step is crucial to the process of making quick decisions that are focused enough to get the right job done with higher accuracy.
So then, where are you? What is your data task right now? Is there a case for you to pursue this step of turning the forest into trees? If you have developers or designers on hand, have them grab Gephi, Orange, and Mondrian as a start. Be sure your data is in the correct format for each to consume and visualize, and start building models that can be used for each project to come.
To borrow from the closing of the first installment, you’re trying to achieve clarity, vision, and understanding by using data visualization tools, methods, and techniques on your data. Whether it’s Big Data, with a capital BD, or smaller sets of more focused data, the right tools and techniques can help you get where you need to go much faster than your father’s pie charts.
Be sure to check back for Part 3 coming soon.
Hey Jeff, another great post; thanks.
This post makes me see how these techniques could be extremely useful in law enforcement and anit-terrorism efforts by quickly surfacing associations (links) that are not obvious in the “big data”.
P.S. Rod Stewart?
Hey Scott. And those are a couple of areas where we are starting to make some progress. Seeing those initial patterns in the data is an important way-finder I think. Clusters, and especially disassociated clusters, can tell us a lot about what’s happening inside.
And yep – you got the reference. 😉 That didn’t make me sound too old did it?