Big data is a term that is the main talking point of most companies that deal with the vast amounts of data out there from sales figures to trend analysis. Most of the time the task that is assigned to the ‘Big Data’ team is to find a way to get the information, which already exists somewhere, into a format that can be used to derive information – data science. Now what do you do when you don’t have enough data to do a formative analysis or some other data intensive task? Well, you have to create data that meets your needs. This may sound like a trivial task, but in fact, it is quite intensive as you must first define what you need before you even try to determine what you can use to create the new data, and then determine the best way to create and store it for your need.
I will not cover the basics of what big data is or the tools to do the aggregating and analysis today as much of this is already covered in the myriad literature out there. What I will cover is a sample of what we have had to go through in order to meet some of the mission goals of our customer with other implementations to follow. The problem that our customer and many other customers have is that they use tools which need to be trained to perform a task or to establish some statistical model to base decisions off of. These so called ‘expert’ systems need to be ‘trained’ with known artifacts, essentially a ground-truth training corpus. The decisions that these systems then make on real world data is based entirely on what they have learned from the training data.
How do we make sure that enough variety or similarity is provided for a tool to reach a certain level of accuracy, precision and recall? As before, the answer varies, so we must be prepared to provide any level needed until we reach the accuracy, precision and recall set points that meet requirements. Two specific tasks where we had to create very large amounts of data of which one is presented here for discussion with the other to follow:
- A requirement to train an algorithm to be able to recognize two and three dimensional objects from any perspective
- A requirement to train a software product to be able to recognize a variety of text and fonts in various languages in various file types
For the first task, the product/algorithm being developed and tested needed to understand what an object was and all specifics about it so it would be available to identify the object uniquely. The type of information that is used for both training and subsequently for testing are two-dimensional images of a three dimensional object. Essentially, we want to be able to find a specific object in a two dimensional image regardless of its orientation or occlusion (hidden behind another object partially). So, where is the big data aspect of this – let’s see:
- A simple image (jpeg, nef, png or similar), depending on details can take up a large amount of space from a thousand bytes for a low resolution image to multiple megabytes for a high resolution image
- For training at least a thousand images of an object from various camera angles are required
- For training lighting effects may be significant so certain images must be darker and brighter
- For training, both color, black & white, and grey-scale images may be required
- We need the ability to track meta-data about each image so we know what the baseline ground truth training corpora is
- We need to be able to search on the metadata and analyze the results continually to make sure that we have all possible training criteria defined
- This has to be done for many objects
At this point, let’s see how much data we have for a simple image of a U.S. based Stop Sign before we explore a more complex object. This is essentially a two dimensional object which can be rotated around the x and y axis with minimal z-axis attributes. What we are not covering here is how we handle the rotation, which we can briefly state is fully automated through a variety of open source tools.
As you can see from this table, the space requirements alone for a ‘single’ two dimensional object, is already 19M. Though the metadata size is minimal and inconsequential when doing big data analysis, it is still critical as it defines an index for us to track what we have and determine if scenarios are missing.
Now let’s take a more complex, 3-dimensional object and see how the storage requirements change, such as a vehicle or plane – hmm lots of things that differentiate this plane from other planes.
85GB of images for a single object and how many objects are out there – Big Data of a different type. Even with petabyte storage available, you would soon use a lot of that up. So what exactly was our solution for this Big Data problem? We implemented a hybrid solution with the capability to generate data as needed while maintaining the requirement of a ground-truth corpora as well as the ability to search on metadata which depicted what all the data would be once created. This facilitates the main task users have to be able to do:
- Determine if test data already exists which is applicable to them by searching
- Determine if enough data already exists which is applicable
- Ability to request creation of additional test data within specification of the corpora set with minimal impact to the storage footprint on a long term basis
- Be able to baseline test cases to corpora version
Needless to say, the project is continually evolving as more types of data request come in to quickly and accurately evaluate products, systems, and algorithms. The main efficiency gained to date is in the preservation of storage space and time to completion for a specific task. For the former case, though storage is cheaper now, it can still be a procurement nightmare, especially in the Federal sector.
The next blog will cover how we handled the issue of text based files which was more of a traditional big data task, albeit, without the Map-Reduce interface.