Microsoft Azure is a cloud computing service that has its own machine learning service, known as Cognitive Services. It is split into five categories: Vision, Speech, Language, Knowledge, and Search, with each category containing several tools, for a total of 26. Under Vision, there are six tools: Computer Vision, Content Moderator, Custom Vision Service, Emotion API, Face API, and Video Indexer. As the title suggests, the focus here is on the Face API tool.
The Face API is split into two basic categories:
- Face Detection – discovers a face in an image, with an ability to identify attributes such as gender, age, pose, facial hair, glasses, and head pose.
- Face Recognition – takes faces and performs comparisons to determine how well they match. Has four categories –
- Face Verification – takes two detected faces and attempts to verify that they match
- Finding Similar Face – takes candidate faces in a set and orders their similarity to a detected face from most similar to least similar
- Face Grouping – takes a set of unknown faces and divides them into subset groups based on similarity. For a subset of the original set of unknown faces, each face within that subset is considered to be the same person object (based on a threshold value).
- Face Identification – further explained below.
With Face Identification, you must first create a PersonGroup object. That PersonGroup object contains one or more person objects. Each person object contains one or more images that represent the respective person object. As the number of face images a person object contains increases, so does the identification accuracy.
For example, let’s say that you create a PersonGroup object called “co-workers.” In co-workers, you create person objects, for example, you might create two – “Alice” and “Bob.” Face images are assigned to their respective person objects. You have now created a database with which to compare a detected face image. An attempt will be made to find out if the detected image is Alice or Bob (or neither), based on a numerical threshold.
This threshold is on a scale that is most permissive at 0 and most restrictive at 1. At 1, they must be perfect matches – by perfect, I mean that two identical images at different compression rates will not be recognized as a match. In contrast, at 0 a match will be returned for the person object with the highest confidence score regardless. In my experiments, somewhere between 0.3 -0.35 tended to strike a good balance. To reiterate an earlier point, more images per person object increases identification accuracy, thus decreasing both false positives and false negatives.
An Example Application to Simulate Video Analysis
An example implementation of Face Identification, in conjunction with Dlib and FFmpeg, follows. The purpose of this application was to identify faces in the video, and since Face Identification only detected still images, FFmpeg was used to extract keyframes for Face Identification to examine individually.
Face Identification detects faces in images before identifying them, but in my experience, Dlib detected faces more accurately and a lot faster. In this case, Dlib detected if an image contained a face; if it did, that image was sent to Azure for face identification. The disadvantage here was that detection was done twice – first in Dlib, then again in Face API. It was faster to detect an image locally using Dlib than it was to call the Face API – which was remote. It was especially advantageous to filter using Dlib when there was a long video with relatively little facial presence (e.g., security footage). If most of the video had a facial presence, disabling Dlib may have been preferable. Another factor to consider is that Azure charges fees based on the number of API calls, so filtering using Dlib first saved money.
Figure 1 depicts the Face Identification user interface. The three list panes in the middle of the user interface (from left to right) define the PersonGroup objects, the Person objects, and the images attached to each Person. In the figure, group2 is selected which contains two person objects: Person_A, and Person_B. Person A is selected and the list of images associated with Person_A are listed in the right-most column. Figure 2 discusses the controls and settings for the conducting a face match run.
Figure 1 – Highlighted in blue from left to right: PersonGroup, person, face image. The database image selected here is Abraham Lincoln, belonging to Person_A in group2. There can be more than one image per person. If an examined image contains a detected face that sufficiently resembles Person_A (or Person_B), best match found.
Figure 2 – A closer look at this application that implements Azure Face API, FFmpeg, and Dlib
Some points regarding using the above method to analyze video through keyframe extraction:
- In tests, it was found that the baseline accuracy was comparable to that of other methods tested (AWS Rekognition, Linux face_recognition), see Figure 3. The advantage of this system was that accuracy could be improved by adding multiple face images per person.
Figure 3 – Azure results compared to other facial recognition tools tested
- The PersonGroup profiles were persistent, using Microsoft Azure’s cloud storage
- Easy to add/remove/modify groups, people, and face
- API calls were limited to 10 per second – this is a server-side imposed limitation, and there exists no local workaround (another advantage to using DLib locally)
- Because of the API call limit set by Microsoft, it was slow relative to other methods
- Like most facial recognition systems, it was difficult to predict processing time. The two main influencing factors were:
- File format – this played a much more important role than file size. In fact, a file that was half the size could take longer to process depending on the format.
- Number of detectable faces in a file – there would be no point in using this if you already knew the contents of a video file, and processing time went up as facial presence increased
- Internet connectivity is obviously necessary – processing was server-side and there was no option for locally exclusive data storage
It should be noted that Face API was meant to be used to analyze images and live video streams, but not stored video files. This application attempted to simulate the analysis of stored video files, and thus was using the API in a manner for which it wasn’t intended. For example, ensuring that the 10 API calls per second limit wasn’t exceeded required testing, as Azure simply discarded any API calls that exceeded the limit with a generic error message – it did not add the images to a queue to be processed as soon as possible. Frames could be lost that way when examining a series of images. Cognitive Services does offer a Video Indexer that, among other things, has face tracking and identification, but that is only against a celebrity database. The user can’t define the database, so it is highly limited. The Video Indexer is in preview mode, so I suspect that at some point it will allow for a more flexible facial recognition system. Currently, it does not offer what this application was attempting to simulate.
Although Amazon Web Services has a far superior market presence than Microsoft Azure, Microsoft Azure’s Cognitive services is very functional. The accuracy of Face API is comparable to AWS’ facial recognition alternative, albeit a bit slower. Its array of tools is consistently growing in size as well. There is an argument to be made that the advantage to AWS is simply that it has a larger userbase, which alone can increase the functionality of a product through consumer demand and supplier response. If Microsoft has something to prove in the area of machine learning, though, that can be an advantage as well.
The other contenders in this area are the various Linux-based, open-source tools, which are often just as good in terms of accuracy. A huge advantage Linux has is control over the locality of processing, which allows for some creative control when it comes to memory and storage management, along with general application implementation. With the ability to introduce multi-threading, Linux is often the fastest when it comes to processing – you could multi-thread AWS or Azure, but there is no point because their servers do the heavy lifting and decide what you get and when you get it (think back to API call limits). The downside Linux has when compared to Azure and AWS is comprehensive support. AWS and Azure have a centralized customer support system, and Linux by nature does not. It can be a headache to even get to the point of installing the necessary software to begin coding for it, as packages often become out-of-date and don’t always play nice, plus online documentation can be challenging or absent. But that is the tradeoff when it comes to the freedom and control of Linux. Plus, it’s free.
At this point, there is no clear advantage to using one over the other. However, one thing is for sure – Microsoft Azure and AWS will continue to invest in this space through research and acquisitions to become the preferred provider of artificial intelligence tools and services.