AI (artificial intelligence), has been developed and debated ever since the first computers were invented. While the most revolutionary incarnations are not yet here, AI-based technologies are widely used today for carrying out clearly defined tasks in applications such as voice recognition, search engines, and virtual assistants. AI is also increasingly employed in healthcare where it provides valuable resources in, for example, X-ray diagnostics and retina scan analysis.
AI-based video analytics is one of the most discussed topics in the video surveillance industry and expectations are high. There are applications on the market that use AI algorithms to successfully speed up data analysis and automate repetitive tasks. But in a wider surveillance context, AI today and in the near future should be viewed as just one element, among several others, in the process of building accurate solutions.
This white paper provides a technological background on machine learning and deep learning algorithms and how they can be developed and applied for video analytics. This includes a brief account of AI acceleration hardware and the pros and cons of running AI-based analytics on the edge compared to on a server. The paper also takes a look at how the preconditions for AI-based video analytics performance can be optimised, taking a wide scope of factors into account.
AI, machine learning and deep learning
Artificial intelligence (AI) is a wide concept associated with machines that can solve complex tasks while demonstrating seemingly intelligent traits. Deep learning and machine learning are subsets of AI.
Machine learning is a subset within AI that uses statistical learning algorithms to build systems that have the ability to automatically learn and improve during training without being explicitly programmed.
In this section, we distinguish between traditional programming and machine learning in the context of computer vision – the discipline of making computers understand what is happening in a scene by analysing images or videos.
Traditionally programmed computer vision is based on methods that calculate an image’s features, for example, computer programs looking for pronounced edges and corner points. These features need to be manually defined by an algorithm developer who knows what is important in the image data. The developer then combines these features for the algorithm to conclude what is found in the scene.
Machine learning algorithms automatically build a mathematical model using substantial amounts of sample data – training data – to gain the ability to make decisions by calculating results without specifically being programmed to do so. The features are still hand-crafted but how to combine these features is learned by the algorithm itself through exposure to large amounts of labelled, or annotated, training data. In this paper, we refer to this technique of using hand-crafted features in learned combinations as classical machine learning.
In other words, for a machine learning application we need to train the computer to get the program we want. Data is collected and then annotated by humans, sometimes assisted with pre-annotation by server computers. The result is fed into the system and this process goes on until the application has learned enough to detect what we wanted, for example, a specific type of vehicle. The trained model becomes the program. Note that when the program is finished the system does not learn anything new.
The advantage of AI over traditional programming, when building a computer vision program, is the ability to process extensive data. A computer can go through thousands of images without losing focus, whereas a human programmer will become tired and unfocused after a while. That way, the AI can make the application substantially more accurate. However, the more complicated the application, the harder it is for the machine to produce the desired result.
Deep learning is a refined version of machine learning in which both the feature extraction and how to combine these features, in deep structures of rules to produce an output, are learned in a data-driven manner. The algorithm can automatically define what features to look for in the training data. It can also learn very deep structures of chained combinations of features.
The core of the algorithms used in deep learning is inspired by how neurons work and how the brain uses these to form higher-level knowledge by combining the neuron outputs in a deep hierarchy, or a network, of chained rules. The brain is a system in which the combinations themselves are also formed by neurons, erasing the distinction between feature extraction and the combination of features, making them the same in some sense. These structures were simulated by researchers into something called artificial neural networks, which is the most widely used type of algorithm in deep learning.
Using deep learning algorithms, it is possible to build intricate visual detectors and automatically train them to detect very complex objects, resilient to scale, rotation, and other variations.
The reason behind this flexibility is that deep learning systems can learn from a much larger amount of data, and much more varied data, than classical machine learning systems. In most cases, they will significantly outperform hand-crafted computer vision algorithms. This makes deep learning especially suited to complex problems where the combination of features cannot easily be formed by human experts, such as image classification, language processing and object detection.
Classical machine learning vs. deep learning
While they are similar types of algorithms, a deep learning algorithm typically uses a much larger set of learned feature combinations than a classical machine learning algorithm does. This means that deep learning-based analytics can be more flexible and can – if trained to – learn to perform much more complex tasks.
For specific surveillance analytics, however, a dedicated, optimised classical machine learning algorithm can be sufficient. In a well specified scope, it can provide similar results as a deep learning algorithm while requiring less mathematical operations and can therefore be more cost-efficient and less power consuming to use. It furthermore requires much less training data and this greatly reduces the development effort.
The stages of machine learning
The development of a machine learning algorithm follows a series of steps and iterations before a finalised analytics application can be deployed. At the heart of an analytics application is one or more algorithms, for example an object detector. In the case of deep learning applications, the core of the algorithm is the deep learning model.
Data collection and data annotation
To develop an AI-based analytics application you need to collect large amounts of data. In video surveillance, this typically consists of images and video clips of humans and vehicles or other objects of interest. In order to make the data recognisable for a machine or computer a data annotation process is necessary, where the relevant objects are categorised and labelled. Data annotation is mainly a manual and labour-intense task. The prepared data needs to cover a large-enough variety of samples that are relevant for the context where the analytics application will be used.
Training, or learning, is when the model is fed annotated data and a training framework is used to iteratively modify and improve the model until the desired quality is reached. In other words, the model is optimised to solve the defined task. Training can be done according to one of three main methods.
Supervised learning is the most used method in machine learning today. It can be described as learning by examples. The training data is clearly annotated, meaning that the input data is already paired with the desired output result.
Supervised learning generally requires a very large amount of annotated data and the performance of the trained algorithm is directly dependent on the quality of the training data. The most important quality aspect is to use a dataset that represents all potential input data from a real deployment situation. For object detectors, the developer must make sure to train the algorithm with a wide variety of images, with different objects instances, orientations, scales, light situations, backgrounds and distractions. Only if the training data is representative for the planned use case, the final analytics application will be able to make accurate predictions also when processing new data, unseen during the training phase.
Unsupervised learning uses algorithms to analyse and group unlabelled datasets. This is not a common training method in the surveillance industry, because the model requires a lot of calibration and testing while the quality can still be unpredictable.
The datasets must be relevant for the analytics application but do not have to be clearly labelled or marked. The manual annotation work is eliminated, but the number of images or videos needed for the training must be greatly increased, by several orders of magnitude. During the training phase, the to-be-trained model is identifying, supported by the training framework, common features in the datasets. This enables it to, during the deployment phase, group data according to patterns while also allowing it to detect anomalies which do not fit into any of the learned groups.
Reinforcement learning is used in, for example, robotics, industrial automation and business strategy planning, but due to the need for large amounts of feedback, the method has limited use in surveillance today.
Reinforcement learning is about taking suitable action to maximise the potential reward in a specific situation, a reward that gets larger when the model makes the right choices. The algorithm does not use data/label pairs for training, but is instead optimised by testing its decisions through interaction with the environment while measuring the reward. The goal of the algorithm is to learn a policy for actions that will help maximise the reward.
Once the model is trained, it needs to be thoroughly tested. This step typically contains an automated part complemented with extensive testing in real-life deployment situations.
In the automated part, the application is benchmarked with new datasets, unseen by the model during its training. If these benchmarks are not where they are expected to be, the process starts over again: new training data is collected, annotations are made or refined and the model is retrained.
After reaching the wanted quality level, a field test starts. In this test, the application is exposed to real world scenarios. The amount and variation depend on the scope of the application. The narrower the scope, the less variations need to be tested. The broader the scope, the more tests are needed.
Results are again compared and evaluated. This step can then again cause the process to start over. Another potential outcome could be to define preconditions, explaining a known scenario in which the application is not or only partly recommended to be used.
The deployment phase is also called the inference or prediction phase. Inference or prediction is the process of executing a trained machine learning model. The algorithm uses what it learned during the training phase to produce its desired output. In the surveillance analytics context, the inference phase is the application running on a surveillance system monitoring real life scenes.
To achieve real-time performance when executing a machine edge-based algorithm on audio or video input data, specific hardware acceleration is generally required.
High-performance video analytics used to be server based because they required more power and cooling than a camera could offer. But algorithm development and increasing processing power of edge devices in recent years have made it possible to run advanced AI-based video analytics on the edge.
There are obvious advantages of edge- based analytics applications: they have access to uncompressed video material with very low latency, enabling real-time applications while avoiding the additional cost and complexity of moving data into the cloud for computations. Edge-based analytics also come with lower hardware and deployment costs since less server resources are needed in the surveillance system.
Some applications may benefit from using a combination of edge based and server-based processing, with pre-processing on the camera and further processing on the server. Such a hybrid system can facilitate cost-efficient scaling of analytics applications by working on several camera streams.
While you can often run a specific analytics application on several types of platforms, using dedicated hardware acceleration achieves a much higher performance when power is limited. Hardware accelerators enable power-efficient implementation of analytics applications. They can be complemented by server and cloud computing resources when suitable.
• GPU (graphics processing unit). GPUs were mainly developed for graphics processing applications but are also used for accelerating AI on server and cloud platforms. While sometimes also used in embedded systems (edge), GPUs are not optimal from a power efficiency standpoint for machine learning inference tasks.
• MLPU (machine learning processing unit). An MLPU can accelerate inference of specific classical machine learning algorithms for solving computer vision tasks with very high power efficiency. It is designed for real-time object detection of a limited number of simultaneous object types, for example, humans and vehicles.
• DLPU (deep learning processing unit). Cameras with a built-in DLPU can accelerate general deep learning algorithm inference with high power efficiency, allowing for a more granular object classification.
AI is still in its early development
It is tempting to make a comparison between the potential of an AI solution and what a human can achieve. While human video surveillance operators can only be fully alert for a short period of time, a computer can keep processing large amounts of data extremely quickly without ever getting tired.
But it would be a fundamental misunderstanding to assume that AI solutions would replace the human operator. The real strength lies in a realistic combination: taking advantage of AI solutions to improve and increase the efficiency of a human operator.
Machine learning or deep learning solutions are often described as having the capability to automatically learn or improve through experience. But AI systems available today do not automatically learn new skills after deployment and will not remember specific events that have occurred. To improve the system’s performance, it needs to be retrained with better and more accurate data during supervised learning sessions. Unsupervised learning typically requires a lot of data to generate clusters and is therefore not used in video surveillance applications. It is instead used today mainly for analysing large datasets to find anomalies, for example in financial transactions. Most approaches that are promoted as ‘self-learning’ within video surveillance are based on a statistical data analysis and not on actually retraining the deep learning models.
Human experience still beats many AI-based analytics applications for surveillance purposes. Especially those which are supposed to perform very general tasks and where contextual understanding is critical. A machine learning-based application might successfully detect a ‘running person’ if specifically trained for it but unlike a human who can put the data into context, the application has no understanding of why the person is running – to catch the bus or flee from the nearby pursuing police officer?
Despite promises from companies applying AI in their analytics applications for surveillance, the application cannot yet understand what it sees on video with remotely the same insight as a human can.
For the same reason, AI-based analytics applications can also trigger false alarms or miss alarms. This could typically happen in a complex environment with a lot of movement. But it could also be about, for example, a person carrying a large object, effectively obstructing the human characteristics to the application, making a correct classification less likely.
AI-based analytics today should be used in an assisting way, for example, to roughly determine how relevant an incident is before alerting a human operator to decide about the response. This way, AI is used to reach scalability and the human operator is there to assess potential incidents.
Considerations for optimal analytics performance
To navigate the quality expectations of an AI-based analytics application, it is recommended to carefully study and understand the known preconditions and limitations, typically listed in the application’s documentation. Every surveillance installation is unique and the application’s performance should be evaluated at each site.
If the quality is not at the expected or anticipated level, it is strongly recommended to not only focus the investigation on the application itself. All investigations should be made on a holistic level because the performance of an analytics application depends on so many factors, most of which can be optimised if we are aware of their impact. These factors include, for example, camera hardware, video quality, scene dynamics, illumination level, as well as camera configuration, position, and direction.
Image quality is often said to depend on high resolution and high light sensitivity of the camera. While the importance of these factors cannot be questioned, there are certainly others that are just as influential for the actual usability of an image or a video. For example, the best quality video stream from the most expensive surveillance camera can be useless if the scene is not sufficiently lit at night, if the camera has been redirected, or if the system connection is broken.
The placement of the camera should be carefully considered before deployment. For video analytics to perform as expected, the camera needs to be positioned to enable a clear view, without obstacles, of the intended scene.
Image usability may also depend on the use case. Video that looks good to a human eye may not have the optimal quality for the performance of a video analytics application. In fact, many image processing methods that are commonly used to enhance video appearance for human viewing are not recommended when using video analytics. This may include, for example, applied noise reduction methods, wide dynamic range methods, or auto exposure algorithms.
Video cameras today often come with integrated IR illumination which enables them to work in complete darkness. This is positive as it may enable cameras to be placed on difficult-light sites and reduce the need for installing additional illumination. However, if heavy rain or snowfall are expected on a site, it is highly recommended not to rely on light coming from the camera or from a location very close to the camera.
It is difficult to determine a maximum detection distance of an AI-based analytics application – an exact datasheet value in metres can never be the whole truth. Image quality, scene characteristics, weather conditions, and object properties such as colour and brightness have a significant impact on the detection distance. It is evident, for example, that a bright object against a dark background during a sunny day can be visually detected at much longer distances than a dark object on a rainy day.
The detection distance also depends on the speed of the objects to be detected. To achieve accurate results, a video analytics application needs to ‘see’ the object during a sufficiently long period of time. How long that period needs to be depends on the processing performance (framerate) of the platform: the lower the processing performance, the longer the object needs to be visible in order to be detected. If the camera’s shutter time is not well matched with the object speed, motion blur appearing in the image may also lower the detection accuracy.
A higher resolution camera typically does not provide a longer detection distance. The processing capabilities needed for executing a machine learning algorithm are proportional to the size of the input data. This means that the processing power required to analyse the full resolution of a 4K camera is at least four times higher than for a 1080p camera. It is very common to run AI-based applications on a lower resolution than the camera or stream can offer due to limitations in the camera’s processing capability.
Alarms and recording setup
Because of the various levels of filters they apply, object analytics generate very few false alarms. But object analytics perform as they should only when their listed preconditions are all met. In other cases, they might instead miss important events.
If it is not absolutely certain that all conditions will be met at all times, it is therefore recommended to take a conservative approach and set up the system so that a specific object classification is not the only alarm trigger. This will cause more false alarms but also reduce the risk of missing something important.
A surveillance installation should be regularly maintained. Physical inspections, and not only viewing the video through the VMS interface, is recommended in order to discover and remove anything that might disturb or block the field of view. This is important also in standard, recording-only installations, but is even more critical when using analytics.
In the context of basic video motion detection, a typical obstacle such as a spider’s web that sways in the wind could increase the number of alarms, resulting in a higher storage consumption than necessary. With object analytics, the web would basically create an exclude zone in the detection area. Its threads would obscure objects and greatly reduce the chance of detection and classification.
Dirt on the front glass or bubble of the camera is unlikely to cause problems during daytime. But in low-light conditions, light that hits a dirty bubble from the side, for example from the headlights of a car, can cause unexpected reflections that may decrease detection accuracy.
Scene-related maintenance is equally important as camera maintenance. During the lifetime of a camera, a lot can happen in the scene it is monitoring. A simple before-and-after image comparison will reveal potential problems. What did the scene look like when the camera was deployed and what does it look like today? Is there a need to adjust the detection zone? Should the camera’s field of view be adjusted, or should the camera be moved to a different location?
This paper has been shortened; the full version can be found at www.securitysa.com/*axis8
© Technews Publishing (Pty) Ltd | All Rights Reserved