The inner world of DLP systems

Issue 3 2021 Information Security

By David Balaban.

Today, DLP (data loss prevention) systems are used not only to protect against data breaches. The expansive development of technologies gave way to intensive and DLP systems began to grow in depth, improving the quality of content interception and analysis. Information obtained with the help of DLP systems becomes invaluable for making management decisions. This allows us to turn information security into a service for other departments of the company, from HR to accounting.

Tasks of DLP systems

The first challenge that DLP data analysis is designed to solve is data loss prevention. Leaks can also be prevented without analysis technologies, but for this, you will have to apply too many administrative measures and in fact, prohibit everything. For a large company, this approach can harm business processes. Therefore, data should be blocked selectively. Analysis helps to determine what kind of data and what particular users are subject to the restrictions.

The second task is to label the intercepted archive. Without labels and templates, intercepted info is just a large pile of data, which can only be worked with using full-text search, which does not always help.

For example, you have a 16-digit credit card number, but in messages it can be written in different formats and it is almost impossible to find it with a full-text search. But here, the standard form comes to the rescue. ‘Credit card’ is set as a text object. The system captures credit cards, extracts the standard form removing any formatting and saves it in the database with reference to the intercepted object.

One more task of modern DLPs is to analyse event chains. Based on this analysis, products of the UBA (user behaviour analytics) class appear. They analyse user behaviour by studying a set of user-generated events. Well-marked events can signal both policy violations and malware infection. For example, the system adds the events of sending a CV by mail, visiting a job search website or an employer assessment website into a chain and helps to determine how high the probability of an employee's dismissal is.

Data primitives

There are many ways to represent data. Archives, for example, help save data storage space. Office formats store text, images, markup and other meta information in a single file.

Since you need to know the data storage format, it is difficult to access this information quickly. However, information security requires rapid response. Therefore, the DLP system has a rich set of so-called extractors. Their task is to obtain primitives from all formats used in the organisation (text, images, vector graphics, etc.).

It is understood the text is the simplest and most convenient data type to analyse. DLP systems even try to convert images into textual representation using OCR (optical character recognition) technology. Modern computer vision methods work with images very well and can already tell a lot about an image.

Not so long ago, vector images moved from the category of binary images to a stand-alone primitive of information as we learned to analyse them as structured data. Let us hope that in the foreseeable future, technologies will evolve to such an extent that they will make it possible to obtain a full-text description of an image.

Data analysis

Data can be analysed in three ways: semantic, formal and content.

1. Semantic search for information usually uses a classifier. This approach makes it possible to extract the subject matter from the intercepted information in the event of a leak without having an exact pattern for searching.

2. In formal analysis, the system is primarily interested in how the information is framed and secondarily in what it is. Regular expressions are a prime example of this analysis.

3. Content-based types of analysis exercise search by sample. They require a reference sample or several reference samples against which the analysed information is to be compared.

Data classification

Classification can be applied to data characteristics, by which we can determine certain groups or topics of data. In general, the main criterion for creating new technologies is maximum quality in minimum time. When analysing data on the fly it is important to do it quickly; otherwise, the information security specialist will find out about the violation too late. The DLP system intercepts millions of events every day, and delays in the analysis of such a huge number of intercepted objects can be critical for business.

For the classifier to work, a labelled training collection is required. That is, each document in it must be assigned to one of the presented classes. The simplest analogy is document directories on a hard drive. Further, features (key points for images and terms for texts) are selected from the presented documents, which are sent to the mathematical core with reference to categories. The system gets trained on their basis. Once the classifier is trained, you can submit documents to it.

The analysis process is similar to training: features\attributes are extracted from the intercepted document and provided to the mathematical core for classification. The calcification process determines whether analysed data belongs to one or several categories.

It is often impossible to set up a classifier in advance for any company because companies operating in the same market can use different sets of terms for the same thematic area. Therefore, when installing DLP, the classifiers are fine-tuned to improve the quality of their work. During operation, it will also be necessary to adjust the classifiers since categories or their signs change over time.

In addition to images, DLP can also classify texts. Many machine learning approaches can be used to classify texts, for example, cosine similarity (content filtering database) or logistic regression.

For text, words are the attributes. Words in almost any language have forms, while the final meaning of the text where these forms are used does not change radically. Therefore, classifiers use morphological dictionaries for several languages, bringing all words to normal form. This helps to improve the quality of the classification. In languages for which there are no dictionaries yet, classifiers look for exact matches. To improve accuracy, a typo correction technology is used that compares words to known terms.

Copyright analysis

Copyright analysis can be thought of as a search for fragments of reference samples in the analysed data. There are several types of such analysis. They all work according to a similar principle: reference documents are uploaded into the system, then each intercepted piece of information is compared against the reference. Each type of copyright analysis usually works with only one data type. At the same time, there can be a lot of reference samples. You can upload a lot of samples to be set as references – hundreds of thousands of documents. There are several types of copyright analysis:

1. Classical copyright analysis takes text as a reference and analyses text primitives only. As a result, the DLP system sees relevance, that is, the reference sample percentage contained in the analysed document and the markup of these pieces and so GUI can highlight these matches.

2. Copyright analysis for binary data works the same way but returns relevance only.

3. There is also copyright analysis for raster graphics data, but it is extremely important to strike a balance between speed and functionality.

4. Copyright analysis for vector images selects graphic primitives and evaluates their relative position in the reference sample. DLP can be configured to intercept fragments of vector images too.

5. There are also specialised types of copyright analysis designed to solve specific yet very frequent tasks. As an example, we can refer to the detector of reference templates. For example, you can detect completed paper surveys by taking the blank survey as a reference. You can also read the filled fields. It has proven to be an indispensable tool in cases where personal data is one of the main digital assets of a business.

6. The detector of reference stamps enables setting round, triangular or rectangular stamps as references and then search for them on scans or photos or other docs.

7. Picture-in-picture search is often used to detect credit cards. The detector tries to find a reference image in the analysed data. For example, it can search for logos of payment systems.

Conclusion

DLPs are complex systems with broad capabilities. The success of their operation largely depends on how thoroughly the vendor fine-tunes its solution to meet specific customer needs. Sometimes you can hear the opinion that the DLP sphere has reached a dead end. This is not true.

Customers' tasks are constantly evolving, data transmission channels, topics, documents and other data that need to be protected change systematically. A good example is a massive transition to remote work this year, which led to the need to ensure cybersecurity and protection against data leaks in the new environment.

DLP analysis technology has taken a big step forward. Now you can analyse employees’ interactions with partners or competitors, build graphs of liaisons, recognise suspicious patterns, identify groups of informal team leaders, timely and competently respond to risks and much more. DLP systems grew out of information security and now solve a wide range of business problems.

David Balaban is a computer security researcher with over 17 years of experience in malware analysis and antivirus software evaluation. David runs MacSecurity.net and Privacy-PC.com projects that present expert opinions on contemporary information security matters, including social engineering, malware, penetration testing, threat intelligence, online privacy and white hat hacking. David has a strong malware troubleshooting background, with a recent focus on ransomware countermeasures

Share this article:

Categories

The inner world of DLP systems

Further reading:

Published by Technews