The inner world of DLP systems

Issue 3 2021 Information Security

Today, DLP (data loss prevention) systems are used not only to protect against data breaches. The expansive development of technologies gave way to intensive and DLP systems began to grow in depth, improving the quality of content interception and analysis. Information obtained with the help of DLP systems becomes invaluable for making management decisions. This allows us to turn information security into a service for other departments of the company, from HR to accounting.

Tasks of DLP systems

The first challenge that DLP data analysis is designed to solve is data loss prevention. Leaks can also be prevented without analysis technologies, but for this, you will have to apply too many administrative measures and in fact, prohibit everything. For a large company, this approach can harm business processes. Therefore, data should be blocked selectively. Analysis helps to determine what kind of data and what particular users are subject to the restrictions.

The second task is to label the intercepted archive. Without labels and templates, intercepted info is just a large pile of data, which can only be worked with using full-text search, which does not always help.

For example, you have a 16-digit credit card number, but in messages it can be written in different formats and it is almost impossible to find it with a full-text search. But here, the standard form comes to the rescue. ‘Credit card’ is set as a text object. The system captures credit cards, extracts the standard form removing any formatting and saves it in the database with reference to the intercepted object.

One more task of modern DLPs is to analyse event chains. Based on this analysis, products of the UBA (user behaviour analytics) class appear. They analyse user behaviour by studying a set of user-generated events. Well-marked events can signal both policy violations and malware infection. For example, the system adds the events of sending a CV by mail, visiting a job search website or an employer assessment website into a chain and helps to determine how high the probability of an employee's dismissal is.

Data primitives

There are many ways to represent data. Archives, for example, help save data storage space. Office formats store text, images, markup and other meta information in a single file.

Since you need to know the data storage format, it is difficult to access this information quickly. However, information security requires rapid response. Therefore, the DLP system has a rich set of so-called extractors. Their task is to obtain primitives from all formats used in the organisation (text, images, vector graphics, etc.).

It is understood the text is the simplest and most convenient data type to analyse. DLP systems even try to convert images into textual representation using OCR (optical character recognition) technology. Modern computer vision methods work with images very well and can already tell a lot about an image.

Not so long ago, vector images moved from the category of binary images to a stand-alone primitive of information as we learned to analyse them as structured data. Let us hope that in the foreseeable future, technologies will evolve to such an extent that they will make it possible to obtain a full-text description of an image.

Data analysis

Data can be analysed in three ways: semantic, formal and content.

1. Semantic search for information usually uses a classifier. This approach makes it possible to extract the subject matter from the intercepted information in the event of a leak without having an exact pattern for searching.

2. In formal analysis, the system is primarily interested in how the information is framed and secondarily in what it is. Regular expressions are a prime example of this analysis.

3. Content-based types of analysis exercise search by sample. They require a reference sample or several reference samples against which the analysed information is to be compared.

Data classification

Classification can be applied to data characteristics, by which we can determine certain groups or topics of data. In general, the main criterion for creating new technologies is maximum quality in minimum time. When analysing data on the fly it is important to do it quickly; otherwise, the information security specialist will find out about the violation too late. The DLP system intercepts millions of events every day, and delays in the analysis of such a huge number of intercepted objects can be critical for business.

For the classifier to work, a labelled training collection is required. That is, each document in it must be assigned to one of the presented classes. The simplest analogy is document directories on a hard drive. Further, features (key points for images and terms for texts) are selected from the presented documents, which are sent to the mathematical core with reference to categories. The system gets trained on their basis. Once the classifier is trained, you can submit documents to it.

The analysis process is similar to training: features\attributes are extracted from the intercepted document and provided to the mathematical core for classification. The calcification process determines whether analysed data belongs to one or several categories.

It is often impossible to set up a classifier in advance for any company because companies operating in the same market can use different sets of terms for the same thematic area. Therefore, when installing DLP, the classifiers are fine-tuned to improve the quality of their work. During operation, it will also be necessary to adjust the classifiers since categories or their signs change over time.

In addition to images, DLP can also classify texts. Many machine learning approaches can be used to classify texts, for example, cosine similarity (content filtering database) or logistic regression.

For text, words are the attributes. Words in almost any language have forms, while the final meaning of the text where these forms are used does not change radically. Therefore, classifiers use morphological dictionaries for several languages, bringing all words to normal form. This helps to improve the quality of the classification. In languages ​​for which there are no dictionaries yet, classifiers look for exact matches. To improve accuracy, a typo correction technology is used that compares words to known terms.

Copyright analysis

Copyright analysis can be thought of as a search for fragments of reference samples in the analysed data. There are several types of such analysis. They all work according to a similar principle: reference documents are uploaded into the system, then each intercepted piece of information is compared against the reference. Each type of copyright analysis usually works with only one data type. At the same time, there can be a lot of reference samples. You can upload a lot of samples to be set as references – hundreds of thousands of documents. There are several types of copyright analysis:

1. Classical copyright analysis takes text as a reference and analyses text primitives only. As a result, the DLP system sees relevance, that is, the reference sample percentage contained in the analysed document and the markup of these pieces and so GUI can highlight these matches.

2. Copyright analysis for binary data works the same way but returns relevance only.

3. There is also copyright analysis for raster graphics data, but it is extremely important to strike a balance between speed and functionality.

4. Copyright analysis for vector images selects graphic primitives and evaluates their relative position in the reference sample. DLP can be configured to intercept fragments of vector images too.

5. There are also specialised types of copyright analysis designed to solve specific yet very frequent tasks. As an example, we can refer to the detector of reference templates. For example, you can detect completed paper surveys by taking the blank survey as a reference. You can also read the filled fields. It has proven to be an indispensable tool in cases where personal data is one of the main digital assets of a business.

6. The detector of reference stamps enables setting round, triangular or rectangular stamps as references and then search for them on scans or photos or other docs.

7. Picture-in-picture search is often used to detect credit cards. The detector tries to find a reference image in the analysed data. For example, it can search for logos of payment systems.

Conclusion

DLPs are complex systems with broad capabilities. The success of their operation largely depends on how thoroughly the vendor fine-tunes its solution to meet specific customer needs. Sometimes you can hear the opinion that the DLP sphere has reached a dead end. This is not true.

Customers' tasks are constantly evolving, data transmission channels, topics, documents and other data that need to be protected change systematically. A good example is a massive transition to remote work this year, which led to the need to ensure cybersecurity and protection against data leaks in the new environment.

DLP analysis technology has taken a big step forward. Now you can analyse employees’ interactions with partners or competitors, build graphs of liaisons, recognise suspicious patterns, identify groups of informal team leaders, timely and competently respond to risks and much more. DLP systems grew out of information security and now solve a wide range of business problems.


David Balaban is a computer security researcher with over 17 years of experience in malware analysis and antivirus software evaluation. David runs MacSecurity.net and Privacy-PC.com projects that present expert opinions on contemporary information security matters, including social engineering, malware, penetration testing, threat intelligence, online privacy and white hat hacking. David has a strong malware troubleshooting background, with a recent focus on ransomware countermeasures




Share this article:
Share via emailShare via LinkedInPrint this page



Further reading:

What are MFA fatigue attacks, and how can they be prevented?
Information Security
Multifactor authentication is a security measure that requires users to provide a second form of verification before they can log into a corporate network. It has long been considered essential for keeping fraudsters out. However, cybercriminals have been discovering clever ways to bypass it.

Read more...
SA's cybersecurity risks to watch
Information Security
The persistent myth is that cybercrime only targets the biggest companies and economies, but cybercriminals are not bound by geography, and rapidly digitising economies lure them in large numbers.

Read more...
Cyber insurance a key component in cyber defence strategies
Information Security
[Sponsored] Cyber insurance has become a key part of South African organisations’ risk reduction strategies, driven by the need for additional financial protection and contingency plans in the event of a cyber incident.

Read more...
Deception technology crucial to unmasking data theft
Information Security Security Services & Risk Management
The ‘silent theft’ of data is an increasingly prevalent cyber threat to businesses, driving the ongoing leakage of personal information in the public domain through undetected attacks that cannot even be policed by data privacy legislation.

Read more...
Data security and privacy in global mobility
Security Services & Risk Management Information Security
Data security and privacy in today’s interconnected world is of paramount importance. In the realm of global mobility, where individuals and organisations traverse borders for various reasons, safeguarding sensitive information becomes an even more critical imperative.

Read more...
Sophos celebrates partners and cybersecurity innovation at annual conference
News & Events Information Security
[Sponsored] Sun City hosted Sophos' annual partner event this year, which took place from 12 to 14 March. Sophos’ South African cybersecurity distributors and resellers gathered for an engaging two-day conference.

Read more...
The CIPC hack has potentially serious consequences
Editor's Choice Information Security
A cyber breach at the South African Companies and Intellectual Property Commission (CIPC) has put millions of companies at risk. The organisation holds a vast database of registration details, including sensitive data like ID numbers, addresses, and contact information.

Read more...
Navigating South Africa's cybersecurity regulations
Sophos Information Security Infrastructure
[Sponsored] Data privacy and compliance are not just buzzwords; they are essential components of a robust cybersecurity strategy that cannot be ignored. Understanding and adhering to local data protection laws and regulations becomes paramount.

Read more...
AI augmentation in security software and the resistance to IT
Security Services & Risk Management Information Security
The integration of AI technology into security software has been met with resistance. In this, the first in a series of two articles, Paul Meyer explores the challenges and obstacles that must be overcome to empower AI-enabled, human-centric decision-making.

Read more...
Milestone Systems joins CVE programme
Milestone Systems News & Events Information Security
Milestone Systems has partnered with the Common Vulnerability and Exposures (CVE) Programme as a CVE Numbering Authority (CNA), to assist the programme to find, describe, and catalogue known cybersecurity issues.

Read more...