The inner world of DLP systems

Issue 3 2021 Information Security

Today, DLP (data loss prevention) systems are used not only to protect against data breaches. The expansive development of technologies gave way to intensive and DLP systems began to grow in depth, improving the quality of content interception and analysis. Information obtained with the help of DLP systems becomes invaluable for making management decisions. This allows us to turn information security into a service for other departments of the company, from HR to accounting.

Tasks of DLP systems

The first challenge that DLP data analysis is designed to solve is data loss prevention. Leaks can also be prevented without analysis technologies, but for this, you will have to apply too many administrative measures and in fact, prohibit everything. For a large company, this approach can harm business processes. Therefore, data should be blocked selectively. Analysis helps to determine what kind of data and what particular users are subject to the restrictions.

The second task is to label the intercepted archive. Without labels and templates, intercepted info is just a large pile of data, which can only be worked with using full-text search, which does not always help.

For example, you have a 16-digit credit card number, but in messages it can be written in different formats and it is almost impossible to find it with a full-text search. But here, the standard form comes to the rescue. ‘Credit card’ is set as a text object. The system captures credit cards, extracts the standard form removing any formatting and saves it in the database with reference to the intercepted object.

One more task of modern DLPs is to analyse event chains. Based on this analysis, products of the UBA (user behaviour analytics) class appear. They analyse user behaviour by studying a set of user-generated events. Well-marked events can signal both policy violations and malware infection. For example, the system adds the events of sending a CV by mail, visiting a job search website or an employer assessment website into a chain and helps to determine how high the probability of an employee's dismissal is.

Data primitives

There are many ways to represent data. Archives, for example, help save data storage space. Office formats store text, images, markup and other meta information in a single file.

Since you need to know the data storage format, it is difficult to access this information quickly. However, information security requires rapid response. Therefore, the DLP system has a rich set of so-called extractors. Their task is to obtain primitives from all formats used in the organisation (text, images, vector graphics, etc.).

It is understood the text is the simplest and most convenient data type to analyse. DLP systems even try to convert images into textual representation using OCR (optical character recognition) technology. Modern computer vision methods work with images very well and can already tell a lot about an image.

Not so long ago, vector images moved from the category of binary images to a stand-alone primitive of information as we learned to analyse them as structured data. Let us hope that in the foreseeable future, technologies will evolve to such an extent that they will make it possible to obtain a full-text description of an image.

Data analysis

Data can be analysed in three ways: semantic, formal and content.

1. Semantic search for information usually uses a classifier. This approach makes it possible to extract the subject matter from the intercepted information in the event of a leak without having an exact pattern for searching.

2. In formal analysis, the system is primarily interested in how the information is framed and secondarily in what it is. Regular expressions are a prime example of this analysis.

3. Content-based types of analysis exercise search by sample. They require a reference sample or several reference samples against which the analysed information is to be compared.

Data classification

Classification can be applied to data characteristics, by which we can determine certain groups or topics of data. In general, the main criterion for creating new technologies is maximum quality in minimum time. When analysing data on the fly it is important to do it quickly; otherwise, the information security specialist will find out about the violation too late. The DLP system intercepts millions of events every day, and delays in the analysis of such a huge number of intercepted objects can be critical for business.

For the classifier to work, a labelled training collection is required. That is, each document in it must be assigned to one of the presented classes. The simplest analogy is document directories on a hard drive. Further, features (key points for images and terms for texts) are selected from the presented documents, which are sent to the mathematical core with reference to categories. The system gets trained on their basis. Once the classifier is trained, you can submit documents to it.

The analysis process is similar to training: features\attributes are extracted from the intercepted document and provided to the mathematical core for classification. The calcification process determines whether analysed data belongs to one or several categories.

It is often impossible to set up a classifier in advance for any company because companies operating in the same market can use different sets of terms for the same thematic area. Therefore, when installing DLP, the classifiers are fine-tuned to improve the quality of their work. During operation, it will also be necessary to adjust the classifiers since categories or their signs change over time.

In addition to images, DLP can also classify texts. Many machine learning approaches can be used to classify texts, for example, cosine similarity (content filtering database) or logistic regression.

For text, words are the attributes. Words in almost any language have forms, while the final meaning of the text where these forms are used does not change radically. Therefore, classifiers use morphological dictionaries for several languages, bringing all words to normal form. This helps to improve the quality of the classification. In languages ​​for which there are no dictionaries yet, classifiers look for exact matches. To improve accuracy, a typo correction technology is used that compares words to known terms.

Copyright analysis

Copyright analysis can be thought of as a search for fragments of reference samples in the analysed data. There are several types of such analysis. They all work according to a similar principle: reference documents are uploaded into the system, then each intercepted piece of information is compared against the reference. Each type of copyright analysis usually works with only one data type. At the same time, there can be a lot of reference samples. You can upload a lot of samples to be set as references – hundreds of thousands of documents. There are several types of copyright analysis:

1. Classical copyright analysis takes text as a reference and analyses text primitives only. As a result, the DLP system sees relevance, that is, the reference sample percentage contained in the analysed document and the markup of these pieces and so GUI can highlight these matches.

2. Copyright analysis for binary data works the same way but returns relevance only.

3. There is also copyright analysis for raster graphics data, but it is extremely important to strike a balance between speed and functionality.

4. Copyright analysis for vector images selects graphic primitives and evaluates their relative position in the reference sample. DLP can be configured to intercept fragments of vector images too.

5. There are also specialised types of copyright analysis designed to solve specific yet very frequent tasks. As an example, we can refer to the detector of reference templates. For example, you can detect completed paper surveys by taking the blank survey as a reference. You can also read the filled fields. It has proven to be an indispensable tool in cases where personal data is one of the main digital assets of a business.

6. The detector of reference stamps enables setting round, triangular or rectangular stamps as references and then search for them on scans or photos or other docs.

7. Picture-in-picture search is often used to detect credit cards. The detector tries to find a reference image in the analysed data. For example, it can search for logos of payment systems.

Conclusion

DLPs are complex systems with broad capabilities. The success of their operation largely depends on how thoroughly the vendor fine-tunes its solution to meet specific customer needs. Sometimes you can hear the opinion that the DLP sphere has reached a dead end. This is not true.

Customers' tasks are constantly evolving, data transmission channels, topics, documents and other data that need to be protected change systematically. A good example is a massive transition to remote work this year, which led to the need to ensure cybersecurity and protection against data leaks in the new environment.

DLP analysis technology has taken a big step forward. Now you can analyse employees’ interactions with partners or competitors, build graphs of liaisons, recognise suspicious patterns, identify groups of informal team leaders, timely and competently respond to risks and much more. DLP systems grew out of information security and now solve a wide range of business problems.


David Balaban is a computer security researcher with over 17 years of experience in malware analysis and antivirus software evaluation. David runs MacSecurity.net and Privacy-PC.com projects that present expert opinions on contemporary information security matters, including social engineering, malware, penetration testing, threat intelligence, online privacy and white hat hacking. David has a strong malware troubleshooting background, with a recent focus on ransomware countermeasures




Share this article:
Share via emailShare via LinkedInPrint this page



Further reading:

95% do not have full trust in cybersecurity vendors
Information Security Security Services & Risk Management
Trust in cybersecurity vendors is fragile, difficult to measure, and increasingly shaping risk posture at both operational and board levels. Lack of verifiable transparency undermines cybersecurity decision-making, according to Sophos-backed research.

Read more...
Africa’s largest Zero Trust platform
NEC XON Information Security Commercial (Industry)
Africa has reached a significant cybersecurity milestone with the successful deployment of the continent’s largest Palo Alto Networks Prisma Access and Prisma Access Browser Zero Trust environment, supporting secure remote access for more than 40 000 users for a large enterprise in Africa.

Read more...
Supply chain attacks top threat over 12 months
Information Security
Supply chain attacks have become the most prevalent cyberthreat confronting businesses over the past year, according to a new Kaspersky global study, with nearly one-third of companies worldwide experiencing a supply chain threat in the past year.

Read more...
From vibe hacking to flat-pack malware
Information Security AI & Data Analytics
HP issued its latest Threat Insights Report, with strong indications that attackers are using AI to scale and accelerate campaigns, and that many are prioritising cost, effort, and efficiency over quality.

Read more...
NEC XON secures mobile provider’s hybrid identities
NEC XON Access Control & Identity Management Information Security Commercial (Industry)
For a leading South African telecommunications operator, identity protection has become a strategic priority as identity-centric attacks proliferate across the industry. The company faced mounting pressure to secure both human and non-human identities across complex hybrid environments.

Read more...
Microsoft 365 security is a ticking time bomb
Information Security
Across boardrooms and IT departments, a dangerous assumption persists that because data is stored in Microsoft 365 and Azure, it is automatically secure. This belief is fundamentally flawed and fosters a false sense of protection.

Read more...
Rise in malicious insider threat reports
News & Events Information Security
Mimecast Study finds 46% of SA organisations report a rise in malicious insider threat reports over the past year: reveals disconnect between security awareness and technical controls as AI-powered attacks accelerate.

Read more...
New campaign exploiting Google Tasks notifications
News & Events Information Security
New phishing scheme abuses legitimate Google Tasks notifications to trick corporate users into revealing corporate login credentials, which can then be used to gain unauthorised access to company systems, steal data, or launch further attacks.

Read more...
Making a mesh for security
Information Security Security Services & Risk Management
Credential-based attacks have reached epidemic levels. For African CISOs in particular, the message is clear: identity is now the perimeter, and defences must reflect that reality with coherence and context.

Read more...
What’s in store for PAM and IAM?
Access Control & Identity Management Information Security
Leostream predicts changes in Identity and Access Management (IAM) and Privileged Access Management (PAM) in the coming year, driven by evolving cybersecurity realities, hybridisation, AI, and more.

Read more...










While every effort has been made to ensure the accuracy of the information contained herein, the publisher and its agents cannot be held responsible for any errors contained, or any loss incurred as a result. Articles published do not necessarily reflect the views of the publishers. The editor reserves the right to alter or cut copy. Articles submitted are deemed to have been cleared for publication. Advertisements and company contact details are published as provided by the advertiser. Technews Publishing (Pty) Ltd cannot be held responsible for the accuracy or veracity of supplied material.




© Technews Publishing (Pty) Ltd. | All Rights Reserved.