Can you be sure a file is what it claims to be?

Content scanning and filtering products are a crucial part of a security ecosystem, validating that files being moved in or out of a network conform to expectation.  But how do you determine what is expected, if the file extension (for example file.PDF) is not reliable.

Just because a file claims to be a PDF file, by having a .PDF name extension does not mean it is – it could be masquerading as something else.   As an example the blog How to Hide Files in JPEG Pictures describes how you can create a file that operates perfectly as both a .JPEG file and a .ZIP file.  That is, if your file is called file.JPG, and you try to open it, you will see a picture.   If you rename the exact same file, file.ZIP it will open as a ZIP archive file with something other than the picture in it.

For malware scanning products and data loss prevention products this is a big issue, you cannot rely on the file extension (or the MIME designation in the case of email).


Within Nexor’s Sentinel, Merlin and Guardian products we need to be certain what a file is, to perform the required guarding or gatewaying security functions.  Within the products there is a module called “Masquerade” that examines the file and identifies what it could masquerading as (as opposed to what claims to be).

Recognising this is a key component of our security technology, Nexor engaged Nottingham University to undertake a study to ensure the product is using a state-of-the art approach.  An MSC student, analysed approaches to assessing the file type using techniques for “Byte frequency analysis”, “Byte frequency cross-correlation” and “File Header/trailer approaches”.  Using test files constructed to behave perfectly as .JPEG and .ZIP, the project looked at the performance of the respective algorithms under different input conditions.

In conclusion, the research identified that analysis of the file header / trailer gave the most accurate file identification – BUT crucially – the other approaches provide valuable evidence that all is not what it seems.   For example, with the dodgy test file, while file header / trailer analysis is good at spotting it as a .JPEG file the other techniques flag that it is not a “normal” JPEG and there is certainly reason to quarantine the file.

Certainly interesting findings that we’ll be looking at in detail to further enhance the masquerade capabilities in our product.

Can you offer any further insight into how best to determine the type of a file you can share with us?