siegfried 1.7.9 released

Version 1.7.9 of siegfried is now available. Get it here. According to the develop benchmarks, this release is slightly more accurate than v1.7.8, with only a marginal impact on performance.

The highlights of this release are a new system for saving configurations for the sf tool, changes to the matching algorithm to improve accuracy, and simplifications to the basis field.

Save and load frequently used configurations

The new -setconf flag allows you to save frequently used configurations for the sf tool. I implemented this to make it possible to set a default -multi value (and not require users to type e.g. sf -multi 32 every time they run siegfried), but you can use -setconf to save any frequently used flags. To save your preferred flags as defaults just type your preferred siegfried command, ommitting the file/directory argument and including -setconf.

For example, the following command records preferences for logging, output, hashing and -multi:

sf -setconf -csv -hash md5 -multi 32 -log time,error

You can then just type sf DIR to run siegfried with those preferred settings. Configurations are stackable with additional flags used while running siegfried. For example, if you type sf -json DIR after setting the above configuration, you’ll get JSON instead of CSV output for that session, but all those other preferences will be applied.

The -conf NAME flag allows you to save and load named configuration files. These configurations are saved to the path specified by the flag, rather than to the default configuration file (a sf.conf file in your siegfried home directory). Named configurations might be useful if you have a few different ways of invoking siegfried. For example, you might want to save a server configuration:

sf -setconf -hash sha1 -z -serve localhost:5138 -sig deluxe.sig -conf server.conf

You can then load that configuration by just typing sf -conf server.conf.

Changes to the matching algorithm

If you review the develop benchmarks, you’ll see that there are some small differences in the results returned for v1.7.9 as compared with v1.7.8. For example, in the Govdocs corpus, a number of PDF files had been identified as fmt/134 (MPEG) previously, but are now correctly identified as various forms of PDF. This improvement in accuracy follows some changes I’ve implemented to resolve this issue.

The problem boiled down to how siegfried uses PRONOM’s file format priority information. One of siegfried’s optimisations (and a reason it gives fairly good performance without requiring users to set limits on bytes scanned) is that it applies format priorities in real-time. For example, if, during scanning, a match comes in for PDF then siegfried will keep scanning to see if the file is a PDF/A (or other more specific type of PDF) but it won’t wait to see if the file is an MPEG or anything else unrelated to that initial match. Think of all the formats as a big tree: once siegfried starts climbing in a particular direction, it will only find results higher up that branch. But what if that initial match is misleading? Like those Govdocs PDFs where a noisy MPEG signature matched first?

The changes I’ve made to the matching algorithm for v1.7.9 retain siegfried’s real-time application of format priorities but with a tweak that allows siegfried to “jump” between branches in that format tree. The way this works is that, when each of the matchers runs (matchers are different stages in the scanning engine - i.e. the file name matcher, container matcher, byte matcher, text matcher etc.), “hints” are supplied based on information gleaned from previous matchers. These “hints” are then weighed alongside format priorities when siegfried decides what to do with format hits. For example, in the case of the Govdocs PDFs, the byte matcher receives a hint from the file name matcher that the file might be in the PDF family (because of the .pdf extension) and that hint causes the matcher to keep an open mind to the possibility that the file might well be a PDF even after that positive MPEG match has been found.

There is a small cost in speed for this change to the matching algorithm (because there is now this new factor that will cause siegfried to delay in returning a positive match early) but my benchmarks show that the slowdown is only very modest.

Please note: these format prioritisation rules only apply to siegfried in its default mode. The roy tool gives you fine-grained control over how format priorities are used during matching (i.e. you can elect to scan more slowly and get more exhaustive results returned). Try the roy build -multi positive, roy build -multi comprehensive and roy build -multi exhaustive commands described here to see how you can fine tune your results.

Simpler basis field

There is a small change in the information returned in the basis field for v1.7.9.

When reporting byte matches, siegfried returns the location (offset from the beginning of the file) and length (in bytes) of matches as pairs e.g. [10 150], which means a match at offset 10 for 150 bytes. For signatures with multiple segments (e.g. a beginning of file segment and an end of file segment), previous versions of siegfried reported a basis which was a list, of lists, of offset/length pairs. For example, you might get a basis like [[[10 150]][[25000 20]]]. The reason siegfried returned lists of lists rather than just simple lists of offset/length pairs was to account for the fact that sometimes particular segments of a signature would match at multiple points in the file. E.g. [[[10 150][30 200]][[25000 20]]] would indicate that that first segment had matched twice at different offsets and with different lengths.

The problem with this approach was that, for very noisy signatures (which generate a lot of segment hits), you could sometimes get very verbose basis fields in your results. In one reported case there was 3MB of data in one of these fields! For this reason, the basis field has been simplified in v1.7.9 and now just reports the first valid set of matching segments i.e. a list of offset/length pairs like [[10 150][25000 20]]. This means fewer square brackets and no more exploding basis fields!

Other changes

There are some other small bug fixes and tweaks in this release, as well as updates for signature files. Here’s the full changelog:

Changelog v1.7.9 (2018-08-31)