siegfried 1.7.9 released

Thursday, 30 August, 2018

Version 1.7.9 of siegfried is now available. Get it here. According to the develop benchmarks, this release is slightly more accurate than v1.7.8, with only a marginal impact on performance.

The highlights of this release are a new system for saving configurations for the sf tool, changes to the matching algorithm to improve accuracy, and simplifications to the basis field.

Save and load frequently used configurations

The new -setconf flag allows you to save frequently used configurations for the sf tool. I implemented this to make it possible to set a default -multi value (and not require users to type e.g. sf -multi 32 every time they run siegfried), but you can use -setconf to save any frequently used flags. To save your preferred flags as defaults just type your preferred siegfried command, ommitting the file/directory argument and including -setconf.

For example, the following command records preferences for logging, output, hashing and -multi:

sf -setconf -csv -hash md5 -multi 32 -log time,error

You can then just type sf DIR to run siegfried with those preferred settings. Configurations are stackable with additional flags used while running siegfried. For example, if you type sf -json DIR after setting the above configuration, you’ll get JSON instead of CSV output for that session, but all those other preferences will be applied.

The -conf NAME flag allows you to save and load named configuration files. These configurations are saved to the path specified by the flag, rather than to the default configuration file (a sf.conf file in your siegfried home directory). Named configurations might be useful if you have a few different ways of invoking siegfried. For example, you might want to save a server configuration:

sf -setconf -hash sha1 -z -serve localhost:5138 -sig deluxe.sig -conf server.conf

You can then load that configuration by just typing sf -conf server.conf.

Changes to the matching algorithm

If you review the develop benchmarks, you’ll see that there are some small differences in the results returned for v1.7.9 as compared with v1.7.8. For example, in the Govdocs corpus, a number of PDF files had been identified as fmt/134 (MPEG) previously, but are now correctly identified as various forms of PDF. This improvement in accuracy follows some changes I’ve implemented to resolve this issue.

The problem boiled down to how siegfried uses PRONOM’s file format priority information. One of siegfried’s optimisations (and a reason it gives fairly good performance without requiring users to set limits on bytes scanned) is that it applies format priorities in real-time. For example, if, during scanning, a match comes in for PDF then siegfried will keep scanning to see if the file is a PDF/A (or other more specific type of PDF) but it won’t wait to see if the file is an MPEG or anything else unrelated to that initial match. Think of all the formats as a big tree: once siegfried starts climbing in a particular direction, it will only find results higher up that branch. But what if that initial match is misleading? Like those Govdocs PDFs where a noisy MPEG signature matched first?

The changes I’ve made to the matching algorithm for v1.7.9 retain siegfried’s real-time application of format priorities but with a tweak that allows siegfried to “jump” between branches in that format tree. The way this works is that, when each of the matchers runs (matchers are different stages in the scanning engine - i.e. the file name matcher, container matcher, byte matcher, text matcher etc.), “hints” are supplied based on information gleaned from previous matchers. These “hints” are then weighed alongside format priorities when siegfried decides what to do with format hits. For example, in the case of the Govdocs PDFs, the byte matcher receives a hint from the file name matcher that the file might be in the PDF family (because of the .pdf extension) and that hint causes the matcher to keep an open mind to the possibility that the file might well be a PDF even after that positive MPEG match has been found.

There is a small cost in speed for this change to the matching algorithm (because there is now this new factor that will cause siegfried to delay in returning a positive match early) but my benchmarks show that the slowdown is only very modest.

Please note: these format prioritisation rules only apply to siegfried in its default mode. The roy tool gives you fine-grained control over how format priorities are used during matching (i.e. you can elect to scan more slowly and get more exhaustive results returned). Try the roy build -multi positive, roy build -multi comprehensive and roy build -multi exhaustive commands described here to see how you can fine tune your results.

Simpler basis field

There is a small change in the information returned in the basis field for v1.7.9.

When reporting byte matches, siegfried returns the location (offset from the beginning of the file) and length (in bytes) of matches as pairs e.g. [10 150], which means a match at offset 10 for 150 bytes. For signatures with multiple segments (e.g. a beginning of file segment and an end of file segment), previous versions of siegfried reported a basis which was a list, of lists, of offset/length pairs. For example, you might get a basis like [[[10 150]][[25000 20]]]. The reason siegfried returned lists of lists rather than just simple lists of offset/length pairs was to account for the fact that sometimes particular segments of a signature would match at multiple points in the file. E.g. [[[10 150][30 200]][[25000 20]]] would indicate that that first segment had matched twice at different offsets and with different lengths.

The problem with this approach was that, for very noisy signatures (which generate a lot of segment hits), you could sometimes get very verbose basis fields in your results. In one reported case there was 3MB of data in one of these fields! For this reason, the basis field has been simplified in v1.7.9 and now just reports the first valid set of matching segments i.e. a list of offset/length pairs like [[10 150][25000 20]]. This means fewer square brackets and no more exploding basis fields!

Other changes

There are some other small bug fixes and tweaks in this release, as well as updates for signature files. Here’s the full changelog:

Changelog v1.7.9 (2018-08-31)

Added:

save defaults in a configuration file: use the -setconf flag to record any other flags used into a config file. These defaults will be loaded each time you run sf. E.g. sf -multi 16 -setconf then sf DIR (loads the new multi default)
use -conf filename to save or load from a named config file. E.g. sf -multi 16 -serve :5138 -conf srv.conf -setconf and then sf -conf srv.conf
added -yaml flag so, if you set json/csv in default config :(, you can override with YAML instead. Choose the YAML!

Changed:

the roy compare -join options that join on filepath now work better when comparing results with mixed windows and unix paths
exported decompress package to give more functionality for users of the golang API; requested by Byron Ruth
update LOC signatures to 2018-06-14
update freedesktop.org signatures to v1.10
update tika-mimetype signatures to v1.18

Fixed:

misidentifications of some files e.g. ODF presentation due to sf quitting early on strong matches. Have adjusted this algorithm to make sf wait longer if there is evidence (e.g. from filename) that the file might be something else. Reported by Jean-Séverin Lair
read and other file errors caused sf to hang; reports by Greg Lepore and Andy Foster; fix contributed by Ross Spencer
bug reading streams where EOF returned for reads exactly adjacent the end of file
bug in mscfb library (race condition for concurrent access to a global variable)
some matches result in extremely verbose basis fields; reported by Nick Krabbenhoeft. Partly fixed: basis field now reports a single basis for a match but work remains to speed up matching for these cases.