Version 1.8.0 of siegfried is now available. Get it here.
This release includes changes in the byte matcher to improve performance, especially when scanning MP3s (fmt/134).
sf -utc FILE | DIR. Requested by Dragan Espenschied
Version 1.7.13 of siegfried is now available. Get it here.
This minor release fixes a in the namematcher that caused filenames containing “?” to be treated as URLs. It also adds the ability to scan directories using the
sf -f command.
Updates to the LOC FDD and tika-mimetypes signature files.
-fflag now scans directories, as well as files. Requested by Harry Moss
Version 1.7.12 of siegfried is now available. Get it here.
This minor release fixes a bug that caused .docx files with .doc extensions to panic and a bug with mime-info signatures.
Updates to the PRONOM (v95), LOC FDD and tika-mimetypes signature files.
Version 1.7.11 of siegfried is now available. Get it here.
This minor release fixes the debian package and allows the container matcher to identify directory names (for SIARD matching). Updates to the LOC FDD and tika-mimetypes signature files.
Version 1.7.10 of siegfried is now available. Get it here.
This minor release fixes a regression in the LOC identifier introduced in 1.7.9 and updates to PRONOM v94.
The highlights of this release are a new system for saving configurations for the
sf tool, changes to the matching algorithm to improve accuracy, and simplifications to the basis field.
-setconf flag allows you to save frequently used configurations for the
sf tool. I implemented this to make it possible to set a default
-multi value (and not require users to type e.g.
sf -multi 32 every time they run siegfried), but you can use
-setconf to save any frequently used flags. To save your preferred flags as defaults just type your preferred siegfried command, ommitting the file/directory argument and including
For example, the following command records preferences for logging, output, hashing and -multi:
sf -setconf -csv -hash md5 -multi 32 -log time,error
You can then just type
sf DIR to run siegfried with those preferred settings. Configurations are stackable with additional flags used while running siegfried. For example, if you type
sf -json DIR after setting the above configuration, you'll get JSON instead of CSV output for that session, but all those other preferences will be applied.
-conf NAME flag allows you to save and load named configuration files. These configurations are saved to the path specified by the flag, rather than to the default configuration file (a sf.conf file in your siegfried home directory). Named configurations might be useful if you have a few different ways of invoking siegfried. For example, you might want to save a server configuration:
sf -setconf -hash sha1 -z -serve localhost:5138 -sig deluxe.sig -conf server.conf
You can then load that configuration by just typing
sf -conf server.conf.
If you review the develop benchmarks, you'll see that there are some small differences in the results returned for v1.7.9 as compared with v1.7.8. For example, in the Govdocs corpus, a number of PDF files had been identified as fmt/134 (MPEG) previously, but are now correctly identified as various forms of PDF. This improvement in accuracy follows some changes I've implemented to resolve this issue.
The problem boiled down to how siegfried uses PRONOM's file format priority information. One of siegfried's optimisations (and a reason it gives fairly good performance without requiring users to set limits on bytes scanned) is that it applies format priorities in real-time. For example, if, during scanning, a match comes in for PDF then siegfried will keep scanning to see if the file is a PDF/A (or other more specific type of PDF) but it won't wait to see if the file is an MPEG or anything else unrelated to that initial match. Think of all the formats as a big tree: once siegfried starts climbing in a particular direction, it will only find results higher up that branch. But what if that initial match is misleading? Like those Govdocs PDFs where a noisy MPEG signature matched first?
The changes I've made to the matching algorithm for v1.7.9 retain siegfried's real-time application of format priorities but with a tweak that allows siegfried to “jump” between branches in that format tree. The way this works is that, when each of the matchers runs (matchers are different stages in the scanning engine - i.e. the file name matcher, container matcher, byte matcher, text matcher etc.), “hints” are supplied based on information gleaned from previous matchers. These “hints” are then weighed alongside format priorities when siegfried decides what to do with format hits. For example, in the case of the Govdocs PDFs, the byte matcher receives a hint from the file name matcher that the file might be in the PDF family (because of the .pdf extension) and that hint causes the matcher to keep an open mind to the possibility that the file might well be a PDF even after that positive MPEG match has been found.
There is a small cost in speed for this change to the matching algorithm (because there is now this new factor that will cause siegfried to delay in returning a positive match early) but my benchmarks show that the slowdown is only very modest.
Please note: these format prioritisation rules only apply to siegfried in its default mode. The
roy tool gives you fine-grained control over how format priorities are used during matching (i.e. you can elect to scan more slowly and get more exhaustive results returned). Try the
roy build -multi positive,
roy build -multi comprehensive and
roy build -multi exhaustive commands described here to see how you can fine tune your results.
There is a small change in the information returned in the basis field for v1.7.9.
When reporting byte matches, siegfried returns the location (offset from the beginning of the file) and length (in bytes) of matches as pairs e.g. [10 150], which means a match at offset 10 for 150 bytes. For signatures with multiple segments (e.g. a beginning of file segment and an end of file segment), previous versions of siegfried reported a basis which was a list, of lists, of offset/length pairs. For example, you might get a basis like [[[10 150]][[25000 20]]]. The reason siegfried returned lists of lists rather than just simple lists of offset/length pairs was to account for the fact that sometimes particular segments of a signature would match at multiple points in the file. E.g. [[[10 150][30 200]][[25000 20]]] would indicate that that first segment had matched twice at different offsets and with different lengths.
The problem with this approach was that, for very noisy signatures (which generate a lot of segment hits), you could sometimes get very verbose basis fields in your results. In one reported case there was 3MB of data in one of these fields! For this reason, the basis field has been simplified in v1.7.9 and now just reports the first valid set of matching segments i.e. a list of offset/length pairs like [[10 150][25000 20]]. This means fewer square brackets and no more exploding basis fields!
There are some other small bug fixes and tweaks in this release, as well as updates for signature files. Here's the full changelog:
sf -multi 16 -setconfthen
sf DIR(loads the new multi default)
-conf filenameto save or load from a named config file. E.g.
sf -multi 16 -serve :5138 -conf srv.conf -setconfand then
sf -conf srv.conf
-yamlflag so, if you set json/csv in default config :(, you can override with YAML instead. Choose the YAML!
roy compare -joinoptions that join on filepath now work better when comparing results with mixed windows and unix paths
The next siegfried release will be out shortly. I have been busy making changes to address two thorny issues: verbose basis and missing results. I have some fixes in place but, when doing large scale testing against big sets of files, I noticed some performance and quality regressions. You can see these regressions on the new develop benchmarks page. There's also a new benchmarks page to measure siegfried against comparable file format identification tools (at this stage just DROID). This post is the story of how these two pages came into being.
Up until this point I had always done large scale testing manually and it was a fiddly and annoying process: I would have to locate test corpora on whichever machine I'd last run a benchmark on and and copy to my current laptop; run the tests; do the comparison; interpret the results; look up how to do golang profiling again because I had forgot; run the profiler and check those results; etc. After going through all these steps, if I made changes to address issues, I'd have to repeat it all again in order to verify I'd fixed the issues. Obviously none of these results had much shelf life, as they depended on the vagaries of the machine I was running them on, how I had configured everything, and would be invalidated each time there was a new software or PRONOM release. I was also left with the uneasy feeling that I should be doing this kind of large scale testing much more regularly. Wouldn't it be nice to have some kind of continuous benchmarking process, just like the other automated tests and deployment workflows I run using Travis CI and appveyor? And down that rabbit hole I went for the last few months…
Today, whenever I push code changes to siegfried's develop branch, or tag a new release on the master branch, the following happens:
Benefits of this approach are:
DROID has got really fast recently! Kudos to the team at the National Archives for continuing to invest in the Aston Martin of format identification tools :). I'm particularly impressed by the DROID “no limit” (a -1 max bytes to scan setting in your DROID properties) results and wonder if it might make sense for future DROID releases to make that the default setting.
In order to make the most of SSD disks, you really need to use a
-multi setting with sf to obtain good speeds. I'm making this easier in the next release of siegfried by introducting a config file where you can store your preferred settings (and not type them in each time you invoke the tool).
If you really care about speed, you can use
roy to build signature files with built in “max bytes” limits. These test runs came out fastest in all categories. But this will impact the quality of your results.
Siegfried is a sprinter and wins on small corpora. DROID runs marathons and wins (in the no limit category) for the biggest corpus. This is possibly because of JVM start-up costs?
Speed isn't the only thing that matters. You also need to assess the quality of results and choose a tool that has the right affordances (e.g. do your staff prefer a GUI? What kind of reporting will you need? etc.). But speed is important, particularly for non-ingest workflows (e.g. consider re-scanning your repository each time there is a PRONOM update).
The tools differ in their outputs. This is because of differences in their matching engines (e.g. siegfried includes a text identifier), differences in their default settings (particularly that max byte setting), and differences in the way they choose to report results (e.g. if more than one result is returned just based on extension matches, then siegfried will return UNKNOWN with a descriptive warning indicating those possibilies; DROID and fido, on the other hand, will report multiple results).
Where's fido? I had fido in early benchmarks but removed it because the files within the test corpora I've used cause fido to come to a grinding halt for some reason. I need to inspect the error messages and follow up. I hope to get fido back onto the scoreboard shortly!
I need more corpora. The corpora I've used reflect some use cases (e.g. scanning typical office type documents) but don't represent others (e.g. scanning large AV collections). Big audio and video files have caused problems for siegfried in the past and it would be great to include them in regular testing.
Version 1.7.8 of siegfried is now available. Get it here.
This minor release updates the PRONOM signatures to v93 and the LOC signatures to 2017-09-28.
As the only changes in this release are to signature files, you can just use
sf -update if you've installed siegfried manually. This minor release is just for the convenience of users who have installed sf with package managers (i.e. debian or homebrew).
Version 1.7.7 of siegfried is now available. Happy #IDPD17!
Get it here.
This minor release fixes bugs in the
roy inspect command and in sf's handling of large container files.
A new sets file is included in this release, ‘pronom-extensions.json’, which creates sets for all extensions defined in PRONOM. You can use these new sets when building signatures e.g.
roy build -limit @.tiff or when logging formats e.g.
sf -log @.doc DIR.
The other addition in this release is the inclusion of version metadata for MIME-info signature files (e.g. freedesktop.org or tika MIME-types). You can define version metadata for MIME-info signature files by editing the MIME-info.json file in your /data directory.
See the CHANGELOG for full details on this release.
Version 1.7.6 of siegfried is now available. Get it here.
This is a minor release that incorporates the latest PRONOM update (v92), introduces a “continue on error” flag (sf -coe) to force sf to keep going when it hits fatal file errors in directory walks, and restricts file scanning to regular files (in previous versions symlinks, devices, sockets etc. were scanned which caused fatal errors for some users).
Thanks to Henk Vanstappen for the bug report that prompted this release.