The next siegfried release will be out shortly. I have been busy making changes to address two thorny issues: verbose basis and missing results. I have some fixes in place but, when doing large scale testing against big sets of files, I noticed some performance and quality regressions. You can see these regressions on the new develop benchmarks page. There’s also a new benchmarks page to measure siegfried against comparable file format identification tools (at this stage just DROID). This post is the story of how these two pages came into being.
Up until this point I had always done large scale testing manually and it was a fiddly and annoying process: I would have to locate test corpora on whichever machine I’d last run a benchmark on and and copy to my current laptop; run the tests; do the comparison; interpret the results; look up how to do golang profiling again because I had forgot; run the profiler and check those results; etc. After going through all these steps, if I made changes to address issues, I’d have to repeat it all again in order to verify I’d fixed the issues. Obviously none of these results had much shelf life, as they depended on the vagaries of the machine I was running them on, how I had configured everything, and would be invalidated each time there was a new software or PRONOM release. I was also left with the uneasy feeling that I should be doing this kind of large scale testing much more regularly. Wouldn’t it be nice to have some kind of continuous benchmarking process, just like the other automated tests and deployment workflows I run using Travis CI and appveyor? And down that rabbit hole I went for the last few months…
Today, whenever I push code changes to siegfried’s develop branch, or tag a new release on the master branch, the following happens:
Benefits of this approach are:
DROID has got really fast recently! Kudos to the team at the National Archives for continuing to invest in the Aston Martin of format identification tools :). I’m particularly impressed by the DROID “no limit” (a -1 max bytes to scan setting in your DROID properties) results and wonder if it might make sense for future DROID releases to make that the default setting.
In order to make the most of SSD disks, you really need to use a
-multi setting with sf to obtain good speeds. I’m making this easier in the next release of siegfried by introducting a config file where you can store your preferred settings (and not type them in each time you invoke the tool).
If you really care about speed, you can use
roy to build signature files with built in “max bytes” limits. These test runs came out fastest in all categories. But this will impact the quality of your results.
Siegfried is a sprinter and wins on small corpora. DROID runs marathons and wins (in the no limit category) for the biggest corpus. This is possibly because of JVM start-up costs?
Speed isn’t the only thing that matters. You also need to assess the quality of results and choose a tool that has the right affordances (e.g. do your staff prefer a GUI? What kind of reporting will you need? etc.). But speed is important, particularly for non-ingest workflows (e.g. consider re-scanning your repository each time there is a PRONOM update).
The tools differ in their outputs. This is because of differences in their matching engines (e.g. siegfried includes a text identifier), differences in their default settings (particularly that max byte setting), and differences in the way they choose to report results (e.g. if more than one result is returned just based on extension matches, then siegfried will return UNKNOWN with a descriptive warning indicating those possibilies; DROID and fido, on the other hand, will report multiple results).
Where’s fido? I had fido in early benchmarks but removed it because the files within the test corpora I’ve used cause fido to come to a grinding halt for some reason. I need to inspect the error messages and follow up. I hope to get fido back onto the scoreboard shortly!
I need more corpora. The corpora I’ve used reflect some use cases (e.g. scanning typical office type documents) but don’t represent others (e.g. scanning large AV collections). Big audio and video files have caused problems for siegfried in the past and it would be great to include them in regular testing.
Version 1.7.8 of siegfried is now available. Get it here.
This minor release updates the PRONOM signatures to v93 and the LOC signatures to 2017-09-28.
As the only changes in this release are to signature files, you can just use
sf -update if you’ve installed siegfried manually. This minor release is just for the convenience of users who have installed sf with package managers (i.e. debian or homebrew).
Version 1.7.7 of siegfried is now available. Happy #IDPD17!
Get it here.
This minor release fixes bugs in the
roy inspect command and in sf’s handling of large container files.
A new sets file is included in this release, ‘pronom-extensions.json’, which creates sets for all extensions defined in PRONOM. You can use these new sets when building signatures e.g.
roy build -limit @.tiff or when logging formats e.g.
sf -log @.doc DIR.
The other addition in this release is the inclusion of version metadata for MIME-info signature files (e.g. freedesktop.org or tika MIME-types). You can define version metadata for MIME-info signature files by editing the MIME-info.json file in your /data directory.
See the CHANGELOG for full details on this release.
Version 1.7.6 of siegfried is now available. Get it here.
This is a minor release that incorporates the latest PRONOM update (v92), introduces a “continue on error” flag (sf -coe) to force sf to keep going when it hits fatal file errors in directory walks, and restricts file scanning to regular files (in previous versions symlinks, devices, sockets etc. were scanned which caused fatal errors for some users).
Thanks to Henk Vanstappen for the bug report that prompted this release.
In my recent updates to this site I’ve added a new “Chart your results” tool on the siegfried page (in the right hand panel under “Try Siegfried”). This tool produces single page reports like this: /siegfried/results/ea1zaj.
Before covering this tool in detail let’s recap some of the existing ways you can already analyse your results.
I appreciate that not everyone is a command-line junkie, but the way I inspect results is just to use sf’s -log flag. If you do
sf -log chart (or
-log c) you can make simple format charts:
(In these examples I add “o” to my log options to direct logging output to STDOUT… otherwise you’ll see it in STDERR).
A chart can be a starting point for deeper analysis e.g. inspecting lists of files of a particular format:
You can also inspect lists of unknowns with
-log u and warnings with
Rather than re-run the format identification job with every step, you can pair these commands with the
-replay flag to run them against a pre-generated results file instead. I cover this workflow in detail in the siegfried wiki.
These tools both do a lot more than simple chart generation. E.g. DROID-SF can create a “Rogues Gallery” of all your problematic files. Brunnhilde has a GUI, does virus scanning, and can also run bulk_extractor against your files. I’d definitely encourage you to check both of these tools out!
If your needs are a little bit simpler, and you just want a chart, then my new “Chart your results” tool might be a good fit.
To try this tool, go to the siegfried page and upload a results file in the “Chart my results” form in the right-hand panel.
Let’s run through some of its features:
Probably the distinguishing feature of this tool is that you can easily share your analysis with colleagues, or with the digital preservation community broadly, by “publishing” your results. This gives you a permanent URL (like https://www.itforarchivists.com/siegfried/results/ea1zaj) and stores your results on the site. Prior to publication you can opt to “redact” your filenames if they contain sensitive information. I’ve added a privacy section to this site to address some of the privacy questions raised by this feature in a little more detail.
That’s it, please use it, and if you like it tweet your results!
Version 1.7.5 of siegfried is now available. Get it here.
The headline feature of this release is new functionality for the
sf -update command requested by Ross Spencer. You can now use the
-update flag to download or update non-PRONOM signatures with a choice of LOC FDD, two flavours of MIMEInfo (Apache Tika’s MIMEInfo and freedesktop.org), and archivematica (latest PRONOM + archivematica extensions) signatures. There are two combo options as well: PRONOM/Tika/LOC and the Ross Spencer “deluxe” (PRONOM/Tika/freedesktop.org/LOC).
PRONOM remains the default, so if you just do
sf -update it will work as before.
To go non-PRONOM, include one of “loc”, “tika”, “freedesktop”, “pronom-tika-loc”, “deluxe” or “archivematica” as an argument after the flags e.g.
sf -update freedesktop. This command will overwrite ‘default.sig’ (the default signature file that sf loads).
You can preserve your default signature file by providing an alternative
-sig target: e.g.
sf -sig notdefault.sig -update loc. If you use one of the signature options as a filename (with or without a .sig extension), you can omit the signature argument i.e.
sf -update -sig loc.sig is equivalent to
sf -sig loc.sig -update loc.
sf -updatenow does SHA-256 hash verification of updates and communication with the update server is via HTTPS