Presented by

  • Gerd Heber

    Gerd Heber
    https://www.hdfgroup.org/

    Gerd Heber is the Executive Director of The HDF Group in Champaign, IL. The HDF Group is a non-profit organization with the mission of advancing state-of-the-art open-source data management technologies, ensuring long-term access to the data, and supporting a dedicated and diverse user community. The HDF Group is the developer of HDF5, a high-performance software library, data format, and highly scalable data service adopted across multiple industries and widely used in the scientific and research communities.

Abstract

HDF5 is a foundational file format for scientific and engineering data, but much of its practical meaning lives at the boundary between prose specifications, library implementations, and real binary files on disk. This talk presents SHAPE5 (Specification for HDF5 Analysis, Parsing, and Encoding), an effort to describe HDF5's on-disk structures as executable, machine-readable GNU poke pickles.

GNU poke is a programmable editor for binary data that maps raw bytes directly into type structures. SHAPE5 applies this model to HDF5 metadata: superblocks, object headers, object header messages, B-trees, chunk indexes, checksums, and related primitives. The result is not another HDF5 decoder or encoder, but a complementary specification artifact—one that can be loaded, inspected, tested, and extended interactively.

The talk will demonstrate how these pickles expose HDF5 internals from the poke REPL, how offsets and versioned structures are handled, and how tools built on top of SHAPE5—h5explain, h5check, and h5patch—support debugging, validation, data recovery, and format exploration. We will also discuss what becomes clearer when a binary format specification is executable: mistakes in the documentation, ambiguities in prose, undocumented implementation behavior, and opportunities for conformance testing.

A machine-readable specification is essential for maintaining coherence across the half-dozen independent HDF5 implementations in active use. It also lowers the barrier to writing new implementations and enables more targeted fuzzing tools to harden existing ones.

Attendees will leave with a practical understanding of how GNU poke can model complex binary formats, why machine-readable specifications matter for long-lived scientific data, and how free software communities can collaborate around format transparency, tooling, and preservation.