24. Practical: Parsing Binary Files - The Current Object Stack - 《Practical Common Lisp》

The Current Object Stack

The Current Object Stack

One last bit of functionality you’ll need in the next chapter is a way to get at the binary object being read or written while reading and writing. More generally, when reading or writing nested composite objects, it’s useful to be able to get at any of the objects currently being read or written. Thanks to dynamic variables and :around methods, you can add this enhancement with about a dozen lines of code. To start, you should define a dynamic variable that will hold a stack of objects currently being read or written.

(defvar *in-progress-objects* nil)

Then you can define :around methods on read-object and write-object that push the object being read or written onto this variable before invoking **CALL-NEXT-METHOD**.

(defmethod read-object :around (object stream)
  (declare (ignore stream))
  (let ((*in-progress-objects* (cons object *in-progress-objects*)))
    (call-next-method)))
(defmethod write-object :around (object stream)
  (declare (ignore stream))
  (let ((*in-progress-objects* (cons object *in-progress-objects*)))
    (call-next-method)))

Note how you rebind *in-progress-objects* to a list with a new item on the front rather than assigning it a new value. This way, at the end of the **LET**, after **CALL-NEXT-METHOD** returns, the old value of *in-progress-objects* will be restored, effectively popping the object of the stack.

With those two methods defined, you can provide two convenience functions for getting at specific objects in the in-progress stack. The function current-binary-object will return the head of the stack, the object whose read-object or write-object method was invoked most recently. The other, parent-of-type, takes an argument that should be the name of a binary object class and returns the most recently pushed object of that type, using the **TYPEP** function that tests whether a given object is an instance of a particular type.

(defun current-binary-object () (first *in-progress-objects*))
(defun parent-of-type (type)
  (find-if #'(lambda (x) (typep x type)) *in-progress-objects*))

These two functions can be used in any code that will be called within the dynamic extent of a read-object or write-object call. You’ll see one example of how current-binary-object can be used in the next chapter.11

Now you have all the tools you need to tackle an ID3 parsing library, so you’re ready to move onto the next chapter where you’ll do just that.

1In ASCII, the first 32 characters are nonprinting control characters originally used to control the behavior of a Teletype machine, causing it to do such things as sound the bell, back up one character, move to a new line, and move the carriage to the beginning of the line. Of these 32 control characters, only three, the newline, carriage return, and horizontal tab, are typically found in text files.

2Some binary file formats are in-memory data structures—on many operating systems it’s possible to map a file into memory, and low-level languages such as C can then treat the region of memory containing the contents of the file just like any other memory; data written to that area of memory is saved to the underlying file when it’s unmapped. However, these formats are platform-dependent since the in-memory representation of even such simple data types as integers depends on the hardware on which the program is running. Thus, any file format that’s intended to be portable must define a canonical representation for all the data types it uses that can be mapped to the actual in-memory data representation on a particular kind of machine or in a particular language.

3The term big-endian and its opposite, little-endian, borrowed from Jonathan Swift’s Gulliver’s Travels, refer to the way a multibyte number is represented in an ordered sequence of bytes such as in memory or in a file. For instance, the number 43981, or abcd in hex, represented as a 16-bit quantity, consists of two bytes, ab and cd. It doesn’t matter to a computer in what order these two bytes are stored as long as everybody agrees. Of course, whenever there’s an arbitrary choice to be made between two equally good options, the one thing you can be sure of is that everybody is not going to agree. For more than you ever wanted to know about it, and to see where the terms big-endian and little-endian were first applied in this fashion, read “On Holy Wars and a Plea for Peace” by Danny Cohen, available at http://khavrinen.lcs.mit.edu/wollman/ien-137.txt.

4**LDB** and **DPB**, a related function, were named after the DEC PDP-10 assembly functions that did essentially the same thing. Both functions operate on integers as if they were represented using twos-complement format, regardless of the internal representation used by a particular Common Lisp implementation.

5Common Lisp also provides functions for shifting and masking the bits of integers in a way that may be more familiar to C and Java programmers. For instance, you could write read-u2 yet a third way, using those functions, like this:

(defun read-u2 (in)
  (logior (ash (read-byte in) 8) (read-byte in)))

which would be roughly equivalent to this Java method:

public int readU2 (InputStream in) throws IOException {
  return (in.read() << 8) | (in.read());
}

The names **LOGIOR** and **ASH** are short for LOGical Inclusive OR and Arithmetic SHift. **ASH** shifts an integer a given number of bits to the left when its second argument is positive or to the right if the second argument is negative. **LOGIOR** combines integers by logically oring each bit. Another function, **LOGAND**, performs a bitwise and, which can be used to mask off certain bits. However, for the kinds of bit twiddling you’ll need to do in this chapter and the next, **LDB** and **BYTE** will be both more convenient and more idiomatic Common Lisp style.

6Originally, UTF-8 was designed to represent a 31-bit character code and used up to six bytes per code point. However, the maximum Unicode code point is #x10ffff, so a UTF-8 encoding of Unicode requires at most four bytes per code point.

7If you need to parse a file format that uses other character codes, or if you need to parse files containing arbitrary Unicode strings using a non-Unicode-Common-Lisp implementation, you can always represent such strings in memory as vectors of integer code points. They won’t be Lisp strings, so you won’t be able to manipulate or compare them with the string functions, but you’ll still be able to do anything with them that you can with arbitrary vectors.

8Unfortunately, the language itself doesn’t always provide a good model in this respect: the macro **DEFSTRUCT**, which I don’t discuss since it has largely been superseded by **DEFCLASS**, generates functions with names that it generates based on the name of the structure it’s given. **DEFSTRUCT**‘s bad example leads many new macro writers astray.

9Technically there’s no possibility of type or object conflicting with slot names—at worst they’d be shadowed within the **WITH-SLOTS** form. But it doesn’t hurt anything to simply **GENSYM** all local variable names used within a macro template.

10Using **ASSOC** to extract the :reader and :writer elements of spec allows users of define-binary-type to include the elements in either order; if you required the :reader element to be always be first, you could then have used (rest (first spec)) to extract the reader and (rest (second spec)) to extract the writer. However, as long as you require the :reader and :writer keywords to improve the readability of define-binary-type forms, you might as well use them to extract the correct data.

11The ID3 format doesn’t require the parent-of-type function since it’s a relatively flat structure. This function comes into its own when you need to parse a format made up of many deeply nested structures whose parsing depends on information stored in higher-level structures. For example, in the Java class file format, the top-level class file structure contains a constant pool that maps numeric values used in other substructures within the class file to constant values that are needed while parsing those substructures. If you were writing a class file parser, you could use parent-of-type in the code that reads and writes those substructures to get at the top-level class file object and from there to the constant pool.