10. Numbers, Characters, and Strings - String Comparisons - 《Practical Common Lisp》

String Comparisons

String Comparisons

You can compare strings using a set of functions that follow the same naming convention as the character comparison functions except with STRING as the prefix rather than CHAR (see Table 10-3).

Table 10-3. String Comparison Functions

Numeric Analog	Case-Sensitive	Case-Insensitive
`=`	`STRING=`	`STRING-EQUAL`
`/=`	`STRING/=`	`STRING-NOT-EQUAL`
`<`	`STRING<`	`STRING-LESSP`
`>`	`STRING>`	`STRING-GREATERP`
`<=`	`STRING<=`	`STRING-NOT-GREATERP`
`>=`	`STRING>=`	`STRING-NOT-LESSP`

However, unlike the character and number comparators, the string comparators can compare only two strings. That’s because they also take keyword arguments that allow you to restrict the comparison to a substring of either or both strings. The arguments—:start1, :end1, :start2, and :end2--specify the starting (inclusive) and ending (exclusive) indices of substrings in the first and second string arguments. Thus, the following:

(string= "foobarbaz" "quuxbarfoo" :start1 3 :end1 6 :start2 4 :end2 7)

compares the substring “bar” in the two arguments and returns true. The :end1 and :end2 arguments can be **NIL** (or the keyword argument omitted altogether) to indicate that the corresponding substring extends to the end of the string.

The comparators that return true when their arguments differ—that is, all of them except **STRING=** and **STRING-EQUAL**--return the index in the first string where the mismatch was detected.

(string/= "lisp" "lissome") ==> 3

If the first string is a prefix of the second, the return value will be the length of the first string, that is, one greater than the largest valid index into the string.

(string< "lisp" "lisper") ==> 4

When comparing substrings, the resulting value is still an index into the string as a whole. For instance, the following compares the substrings “bar” and “baz” but returns 5 because that’s the index of the r in the first string:

(string< "foobar" "abaz" :start1 3 :start2 1) ==> 5   ; N.B. not 2

Other string functions allow you to convert the case of strings and trim characters from one or both ends of a string. And, as I mentioned previously, since strings are really a kind of sequence, all the sequence functions I’ll discuss in the next chapter can be used with strings. For instance, you can discover the length of a string with the **LENGTH** function and can get and set individual characters of a string with the generic sequence element accessor function, **ELT**, or the generic array element accessor function, **AREF**. Or you can use the string-specific accessor, **CHAR**. But those functions, and others, are the topic of the next chapter, so let’s move on.

1Fred Brooks, The Mythical Man-Month, 20th Anniversary Edition (Boston: Addison-Wesley, 1995), p. 103. Emphasis in original.

2Mattel’s Teen Talk Barbie

3Obviously, the size of a number that can be represented on a computer with finite memory is still limited in practice; furthermore, the actual representation of bignums used in a particular Common Lisp implementation may place other limits on the size of number that can be represented. But these limits are going to be well beyond “astronomically” large numbers. For instance, the number of atoms in the universe is estimated to be less than 2^269; current Common Lisp implementations can easily handle numbers up to and beyond 2^262144.

4Folks interested in using Common Lisp for intensive numeric computation should note that a naive comparison of the performance of numeric code in Common Lisp and languages such as C or FORTRAN will probably show Common Lisp to be much slower. This is because something as simple as (+ a b) in Common Lisp is doing a lot more than the seemingly equivalent a + b in one of those languages. Because of Lisp’s dynamic typing and support for things such as arbitrary precision rationals and complex numbers, a seemingly simple addition is doing a lot more than an addition of two numbers that are known to be represented by machine words. However, you can use declarations to give Common Lisp information about the types of numbers you’re using that will enable it to generate code that does only as much work as the code that would be generated by a C or FORTRAN compiler. Tuning numeric code for this kind of performance is beyond the scope of this book, but it’s certainly possible.

5While the standard doesn’t require it, many Common Lisp implementations support the IEEE standard for floating-point arithmetic, IEEE Standard for Binary Floating-Point Arithmetic, ANSI/ IEEE Std 754-1985 (Institute of Electrical and Electronics Engineers, 1985).

6It’s also possible to change the default base the reader uses for numbers without a specific radix marker by changing the value of the global variable ***READ-BASE***. However, it’s not clear that’s the path to anything other than complete insanity.

7Since the purpose of floating-point numbers is to make efficient use of floating-point hardware, each Lisp implementation is allowed to map these four subtypes onto the native floating-point types as appropriate. If the hardware supports fewer than four distinct representations, one or more of the types may be equivalent.

8”Computerized scientific notation” is in scare quotes because, while commonly used in computer languages since the days of FORTRAN, it’s actually quite different from real scientific notation. In particular, something like 1.0e4 means 10000.0, but in true scientific notation that would be written as 1.0 x 10^4. And to further confuse matters, in true scientific notation the letter e stands for the base of the natural logarithm, so something like 1.0 x e^4, while superficially similar to 1.0e4, is a completely different value, approximately 54.6.

9For mathematical consistency, **+** and ***** can also be called with no arguments, in which case they return the appropriate identity: 0 for **+** and 1 for *****.

10Roughly speaking, **MOD** is equivalent to the % operator in Perl and Python, and **REM** is equivalent to the % in C and Java. (Technically, the exact behavior of % in C wasn’t specified until the C99 standard.)

11Even Java, which was designed from the beginning to use Unicode characters on the theory that Unicode was the going to be the character encoding of the future, has run into trouble since Java characters are defined to be a 16-bit quantity and the Unicode 3.1 standard extended the range of the Unicode character set to require a 21-bit representation. Ooops.

12Note, however, that not all literal strings can be printed by passing them as the second argument to **FORMAT** since certain sequences of characters have a special meaning to **FORMAT**. To safely print an arbitrary string—say, the value of a variable s—with **FORMAT** you should write (format t “~a” s).