17.3.1 x86

MSVC 2010:

Listing 17.18: MSVC 2010

  1. _rt$ = -8 ; size = 4
  2. _i$ = -4 ; size = 4
  3. _a$ = 8 ; size = 4
  4. _f PROC
  5. push ebp
  6. mov ebp, esp
  7. sub esp, 8
  8. mov DWORD PTR _rt$[ebp], 0
  9. mov DWORD PTR _i$[ebp], 0
  10. jmp SHORT $LN4@f
  11. $LN3@f:
  12. mov eax, DWORD PTR _i$[ebp] ; increment of 1
  13. add eax, 1
  14. mov DWORD PTR _i$[ebp], eax
  15. $LN4@f:
  16. cmp DWORD PTR _i$[ebp], 32 ; 00000020H
  17. jge SHORT $LN2@f ; loop finished?
  18. mov edx, 1
  19. mov ecx, DWORD PTR _i$[ebp]
  20. shl edx, cl ; EDX=EDX<<CL
  21. and edx, DWORD PTR _a$[ebp]
  22. je SHORT $LN1@f ; result of AND instruction was 0?
  23. ; then skip next instructions
  24. mov eax, DWORD PTR _rt$[ebp] ; no, not zero
  25. add eax, 1 ; increment rt
  26. mov DWORD PTR _rt$[ebp], eax
  27. $LN1@f:
  28. jmp SHORT $LN3@f
  29. $LN2@f:
  30. mov eax, DWORD PTR _rt$[ebp]
  31. mov esp, ebp
  32. pop ebp
  33. ret 0
  34. _f ENDP

下面是GCC 4.4.1编译的代码: Listing 17.19: GCC 4.4.1

  1. public f
  2. f proc near
  3. rt = dword ptr -0Ch
  4. i = dword ptr -8
  5. arg_0 = dword ptr 8
  6. push ebp
  7. mov ebp, esp
  8. push ebx
  9. sub esp, 10h
  10. mov [ebp+rt], 0
  11. mov [ebp+i], 0
  12. jmp short loc_80483EF
  13. loc_80483D0:
  14. mov eax, [ebp+i]
  15. mov edx, 1
  16. mov ebx, edx
  17. mov ecx, eax
  18. shl ebx, cl
  19. mov eax, ebx
  20. and eax, [ebp+arg_0]
  21. test eax, eax
  22. jz short loc_80483EB
  23. add [ebp+rt], 1
  24. loc_80483EB:
  25. add [ebp+i], 1
  26. loc_80483EF:
  27. cmp [ebp+i], 1Fh
  28. jle short loc_80483D0
  29. mov eax, [ebp+rt]
  30. add esp, 10h
  31. pop ebx
  32. pop ebp
  33. retn
  34. f endp

在乘以或者除以2的指数值(1,2,4,8等)时经常使用移位操作。 例如:

  1. unsigned int f(unsigned int a)
  2. {
  3. return a/4;
  4. };

MSVC 2010: Listing 17.20: MSVC 2010

  1. _a$ = 8 ; size = 4
  2. _f PROC
  3. mov eax, DWORD PTR _a$[esp-4]
  4. shr eax, 2
  5. ret 0
  6. _f ENDP

例子中的SHR(逻辑右移)指令将a值右移2位,最高两位被置0,最低2位被丢弃。实施上丢弃的两位是除法的余数。 SHR作用类似SHL只是移位方向不同。

17.3.1 x86 - 图1

使用十进制23很好来理解。23除以10,丢弃最后的数字(3是余数),商为2。 与此类似的是乘法。比如乘以4,仅需将数字左移2位,最低两位被置0。就像3乘以100—仅仅在最后补两个0就行了。

17.3.2 ARM + Optimizing Xcode (LLVM) + ARM mode

Listing 17.21: Optimizing Xcode (LLVM) + ARM mode

  1. MOV R1, R0
  2. MOV R0, #0
  3. MOV R2, #1
  4. MOV R3, R0
  5. loc_2E54
  6. TST R1, R2,LSL R3 ; set flags according to R1 & (R2<<R3)
  7. ADD R3, R3, #1 ; R3++
  8. ADDNE R0, R0, #1 ; if ZF flag is cleared by TST, R0++
  9. CMP R3, #32
  10. BNE loc_2E54
  11. BX LR

TST类似于x86下的TEST指令。 正如我前面提到的(14.2.1),ARM模式下没有单独的移位指令。对于用作修饰的LSL(逻辑左移)、LSR(逻辑右移)、ASR(算术右移)、ROR(循环右移)和RRX(带扩展的循环右移指令),需要与MOV,TST,CMP,ADD,SUB,RSB结合来使用6。 这些修饰指令被定义,第二个操作数指定移动的位数。 因此“TST R1, R2,LSL R3”指令所做的工作为????1 ∧ (????2 ≪ ????3).

17.3.3 ARM + Optimizing Xcode (LLVM) + thumb-2 mode

几乎一样,只是这里使用LSL.W/TST指令而不是只有TST。因为Thumb模式下TST没有定义修饰符LSL。

  1. MOV R1, R0
  2. MOVS R0, #0
  3. MOV.W R9, #1
  4. MOVS R3, #0
  5. loc_2F7A
  6. LSL.W R2, R9, R3
  7. TST R2, R1
  8. ADD.W R3, R3, #1
  9. IT NE
  10. ADDNE R0, #1
  11. CMP R3, #32
  12. BNE loc_2F7A
  13. BX LR

 17.4 CRC32哈希散列计算例子

这是非常流行的CRC32哈希散列计算。

  1. /* By Bob Jenkins, (c) 2006, Public Domain */
  2. #include <stdio.h>
  3. #include <stddef.h>
  4. #include <string.h>
  5. typedef unsigned long ub4;
  6. typedef unsigned char ub1;
  7. static const ub4 crctab[256] = {
  8. 0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
  9. 0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
  10. 0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
  11. 0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
  12. 0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
  13. 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
  14. 0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
  15. 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
  16. 0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
  17. 0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
  18. 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
  19. 0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
  20. 0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
  21. 0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
  22. 0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
  23. 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
  24. 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
  25. 0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
  26. 0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
  27. 0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
  28. 0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
  29. 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
  30. 0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
  31. 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
  32. 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
  33. 0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
  34. 0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
  35. 0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
  36. 0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
  37. 0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
  38. 0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
  39. 0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
  40. 0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
  41. 0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
  42. 0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
  43. 0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
  44. 0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
  45. 0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
  46. 0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
  47. 0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
  48. 0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
  49. 0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
  50. 0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d,
  51. };
  52. /* how to derive the values in crctab[] from polynomial 0xedb88320 */
  53. void build_table()
  54. {
  55. ub4 i, j;
  56. for (i=0; i<256; ++i) {
  57. j = i;
  58. j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);
  59. j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);
  60. j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);
  61. j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);
  62. j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);
  63. j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);
  64. j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);
  65. j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);
  66. printf("0x%.8lx, ", j);
  67. if (i%6 == 5) printf("");
  68. }
  69. }
  70. /* the hash function */
  71. ub4 crc(const void *key, ub4 len, ub4 hash)
  72. {
  73. ub4 i;
  74. const ub1 *k = key;
  75. for (hash=len, i=0; i<len; ++i)
  76. hash = (hash >> 8) ^ crctab[(hash & 0xff) ^ k[i]];
  77. return hash;
  78. }
  79. /* To use, try "gcc -O crc.c -o crc; crc < crc.c" */
  80. int main()
  81. {
  82. char s[1000];
  83. while (gets(s)) printf("%.8lx", crc(s, strlen(s), 0));
  84. return 0;
  85. }

我们只关心crc()函数。注意for()语句两个循环初始化:hash=len,i=0。标准C/C++允许这样做。循环体内通常需要使用两个初始化部分。 让我们用MSVC优化(/Ox)。为了简洁,仅列出crc()函数的代码,包括我做的注释。

  1. key$ = 8 ; size = 4
  2. _len$ = 12 ; size = 4
  3. _hash$ = 16 ; size = 4
  4. _crc PROC
  5. mov edx, DWORD PTR _len$[esp-4]
  6. xor ecx, ecx ; i will be stored in ECX
  7. mov eax, edx
  8. test edx, edx
  9. jbe SHORT $LN1@crc
  10. push ebx
  11. push esi
  12. mov esi, DWORD PTR _key$[esp+4] ; ESI = key
  13. push edi
  14. $LL3@crc:
  15. ; work with bytes using only 32-bit registers. byte from address key+i we store into EDI
  16. movzx edi, BYTE PTR [ecx+esi]
  17. mov ebx, eax ; EBX = (hash = len)
  18. and ebx, 255 ; EBX = hash & 0xff
  19. ; XOR EDI, EBX (EDI=EDI^EBX) - this operation uses all 32 bits of each register
  20. ; but other bits (8-31) are cleared all time, so its OK
  21. ; these are cleared because, as for EDI, it was done by MOVZX instruction above
  22. ; high bits of EBX was cleared by AND EBX, 255 instruction above (255 = 0xff)
  23. xor edi, ebx
  24. ; EAX=EAX>>8; bits 24-31 taken "from nowhere" will be cleared
  25. shr eax, 8
  26. ; EAX=EAX^crctab[EDI*4] - choose EDI-th element from crctab[] table
  27. xor eax, DWORD PTR _crctab[edi*4]
  28. inc ecx ; i++
  29. cmp ecx, edx ; i<len ?
  30. jb SHORT $LL3@crc ; yes
  31. pop edi
  32. pop esi
  33. pop ebx
  34. $LN1@crc:
  35. ret 0
  36. _crc ENDP

我们来看GCC 4.4.1优化后的代码:

  1. public crc
  2. crc proc near
  3. key = dword ptr 8
  4. hash = dword ptr 0Ch
  5. push ebp
  6. xor edx, edx
  7. mov ebp, esp
  8. push esi
  9. mov esi, [ebp+key]
  10. push ebx
  11. mov ebx, [ebp+hash]
  12. test ebx, ebx
  13. mov eax, ebx
  14. jz short loc_80484D3
  15. nop ; padding
  16. lea esi, [esi+0] ; padding; ESI doesnt changing here
  17. loc_80484B8:
  18. mov ecx, eax ; save previous state of hash to ECX
  19. xor al, [esi+edx] ; AL=*(key+i)
  20. add edx, 1 ; i++
  21. shr ecx, 8 ; ECX=hash>>8
  22. movzx eax, al ; EAX=*(key+i)
  23. mov eax, dword ptr ds:crctab[eax*4] ; EAX=crctab[EAX]
  24. xor eax, ecx ; hash=EAX^ECX
  25. cmp ebx, edx
  26. ja short loc_80484B8
  27. loc_80484D3:
  28. pop ebx
  29. pop esi
  30. pop ebp
  31. retn
  32. crc endp

GCC在循环开始的时候通过填入NOP和lea esi,esi+0来按8字节对齐。更多信息请阅读npad小结(64)。