说明

本文用查找一条数据库表记录的例子来分析InnoDB B+Tree的结构

先用sysbench插入一千万条记录:

  1. sysbench --db-driver=mysql --mysql-user=username --mysql-password=password --mysql-db=sbtest \
  2. --table_size=10000000 --tables=1 oltp_read_write --mysql-host=127.0.0.1 prepare

生成的sbtest1.ibd大小为2.3G,用hexdump来查看内容

随意选一条记录,比如id=3905000,通过手动找到这条记录来分析B+Tree的结构。

  1. mysql> select * from sbtest1 where id=3905000 \G
  2. *************************** 1. row ***************************
  3. id: 3905000
  4. k: 7152454
  5. c: 33061989913-01978996152-96897051302-66804054532-36658200903-75265952777-90162670547-62113775555-84037309450-68725639441
  6. pad: 93314475890-72810819110-74153294523-75348581725-15287112137
  7. 1 row in set (0.00 sec)

Page format

详细请查看: https://dev.mysql.com/doc/internals/en/innodb-page-overview.html

NameSize(bytes)
File Header38
Page Header56
Infimum + Supremum Records16
User Records不定
Free Space不定
Page Directory不定
Fil Trailer8

查看B+Tree root page

page 3是root page,先用hexdump查看file header

page 3的offset是3*16*1024 = 49152,file header size是38 bytes

  1. $hexdump -s 49152 -n 38 -C sbtest1.ibd
  2. 0000c000 58 0f 5c bd 00 00 00 03 ff ff ff ff ff ff ff ff |X.\.............|
  3. 0000c010 00 00 00 00 99 4a f4 5a 45 bf 00 00 00 00 00 00 |.....J.ZE.......|
  4. 0000c020 00 00 00 00 00 20 |..... |

page type是0x45bf,表示B+Tree节点

再看page header,offset是3161024 + 38 = 49190,page header size是56 bytes

  1. $hexdump -s 49190 -n 56 sbtest1.ibd
  2. 0000c026 00 1d 06 4f 80 75 00 00 00 00 06 47 00 02 00 72 |...O.u.....G...r|
  3. 0000c036 00 73 00 00 00 00 00 00 00 00 00 02 00 00 00 00 |.s..............|
  4. 0000c046 00 00 00 38 00 00 00 20 00 00 00 02 00 f2 00 00 |...8... ........|
  5. 0000c056 00 20 00 00 00 02 00 32 |. .....2|

page directory slots数目为29(0x001d)

page level为2(0x0002),这个是根节点。叶子节点level总是为0,所以这棵B+Tree深度为3。

root page directory

每个slot占2 bytes,29个slot共58 bytes。 所以root page directory offset为 3(161024) - 58 - 8(trailer) = 65470

  1. $hexdump -s 65470 -n 58 -C sbtest1.ibd
  2. 0000ffbe 00 70 05 ec 05 b8 05 84 05 50 05 1c 04 e8 04 b4 |.p.......P......|
  3. 0000ffce 04 80 04 4c 04 18 03 e4 03 b0 03 7c 03 48 03 14 |...L.......|.H..|
  4. 0000ffde 02 e0 02 ac 02 78 02 44 02 10 01 dc 01 a8 01 74 |.....x.D.......t|
  5. 0000ffee 01 40 01 0c 00 d8 00 a4 00 63 |.@.......c|

page directory按primary key逆序排列,0x0063是infimum record offset,0x0070是supremum record offset

  1. $hexdump -s 49246 -n 12 -C sbtest1.ibd
  2. 0000c05e 01 00 02 00 1a 69 6e 66 69 6d 75 6d |.....infimum|

“01 00 02 00 1a” 是record header,最后1个byte 0x1a表示next record offset

3161024+0x63+0x1a=49277,找到第1条记录

  1. $hexdump -s 49277 -n 8 sbtest1.ibd
  2. 0000c07d 80 00 00 01 00 00 00 24 |.......$|
slotoffsetrecord(hex)primary keypage no
00x006369 6e 66 69 6d 75 6d 00136
10x00a480 03 59 53 00 00 00 2721947539
20x00d880 08 b5 7f 00 00 00 2b57075143
30x010c80 0e 11 ab 00 00 00 2f92202747
40x014080 13 6d d7 00 00 00 33127330351
50x017480 18 ca 03 00 00 00 37162457955
60x01a880 1e 26 2f 00 00 00 3b197585559
70x01dc80 23 82 5b 00 00 00 3f232713163
80x021080 28 de 87 00 00 90 00267840736864
90x024480 2e 3a b3 00 00 90 04302968336868
100x027880 33 96 df 00 00 90 08338095936872
110x02ac80 38 f3 0b 00 00 90 0c373223536876
120x02e080 3e 4f 37 00 00 90 10408351136880
130x031480 43 ab 63 00 00 90 14443478736884
140x034880 49 07 8f 00 00 90 18478606336888
150x037c80 4e 63 bb 00 00 90 1c513733936892
160x03b080 53 bf e7 00 00 90 20548861536896
170x03e480 59 1c 13 00 00 90 24583989136900
180x041880 5e 78 3f 00 00 90 28619116736904
190x044c80 63 d4 6b 00 00 90 2c654244336908
200x048080 69 30 97 00 00 90 30689371936912
210x04b480 6e 8c c3 00 00 90 34724499536916
220x04e880 73 e8 ef 00 00 90 38759627136920
230x051c80 79 45 1b 00 00 90 3c794754736924
240x055080 7e a1 47 00 01 be 008298823114176
250x058480 83 fd 73 00 01 be 048650099114180
260x05b880 89 59 9f 00 01 be 089001375114184
270x05ec80 8e b5 cb 00 01 be 0c9352651114188
280x007073 75 70 72 65 6d 75 6d  

我们要找的记录primary key为3905000,在以上29个slots中做binary search,发现可能在slot 11和slot 12之间 slot 11的primary key为3732235,并不是我们要找的3905000,需要继续在slot 11的record list中查找。

查看page 0 slot 11的record list

record header共5个bytes,第5个byte记录了next record的相对于current record的offset, 即next record offset = current record offset + current record header的第5个byte 非叶子节点record body为8 bytes

offsetrecord(hex)primary keypage no
68480 38 f3 0b 00 00 90 0c373223536876
69780 3a 4a 16 00 00 90 0d382005436877
71080 3b a1 21 00 00 90 0e390787336878
72380 3c f8 2c 00 00 90 0f399569236879
73680 3e 4f 37 00 00 90 10408351136880
152980 90 0c d6 00 01 be 0d9440470114189
154280 91 63 e1 00 01 be 0e9528289114190
155580 92 ba ec 00 01 be 0f9616108114191
156880 94 11 f7 00 01 be 109703927114192
158180 95 69 02 00 01 be 119791746114193
159480 96 c0 0d 00 01 be 129879565114194
160780 98 17 18 00 01 be 139967384114195

到此发现我们要找的记录(primary key 3905000)可能在offset 697所指向的page 36877

查看page 36877

先看file header, offset = 36877 * (16*1024) = 604192768, size为38 bytes

  1. $hexdump -s 604192768 -n 38 -C sbtest1.ibd
  2. 24034000 d1 d6 61 2a 00 00 90 0d 00 00 90 0c 00 00 90 0e |..a*............|
  3. 24034010 00 00 00 00 3b ff 8b 03 45 bf 00 00 00 00 00 00 |....;...E.......|
  4. 24034020 00 00 00 00 00 20 |..... |

page type是0x45bf,确认了是B+Tree节点

再看page header, offset = 36877 * (16*1024) + 38 = 604192806, size为56 bytes

  1. $hexdump -s 604192806 -n 56 -C sbtest1.ibd
  2. 24034026 01 2d 3d 8f 84 b5 00 00 00 00 3d 87 00 02 04 b2 |.-=.......=.....|
  3. 24034036 04 b3 00 00 00 00 00 00 00 00 00 01 00 00 00 00 |................|
  4. 24034046 00 00 00 38 00 00 00 00 00 00 00 00 00 00 00 00 |...8............|
  5. 24034056 00 00 00 00 00 00 00 00 |........|

page directory slots数目为301(0x012d)

page level为1(0x0001),还不是叶子节点。

再看page diretory 每个slot占2 bytes,301个slot共602 bytes。 所以page directory offset为 36877(161024) - 602 - 8(trailer) = 604208542

  1. $hexdump -s 604208542 -n 602 -C sbtest1.ibd
  2. 24037d9e 00 70 3d 2c 3c f8 3c c4 3c 90 3c 5c 3c 28 3b f4 |.p=,<.<.<.<\<(;.|
  3. 24037dae 3b c0 3b 8c 3b 58 3b 24 3a f0 3a bc 3a 88 3a 54 |;.;.;X;$:.:.:.:T|
  4. 24037dbe 3a 20 39 ec 39 b8 39 84 39 50 39 1c 38 e8 38 b4 |: 9.9.9.9P9.8.8.|
  5. 24037dce 38 80 38 4c 38 18 37 e4 37 b0 37 7c 37 48 37 14 |8.8L8.7.7.7|7H7.|
  6. 24037dde 36 e0 36 ac 36 78 36 44 36 10 35 dc 35 a8 35 74 |6.6.6x6D6.5.5.5t|
  7. 24037dee 35 40 35 0c 34 d8 34 a4 34 70 34 3c 34 08 33 d4 |5@5.4.4.4p4<4.3.|
  8. 24037dfe 33 a0 33 6c 33 38 33 04 32 d0 32 9c 32 68 32 34 |3.3l383.2.2.2h24|
  9. 24037e0e 32 00 31 cc 31 98 31 64 31 30 30 fc 30 c8 30 94 |2.1.1.1d100.0.0.|
  10. 24037e1e 30 60 30 2c 2f f8 2f c4 2f 90 2f 5c 2f 28 2e f4 |0`0,/./././\/(..|
  11. 24037e2e 2e c0 2e 8c 2e 58 2e 24 2d f0 2d bc 2d 88 2d 54 |.....X.$-.-.-.-T|
  12. 24037e3e 2d 20 2c ec 2c b8 2c 84 2c 50 2c 1c 2b e8 2b b4 |- ,.,.,.,P,.+.+.|
  13. 24037e4e 2b 80 2b 4c 2b 18 2a e4 2a b0 2a 7c 2a 48 2a 14 |+.+L+.*.*.*|*H*.|
  14. 24037e5e 29 e0 29 ac 29 78 29 44 29 10 28 dc 28 a8 28 74 |).).)x)D).(.(.(t|
  15. 24037e6e 28 40 28 0c 27 d8 27 a4 27 70 27 3c 27 08 26 d4 |(@(.'.'.'p'<'.&.|
  16. 24037e7e 26 a0 26 6c 26 38 26 04 25 d0 25 9c 25 68 25 34 |&.&l&8&.%.%.%h%4|
  17. 24037e8e 25 00 24 cc 24 98 24 64 24 30 23 fc 23 c8 23 94 |%.$.$.$d$0#.#.#.|
  18. 24037e9e 23 60 23 2c 22 f8 22 c4 22 90 22 5c 22 28 21 f4 |#`#,"."."."\"(!.|
  19. 24037eae 21 c0 21 8c 21 58 21 24 20 f0 20 bc 20 88 20 54 |!.!.!X!$ . . . T|
  20. 24037ebe 20 20 1f ec 1f b8 1f 84 1f 50 1f 1c 1e e8 1e b4 | .......P......|
  21. 24037ece 1e 80 1e 4c 1e 18 1d e4 1d b0 1d 7c 1d 48 1d 14 |...L.......|.H..|
  22. 24037ede 1c e0 1c ac 1c 78 1c 44 1c 10 1b dc 1b a8 1b 74 |.....x.D.......t|
  23. 24037eee 1b 40 1b 0c 1a d8 1a a4 1a 70 1a 3c 1a 08 19 d4 |.@.......p.<....|
  24. 24037efe 19 a0 19 6c 19 38 19 04 18 d0 18 9c 18 68 18 34 |...l.8.......h.4|
  25. 24037f0e 18 00 17 cc 17 98 17 64 17 30 16 fc 16 c8 16 94 |.......d.0......|
  26. 24037f1e 16 60 16 2c 15 f8 15 c4 15 90 15 5c 15 28 14 f4 |.`.,.......\.(..|
  27. 24037f2e 14 c0 14 8c 14 58 14 24 13 f0 13 bc 13 88 13 54 |.....X.$.......T|
  28. 24037f3e 13 20 12 ec 12 b8 12 84 12 50 12 1c 11 e8 11 b4 |. .......P......|
  29. 24037f4e 11 80 11 4c 11 18 10 e4 10 b0 10 7c 10 48 10 14 |...L.......|.H..|
  30. 24037f5e 0f e0 0f ac 0f 78 0f 44 0f 10 0e dc 0e a8 0e 74 |.....x.D.......t|
  31. 24037f6e 0e 40 0e 0c 0d d8 0d a4 0d 70 0d 3c 0d 08 0c d4 |.@.......p.<....|
  32. 24037f7e 0c a0 0c 6c 0c 38 0c 04 0b d0 0b 9c 0b 68 0b 34 |...l.8.......h.4|
  33. 24037f8e 0b 00 0a cc 0a 98 0a 64 0a 30 09 fc 09 c8 09 94 |.......d.0......|
  34. 24037f9e 09 60 09 2c 08 f8 08 c4 08 90 08 5c 08 28 07 f4 |.`.,.......\.(..|
  35. 24037fae 07 c0 07 8c 07 58 07 24 06 f0 06 bc 06 88 06 54 |.....X.$.......T|
  36. 24037fbe 06 20 05 ec 05 b8 05 84 05 50 05 1c 04 e8 04 b4 |. .......P......|
  37. 24037fce 04 80 04 4c 04 18 03 e4 03 b0 03 7c 03 48 03 14 |...L.......|.H..|
  38. 24037fde 02 e0 02 ac 02 78 02 44 02 10 01 dc 01 a8 01 74 |.....x.D.......t|
  39. 24037fee 01 40 01 0c 00 d8 00 a4 00 63 |.@.......c|

解析page directory

slotoffsetrecord(hex)primary keypage no
10x00a480 3a 4a f1 00 00 cd 8d382027352621
20x00d880 3a 4c 15 00 00 cd 91382056552625
30x010c80 3a 4d 39 00 00 cd 95382085752629
40x014080 3a 4e 5d 00 00 cd 99382114952633
50x017480 3a 4f 81 00 00 cd 9d382144152637
60x01a880 3a 50 a5 00 00 cd a1382173352641
70x01dc80 3a 51 c9 00 00 cd a5382202552645
80x021080 3a 52 ed 00 00 cd a9382231752649
90x024480 3a 54 11 00 00 cd ad382260952653
100x027880 3a 55 35 00 00 cd b1382290152657
2880x3af080 3b 92 4d 00 00 d2 09390407753769
2890x3b2480 3b 93 71 00 00 d2 0d390436953773
2900x3b5880 3b 94 95 00 00 d2 11390466153777
2910x3b8c80 3b 95 b9 00 00 d2 15390495353781
2920x3bc080 3b 96 dd 00 00 d2 19390524553785
2930x3bf480 3b 98 01 00 00 d2 1d390553753789
2940x3c2880 3b 99 25 00 00 d2 21390582953793
2950x3c5c80 3b 9a 49 00 00 d2 25390612153797
2960x3c9080 3b 9b 6d 00 00 d2 29390641353801
2970x3cc480 3b 9c 91 00 00 d2 2d390670553805
2980x3cf880 3b 9d b5 00 00 d2 31390699753809
2990x3d2c80 3b 9e d9 00 00 d2 35390728953813

在以上slots中做binary search,发现我们要找的记录(primary key 3905000)可能在slot 291和slot 292之间 继续查看slot 291的第1个record所指向的page 53781

查看page 53781

先看file header, offset = 53781 * (16*1024) = 881147904, size为38 bytes

  1. $hexdump -s 881147904 -n 38 -C sbtest1.ibd
  2. 34854000 09 d0 e4 a7 00 00 d2 15 00 00 d2 14 00 00 d2 16 |................|
  3. 34854010 00 00 00 00 3b f4 ac 4b 45 bf 00 00 00 00 00 00 |....;..KE.......|
  4. 34854020 00 00 00 00 00 20 |..... |

page type是0x45bf,确认了是B+Tree节点

再看page header, offset = 53781 * (16*1024) + 38 = 881147942, size为56 bytes

  1. $hexdump -s 881147942 -n 56 -C sbtest1.ibd
  2. 34854026 00 13 3b c8 80 4b 00 00 00 00 3a ff 00 02 00 48 |..;..K....:....H|
  3. 34854036 00 49 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |.I..............|
  4. 34854046 00 00 00 38 00 00 00 00 00 00 00 00 00 00 00 00 |...8............|
  5. 34854056 00 00 00 00 00 00 00 00 |........|

page directory slots数目为19(0x0013)

page level为0(0x0000),这是叶子节点了,是真正存数据的page

再看page diretory 每个slot占2 bytes,19个slot共38 bytes。 所以page directory offset为 53781(161024) - 38 - 8(trailer) = 881164242

  1. $hexdump -s 881164242 -n 38 -C sbtest1.ibd
  2. 34857fd2 00 70 36 ef 33 af 30 6f 2d 2f 29 ef 26 af 23 6f |.p6.3.0o-/).&.#o|
  3. 34857fe2 20 2f 1c ef 19 af 16 6f 13 2f 0f ef 0c af 09 6f | /.....o./.....o|
  4. 34857ff2 06 2f 02 ef 00 63 |./...c|

解析page directory

  1. slot | offset | record(hex) | primary key | page no
  2. ----- | ------ | ------ | -------- | --------
  3. 1 | 0x02ef | 80 3b 95 bc 00 00 00 00 | 3904956 | 0
  4. 2 | 0x062f | 80 3b 95 c0 00 00 00 00 | 3904960 | 0
  5. 3 | 0x096f | 80 3b 95 c4 00 00 00 00 | 3904964 | 0
  6. 4 | 0x0caf | 80 3b 95 c8 00 00 00 00 | 3904968 | 0
  7. 5 | 0x0fef | 80 3b 95 cc 00 00 00 00 | 3904972 | 0
  8. 6 | 0x132f | 80 3b 95 d0 00 00 00 00 | 3904976 | 0
  9. 7 | 0x166f | 80 3b 95 d4 00 00 00 00 | 3904980 | 0
  10. 8 | 0x19af | 80 3b 95 d8 00 00 00 00 | 3904984 | 0
  11. 9 | 0x1cef | 80 3b 95 dc 00 00 00 00 | 3904988 | 0
  12. 10 | 0x202f | 80 3b 95 e0 00 00 00 00 | 3904992 | 0
  13. 11 | 0x236f | 80 3b 95 e4 00 00 00 00 | 3904996 | 0
  14. 12 | 0x26af | 80 3b 95 e8 00 00 00 00 | 3905000 | 0
  15. 13 | 0x29ef | 80 3b 95 ec 00 00 00 00 | 3905004 | 0
  16. 14 | 0x2d2f | 80 3b 95 f0 00 00 00 00 | 3905008 | 0
  17. 15 | 0x306f | 80 3b 95 f4 00 00 00 00 | 3905012 | 0
  18. 16 | 0x33af | 80 3b 95 f8 00 00 00 00 | 3905016 | 0
  19. 17 | 0x36ef | 80 3b 95 fc 00 00 00 00 | 3905020 | 0

OK,在slot 12找到了primary key 3905000

查看primary key 3905000的记录

  1. mysql> select * from sbtest1 where id=3905000 \G
  2. *************************** 1. row ***************************
  3. id: 3905000
  4. k: 7152454
  5. c: 33061989913-01978996152-96897051302-66804054532-36658200903-75265952777-90162670547-62113775555-84037309450-68725639441
  6. pad: 93314475890-72810819110-74153294523-75348581725-15287112137
  7. 1 row in set (0.00 sec)

打印256 bytes验证下是不是我们要找的记录 offset为 53781(161024)+0x26af=881157807 id = 3905000 = 0x3b95e8 k = 7152454 = 0x6d2346 剩下的是c和pad的字符串,内容是对的。

  1. hexdump -s 881157807 -n 256 -C /home/ming.lin/one_key_env_57/run_primary/dbs/testdb/sbtest1.ibd
  2. 348566af 80 3b 95 e8 00 00 00 00 0e d3 cb 00 00 00 0f 24 |.;.............$|
  3. 348566bf ec 80 6d 23 46 33 33 30 36 31 39 38 39 39 31 33 |..m#F33061989913|
  4. 348566cf 2d 30 31 39 37 38 39 39 36 31 35 32 2d 39 36 38 |-01978996152-968|
  5. 348566df 39 37 30 35 31 33 30 32 2d 36 36 38 30 34 30 35 |97051302-6680405|
  6. 348566ef 34 35 33 32 2d 33 36 36 35 38 32 30 30 39 30 33 |4532-36658200903|
  7. 348566ff 2d 37 35 32 36 35 39 35 32 37 37 37 2d 39 30 31 |-75265952777-901|
  8. 3485670f 36 32 36 37 30 35 34 37 2d 36 32 31 31 33 37 37 |62670547-6211377|
  9. 3485671f 35 35 35 35 2d 38 34 30 33 37 33 30 39 34 35 30 |5555-84037309450|
  10. 3485672f 2d 36 38 37 32 35 36 33 39 34 34 31 20 39 33 33 |-68725639441 933|
  11. 3485673f 31 34 34 37 35 38 39 30 2d 37 32 38 31 30 38 31 |14475890-7281081|
  12. 3485674f 39 31 31 30 2d 37 34 31 35 33 32 39 34 35 32 33 |9110-74153294523|
  13. 3485675f 2d 37 35 33 34 38 35 38 31 37 32 35 2d 31 35 32 |-75348581725-152|
  14. 3485676f 38 37 31 31 32 31 33 37 20 3c 78 00 01 90 00 d0 |87112137 <x.....|
  15. 3485677f 80 3b 95 e9 00 00 00 00 0e d3 cb 00 00 00 0f 24 |.;.............$|
  16. 3485678f f9 80 57 82 ca 37 38 33 32 36 37 32 37 31 32 39 |..W..78326727129|
  17. 3485679f 2d 32 30 30 39 36 39 37 37 37 38 30 2d 38 34 31 |-20096977780-841

至此我们手动找到了primary key为3905000的记录。

小结

page directory很关键,先在page directory中做binary search,定位到某个slot,再遍历slot上的record list