Part 8 - B-Tree Leaf Node Format

Part 7 - Introduction to the B-Tree

Part 9 - Binary Search and Duplicate Keys

We’re changing the format of our table from an unsorted array of rows to a B-Tree. This is a pretty big change that is going to take multiple articles to implement. By the end of this article, we’ll define the layout of a leaf node and support inserting key/value pairs into a single-node tree. But first, let’s recap the reasons for switching to a tree structure.

Alternative Table Formats

With the current format, each page stores only rows (no metadata) so it is pretty space efficient. Insertion is also fast because we just append to the end. However, finding a particular row can only be done by scanning the entire table. And if we want to delete a row, we have to fill in the hole by moving every row that comes after it.

If we stored the table as an array, but kept rows sorted by id, we could use binary search to find a particular id. However, insertion would be slow because we would have to move a lot of rows to make space.

Instead, we’re going with a tree structure. Each node in the tree can contain a variable number of rows, so we have to store some information in each node to keep track of how many rows it contains. Plus there is the storage overhead of all the internal nodes which don’t store any rows. In exchange for a larger database file, we get fast insertion, deletion and lookup.

Unsorted Array of rows Sorted Array of rows Tree of nodes
Pages contain only data only data metadata, primary keys, and data
Rows per page more more fewer
Insertion O(1) O(n) O(log(n))
Deletion O(n) O(n) O(log(n))
Lookup by id O(n) O(log(n)) O(log(n))

Node Header Format

Leaf nodes and internal nodes have different layouts. Let’s make an enum to keep track of node type:

  1. +enum NodeType_t { NODE_INTERNAL, NODE_LEAF };
  2. +typedef enum NodeType_t NodeType;

Each node will correspond to one page. Internal nodes will point to their children by storing the page number that stores the child. The btree asks the pager for a particular page number and gets back a pointer into the page cache. Pages are stored in the database file one after the other in order of page number.

Nodes need to store some metadata in a header at the beginning of the page. Every node will store what type of node it is, whether or not it is the root node, and a pointer to its parent (to allow finding a node’s siblings). I define constants for the size and offset of every header field:

  1. +/*
  2. + * Common Node Header Layout
  3. + */
  4. +const uint32_t NODE_TYPE_SIZE = sizeof(uint8_t);
  5. +const uint32_t NODE_TYPE_OFFSET = 0;
  6. +const uint32_t IS_ROOT_SIZE = sizeof(uint8_t);
  7. +const uint32_t IS_ROOT_OFFSET = NODE_TYPE_SIZE;
  8. +const uint32_t PARENT_POINTER_SIZE = sizeof(uint32_t);
  9. +const uint32_t PARENT_POINTER_OFFSET = IS_ROOT_OFFSET + IS_ROOT_SIZE;
  10. +const uint8_t COMMON_NODE_HEADER_SIZE =
  11. + NODE_TYPE_SIZE + IS_ROOT_SIZE + PARENT_POINTER_SIZE;

Leaf Node Format

In addition to these common header fields, leaf nodes need to store how many “cells” they contain. A cell is a key/value pair.

  1. +/*
  2. + * Leaf Node Header Layout
  3. + */
  4. +const uint32_t LEAF_NODE_NUM_CELLS_SIZE = sizeof(uint32_t);
  5. +const uint32_t LEAF_NODE_NUM_CELLS_OFFSET = COMMON_NODE_HEADER_SIZE;
  6. +const uint32_t LEAF_NODE_HEADER_SIZE =
  7. + COMMON_NODE_HEADER_SIZE + LEAF_NODE_NUM_CELLS_SIZE;

The body of a leaf node is an array of cells. Each cell is a key followed by a value (a serialized row).

  1. +/*
  2. + * Leaf Node Body Layout
  3. + */
  4. +const uint32_t LEAF_NODE_KEY_SIZE = sizeof(uint32_t);
  5. +const uint32_t LEAF_NODE_KEY_OFFSET = 0;
  6. +const uint32_t LEAF_NODE_VALUE_SIZE = ROW_SIZE;
  7. +const uint32_t LEAF_NODE_VALUE_OFFSET =
  8. + LEAF_NODE_KEY_OFFSET + LEAF_NODE_KEY_SIZE;
  9. +const uint32_t LEAF_NODE_CELL_SIZE = LEAF_NODE_KEY_SIZE + LEAF_NODE_VALUE_SIZE;
  10. +const uint32_t LEAF_NODE_SPACE_FOR_CELLS = PAGE_SIZE - LEAF_NODE_HEADER_SIZE;
  11. +const uint32_t LEAF_NODE_MAX_CELLS =
  12. + LEAF_NODE_SPACE_FOR_CELLS / LEAF_NODE_CELL_SIZE;

Based on these constants, here’s what the layout of a leaf node looks like currently:

|Our leaf node format

It’s a little space inefficient to use an entire byte per boolean value in the header, but this makes it easier to write code to access those values.

Also notice that there’s some wasted space at the end. We store as many cells as we can after the header, but the leftover space can’t hold an entire cell. We leave it empty to avoid splitting cells between nodes.

Accessing Leaf Node Fields

The code to access keys, values and metadata all involve pointer arithmetic using the constants we just defined.

  1. +uint32_t* leaf_node_num_cells(void* node) {
  2. + return (char *)node + LEAF_NODE_NUM_CELLS_OFFSET;
  3. +}
  4. +
  5. +void* leaf_node_cell(void* node, uint32_t cell_num) {
  6. + return (char *)node + LEAF_NODE_HEADER_SIZE + cell_num * LEAF_NODE_CELL_SIZE;
  7. +}
  8. +
  9. +uint32_t* leaf_node_key(void* node, uint32_t cell_num) {
  10. + return leaf_node_cell(node, cell_num);
  11. +}
  12. +
  13. +void* leaf_node_value(void* node, uint32_t cell_num) {
  14. + return leaf_node_cell(node, cell_num) + LEAF_NODE_KEY_SIZE;
  15. +}
  16. +
  17. +void initialize_leaf_node(void* node) { *leaf_node_num_cells(node) = 0; }
  18. +

These methods return a pointer to the value in question, so they can be used both as a getter and a setter.

Changes to Pager and Table Objects

Every node is going to take up exactly one page, even if it’s not full. That means our pager no longer needs to support reading/writing partial pages.

  1. -void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) {
  2. +void pager_flush(Pager* pager, uint32_t page_num) {
  3. if (pager->pages[page_num] == NULL) {
  4. printf("Tried to flush null page\n");
  5. exit(EXIT_FAILURE);
  6. @@ -242,7 +337,7 @@ void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) {
  7. }
  8. ssize_t bytes_written =
  9. - write(pager->file_descriptor, pager->pages[page_num], size);
  10. + write(pager->file_descriptor, pager->pages[page_num], PAGE_SIZE);
  11. if (bytes_written == -1) {
  12. printf("Error writing: %d\n", errno);
  1. void db_close(Table* table) {
  2. Pager* pager = table->pager;
  3. - uint32_t num_full_pages = table->num_rows / ROWS_PER_PAGE;
  4. - for (uint32_t i = 0; i < num_full_pages; i++) {
  5. + for (uint32_t i = 0; i < pager->num_pages; i++) {
  6. if (pager->pages[i] == NULL) {
  7. continue;
  8. }
  9. - pager_flush(pager, i, PAGE_SIZE);
  10. + pager_flush(pager, i);
  11. free(pager->pages[i]);
  12. pager->pages[i] = NULL;
  13. }
  14. - // There may be a partial page to write to the end of the file
  15. - // This should not be needed after we switch to a B-tree
  16. - uint32_t num_additional_rows = table->num_rows % ROWS_PER_PAGE;
  17. - if (num_additional_rows > 0) {
  18. - uint32_t page_num = num_full_pages;
  19. - if (pager->pages[page_num] != NULL) {
  20. - pager_flush(pager, page_num, num_additional_rows * ROW_SIZE);
  21. - free(pager->pages[page_num]);
  22. - pager->pages[page_num] = NULL;
  23. - }
  24. - }
  25. -
  26. int result = close(pager->file_descriptor);
  27. if (result == -1) {
  28. printf("Error closing db file.\n");

Now it makes more sense to store the number of pages in our database rather than the number of rows. The number of pages should be associated with the pager object, not the table, since it’s the number of pages used by the database, not a particular table. A btree is identified by its root node page number, so the table object needs to keep track of that.

  1. const uint32_t PAGE_SIZE = 4096;
  2. const uint32_t TABLE_MAX_PAGES = 100;
  3. -const uint32_t ROWS_PER_PAGE = PAGE_SIZE / ROW_SIZE;
  4. -const uint32_t TABLE_MAX_ROWS = ROWS_PER_PAGE * TABLE_MAX_PAGES;
  5. struct Pager_t {
  6. int file_descriptor;
  7. uint32_t file_length;
  8. + uint32_t num_pages;
  9. void* pages[TABLE_MAX_PAGES];
  10. };
  11. typedef struct Pager_t Pager;
  12. struct Table_t {
  13. Pager* pager;
  14. - uint32_t num_rows;
  15. + uint32_t root_page_num;
  16. };
  17. typedef struct Table_t Table;
  1. @@ -127,6 +200,10 @@ void* get_page(Pager* pager, uint32_t page_num) {
  2. }
  3. pager->pages[page_num] = page;
  4. +
  5. + if (page_num >= pager->num_pages) {
  6. + pager->num_pages = page_num + 1;
  7. + }
  8. }
  9. return pager->pages[page_num];
  1. @@ -184,6 +269,12 @@ Pager* pager_open(const char* filename) {
  2. Pager* pager = malloc(sizeof(Pager));
  3. pager->file_descriptor = fd;
  4. pager->file_length = file_length;
  5. + pager->num_pages = (file_length / PAGE_SIZE);
  6. +
  7. + if (file_length % PAGE_SIZE != 0) {
  8. + printf("Db file is not a whole number of pages. Corrupt file.\n");
  9. + exit(EXIT_FAILURE);
  10. + }
  11. for (uint32_t i = 0; i < TABLE_MAX_PAGES; i++) {
  12. pager->pages[i] = NULL;

Changes to the Cursor Object

A cursor represents a position in the table. When our table was a simple array of rows, we could access a row given just the row number. Now that it’s a tree, we identify a position by the page number of the node, and the cell number within that node.

  1. struct Cursor_t {
  2. Table* table;
  3. - uint32_t row_num;
  4. + uint32_t page_num;
  5. + uint32_t cell_num;
  6. bool end_of_table; // Indicates a position one past the last element
  7. };
  8. typedef struct Cursor_t Cursor;
  1. Cursor* table_start(Table* table) {
  2. Cursor* cursor = malloc(sizeof(Cursor));
  3. cursor->table = table;
  4. - cursor->row_num = 0;
  5. - cursor->end_of_table = (table->num_rows == 0);
  6. + cursor->page_num = table->root_page_num;
  7. + cursor->cell_num = 0;
  8. +
  9. + void* root_node = get_page(table->pager, table->root_page_num);
  10. + uint32_t num_cells = *leaf_node_num_cells(root_node);
  11. + cursor->end_of_table = (num_cells == 0);
  12. return cursor;
  13. }
  1. Cursor* table_end(Table* table) {
  2. Cursor* cursor = malloc(sizeof(Cursor));
  3. cursor->table = table;
  4. - cursor->row_num = table->num_rows;
  5. + cursor->page_num = table->root_page_num;
  6. +
  7. + void* root_node = get_page(table->pager, table->root_page_num);
  8. + uint32_t num_cells = *leaf_node_num_cells(root_node);
  9. + cursor->cell_num = num_cells;
  10. cursor->end_of_table = true;
  11. return cursor;
  12. }
  1. void* cursor_value(Cursor* cursor) {
  2. - uint32_t row_num = cursor->row_num;
  3. - uint32_t page_num = row_num / ROWS_PER_PAGE;
  4. + uint32_t page_num = cursor->page_num;
  5. void* page = get_page(cursor->table->pager, page_num);
  6. - uint32_t row_offset = row_num % ROWS_PER_PAGE;
  7. - uint32_t byte_offset = row_offset * ROW_SIZE;
  8. - return page + byte_offset;
  9. + return leaf_node_value(page, cursor->cell_num);
  10. }
  1. void cursor_advance(Cursor* cursor) {
  2. - cursor->row_num += 1;
  3. - if (cursor->row_num >= cursor->table->num_rows) {
  4. + uint32_t page_num = cursor->page_num;
  5. + void* node = get_page(cursor->table->pager, page_num);
  6. +
  7. + cursor->cell_num += 1;
  8. + if (cursor->cell_num >= (*leaf_node_num_cells(node))) {
  9. cursor->end_of_table = true;
  10. }
  11. }

Insertion Into a Leaf Node

In this article we’re only going to implement enough to get a single-node tree. Recall from last article that a tree starts out as an empty leaf node:

|empty btree

Key/value pairs can be added until the leaf node is full:

|one-node btree

When we open the database for the first time, the database file will be empty, so we initialize page 0 to be an empty leaf node (the root node):

  1. Table* db_open(const char* filename) {
  2. Pager* pager = pager_open(filename);
  3. - uint32_t num_rows = pager->file_length / ROW_SIZE;
  4. Table* table = malloc(sizeof(Table));
  5. table->pager = pager;
  6. - table->num_rows = num_rows;
  7. + table->root_page_num = 0;
  8. +
  9. + if (pager->num_pages == 0) {
  10. + // New database file. Initialize page 0 as leaf node.
  11. + void* root_node = get_page(pager, 0);
  12. + initialize_leaf_node(root_node);
  13. + }
  14. return table;
  15. }

Next we’ll make a function for inserting a key/value pair into a leaf node. It will take a cursor as input to represent the position where the pair should be inserted.

  1. +void leaf_node_insert(Cursor* cursor, uint32_t key, Row* value) {
  2. + void* node = get_page(cursor->table->pager, cursor->page_num);
  3. +
  4. + uint32_t num_cells = *leaf_node_num_cells(node);
  5. + if (num_cells >= LEAF_NODE_MAX_CELLS) {
  6. + // Node full
  7. + printf("Need to implement splitting a leaf node.\n");
  8. + exit(EXIT_FAILURE);
  9. + }
  10. +
  11. + if (cursor->cell_num < num_cells) {
  12. + // Make room for new cell
  13. + for (uint32_t i = num_cells; i > cursor->cell_num; i--) {
  14. + memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i - 1),
  15. + LEAF_NODE_CELL_SIZE);
  16. + }
  17. + }
  18. +
  19. + *(leaf_node_num_cells(node)) += 1;
  20. + *(leaf_node_key(node, cursor->cell_num)) = key;
  21. + serialize_row(value, leaf_node_value(node, cursor->cell_num));
  22. +}
  23. +

We haven’t implemented splitting yet, so we error if the node is full. Next we shift cells one space to the right to make room for the new cell. Then we write the new key/value into the empty space.

Since we assume the tree only has one node, our execute_insert() function simply needs to call this helper method:

  1. ExecuteResult execute_insert(Statement* statement, Table* table) {
  2. - if (table->num_rows >= TABLE_MAX_ROWS) {
  3. + void* node = get_page(table->pager, table->root_page_num);
  4. + if ((*leaf_node_num_cells(node) >= LEAF_NODE_MAX_CELLS)) {
  5. return EXECUTE_TABLE_FULL;
  6. }
  7. Row* row_to_insert = &(statement->row_to_insert);
  8. Cursor* cursor = table_end(table);
  9. - serialize_row(row_to_insert, cursor_value(cursor));
  10. - table->num_rows += 1;
  11. + leaf_node_insert(cursor, row_to_insert->id, row_to_insert);
  12. free(cursor);

With those changes, our database should work as before! Except now it returns a “Table Full” error much sooner, since we can’t split the root node yet.

How many rows can the leaf node hold?

Command to Print Constants

I’m adding a new meta command to print out a few constants of interest.

  1. +void print_constants() {
  2. + printf("ROW_SIZE: %d\n", ROW_SIZE);
  3. + printf("COMMON_NODE_HEADER_SIZE: %d\n", COMMON_NODE_HEADER_SIZE);
  4. + printf("LEAF_NODE_HEADER_SIZE: %d\n", LEAF_NODE_HEADER_SIZE);
  5. + printf("LEAF_NODE_CELL_SIZE: %d\n", LEAF_NODE_CELL_SIZE);
  6. + printf("LEAF_NODE_SPACE_FOR_CELLS: %d\n", LEAF_NODE_SPACE_FOR_CELLS);
  7. + printf("LEAF_NODE_MAX_CELLS: %d\n", LEAF_NODE_MAX_CELLS);
  8. +}
  9. +
  10. @@ -294,6 +376,14 @@ MetaCommandResult do_meta_command(InputBuffer* input_buffer, Table* table) {
  11. if (strcmp(input_buffer->buffer, ".exit") == 0) {
  12. db_close(table);
  13. exit(EXIT_SUCCESS);
  14. + } else if (strcmp(input_buffer->buffer, ".constants") == 0) {
  15. + printf("Constants:\n");
  16. + print_constants();
  17. + return META_COMMAND_SUCCESS;
  18. } else {
  19. return META_COMMAND_UNRECOGNIZED_COMMAND;
  20. }

I’m also adding a test so we get alerted when those constants change:

  1. + it 'prints constants' do
  2. + script = [
  3. + ".constants",
  4. + ".exit",
  5. + ]
  6. + result = run_script(script)
  7. +
  8. + expect(result).to match_array([
  9. + "db > Constants:",
  10. + "ROW_SIZE: 293",
  11. + "COMMON_NODE_HEADER_SIZE: 6",
  12. + "LEAF_NODE_HEADER_SIZE: 10",
  13. + "LEAF_NODE_CELL_SIZE: 297",
  14. + "LEAF_NODE_SPACE_FOR_CELLS: 4086",
  15. + "LEAF_NODE_MAX_CELLS: 13",
  16. + "db > ",
  17. + ])
  18. + end

So our table can hold 13 rows right now!

Tree Visualization

To help with debugging and visualization, I’m also adding a meta command to print out a representation of the btree.

  1. +void print_leaf_node(void* node) {
  2. + uint32_t num_cells = *leaf_node_num_cells(node);
  3. + printf("leaf (size %d)\n", num_cells);
  4. + for (uint32_t i = 0; i < num_cells; i++) {
  5. + uint32_t key = *leaf_node_key(node, i);
  6. + printf(" - %d : %d\n", i, key);
  7. + }
  8. +}
  9. +
  1. @@ -294,6 +376,14 @@ MetaCommandResult do_meta_command(InputBuffer* input_buffer, Table* table) {
  2. if (strcmp(input_buffer->buffer, ".exit") == 0) {
  3. db_close(table);
  4. exit(EXIT_SUCCESS);
  5. + } else if (strcmp(input_buffer->buffer, ".btree") == 0) {
  6. + printf("Tree:\n");
  7. + print_leaf_node(get_page(table->pager, 0));
  8. + return META_COMMAND_SUCCESS;
  9. } else if (strcmp(input_buffer->buffer, ".constants") == 0) {
  10. printf("Constants:\n");
  11. print_constants();
  12. return META_COMMAND_SUCCESS;
  13. } else {
  14. return META_COMMAND_UNRECOGNIZED_COMMAND;
  15. }

And a test

  1. + it 'allows printing out the structure of a one-node btree' do
  2. + script = [3, 1, 2].map do |i|
  3. + "insert #{i} user#{i} person#{i}@example.com"
  4. + end
  5. + script << ".btree"
  6. + script << ".exit"
  7. + result = run_script(script)
  8. +
  9. + expect(result).to match_array([
  10. + "db > Executed.",
  11. + "db > Executed.",
  12. + "db > Executed.",
  13. + "db > Tree:",
  14. + "leaf (size 3)",
  15. + " - 0 : 3",
  16. + " - 1 : 1",
  17. + " - 2 : 2",
  18. + "db > "
  19. + ])
  20. + end

Uh oh, we’re still not storing rows in sorted order. You’ll notice that execute_insert() inserts into the leaf node at the position returned by table_end(). So rows are stored in the order they were inserted, just like before.

Next Time

This all might seem like a step backwards. Our database now stores fewer rows than it did before, and we’re still storing rows in unsorted order. But like I said at the beginning, this is a big change and it’s important to break it up into manageable steps.

Next time, we’ll implement finding a record by primary key, and start storing rows in sorted order.

Complete Diff

  1. const uint32_t PAGE_SIZE = 4096;
  2. const uint32_t TABLE_MAX_PAGES = 100;
  3. -const uint32_t ROWS_PER_PAGE = PAGE_SIZE / ROW_SIZE;
  4. -const uint32_t TABLE_MAX_ROWS = ROWS_PER_PAGE * TABLE_MAX_PAGES;
  5. struct Pager_t {
  6. int file_descriptor;
  7. uint32_t file_length;
  8. + uint32_t num_pages;
  9. void* pages[TABLE_MAX_PAGES];
  10. };
  11. typedef struct Pager_t Pager;
  12. struct Table_t {
  13. Pager* pager;
  14. - uint32_t num_rows;
  15. + uint32_t root_page_num;
  16. };
  17. typedef struct Table_t Table;
  18. struct Cursor_t {
  19. Table* table;
  20. - uint32_t row_num;
  21. + uint32_t page_num;
  22. + uint32_t cell_num;
  23. bool end_of_table; // Indicates a position one past the last element
  24. };
  25. typedef struct Cursor_t Cursor;
  26. @@ -88,6 +88,77 @@ void print_row(Row* row) {
  27. printf("(%d, %s, %s)\n", row->id, row->username, row->email);
  28. }
  29. +enum NodeType_t { NODE_INTERNAL, NODE_LEAF };
  30. +typedef enum NodeType_t NodeType;
  31. +
  32. +/*
  33. + * Common Node Header Layout
  34. + */
  35. +const uint32_t NODE_TYPE_SIZE = sizeof(uint8_t);
  36. +const uint32_t NODE_TYPE_OFFSET = 0;
  37. +const uint32_t IS_ROOT_SIZE = sizeof(uint8_t);
  38. +const uint32_t IS_ROOT_OFFSET = NODE_TYPE_SIZE;
  39. +const uint32_t PARENT_POINTER_SIZE = sizeof(uint32_t);
  40. +const uint32_t PARENT_POINTER_OFFSET = IS_ROOT_OFFSET + IS_ROOT_SIZE;
  41. +const uint8_t COMMON_NODE_HEADER_SIZE =
  42. + NODE_TYPE_SIZE + IS_ROOT_SIZE + PARENT_POINTER_SIZE;
  43. +
  44. +/*
  45. + * Leaf Node Header Layout
  46. + */
  47. +const uint32_t LEAF_NODE_NUM_CELLS_SIZE = sizeof(uint32_t);
  48. +const uint32_t LEAF_NODE_NUM_CELLS_OFFSET = COMMON_NODE_HEADER_SIZE;
  49. +const uint32_t LEAF_NODE_HEADER_SIZE =
  50. + COMMON_NODE_HEADER_SIZE + LEAF_NODE_NUM_CELLS_SIZE;
  51. +
  52. +/*
  53. + * Leaf Node Body Layout
  54. + */
  55. +const uint32_t LEAF_NODE_KEY_SIZE = sizeof(uint32_t);
  56. +const uint32_t LEAF_NODE_KEY_OFFSET = 0;
  57. +const uint32_t LEAF_NODE_VALUE_SIZE = ROW_SIZE;
  58. +const uint32_t LEAF_NODE_VALUE_OFFSET =
  59. + LEAF_NODE_KEY_OFFSET + LEAF_NODE_KEY_SIZE;
  60. +const uint32_t LEAF_NODE_CELL_SIZE = LEAF_NODE_KEY_SIZE + LEAF_NODE_VALUE_SIZE;
  61. +const uint32_t LEAF_NODE_SPACE_FOR_CELLS = PAGE_SIZE - LEAF_NODE_HEADER_SIZE;
  62. +const uint32_t LEAF_NODE_MAX_CELLS =
  63. + LEAF_NODE_SPACE_FOR_CELLS / LEAF_NODE_CELL_SIZE;
  64. +
  65. +uint32_t* leaf_node_num_cells(void* node) {
  66. + return node + LEAF_NODE_NUM_CELLS_OFFSET;
  67. +}
  68. +
  69. +void* leaf_node_cell(void* node, uint32_t cell_num) {
  70. + return node + LEAF_NODE_HEADER_SIZE + cell_num * LEAF_NODE_CELL_SIZE;
  71. +}
  72. +
  73. +uint32_t* leaf_node_key(void* node, uint32_t cell_num) {
  74. + return leaf_node_cell(node, cell_num);
  75. +}
  76. +
  77. +void* leaf_node_value(void* node, uint32_t cell_num) {
  78. + return leaf_node_cell(node, cell_num) + LEAF_NODE_KEY_SIZE;
  79. +}
  80. +
  81. +void print_constants() {
  82. + printf("ROW_SIZE: %d\n", ROW_SIZE);
  83. + printf("COMMON_NODE_HEADER_SIZE: %d\n", COMMON_NODE_HEADER_SIZE);
  84. + printf("LEAF_NODE_HEADER_SIZE: %d\n", LEAF_NODE_HEADER_SIZE);
  85. + printf("LEAF_NODE_CELL_SIZE: %d\n", LEAF_NODE_CELL_SIZE);
  86. + printf("LEAF_NODE_SPACE_FOR_CELLS: %d\n", LEAF_NODE_SPACE_FOR_CELLS);
  87. + printf("LEAF_NODE_MAX_CELLS: %d\n", LEAF_NODE_MAX_CELLS);
  88. +}
  89. +
  90. +void print_leaf_node(void* node) {
  91. + uint32_t num_cells = *leaf_node_num_cells(node);
  92. + printf("leaf (size %d)\n", num_cells);
  93. + for (uint32_t i = 0; i < num_cells; i++) {
  94. + uint32_t key = *leaf_node_key(node, i);
  95. + printf(" - %d : %d\n", i, key);
  96. + }
  97. +}
  98. +
  99. void serialize_row(Row* source, void* destination) {
  100. memcpy(destination + ID_OFFSET, &(source->id), ID_SIZE);
  101. memcpy(destination + USERNAME_OFFSET, &(source->username), USERNAME_SIZE);
  102. @@ -100,6 +171,8 @@ void deserialize_row(void* source, Row* destination) {
  103. memcpy(&(destination->email), source + EMAIL_OFFSET, EMAIL_SIZE);
  104. }
  105. +void initialize_leaf_node(void* node) { *leaf_node_num_cells(node) = 0; }
  106. +
  107. void* get_page(Pager* pager, uint32_t page_num) {
  108. if (page_num > TABLE_MAX_PAGES) {
  109. printf("Tried to fetch page number out of bounds. %d > %d\n", page_num,
  110. @@ -127,6 +200,10 @@ void* get_page(Pager* pager, uint32_t page_num) {
  111. }
  112. pager->pages[page_num] = page;
  113. +
  114. + if (page_num >= pager->num_pages) {
  115. + pager->num_pages = page_num + 1;
  116. + }
  117. }
  118. return pager->pages[page_num];
  119. @@ -135,8 +212,12 @@ void* get_page(Pager* pager, uint32_t page_num) {
  120. Cursor* table_start(Table* table) {
  121. Cursor* cursor = malloc(sizeof(Cursor));
  122. cursor->table = table;
  123. - cursor->row_num = 0;
  124. - cursor->end_of_table = (table->num_rows == 0);
  125. + cursor->page_num = table->root_page_num;
  126. + cursor->cell_num = 0;
  127. +
  128. + void* root_node = get_page(table->pager, table->root_page_num);
  129. + uint32_t num_cells = *leaf_node_num_cells(root_node);
  130. + cursor->end_of_table = (num_cells == 0);
  131. return cursor;
  132. }
  133. @@ -144,24 +225,28 @@ Cursor* table_start(Table* table) {
  134. Cursor* table_end(Table* table) {
  135. Cursor* cursor = malloc(sizeof(Cursor));
  136. cursor->table = table;
  137. - cursor->row_num = table->num_rows;
  138. + cursor->page_num = table->root_page_num;
  139. +
  140. + void* root_node = get_page(table->pager, table->root_page_num);
  141. + uint32_t num_cells = *leaf_node_num_cells(root_node);
  142. + cursor->cell_num = num_cells;
  143. cursor->end_of_table = true;
  144. return cursor;
  145. }
  146. void* cursor_value(Cursor* cursor) {
  147. - uint32_t row_num = cursor->row_num;
  148. - uint32_t page_num = row_num / ROWS_PER_PAGE;
  149. + uint32_t page_num = cursor->page_num;
  150. void* page = get_page(cursor->table->pager, page_num);
  151. - uint32_t row_offset = row_num % ROWS_PER_PAGE;
  152. - uint32_t byte_offset = row_offset * ROW_SIZE;
  153. - return page + byte_offset;
  154. + return leaf_node_value(page, cursor->cell_num);
  155. }
  156. void cursor_advance(Cursor* cursor) {
  157. - cursor->row_num += 1;
  158. - if (cursor->row_num >= cursor->table->num_rows) {
  159. + uint32_t page_num = cursor->page_num;
  160. + void* node = get_page(cursor->table->pager, page_num);
  161. +
  162. + cursor->cell_num += 1;
  163. + if (cursor->cell_num >= (*leaf_node_num_cells(node))) {
  164. cursor->end_of_table = true;
  165. }
  166. }
  167. @@ -184,6 +269,12 @@ Pager* pager_open(const char* filename) {
  168. Pager* pager = malloc(sizeof(Pager));
  169. pager->file_descriptor = fd;
  170. pager->file_length = file_length;
  171. + pager->num_pages = (file_length / PAGE_SIZE);
  172. +
  173. + if (file_length % PAGE_SIZE != 0) {
  174. + printf("Db file is not a whole number of pages. Corrupt file.\n");
  175. + exit(EXIT_FAILURE);
  176. + }
  177. for (uint32_t i = 0; i < TABLE_MAX_PAGES; i++) {
  178. pager->pages[i] = NULL;
  179. @@ -194,11 +285,15 @@ Pager* pager_open(const char* filename) {
  180. Table* db_open(const char* filename) {
  181. Pager* pager = pager_open(filename);
  182. - uint32_t num_rows = pager->file_length / ROW_SIZE;
  183. Table* table = malloc(sizeof(Table));
  184. table->pager = pager;
  185. - table->num_rows = num_rows;
  186. + table->root_page_num = 0;
  187. +
  188. + if (pager->num_pages == 0) {
  189. + // New database file. Initialize page 0 as leaf node.
  190. + void* root_node = get_page(pager, 0);
  191. + initialize_leaf_node(root_node);
  192. + }
  193. return table;
  194. }
  195. @@ -228,7 +323,7 @@ void read_input(InputBuffer* input_buffer) {
  196. input_buffer->buffer[bytes_read - 1] = 0;
  197. }
  198. -void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) {
  199. +void pager_flush(Pager* pager, uint32_t page_num) {
  200. if (pager->pages[page_num] == NULL) {
  201. printf("Tried to flush null page\n");
  202. exit(EXIT_FAILURE);
  203. @@ -242,7 +337,7 @@ void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) {
  204. }
  205. ssize_t bytes_written =
  206. - write(pager->file_descriptor, pager->pages[page_num], size);
  207. + write(pager->file_descriptor, pager->pages[page_num], PAGE_SIZE);
  208. if (bytes_written == -1) {
  209. printf("Error writing: %d\n", errno);
  210. @@ -252,29 +347,16 @@ void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) {
  211. void db_close(Table* table) {
  212. Pager* pager = table->pager;
  213. - uint32_t num_full_pages = table->num_rows / ROWS_PER_PAGE;
  214. - for (uint32_t i = 0; i < num_full_pages; i++) {
  215. + for (uint32_t i = 0; i < pager->num_pages; i++) {
  216. if (pager->pages[i] == NULL) {
  217. continue;
  218. }
  219. - pager_flush(pager, i, PAGE_SIZE);
  220. + pager_flush(pager, i);
  221. free(pager->pages[i]);
  222. pager->pages[i] = NULL;
  223. }
  224. - // There may be a partial page to write to the end of the file
  225. - // This should not be needed after we switch to a B-tree
  226. - uint32_t num_additional_rows = table->num_rows % ROWS_PER_PAGE;
  227. - if (num_additional_rows > 0) {
  228. - uint32_t page_num = num_full_pages;
  229. - if (pager->pages[page_num] != NULL) {
  230. - pager_flush(pager, page_num, num_additional_rows * ROW_SIZE);
  231. - free(pager->pages[page_num]);
  232. - pager->pages[page_num] = NULL;
  233. - }
  234. - }
  235. -
  236. int result = close(pager->file_descriptor);
  237. if (result == -1) {
  238. printf("Error closing db file.\n");
  239. @@ -294,6 +376,14 @@ MetaCommandResult do_meta_command(InputBuffer* input_buffer, Table* table) {
  240. if (strcmp(input_buffer->buffer, ".exit") == 0) {
  241. db_close(table);
  242. exit(EXIT_SUCCESS);
  243. + } else if (strcmp(input_buffer->buffer, ".btree") == 0) {
  244. + printf("Tree:\n");
  245. + print_leaf_node(get_page(table->pager, 0));
  246. + return META_COMMAND_SUCCESS;
  247. + } else if (strcmp(input_buffer->buffer, ".constants") == 0) {
  248. + printf("Constants:\n");
  249. + print_constants();
  250. + return META_COMMAND_SUCCESS;
  251. } else {
  252. return META_COMMAND_UNRECOGNIZED_COMMAND;
  253. }
  254. @@ -342,16 +432,39 @@ PrepareResult prepare_statement(InputBuffer* input_buffer,
  255. return PREPARE_UNRECOGNIZED_STATEMENT;
  256. }
  257. +void leaf_node_insert(Cursor* cursor, uint32_t key, Row* value) {
  258. + void* node = get_page(cursor->table->pager, cursor->page_num);
  259. +
  260. + uint32_t num_cells = *leaf_node_num_cells(node);
  261. + if (num_cells >= LEAF_NODE_MAX_CELLS) {
  262. + // Node full
  263. + printf("Need to implement splitting a leaf node.\n");
  264. + exit(EXIT_FAILURE);
  265. + }
  266. +
  267. + if (cursor->cell_num < num_cells) {
  268. + // Make room for new cell
  269. + for (uint32_t i = num_cells; i > cursor->cell_num; i--) {
  270. + memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i - 1),
  271. + LEAF_NODE_CELL_SIZE);
  272. + }
  273. + }
  274. +
  275. + *(leaf_node_num_cells(node)) += 1;
  276. + *(leaf_node_key(node, cursor->cell_num)) = key;
  277. + serialize_row(value, leaf_node_value(node, cursor->cell_num));
  278. +}
  279. +
  280. ExecuteResult execute_insert(Statement* statement, Table* table) {
  281. - if (table->num_rows >= TABLE_MAX_ROWS) {
  282. + void* node = get_page(table->pager, table->root_page_num);
  283. + if ((*leaf_node_num_cells(node) >= LEAF_NODE_MAX_CELLS)) {
  284. return EXECUTE_TABLE_FULL;
  285. }
  286. Row* row_to_insert = &(statement->row_to_insert);
  287. Cursor* cursor = table_end(table);
  288. - serialize_row(row_to_insert, cursor_value(cursor));
  289. - table->num_rows += 1;
  290. + leaf_node_insert(cursor, row_to_insert->id, row_to_insert);
  291. free(cursor);

And the specs:

  1. + it 'allows printing out the structure of a one-node btree' do
  2. + script = [3, 1, 2].map do |i|
  3. + "insert #{i} user#{i} person#{i}@example.com"
  4. + end
  5. + script << ".btree"
  6. + script << ".exit"
  7. + result = run_script(script)
  8. +
  9. + expect(result).to match_array([
  10. + "db > Executed.",
  11. + "db > Executed.",
  12. + "db > Executed.",
  13. + "db > Tree:",
  14. + "leaf (size 3)",
  15. + " - 0 : 3",
  16. + " - 1 : 1",
  17. + " - 2 : 2",
  18. + "db > "
  19. + ])
  20. + end
  21. +
  22. + it 'prints constants' do
  23. + script = [
  24. + ".constants",
  25. + ".exit",
  26. + ]
  27. + result = run_script(script)
  28. +
  29. + expect(result).to match_array([
  30. + "db > Constants:",
  31. + "ROW_SIZE: 293",
  32. + "COMMON_NODE_HEADER_SIZE: 6",
  33. + "LEAF_NODE_HEADER_SIZE: 10",
  34. + "LEAF_NODE_CELL_SIZE: 297",
  35. + "LEAF_NODE_SPACE_FOR_CELLS: 4086",
  36. + "LEAF_NODE_MAX_CELLS: 13",
  37. + "db > ",
  38. + ])
  39. + end
  40. end

Part 7 - Introduction to the B-Tree

Part 9 - Binary Search and Duplicate Keys

原文: https://cstack.github.io/db_tutorial/parts/part8.html