Bulk inserts

There are a couple of ways you can load lots of data quickly. The naive approach is to simply call Model.create() in a loop:

  1. data_source = [
  2. {'field1': 'val1-1', 'field2': 'val1-2'},
  3. {'field1': 'val2-1', 'field2': 'val2-2'},
  4. # ...
  5. ]
  6. for data_dict in data_source:
  7. MyModel.create(**data_dict)

The above approach is slow for a couple of reasons:

  1. If you are not wrapping the loop in a transaction then each call to create() happens in its own transaction. That is going to be really slow!
  2. There is a decent amount of Python logic getting in your way, and each InsertQuery must be generated and parsed into SQL.
  3. That’s a lot of data (in terms of raw bytes of SQL) you are sending to your database to parse.
  4. We are retrieving the last insert id, which causes an additional query to be executed in some cases.

You can get a very significant speedup by simply wrapping this in a atomic().

  1. # This is much faster.
  2. with db.atomic():
  3. for data_dict in data_source:
  4. MyModel.create(**data_dict)

The above code still suffers from points 2, 3 and 4. We can get another big boost by calling insert_many(). This method accepts a list of tuples or dictionaries to insert.

  1. # Fastest.
  2. MyModel.insert_many(data_source).execute()
  3. # Fastest using tuples and specifying the fields being inserted.
  4. fields = [MyModel.field1, MyModel.field2]
  5. data = [('val1-1', 'val1-2'),
  6. ('val2-1', 'val2-2'),
  7. ('val3-1', 'val3-2')]
  8. MyModel.insert_many(data, fields=fields).execute()
  9. # You can, of course, wrap this in a transaction as well:
  10. with db.atomic():
  11. MyModel.insert_many(data, fields=fields).execute()

Depending on the number of rows in your data source, you may need to break it up into chunks:

  1. # Insert rows 100 at a time.
  2. with db.atomic():
  3. for idx in range(0, len(data_source), 100):
  4. MyModel.insert_many(data_source[idx:idx+100]).execute()

If Model.insert_many() won’t work for your use-case, you can also use the Database.batch_commit() helper to process chunks of rows inside transactions:

  1. # List of row data to insert.
  2. row_data = [{'username': 'u1'}, {'username': 'u2'}, ...]
  3. # Assume there are 789 items in row_data. The following code will result in
  4. # 8 total transactions (7x100 rows + 1x89 rows).
  5. for row in db.batch_commit(row_data, 100):
  6. User.create(**row)

Note

SQLite users should be aware of some caveats when using bulk inserts. Specifically, your SQLite3 version must be 3.7.11.0 or newer to take advantage of the bulk insert API. Additionally, by default SQLite limits the number of bound variables in a SQL query to 999. This value can be modified by setting the SQLITE_MAX_VARIABLE_NUMBER flag.

If the data you would like to bulk load is stored in another table, you can also create INSERT queries whose source is a SELECT query. Use the Model.insert_from() method:

  1. query = (TweetArchive
  2. .insert_from(
  3. Tweet.select(Tweet.user, Tweet.message),
  4. fields=[Tweet.user, Tweet.message])
  5. .execute())