Bulk export

This page documents the options for exporting data out of YugabyteDB.

This page documents bulk export for YugabyteDB’s Cassandra compatible YCQL API. To export data from a YugabyteDB (or even an Apache Cassandra) table, you can use the cassandra-unloader tool.

We will first create a source YugabyteDB table and populate it with data. Then we will export the data out using the cassandra-unloader tool. We will use a generic gaming user profile use case as a running example to illustrate the export process.

Create Source Table

Following is the schema of the destination YugabyteDB table.

  1. CREATE KEYSPACE load;
  2. USE load;
  3. CREATE TABLE users(
  4. user_id varchar,
  5. score1 double,
  6. score2 double,
  7. points int,
  8. object_id varchar,
  9. PRIMARY KEY (user_id));

Generate Sample Data

  1. # sample usage:
  2. # To generate a 10GB (10240 MB) file.
  3. # % python gen_csv.py <outfile_name> <outfile_size_MB>
  4. # % python gen_csv.py file01.csv 10240
  5. #
  6. import numpy as np
  7. import uuid
  8. import csv
  9. import os
  10. import sys
  11. outfile = sys.argv[1] # output file name
  12. outsize_mb = int(sys.argv[2])
  13. print("Outfile = " + outfile)
  14. print("Outfile Size (MB) = " + str(outsize_mb))
  15. chunksize = 10000
  16. with open(outfile, 'ab') as csvfile:
  17. while (os.path.getsize(outfile)//1024**2) < outsize_mb:
  18. data = [[uuid.uuid4() for i in range(chunksize)],
  19. np.random.random(chunksize)*1000,
  20. np.random.random(chunksize)*50,
  21. np.random.randint(1000000, size=(chunksize,)),
  22. [uuid.uuid4() for i in range(chunksize)]]
  23. csvfile.writelines(['%s,%.6f,%.6f,%i,%s\n' % row for row in zip(*data)])

Sample rows generated by script would like the following.

  1. $ head file00.csv
  1. 3399bebc-d2cc-40c6-89d4-26102e08ff61,622.491927,40.262305,658257,44d73f8c-1d3c-424e-8fd2-d316c56b8454
  2. 4f362eac-f79f-45f6-b6b1-bd5a81f931dc,141.344278,3.024717,694290,7768b010-8411-490a-b523-88cc3ec53cb5
  3. a24a6587-eea4-4907-ac7f-9f99dcac8f82,345.110599,3.869150,510943,5765d1d3-2855-4dbe-9f11-bb3b8631789f
  4. ...

To generate 5 CSV files of about 5 GB each, run the following commands.

  1. python ./gen_csv.py file00.csv 5120 &
  2. python ./gen_csv.py file01.csv 5120 &
  3. python ./gen_csv.py file02.csv 5120 &
  4. python ./gen_csv.py file03.csv 5120 &
  5. python ./gen_csv.py file04.csv 5120 &

Load Sample Data

cassandra-loader is a general purpose bulk loader for CQL that supports various types of delimited files (particularly csv files). For more details, review the README of the YugabyteDB cassandra-loader fork. Note that cassandra-loader requires quotes for collection types (e.g. “[1,2,3]” rather than [1,2,3] for lists).

Install cassandra-loader

You can do this as shown below.

  1. $ wget https://github.com/yugabyte/cassandra-loader/releases/download/v0.0.27-yb-2/cassandra-loader
  1. $ chmod a+x cassandra-loader

Run cassandra-loader

The files can be queued up for upload one at a time. Sample invocation:

  1. ./cassandra-loader \
  2. -schema "load.users(user_id, score1, score2, points, object_id)" \
  3. -boolStyle 1_0 \
  4. -numFutures 1000 \
  5. -rate 10000 \
  6. -queryTimeout 65 \
  7. -numRetries 10 \
  8. -progressRate 200000 \
  9. -host <clusterNodeIP> \
  10. -f file01.csv

For additional options to cassandra-loader, see here.

Export Data

Install cassandra-unloader

  1. $ wget https://github.com/brianmhess/cassandra-loader/releases/download/v0.0.27/cassandra-unloader
  1. $ chmod a+x cassandra-unloader

Run cassandra-unloader

  1. ./cassandra-unloader \
  2. -schema "load.users(user_id, score1, score2, points, object_id)" \
  3. -boolStyle 1_0 \
  4. -host <clusterNodeIP> \
  5. -f outfile.csv

For additional options to cassandra-unloader, see here.