5.3. Cassandra Connector

The Cassandra connector allows querying data stored in Cassandra.

Compatibility

Connector is compatible with all Cassandra versions starting from 2.1.5.

Configuration

To configure the Cassandra connector, create a catalog properties fileetc/catalog/cassandra.properties with the following contents,replacing host1,host2 with a comma-separated list of the Cassandranodes used to discovery the cluster topology:

  1. connector.name=cassandra
  2. cassandra.contact-points=host1,host2

You will also need to set cassandra.native-protocol-port if yourCassandra nodes are not using the default port (9042).

Multiple Cassandra Clusters

You can have as many catalogs as you need, so if you have additionalCassandra clusters, simply add another properties file to etc/catalogwith a different name (making sure it ends in .properties). Forexample, if you name the property file sales.properties, Prestowill create a catalog named sales using the configured connector.

Configuration Properties

The following configuration properties are available:

Property NameDescription
cassandra.contact-pointsComma-separated list of hosts in a Cassandra cluster. The Cassandradriver will use these contact points to discover cluster topology.At least one Cassandra host is required.
cassandra.native-protocol-portThe Cassandra server port running the native client protocol(defaults to 9042).
cassandra.consistency-levelConsistency levels in Cassandra refer to the level of consistencyto be used for both read and write operations. More informationabout consistency levels can be found in theCassandra consistency documentation. This property defaults toa consistency level of ONE. Possible values include ALL,EACH_QUORUM, QUORUM, LOCAL_QUORUM, ONE, TWO,THREE, LOCAL_ONE, ANY, SERIAL, LOCAL_SERIAL.
cassandra.allow-drop-tableSet to true to allow dropping Cassandra tables from Prestovia DROP TABLE (defaults to false).
cassandra.usernameUsername used for authentication to the Cassandra cluster.This is a global setting used for all connections, regardlessof the user who is connected to Presto.
cassandra.passwordPassword used for authentication to the Cassandra cluster.This is a global setting used for all connections, regardlessof the user who is connected to Presto.
cassandra.protocol-versionIt is possible to override the protocol version for older Cassandra clusters.This property defaults to V3. Possible values include V2, V3 and V4.

Note

If authorization is enabled, cassandra.username must have enough permissions to perform SELECT queries onthe system.size_estimates table.

The following advanced configuration properties are available:

Property NameDescription
cassandra.fetch-sizeNumber of rows fetched at a time in a Cassandra query.
cassandra.partition-size-for-batch-selectNumber of partitions batched together into a single select for asingle partion key column table.
cassandra.split-sizeNumber of keys per split when querying Cassandra.
cassandra.splits-per-nodeNumber of splits per node. By default, the values from thesystem.sizeestimates table are used. Only override whenconnecting to Cassandra versions < 2.1.5, which lacksthe system.size_estimates table.
cassandra.client.read-timeoutMaximum time the Cassandra driver will wait for ananswer to a query from one Cassandra node. Note that the underlyingCassandra driver may retry a query against more than one node inthe event of a read timeout. Increasing this may help with queriesthat use an index.
cassandra.client.connect-timeoutMaximum time the Cassandra driver will wait to establisha connection to a Cassandra node. Increasing this may help withheavily loaded Cassandra clusters.
cassandra.client.so-lingerNumber of seconds to linger on close if unsent data is queued.If set to zero, the socket will be closed immediately.When this option is non-zero, a socket will linger that manyseconds for an acknowledgement that all data was written to apeer. This option can be used to avoid consuming sockets on aCassandra server by immediately closing connections when theyare no longer needed.
cassandra.retry-policyPolicy used to retry failed requests to Cassandra. This propertydefaults to DEFAULT. Using BACKOFF may help whenqueries fail with “not enough replicas”_. The other possiblevalues are DOWNGRADING_CONSISTENCY and FALLTHROUGH.
cassandra.load-policy.use-dc-awareSet to true to use DCAwareRoundRobinPolicy(defaults to false).
cassandra.load-policy.dc-aware.local-dcThe name of the local datacenter for DCAwareRoundRobinPolicy.
cassandra.load-policy.dc-aware.used-hosts-per-remote-dcUses the provided number of host per remote datacenteras failover for the local hosts for DCAwareRoundRobinPolicy.
cassandra.load-policy.dc-aware.allow-remote-dc-for-localSet to true to allow to use hosts ofremote datacenter for local consistency level.
cassandra.load-policy.use-token-awareSet to true to use TokenAwarePolicy (defaults to false).
cassandra.load-policy.shuffle-replicasSet to true to use TokenAwarePolicy with shuffling of replicas(defaults to false).
cassandra.load-policy.use-white-listSet to true to use WhiteListPolicy (defaults to false).
cassandra.load-policy.white-list.addressesComma-separated list of hosts for WhiteListPolicy.
cassandra.no-host-available-retry-timeoutRetry timeout for NoHostAvailableException (defaults to 1m).
cassandra.speculative-execution.limitThe number of speculative executions (defaults to 1).
cassandra.speculative-execution.delayThe delay between each speculative execution (defaults to 500ms).

Querying Cassandra Tables

The users table is an example Cassandra table from the CassandraGetting Started guide. It can be created along with the mykeyspacekeyspace using Cassandra’s cqlsh (CQL interactive terminal):

  1. cqlsh> CREATE KEYSPACE mykeyspace
  2. ... WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
  3. cqlsh> USE mykeyspace;
  4. cqlsh:mykeyspace> CREATE TABLE users (
  5. ... user_id int PRIMARY KEY,
  6. ... fname text,
  7. ... lname text
  8. ... );

This table can be described in Presto:

  1. DESCRIBE cassandra.mykeyspace.users;
  1. Column | Type | Extra | Comment
  2. ---------+---------+-------+---------
  3. user_id | bigint | |
  4. fname | varchar | |
  5. lname | varchar | |
  6. (3 rows)

This table can then be queried in Presto:

  1. SELECT * FROM cassandra.mykeyspace.users;

Data types

The data types mappings are as follows:

CassandraPresto
ASCIIVARCHAR
BIGINTBIGINT
BLOBVARBINARY
BOOLEANBOOLEAN
DECIMALDOUBLE
DOUBLEDOUBLE
FLOATDOUBLE
INETVARCHAR(45)
INTINTEGER
LIST<?>VARCHAR
MAP<?, ?>VARCHAR
SET<?>VARCHAR
TEXTVARCHAR
TIMESTAMPTIMESTAMP
TIMEUUIDVARCHAR
VARCHARVARCHAR
VARIANTVARCHAR

Any collection (LIST/MAP/SET) can be designated as FROZEN, and the value ismapped to VARCHAR. Additionally, blobs have the limitation that they cannot be empty.

Types not mentioned in the table above are not supported (e.g. tuple or UDT).

Partition keys can only be of the following types:| ASCII| TEXT| VARCHAR| BIGINT| BOOLEAN| DOUBLE| INET| INT| FLOAT| DECIMAL| TIMESTAMP| UUID| TIMEUUID

Limitations

  • Queries without filters containing the partition key result in fetching all partitions.This causes a full scan of the entire data set, therefore it’s much slower compared to a similarquery with a partition key as a filter.
  • IN list filters are only allowed on index (that is, partition key or clustering key) columns.
  • Range (< or > and BETWEEN) filters can be applied only to the partition keys.