Pulsar SQL configuration and deployment

你可通过下面指南配置 Presto Pulsar 连接器并部署一个集群。

配置 Presto Pulsar 连接器

你可以在 ${project.root}/conf/presto/catalog/pulsar.properties 属性文件中配置 Presto Pulsar 连接器。 连接器和默认值的配置如下。

  1. # name of the connector to be displayed in the catalog
  2. connector.name=pulsar
  3. # the url of Pulsar broker service
  4. pulsar.web-service-url=http://localhost:8080
  5. # URI of Zookeeper cluster
  6. pulsar.zookeeper-uri=localhost:2181
  7. # minimum number of entries to read at a single time
  8. pulsar.entry-read-batch-size=100
  9. # default number of splits to use per query
  10. pulsar.target-num-splits=4

Presto 可通过多个主机连接到 Pulsar 集群。 要为 broker 配置多个主机,需要添加多个 URL 到 pulsar.web-service-url。 要为 ZooKeeper 配置多个主机,需要添加多个 URI 到 pulsar.zookeeper-uri。 The following is an example.

  1. pulsar.web-service-url=http://localhost:8080,localhost:8081,localhost:8082
  2. pulsar.zookeeper-uri=localhost1,localhost2:2181

Note: by default, Pulsar SQL does not get the last message in a topic. 它是由设置设计和控制的。 默认情况下,BookKeeper LAC 只在添加后续条目时才会优化。 如果没有添加后续条目,则最后写入的条目对 readers 不可见,直到 ledger 被 关闭。 这对于使用 managed ledger 的 Pulsar 来说不是问题,但是 Pulsar SQL 是直接从 BookKeeper ledger 中读取的。

如果您想在 topic 中获取最后一条消息,请设置以下配置:

  1. For the broker configuration, set bookkeeperExplicitLacIntervalInMills > 0 in broker.conf or standalone.conf.

  2. For the Presto configuration, set pulsar.bookkeeper-explicit-interval > 0 and pulsar.bookkeeper-use-v2-protocol=false.

但是,由于使用Protobuf, BookKeeper V3 协议会给 BK 带来额外的 GC 开销。

从现有 Presto 集群查询数据

If you already have a Presto cluster, you can copy the Presto Pulsar connector plugin to your existing cluster. Download the archived plugin package with the following command.

  1. $ wget https://archive.apache.org/dist/pulsar/pulsar-2.9.2/apache-pulsar-2.9.2-bin.tar.gz

部署新群集

因为 Pulsar SQL 是由 Trino(项目原为 Presto SQL)驱动,部署的配置对 Pulsar SQL worker 是相同的。

Note
For how to set up a standalone single node environment, refer to Query data.

你可以使用相同的 CLI 参数给 Presto 启动器:

  1. $ ./bin/pulsar sql-worker --help
  2. Usage: launcher [options] command
  3. Commands: run, start, stop, restart, kill, status
  4. Options:
  5. -h, --help show this help message and exit
  6. -v, --verbose Run verbosely
  7. --etc-dir=DIR Defaults to INSTALL_PATH/etc
  8. --launcher-config=FILE
  9. Defaults to INSTALL_PATH/bin/launcher.properties
  10. --node-config=FILE Defaults to ETC_DIR/node.properties
  11. --jvm-config=FILE Defaults to ETC_DIR/jvm.config
  12. --config=FILE Defaults to ETC_DIR/config.properties
  13. --log-levels-file=FILE
  14. Defaults to ETC_DIR/log.properties
  15. --data-dir=DIR Defaults to INSTALL_PATH
  16. --pid-file=FILE Defaults to DATA_DIR/var/run/launcher.pid
  17. --launcher-log-file=FILE
  18. Defaults to DATA_DIR/var/log/launcher.log (only in
  19. daemon mode)
  20. --server-log-file=FILE
  21. Defaults to DATA_DIR/var/log/server.log (only in
  22. daemon mode)
  23. -D NAME=VALUE Set a Java system property

The default configuration for the cluster is located in ${project.root}/conf/presto. You can customize your deployment by modifying the default configuration.

你可以设置 worker 从不同的配置目录读取数据,或者设置不同的目录来写入数据。

  1. $ ./bin/pulsar sql-worker run --etc-dir /tmp/incubator-pulsar/conf/presto --data-dir /tmp/presto-1

您可以将 worker 作为守护进程启动:

  1. $ ./bin/pulsar sql-worker start

在多节点上部署一个集群

You can deploy a Pulsar SQL cluster or Presto cluster on multiple nodes. The following example shows how to deploy a cluster on three-node cluster.

  1. 将 Pulsar 二进制文件复制到三个节点。

The first node runs as Presto coordinator. The minimal configuration requirement in the ${project.root}/conf/presto/config.properties file is as follows.

  1. coordinator=true
  2. node-scheduler.include-coordinator=true
  3. http-server.http.port=8080
  4. query.max-memory=50GB
  5. query.max-memory-per-node=1GB
  6. discovery-server.enabled=true
  7. discovery.uri=<coordinator-url>

另两个节点作为 worker 节点,可以使用下面的配置:

  1. coordinator=false
  2. http-server.http.port=8080
  3. query.max-memory=50GB
  4. query.max-memory-per-node=1GB
  5. discovery.uri=<coordinator-url>
  1. 在文件${project.root}/conf/presto/catalog/pulsar.properties中相应地为 3 个节点修改 pulsar.web-service-urlpulsar.zookeeper-uri 配置。

  2. 启动 Coordinator 节点。

  1. $ ./bin/pulsar sql-worker run
  1. 启动 worker 节点。
  1. $ ./bin/pulsar sql-worker run
  1. 启动 SQL CLI 并检查集群的状态。
  1. $ ./bin/pulsar sql --server <coordinate_url>
  1. 检查节点的状态。
  1. presto> SELECT * FROM system.runtime.nodes;
  2. node_id | http_uri | node_version | coordinator | state
  3. ---------+-------------------------+--------------+-------------+--------
  4. 1 | http://192.168.2.1:8081 | testversion | true | active
  5. 3 | http://192.168.2.2:8081 | testversion | false | active
  6. 2 | http://192.168.2.3:8081 | testversion | false | active

关于 Presto 部署的更多信息,请参阅 Presto 部署

Note
The broker does not advance LAC, so when Pulsar SQL bypass broker to query data, it can only read entries up to the LAC that all the bookies learned. 你可以通过在 broker.conf 中设置 “bookkeeperExplicitLacIntervalInMills”,在 broker 上启用定期写入LAC。