vtorc

VTOrc is the automated fault detection and repair tool of Vitess.

Example Usage

Start VTOrc as follows:

  1. export TOPOLOGY_FLAGS="--topo_implementation etcd2 --topo_global_server_address localhost:2379 --topo_global_root /vitess/global"
  2. export VTDATAROOT="/tmp"
  3. vtorc \
  4. $TOPOLOGY_FLAGS \
  5. --log_dir $VTDATAROOT/tmp \
  6. --port 15000 \
  7. --recovery-period-block-duration "10m" \
  8. --instance-poll-time "1s" \
  9. --topo-information-refresh-duration "30s" \
  10. --alsologtostderr

Options

The following command line options apply to VTOrc:

NameTypeDefinition
—alsologtostderrbooleanlog to standard error as well as files
—audit-file-locationstringFile location where the audit logs are to be stored
—audit-purge-durationdurationDuration for which audit logs are held before being purged. Should be in multiples of days (default 168h0m0s)
—audit-to-backendbooleanWhether to store the audit log in the VTOrc database
—audit-to-syslogbooleanWhether to store the audit log in the syslog
—catch-sigpipebooleancatch and ignore SIGPIPE on stdout and stderr if specified
—clusters_to_watchstringsComma-separated list of keyspaces or keyspace/shards that this instance will monitor and repair. Defaults to all clusters in the topology. Example: “ks1,ks2/-80”
—configstringconfig file name
—consul_auth_static_filestringJSON File to read the topos/tokens from.
—grpc_auth_static_client_credsstringWhen using grpc_static_auth in the server, this file provides the credentials to use to authenticate with server.
—grpc_compressionstringWhich protocol to use for compressing gRPC. Default: nothing. Supported: snappy
—grpc_enable_tracingbooleanEnable gRPC tracing.
—grpc_initial_conn_window_sizeintgRPC initial connection window size
—grpc_initial_window_sizeintgRPC initial window size
—grpc_keepalive_timedurationAfter a duration of this time, if the client doesn’t see any activity, it pings the server to see if the transport is still alive. (default 10s)
—grpc_keepalive_timeoutdurationAfter having pinged for keepalive check, the client waits for a duration of Timeout and if no activity is seen even after that the connection is closed. (default 10s)
—grpc_max_message_sizeintMaximum allowed RPC message size. Larger messages will be rejected by gRPC with the error ‘exceeding the max size’. (default 16777216)
—grpc_prometheusbooleanEnable gRPC monitoring with Prometheus.
-h, —helpbooleandisplay usage and exit
—instance-poll-timedurationTimer duration on which VTOrc refreshes MySQL information (default 5s)
—keep_logsdurationkeep logs for this long (using ctime) (zero to keep forever)
—keep_logs_by_mtimedurationkeep logs for this long (using mtime) (zero to keep forever)
—lameduck-perioddurationkeep running at least this long after SIGTERM before stopping (default 50ms)
—lock-timeoutdurationMaximum time for which a shard/keyspace lock can be acquired for (default 45s)
—log_backtrace_attraceLocationwhen logging hits line file:N, emit a stack trace (default :0)
—log_dirstringIf non-empty, write log files in this directory
—log_err_stacksbooleanlog stack traces for errors
—log_rotate_max_sizeuintsize in bytes at which logs are rotated (glog.MaxSize) (default 1887436800)
—logtostderrbooleanlog to standard error instead of files
—onclose_timeoutdurationwait no more than this for OnClose handlers before stopping (default 10s)
—onterm_timeoutdurationwait no more than this for OnTermSync handlers before stopping (default 10s)
—pid_filestringIf set, the process will write its pid to the named file, and delete it on graceful shutdown.
—portintport for the server
—pprofstringsenable profiling
—prevent-cross-cell-failoverbooleanPrevent VTOrc from promoting a primary in a different cell than the current primary in case of a failover
—purge_logs_intervaldurationhow often try to remove old logs (default 1h0m0s)
—reasonable-replication-lagdurationMaximum replication lag on replicas which is deemed to be acceptable (default 10s)
—recovery-period-block-durationdurationDuration for which a new recovery is blocked on an instance after running a recovery (default 30s)
—recovery-poll-durationdurationTimer duration on which VTOrc polls its database to run a recovery (default 1s)
—remote_operation_timeoutdurationtime to wait for a remote operation (default 15s)
—security_policystringthe name of a registered security policy to use for controlling access to URLs - empty means allow all for anyone (built-in policies: deny-all, read-only)
—shutdown_wait_timedurationMaximum time to wait for VTOrc to release all the locks that it is holding before shutting down on SIGTERM (default 30s)
—snapshot-topology-intervaldurationTimer duration on which VTOrc takes a snapshot of the current MySQL information it has in the database. Should be in multiple of hours
—sqlite-data-filestringSQLite Datafile to use as VTOrc’s database (default “file::memory:?mode=memory&cache=shared”)
—stderrthresholdseveritylogs at or above this threshold go to stderr (default 1)
—tablet_manager_grpc_castringthe server ca to use to validate servers when connecting
—tablet_manager_grpc_certstringthe cert to use to connect
—tablet_manager_grpc_concurrencyintconcurrency to use to talk to a vttablet server for performance-sensitive RPCs (like ExecuteFetchAs{Dba,AllPrivs,App}) (default 8)
—tablet_manager_grpc_connpool_sizeintnumber of tablets to keep tmclient connections open to (default 100)
—tablet_manager_grpc_crlstringthe server crl to use to validate server certificates when connecting
—tablet_manager_grpc_keystringthe key to use to connect
—tablet_manager_grpc_server_namestringthe server name to use to validate server certificate
—tablet_manager_protocolstringProtocol to use to make tabletmanager RPCs to vttablets. (default “grpc”)
—topo-information-refresh-durationdurationTimer duration on which VTOrc refreshes the keyspace and vttablet records from the topology server (default 15s)
—topo_consul_lock_delaydurationLockDelay for consul session. (default 15s)
—topo_consul_lock_session_checksstringList of checks for consul session. (default “serfHealth”)
—topo_consul_lock_session_ttlstringTTL for consul session.
—topo_consul_watch_poll_durationdurationtime of the long poll for watch queries. (default 30s)
—topo_etcd_lease_ttlintLease TTL for locks and leader election. The client will use KeepAlive to keep the lease going. (default 30)
—topo_etcd_tls_castringpath to the ca to use to validate the server cert when connecting to the etcd topo server
—topo_etcd_tls_certstringpath to the client cert to use to connect to the etcd topo server, requires topo_etcd_tls_key, enables TLS
—topo_etcd_tls_keystringpath to the client key to use to connect to the etcd topo server, enables TLS
—topo_global_rootstringthe path of the global topology data in the global topology server
—topo_global_server_addressstringthe address of the global topology server
—topo_implementationstringthe topology implementation to use
—topo_k8s_contextstringThe kubeconfig context to use, overrides the ‘current-context’ from the config
—topo_k8s_kubeconfigstringPath to a valid kubeconfig file. When running as a k8s pod inside the same cluster you wish to use as the topo, you may omit this and the below arguments, and Vitess is capable of auto-discovering the correct values. https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod
—topo_k8s_namespacestringThe kubernetes namespace to use for all objects. Default comes from the context or in-cluster config
—topo_zk_auth_filestringauth to use when connecting to the zk topo server, file contents should be :, e.g., digest:user:pass
—topo_zk_base_timeoutdurationzk base timeout (see zk.Connect) (default 30s)
—topo_zk_max_concurrencyintmaximum number of pending requests to send to a Zookeeper server. (default 64)
—topo_zk_tls_castringthe server ca to use to validate servers when connecting to the zk topo server
—topo_zk_tls_certstringthe cert to use to connect to the zk topo server, requires topo_zk_tls_key, enables TLS
—topo_zk_tls_keystringthe key to use to connect to the zk topo server, enables TLS
—vvaluelog level for V logs
—versionbooleanprint binary version
—vmodulevaluecomma-separated list of pattern=N settings for file-filtered logging
—wait-replicas-timeoutdurationDuration for which to wait for replica’s to respond when issuing RPCs (default 30s)