FAQ

Frequently asked questions in Pigsty.

If you have any unlisted questions or suggestions, please create an Issue or ask the community for help.


Preparation

Node Requirement

CPU Architecture: x86_64 only. Pigsty does not support ARM yet.

CPU Number: 1 core for common node, at least 2 for admin node.

Memory: at least 1GB for the common node and 2GB for the admin node.

Using at least 3~4 x (2C / 4G / 100G) nodes for serious production deployment is recommended.

OS Requirement

Pigsty is now developed and tested on CentOS 7.9, Rocky 8.6 & 9.0. RHEL, Alma, Oracle, and any EL-compatible distribution also work.

We strongly recommend using EL 7.9, 8.6, and 9.0 to avoid meaningless efforts on RPM troubleshooting.

And PLEASE USE FRESH NEW NODES to avoid any unexpected issues.

Versioning Policy

!> Please always use a version-specific release, do not use the GitHub master branch unless you know what you are doing.

Pigsty uses semantic version numbers such as: <major>. <minor>. <release>. Alpha/Beta/RC are suffixed to the version number -a1, -b1, -rc1.

Major updates mean fundamental changes and massive features; minor version updates suggest new features, bump package versions, and minor API changes. Release version updates mean bug fixes and doc updates, and it does change offline package versions (i.e., v1.0.1 and v1.0.0 will use the same pkg.tgz).

Pigsty tries to release a Minor Release every 1-3 months and a Major Release every 1-2 years.


Download

Where to download the Pigsty source code?

!> bash -c "$(curl -fsSL http://download.pigsty.cc/get)"

The above command will automatically download the latest stable version of pigsty.tgz and extract it to the ~/pigsty dir. You can also manually download a specific version of Pigsty source code from the following location.

If you need to install it in an environment without the Internet, you can download it in advance and upload it to the production server via scp/sftp.

How to accelerate RPM downloading from the upstream repo?

Consider using the upstream repo mirror of your region. Define them with repo_upstream and region.

For example, you can use region = china, and the baseurl with key = china will be used instead of the default.

If a firewall or GFW blocks some repo, consider using a proxy_env to bypass that.

Where to download pigsty offline packages

Offline packages can be downloaded during bootstrap, or you can download them directly via:

  1. https://github.com/Vonng/pigsty/releases/download/v2.0.2/pigsty-v2.0.2.tgz # source code
  2. https://github.com/Vonng/pigsty/releases/download/v2.0.2/pigsty-pkg-v2.0.2.el7.x86_64.tgz # el7 packages
  3. https://github.com/Vonng/pigsty/releases/download/v2.0.2/pigsty-pkg-v2.0.2.el8.x86_64.tgz # el8 packages
  4. https://github.com/Vonng/pigsty/releases/download/v2.0.2/pigsty-pkg-v2.0.2.el9.x86_64.tgz # el9 packages

Configuration

What does configure do?

!> Detect the environment, generate the configuration, enable the offline package (optional), and install the essential tool Ansible.

After downloading the Pigsty source package and unpacking it, you may have to execute ./configure to complete the environment configuration. This is optional if you already know how to configure Pigsty properly.

The configure procedure will detect your node environment and generate a pigsty config file: pigsty.yml for you.

What is the Pigsty config file?

!> pigsty.yml under the pigsty home dir is the default config file.

Pigsty uses a single config file pigsty.yml, to describe the entire environment, and you can define everything there. There are many config examples in files/pigsty for your reference.

You can pass the -i <path> to playbooks to use other configuration files. For example, you want to install redis according to another config: redis.yml:

  1. ./redis.yml -i files/pigsty/redis.yml

How to use the CMDB as config inventory

The default config file path is specified in ansible.cfg: inventory = pigsty.yml

You can switch to a dynamic CMDB inventory with bin/inventory_cmdb, and switch back to the local config file with bin/inventory_conf. You must also load the current config file inventory to CMDB with bin/inventory_load.

If CMDB is used, you must edit the inventory config from the database rather than the config file.

What is the IP address placeholder in the config file?

!> Pigsty uses 10.10.10.10 as a placeholder for the current node IP, which will be replaced with the primary IP of the current node during the configuration.

When the configure detects multiple NICs with multiple IPs on the current node, the config wizard will prompt for the primary IP to be used, i.e., the IP used by the user to access the node from the internal network. Note that please do not use the public IP.

This IP will be used to replace 10.10.10.10 in the config file template.

Which parameters need your attention?

!> Usually, in a singleton installation, there is no need to make any adjustments to the config files.

Pigsty provides 265 config parameters to customize the entire infra/node/etcd/minio/pgsql. However, there are a few parameters that can be adjusted in advance if needed:

  • When accessing web service components, the domain name is infra_portal (some services can only be accessed using the domain name through the Nginx proxy).
  • Pigsty assumes that a /data dir exists to hold all data; you can adjust these paths if the data disk mount point differs from this.
  • Don’t forget to change those passwords in the config file for your production deployment.

Installation

What was executed during installation?

!> When running make install, the ansible-playbook install.yml will be invoked to install everything on all nodes

Which will:

  • Install INFRA module on the current node.
  • Install NODE module on the current node.
  • Install ETCD module on the current node.
  • The MinIO module is optional, and will not be installed by default.
  • Install PGSQL module on the current node.

How to resolve RPM conflict?

There may have a slight chance that rpm conflict occurs during node/infra/pgsql packages installation.

The simplest way to resolve this is to install without offline packages, which will download directly from the upstream repo.

If there are only a few problematic RPMs, you can use a trick to fix the yum repo quickly:

  1. rm -rf /www/pigsty/repo_complete # delete the repo_complete flag file to mark this repo incomplete
  2. rm -rf SomeBrokenRPMPackages # delete problematic RPMs
  3. ./infra.yml -t repo_upstream # write upstream repos. you can also use /etc/yum.repos.d/backup/*
  4. ./infra.yml -t repo_pkg # download rpms according to your current OS

How to create local VMs with vagrant

!> The first time you use Vagrant to pull up a particular OS repo, it will download the corresponding BOX.

Pigsty sandbox uses generic/rocky9 image box by default, and Vagrant will download the rocky/9 box for the first time the VM is started.

Using a proxy may increase the download speed. Box only needs to be downloaded once, and will be reused when recreating the sandbox.

RPMs error on Aliyun CentOS 7.9

!> Aliyun CentOS 7.9 server has DNS caching service nscd installed by default. Just remove it.

Aliyun’s CentOS 7.9 repo has nscd installed by default, locking out the glibc version, which can cause RPM dependency errors during installation.

  1. "Error: Package: nscd-2.17-307.el7.1.x86_64 (@base)"

Run yum remove -y nscd on all nodes to resolve this issue, and with Ansible, you can batch.

  1. ansible all -b -a 'yum remove -y nscd'

Monitoring

Performance impact of monitoring exporter

Not very much, 200ms per 10 ~ 15 seconds.

How to monitor an existing PostgreSQL instance?

Check PGSQL Monitor for details.

How to remove monitor targets from prometheus?

  1. ./pgsql-rm.yml -t prometheus -l <cls> # remove prometheus targets of cluster 'cls'

Or

  1. bin/pgmon-rm <ins> # shortcut for removing prometheus targets of pgsql instance 'ins'

INFRA

Which components are included in INFRA

  • Ansible for automation, deployment, and administration;
  • Nginx for exposing any WebUI service and serving the yum repo;
  • Self-Signed CA for SSL/TLS certificates;
  • Prometheus for monitoring metrics
  • Grafana for monitoring/visualization
  • Loki for logging collection
  • AlertManager for alerts aggregation
  • Chronyd for NTP time sync on the admin node.
  • DNSMasq for DNS registration and resolution.
  • ETCD as DCS for PGSQL HA; (dedicated module)
  • PostgreSQL on meta nodes as CMDB; (optional)
  • Docker for stateless applications & tools (optional)

How to restore Prometheus targets

If you accidentally deleted the Prometheus targets dir, you can register monitoring targets to Prometheus again with the:

  1. ./infra.yml -t register_prometheus # register all infra targets to prometheus on infra nodes
  2. ./node.yml -t register_prometheus # register all node targets to prometheus on infra nodes
  3. ./etcd.yml -t register_prometheus # register all etcd targets to prometheus on infra nodes
  4. ./minio.yml -t register_prometheus # register all minio targets to prometheus on infra nodes
  5. ./pgsql.yml -t register_prometheus # register all pgsql targets to prometheus on infra nodes

How to restore Grafana datasource

PGSQL Databases in pg_databases are registered as Grafana datasource by default.

If you accidentally deleted the registered postgres datasource in Grafana, you can register them again with

  1. ./pgsql.yml -t register_grafana # register all pgsql database (in pg_databases) as grafana datasource

How to restore the HAProxy admin page proxy

The haproxy admin page is proxied by Nginx under the default server.

If you accidentally deleted the registered haproxy proxy settings in /etc/nginx/conf.d/haproxy, you can restore them again with

  1. ./node.yml -t register_nginx # register all haproxy admin page proxy settings to nginx on infra nodes

How to restore the DNS registration

PGSQL cluster/instance domain names are registered to /etc/hosts.d/<name> on infra nodes by default.

You can restore them again with the following:

  1. ./pgsql.yml -t pg_dns # register pg DNS names to dnsmasq on infra nodes

How to expose new Nginx upstream service

If you wish to expose a new WebUI service via the Nginx portal, you can add the service definition to the infra_portal parameter.

And re-run ./infra.yml -t nginx_config,nginx_launch to update & apply the Nginx configuration.

If you wish to access with HTTPS, you must remove files/pki/csr/pigsty.csr, files/pki/nginx/pigsty.{key,crt} to force re-generating the Nginx SSL/TLS certificate to include the new upstream’s domain name.


NODE

How to configure NTP service?

!> If NTP is not configured, use a public NTP service or sync time with the admin node.

If your nodes already have NTP configured, you can leave it there by setting node_ntp_enabled to false.

Otherwise, if you have Internet access, you can use public NTP services such as pool.ntp.org.

If you don’t have Internet access, at least you can sync time with the admin node with the following:

  1. node_ntp_servers: # NTP servers in /etc/chrony.conf
  2. - pool cn.pool.ntp.org iburst
  3. - pool ${admin_ip} iburst # assume non-admin nodes do not have internet access

How to force sync time on nodes?

!> Use chronyc to sync time. You have to configure the NTP service first.

  1. ansible all -b -a 'chronyc -a makestep' # sync time

You can replace all with any group or host IP address to limit execution scope.

Remote nodes are not accessible via SSH commands.

!> Specify a different port via the host instance-level ansible connection parameters.

Consider using Ansible connection parameters if the target machine is hidden behind an SSH springboard machine, or if some customizations have been made that cannot be accessed directly using ssh ip. Additional SSH ports can be specified with ansible_port or ansible_host for SSH Alias.

  1. pg-test:
  2. vars: { pg_cluster: pg-test }
  3. hosts:
  4. 10.10.10.11: {pg_seq: 1, pg_role: primary, ansible_host: node-1 }
  5. 10.10.10.12: {pg_seq: 2, pg_role: replica, ansible_port: 22223, ansible_user: admin }
  6. 10.10.10.13: {pg_seq: 3, pg_role: offline, ansible_port: 22224 }

Password required for remote node SSH and SUDO

!> Use the -k and -K parameters, enter the password at the prompt, and refer to admin provisioning.

When performing deployments and changes, the admin user used must have ssh and sudo privileges for all nodes. Password-free is not required. You can pass in ssh and sudo passwords via the -k|-K parameter when executing the playbook or even use another user to run the playbook via -eansible_host=<another_user>. However, Pigsty strongly recommends configuring SSH passwordless login with passwordless sudo for the admin user.

Create an admin user with the existing admin user.

!> ./node.yml -k -K -e ansible_user=<another_admin> -t node_admin

This will create an admin user specified by node_admin_username with the existing one on that node.

Exposing node services with HAProxy

!> You can expose service with haproxy_services in node.yml.

And here’s an example of exposing MinIO service with it: Expose MinIO Service

Why my nodes /etc/yum.repos.d/* are nuked?

Pigsty will try to include all dependencies in the local yum repo on infra nodes. This repo file will be added according to node_repo_local_urls. And existing repo files will be removed by default according to the default value of node_repo_remove. This will prevent the node from using the Internet repo or some stupid issues.

If you want to keep existing repo files, just set node_repo_remove to false.


ETCD

What is the impact of ETCD failure? [ETCD](/en/docs/etcd) availability is critical for the PGSQL cluster’s HA, which is guaranteed by using multiple nodes. With a 3-node ETCD cluster, if one node is down, the other two nodes can still function normally; and with a 5-node ETCD cluster, two-node failure can still be tolerated. If more than half of the ETCD nodes are down, the ETCD cluster and its service will be unavailable. Before Patroni 3.0, this could lead to a global [PGSQL](/en/docs/pgsql) outage; all primary will be demoted and reject write requests.

Since pigsty 2.0, the patroni 3.0 DCS failsafe mode is enabled by default, which will LOCK the PGSQL cluster status if the ETCD cluster is unavailable and all PGSQL members are still known to the primary.

The PGSQL cluster can still function normally, but you must recover the ETCD cluster ASAP. (you can’t configure the PGSQL cluster through patroni if etcd is down)

How to use existing external etcd cluster? The hard-coded group, `etcd`, will be used as DCS servers for PGSQL. You can initialize them with `etcd.yml` or assume it is an existing external etcd cluster.

To use an existing external etcd cluster, define them as usual and make sure your current etcd cluster certificate is signed by the same CA as your self-signed CA for PGSQL.

How to add a new member to the existing etcd cluster?

!> Check Add a member to etcd cluster

  1. etcdctl member add <etcd-?> --learner=true --peer-urls=https://<new_ins_ip>:2380 # on admin node
  2. ./etcd.yml -l <new_ins_ip> -e etcd_init=existing # init new etcd member
  3. etcdctl member promote <new_ins_server_id> # on admin node

How to remove a member from an existing etcd cluster?

!> Check Remove member from etcd cluster

  1. etcdctl member remove <etcd_server_id> # kick member out of the cluster (on admin node)
  2. ./etcd.yml -l <ins_ip> -t etcd_purge # purge etcd instance

MINIO

Fail to launch multi-node / multi-driver MinIO cluster.

In Multi-Driver or Multi-Node mode, MinIO will refuse to start if the data dir is not a valid mount point.

Use mounted disks for MinIO data dir rather than some regular directory. You can use the regular directory only in the single node, single drive mode.

How to deploy a multi-node multi-drive MinIO cluster?

!> Check Create Multi-Node Multi-Driver MinIO Cluster

How to add a member to the existing MinIO cluster?

!> You’d better plan the MinIO cluster before deployment… Since this requires a global restart

Check this: Expand MinIO Deployment

How to use a HA MinIO deployment for PGSQL?

!> Access the HA MinIO cluster with an optional load balancer and different ports.

Here is an example: Access MinIO Service


REDIS

ABORT due to existing redis instance

!> use redis_clean = true and redis_safeguard = false to force clean redis data

This happens when you run redis.yml to init a redis instance that is already running, and redis_clean is set to false.

If redis_clean is set to true (and the redis_safeguard is set to false, too), the redis.yml playbook will remove the existing redis instance and re-init it as a new one, which makes the redis.yml playbook fully idempotent.

ABORT due to redis_safeguard enabled

!> This happens when removing a redis instance with redis_safeguard set to true.

You can disable redis_safeguard to remove the Redis instance. This is redis_safeguard is what it is for.

How to add a single new redis instance on this node?

!> Use bin/redis-add <ip> <port> to deploy a new redis instance on node.

How to remove a single redis instance from the node?

!> bin/redis-rm <ip> <port> to remove a single redis instance from node


PGSQL

ABORT due to postgres exists

!> Set pg_clean = true and pg_safeguard = false to force clean postgres data during pgsql.yml

This happens when you run pgsql.yml on a node with postgres running, and pg_clean is set to false.

If pg_clean is true (and the pg_safeguard is false, too), the pgsql.yml playbook will remove the existing pgsql data and re-init it as a new one, which makes this playbook fully idempotent.

You can still purge the existing PostgreSQL data by using a special task tag pg_purge

  1. ./pgsql.yml -t pg_clean # honor pg_clean and pg_safeguard
  2. ./pgsql.yml -t pg_purge # ignore pg_clean and pg_safeguard

ABORT due to pg_safeguard enabled

!> Disable pg_safeguard to remove the Postgres instance.

If pg_safeguard is enabled, you can not remove the running pgsql instance with bin/pgsql-rm and pgsql-rm.yml playbook.

To disable pg_safeguard, you can set pg_safeguard to false in the inventory or pass -e pg_safeguard=false as cli arg to the playbook:

  1. ./pgsql-rm.yml -e pg_safeguard=false -l <cls_to_remove> # force override pg_safeguard

Fail to wait for postgres/patroni primary

This usually happens when the cluster is misconfigured, or the previous primary is improperly removed. (e.g., trash metadata in DCS with the same cluster name).

You must check /pg/log/* to find the reason.

Fail to wait for postgres/patroni replica

There are several possible reasons:

Failed Immediately: Usually, this happens because of misconfiguration, network issues, broken DCS metadata, etc…, you have to inspect /pg/log to find out the actual reason.

Failed After a While: This may be due to source instance data corruption. Check PGSQL FAQ: How to create replicas when data is corrupted?

Timeout: If the wait for postgres replica task takes 30min or more and fails due to timeout, This is common for a huge cluster (e.g., 1TB+, which may take hours to create a replica). In this case, the underlying creating replica procedure is still proceeding. You can check cluster status with pg list <cls> and wait until the replica catches up with the primary. Then continue the following tasks:

  1. ./pgsql.yml -t pg_hba,pg_backup,pgbouncer,pg_vip,pg_dns,pg_service,pg_exporter,pg_register

How enable hugepage for PostgreSQL?

!> use node_hugepage_count and node_hugepage_ratio or /pg/bin/pg-tune-hugepage

If you plan to enable hugepage, consider using node_hugepage_count and node_hugepage_ratio and apply with ./node.yml -t node_tune .

It’s good to allocate enough hugepage before postgres start, and use pg_tune_hugepage to shrink them later.

If your postgres is already running, you can use /pg/bin/pg-tune-hugepage to enable hugepage on the fly.

  1. sync; echo 3 > /proc/sys/vm/drop_caches # drop system cache (ready for performance impact)
  2. sudo /pg/bin/pg-tune-hugepage # write nr_hugepages to /etc/sysctl.d/hugepage.conf
  3. pg restart <cls> # restart postgres to use hugepage

How to guarantee zero data loss during failover?

!> Use crit.yml template, or setting pg_rpo to 0, or Config Cluster with synchronous mode.

Consider using Sync Standby and Quorum Comit to guarantee 0 data loss during failover.

How to survive from disk full?

!> rm -rf /pg/dummy will free some emergency space.

The pg_dummy_filesize is set to 64MB by default. Consider increasing it to 8GB or larger in the production environment.

It will be placed on /pg/dummy same disk as the PGSQL main data disk. You can remove that file to free some emergency space. At least you can run some shell scripts on that node.

How to create replicas when data is corrupted?

!> Disable clonefrom on bad instances and reload patroni config.

Pigsty sets the cloneform: true tag on all instances’ patroni config, which marks the instance available for cloning replica.

If this instance has corrupt data files, you can set clonefrom: false to avoid pulling data from the evil instance. To do so:

  1. $ vi /pg/bin/patroni.yml
  2. tags:
  3. nofailover: false
  4. clonefrom: true # ----------> change to false
  5. noloadbalance: false
  6. nosync: false
  7. version: '15'
  8. spec: '4C.8G.50G'
  9. conf: 'oltp.yml'
  10. $ systemctl reload patroni

How to create replicas when data is corrupted?

!> Disable clonefrom on bad instances and reload patroni config.

Pigsty sets the cloneform: true tag on all instances’ patroni config, which marks the instance available for cloning replica.

If this instance has corrupt data files, you can set clonefrom: false to avoid pulling data from the evil instance. To do so:

  1. $ vi /pg/bin/patroni.yml
  2. tags:
  3. nofailover: false
  4. clonefrom: true # ----------> change to false
  5. noloadbalance: false
  6. nosync: false
  7. version: '15'
  8. spec: '4C.8G.50G'
  9. conf: 'oltp.yml'
  10. $ systemctl reload patroni

Last modified 2023-04-07: bump en docs to v2.0.2 (5a16652)