Quantcast
Channel: Severalnines - galera
Viewing all 111 articles
Browse latest View live

MySQL on Docker: Running a MariaDB Galera Cluster without Orchestration Tools - DB Container Management - Part 2

$
0
0

As we saw in the first part of this blog, a strongly consistent database cluster like Galera does not play well with container orchestration tools like Kubernetes or Swarm. We showed you how to deploy Galera and configure process management for Docker, so you retain full control of the behaviour.  This blog post is the continuation of that, we are going to look into operation and maintenance of the cluster.

To recap some of the main points from the part 1 of this blog, we deployed a three-node Galera cluster, with ProxySQL and Keepalived on three different Docker hosts, where all MariaDB instances run as Docker containers. The following diagram illustrates the final deployment:

Graceful Shutdown

To perform a graceful MySQL shutdown, the best way is to send SIGTERM (signal 15) to the container:

$ docker kill -s 15 {db_container_name}

If you would like to shutdown the cluster, repeat the above command on all database containers, one node at a time. The above is similar to performing "systemctl stop mysql" in systemd service for MariaDB. Using "docker stop" command is pretty risky for database service because it waits for 10 seconds timeout and Docker will force SIGKILL if this duration is exceeded (unless you use a proper --timeout value).

The last node that shuts down gracefully will have the seqno not equal to -1 and safe_to_bootstrap flag is set to 1 in the /{datadir volume}/grastate.dat of the Docker host, for example on host2:

$ cat /containers/mariadb2/datadir/grastate.dat
# GALERA saved state
version: 2.1
uuid:    e70b7437-645f-11e8-9f44-5b204e58220b
seqno:   7099
safe_to_bootstrap: 1

Detecting the Most Advanced Node

If the cluster didn't shut down gracefully, or the node that you were trying to bootstrap wasn't the last node to leave the cluster, you probably wouldn't be able to bootstrap one of the Galera node and might encounter the following error:

2016-11-07 01:49:19 5572 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node.
It was not the last one to leave the cluster and may not contain all the updates.
To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .

Galera honours the node that has safe_to_bootstrap flag set to 1 as the first reference node. This is the safest way to avoid data loss and ensure the correct node always gets bootstrapped.

If you got the error, we have to find out the most advanced node first before picking up the node as the first to be bootstrapped. Create a transient container (with --rm flag), map it to the same datadir and configuration directory of the actual database container with two MySQL command flags, --wsrep_recover and --wsrep_cluster_address. For example, if we want to know mariadb1 last committed number, we need to run:

$ docker run --rm --name mariadb-recover \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/conf.d \
        mariadb:10.2.15 \
        --wsrep_recover \
        --wsrep_cluster_address=gcomm://
2018-06-12  4:46:35 139993094592384 [Note] mysqld (mysqld 10.2.15-MariaDB-10.2.15+maria~jessie) starting as process 1 ...
2018-06-12  4:46:35 139993094592384 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
...
2018-06-12  4:46:35 139993094592384 [Note] Plugin 'FEEDBACK' is disabled.
2018-06-12  4:46:35 139993094592384 [Note] Server socket created on IP: '::'.
2018-06-12  4:46:35 139993094592384 [Note] WSREP: Recovered position: e70b7437-645f-11e8-9f44-5b204e58220b:7099

The last line is what we are looking for. MariaDB prints out the cluster UUID and the sequence number of the most recently committed transaction. The node which holds the highest number is deemed as the most advanced node. Since we specified --rm, the container will be removed automatically once it exits. Repeat the above step on every Docker host by replacing the --volume path to the respective database container volumes.

Once you have compared the value reported by all database containers and decided which container is the most up-to-date node, change the safe_to_bootstrap flag to 1 inside /{datadir volume}/grastate.dat manually. Let's say all nodes are reporting the same exact sequence number, we can just pick mariadb3 to be bootstrapped by changing the safe_to_bootstrap value to 1:

$ vim /containers/mariadb3/datadir/grasate.dat
...
safe_to_bootstrap: 1

Save the file and start bootstrapping the cluster from that node, as described in the next chapter.

Bootstrapping the Cluster

Bootstrapping the cluster is similar to the first docker run command we used when starting up the cluster for the first time. If mariadb1 is the chosen bootstrap node, we can simply re-run the created bootstrap container:

$ docker start mariadb0 # on host1

Otherwise, if the bootstrap container does not exist on the chosen node, let's say on host2, run the bootstrap container command and map the existing mariadb2's volumes. We are using mariadb0 as the container name on host2 to indicate it is a bootstrap container:

$ docker run -d \
        --name mariadb0 \
        --hostname mariadb0.weave.local \
        --net weave \
        --publish "3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb2/datadir:/var/lib/mysql \
        --volume /containers/mariadb2/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm:// \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb0.weave.local

You may notice that this command is slightly shorter as compared to the previous bootstrap command described in this guide. Since we already have the proxysql user created in our first bootstrap command, we may skip these two environment variables:

  • --env MYSQL_USER=proxysql
  • --env MYSQL_PASSWORD=proxysqlpassword

Then, start the remaining MariaDB containers, remove the bootstrap container and start the existing MariaDB container on the bootstrapped host. Basically the order of commands would be:

$ docker start mariadb1 # on host1
$ docker start mariadb3 # on host3
$ docker stop mariadb0 # on host2
$ docker start mariadb2 # on host2

At this point, the cluster is started and is running at full capacity.

Resource Control

Memory is a very important resource in MySQL. This is where the buffers and caches are stored, and it's critical for MySQL to reduce the impact of hitting the disk too often. On the other hand, swapping is bad for MySQL performance. By default, there will be no resource constraints on the running containers. Containers use as much of a given resource as the host’s kernel will allow. Another important thing is file descriptor limit. You can increase the limit of open file descriptor, or "nofile" to something higher to cater for the number of files MySQL server can open simultaneously. Setting this to a high value won't hurt.

To cap memory allocation and increase the file descriptor limit to our database container, one would append --memory, --memory-swap and --ulimit parameters into the "docker run" command:

$ docker kill -s 15 mariadb1
$ docker rm -f mariadb1
$ docker run -d \
        --name mariadb1 \
        --hostname mariadb1.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --memory 16g \
        --memory-swap 16g \
        --ulimit nofile:16000:16000 \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb1.weave.local

Take note that if --memory-swap is set to the same value as --memory, and --memory is set to a positive integer, the container will not have access to swap. If --memory-swap is not set, container swap will default to --memory multiply by 2. If --memory and --memory-swap are set to the same value, this will prevent containers from using any swap. This is because --memory-swap is the amount of combined memory and swap that can be used, while --memory is only the amount of physical memory that can be used.

Some of the container resources like memory and CPU can be controlled dynamically through "docker update" command, as shown in the following example to upgrade the memory of container mariadb1 to 32G on-the-fly:

$ docker update \
    --memory 32g \
    --memory-swap 32g \
    mariadb1

Do not forget to tune the my.cnf accordingly to suit the new specs. Configuration management is explained in the next section.

Configuration Management

Most of the MySQL/MariaDB configuration parameters can be changed during runtime, which means you don't need to restart to apply the changes. Check out the MariaDB documentation page for details. The parameter listed with "Dynamic: Yes" means the variable is loaded immediately upon changing without the necessity to restart MariaDB server. Otherwise, set the parameters inside the custom configuration file in the Docker host. For example, on mariadb3, make the changes to the following file:

$ vim /containers/mariadb3/conf.d/my.cnf

And then restart the database container to apply the change:

$ docker restart mariadb3

Verify the container starts up the process by looking at the docker logs. Perform this operation on one node at a time if you would like to make cluster-wide changes.

Backup

Taking a logical backup is pretty straightforward because the MariaDB image also comes with mysqldump binary. You simply use the "docker exec" command to run the mysqldump and send the output to a file relative to the host path. The following command performs mysqldump backup on mariadb2 and saves it to /backups/mariadb2 inside host2:

$ docker exec -it mariadb2 mysqldump -uroot -p --single-transaction > /backups/mariadb2/dump.sql

Binary backup like Percona Xtrabackup or MariaDB Backup requires the process to access the MariaDB data directory directly. You have to either install this tool inside the container, or through the machine host or use a dedicated image for this purpose like "perconalab/percona-xtrabackup" image to create the backup and stored it inside /tmp/backup on the Docker host:

$ docker run --rm -it \
    -v /containers/mariadb2/datadir:/var/lib/mysql \
    -v /tmp/backup:/xtrabackup_backupfiles \
    perconalab/percona-xtrabackup \
    --backup --host=mariadb2 --user=root --password=mypassword

You can also stop the container with innodb_fast_shutdown set to 0 and copy over the datadir volume to another location in the physical host:

$ docker exec -it mariadb2 mysql -uroot -p -e 'SET GLOBAL innodb_fast_shutdown = 0'
$ docker kill -s 15 mariadb2
$ cp -Rf /containers/mariadb2/datadir /backups/mariadb2/datadir_copied
$ docker start mariadb2

Restore

Restoring is pretty straightforward for mysqldump. You can simply redirect the stdin into the container from the physical host:

$ docker exec -it mariadb2 mysql -uroot -p < /backups/mariadb2/dump.sql

You can also use the standard mysql client command line remotely with proper hostname and port value instead of using this "docker exec" command:

$ mysql -uroot -p -h127.0.0.1 -P3306 < /backups/mariadb2/dump.sql

For Percona Xtrabackup and MariaDB Backup, we have to prepare the backup beforehand. This will roll forward the backup to the time when the backup was finished. Let's say our Xtrabackup files are located under /tmp/backup of the Docker host, to prepare it, simply:

$ docker run --rm -it \
    -v mysql-datadir:/var/lib/mysql \
    -v /tmp/backup:/xtrabackup_backupfiles \
    perconalab/percona-xtrabackup \
    --prepare --target-dir /xtrabackup_backupfiles

The prepared backup under /tmp/backup of the Docker host then can be used as the MariaDB datadir for a new container or cluster. Let's say we just want to verify restoration on a standalone MariaDB container, we would run:

$ docker run -d \
    --name mariadb-restored \
    --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
    -v /tmp/backup:/var/lib/mysql \
    mariadb:10.2.15

If you performed a backup using stop and copy approach, you can simply duplicate the datadir and use the duplicated directory as a volume maps to MariaDB datadir to run on another container. Let's say the backup was copied over under /backups/mariadb2/datadir_copied, we can run a new container by running:

$ mkdir -p /containers/mariadb-restored/datadir
$ cp -Rf /backups/mariadb2/datadir_copied /containers/mariadb-restored/datadir
$ docker run -d \
    --name mariadb-restored \
    --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
    -v /containers/mariadb-restored/datadir:/var/lib/mysql \
    mariadb:10.2.15

The MYSQL_ROOT_PASSWORD must match the actual root password for that particular backup.

Severalnines
 
MySQL on Docker: How to Containerize Your Database
Discover all you need to understand when considering to run a MySQL service on top of Docker container virtualization

Database Version Upgrade

There are two types of upgrade - in-place upgrade or logical upgrade.

In-place upgrade involves shutting down the MariaDB server, replacing the old binaries with the new binaries and then starting the server on the old data directory. Once started, you have to run mysql_upgrade script to check and upgrade all system tables and also to check the user tables.

The logical upgrade involves exporting SQL from the current version using a logical backup utility such as mysqldump, running the new container with the upgraded version binaries, and then applying the SQL to the new MySQL/MariaDB version. It is similar to backup and restore approach described in the previous section.

Nevertheless, it's a good approach to always backup your database before performing any destructive operations. The following steps are required when upgrading from the current image, MariaDB 10.1.33 to another major version, MariaDB 10.2.15 on mariadb3 resides on host3:

  1. Backup the database. It doesn't matter physical or logical backup but the latter using mysqldump is recommended.

  2. Download the latest image that we would like to upgrade to:

    $ docker pull mariadb:10.2.15
  3. Set innodb_fast_shutdown to 0 for our database container:

    $ docker exec -it mariadb3 mysql -uroot -p -e 'SET GLOBAL innodb_fast_shutdown = 0'
  4. Graceful shut down the database container:

    $ docker kill --signal=TERM mariadb3
  5. Create a new container with the new image for our database container. Keep the rest of the parameters intact except using the new container name (otherwise it would conflict):

    $ docker run -d \
            --name mariadb3-new \
            --hostname mariadb3.weave.local \
            --net weave \
            --publish "3306:3306" \
            --publish "4444" \
            --publish "4567" \
            --publish "4568" \
            $(weave dns-args) \
            --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
            --volume /containers/mariadb3/datadir:/var/lib/mysql \
            --volume /containers/mariadb3/conf.d:/etc/mysql/mariadb.conf.d \
            mariadb:10.2.15 \
            --wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
            --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
            --wsrep_node_address=mariadb3.weave.local
  6. Run mysql_upgrade script:

    $ docker exec -it mariadb3-new mysql_upgrade -uroot -p
  7. If no errors occurred, remove the old container, mariadb3 (the new one is mariadb3-new):

    $ docker rm -f mariadb3
  8. Otherwise, if the upgrade process fails in between, we can fall back to the previous container:

    $ docker stop mariadb3-new
    $ docker start mariadb3

Major version upgrade can be performed similarly to the minor version upgrade, except you have to keep in mind that MySQL/MariaDB only supports major upgrade from the previous version. If you are on MariaDB 10.0 and would like to upgrade to 10.2, you have to upgrade to MariaDB 10.1 first, followed by another upgrade step to MariaDB 10.2.

Take note on the configuration changes being introduced and deprecated between major versions.

Failover

In Galera, all nodes are masters and hold the same role. With ProxySQL in the picture, connections that pass through this gateway will be failed over automatically as long as there is a primary component running for Galera Cluster (that is, a majority of nodes are up). The application won't notice any difference if one database node goes down because ProxySQL will simply redirect the connections to the other available nodes.

If the application connects directly to the MariaDB bypassing ProxySQL, failover has to be performed on the application-side by pointing to the next available node, provided the database node meets the following conditions:

  • Status wsrep_local_state_comment is Synced (The state "Desynced/Donor" is also possible, only if wsrep_sst_method is xtrabackup, xtrabackup-v2 or mariabackup).
  • Status wsrep_cluster_status is Primary.

In Galera, an available node doesn't mean it's healthy until the above status are verified.

Scaling Out

To scale out, we can create a new container in the same network and use the same custom configuration file for the existing container on that particular host. For example, let's say we want to add the fourth MariaDB container on host3, we can use the same configuration file mounted for mariadb3, as illustrated in the following diagram:

Run the following command on host3 to scale out:

$ docker run -d \
        --name mariadb4 \
        --hostname mariadb4.weave.local \
        --net weave \
        --publish "3306:3307" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb4/datadir:/var/lib/mysql \
        --volume /containers/mariadb3/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm://mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local,mariadb4.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb4.weave.local

Once the container is created, it will join the cluster and perform SST. It can be accessed on port 3307 externally or outside of the Weave network, or port 3306 within the host or within the Weave network. It's not necessary to include mariadb0.weave.local into the cluster address anymore. Once the cluster is scaled out, we need to add the new MariaDB container into the ProxySQL load balancing set via admin console:

$ docker exec -it proxysql1 mysql -uadmin -padmin -P6032
mysql> INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (10,'mariadb4.weave.local',3306);
mysql> INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (20,'mariadb4.weave.local',3306);
mysql> LOAD MYSQL SERVERS TO RUNTIME;
mysql> SAVE MYSQL SERVERS TO DISK;

Repeat the above commands on the second ProxySQL instance.

Finally for the the last step, (you may skip this part if you already ran "SAVE .. TO DISK" statement in ProxySQL), add the following line into proxysql.cnf to make it persistent across container restart on host1 and host2:

$ vim /containers/proxysql1/proxysql.cnf # host1
$ vim /containers/proxysql2/proxysql.cnf # host2

And append mariadb4 related lines under mysql_server directive:

mysql_servers =
(
        { address="mariadb1.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb4.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb1.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb4.weave.local" , port=3306 , hostgroup=20, max_connections=100 }
)

Save the file and we should be good on the next container restart.

Scaling Down

To scale down, simply shuts down the container gracefully. The best command would be:

$ docker kill -s 15 mariadb4
$ docker rm -f mariadb4

Remember, if the database node left the cluster ungracefully, it was not part of scaling down and would affect the quorum calculation.

To remove the container from ProxySQL, run the following commands on both ProxySQL containers. For example, on proxysql1:

$ docker exec -it proxysql1 mysql -uadmin -padmin -P6032
mysql> DELETE FROM mysql_servers WHERE hostname="mariadb4.weave.local";
mysql> LOAD MYSQL SERVERS TO RUNTIME;
mysql> SAVE MYSQL SERVERS TO DISK;

You can then either remove the corresponding entry inside proxysql.cnf or just leave it like that. It will be detected as OFFLINE from ProxySQL point-of-view anyway.

Summary

With Docker, things get a bit different from the conventional way on handling MySQL or MariaDB servers. Handling stateful services like Galera Cluster is not as easy as stateless applications, and requires proper testing and planning.

In our next blog on this topic, we will evaluate the pros and cons of running Galera Cluster on Docker without any orchestration tools.


How to Improve Performance of Galera Cluster for MySQL or MariaDB

$
0
0

Galera Cluster comes with many notable features that are not available in standard MySQL replication (or Group Replication); automatic node provisioning, true multi-master with conflict resolutions and automatic failover. There are also a number of limitations that could potentially impact cluster performance. Luckily, if you are not aware of these, there are workarounds. And if you do it right, you can minimize the impact of these limitations and improve overall performance.

We have previously covered many tips and tricks related to Galera Cluster, including running Galera on AWS Cloud. This blog post distinctly dives into the performance aspects, with examples on how to get the most out of Galera.

Replication Payload

A bit of introduction - Galera replicates writesets during the commit stage, transferring writesets from the originator node to the receiver nodes synchronously through the wsrep replication plugin. This plugin will also certify writesets on the receiver nodes. If the certification process passes, it returns OK to the client on the originator node and will be applied on the receiver nodes at a later time asynchronously. Else, the transaction will be rolled back on the originator node (returning error to the client) and the writesets that have been transferred to the receiver nodes will be discarded.

A writeset consists of write operations inside a transaction that changes the database state. In Galera Cluster, autocommit is default to 1 (enabled). Literally, any SQL statement executed in Galera Cluster will be enclosed as a transaction, unless you explicitly start with BEGIN, START TRANSACTION or SET autocommit=0. The following diagram illustrates the encapsulation of a single DML statement into a writeset:

For DML (INSERT, UPDATE, DELETE..), the writeset payload consists of the binary log events for a particular transaction while for DDLs (ALTER, GRANT, CREATE..), the writeset payload is the DDL statement itself. For DMLs, the writeset will have to be certified against conflicts on the receiver node while for DDLs (depending on wsrep_osu_method, default to TOI), the cluster cluster runs the DDL statement on all nodes in the same total order sequence, blocking other transactions from committing while the DDL is in progress (see also RSU). In simple words, Galera Cluster handles DDL and DML replication differently.

Round Trip Time

Generally, the following factors determine how fast Galera can replicate a writeset from an originator node to all receiver nodes:

  • Round trip time (RTT) to the farthest node in the cluster from the originator node.
  • The size of a writeset to be transferred and certified for conflict on the receiver node.

For example, if we have a three-node Galera Cluster and one of the nodes is located 10 milliseconds away (0.01 second), it's very unlikely you might be able to write more than 100 times per second to the same row without conflicting. There is a popular quote from Mark Callaghan which describes this behaviour pretty well:

"[In a Galera cluster] a given row can’t be modified more than once per RTT"

To measure RTT value, simply perform ping on the originator node to the farthest node in the cluster:

$ ping 192.168.55.173 # the farthest node

Wait for a couple of seconds (or minutes) and terminate the command. The last line of the ping statistic section is what we are looking for:

--- 192.168.55.172 ping statistics ---
65 packets transmitted, 65 received, 0% packet loss, time 64019ms
rtt min/avg/max/mdev = 0.111/0.431/1.340/0.240 ms

The max value is 1.340 ms (0.00134s) and we should take this value when estimating the minimum transactions per second (tps) for this cluster. The average value is 0.431ms (0.000431s) and we can use to estimate the average tps while min value is 0.111ms (0.000111s) which we can use to estimate the maximum tps. The mdev means how the RTT samples were distributed from the average. Lower value means more stable RTT.

Hence, transactions per second can be estimated by dividing RTT (in second) into 1 second:

Resulting,

  • Minimum tps: 1 / 0.00134 (max RTT) = 746.26 ~ 746 tps
  • Average tps: 1 / 0.000431 (avg RTT) = 2320.19 ~ 2320 tps
  • Maximum tps: 1 / 0.000111 (min RTT) = 9009.01 ~ 9009 tps

Note that this is just an estimation to anticipate replication performance. There is not much we can do to improve this on the database side, once we have everything deployed and running. Except, if you move or migrate the database servers closer to each other to improve the RTT between nodes, or upgrade the network peripherals or infrastructure. This would require maintenance window and proper planning.

Chunk Up Big Transactions

Another factor is the transaction size. After the writeset is transferred, there will be a certification process. Certification is a process to determine whether or not the node can apply the writeset. Galera generates MD5 checksum pseudo keys from every full row. The cost of certification depends on the size of the writeset, which translates into a number of unique key lookups into the certification index (a hash table). If you update 500,000 rows in a single transaction, for example:

# a 500,000 rows table
mysql> UPDATE mydb.settings SET success = 1;

The above will generate a single writeset with 500,000 binary log events in it. This huge writeset does not exceed wsrep_max_ws_size (default to 2GB) so it will be transferred over by Galera replication plugin to all nodes in the cluster, certifying these 500,000 rows on the receiver nodes for any conflicting transactions that are still in the slave queue. Finally, the certification status is returned to the group replication plugin. The bigger the transaction size, the higher risk it will be conflicting with other transactions that come from another master. Conflicting transactions waste server resources, plus cause a huge rollback to the originator node. Note that a rollback operation in MySQL is way slower and less optimized than commit operation.

The above SQL statement can be re-written into a more Galera-friendly statement with the help of simple loop, like the example below:

(bash)$ for i in {1..500}; do \
mysql -uuser -ppassword -e "UPDATE mydb.settings SET success = 1 WHERE success != 1 LIMIT 1000"; \
sleep 2; \
done

The above shell command would update 1000 rows per transaction for 500 times and wait for 2 seconds between executions. You could also use a stored procedure or other means to achieve a similar result. If rewriting the SQL query is not an option, simply instruct the application to execute the big transaction during a maintenance window to reduce the risk of conflicts.

For huge deletes, consider using pt-archiver from the Percona Toolkit - a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much.

Parallel Slave Threads

In Galera, the applier is a multithreaded process. Applier is a thread running within Galera to apply the incoming write-sets from another node. Which means, it is possible for all receivers to execute multiple DML operations that come right from the originator (master) node simultaneously. Galera parallel replication is only applied to transactions when it is safe to do so. It improves the probability of the node to sync up with the originator node. However, the replication speed is still limited to RTT and writeset size.

To get the best out of this, we need to know two things:

  • The number of cores the server has.
  • The value of wsrep_cert_deps_distance status.

The status wsrep_cert_deps_distance tells us the potential degree of parallelization. It is the value of the average distance between highest and lowest seqno values that can be possibly applied in parallel. You can use the wsrep_cert_deps_distance status variable to determine the maximum number of slave threads possible. Take note that this is an average value across time. Hence, in order get a good value, you have to hit the cluster with writes operations through test workload or benchmark until you see a stable value coming out.

To get the number of cores, you can simply use the following command:

$ grep -c processor /proc/cpuinfo
4

Ideally, 2, 3 or 4 threads of slave applier per CPU core is a good start. Thus, the minimum value for the slave threads should be 4 x number of CPU cores, and must not exceed the wsrep_cert_deps_distance value:

MariaDB [(none)]> SHOW STATUS LIKE 'wsrep_cert_deps_distance';
+--------------------------+----------+
| Variable_name            | Value    |
+--------------------------+----------+
| wsrep_cert_deps_distance | 48.16667 |
+--------------------------+----------+

You can control the number of slave applier threads using wsrep_slave_thread variable. Even though this is a dynamic variable, only increasing the number would have an immediate effect. If you reduce the value dynamically, it would take some time, until the applier thread exits after it finishes applying. A recommended value is anywhere between 16 to 48:

mysql> SET GLOBAL wsrep_slave_threads = 48;

Take note that in order for parallel slave threads to work, the following must be set (which is usually pre-configured for Galera Cluster):

innodb_autoinc_lock_mode=2

Galera Cache (gcache)

Galera uses a preallocated file with a specific size called gcache, where a Galera node keeps a copy of writesets in circular buffer style. By default, its size is 128MB, which is rather small. Incremental State Transfer (IST) is a method to prepare a joiner by sending only the missing writesets available in the donor’s gcache. IST is faster than state snapshot transfer (SST), it is non-blocking and has no significant performance impact on the donor. It should be the preferred option whenever possible.

IST can only be achieved if all changes missed by the joiner are still in the gcache file of the donor. The recommended setting for this is to be as big as the whole MySQL dataset. If disk space is limited or costly, determining the right size of the gcache size is crucial, as it can influence the data synchronization performance between Galera nodes.

The below statement will give us an idea of the amount of data replicated by Galera. Run the following statement on one of the Galera nodes during peak hours (tested on MariaDB >10.0 and PXC >5.6, galera >3.x):

mysql> SET @start := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); do sleep(60); SET @end := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); SET @gcache := (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(@@GLOBAL.wsrep_provider_options,'gcache.size = ',-1), 'M', 1)); SELECT ROUND((@end - @start),2) AS `MB/min`, ROUND((@end - @start),2) * 60 as `MB/hour`, @gcache as `gcache Size(MB)`, ROUND(@gcache/round((@end - @start),2),2) as `Time to full(minutes)`;
+--------+---------+-----------------+-----------------------+
| MB/min | MB/hour | gcache Size(MB) | Time to full(minutes) |
+--------+---------+-----------------+-----------------------+
|   7.95 |  477.00 |  128            |                 16.10 |
+--------+---------+-----------------+-----------------------+

We can estimate that the Galera node can have approximately 16 minutes of downtime, without requiring SST to join (unless Galera cannot determine the joiner state). If this is too short time and you have enough disk space on your nodes, you can change the wsrep_provider_options="gcache.size=<value>" to a more appropriate value. In this example workload, setting gcache.size=1G allows us to have 2 hours of node downtime with high probability of IST when the node rejoins.

It's also recommended to use gcache.recover=yes in wsrep_provider_options (Galera >3.19), where Galera will attempt to recover the gcache file to a usable state on startup rather than delete it, thus preserving the ability to have IST and avoiding SST as much as possible. Codership and Percona have covered this in details in their blogs. IST is always the best method to sync up after a node rejoins the cluster. It is 50% faster than xtrabackup or mariabackup and 5x faster than mysqldump.

Asynchronous Slave

Galera nodes are tightly-coupled, where the replication performance is as fast as the slowest node. Galera use a flow control mechanism, to control replication flow among members and eliminate any slave lag. The replication can be all fast or all slow on every node and is adjusted automatically by Galera. If you want to know about flow control, read this blog post by Jay Janssen from Percona.

In most cases, heavy operations like long running analytics (read-intensive) and backups (read-intensive, locking) are often inevitable, which could potentially degrade the cluster performance. The best way to execute this type of queries is by sending them to a loosely-coupled replica server, for instance, an asynchronous slave.

An asynchronous slave replicates from a Galera node using the standard MySQL asynchronous replication protocol. There is no limit on the number of slaves that can be connected to one Galera node, and chaining it out with an intermediate master is also possible. MySQL operations that execute on this server won't impact the cluster performance, apart from the initial syncing phase where a full backup must be taken on the Galera node to stage the slave before establishing the replication link (although ClusterControl allows you to build the async slave from an existing backup first, before connecting it to the cluster).

GTID (Global Transaction Identifier) provides a better transactions mapping across nodes, and is supported in MySQL 5.6 and MariaDB 10.0. With GTID, the failover operation on a slave to another master (another Galera node) is simplified, without the need to figure out the exact log file and position. Galera also comes with its own GTID implementation but these two are independent to each other.

Scaling out an asynchronous slave is one-click away if you are using ClusterControl -> Add Replication Slave feature:

Take note that binary logs must be enabled on the master (the chosen Galera node) before we can proceed with this setup. We have also covered the manual way in this previous post.

The following screenshot from ClusterControl shows the cluster topology, it illustrates our Galera Cluster architecture with an asynchronous slave:

ClusterControl automatically discovers the topology and generates the super cool diagram like above. You can also perform administration tasks directly from this page by clicking on the top-right gear icon of each box.

SQL-aware Reverse Proxy

ProxySQL and MariaDB MaxScale are intelligent reverse-proxies which understand MySQL protocol and is capable of acting as a gateway, router, load balancer and firewall in front of your Galera nodes. With the help of Virtual IP Address provider like LVS or Keepalived, and combining this with Galera multi-master replication technology, we can have a highly available database service, eliminating all possible single-point-of-failures (SPOF) from the application point-of-view. This will surely improve the availability and reliability the architecture as whole.

Another advantage with this approach is you will have the ability to monitor, rewrite or re-route the incoming SQL queries based on a set of rules before they hit the actual database server, minimizing the changes on the application or client side and routing queries to a more suitable node for optimal performance. Risky queries for Galera like LOCK TABLES and FLUSH TABLES WITH READ LOCK can be prevented way ahead before they would cause havoc to the system, while impacting queries like "hotspot" queries (a row that different queries want to access at the same time) can be rewritten or being redirected to a single Galera node to reduce the risk of transaction conflicts. For heavy read-only queries like OLAP or backup, you can route them over to an asynchronous slave if you have any.

Reverse proxy also monitors the database state, queries and variables to understand the topology changes and produce an accurate routing decision to the backend servers. Indirectly, it centralizes the nodes monitoring and cluster overview without the need to check on each and every single Galera node regularly. The following screenshot shows the ProxySQL monitoring dashboard in ClusterControl:

There are also many other benefits that a load balancer can bring to improve Galera Cluster significantly, as covered in details in this blog post, Become a ClusterControl DBA: Making your DB components HA via Load Balancers.

Final Thoughts

With good understanding on how Galera Cluster internally works, we can work around some of the limitations and improve the database service. Happy clustering!

Galera Cluster Recovery 101 - A Deep Dive into Network Partitioning

$
0
0

One of the cool features in Galera is automatic node provisioning and membership control. If a node fails or loses communication, it will be automatically evicted from the cluster and remain unoperational. As long as the majority of nodes are still communicating (Galera calls this PC - primary component), there is a very high chance the failed node would be able to automatically rejoin, resync and resume the replication once the connectivity is back.

Generally, all Galera nodes are equal. They hold the same data set and same role as masters, capable of handling read and write simultaneously, thanks to Galera group communication and certification-based replication plugin. Therefore, there is actually no failover from the database point-of-view due to this equilibrium. Only from the application side that would require failover, to skip the unoperational nodes while the cluster is partitioned.

In this blog post, we are going to look into understanding how Galera Cluster performs node and cluster recovery in case network partition happens. Just as a side note, we have covered a similar topic in this blog post some time back. Codership has explained Galera's recovery concept in great details in the documentation page, Node Failure and Recovery.

Node Failure and Eviction

In order to understand the recovery, we have to understand how Galera detects the node failure and eviction process first. Let's put this into a controlled test scenario so we can understand the eviction process better. Suppose we have a three-node Galera Cluster as illustrated below:

The following command can be used to retrieve our Galera provider options:

mysql> SHOW VARIABLES LIKE 'wsrep_provider_options'\G

It's a long list, but we just need to focus on some of the parameters to explain the process:

evs.inactive_check_period = PT0.5S; 
evs.inactive_timeout = PT15S; 
evs.keepalive_period = PT1S; 
evs.suspect_timeout = PT5S; 
evs.view_forget_timeout = P1D;
gmcast.peer_timeout = PT3S;

First of all, Galera follows ISO 8601 formatting to represent duration. P1D means the duration is one day, while PT15S means the duration is 15 seconds (note the time designator, T, that precedes the time value). For example if one wanted to increase evs.view_forget_timeout to 1 day and a half, one would set P1DT12H, or PT36H.

Considering all hosts haven't been configured with any firewall rules, we use the following script called block_galera.sh on galera2 to simulate a network failure to/from this node:

#!/bin/bash
# block_galera.sh
# galera2, 192.168.55.172

iptables -I INPUT -m tcp -p tcp --dport 4567 -j REJECT
iptables -I INPUT -m tcp -p tcp --dport 3306 -j REJECT
iptables -I OUTPUT -m tcp -p tcp --dport 4567 -j REJECT
iptables -I OUTPUT -m tcp -p tcp --dport 3306 -j REJECT
# print timestamp
date

By executing the script, we get the following output:

$ ./block_galera.sh
Wed Jul  4 16:46:02 UTC 2018

The reported timestamp can be considered as the start of the cluster partitioning, where we lose galera2, while galera1 and galera3 are still online and accessible. At this point, our Galera Cluster architecture is looking something like this:

From Partitioned Node Perspective

On galera2, you will see some printouts inside the MySQL error log. Let's break them out into several parts. The downtime was started around 16:46:02 UTC time and after gmcast.peer_timeout=PT3S, the following appears:

2018-07-04 16:46:05 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') connection to peer 8b2041d6 with addr tcp://192.168.55.173:4567 timed out, no messages seen in PT3S
2018-07-04 16:46:05 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.55.173:4567
2018-07-04 16:46:06 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') connection to peer 737422d6 with addr tcp://192.168.55.171:4567 timed out, no messages seen in PT3S
2018-07-04 16:46:06 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 0

As it passed evs.suspect_timeout = PT5S, both nodes galera1 and galera3 are suspected as dead by galera2:

2018-07-04 16:46:07 140454904243968 [Note] WSREP: evs::proto(62116b35, OPERATIONAL, view_id(REG,62116b35,54)) suspecting node: 8b2041d6
2018-07-04 16:46:07 140454904243968 [Note] WSREP: evs::proto(62116b35, OPERATIONAL, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive
2018-07-04 16:46:07 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 0
2018-07-04 16:46:08 140454904243968 [Note] WSREP: evs::proto(62116b35, GATHER, view_id(REG,62116b35,54)) suspecting node: 737422d6
2018-07-04 16:46:08 140454904243968 [Note] WSREP: evs::proto(62116b35, GATHER, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive

Then, Galera will revise the current cluster view and the position of this node:

2018-07-04 16:46:09 140454904243968 [Note] WSREP: view(view_id(NON_PRIM,62116b35,54) memb {
        62116b35,0
} joined {
} left {
} partitioned {
        737422d6,0
        8b2041d6,0
})
2018-07-04 16:46:09 140454904243968 [Note] WSREP: view(view_id(NON_PRIM,62116b35,55) memb {
        62116b35,0
} joined {
} left {
} partitioned {
        737422d6,0
        8b2041d6,0
})

With the new cluster view, Galera will perform quorum calculation to decide whether this node is part of the primary component. If the new component sees "primary = no", Galera will demote the local node state from SYNCED to OPEN:

2018-07-04 16:46:09 140454288942848 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Flow-control interval: [16, 16]
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Trying to continue unpaused monitor
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Received NON-PRIMARY.
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 2753699)

With the latest change on the cluster view and node state, Galera returns the post-eviction cluster view and global state as below:

2018-07-04 16:46:09 140454222194432 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2753699, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2018-07-04 16:46:09 140454222194432 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

You can see the following global status of galera2 have changed during this period:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY');
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+
| VARIABLE_NAME             | VARIABLE_VALUE                                                                                                                    |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+
| WSREP_CLUSTER_SIZE        | 1                                                                                                                                 |
| WSREP_CLUSTER_STATUS      | non-Primary                                                                                                                       |
| WSREP_EVS_DELAYED         | 737422d6-7db3-11e8-a2a2-bbe98913baf0:tcp://192.168.55.171:4567:1,8b2041d6-7f62-11e8-87d5-12a76678131f:tcp://192.168.55.173:4567:2 |
| WSREP_LOCAL_STATE_COMMENT | Initialized                                                                                                                       |
| WSREP_READY               | OFF                                                                                                                               |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+

At this point, MySQL/MariaDB server on galera2 is still accessible (database is listening on 3306 and Galera on 4567) and you can query the mysql system tables and list out the databases and tables. However when you jump into the non-system tables and make a simple query like this:

mysql> SELECT * FROM sbtest1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

You will immediately get an error indicating WSREP is loaded but not ready to use by this node, as reported by wsrep_ready status. This is due to the node losing its connection to the Primary Component and it enters the non-operational state (the local node status was changed from SYNCED to OPEN). Data reads from nodes in a non-operational state are considered stale, unless you set wsrep_dirty_reads=ON to permit reads, although Galera still rejects any command that modifies or updates the database.

Finally, Galera will keep on listening and reconnecting to other members in the background infinitely:

2018-07-04 16:47:12 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 30
2018-07-04 16:47:13 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 30
2018-07-04 16:48:20 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 60
2018-07-04 16:48:22 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 60

The eviction process flow by Galera group communication for the partitioned node during network issue can be summarized as below:

  1. Disconnects from the cluster after gmcast.peer_timeout.
  2. Suspects other nodes after evs.suspect_timeout.
  3. Retrieves the new cluster view.
  4. Performs quorum calculation to determine the node's state.
  5. Demotes the node from SYNCED to OPEN.
  6. Attempts to reconnect to the primary component (other Galera nodes) in the background.

From Primary Component Perspective

On galera1 and galera3 respectively, after gmcast.peer_timeout=PT3S, the following appears in the MySQL error log:

2018-07-04 16:46:05 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.55.172:4567
2018-07-04 16:46:06 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') reconnecting to 62116b35 (tcp://192.168.55.172:4567), attempt 0

After it passed evs.suspect_timeout = PT5S, galera2 is suspected as dead by galera3 (and galera1):

2018-07-04 16:46:10 139955510687488 [Note] WSREP: evs::proto(8b2041d6, OPERATIONAL, view_id(REG,62116b35,54)) suspecting node: 62116b35
2018-07-04 16:46:10 139955510687488 [Note] WSREP: evs::proto(8b2041d6, OPERATIONAL, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive

Galera checks out if the other nodes respond to the group communication on galera3, it finds galera1 is in primary and stable state:

2018-07-04 16:46:11 139955510687488 [Note] WSREP: declaring 737422d6 at tcp://192.168.55.171:4567 stable
2018-07-04 16:46:11 139955510687488 [Note] WSREP: Node 737422d6 state prim

Galera revises the cluster view of this node (galera3):

2018-07-04 16:46:11 139955510687488 [Note] WSREP: view(view_id(PRIM,737422d6,55) memb {
        737422d6,0
        8b2041d6,0
} joined {
} left {
} partitioned {
        62116b35,0
})
2018-07-04 16:46:11 139955510687488 [Note] WSREP: save pc into disk

Galera then removes the partitioned node from the Primary Component:

2018-07-04 16:46:11 139955510687488 [Note] WSREP: forgetting 62116b35 (tcp://192.168.55.172:4567)

The new Primary Component is now consisted of two nodes, galera1 and galera3:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2

The Primary Component will exchange the state between each other to agree on the new cluster view and global state:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2018-07-04 16:46:11 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') turning message relay requesting off
2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: sent state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993
2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: got state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993 from 0 (192.168.55.171)
2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: got state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993 from 1 (192.168.55.173)

Galera calculates and verifies the quorum of the state exchange between online members:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 27,
        members    = 2/2 (joined/total),
        act_id     = 2753703,
        last_appl. = 2753606,
        protocols  = 0/8/3 (gcs/repl/appl),
        group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc
2018-07-04 16:46:11 139955502294784 [Note] WSREP: Flow-control interval: [23, 23]
2018-07-04 16:46:11 139955502294784 [Note] WSREP: Trying to continue unpaused monitor

Galera updates the new cluster view and global state after galera2 eviction:

2018-07-04 16:46:11 139955214169856 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2753703, view# 28: Primary, number of nodes: 2, my index: 1, protocol version 3
2018-07-04 16:46:11 139955214169856 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-07-04 16:46:11 139955214169856 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-07-04 16:46:11 139955214169856 [Note] WSREP: Assign initial position for certification: 2753703, protocol version: 3
2018-07-04 16:46:11 139956691814144 [Note] WSREP: Service thread queue flushed.
Clean up the partitioned node (galera2) from the active list:
2018-07-04 16:46:14 139955510687488 [Note] WSREP: cleaning up 62116b35 (tcp://192.168.55.172:4567)

At this point, both galera1 and galera3 will be reporting similar global status:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY');
+---------------------------+------------------------------------------------------------------+
| VARIABLE_NAME             | VARIABLE_VALUE                                                   |
+---------------------------+------------------------------------------------------------------+
| WSREP_CLUSTER_SIZE        | 2                                                                |
| WSREP_CLUSTER_STATUS      | Primary                                                          |
| WSREP_EVS_DELAYED         | 1491abd9-7f6d-11e8-8930-e269b03673d8:tcp://192.168.55.172:4567:1 |
| WSREP_LOCAL_STATE_COMMENT | Synced                                                           |
| WSREP_READY               | ON                                                               |
+---------------------------+------------------------------------------------------------------+

They list out the problematic member in the wsrep_evs_delayed status. Since the local state is "Synced", these nodes are operational and you can redirect the client connections from galera2 to any of them. If this step is inconvenient, consider using a load balancer sitting in front of the database to simplify the connection endpoint from the clients.

Node Recovery and Joining

A partitioned Galera node will keep on attempting to establish connection with the Primary Component infinitely. Let's flush the iptables rules on galera2 to let it connect with the remaining nodes:

# on galera2
$ iptables -F

Once the node is capable of connecting to one of the nodes, Galera will start re-establishing the group communication automatically:

2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 8b2041d6 tcp://192.168.55.173:4567
2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 737422d6 tcp://192.168.55.171:4567
2018-07-09 10:46:34 140075962705664 [Note] WSREP: declaring 737422d6 at tcp://192.168.55.171:4567 stable
2018-07-09 10:46:34 140075962705664 [Note] WSREP: declaring 8b2041d6 at tcp://192.168.55.173:4567 stable

Node galera2 will then connect to one of the Primary Component (in this case is galera1, node ID 737422d6) to get the current cluster view and nodes state:

2018-07-09 10:46:34 140075962705664 [Note] WSREP: Node 737422d6 state prim
2018-07-09 10:46:34 140075962705664 [Note] WSREP: view(view_id(PRIM,1491abd9,142) memb {
        1491abd9,0
        737422d6,0
        8b2041d6,0
} joined {
} left {
} partitioned {
})
2018-07-09 10:46:34 140075962705664 [Note] WSREP: save pc into disk

Galera will then perform state exchange with the rest of the members that can form the Primary Component:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: sent state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 0 (192.168.55.172)
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 1 (192.168.55.171)
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 2 (192.168.55.173)

The state exchange allows galera2 to calculate the quorum and produce the following result:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 71,
        members    = 2/3 (joined/total),
        act_id     = 2836958,
        last_appl. = 0,
        protocols  = 0/8/3 (gcs/repl/appl),
        group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc

Galera will then promote the local node state from OPEN to PRIMARY, to start and establish the node connection to the Primary Component:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Flow-control interval: [28, 28]
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Trying to continue unpaused monitor
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 2836958)

As reported by the above line, Galera calculates the gap on how far the node is behind from the cluster. This node requires state transfer to catch up to writeset number 2836958 from 2761994:

2018-07-09 10:46:34 140075929970432 [Note] WSREP: State transfer required:
        Group state: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958
        Local state: 55238f52-41ee-11e8-852f-3316bdb654bc:2761994
2018-07-09 10:46:34 140075929970432 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958, view# 72: Primary, number of nodes:
3, my index: 0, protocol version 3
2018-07-09 10:46:34 140075929970432 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Assign initial position for certification: 2836958, protocol version: 3

Galera prepares the IST listener on port 4568 on this node and asks any Synced node in the cluster to become a donor. In this case, Galera automatically picks galera3 (192.168.55.173), or it could also pick a donor from the list under wsrep_sst_donor (if defined) r for the syncing operation:

2018-07-09 10:46:34 140075996276480 [Note] WSREP: Service thread queue flushed.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: IST receiver addr using tcp://192.168.55.172:4568
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.55.172:4568
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Member 0.0 (192.168.55.172) requested state transfer from '*any*'. Selected 2.0 (192.168.55.173)(SYNCED) as donor.

It will then change the local node state from PRIMARY to JOINER. At this stage, galera2 is granted with state transfer request and starts to cache write-sets:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2836958)
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Requesting state transfer: success, donor: 2
2018-07-09 10:46:34 140075929970432 [Note] WSREP: GCache history reset: 55238f52-41ee-11e8-852f-3316bdb654bc:2761994 -> 55238f52-41ee-11e8-852f-3316bdb654bc:2836958
2018-07-09 10:46:34 140075929970432 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset

Node galera2 starts receiving the missing writesets from the selected donor's gcache (galera3):

2018-07-09 10:46:34 140075954312960 [Note] WSREP: 2.0 (192.168.55.173): State transfer to 0.0 (192.168.55.172) complete.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Receiving IST: 74964 writesets, seqnos 2761994-2836958
2018-07-09 10:46:34 140075593627392 [Note] WSREP: Receiving IST...  0.0% (    0/74964 events) complete.
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Member 2.0 (192.168.55.173) synced with group.
2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 737422d6 tcp://192.168.55.171:4567
2018-07-09 10:46:41 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') turning message relay requesting off
2018-07-09 10:46:44 140075593627392 [Note] WSREP: Receiving IST... 36.0% (27008/74964 events) complete.
2018-07-09 10:46:54 140075593627392 [Note] WSREP: Receiving IST... 71.6% (53696/74964 events) complete.
2018-07-09 10:47:02 140075593627392 [Note] WSREP: Receiving IST...100.0% (74964/74964 events) complete.
2018-07-09 10:47:02 140075929970432 [Note] WSREP: IST received: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958
2018-07-09 10:47:02 140075954312960 [Note] WSREP: 0.0 (192.168.55.172): State transfer from 2.0 (192.168.55.173) complete.

Once all the missing writesets are received and applied, Galera will promote galera2 as JOINED until seqno 2837012:

2018-07-09 10:47:02 140075954312960 [Note] WSREP: Shifting JOINER -> JOINED (TO: 2837012)
2018-07-09 10:47:02 140075954312960 [Note] WSREP: Member 0.0 (192.168.55.172) synced with group.

The node applies any cached writesets in its slave queue and finishes catching up with the cluster. Its slave queue is now empty. Galera will promote galera2 to SYNCED, indicating the node is now operational and ready to serve clients:

2018-07-09 10:47:02 140075954312960 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2837012)
2018-07-09 10:47:02 140076605892352 [Note] WSREP: Synchronized with group, ready for connections

At this point, all nodes are back operational. You can verify by using the following statements on galera2:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY');
+---------------------------+----------------+
| VARIABLE_NAME             | VARIABLE_VALUE |
+---------------------------+----------------+
| WSREP_CLUSTER_SIZE        | 3              |
| WSREP_CLUSTER_STATUS      | Primary        |
| WSREP_EVS_DELAYED         |                |
| WSREP_LOCAL_STATE_COMMENT | Synced         |
| WSREP_READY               | ON             |
+---------------------------+----------------+

The wsrep_cluster_size reported as 3 and the cluster status is Primary, indicating galera2 is part of the Primary Component. The wsrep_evs_delayed has also been cleared and the local state is now Synced.

The recovery process flow for the partitioned node during network issue can be summarized as below:

  1. Re-establishes group communication to other nodes.
  2. Retrieves the cluster view from one of the Primary Component.
  3. Performs state exchange with the Primary Component and calculates the quorum.
  4. Changes the local node state from OPEN to PRIMARY.
  5. Calculates the gap between local node and the cluster.
  6. Changes the local node state from PRIMARY to JOINER.
  7. Prepares IST listener/receiver on port 4568.
  8. Requests state transfer via IST and picks a donor.
  9. Starts receiving and applying the missing writeset from chosen donor's gcache.
  10. Changes the local node state from JOINER to JOINED.
  11. Catches up with the cluster by applying the cached writesets in the slave queue.
  12. Changes the local node state from JOINED to SYNCED.
ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Cluster Failure

A Galera Cluster is considered failed if no primary component (PC) is available. Consider a similar three-node Galera Cluster as depicted in the diagram below:

A cluster is considered operational if all nodes or majority of the nodes are online. Online means they are able to see each other through Galera's replication traffic or group communication. If no traffic is coming in and out from the node, the cluster will send a heartbeat beacon for the node to response in a timely manner. Otherwise, it will be put into the delay or suspected list according to how the node responses.

If a node goes down, let's say node C, the cluster will remain operational because node A and B are still in quorum with 2 votes out of 3 to form a primary component. You should get the following cluster state on A and B:

mysql> SHOW STATUS LIKE 'wsrep_cluster_status';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_cluster_status | Primary |
+----------------------+---------+

If let's say a primary switch down went kaput, as illustrated in the following diagram:

At this point, every single node loses communication to each other, and the cluster state will be reported as non-Primary on all nodes (as what happened to galera2 in the previous case). Every node would calculate the quorum and find out that it is the minority (1 vote out of 3) thus losing the quorum, which means no Primary Component is formed and consequently all nodes refuse to serve any data. This is deemed as cluster failure.

Once the network issue is resolved, Galera will automatically re-establish the communication between members, exchange node's states and determine the possibility of reforming the primary component by comparing node state, UUIDs and seqnos. If the probability is there, Galera will merge the primary components as shown in the following lines:

2018-06-27  0:16:57 140203784476416 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: sent state msg: 5885911b-795c-11e8-8683-931c85442c7e
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 0 (192.168.55.171)
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 1 (192.168.55.172)
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 2 (192.168.55.173)
2018-06-27  0:16:57 140203784476416 [Warning] WSREP: Quorum: No node with complete state:

        Version      : 4
        Flags        : 0x3
        Protocols    : 0 / 8 / 3
        State        : NON-PRIMARY
        Desync count : 0
        Prim state   : SYNCED
        Prim UUID    : 5224a024-791b-11e8-a0ac-8bc6118b0f96
        Prim  seqno  : 5
        First seqno  : 112714
        Last  seqno  : 112725
        Prim JOINED  : 3
        State UUID   : 5885911b-795c-11e8-8683-931c85442c7e
        Group UUID   : 55238f52-41ee-11e8-852f-3316bdb654bc
        Name         : '192.168.55.171'
        Incoming addr: '192.168.55.171:3306'

        Version      : 4
        Flags        : 0x2
        Protocols    : 0 / 8 / 3
        State        : NON-PRIMARY
        Desync count : 0
        Prim state   : SYNCED
        Prim UUID    : 5224a024-791b-11e8-a0ac-8bc6118b0f96
        Prim  seqno  : 5
        First seqno  : 112714
        Last  seqno  : 112725
        Prim JOINED  : 3
        State UUID   : 5885911b-795c-11e8-8683-931c85442c7e
        Group UUID   : 55238f52-41ee-11e8-852f-3316bdb654bc
        Name         : '192.168.55.172'
        Incoming addr: '192.168.55.172:3306'

        Version      : 4
        Flags        : 0x2
        Protocols    : 0 / 8 / 3
        State        : NON-PRIMARY
        Desync count : 0
        Prim state   : SYNCED
        Prim UUID    : 5224a024-791b-11e8-a0ac-8bc6118b0f96
        Prim  seqno  : 5
        First seqno  : 112714
        Last  seqno  : 112725
        Prim JOINED  : 3
        State UUID   : 5885911b-795c-11e8-8683-931c85442c7e
        Group UUID   : 55238f52-41ee-11e8-852f-3316bdb654bc
        Name         : '192.168.55.173'
        Incoming addr: '192.168.55.173:3306'

2018-06-27  0:16:57 140203784476416 [Note] WSREP: Full re-merge of primary 5224a024-791b-11e8-a0ac-8bc6118b0f96 found: 3 of 3.
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 5,
        members    = 3/3 (joined/total),
        act_id     = 112725,
        last_appl. = 112722,
        protocols  = 0/8/3 (gcs/repl/appl),
        group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Flow-control interval: [28, 28]
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Trying to continue unpaused monitor
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Restored state OPEN -> SYNCED (112725)
2018-06-27  0:16:57 140202564110080 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:112725, view# 6: Primary, number of nodes: 3, my index: 2, protocol version 3

A good indicator to know if the re-bootstrapping process is OK is by looking at the following line in the error log:

[Note] WSREP: Synchronized with group, ready for connections

ClusterControl Auto Recovery

ClusterControl comes with node and cluster automatic recovery features, because it oversees and understands the state of all nodes in the cluster. Automatic recovery is by default enabled if the cluster is deployed using ClusterControl. To enable or disable the cluster, simply clicking on the power icon in the summary bar as shown below:

Green icon means automatic recovery is turned on, while red is the opposite. You can monitor the recovery progress from the Activity -> Jobs dialog, like in this case, galera2 was totally inaccessible due to firewall blocking, thus forcing ClusterControl to report the following:

The recovery process will only be commencing after a graceful timeout (30 seconds) to give Galera node a chance to recover itself beforehand. If ClusterControl fails to recover a node or cluster, it will first pull all MySQL error logs from all accessible nodes and will raise the necessary alarms to notify the user via email or by pushing critical events to the third-party integration modules like PagerDuty, VictorOps or Slack. Manual intervention is then required. For Galera Cluster, ClusterControl will keep on trying to recover the failure until you mark the node as under maintenance, or disable the automatic recovery feature.

ClusterControl's automatic recovery is one of most favorite features as voted by our users. It helps you to take the necessary actions quickly, with a complete report on what has been attempted and recommendation steps to troubleshoot further on the issue. For users with support subscriptions, you can look for extra hands by escalating this issue to our technical support team for assistance.

Conclusion

Galera automatic node recovery and membership control are neat features to simplify the cluster management, improve the database reliability and reduce the risk of human error, as commonly haunting other open-source database replication technology like MySQL Replication, Group Replication and PostgreSQL Streaming/Logical Replication.

Asynchronous Replication Between MySQL Galera Clusters - Failover and Failback

$
0
0

Galera Cluster enforces strong data consistency, where all nodes in the cluster are tightly coupled. Although network segmentation is supported, replication performance is still bound by two factors:

  • Round trip time (RTT) to the farthest node in the cluster from the originator node.
  • The size of a writeset to be transferred and certified for conflict on the receiver node.

While there are ways to boost the performance of Galera, it is not possible to work around these 2 limiting factors.

Luckily, Galera Cluster was built on top of MySQL, which also comes with its built-in replication feature (duh!). Both Galera replication and MySQL replication exist in the same server software independently. We can make use of these technologies to work together, where all replication within a datacenter will be on Galera while inter-datacenter replication will be on standard MySQL Replication. The slave site can act as a hot-standby site, ready to serve data once the applications are redirected to the backup site. We covered this in a previous blog on MySQL architectures for disaster recovery.

In this blog post, we’ll see how straightforward it is to set up replication between two Galera Clusters (PXC 5.7). Then we’ll look at the more challenging part, that is, handling failures at both node and cluster levels. Failover and failback operations are crucial in order to preserve data integrity across the system.

Cluster Deployment

For the sake of our example, we’ll need at least two clusters and two sites - one for the primary and another one for the secondary. It works similarly to traditional MySQL master-slave replication, but on a bigger scale with three nodes in each site. With ClusterControl, you would achieve this by deploying two separate clusters, one on each site. Then, you would configure asynchronous replication between designed nodes from each cluster.

The following diagram illustrates our default architecture:

We have 6 nodes in total, 3 on the primary site and another 3 on the disaster recovery site. To simplify the node representation, we will use the following notations:

  • Primary site: galera1-P, galera2-P, galera3-P (master)
  • Disaster recovery site: galera1-DR, galera2-DR (slave), galera3-DR

Once the Galera Cluster is deployed, simply pick one node on each site to set up the asynchronous replication link. Take note that ALL Galera nodes must be configured with binary logging and log_slave_updates enabled. Enabling GTID is highly recommended, although not compulsory. On all nodes, configure with the following parameters inside my.cnf:

server_id=40 # this number must be different on every node.
binlog_format=ROW
log_bin = /var/lib/mysql-binlog/binlog
log_slave_updates = ON
gtid_mode = ON
enforce_gtid_consistency = true
expire_logs_days = 7

If you are using ClusterControl, from the web interface, pick Nodes -> the chosen Galera node -> Enable Binary Logging. You might then have to change the server-id on the DR site manually to make sure every node is holding a distinct server-id value.

Then on galera3-P, create the replication user:

mysql> GRANT REPLICATION SLAVE ON *.* to slave@'%' IDENTIFIED BY 'slavepassword';

On galera2-DR, point the slave to the current master, galera3-P:

mysql> CHANGE MASTER TO MASTER_HOST = 'galera3-primary', MASTER_USER = 'slave', MASTER_PASSWORD = 'slavepassword' , MASTER_AUTO_POSITION=1;

Start the replication slave on galera2-DR:

mysql> START SLAVE;

From ClusterControl dashboard, once the replication is established, you should see the DR site has a slave and the Primary site got 3 masters (nodes that produce binary logs):

The deployment is now complete. Applications should send writes to the Primary Site only, as the replication direction goes from Primary Site to DR site. Reads can be sent to both sites, although the DR site might be lagging behind. Assuming that writes only reach the Primary Site, it should not be necessary to set the DR site to read-only (although it can be a good precaution).

This setup will make the primary and disaster recovery site independent of each other, loosely connected with asynchronous replication. One of the Galera nodes in the DR site will be a slave, that replicates from one of the Galera nodes (master) in the primary site. Ensure that both sites are producing binary logs with GTID, and that log_slave_updates is enabled - the updates that come from the asynchronous replication stream will be applied to the other nodes in the cluster.

We now have a system where a cluster failure on the primary site will not affect the backup site. Performance-wise, WAN latency will not impact updates on the active cluster. These are shipped asynchronously to the backup site.

As a side note, it’s also possible to have a dedicated slave instance as replication relay, instead of using one of the Galera nodes as slave.

Node Failover Procedure

In case the current master (galera3-P) fails and the remaining nodes in the Primary Site are still up, the slave on the Disaster Recovery site (galera2-DR) should be directed to any available masters, as shown in the following diagram:

With GTID-based replication, this is peanut. Simply run the following on galera2-DR:

mysql> STOP SLAVE;
mysql> CHANGE MASTER TO MASTER_HOST = 'galera1-P', MASTER_AUTO_POSITION=1;
mysql> START SLAVE;

Verify the slave status with:

mysql> SHOW SLAVE STATUS\G
...
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...            
           Retrieved_Gtid_Set: f66a6152-74d6-ee17-62b0-6117d2e643de:2043395-2047231
            Executed_Gtid_Set: d551839f-74d6-ee17-78f8-c14bd8b1e4ed:1-4,
f66a6152-74d6-ee17-62b0-6117d2e643de:1-2047231
...

Ensure the above values are reporting correctly. The executed GTID set should have 2 sets of GTID at this point, one from all transactions executed on the old master and the other for the new master.

Cluster Failover Procedure

If the primary cluster goes down, crashes, or simply loses connectivity from the application standpoint, the application can be directed to the DR site instantly. No database failover is necessary to continue the operation. If the application has connected to the DR site and starts to write, it is important to ensure no other writes are happening on the Primary Site once the DR site is activated.

The following diagram shows our architecture after application is failed over to the DR site:

Assuming the Primary Site is still down, at this point, there is no replication between sites until we re-configure one of the nodes in the Primary Site once it comes back up.

For clean up purposes, the slave process on the DR site has to be stopped. On galera2-DR, stop replication slave:

mysql> STOP SLAVE;

The failover to the DR site is now considered complete.

Cluster Failback Procedure

To failback to the Primary Site, one of the Galera nodes must become a slave to catch up on changes that happened on the DR site. The procedure would be something like the following:

  1. Pick one node to become a slave in the Primary Site (galera3-P).
  2. Stop all nodes other than the chosen one (galera1-P and galera2-P). Keep the chosen slave up (galera3-P).
  3. Create a backup from the new master (galera2-DR) in the DR site and transfer it over to the chosen slave (galera3-P).
  4. Restore the backup.
  5. Start the replication slave.
  6. Start the remaining Galera node in the Primary Site, with grastate.dat removed.

The below steps can then be performed to fail back the system to its original architecture - Primary is the master and DR is the slave.

1) Shut down all nodes other than the chosen slave:

$ systemctl stop mysql # galera1-P
$ systemctl stop mysql # galera2-P

Or from the ClusterControl interface, simply pick the node from the UI and click "Shutdown Node".

2) Pick a node in the DR site to be the new master (galera2-DR). Create a mysqldump backup with the necessary parameters (for PXC, pxc_strict_mode has to be set other than ENFORCING):

$ mysql -uroot -p -e 'set global pxc_strict_mode = "PERMISSIVE"'
$ mysqldump -uroot -p --all-databases --triggers --routines --events > dump.sql
$ mysql -uroot -p -e 'set global pxc_strict_mode = "ENFORCING"'

3) Transfer the backup to the chosen slave, galera3-P via your preferred remote copy tool:

$ scp dump.sql galera3-primary:~

4) In order to perform RESET MASTER on a Galera node, Galera replication plugin must be turned off. On galera3-P, disable Galera write-set replication temporarily and then restore the dump file in the very same session:

mysql> SET GLOBAL pxc_strict_mode = 'PERMISSIVE';
mysql> SET wsrep_on=OFF;
mysql> RESET MASTER;
mysql> SOURCE /root/dump.sql;

Variable wsrep_on is a session variable. Therefore, we have to perform the restore operation within the same session using SOURCE statement. Otherwise, restoring using standard mysql client would require wsrep_on=OFF or commenting wsrep_provider inside my.cnf set during MySQL startup.

5) Start the replication thread on the chosen slave, galera3-P:

mysql> CHANGE MASTER TO MASTER_HOST = 'galera2-DR', MASTER_USER = 'slave', MASTER_PASSWORD = 'slavepassword', MASTER_AUTO_POSITION = 1;
mysql> START SLAVE;
mysql> SHOW SLAVE STATUS\G

6) Start the remaining of the nodes in the cluster (one node at a time), and force an SST by removing grastate.dat beforehand:

$ rm -Rf /var/lib/mysql/grastate.dat
$ systemctl start mysql

Or from ClusterControl, simply pick the node -> Start Node -> check "Perform an Initial Start".

The above will force other Galera nodes to re-sync with galera3-P through SST and get the most up-to-date data. At this point, the replication direction has switched, from DR to Primary. Write operations are coming to the DR site and the Primary Site has become the replicating site:

From ClusterControl dashboard, you would notice the Primary Site has a slave configured while the DR site are all masters. In ClusterControl, MASTER indicator means all Galera nodes are generating binary logs:

7) Optionally, we can clean up slave's entries on galera2-DR since it's already become a master:

mysql> RESET SLAVE ALL;

8) Once the Primary site catches up, we may switch the database traffic from application back to the primary cluster:

At this point, all writes must go to the Primary Site only. The replication link should be stopped as described under the "Cluster Failover Procedure" section above.

The above mentioned failback steps should be applied when staging back the DR site from the Primary Site:

  • Stop replication between primary site and DR site.
  • Re-slave one of the Galera nodes on the DR site to replicate from the Primary Site.
  • Start replication between both sites.

Once done, the replication direction has gone back to its original configuration, from Primary to DR. Writes operations are coming to the Primary Site and the DR Site is now the replicating site:

Finally, perform some clean ups on the newly promoted master by running "RESET SLAVE ALL".

Advantages

Cluster-to-cluster asynchronous replication comes with a number of advantages:

  • Minimal downtime during database failover operation. Basically, you can redirect the write almost instantly to the slave site, only and only if you can protect writes to not reach the master site (as these writes would not be replicated, and will probably be overwritten when re-syncing from the DR site).
  • No performance impact on the primary site since it is independent from the backup (DR) site. Replication from master to slave is performed asynchronously. The master site generates binary logs, the slave site replicates the events and applies the events at some later time.
  • Disaster recovery site can be used for other purposes, e.g., database backup, binary logs backup and reporting or heavy analytical queries (OLAP). Both sites can be used simultaneously, with exceptions on the replication lag and read-only operations on the slave side.
  • The DR cluster could potentially run on smaller instances in a public cloud environment, as long as they can keep up with the primary cluster. The instances can be upgraded if needed. In certain scenarios, it can save you some costs.
  • You only need one extra site for Disaster Recovery, as opposed to active-active Galera multi-site replication setup, which requires at least 3 sites to operate correctly.

Disadvantages

There are also drawbacks having this setup:

  • There is a chance of missing some data during failover if the slave was behind, since replication is asynchronous. This could be improved with semi-synchronous and multi-threaded slaves replication, albeit there will be another set of challenges waiting (network overhead, replication gap, etc).
  • Despite the failover operation being fairly simple, the failback operation can be tricky and prone to human error. It requires some expertise on switching master/slave role back to the primary site. It's recommended to keep the procedures documented, rehearse the failover/failback operation regularly and use accurate reporting and monitoring tools.
  • There is no built-in failure detection and notification process. You may need to automate and send out notifications to the relevant person once the unwanted event occurs. One good way is to regularly check the slave's status from other available node's point-of-view on the master's site before raising an alarm for the operation team.
  • Pretty costly, as you have to setup a similar number of nodes on the disaster recovery site. This is not black and white, as the cost justification usually comes from the requirements of your business. With some planning, it is possible to maximize usage of database resources at both sites, regardless of the database roles.

MySQL on Docker: How to Monitor MySQL Containers with Prometheus - Part 1 - Deployment on Standalone and Swarm

$
0
0

Monitoring is a concern for containers, as the infrastructure is dynamic. Containers can be routinely created and destroyed, and are ephemeral. So how do you keep track of your MySQL instances running on Docker?

As with any software component, there are many options out there that can be used. We’ll look at Prometheus as a solution built for distributed infrastructure, and works very well with Docker.

This is a two-part blog. In this part 1 blog, we are going to cover the deployment aspect of our MySQL containers with Prometheus and its components, running as standalone Docker containers and Docker Swarm services. In part 2, we will look at the important metrics to monitor from our MySQL containers, as well as integration with the paging and notification systems.

Introduction to Prometheus

Prometheus is a full monitoring and trending system that includes built-in and active scraping, storing, querying, graphing, and alerting based on time series data. Prometheus collects metrics through pull mechanism from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true. It supports all the target metrics that we want to measure if one would like to run MySQL as Docker containers. Those metrics include physical hosts metrics, Docker container metrics and MySQL server metrics.

Take a look at the following diagram which illustrates Prometheus architecture (taken from Prometheus official documentation):

We are going to deploy some MySQL containers (standalone and Docker Swarm) complete with a Prometheus server, MySQL exporter (i.e., a Prometheus agent to expose MySQL metrics, that can then be scraped by the Prometheus server) and also Alertmanager to handle alerts based on the collected metrics.

For more details check out the Prometheus documentation. In this example, we are going to use the official Docker images provided by the Prometheus team.

Standalone Docker

Deploying MySQL Containers

Let's run two standalone MySQL servers on Docker to simplify our deployment walkthrough. One container will be using the latest MySQL 8.0 and the other one is MySQL 5.7. Both containers are in the same Docker network called "db_network":

$ docker network create db_network
$ docker run -d \
--name mysql80 \
--publish 3306 \
--network db_network \
--restart unless-stopped \
--env MYSQL_ROOT_PASSWORD=mypassword \
--volume mysql80-datadir:/var/lib/mysql \
mysql:8 \
--default-authentication-plugin=mysql_native_password

MySQL 8 defaults to a new authentication plugin called caching_sha2_password. For compatibility with Prometheus MySQL exporter container, let's use the widely-used mysql_native_password plugin whenever we create a new MySQL user on this server.

For the second MySQL container running 5.7, we execute the following:

$ docker run -d \
--name mysql57 \
--publish 3306 \
--network db_network \
--restart unless-stopped \
--env MYSQL_ROOT_PASSWORD=mypassword \
--volume mysql57-datadir:/var/lib/mysql \
mysql:5.7

Verify if our MySQL servers are running OK:

[root@docker1 mysql]# docker ps | grep mysql
cc3cd3c4022a        mysql:5.7           "docker-entrypoint.s…"   12 minutes ago      Up 12 minutes       0.0.0.0:32770->3306/tcp   mysql57
9b7857c5b6a1        mysql:8             "docker-entrypoint.s…"   14 minutes ago      Up 14 minutes       0.0.0.0:32769->3306/tcp   mysql80

At this point, our architecture is looking something like this:

Let's get started to monitor them.

Exposing Docker Metrics to Prometheus

Docker has built-in support as Prometheus target, where we can use to monitor the Docker engine statistics. We can simply enable it by creating a text file called "daemon.json" inside the Docker host:

$ vim /etc/docker/daemon.json

And add the following lines:

{
  "metrics-addr" : "12.168.55.161:9323",
  "experimental" : true
}

Where 192.168.55.161 is the Docker host primary IP address. Then, restart Docker daemon to load the change:

$ systemctl restart docker

Since we have defined --restart=unless-stopped in our MySQL containers' run command, the containers will be automatically started after Docker is running.

Deploying MySQL Exporter

Before we move further, the mysqld exporter requires a MySQL user to be used for monitoring purposes. On our MySQL containers, create the monitoring user:

$ docker exec -it mysql80 mysql -uroot -p
Enter password:
mysql> CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporterpassword' WITH MAX_USER_CONNECTIONS 3;
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';

Take note that it is recommended to set a max connection limit for the user to avoid overloading the server with monitoring scrapes under heavy load. Repeat the above statements onto the second container, mysql57:

$ docker exec -it mysql57 mysql -uroot -p
Enter password:
mysql> CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporterpassword' WITH MAX_USER_CONNECTIONS 3;
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';

Let's run the mysqld exporter container called "mysql8-exporter" to expose the metrics for our MySQL 8.0 instance as below:

$ docker run -d \
--name mysql80-exporter \
--publish 9104 \
--network db_network \
--restart always \
--env DATA_SOURCE_NAME="exporter:exporterpassword@(mysql80:3306)/" \
prom/mysqld-exporter:latest \
--collect.info_schema.processlist \
--collect.info_schema.innodb_metrics \
--collect.info_schema.tablestats \
--collect.info_schema.tables \
--collect.info_schema.userstats \
--collect.engine_innodb_status

And also another exporter container for our MySQL 5.7 instance:

$ docker run -d \
--name mysql57-exporter \
--publish 9104 \
--network db_network \
--restart always \
-e DATA_SOURCE_NAME="exporter:exporterpassword@(mysql57:3306)/" \
prom/mysqld-exporter:latest \
--collect.info_schema.processlist \
--collect.info_schema.innodb_metrics \
--collect.info_schema.tablestats \
--collect.info_schema.tables \
--collect.info_schema.userstats \
--collect.engine_innodb_status

We enabled a bunch of collector flags for the container to expose the MySQL metrics. You can also enable --collect.slave_status, --collect.slave_hosts if you have a MySQL replication running on containers.

We should be able to retrieve the MySQL metrics via curl from the Docker host directly (port 32771 is the published port assigned automatically by Docker for container mysql80-exporter):

$ curl 127.0.0.1:32771/metrics
...
mysql_info_schema_threads_seconds{state="waiting for lock"} 0
mysql_info_schema_threads_seconds{state="waiting for table flush"} 0
mysql_info_schema_threads_seconds{state="waiting for tables"} 0
mysql_info_schema_threads_seconds{state="waiting on cond"} 0
mysql_info_schema_threads_seconds{state="writing to net"} 0
...
process_virtual_memory_bytes 1.9390464e+07

At this point, our architecture is looking something like this:

We are now good to setup the Prometheus server.

Deploying Prometheus Server

Firstly, create Prometheus configuration file at ~/prometheus.yml and add the following lines:

$ vim ~/prometheus.yml
global:
  scrape_interval:     5s
  scrape_timeout:      3s
  evaluation_interval: 5s

# Our alerting rule files
rule_files:
  - "alert.rules"

# Scrape endpoints
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql57-exporter:9104','mysql80-exporter:9104']

  - job_name: 'docker'
    static_configs:
      - targets: ['192.168.55.161:9323']

From the Prometheus configuration file, we have defined three jobs - "prometheus", "mysql" and "docker". The first one is the job to monitor the Prometheus server itself. The next one is the job to monitor our MySQL containers named "mysql". We define the endpoints on our MySQL exporters on port 9104, which exposed the Prometheus-compatible metrics from the MySQL 8.0 and 5.7 instances respectively. The "alert.rules" is the rule file that we will include later in the next blog post for alerting purposes.

We can then map the configuration with the Prometheus container. We also need to create a Docker volume for Prometheus data for persistency and also expose port 9090 publicly:

$ docker run -d \
--name prometheus-server \
--publish 9090:9090 \
--network db_network \
--restart unless-stopped \
--mount type=volume,src=prometheus-data,target=/prometheus \
--mount type=bind,src="$(pwd)"/prometheus.yml,target=/etc/prometheus/prometheus.yml \
--mount type=bind,src="$(pwd)
prom/prometheus

Now our Prometheus server is already running and can be accessed directly on port 9090 of the Docker host. Open a web browser and go to http://192.168.55.161:9090/ to access the Prometheus web UI. Verify the target status under Status -> Targets and make sure they are all green:

At this point, our container architecture is looking something like this:

Our Prometheus monitoring system for our standalone MySQL containers are now deployed.

Docker Swarm

Deploying a 3-node Galera Cluster

Supposed we want to deploy a three-node Galera Cluster in Docker Swarm, we would have to create 3 different services, each service representing one Galera node. Using this approach, we can keep a static resolvable hostname for our Galera container, together with MySQL exporter containers that will accompany each of them. We will be using MariaDB 10.2 image maintained by the Docker team to run our Galera cluster.

Firstly, create a MySQL configuration file to be used by our Swarm service:

$ vim ~/my.cnf
[mysqld]

default_storage_engine          = InnoDB
binlog_format                   = ROW

innodb_flush_log_at_trx_commit  = 0
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_autoinc_lock_mode        = 2
innodb_lock_schedule_algorithm  = FCFS # MariaDB >10.1.19 and >10.2.3 only

wsrep_on                        = ON
wsrep_provider                  = /usr/lib/galera/libgalera_smm.so
wsrep_sst_method                = mariabackup

Create a dedicated database network in our Swarm called "db_swarm":

$ docker network create --driver overlay db_swarm

Import our MySQL configuration file into Docker config so we can load it into our Swarm service when we create it later:

$ cat ~/my.cnf | docker config create my-cnf -

Create the first Galera bootstrap service, with "gcomm://" as the cluster address called "galera0". This is a transient service for bootstrapping process only. We will delete this service once we have gotten 3 other Galera services running:

$ docker service create \
--name galera0 \
--replicas 1 \
--hostname galera0 \
--network db_swarm \
--publish 3306 \
--publish 4444 \
--publish 4567 \
--publish 4568 \
--config src=my-cnf,target=/etc/mysql/mariadb.conf.d/my.cnf \
--env MYSQL_ROOT_PASSWORD=mypassword \
--mount type=volume,src=galera0-datadir,dst=/var/lib/mysql \
mariadb:10.2 \
--wsrep_cluster_address=gcomm:// \
--wsrep_sst_auth="root:mypassword" \
--wsrep_node_address=galera0

At this point, our database architecture can be illustrated as below:

Then, repeat the following command for 3 times to create 3 different Galera services. Replace {name} with galera1, galera2 and galera3 respectively:

$ docker service create \
--name {name} \
--replicas 1 \
--hostname {name} \
--network db_swarm \
--publish 3306 \
--publish 4444 \
--publish 4567 \
--publish 4568 \
--config src=my-cnf,target=/etc/mysql/mariadb.conf.d/my.cnf \
--env MYSQL_ROOT_PASSWORD=mypassword \
--mount type=volume,src={name}-datadir,dst=/var/lib/mysql \
mariadb:10.2 \
--wsrep_cluster_address=gcomm://galera0,galera1,galera2,galera3 \
--wsrep_sst_auth="root:mypassword" \
--wsrep_node_address={name}

Verify our current Docker services:

$ docker service ls 
ID                  NAME                MODE                REPLICAS            IMAGE               PORTS
wpcxye3c4e9d        galera0             replicated          1/1                 mariadb:10.2        *:30022->3306/tcp, *:30023->4444/tcp, *:30024-30025->4567-4568/tcp
jsamvxw9tqpw        galera1             replicated          1/1                 mariadb:10.2        *:30026->3306/tcp, *:30027->4444/tcp, *:30028-30029->4567-4568/tcp
otbwnb3ridg0        galera2             replicated          1/1                 mariadb:10.2        *:30030->3306/tcp, *:30031->4444/tcp, *:30032-30033->4567-4568/tcp
5jp9dpv5twy3        galera3             replicated          1/1                 mariadb:10.2        *:30034->3306/tcp, *:30035->4444/tcp, *:30036-30037->4567-4568/tcp

Our architecture is now looking something like this:

We need to remove the Galera bootstrap Swarm service, galera0, to stop it from running because if the container is being rescheduled by Docker Swarm, a new replica will be started with a fresh new volume. We run the risk of data loss because the --wsrep_cluster_address contains "galera0" in the other Galera nodes (or Swarm services). So, let's remove it:

$ docker service rm galera0

At this point, we have our three-node Galera Cluster:

We are now ready to deploy our MySQL exporter and Prometheus Server.

MySQL Exporter Swarm Service

Login to one of the Galera nodes and create the exporter user with proper privileges:

$ docker exec -it {galera1} mysql -uroot -p
Enter password:
mysql> CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporterpassword' WITH MAX_USER_CONNECTIONS 3;
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';

Then, create the exporter service for each of the Galera services (replace {name} with galera1, galera2 and galera3 respectively):

$ docker service create \
--name {name}-exporter \
--network db_swarm \
--replicas 1 \
-p 9104 \
-e DATA_SOURCE_NAME="exporter:exporterpassword@({name}:3306)/" \
prom/mysqld-exporter:latest \
--collect.info_schema.processlist \
--collect.info_schema.innodb_metrics \
--collect.info_schema.tablestats \
--collect.info_schema.tables \
--collect.info_schema.userstats \
--collect.engine_innodb_status

At this point, our architecture is looking something like this with exporter services in the picture:

Prometheus Server Swarm Service

Finally, let's deploy our Prometheus server. Similar to the Galera deployment, we have to prepare the Prometheus configuration file first before importing it into Swarm using Docker config command:

$ vim ~/prometheus.yml
global:
  scrape_interval:     5s
  scrape_timeout:      3s
  evaluation_interval: 5s

# Our alerting rule files
rule_files:
  - "alert.rules"

# Scrape endpoints
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'galera'
    static_configs:
      - targets: ['galera1-exporter:9104','galera2-exporter:9104', 'galera3-exporter:9104']

From the Prometheus configuration file, we have defined three jobs - "prometheus" and "galera". The first one is the job to monitor the Prometheus server itself. The next one is the job to monitor our MySQL containers named "galera". We define the endpoints on our MySQL exporters on port 9104, which expose the Prometheus-compatible metrics from the three Galera nodes respectively. The "alert.rules" is the rule file that we will include later in the next blog post for alerting purposes.

Import the configuration file into Docker config to be used with Prometheus container later:

$ cat ~/prometheus.yml | docker config create prometheus-yml -

Let's run the Prometheus server container, and publish port 9090 of all Docker hosts for the Prometheus web UI service:

$ docker service create \
--name prometheus-server \
--publish 9090:9090 \
--network db_swarm \
--replicas 1 \    
--config src=prometheus-yml,target=/etc/prometheus/prometheus.yml \
--mount type=volume,src=prometheus-data,dst=/prometheus \
prom/prometheus

Verify with the Docker service command that we have 3 Galera services, 3 exporter services and 1 Prometheus service:

$ docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                         PORTS
jsamvxw9tqpw        galera1             replicated          1/1                 mariadb:10.2                  *:30026->3306/tcp, *:30027->4444/tcp, *:30028-30029->4567-4568/tcp
hbh1dtljn535        galera1-exporter    replicated          1/1                 prom/mysqld-exporter:latest   *:30038->9104/tcp
otbwnb3ridg0        galera2             replicated          1/1                 mariadb:10.2                  *:30030->3306/tcp, *:30031->4444/tcp, *:30032-30033->4567-4568/tcp
jq8i77ch5oi3        galera2-exporter    replicated          1/1                 prom/mysqld-exporter:latest   *:30039->9104/tcp
5jp9dpv5twy3        galera3             replicated          1/1                 mariadb:10.2                  *:30034->3306/tcp, *:30035->4444/tcp, *:30036-30037->4567-4568/tcp
10gdkm1ypkav        galera3-exporter    replicated          1/1                 prom/mysqld-exporter:latest   *:30040->9104/tcp
gv9llxrig30e        prometheus-server   replicated          1/1                 prom/prometheus:latest        *:9090->9090/tcp

Now our Prometheus server is already running and can be accessed directly on port 9090 from any Docker node. Open a web browser and go to http://192.168.55.161:9090/ to access the Prometheus web UI. Verify the target status under Status -> Targets and make sure they are all green:

At this point, our Swarm architecture is looking something like this:

To be continued..

We now have our database and monitoring stack deployed on Docker. In part 2 of the blog, we will look into the different MySQL metrics to keep an eye on. We’ll also see how to configure alerting with Prometheus.

How to Monitor your Database Servers using ClusterControl CLI

$
0
0

How would you like to merge "top" process for all your 5 database nodes and sort by CPU usage with just a one-liner command? Yeah, you read it right! How about interactive graphs display in the terminal interface? We introduced the CLI client for ClusterControl called s9s about a year ago, and it’s been a great complement to the web interface. It’s also open source..

In this blog post, we’ll show you how you can monitor your databases using your terminal and s9s CLI.

Introduction to s9s, The ClusterControl CLI

ClusterControl CLI (or s9s or s9s CLI), is an open source project and optional package introduced with ClusterControl version 1.4.1. It is a command line tool to interact, control and manage your database infrastructure using ClusterControl. The s9s command line project is open source and can be found on GitHub.

Starting from version 1.4.1, the installer script will automatically install the package (s9s-tools) on the ClusterControl node.

Some prerequisites. In order for you to run s9s-tools CLI, the following must be true:

  • A running ClusterControl Controller (cmon).
  • s9s client, install as a separate package.
  • Port 9501 must be reachable by the s9s client.

Installing the s9s CLI is straightforward if you install it on the ClusterControl Controller host itself:$ rm

$ rm -Rf ~/.s9s
$ wget http://repo.severalnines.com/s9s-tools/install-s9s-tools.sh
$ ./install-s9s-tools.sh

You can install s9s-tools outside of the ClusterControl server (your workstation laptop or bastion host), as long as the ClusterControl Controller RPC (TLS) interface is exposed to the public network (default to 127.0.0.1:9501). You can find more details on how to configure this in the documentation page.

To verify if you can connect to ClusterControl RPC interface correctly, you should get the OK response when running the following command:

$ s9s cluster --ping
PING OK 2.000 ms

As a side note, also look at the limitations when using this tool.

Example Deployment

Our example deployment consists of 8 nodes across 3 clusters:

  • PostgreSQL Streaming Replication - 1 master, 2 slaves
  • MySQL Replication - 1 master, 1 slave
  • MongoDB Replica Set - 1 primary, 2 secondary nodes

All database clusters were deployed by ClusterControl by using "Deploy Database Cluster" deployment wizard and from the UI point-of-view, this is what we would see in the cluster dashboard:

Cluster Monitoring

We will start by listing out the clusters:

$ s9s cluster --list --long
ID STATE   TYPE              OWNER  GROUP  NAME                   COMMENT
23 STARTED postgresql_single system admins PostgreSQL 10          All nodes are operational.
24 STARTED replication       system admins Oracle 5.7 Replication All nodes are operational.
25 STARTED mongodb           system admins MongoDB 3.6            All nodes are operational.

We see the same clusters as the UI. We can get more details on the particular cluster by using the --stat flag. Multiple clusters and nodes can also be monitored this way, the command line options can even use wildcards in the node and cluster names:

$ s9s cluster --stat *Replication
Oracle 5.7 Replication                                                                                                                                                                                               Name: Oracle 5.7 Replication              Owner: system/admins
      ID: 24                                  State: STARTED
    Type: REPLICATION                        Vendor: oracle 5.7
  Status: All nodes are operational.
  Alarms:  0 crit   1 warn
    Jobs:  0 abort  0 defnd  0 dequd  0 faild  7 finsd  0 runng
  Config: '/etc/cmon.d/cmon_24.cnf'
 LogFile: '/var/log/cmon_24.log'

                                                                                HOSTNAME    CPU   MEMORY   SWAP    DISK       NICs
                                                                                10.0.0.104 1  6% 992M 120M 0B 0B 19G 13G   10K/s 54K/s
                                                                                10.0.0.168 1  6% 992M 116M 0B 0B 19G 13G   11K/s 66K/s
                                                                                10.0.0.156 2 39% 3.6G 2.4G 0B 0B 19G 3.3G 338K/s 79K/s

The output above gives a summary of our MySQL replication together with the cluster status, state, vendor, configuration file and so on. Down the line, you can see the list of nodes that fall under this cluster ID with a summarized view of system resources for each host like number of CPUs, total memory, memory usage, swap disk and network interfaces. All information shown are retrieved from the CMON database, not directly from the actual nodes.

You can also get a summarized view of all databases on all clusters:

$ s9s  cluster --list-databases --long
SIZE        #TBL #ROWS     OWNER  GROUP  CLUSTER                DATABASE
  7,340,032    0         0 system admins PostgreSQL 10          postgres
  7,340,032    0         0 system admins PostgreSQL 10          template1
  7,340,032    0         0 system admins PostgreSQL 10          template0
765,460,480   24 2,399,611 system admins PostgreSQL 10          sbtest
          0  101         - system admins Oracle 5.7 Replication sys
Total: 5 databases, 789,577,728, 125 tables.

The last line summarizes that we have total of 5 databases with 125 tables, 4 of them are on our PostgreSQL cluster.

For a complete example of usage on s9s cluster command line options, check out s9s cluster documentation.

Node Monitoring

For nodes monitoring, s9s CLI has similar features with the cluster option. To get a summarized view of all nodes, you can simply do:

$ s9s node --list --long
STAT VERSION    CID CLUSTER                HOST       PORT  COMMENT
coC- 1.6.2.2662  23 PostgreSQL 10          10.0.0.156  9500 Up and running
poM- 10.4        23 PostgreSQL 10          10.0.0.44   5432 Up and running
poS- 10.4        23 PostgreSQL 10          10.0.0.58   5432 Up and running
poS- 10.4        23 PostgreSQL 10          10.0.0.60   5432 Up and running
soS- 5.7.23-log  24 Oracle 5.7 Replication 10.0.0.104  3306 Up and running.
coC- 1.6.2.2662  24 Oracle 5.7 Replication 10.0.0.156  9500 Up and running
soM- 5.7.23-log  24 Oracle 5.7 Replication 10.0.0.168  3306 Up and running.
mo-- 3.2.20      25 MongoDB 3.6            10.0.0.125 27017 Up and Running
mo-- 3.2.20      25 MongoDB 3.6            10.0.0.131 27017 Up and Running
coC- 1.6.2.2662  25 MongoDB 3.6            10.0.0.156  9500 Up and running
mo-- 3.2.20      25 MongoDB 3.6            10.0.0.35  27017 Up and Running
Total: 11

The most left-hand column specifies the type of the node. For this deployment, "c" represents ClusterControl Controller, 'p" for PostgreSQL, "m" for MongoDB, "e" for Memcached and s for generic MySQL nodes. The next one is the host status - "o" for online, "l" for off-line, "f" for failed nodes and so on. The next one is the role of the node in the cluster. It can be M for master, S for slave, C for controller and - for everything else. The remaining columns are pretty self-explanatory.

You can get all the list by looking at the man page of this component:

$ man s9s-node

From there, we can jump into a more detailed stats for all nodes with --stats flag:

$ s9s node --stat --cluster-id=24
 10.0.0.104:3306
    Name: 10.0.0.104              Cluster: Oracle 5.7 Replication (24)
      IP: 10.0.0.104                 Port: 3306
   Alias: -                         Owner: system/admins
   Class: CmonMySqlHost              Type: mysql
  Status: CmonHostOnline             Role: slave
      OS: centos 7.0.1406 core     Access: read-only
   VM ID: -
 Version: 5.7.23-log
 Message: Up and running.
LastSeen: Just now                    SSH: 0 fail(s)
 Connect: y Maintenance: n Managed: n Recovery: n Skip DNS: y SuperReadOnly: n
     Pid: 16592  Uptime: 01:44:38
  Config: '/etc/my.cnf'
 LogFile: '/var/log/mysql/mysqld.log'
 PidFile: '/var/lib/mysql/mysql.pid'
 DataDir: '/var/lib/mysql/'
 10.0.0.168:3306
    Name: 10.0.0.168              Cluster: Oracle 5.7 Replication (24)
      IP: 10.0.0.168                 Port: 3306
   Alias: -                         Owner: system/admins
   Class: CmonMySqlHost              Type: mysql
  Status: CmonHostOnline             Role: master
      OS: centos 7.0.1406 core     Access: read-write
   VM ID: -
 Version: 5.7.23-log
 Message: Up and running.
  Slaves: 10.0.0.104:3306
LastSeen: Just now                    SSH: 0 fail(s)
 Connect: n Maintenance: n Managed: n Recovery: n Skip DNS: y SuperReadOnly: n
     Pid: 975  Uptime: 01:52:53
  Config: '/etc/my.cnf'
 LogFile: '/var/log/mysql/mysqld.log'
 PidFile: '/var/lib/mysql/mysql.pid'
 DataDir: '/var/lib/mysql/'
 10.0.0.156:9500
    Name: 10.0.0.156              Cluster: Oracle 5.7 Replication (24)
      IP: 10.0.0.156                 Port: 9500
   Alias: -                         Owner: system/admins
   Class: CmonHost                   Type: controller
  Status: CmonHostOnline             Role: controller
      OS: centos 7.0.1406 core     Access: read-write
   VM ID: -
 Version: 1.6.2.2662
 Message: Up and running
LastSeen: 28 seconds ago              SSH: 0 fail(s)
 Connect: n Maintenance: n Managed: n Recovery: n Skip DNS: n SuperReadOnly: n
     Pid: 12746  Uptime: 01:10:05
  Config: ''
 LogFile: '/var/log/cmon_24.log'
 PidFile: ''
 DataDir: ''

Printing graphs with the s9s client can also be very informative. This presents the data the controller collected in various graphs. There are almost 30 graphs supported by this tool as listed here and s9s-node enumerates them all. The following shows server load histogram of all nodes for cluster ID 1 as collected by CMON, right from your terminal:

It is possible to set the start and end date and time. One can view short periods (like the last hour) or longer periods (like a week or a month). The following is an example of viewing the disk utilization for the last hour:

Using the --density option, a different view can be printed for every graph. This density graph shows not the time series, but how frequently the given values were seen (X-axis represents the density value):

If the terminal does not support Unicode characters, the --only-ascii option can switch them off:

The graphs have colors, where dangerously high values for example are shown in red. The list of nodes can be filtered with --nodes option, where you can specify the node names or use wildcards if convenient.

Process Monitoring

Another cool thing about s9s CLI is it provides a processlist of the entire cluster - a “top” for all nodes, all processes merged into one. The following command runs the "top" command on all database nodes for cluster ID 24, sorted by the most CPU consumption, and updated continuously:

$ s9s process --top --cluster-id=24
Oracle 5.7 Replication - 04:39:17                                                                                                                                                      All nodes are operational.
3 hosts, 4 cores, 10.6 us,  4.2 sy, 84.6 id,  0.1 wa,  0.3 st,
GiB Mem : 5.5 total, 1.7 free, 2.6 used, 0.1 buffers, 1.1 cached
GiB Swap: 0 total, 0 used, 0 free,

PID   USER     HOST       PR  VIRT      RES    S   %CPU   %MEM COMMAND
12746 root     10.0.0.156 20  1359348    58976 S  25.25   1.56 cmon
 1587 apache   10.0.0.156 20   462572    21632 S   1.38   0.57 httpd
  390 root     10.0.0.156 20     4356      584 S   1.32   0.02 rngd
  975 mysql    10.0.0.168 20  1144260    71936 S   1.11   7.08 mysqld
16592 mysql    10.0.0.104 20  1144808    75976 S   1.11   7.48 mysqld
22983 root     10.0.0.104 20   127368     5308 S   0.92   0.52 sshd
22548 root     10.0.0.168 20   127368     5304 S   0.83   0.52 sshd
 1632 mysql    10.0.0.156 20  3578232  1803336 S   0.50  47.65 mysqld
  470 proxysql 10.0.0.156 20   167956    35300 S   0.44   0.93 proxysql
  338 root     10.0.0.104 20     4304      600 S   0.37   0.06 rngd
  351 root     10.0.0.168 20     4304      600 R   0.28   0.06 rngd
   24 root     10.0.0.156 20        0        0 S   0.19   0.00 rcu_sched
  785 root     10.0.0.156 20   454112    11092 S   0.13   0.29 httpd
   26 root     10.0.0.156 20        0        0 S   0.13   0.00 rcuos/1
   25 root     10.0.0.156 20        0        0 S   0.13   0.00 rcuos/0
22498 root     10.0.0.168 20   127368     5200 S   0.09   0.51 sshd
14538 root     10.0.0.104 20        0        0 S   0.09   0.00 kworker/0:1
22933 root     10.0.0.104 20   127368     5200 S   0.09   0.51 sshd
28295 root     10.0.0.156 20   127452     5016 S   0.06   0.13 sshd
 2238 root     10.0.0.156 20   197520    10444 S   0.06   0.28 vc-agent-007
  419 root     10.0.0.156 20    34764     1660 S   0.06   0.04 systemd-logind
    1 root     10.0.0.156 20    47628     3560 S   0.06   0.09 systemd
27992 proxysql 10.0.0.156 20    11688      872 S   0.00   0.02 proxysql_galera
28036 proxysql 10.0.0.156 20    11688      876 S   0.00   0.02 proxysql_galera

There is also a --list flag which returns a similar result without continuous update (similar to "ps" command):

$ s9s process --list --cluster-id=25

Job Monitoring

Jobs are tasks performed by the controller in the background, so that the client application does not need to wait until the entire job is finished. ClusterControl executes management tasks by assigning an ID for every task and lets the internal scheduler decide whether two or more jobs can be run in parallel. For example, more than one cluster deployment can be executed simultaneously, as well as other long running operations like backup and automatic upload of backups to cloud storage.

In any management operation, it's would be helpful if we could monitor the progress and status of a specific job, like e.g., scale out a new slave for our MySQL replication. The following command add a new slave, 10.0.0.77 to scale out our MySQL replication:

$ s9s cluster --add-node --nodes="10.0.0.77" --cluster-id=24
Job with ID 66992 registered.

We can then monitor the jobID 66992 using the job option:

$ s9s job --log --job-id=66992
addNode: Verifying job parameters.
10.0.0.77:3306: Adding host to cluster.
10.0.0.77:3306: Testing SSH to host.
10.0.0.77:3306: Installing node.
10.0.0.77:3306: Setup new node (installSoftware = true).
10.0.0.77:3306: Setting SELinux in permissive mode.
10.0.0.77:3306: Disabling firewall.
10.0.0.77:3306: Setting vm.swappiness = 1
10.0.0.77:3306: Installing software.
10.0.0.77:3306: Setting up repositories.
10.0.0.77:3306: Installing helper packages.
10.0.0.77: Upgrading nss.
10.0.0.77: Upgrading ca-certificates.
10.0.0.77: Installing socat.
...
10.0.0.77: Installing pigz.
10.0.0.77: Installing bzip2.
10.0.0.77: Installing iproute2.
10.0.0.77: Installing tar.
10.0.0.77: Installing openssl.
10.0.0.77: Upgrading openssl openssl-libs.
10.0.0.77: Finished with helper packages.
10.0.0.77:3306: Verifying helper packages (checking if socat is installed successfully).
10.0.0.77:3306: Uninstalling existing MySQL packages.
10.0.0.77:3306: Installing replication software, vendor oracle, version 5.7.
10.0.0.77:3306: Installing software.
...

Or we can use the --wait flag and get a spinner with progress bar:

$ s9s job --wait --job-id=66992
Add Node to Cluster
- Job 66992 RUNNING    [         █] ---% Add New Node to Cluster

That's it for today's monitoring supplement. We hope that you’ll give the CLI a try and get value out of it. Happy clustering!

New Webinar - Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL and MongoDB

$
0
0

Monitoring is essential for operations teams to ensure that databases are up and running. However, as databases are increasingly being deployed in distributed topologies based on replication or clustering, what does it mean to our monitoring infrastructure? Is it ok to monitor individual components of a database cluster, or do we need a more holistic systems approach? Can we rely on SELECT 1 as health check when determining whether a database is up or down? Do we need high-resolution time-series charts of status counters? Are there ways to predict problems before they actually become one?

In this webinar, we will discuss how to effectively monitor distributed database clusters or replication setups. We’ll look at different types of monitoring infrastructures, from on-prem to cloud and from agent-based to agentless. Then we’ll dive into the different monitoring features available in the free ClusterControl Community Edition - from time-series charts of metrics, dashboards, and queries to performance advisors.

If you would like to centralize the monitoring of your open source databases and achieve this at zero cost, please join us on September 25!

Date, Time & Registration

Europe/MEA/APAC

Tuesday, September 25th at 09:00 BST / 10:00 CEST (Germany, France, Sweden)

Register Now

North America/LatAm

Tuesday, September 25th at 09:00 Pacific Time (US) / 12:00 Eastern Time (US)

Register Now

Agenda

  • Requirements for monitoring distributed database systems
  • Cloud-based vs On-prem monitoring solutions
  • Agent-based vs Agentless monitoring
  • Deep-dive into ClusterControl Community Edition
    • Architecture
    • Metrics Collection
    • Trending
    • Dashboards
    • Queries
    • Performance Advisors
    • Other features available to Community users

Speaker

Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.

We look forward to “seeing” you there!

How to Deploy a Production-Ready MySQL or MariaDB Galera Cluster using ClusterControl

$
0
0

Deploying a database cluster is not rocket science - there are many how-to’s on how to do that. But how do you know what you just deployed is production-ready? Manual deployments can also be tedious and repetitive. Depending on the number of nodes in the cluster, the deployment steps may be time-consuming and error-prone. Configuration management tools like Puppet, Chef and Ansible are popular in deploying infrastructure, but for stateful database clusters, you need to perform significant scripting to handle deployment of the whole database HA stack. Moreover, the chosen template/module/cookbook/role has to be meticulously tested before you can trust it as part of your infrastructure automation. Version changes require the scripts to be updated and tested again.

The good news is that ClusterControl automates deployments of the entire stack - and for free as well! We’ve deployed thousands of production clusters, and take a number of precautions to ensure they are production-ready Different topologies are supported, from master-slave replication to Galera, NDB and InnoDB cluster, with different database proxies on top.

A high availability stack, deployed through ClusterControl, consists of three layers:

  • Database layer (e.g., Galera Cluster)
  • Reverse proxy layer (e.g., HAProxy or ProxySQL)
  • Keepalived layer, which, with use of Virtual IP, ensures high availability of the proxy layer

In this blog, we are going to show you how to deploy a production-grade Galera Cluster complete with load balancers for high availability setup. The complete setup consists of 6 hosts:

  • 1 host - ClusterControl (deployment, monitoring, management server)
  • 3 hosts - MySQL Galera Cluster
  • 2 hosts - Reverse proxies act as load balancers in front of the cluster.

The following diagram illustrates our end result once deployment is complete:

Prerequisites

ClusterControl must reside on an independant node which is not part of the cluster. Download ClusterControl, and the page will generate a license unique for you and show the steps to install ClusterControl:

$ wget -O install-cc https://severalnines.com/scripts/install-cc
$ chmod +x install-cc
$ ./install-cc # as root or sudo user

Follow the instructions where you will be guided with setting up MySQL server, MySQL root password on the ClusterControl node, cmon password for ClusterControl usage and so on. You should get the following line once the installation has completed:

Determining network interfaces. This may take a couple of minutes. Do NOT press any key.
Public/external IP => http://{public_IP}/clustercontrol
Installation successful. If you want to uninstall ClusterControl then run install-cc --uninstall.

Then, on the ClusterControl server, generate an SSH key which we will use to setup the passwordless SSH later on. You can use any user in the system but it must have the ability to perform super-user operations (sudoer). In this example, we picked the root user:

$ whoami
root
$ ssh-keygen -t rsa

Set up passwordless SSH to all nodes that you would like to monitor/manage via ClusterControl. In this case, we will set this up on all nodes in the stack (including ClusterControl node itself). On ClusterControl node, run the following commands and specify the root password when prompted:

$ ssh-copy-id root@192.168.55.160 # clustercontrol
$ ssh-copy-id root@192.168.55.161 # galera1
$ ssh-copy-id root@192.168.55.162 # galera2
$ ssh-copy-id root@192.168.55.163 # galera3
$ ssh-copy-id root@192.168.55.181 # proxy1
$ ssh-copy-id root@192.168.55.182 # proxy2

You can then verify if it's working by running the following command on ClusterControl node:

$ ssh root@192.168.55.161 "ls /root"

Make sure you are able to see the result of the command above without the need to enter password.

Deploying the Cluster

ClusterControl supports all vendors for Galera Cluster (Codership, Percona and MariaDB). There are some minor differences which may influence your decision for choosing the vendor. If you would like to learn about the differences between them, check out our previous blog post - Galera Cluster Comparison - Codership vs Percona vs MariaDB.

For production deployment, a three-node Galera Cluster is the minimum you should have. You can always scale it out later once the cluster is deployed, manually or via ClusterControl. We’ll open our ClusterControl UI at https://192.168.55.160/clustercontrol and create the first admin user. Then, go to the top menu and click Deploy -> MySQL Galera and you will be presented with the following dialog:

There are two steps, the first one is the "General & SSH Settings". Here we need to configure the SSH user that ClusterControl should use to connect to the database nodes, together with the path to the SSH key (as generated under Prerequisite section) as well as the SSH port of the database nodes. ClusterControl presumes all database nodes are configured with the same SSH user, key and port. Next, give the cluster a name, in this case we will use "MySQL Galera Cluster 5.7". This value can be changed later on. Then select the options to instruct ClusterControl to install the required software, disable the firewall and also disable the security enhancement module on the particular Linux distribution. All of these are recommended to be toggled on to maximize the potential of successful deployment.

Click Continue and you will be presented with the following dialog:

In the next step, we need to configure the database servers - vendor, version, datadir, port, etc - which are pretty self-explanatory. "Configuration Template" is the template filename under /usr/share/cmon/templates of the ClusterControl node. "Repository" is how ClusterControl should configure the repository on the database node. By default, it will use the vendor repository and install the latest version provided by the repository. However, in some cases, the user might have a pre-existing repository mirrored from the original repository due to security policy restriction. Nevertheless, ClusterControl supports most of them, as described in the user guide, under Repository.

Lastly, add the IP address or hostname (must be a valid FQDN) of the database nodes. You will see a green tick icon on the left of the node, indicating ClusterControl was able to connect to the node via passwordless SSH. You are now good to go. Click Deploy to start the deployment. This may take 15 to 20 minutes to complete. You can monitor the deployment progress under Activity (top menu) -> Jobs -> Create Cluster:

Once the deployment completed, at this point, our architecture can be illustrated as below:

Deploying the Load Balancers

In Galera Cluster, all nodes are equal - each node holds the same role and same dataset. Therefore, there is no failover within the cluster if a node fails. Only the application side requires failover, to skip the inoperational nodes while the cluster is partitioned. Therefore, it's highly recommended to place load balancers on top of a Galera Cluster to:

  • Unify the multiple database endpoints to a single endpoint (load balancer host or virtual IP address as the endpoint).
  • Balance the database connections between the backend database servers.
  • Perform health checks and only forward the database connections to healthy nodes.
  • Redirect/rewrite/block offending (badly written) queries before they hit the database servers.

There are three main choices of reverse proxies for Galera Cluster - HAProxy, MariaDB MaxScale or ProxySQL - all can be installed and configured automatically by ClusterControl. In this deployment, we picked ProxySQL because it checks all the above plus it understands the MySQL protocol of the backend servers.

In this architecture, we want to use two ProxySQL servers to eliminate any single-point-of-failure (SPOF) to the database tier, which will be tied together using a floating virtual IP address. We’ll explain this in the next section. One node will act as the active proxy and the other one as hot-standby. Whichever node that holds the virtual IP address at a given time is the active node.

To deploy the first ProxySQL server, simply go to the cluster action menu (right-side of the summary bar) and click on Add Load Balancer -> ProxySQL -> Deploy ProxySQL and you will see the following:

Again, most of the fields are self-explanatory. In the "Database User" section, ProxySQL acts as a gateway through which your application connects to the database. The application authenticates against ProxySQL, therefore you have to add all of the users from all the backend MySQL nodes, along with their passwords, into ProxySQL. From ClusterControl, you can either create a new user to be used by the application - you can decide on its name, password, access to which databases are granted and what MySQL privileges that user will have. Such user will be created on both MySQL and ProxySQL side. Second option, more suitable for existing infrastructures, is to use the existing database users. You need to pass username and password, and such user will be created only on ProxySQL.

The last section, "Implicit Transaction", ClusterControl will configure ProxySQL to send all of the traffic to the master if we started transaction with SET autocommit=0. Otherwise, if you use BEGIN or START TRANSACTION to create a transaction, ClusterControl will configure read/write split in the query rules. This is to ensure ProxySQL will handle transactions correctly. If you have no idea how your application does this, you can pick the latter.

Repeat the same configuration for the second ProxySQL node, except the "Server Address" value which is 192.168.55.182. Once done, both nodes will be listed under "Nodes" tab -> ProxySQL where you can monitor and manage them directly from the UI:

At this point, our architecture is now looking like this:

If you would like to learn more about ProxySQL, do check out this tutorial - Database Load Balancing for MySQL and MariaDB with ProxySQL - Tutorial.

Deploying the Virtual IP Address

The final part is the virtual IP address. Without it, our load balancers (reverse proxies) would be the weak link as they would be a single-point of failure - unless the application has the ability to automatically redirect failed database connections to another load balancer. Nevertheless, it's good practice to unify them both using virtual IP address and simplify the connection endpoint to the database layer.

From ClusterControl UI -> Add Load Balancer -> Keepalived -> Deploy Keepalived and select the two ProxySQL hosts that we have deployed:

Also, specify the virtual IP address and the network interface to bind the IP address. The network interface must exist on both ProxySQL nodes. Once deployed, you should see the following green checks in the summary bar of the cluster:

At this point, our architecture can be illustrated as below:

Our database cluster is now ready for production usage. You can import your existing database into it or create a fresh new database. You can use the Schemas and Users Management feature if the trial license hasn't expired.

To understand how ClusterControl configures Keepalived, check out this blog post, How ClusterControl Configures Virtual IP and What to Expect During Failover.

Connecting to the Database Cluster

From the application and client standpoint, they need to connect to 192.168.55.180 on port 6033 which is the virtual IP address floating on top of the load balancers. For example, the Wordpress database configuration will be something like this:

/** The name of the database for WordPress */
define( 'DB_NAME', 'wp_myblog' );

/** MySQL database username */
define( 'DB_USER', 'wp_myblog' );

/** MySQL database password */
define( 'DB_PASSWORD', 'mysecr3t' );

/** MySQL hostname - virtual IP address with ProxySQL load-balanced port*/
define( 'DB_HOST', '192.168.55.180:6033' );

If you would like to access the database cluster directly, bypassing the load balancer, you can just connect to port 3306 of the database hosts. This is usually required by the DBA staff for administration, management, and troubleshooting. With ClusterControl, most of these operations can be performed directly from the user interface.

Final Thoughts

As shown above, deploying a database cluster is no longer a difficult task. Once deployed, there a full suite of free monitoring features as well as commercial features for backup management, failover/recovery and others. Fast deployment of different types of cluster/replication topologies can be useful when evaluating high availability database solutions, and how they fit to your particular environment.


Become a ClusterControl DBA: Performance and Health Monitoring

$
0
0

In the previous two blog posts we covered both deploying the four types of clustering/replication (MySQL/Galera, MySQL Replication, MongoDB & PostgreSQL) and managing/monitoring your existing databases and clusters. So, after reading these two first blog posts you were able to add your 20 existing replication setups to ClusterControl, expand them and additionally deployed two new Galera clusters while doing a ton of other things. Or maybe you deployed MongoDB and/or PostgreSQL systems. So now, how do you keep them healthy?

That’s exactly what this blog post is about: how to leverage ClusterControl performance monitoring and advisors functionality to keep your MySQL, MongoDB and/or PostgreSQL databases and clusters healthy. So how is this done in ClusterControl?

Database Cluster List

The most important information can already be found in the cluster list: as long as there are no alarms and no hosts are shown to be down, everything is functioning fine. An alarm is raised if a certain condition is met, e.g. host is swapping, and brings to your attention the issue you should investigate. That means that alarms not only are raised during an outage but also to allow you to proactively manage your databases.

Suppose you would log into ClusterControl and see a cluster listing like this, you will definitely have something to investigate: one node is down in the Galera cluster for example and every cluster has various alarms:

Once you click on one of the alarms, you will go to a detailed page on all alarms of the cluster. The alarm details will explain the issue and in most cases also advise the action to resolve the issue.

You can set up your own alarms by creating custom expressions, but that has been deprecated in favor of our new Developer Studio that allows you to write custom Javascripts and execute these as Advisors. We will get back to this topic later in this post.

Cluster Overview - Dashboards

When opening up the cluster overview, we can immediately see the most important performance metrics for the cluster in the tabs. This overview may differ per cluster type as, for instance, Galera has different performance metrics to watch than traditional MySQL, PostgreSQL or MongoDB.

Both the default overview and the pre-selected tabs are customizable. By clicking on Overview -> Dash Settings you are given a dialogue that allows you to define the dashboard:

By pressing the plus sign you can add and define your own metrics to graph the dashboard. In our case we will define a new dashboard featuring the Galera specific send and receive queue average:

This new dashboard should give us good insight in the average queue length of our Galera cluster.

Once you have pressed save, the new dashboard will become available for this cluster:

Similarly you can do this for PostgreSQL as well, for example we can monitor the shared blocks hit versus blocks read:

So as you can see, it is relatively easy to customize your own (default) dashboard.

Cluster Overview - Query Monitor

The Query Monitor tab is available for both MySQL and PostgreSQL based setups and consists out of three dashboards: Top Queries, Running Queries and Query Outliers.

In the Running Queries dashboard, you will find all current queries that are running. This is basically the equivalent of SHOW FULL PROCESSLIST statement in MySQL database.

Top Queries and Query Outliers both rely on the input of the slow query log or Performance Schema. Using Performance Schema is always recommended and will be used automatically if enabled. Otherwise, ClusterControl will use the MySQL slow query log to capture the running queries. To prevent ClusterControl from being too intrusive and the slow query log to grow too large, ClusterControl will sample the slow query log by turning it on and off. This loop is by default set to 1 second capturing and the long_query_time is set to 0.5 seconds. If you wish to change these settings for your cluster, you can change this via Settings -> Query Monitor.

Top Queries will, like the name says, show the top queries that were sampled. You can sort them on various columns: for instance the frequency, average execution time, total execution time or standard deviation time:

You can get more details about the query by selecting it and this will present the query execution plan (if available) and optimization hints/advisories. The Query Outliers is similar to the Top Queries but then allows you to filter the queries per host and compare them in time.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Cluster Overview - Operations

Similar to the PostgreSQL and MySQL systems the MongoDB clusters have the Operations overview and is similar to the MySQL's Running Queries. This overview is similar to issuing the db.currentOp() command within MongoDB.

Cluster Overview - Performance

MySQL/Galera

The performance tab is probably the best place to find the overall performance and health of your clusters. For MySQL and Galera it consists of an Overview page, the Advisors, status/variables overviews, the Schema Analyzer and the Transaction log.

The Overview page will give you a graph overview of the most important metrics in your cluster. This is, obviously, different per cluster type. Eight metrics have been set by default, but you can easily set your own - up to 20 graphs if needed:

The Advisors is one of the key features of ClusterControl: the Advisors are scripted checks that can be run on demand. The advisors can evaluate almost any fact known about the host and/or cluster and give its opinion on the health of the host and/or cluster and even can give advice on how to resolve issues or improve your hosts!

The best part is yet to come: you can create your own checks in the Developer Studio (ClusterControl -> Manage -> Developer Studio), run them on a regular interval and use them again in the Advisors section. We blogged about this new feature earlier this year.

We will skip the status/variables overview of MySQL and Galera as this is useful for reference but not for this blog post: it is good enough that you know it is here.

Now suppose your database is growing but you want to know how fast it grew in the past week. You can actually keep track of the growth of both data and index sizes from right within ClusterControl:

And next to the total growth on disk it can also report back the top 25 largest schemas.

Another important feature is the Schema Analyzer within ClusterControl:

ClusterControl will analyze your schemas and look for redundant indexes, MyISAM tables and tables without a primary key. Of course it is entirely up to you to keep a table without a primary key because some application might have created it this way, but at least it is great to get the advice here for free. The Schema Analyzer even recommends the necessary ALTER statement to fix the problem.

PostgreSQL

For PostgreSQL the Advisors, DB Status and DB Variables can be found here:

MongoDB

For MongoDB the Mongo Stats and performance overview can be found under the Performance tab. The Mongo Stats is an overview of the output of mongostat and the Performance overview gives a good graphical overview of the MongoDB opcounters:

Final Thoughts

We showed you how to keep your eyeballs on the most important monitoring and health checking features of ClusterControl. Obviously this is only the beginning of the journey as we will soon start another blog series about the Developer Studio capabilities and how you can make most of your own checks. Also keep in mind that our support for MongoDB and PostgreSQL is not as extensive as our MySQL toolset, but we are continuously improving on this.

You may ask yourself why we have skipped over the performance monitoring and health checks of HAProxy, ProxySQL and MaxScale. We did that deliberately as the blog series covered only deployments of clusters up till now and not the deployment of HA components. So that’s the subject we'll cover next time.

Become a ClusterControl DBA: Operational Reports for MySQL, MariaDB, PostgreSQL & MongoDB

$
0
0

The majority of DBA’s perform health checks on their databases every now and then. Usually, it would happen on a daily or weekly basis. We previously discussed why such checks are important and what they should include.

To make sure your systems are in a good shape, you’d need to go through quite a lot of information - host statistics, MySQL statistics, workload statistics, state of backups, database packages, logs and so forth. Such data should be available in every properly monitored environment, although sometimes it is scattered across multiple locations - you may have one tool to monitor MySQL state, another tool to collect system statistics, maybe a set of scripts, e.g., to check the state of your backups. This makes health checks much more time-consuming than they should be - the DBA has to put together the different pieces to understand the state of the system.

Integrated tools like ClusterControl have an advantage that all of the bits are located in the same place (or in the same application). It still does not mean they are located next to each other - they may be located in different sections of the UI and a DBA may have to spend some time clicking through the UI to reach all the interesting data.

The whole idea behind creating Operational Reports is to put all of the most important data into a single document, which can be quickly reviewed to get an understanding of the state of the databases.

Operational Reports are available from the menu Side Menu -> Operational Reports:

Once you go there, you’ll be presented with a list of reports created manually or automatically, based on a predefined schedule:

If you want to create a new report manually, you’ll use the 'Create' option. Pick the type of report, cluster name (for per-cluster report), email recipients (optional - if you want the report to be delivered to you), and you’re pretty much done:

The reports can also be scheduled to be created on a regular basis:

At this time, 5 types of reports are available:

  • Availability report - All clusters.
  • Backup report - All clusters.
  • Schema change report - MySQL/MariaDB-based cluster only.
  • Daily system report - Per cluster.
  • Package upgrade report - Per cluster.

Availability Report

Availability reports focuses on, well, availability. It includes three sections. First, availability summary:

You can see information about availability statistics of your databases, the cluster type, total uptime and downtime, current state of the cluster and when that state last changed.

Another section gives more details on availability for every cluster. The screenshot below only shows one of the database cluster:

We can see when a node switched state and what the transition was. It’s a nice place to check if there were any recent problems with the cluster. Similar data is shown in the third section of this report, where you can go through the history of changes in cluster state.

Backup Report

The second type of the report is one covering backups of all clusters. It contains two sections - backup summary and backup details, where the former basically gives you a short summary of when the last backup was created, if it completed successfully or failed, backup verification status, success rate and retention period:

ClusterControl also provides examples of backup policy if it finds any of the monitored database cluster running without any scheduled backup or delayed slave configured. Next are the backup details:

You can also check the list of backups executed on the cluster with their state, type and size within the specified interval. This is as close you can get to be certain that backups work correctly without running a full recovery test. We definitely recommend that such tests are performed every now and then. Good news is ClusterControl supports MySQL-based restoration and verification on a standalone host under Backup -> Restore Backup.

Daily System Report

This type of report contains detailed information about a particular cluster. It starts with a summary of different alerts which are related to the cluster:

Next section is about the state of the nodes that are part of the cluster:

You have a list of the nodes in the cluster, their type, role (master or slave), status of the node, uptime and the OS.

Another section of the report is the backup summary, same as we discussed above. Next one presents a summary of top queries in the cluster:

Finally, we see a “Node status overview” in which you’ll be presented with graphs related to OS and MySQL metrics for each node.

As you can see, we have here graphs covering all of the aspects of the load on the host - CPU, memory, network, disk, CPU load and disk free. This is enough to get an idea whether anything weird happened recently or not. You can also see some details about MySQL workload - how many queries were executed, which type of query, how the data was accessed (via which handler)? This, on the other hand, should be enough to pick most of the issues on MySQL side. What you want to look at are all spikes and dips that you haven’t seen in the past. Maybe a new query has been added to the mix and, as a result, handler_read_rnd_next skyrocketed? Maybe there was an increase of CPU load and a high number of connections might point to increased load on MySQL, but also to some kind of contention. An unexpected pattern might be good to investigate, so you know what is going on.

Package Upgrade Report

This report gives a summary of packages available for upgrade by the repository manager on the monitored hosts. For an accurate reporting, ensure you always use stable and trusted repositories on every host. In some undesirable occasions, the monitored hosts could be configured with an outdated repository after an upgrade (e.g, every MariaDB major version uses different repository), incomplete internal repository (e.g, partial mirrored from the upstream) or bleeding edge repository (commonly for unstable nightly-build packages).

The first section is the upgrade summary:

It summarizes the total number of packages available for upgrade as well as the related managed service for the cluster like load balancer, virtual IP address and arbitrator. Next, ClusterControl provides a detailed package list, grouped by package type for every host:

This report provides the available version and can greatly help us plan our maintenance window efficiently. For critical upgrades like security and database packages, we could prioritize it over non-critical upgrades, which could be consolidated with other less priority maintenance windows.

Schema Change Report

This report compares the selected MySQL/MariaDB database changes in table structure which happened between two different generated reports. In the MySQL/MariaDB older versions, DDL operation is a non-atomic operation (pre 8.0) and requires full table copy (pre 5.6 for most operations) - blocking other transactions until it completes. Schema changes could become a huge pain once your tables get a significant amount of data and must be carefully planned especially in a clustered setup. In a multi-tiered development environment, we have seen many cases where developers silently modify the table structure, resulting in significant impact to query performance.

In order for ClusterControl to produce an accurate report, special options must be configured inside CMON configuration file for the respective cluster:

  • schema_change_detection_address - Checks will be executed using SHOW TABLES/SHOW CREATE TABLE to determine if the schema has changed. The checks are executed on the address specified and is of the format HOSTNAME:PORT. The schema_change_detection_databases must also be set. A differential of a changed table is created (using diff).
  • schema_change_detection_databases - Comma separated list of databases to monitor for schema changes. If empty, no checks are made.

In this example, we would like to monitor schema changes for database "myapp" and "sbtest" on our MariaDB Cluster with cluster ID 27. Pick one of the database nodes as the value of schema_change_detection_address. For MySQL replication, this should be the master host, or any slave host that holds the databases (in case partial replication is active). Then, inside /etc/cmon.d/cmon_27.cnf, add the two following lines:

schema_change_detection_address=10.0.0.30:3306
schema_change_detection_databases=myapp,sbtest

Restart CMON service to load the change:

$ systemctl restart cmon

For the first and foremost report, ClusterControl only returns the result of metadata collection, similar to below:

With the first report as the baseline, the subsequent reports will return the output that we are expecting for:

Take note only new tables or changed tables are printed in the report. The first report is only for metadata collection for comparison in the subsequent rounds, thus we have to run it for at least twice to see the difference.

With this report, you can now gather the database structure footprints and understand how your database has evolved across time.

Final Thoughts

Operational report is a comprehensive way to understand the state of your database infrastructure. It is built for both operational or managerial staff, and can be very useful in analysing your database operations. The reports can be generated in-place or can be delivered to you via email, which make things conveniently easy if you have a reporting silo.

We’d love to hear your feedback on anything else you’d like to have included in the report, what’s missing and what is not needed.

MySQL on Docker: Running a MariaDB Galera Cluster without Container Orchestration Tools - Part 1

$
0
0

Container orchestration tools simplify the running of a distributed system, by deploying and redeploying containers and handling any failures that occur. One might need to move applications around, e.g., to handle updates, scaling, or underlying host failures. While this sounds great, it does not always work well with a strongly consistent database cluster like Galera. You can’t just move database nodes around, they are not stateless applications. Also, the order in which you perform operations on a cluster has high significance. For instance, restarting a Galera cluster has to start from the most advanced node, or else you will lose data. Therefore, we’ll show you how to run Galera Cluster on Docker without a container orchestration tool, so you have total control.

In this blog post, we are going to look into how to run a MariaDB Galera Cluster on Docker containers using the standard Docker image on multiple Docker hosts, without the help of orchestration tools like Swarm or Kubernetes. This approach is similar to running a Galera Cluster on standard hosts, but the process management is configured through Docker.

Before we jump further into details, we assume you have installed Docker, disabled SElinux/AppArmor and cleared up the rules inside iptables, firewalld or ufw (whichever you are using). The following are three dedicated Docker hosts for our database cluster:

  • host1.local - 192.168.55.161
  • host2.local - 192.168.55.162
  • host3.local - 192.168.55.163

Multi-host Networking

First of all, the default Docker networking is bound to the local host. Docker Swarm introduces another networking layer called overlay network, which extends the container internetworking to multiple Docker hosts in a cluster called Swarm. Long before this integration came into place, there were many network plugins developed to support this - Flannel, Calico, Weave are some of them.

Here, we are going to use Weave as the Docker network plugin for multi-host networking. This is mainly due to its simplicity to get it installed and running, and support for DNS resolver (containers running under this network can resolve each other's hostname). There are two ways to get Weave running - systemd or through Docker. We are going to install it as a systemd unit, so it's independent from Docker daemon (otherwise, we would have to start Docker first before Weave gets activated).

  1. Download and install Weave:

    $ curl -L git.io/weave -o /usr/local/bin/weave
    $ chmod a+x /usr/local/bin/weave
  2. Create a systemd unit file for Weave:

    $ cat > /etc/systemd/system/weave.service << EOF
    [Unit]
    Description=Weave Network
    Documentation=http://docs.weave.works/weave/latest_release/
    Requires=docker.service
    After=docker.service
    [Service]
    EnvironmentFile=-/etc/sysconfig/weave
    ExecStartPre=/usr/local/bin/weave launch --no-restart $PEERS
    ExecStart=/usr/bin/docker attach weave
    ExecStop=/usr/local/bin/weave stop
    [Install]
    WantedBy=multi-user.target
    EOF
  3. Define IP addresses or hostname of the peers inside /etc/sysconfig/weave:

    $ echo 'PEERS="192.168.55.161 192.168.55.162 192.168.55.163"'> /etc/sysconfig/weave
  4. Start and enable Weave on boot:

    $ systemctl start weave
    $ systemctl enable weave

Repeat the above 4 steps on all Docker hosts. Verify with the following command once done:

$ weave status

The number of peers is what we are looking after. It should be 3:

          ...
          Peers: 3 (with 6 established connections)
          ...

Running a Galera Cluster

Now the network is ready, it's time to fire our database containers and form a cluster. The basic rules are:

  • Container must be created under --net=weave to have multi-host connectivity.
  • Container ports that need to be published are 3306, 4444, 4567, 4568.
  • The Docker image must support Galera. If you'd like to use Oracle MySQL, then get the Codership version. If you'd like Percona's, use this image instead. In this blog post, we are using MariaDB's.

The reasons we chose MariaDB as the Galera cluster vendor are:

  • Galera is embedded into MariaDB, starting from MariaDB 10.1.
  • The MariaDB image is maintained by the Docker and MariaDB teams.
  • One of the most popular Docker images out there.

Bootstrapping a Galera Cluster has to be performed in sequence. Firstly, the most up-to-date node must be started with "wsrep_cluster_address=gcomm://". Then, start the remaining nodes with a full address consisting of all nodes in the cluster, e.g, "wsrep_cluster_address=gcomm://node1,node2,node3". To accomplish these steps using container, we have to do some extra steps to ensure all containers are running homogeneously. So the plan is:

  1. We would need to start with 4 containers in this order - mariadb0 (bootstrap), mariadb2, mariadb3, mariadb1.
  2. Container mariadb0 will be using the same datadir and configdir with mariadb1.
  3. Use mariadb0 on host1 for the first bootstrap, then start mariadb2 on host2, mariadb3 on host3.
  4. Remove mariadb0 on host1 to give way for mariadb1.
  5. Lastly, start mariadb1 on host1.

At the end of the day, you would have a three-node Galera Cluster (mariadb1, mariadb2, mariadb3). The first container (mariadb0) is a transient container for bootstrapping purposes only, using cluster address "gcomm://". It shares the same datadir and configdir with mariadb1 and will be removed once the cluster is formed (mariadb2 and mariadb3 are up) and nodes are synced.

By default, Galera is turned off in MariaDB and needs to be enabled with a flag called wsrep_on (set to ON) and wsrep_provider (set to the Galera library path) plus a number of Galera-related parameters. Thus, we need to define a custom configuration file for the container to configure Galera correctly.

Let's start with the first container, mariadb0. Create a file under /containers/mariadb1/conf.d/my.cnf and add the following lines:

$ mkdir -p /containers/mariadb1/conf.d
$ cat /containers/mariadb1/conf.d/my.cnf
[mysqld]

default_storage_engine          = InnoDB
binlog_format                   = ROW

innodb_flush_log_at_trx_commit  = 0
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_autoinc_lock_mode        = 2
innodb_lock_schedule_algorithm  = FCFS # MariaDB >10.1.19 and >10.2.3 only

wsrep_on                        = ON
wsrep_provider                  = /usr/lib/galera/libgalera_smm.so
wsrep_sst_method                = xtrabackup-v2

Since the image doesn't come with MariaDB Backup (which is the preferred SST method for MariaDB 10.1 and MariaDB 10.2), we are going to stick with xtrabackup-v2 for the time being.

To perform the first bootstrap for the cluster, run the bootstrap container (mariadb0) on host1 with mariadb1's "datadir" and "conf.d":

$ docker run -d \
        --name mariadb0 \
        --hostname mariadb0.weave.local \
        --net weave \
        --publish "3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --env MYSQL_USER=proxysql \
        --env MYSQL_PASSWORD=proxysqlpassword \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm:// \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb0.weave.local

The parameters used in the the above command are:

  • --name, creates the container named "mariadb0",
  • --hostname, assigns the container a hostname "mariadb0.weave.local",
  • --net, places the container in the weave network for multi-host networing support,
  • --publish, exposes ports 3306, 4444, 4567, 4568 on the container to the host,
  • $(weave dns-args), configures DNS resolver for this container. This command can be translated into Docker run as "--dns=172.17.0.1 --dns-search=weave.local.",
  • --env MYSQL_ROOT_PASSWORD, the MySQL root password,
  • --env MYSQL_USER, creates "proxysql" user to be used later with ProxySQL for database routing,
  • --env MYSQL_PASSWORD, the "proxysql" user password,
  • --volume /containers/mariadb1/datadir:/var/lib/mysql, creates /containers/mariadb1/datadir if does not exist and map it with /var/lib/mysql (MySQL datadir) of the container (for bootstrap node, this could be skipped),
  • --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d, mounts the files under directory /containers/mariadb1/conf.d of the Docker host, into the container at /etc/mysql/mariadb.conf.d.
  • mariadb:10.2.15, uses MariaDB 10.2.15 image from here,
  • --wsrep_cluster_address, Galera connection string for the cluster. "gcomm://" means bootstrap. For the rest of the containers, we are going to use a full address instead.
  • --wsrep_sst_auth, authentication string for SST user. Use the same user as root,
  • --wsrep_node_address, the node hostname, in this case we are going to use the FQDN provided by Weave.

The bootstrap container contains several key things:

  • The name, hostname and wsrep_node_address is mariadb0, but it uses the volumes of mariadb1.
  • The cluster address is "gcomm://"
  • There are two additional --env parameters - MYSQL_USER and MYSQL_PASSWORD. This parameters will create additional user for our proxysql monitoring purpose.

Verify with the following command:

$ docker ps
$ docker logs -f mariadb0

Once you see the following line, it indicates the bootstrap process is completed and Galera is active:

2018-05-30 23:19:30 139816524539648 [Note] WSREP: Synchronized with group, ready for connections

Create the directory to load our custom configuration file in the remaining hosts:

$ mkdir -p /containers/mariadb2/conf.d # on host2
$ mkdir -p /containers/mariadb3/conf.d # on host3

Then, copy the my.cnf that we've created for mariadb0 and mariadb1 to mariadb2 and mariadb3 respectively:

$ scp /containers/mariadb1/conf.d/my.cnf /containers/mariadb2/conf.d/ # on host1
$ scp /containers/mariadb1/conf.d/my.cnf /containers/mariadb3/conf.d/ # on host1

Next, create another 2 database containers (mariadb2 and mariadb3) on host2 and host3 respectively:

$ docker run -d \
        --name ${NAME} \
        --hostname ${NAME}.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/${NAME}/datadir:/var/lib/mysql \
        --volume /containers/${NAME}/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=${NAME}.weave.local

** Replace ${NAME} with mariadb2 or mariadb3 respectively.

However, there is a catch. The entrypoint script checks the mysqld service in the background after database initialization by using MySQL root user without password. Since Galera automatically performs synchronization through SST or IST when starting up, the MySQL root user password will change, mirroring the bootstrapped node. Thus, you would see the following error during the first start up:

018-05-30 23:27:13 140003794790144 [Warning] Access denied for user 'root'@'localhost' (using password: NO)
MySQL init process in progress…
MySQL init process failed.

The trick is to restart the failed containers once more, because this time, the MySQL datadir would have been created (in the first run attempt) and it would skip the database initialization part:

$ docker start mariadb2 # on host2
$ docker start mariadb3 # on host3

Once started, verify by looking at the following line:

$ docker logs -f mariadb2
…
2018-05-30 23:28:39 139808069601024 [Note] WSREP: Synchronized with group, ready for connections

At this point, there are 3 containers running, mariadb0, mariadb2 and mariadb3. Take note that mariadb0 is started using the bootstrap command (gcomm://), which means if the container is automatically restarted by Docker in the future, it could potentially become disjointed with the primary component. Thus, we need to remove this container and replace it with mariadb1, using the same Galera connection string with the rest and use the same datadir and configdir with mariadb0.

First, stop mariadb0 by sending SIGTERM (to ensure the node is going to be shutdown gracefully):

$ docker kill -s 15 mariadb0

Then, start mariadb1 on host1 using similar command as mariadb2 or mariadb3:

$ docker run -d \
        --name mariadb1 \
        --hostname mariadb1.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb1.weave.local

This time, you don't need to do the restart trick because MySQL datadir already exists (created by mariadb0). Once the container is started, verify the cluster size is 3, the status must be in Primary and the local state is synced:

$ docker exec -it mariadb3 mysql -uroot "-pPM7%cB43$sd@^1" -e 'select variable_name, variable_value from information_schema.global_status where variable_name in ("wsrep_cluster_size", "wsrep_local_state_comment", "wsrep_cluster_status", "wsrep_incoming_addresses")'
+---------------------------+-------------------------------------------------------------------------------+
| variable_name             | variable_value                                                                |
+---------------------------+-------------------------------------------------------------------------------+
| WSREP_CLUSTER_SIZE        | 3                                                                             |
| WSREP_CLUSTER_STATUS      | Primary                                                                       |
| WSREP_INCOMING_ADDRESSES  | mariadb1.weave.local:3306,mariadb3.weave.local:3306,mariadb2.weave.local:3306 |
| WSREP_LOCAL_STATE_COMMENT | Synced                                                                        |
+---------------------------+-------------------------------------------------------------------------------+

At this point, our architecture is looking something like this:

Although the run command is pretty long, it well describes the container's characteristics. It's probably a good idea to wrap the command in a script to simplify the execution steps, or use a compose file instead.

Database Routing with ProxySQL

Now we have three database containers running. The only way to access to the cluster now is to access the individual Docker host’s published port of MySQL, which is 3306 (map to 3306 to the container). So what happens if one of the database containers fails? You have to manually failover the client's connection to the next available node. Depending on the application connector, you could also specify a list of nodes and let the connector do the failover and query routing for you (Connector/J, PHP mysqlnd). Otherwise, it would be a good idea to unify the database resources into a single resource, that can be called a service.

This is where ProxySQL comes into the picture. ProxySQL can act as the query router, load balancing the database connections similar to what "Service" in Swarm or Kubernetes world can do. We have built a ProxySQL Docker image for this purpose and will maintain the image for every new version with our best effort.

Before we run the ProxySQL container, we have to prepare the configuration file. The following is what we have configured for proxysql1. We create a custom configuration file under /containers/proxysql1/proxysql.cnf on host1:

$ cat /containers/proxysql1/proxysql.cnf
datadir="/var/lib/proxysql"
admin_variables=
{
        admin_credentials="admin:admin"
        mysql_ifaces="0.0.0.0:6032"
        refresh_interval=2000
}
mysql_variables=
{
        threads=4
        max_connections=2048
        default_query_delay=0
        default_query_timeout=36000000
        have_compress=true
        poll_timeout=2000
        interfaces="0.0.0.0:6033;/tmp/proxysql.sock"
        default_schema="information_schema"
        stacksize=1048576
        server_version="5.1.30"
        connect_timeout_server=10000
        monitor_history=60000
        monitor_connect_interval=200000
        monitor_ping_interval=200000
        ping_interval_server=10000
        ping_timeout_server=200
        commands_stats=true
        sessions_sort=true
        monitor_username="proxysql"
        monitor_password="proxysqlpassword"
}
mysql_servers =
(
        { address="mariadb1.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb1.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=20, max_connections=100 }
)
mysql_users =
(
        { username = "sbtest" , password = "password" , default_hostgroup = 10 , active = 1 }
)
mysql_query_rules =
(
        {
                rule_id=100
                active=1
                match_pattern="^SELECT .* FOR UPDATE"
                destination_hostgroup=10
                apply=1
        },
        {
                rule_id=200
                active=1
                match_pattern="^SELECT .*"
                destination_hostgroup=20
                apply=1
        },
        {
                rule_id=300
                active=1
                match_pattern=".*"
                destination_hostgroup=10
                apply=1
        }
)
scheduler =
(
        {
                id = 1
                filename = "/usr/share/proxysql/tools/proxysql_galera_checker.sh"
                active = 1
                interval_ms = 2000
                arg1 = "10"
                arg2 = "20"
                arg3 = "1"
                arg4 = "1"
                arg5 = "/var/lib/proxysql/proxysql_galera_checker.log"
        }
)

The above configuration will:

  • configure two host groups, the single-writer and multi-writer group, as defined under "mysql_servers" section,
  • send reads to all Galera nodes (hostgroup 20) while write operations will go to a single Galera server (hostgroup 10),
  • schedule the proxysql_galera_checker.sh,
  • use monitor_username and monitor_password as the monitoring credentials created when we first bootstrapped the cluster (mariadb0).

Copy the configuration file to host2, for ProxySQL redundancy:

$ mkdir -p /containers/proxysql2/ # on host2
$ scp /containers/proxysql1/proxysql.cnf /container/proxysql2/ # on host1

Then, run the ProxySQL containers on host1 and host2 respectively:

$ docker run -d \
        --name=${NAME} \
        --publish 6033 \
        --publish 6032 \
        --restart always \
        --net=weave \
        $(weave dns-args) \
        --hostname ${NAME}.weave.local \
        -v /containers/${NAME}/proxysql.cnf:/etc/proxysql.cnf \
        -v /containers/${NAME}/data:/var/lib/proxysql \
        severalnines/proxysql

** Replace ${NAME} with proxysql1 or proxysql2 respectively.

We specified --restart=always to make it always available regardless of the exit status, as well as automatic startup when Docker daemon starts. This will make sure the ProxySQL containers act like a daemon.

Verify the MySQL servers status monitored by both ProxySQL instances (OFFLINE_SOFT is expected for the single-writer host group):

$ docker exec -it proxysql1 mysql -uadmin -padmin -h127.0.0.1 -P6032 -e 'select hostgroup_id,hostname,status from mysql_servers'
+--------------+----------------------+--------------+
| hostgroup_id | hostname             | status       |
+--------------+----------------------+--------------+
| 10           | mariadb1.weave.local | ONLINE       |
| 10           | mariadb2.weave.local | OFFLINE_SOFT |
| 10           | mariadb3.weave.local | OFFLINE_SOFT |
| 20           | mariadb1.weave.local | ONLINE       |
| 20           | mariadb2.weave.local | ONLINE       |
| 20           | mariadb3.weave.local | ONLINE       |
+--------------+----------------------+--------------+

At this point, our architecture is looking something like this:

All connections coming from 6033 (either from the host1, host2 or container's network) will be load balanced to the backend database containers using ProxySQL. If you would like to access an individual database server, use port 3306 of the physical host instead. There is no virtual IP address as single endpoint configured for the ProxySQL service, but we could have that by using Keepalived, which is explained in the next section.

Virtual IP Address with Keepalived

Since we configured ProxySQL containers to be running on host1 and host2, we are going to use Keepalived containers to tie these hosts together and provide virtual IP address via the host network. This allows a single endpoint for applications or clients to connect to the load balancing layer backed by ProxySQL.

As usual, create a custom configuration file for our Keepalived service. Here is the content of /containers/keepalived1/keepalived.conf:

vrrp_instance VI_DOCKER {
   interface ens33               # interface to monitor
   state MASTER
   virtual_router_id 52          # Assign one ID for this route
   priority 101
   unicast_src_ip 192.168.55.161
   unicast_peer {
      192.168.55.162
   }
   virtual_ipaddress {
      192.168.55.160             # the virtual IP
}

Copy the configuration file to host2 for the second instance:

$ mkdir -p /containers/keepalived2/ # on host2
$ scp /containers/keepalived1/keepalived.conf /container/keepalived2/ # on host1

Change the priority from 101 to 100 inside the copied configuration file on host2:

$ sed -i 's/101/100/g' /containers/keepalived2/keepalived.conf

**The higher priority instance will hold the virtual IP address (in this case is host1), until the VRRP communication is interrupted (in case host1 goes down).

Then, run the following command on host1 and host2 respectively:

$ docker run -d \
        --name=${NAME} \
        --cap-add=NET_ADMIN \
        --net=host \
        --restart=always \
        --volume /containers/${NAME}/keepalived.conf:/usr/local/etc/keepalived/keepalived.conf \ osixia/keepalived:1.4.4

** Replace ${NAME} with keepalived1 and keepalived2.

The run command tells Docker to:

  • --name, create a container with
  • --cap-add=NET_ADMIN, add Linux capabilities for network admin scope
  • --net=host, attach the container into the host network. This will provide virtual IP address on the host interface, ens33
  • --restart=always, always keep the container running,
  • --volume=/containers/${NAME}/keepalived.conf:/usr/local/etc/keepalived/keepalived.conf, map the custom configuration file for container's usage.

After both containers are started, verify the virtual IP address existence by looking at the physical network interface of the MASTER node:

$ ip a | grep ens33
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    inet 192.168.55.161/24 brd 192.168.55.255 scope global ens33
    inet 192.168.55.160/32 scope global ens33

The clients and applications may now use the virtual IP address, 192.168.55.160 to access the database service. This virtual IP address exists on host1 at this moment. If host1 goes down, keepalived2 will take over the IP address and bring it up on host2. Take note that the configuration for this keepalived does not monitor the ProxySQL containers. It only monitors the VRRP advertisement of the Keepalived peers.

At this point, our architecture is looking something like this:

Summary

So, now we have a MariaDB Galera Cluster fronted by a highly available ProxySQL service, all running on Docker containers.

In part two, we are going to look into how to manage this setup. We’ll look at how to perform operations like graceful shutdown, bootstrapping, detecting the most advanced node, failover, recovery, scaling up/down, upgrades, backup and so on. We will also discuss the pros and cons of having this setup for our clustered database service.

Happy containerizing!

How to Monitor MySQL or MariaDB Galera Cluster with Prometheus Using SCUMM

$
0
0

ClusterControl version 1.7 introduced a new way to watch your database clusters. The new agent-based approach was designed for demanding high-resolution monitoring. ClusterControl agentless, secure SSH-based remote stats collection has been extended with modern high-frequency monitoring based on time series data.

SCUMM is the new monitoring and trending system in ClusterControl.

It includes built-in active scraping and storing of metrics based on time series data. The new version of ClusterControl 1.7 uses Prometheus exporters to gather more data from monitored hosts and services.

In this blog, we will see how to use SCUMM to monitor multi-master Percona XtraDB Cluster and MariaDB Galera Cluster instances. We will also see what metrics are available, and which ones can are tracked with built-in dashboards.

What is SCUMM?

To be able to understand the new functionality and recent changes in ClusterControl, let’s take a quick look first at the new monitoring architecture - SCUMM.

SCUMM (Severalnines ClusterControl Unified Monitoring & Management) is a new agent-based solution with agents installed on the database nodes. It consists of two core elements:

  • Prometheus server which is a time series database to collect the data.
  • Exporters which export metrics from services like Galera cluster WSREP API.

More details can be found in our introduction blog for SCUMM.

How to use SCUMM for Galera Cluster?

The main requirement is to upgrade your current ClusterControl or install at least version 1.7. You can find the upgrade and installation procedure in the online documentation. You will then see a new tab called Dashboards. By default, the new Dashboard is disabled. Until it’s enabled, ClusterControl monitoring is based on agentless SSH secure key.

ClusterControl takes care of installing and maintaining Prometheus as well as exporters on the monitored hosts. The installation process is automated.

ClusterControl: Enable Agent-Based Monitoring
ClusterControl: Enable Agent-Based Monitoring

After choosing Enable Agent-Based Monitoring, you will see the following.

ClusterControl: Enable Agent-Based Monitoring
ClusterControl: Enable Agent-Based Monitoring

Here you can specify the host on which to install our Prometheus server. The host can be your ClusterControl server, or a dedicated one. It is also possible to re-use another Prometheus server that is managed by ClusterControl.

Other options to specify are:

  • Scrape Interval (seconds): Set how often the nodes are scraped for metrics. By default 10.
  • Data Retention (days): Set how long the metrics are kept before being removed. By default 15.

We can monitor the installation of our server and agents from the Activity section in our ClusterControl and, once it is finished, we can see our cluster with the agents enabled from the main ClusterControl screen.

Dashboards

Having our agents enabled, if we go to the Dashboards section, we would see something like this:

Default dashboard
Default dashboard

There are seven different kinds of dashboards available: System Overview, Cross Server Graphs, MySQL Overview, MySQL InnoDB Metrics, MySQL Performance Schema, Galera Cluster Overview, Galera Graphs.

List of dashboards
List of dashboards

Here we can also specify which node to monitor, the time range and the refresh rate.

Dashboard time ranges
Dashboard time ranges

In the configuration section, we can enable or disable our agents (Exporters), check the agents status and verify the version of our Prometheus server.

Prometheus Configuration
Prometheus Configuration

Galera Cluster Overview

Galera Cluster Overview

Galera Cluster overview enables you to see most critical Galera cluster metrics. The information here is based on WSREP API status.

  • Galera Cluster Size: shows the number of nodes in the cluster.
  • Flow Control Paused Time: a time when flow control is in effect.
  • Flow Control Messages Sent: the number of messages sent to other cluster members to slow down.
  • Writeset Inbound Traffic: transaction commits that the node receives from the cluster.
  • Writeset Outbound Traffic: transaction commits that the node sends from the cluster.
  • Receive Queue: a number of write-sets waiting to be applied.
  • Send Queue: a number of write-sets sent.
  • Transactions Received: a number of transactions received.
  • Transactions Replicated: a number of transactions replicated.
  • Average Incoming Transaction Size: number of average incoming transactions per node.
  • Average Replicated Transaction: number of average replicated transactions per node.
  • FC Trigger Low Limit: the point at which Flow Control engages.
  • FC Trigger High Limit.
  • Sequence numbers of transactions.

MySQL Overview

MySQL Overview
  • MySQL Uptime: The amount of time since the MySQL server process was started.
  • Current QPS: The number of queries executed by the server during the last second.
  • InnoDB Buffer Pool Size: InnoDB buffer pool used for caching data and indexes in memory.
  • Buffer Pool Size % of Total RAM: The ratio between InnoDB buffer pool size and total memory.
  • MySQL Connections: The number of connection attempts (successful or not).
  • MySQL Client Thread Activity: Number of threads.
  • MySQL Questions: The number of queries sent to the server by clients, excluding those executed within stored programs.
  • MySQL Thread Cache: The thread_cache_size metric informs how many threads the server should cache to reuse.
  • MySQL Temporary Objects.
  • MySQL Select Types: counters for selects not done with indexes.
  • MySQL Sorts: Full table scans for sorts.
  • MySQL Slow Queries: Slow queries are defined as queries being slower than the long_query_time setting.
  • MySQL Aborted Connections: Number of aborted connections.
  • MySQL Table Locks: Number of table locks.
  • MySQL Network Traffic: Shows how much network traffic is generated by MySQL.
  • MySQL Network Usage Hourly: Shows how much network traffic is generated by MySQL per hour.
  • MySQL Internal Memory Overview: Shows various uses of memory within MySQL.
  • Top Command Counters: The number of times each statement has been executed.
  • MySQL Handlers: Internal statistics on how MySQL is selecting, updating, inserting, and modifying rows, tables, and indexes.
  • MySQL Transaction Handlers.
  • Process States.
  • Top Process States Hourly.
  • MySQL Query Cache Memory.
  • MySQL Query Cache Activity.
  • MySQL File Openings.
  • MySQL Open Files: Number of files opened by MySQL.
  • MySQL Table Open Cache Status.
  • MySQL Open Tables: Number of open tables.
  • MySQL Table Definition Cache.

MySQL InnoDB Metrics

MySQL InnoDB Metrics
  • InnoDB Checkpoint Age.
  • InnoDB Transactions.
  • InnoDB Row Operations.
  • InnoDB Row Lock Time.
  • InnoDB I/O.
  • InnoDB Log File Usage Hourly.
  • InnoDB Logging Performance.
  • InnoDB Deadlocks.
  • Index Condition Pushdown.
  • InnoDB Buffer Pool Content.
  • InnoDB Buffer Pool Pages.
  • InnoDB Buffer Pool I/O.
  • InnoDB Buffer Pool Requests.
  • InnoDB Read-Ahead.
  • InnoDB Change Buffer.
  • InnoDB Change Buffer Activity.

MySQL Performance Schema

MySQL Performance Schema

This graph provides a way to inspect the internal execution of the server at runtime. It requires to have performance schema enabled.

  • Performance Schema File IO (Events).
  • Performance Schema File IO (Load).
  • Performance Schema File IO (Bytes).
  • Performance Schema Waits (Events).
  • Performance Schema Waits (Load).
  • Index Access Operations (Load).
  • Table Access Operations (Load).
  • Performance Schema SQL & External Locks (Events).
  • Performance Schema SQL and External Locks (Seconds).

System Overview Metrics

System Overview Metrics

To monitor our system, we have available for each server the following metrics (all of them for the selected node):

  • System Uptime: Time since the server is up.
  • CPUs: Amount of CPUs.
  • RAM: Amount of RAM memory.
  • Memory Available: Percentage of RAM memory available.
  • Load Average: Min, max and average server load.
  • Memory: Available, total and used server memory.
  • CPU Usage: Min, max and average server CPU usage information.
  • Memory Distribution: Memory distribution (buffer, cache, free and used) on the selected node.
  • Saturation Metrics: Min, max, and average of IO load and CPU load on the selected node.
  • Memory Advanced Details: Memory usage details like pages, buffer and more, on the selected node.
  • Forks: Amount of forks processes. The fork is an operation whereby a process creates a copy of itself. It is usually a system call, implemented in the kernel.
  • Processes: Amount of processes running or waiting on the Operating System.
  • Context Switches: A context switch is an action of storing the state of a processor of a thread.
  • Interrupts: Amount of interrupts. An interrupt is an event that alters the normal execution flow of a program and can be generated by hardware devices or even by the CPU itself.
  • Network Traffic: Inbound and outbound network traffic in KBytes per second on the selected node.
  • Network Utilization Hourly: Traffic sent and received in the last day.
  • Swap: Swap usage (free and used) on the selected node.
  • Swap Activity: Reads and writes data on the swap.
  • I/O Activity: Page in and page out on IO.
  • File Descriptors: Allocated and limit file descriptors.

Cross-Server Graphs Metrics

Cross Server Graphs

If we want to see the general state of all our servers with the information combined for OS and MySQL we can use this dashboard with the following metrics:

  • Load Average: Servers load average for each server.
  • Memory Usage: Percentage of memory usage for each server.
  • Network Traffic: Min, max and average kBytes of network traffic per second.
  • MySQL Connections: Number of client connections to MySQL server.
  • MySQL Queries: Number of queries executed.
  • MySQL Traffic: Provides information about min, max and avg data send end received.

Galera Graphs

Galera Graphs

In this view, you can check Galera specific metrics for each cluster node. The dashboard lists all of your cluster nodes, so you can easily filter for performance metrics of a particular node.

  • Ready to Accept: Queries: Identify if the node is able to run database operations.
  • Local State: Shows the node state.
  • Desync Mode: Identified if node participates in Flow Control.
  • Cluster Status: Cluster component status.
  • gcache Size: Galera Cluster cache size.
  • FC (normal traffic): Flow control status.
  • Galera Replication Queues: Size of the replication queue.
  • Galera Cluster Size: Number of nodes in the cluster.
  • Galera Flow Control: Identifies a number of FC calls.
  • Galera Parallelization Efficiency: An average distance between highest and lowest sequence numbers that are concurrently applied, committed and can be possibly applied in parallel.
  • Galera Writing Conflicts: A number of local transactions being committed on this node that failed certification.
  • Available Downtime before SST Required: Downtime in minutes before SST operation.
  • Galera Writeset Count: the count of transactions replicated to the cluster (from this node) and received from the cluster (any other node).
  • Galera Writeset Size: This graph shows the average transaction size sent/received.
  • Galera Network Usage Hourly: Network usage hourly (received and replicated).

Conclusion

Monitoring is an area where operations teams commonly spend time developing custom solutions. It is common to find IT teams integrating these systems in order to get a holistic view of their systems.

ClusterControl provides a complete monitoring system with real-time data to know what is happening now, high-resolution metrics for better accuracy, configurable dashboards, and a wide range of third-party notification services for alerting. Download ClusterControl today (it’s free).

How to Improve Replication Performance in a MySQL or MariaDB Galera Cluster

$
0
0

In the comments section of one of our blogs a reader asked about the impact of wsrep_slave_threads on Galera Cluster’s I/O performance and scalability. At that time, we couldn’t easily answer that question and back it up with more data, but finally we managed to set up the environment and run some tests.

Our reader pointed towards benchmarks that showed that increasing wsrep_slave_threads did not have any impact on the performance of the Galera cluster.

To explain what the impact of that setting is, we set up a small cluster of three nodes (m5d.xlarge). This allowed us to utilize directly attached nvme SSD for the MySQL data directory. By doing this, we minimized the chance of storage becoming the bottleneck in our setup.

We set up InnoDB buffer pool to 8GB and redo logs to two files, 1GB each. We also increased innodb_io_capacity to 2000 and innodb_io_capacity_max to 10000. This was also intended to ensure that neither of those settings would impact our performance.

The whole problem with such benchmarks is that there are so many bottlenecks that you have to eliminate them one by one. Only after doing some configuration tuning and after making sure that the hardware will not be a problem, one can have hope that some more subtle limits will show up.

We generated ~90GB of data using sysbench:

sysbench /usr/share/sysbench/oltp_write_only.lua --threads=16 --events=0 --time=600 --mysql-host=172.30.4.245 --mysql-user=sbtest --mysql-password=sbtest --mysql-port=3306 --tables=28 --report-interval=1 --skip-trx=off --table-size=10000000 --db-ps-mode=disable --mysql-db=sbtest_large prepare

Then the benchmark was executed. We tested two settings: wsrep_slave_threads=1 and wsrep_slave_threads=16. The hardware was not powerful enough to benefit from increasing this variable even further. Please also keep in mind that we did not do a detailed benchmarking in order to determine whether wsrep_slave_threads should be set to 16, 8 or maybe 4 for the best performance. We were interested to see if we can show an impact on the cluster. And yes, the impact was clearly visible. For starters, some flow control graphs.

While running with wsrep_slave_threads=1, on average, nodes were paused due to flow control ~64% of the time.

While running with wsrep_slave_threads=16, on average, nodes were paused due to flow control ~20% of the time.

You can also compare the difference on a single graph. The drop at the end of the first part is the first attempt to run with wsrep_slave_threads=16. Servers ran out of disk space for binary logs and we had to re-run that benchmark once more at a later time.

How did this translate in performance terms? The difference is visible although definitely not that spectacular.

First, the query per second graph. First of all, you can notice that in both cases results are all over the place. This is mostly related to the unstable performance of the I/O storage and the flow control randomly kicking in. You can still see that the performance of the “red” result (wsrep_slave_threads=1) is quite lower than the “green” one (wsrep_slave_threads=16).

Quite similar picture is when we look at the latency. You can see more (and typically deeper) stalls for the run with wsrep_slave_thread=1.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

The difference is even more visible when we calculated average latency across all the runs and you can see that the latency of wsrep_slave_thread=1 is 27% higher of the latency with 16 slave threads, which obviously is not good as we want latency to be lower, not higher.

The difference in throughput is also visible, around 11% of the improvement when we added more wsrep_slave_threads.

As you can see, the impact is there. It is by no means 16x (even if that’s how we increased the number of slave threads in Galera) but it is definitely prominent enough so that we cannot classify it as just a statistical anomaly.

Please keep in mind that in our case we used quite small nodes. The difference should be even more significant if we are talking about large instances running on EBS volumes with thousands of provisioned IOPS.

Then we would be able to run sysbench even more aggressively, with higher number of concurrent operations. This should improve parallelization of the writesets, improving the gain from the multithreading even further. Also, faster hardware means that Galera will be able to utilize those 16 threads in more efficient way.

When running tests like this you have to keep in mind you need to push your setup almost to its limits. Single-threaded replication can handle quite a lot of load and you need to run heavy traffic to actually make it not performant enough to handle the task.

We hope this blog post gives you more insight into Galera Cluster’s abilities to apply writesets in parallel and the limiting factors around it.

High Availability on a Shoestring Budget - Deploying a Minimal Two Node MySQL Galera Cluster

$
0
0

We regularly get questions about how to set up a Galera cluster with just 2 nodes.

The documentation clearly states you should have at least 3 Galera nodes to avoid network partitioning. But there are some valid reasons for considering a 2 node deployment, e.g., if you want to achieve database high availability but have a limited budget to spend on a third database node. Or perhaps you are running Galera in a development/sandbox environment and prefer a minimal setup.

Galera implements a quorum-based algorithm to select a primary component through which it enforces consistency. The primary component needs to have a majority of votes, so in a 2 node system, there would be no majority resulting in split brain. Fortunately, it is possible to add a garbd (Galera Arbitrator Daemon), which is a lightweight stateless daemon that can act as the odd node. Arbitrator failure does not affect the cluster operations and a new instance can be reattached to the cluster at any time. There can be several arbitrators in the cluster.

ClusterControl has support for deploying garbd on non-database hosts.

Normally a Galera cluster needs at least three hosts to be fully functional, however, at deploy time, two nodes would suffice to create a primary component. Here are the steps:

  1. Deploy a Galera cluster of two nodes,
  2. After the cluster has been deployed by ClusterControl, add garbd on the ClusterControl node.

You should end up with the below setup:

Deploy the Galera Cluster

Go to the ClusterControl Deploy section to deploy the cluster.

After selecting the technology that we want to deploy, we must specify User, Key or Password and port to connect by SSH to our hosts. We also need the name for our new cluster and if we want ClusterControl to install the corresponding software and configurations for us.

After setting up the SSH access information, we must select vendor/version and we must define the database admin password, datadir and port. We can also specify which repository to use.

Even though ClusterControl warns you that a Galera cluster needs an odd number of nodes, only add two nodes to the cluster.

Deploying a Galera cluster will trigger a ClusterControl job which can be monitored at the Jobs page.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Install Garbd

Once deployment is complete, install garbd on the ClusterControl host. We have the option to deploy garbd from ClusterControl, but this option won’t work if we want to deploy it in the same ClusterControl server. This is to avoid some issue related to the database versions and package dependencies.

So, we must install it manually, and then import garbd to ClusterControl.

Let’s see the manual installation of Percona Garbd on CentOS 7.

Create the Percona repository file:

$ vi /etc/yum.repos.d/percona.repo
[percona-release-$basearch]
name = Percona-Release YUM repository - $basearch
baseurl = http://repo.percona.com/release/$releasever/RPMS/$basearch
enabled = 1
gpgcheck = 0
[percona-release-noarch]
name = Percona-Release YUM repository - noarch
baseurl = http://repo.percona.com/release/$releasever/RPMS/noarch
enabled = 1
gpgcheck = 0
[percona-release-source]
name = Percona-Release YUM repository - Source packages
baseurl = http://repo.percona.com/release/$releasever/SRPMS
enabled = 0
gpgcheck = 0

Then, install the Percona XtraDB Cluster garbd package:

$ yum install Percona-XtraDB-Cluster-garbd-57

Now, we need to configure garbd. For this, we need to edit the /etc/sysconfig/garb file:

$ vi /etc/sysconfig/garb
# Copyright (C) 2012 Codership Oy
# This config file is to be sourced by garb service script.
# A comma-separated list of node addresses (address[:port]) in the cluster
GALERA_NODES="192.168.100.192:4567,192.168.100.193:4567"
# Galera cluster name, should be the same as on the rest of the nodes.
GALERA_GROUP="Galera1"
# Optional Galera internal options string (e.g. SSL settings)
# see http://galeracluster.com/documentation-webpages/galeraparameters.html
# GALERA_OPTIONS=""
# Log file for garbd. Optional, by default logs to syslog
# Deprecated for CentOS7, use journalctl to query the log for garbd
# LOG_FILE=""

Change the GALERA_NODES and GALERA_GROUP parameter according to the Galera nodes configuration. We also need to remove the line # REMOVE THIS AFTER CONFIGURATION before starting the service.

And now, we can start the garb service:

$ service garb start
Redirecting to /bin/systemctl start garb.service

Now, we can import the new garbd into ClusterControl.

Go to ClusterControl -> Select Cluster -> Add Load Balancer.

Then, select Garbd and Import Garbd section.

Here we only need to specify the hostname or IP Address and the port of the new Garbd.

Importing garbd will trigger a ClusterControl job which can be monitored at the Jobs page. Once completed, you can verify garbd is running with a green tick icon at the top bar:

That’s it!

Our minimal two-node Galera cluster is now ready!

HA for MySQL and MariaDB - Comparing Master-Master Replication to Galera Cluster

$
0
0

Galera replication is relatively new if compared to MySQL replication, which is natively supported since MySQL v3.23. Although MySQL replication is designed for master-slave unidirectional replication, it can be configured as an active master-master setup with bidirectional replication. While it is easy to set up, and some use cases might benefit from this “hack”, there are a number of caveats. On the other hand, Galera cluster is a different type of technology to learn and manage. Is it worth it?

In this blog post, we are going to compare master-master replication to Galera cluster.

Replication Concepts

Before we jump into the comparison, let’s explain the basic concepts behind these two replication mechanisms.

Generally, any modification to the MySQL database generates an event in binary format. This event is transported to the other nodes depending on the replication method chosen - MySQL replication (native) or Galera replication (patched with wsrep API).

MySQL Replication

The following diagrams illustrates the data flow of a successful transaction from one node to another when using MySQL replication:

The binary event is written into the master's binary log. The slave(s) via slave_IO_thread will pull the binary events from master's binary log and replicate them into its relay log. The slave_SQL_thread will then apply the event from the relay log asynchronously. Due to the asynchronous nature of replication, the slave server is not guaranteed to have the data when the master performs the change.

Ideally, MySQL replication will have the slave to be configured as a read-only server by setting read_only=ON or super_read_only=ON. This is a precaution to protect the slave from accidental writes which can lead to data inconsistency or failure during master failover (e.g., errant transactions). However, in a master-master active-active replication setup, read-only has to be disabled on the other master to allow writes to be processed simultaneously. The primary master must be configured to replicate from the secondary master by using the CHANGE MASTER statement to enable circular replication.

Galera Replication

The following diagrams illustrates the data replication flow of a successful transaction from one node to another for Galera Cluster:

The event is encapsulated in a writeset and broadcasted from the originator node to the other nodes in the cluster by using Galera replication. The writeset undergoes certification on every Galera node and if it passes, the applier threads will apply the writeset asynchronously. This means that the slave server will eventually become consistent, after agreement of all participating nodes in global total ordering. It is logically synchronous, but the actual writing and committing to the tablespace happens independently, and thus asynchronously on each node with a guarantee for the change to propagate on all nodes.

Avoiding Primary Key Collision

In order to deploy MySQL replication in master-master setup, one has to adjust the auto increment value to avoid primary key collision for INSERT between two or more replicating masters. This allows the primary key value on masters to interleave each other and prevent the same auto increment number being used twice on either of the node. This behaviour must be configured manually, depending on the number of masters in the replication setup. The value of auto_increment_increment equals to the number of replicating masters and the auto_increment_offset must be unique between them. For example, the following lines should exist inside the corresponding my.cnf:

Master1:

log-slave-updates
auto_increment_increment=2
auto_increment_offset=1

Master2:

log-slave-updates
auto_increment_increment=2
auto_increment_offset=2

Likewise, Galera Cluster uses this same trick to avoid primary key collisions by controlling the auto increment value and offset automatically with wsrep_auto_increment_control variable. If set to 1 (the default), will automatically adjust the auto_increment_increment and auto_increment_offset variables according to the size of the cluster, and when the cluster size changes. This avoids replication conflicts due to auto_increment. In a master-slave environment, this variable can be set to OFF.

The consequence of this configuration is the auto increment value will not be in sequential order, as shown in the following table of a three-node Galera Cluster:

Nodeauto_increment_incrementauto_increment_offsetAuto increment value
Node 1311, 4, 7, 10, 13, 16...
Node 2322, 5, 8, 11, 14, 17...
Node 3333, 6, 9, 12, 15, 18...

If an application performs insert operations on the following nodes in the following order:

  • Node1, Node3, Node2, Node3, Node3, Node1, Node3 ..

Then the primary key value that will be stored in the table will be:

  • 1, 6, 8, 9, 12, 13, 15 ..

Simply said, when using master-master replication (MySQL replication or Galera), your application must be able to tolerate non-sequential auto-increment values in its dataset.

For ClusterControl users, take note that it supports deployment of MySQL master-master replication with a limit of two masters per replication cluster, only for active-passive setup. Therefore, ClusterControl does not deliberately configure the masters with auto_increment_increment and auto_increment_offset variables.

Data Consistency

Galera Cluster comes with its flow-control mechanism, where each node in the cluster must keep up when replicating, or otherwise all other nodes will have to slow down to allow the slowest node to catch up. This basically minimizes the probability of slave lag, although it might still happen but not as significant as in MySQL replication. By default, Galera allows nodes to be at least 16 transactions behind in applying through variable gcs.fc_limit. If you want to do critical reads (a SELECT that must return most up to date information), you probably want to use session variable, wsrep_sync_wait.

Galera Cluster on the other hand comes with a safeguard to data inconsistency whereby a node will get evicted from the cluster if it fails to apply any writeset for whatever reasons. For example, when a Galera node fails to apply writeset due to internal error by the underlying storage engine (MySQL/MariaDB), the node will pull itself out from the cluster with the following error:

150305 16:13:14 [ERROR] WSREP: Failed to apply trx 1 4 times
150305 16:13:14 [ERROR] WSREP: Node consistency compromized, aborting..

To fix the data consistency, the offending node has to be re-synced before it is allowed to join the cluster. This can be done manually or by wiping out the data directory to trigger snapshot state transfer (full syncing from a donor).

MySQL master-master replication does not enforce data consistency protection and a slave is allowed to diverge e.g, replicate a subset of data or lag behind, which makes the slave inconsistent with the master. It is designed to replicate data in one flow - from master down to the slaves. Data consistency checks have to be performed manually or via external tools like Percona Toolkit pt-table-checksum or mysql-replication-check.

Conflict Resolution

Generally, master-master (or multi-master, or bi-directional) replication allows more than one member in the cluster to process writes. With MySQL replication, in case of replication conflict, the slave's SQL thread simply stops applying the next query until the conflict is resolved, either by manually skipping the replication event, fixing the offending rows or resyncing the slave. Simply said, there is no automatic conflict resolution support for MySQL replication.

Galera Cluster provides a better alternative by retrying the offending transaction during replication. By using wsrep_retry_autocommit variable, one can instruct Galera to automatically retry a failed transaction due to cluster-wide conflicts, before returning an error to the client. If set to 0, no retries will be attempted, while a value of 1 (the default) or more specifies the number of retries attempted. This can be useful to assist applications using autocommit to avoid deadlocks.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Node Consensus and Failover

Galera uses Group Communication System (GCS) to check node consensus and availability between cluster members. If a node is unhealthy, it will be automatically evicted from the cluster after gmcast.peer_timeout value, default to 3 seconds. A healthy Galera node in "Synced" state is deemed as a reliable node to serve reads and writes, while others are not. This design greatly simplifies health check procedures from the upper tiers (load balancer or application).

In MySQL replication, a master does not care about its slave(s), while a slave only has consensus with its sole master via the slave_IO_thread process when replicating the binary events from master's binary log. If a master goes down, this will break the replication and an attempt to re-establish the link will be made every slave_net_timeout (default to 60 seconds). From the application or load balancer perspective, the health check procedures for replication slave must at least involve checking the following state:

  • Seconds_Behind_Master
  • Slave_IO_Running
  • Slave_SQL_Running
  • read_only variable
  • super_read_only variable (MySQL 5.7.8 and later)

In terms of failover, generally, master-master replication and Galera nodes are equal. They hold the same data set (albeit you can replicate a subset of data in MySQL replication, but that's uncommon for master-master) and share the same role as masters, capable of handling reads and writes simultaneously. Therefore, there is actually no failover from the database point-of-view due to this equilibrium. Only from the application side that would require failover to skip the unoperational nodes. Keep in mind that because MySQL replication is asynchronous, it is possible that not all of the changes done on the master will have propagated to the other master.

Node Provisioning

The process of bringing a node into sync with the cluster before replication starts, is known as provisioning. In MySQL replication, provisioning a new node is a manual process. One has to take a backup of the master and restore it over to the new node before setting up the replication link. For an existing replication node, if the master's binary logs have been rotated (based on expire_logs_days, default to 0 means no automatic removal), you may have to re-provision the node using this procedure. There are also external tools like Percona Toolkit pt-table-sync and ClusterControl to help you out on this. ClusterControl supports resyncing a slave with just two clicks. You have options to resync by taking a backup from the active master or an existing backup.

In Galera, there are two ways of doing this - incremental state transfer (IST) or state snapshot transfer (SST). IST process is the preferred method where only the missing transactions transfer from a donor's cache. SST process is similar to taking a full backup from the donor, it is usually pretty resource intensive. Galera will automatically determine which syncing process to trigger based on the joiner's state. In most cases, if a node fails to join a cluster, simply wipe out the MySQL datadir of the problematic node and start the MySQL service. Galera provisioning process is much simpler, it comes very handy when scaling out your cluster or re-introducing a problematic node back into the cluster.

Loosely Coupled vs Tightly Coupled

MySQL replication works very well even across slower connections, and with connections that are not continuous. It can also be used across different hardware, environment and operating systems. Most storage engines support it, including MyISAM, Aria, MEMORY and ARCHIVE. This loosely coupled setup allows MySQL master-master replication to work well in a mixed environment with less restriction.

Galera nodes are tightly-coupled, where the replication performance is as fast as the slowest node. Galera uses a flow control mechanism to control replication flow among members and eliminate any slave lag. The replication can be all fast or all slow on every node and is adjusted automatically by Galera. Thus, it's recommended to use uniform hardware specs for all Galera nodes, especially with respect to CPU, RAM, disk subsystem, network interface card and network latency between nodes in the cluster.

Conclusions

In summary, Galera Cluster is superior if compared to MySQL master-master replication due to its synchronous replication support with strong consistency, plus more advanced features like automatic membership control, automatic node provisioning and multi-threaded slaves. Ultimately, this depends on how the application interacts with the database server. Some legacy applications built for a standalone database server may not work well on a clustered setup.

To simplify our points above, the following reasons justify when to use MySQL master-master replication:

  • Things that are not supported by Galera:
    • Replication for non-InnoDB/XtraDB tables like MyISAM, Aria, MEMORY or ARCHIVE.
    • XA transactions.
    • Statement-based replication between masters (e.g, when bandwidth is very expensive).
    • Relying on explicit locking like LOCK TABLES statement.
    • The general query log and the slow query log must be directed to a table, instead of a file.
  • Loosely coupled setup where the hardware specs, software version and connection speed are significantly different on every master.
  • When you already have a MySQL replication chain and you want to add another active/backup master for redundancy to speed up failover and recovery time in case if one of the master is unavailable.
  • If your application can't be modified to work around Galera Cluster limitations and having a MySQL-aware load balancer like ProxySQL or MaxScale is not an option.

Reasons to pick Galera Cluster over MySQL master-master replication:

  • Ability to safely write to multiple masters.
  • Data consistency automatically managed (and guaranteed) across databases.
  • New database nodes easily introduced and synced.
  • Failures or inconsistencies automatically detected.
  • In general, more advanced and robust high availability features.

How to Run and Configure ProxySQL 2.0 for MySQL Galera Cluster on Docker

$
0
0

ProxySQL is an intelligent and high-performance SQL proxy which supports MySQL, MariaDB and ClickHouse. Recently, ProxySQL 2.0 has become GA and it comes with new exciting features such as GTID consistent reads, frontend SSL, Galera and MySQL Group Replication native support.

It is relatively easy to run ProxySQL as Docker container. We have previously written about how to run ProxySQL on Kubernetes as a helper container or as a Kubernetes service, which is based on ProxySQL 1.x. In this blog post, we are going to use the new version ProxySQL 2.x which uses a different approach for Galera Cluster configuration.

ProxySQL 2.x Docker Image

We have released a new ProxySQL 2.0 Docker image container and it's available in Docker Hub. The README provides a number of configuration examples particularly for Galera and MySQL Replication, pre and post v2.x. The configuration lines can be defined in a text file and mapped into the container's path at /etc/proxysql.cnf to be loaded into ProxySQL service.

The image "latest" tag still points to 1.x until ProxySQL 2.0 officially becomes GA (we haven't seen any official release blog/article from ProxySQL team yet). Which means, whenever you install ProxySQL image using latest tag from Severalnines, you will still get version 1.x with it. Take note the new example configurations also enable ProxySQL web stats (introduced in 1.4.4 but still in beta) - a simple dashboard that summarizes the overall configuration and status of ProxySQL itself.

ProxySQL 2.x Support for Galera Cluster

Let's talk about Galera Cluster native support in greater detail. The new mysql_galera_hostgroups table consists of the following fields:

  • writer_hostgroup: ID of the hostgroup that will contain all the members that are writers (read_only=0).
  • backup_writer_hostgroup: If the cluster is running in multi-writer mode (i.e. there are multiple nodes with read_only=0) and max_writers is set to a smaller number than the total number of nodes, the additional nodes are moved to this backup writer hostgroup.
  • reader_hostgroup: ID of the hostgroup that will contain all the members that are readers (i.e. nodes that have read_only=1)
  • offline_hostgroup: When ProxySQL monitoring determines a host to be OFFLINE, the host will be moved to the offline_hostgroup.
  • active: a boolean value (0 or 1) to activate a hostgroup
  • max_writers: Controls the maximum number of allowable nodes in the writer hostgroup, as mentioned previously, additional nodes will be moved to the backup_writer_hostgroup.
  • writer_is_also_reader: When 1, a node in the writer_hostgroup will also be placed in the reader_hostgroup so that it will be used for reads. When set to 2, the nodes from backup_writer_hostgroup will be placed in the reader_hostgroup, instead of the node(s) in the writer_hostgroup.
  • max_transactions_behind: determines the maximum number of writesets a node in the cluster can have queued before the node is SHUNNED to prevent stale reads (this is determined by querying the wsrep_local_recv_queue Galera variable).
  • comment: Text field that can be used for any purposes defined by the user

Here is an example configuration for mysql_galera_hostgroups in table format:

Admin> select * from mysql_galera_hostgroups\G
*************************** 1. row ***************************
       writer_hostgroup: 10
backup_writer_hostgroup: 20
       reader_hostgroup: 30
      offline_hostgroup: 9999
                 active: 1
            max_writers: 1
  writer_is_also_reader: 2
max_transactions_behind: 20
                comment: 

ProxySQL performs Galera health checks by monitoring the following MySQL status/variable:

  • read_only - If ON, then ProxySQL will group the defined host into reader_hostgroup unless writer_is_also_reader is 1.
  • wsrep_desync - If ON, ProxySQL will mark the node as unavailable, moving it to offline_hostgroup.
  • wsrep_reject_queries - If this variable is ON, ProxySQL will mark the node as unavailable, moving it to the offline_hostgroup (useful in certain maintenance situations).
  • wsrep_sst_donor_rejects_queries - If this variable is ON, ProxySQL will mark the node as unavailable while the Galera node is serving as an SST donor, moving it to the offline_hostgroup.
  • wsrep_local_state - If this status returns other than 4 (4 means Synced), ProxySQL will mark the node as unavailable and move it into offline_hostgroup.
  • wsrep_local_recv_queue - If this status is higher than max_transactions_behind, the node will be shunned.
  • wsrep_cluster_status - If this status returns other than Primary, ProxySQL will mark the node as unavailable and move it into offline_hostgroup.

Having said that, by combining these new parameters in mysql_galera_hostgroups together with mysql_query_rules, ProxySQL 2.x has the flexibility to fit into much more Galera use cases. For example, one can have a single-writer, multi-writer and multi-reader hostgroups defined as the destination hostgroup of a query rule, with the ability to limit the number of writers and finer control on the stale reads behaviour.

Contrast this to ProxySQL 1.x, where the user had to explicitly define a scheduler to call an external script to perform the backend health checks and update the database servers state. This requires some customization to the script (user has to update the ProxySQL admin user/password/port) plus it depended on an additional tool (MySQL client) to connect to ProxySQL admin interface.

Here is an example configuration of Galera health check script scheduler in table format for ProxySQL 1.x:

Admin> select * from scheduler\G
*************************** 1. row ***************************
         id: 1
     active: 1
interval_ms: 2000
   filename: /usr/share/proxysql/tools/proxysql_galera_checker.sh
       arg1: 10
       arg2: 20
       arg3: 1
       arg4: 1
       arg5: /var/lib/proxysql/proxysql_galera_checker.log
    comment:

Besides, since ProxySQL scheduler thread executes any script independently, there are many versions of health check scripts available out there. All ProxySQL instances deployed by ClusterControl uses the default script provided by the ProxySQL installer package.

In ProxySQL 2.x, max_writers and writer_is_also_reader variables can determine how ProxySQL dynamically groups the backend MySQL servers and will directly affect the connection distribution and query routing. For example, consider the following MySQL backend servers:

Admin> select hostgroup_id, hostname, status, weight from mysql_servers;
+--------------+--------------+--------+--------+
| hostgroup_id | hostname     | status | weight |
+--------------+--------------+--------+--------+
| 10           | DB1          | ONLINE | 1      |
| 10           | DB2          | ONLINE | 1      |
| 10           | DB3          | ONLINE | 1      |
+--------------+--------------+--------+--------+

Together with the following Galera hostgroups definition:

Admin> select * from mysql_galera_hostgroups\G
*************************** 1. row ***************************
       writer_hostgroup: 10
backup_writer_hostgroup: 20
       reader_hostgroup: 30
      offline_hostgroup: 9999
                 active: 1
            max_writers: 1
  writer_is_also_reader: 2
max_transactions_behind: 20
                comment: 

Considering all hosts are up and running, ProxySQL will most likely group the hosts as below:

Let's look at them one by one:

ConfigurationDescription
writer_is_also_reader=0
  • Groups the hosts into 2 hostgroups (writer and backup_writer).
  • Writer is part of the backup_writer.
  • Since the writer is not a reader, nothing in hostgroup 30 (reader) because none of the hosts are set with read_only=1. It is not a common practice in Galera to enable the read-only flag.
writer_is_also_reader=1
  • Groups the hosts into 3 hostgroups (writer, backup_writer and reader).
  • Variable read_only=0 in Galera has no affect thus writer is also in hostgroup 30 (reader)
  • Writer is not part of backup_writer.
writer_is_also_reader=2
  • Similar with writer_is_also_reader=1 however, writer is part of backup_writer.

With this configuration, one can have various choices for hostgroup destination to cater for specific workloads. "Hotspot" writes can be configured to go to only one server to reduce multi-master conflicts, non-conflicting writes can be distributed equally on the other masters, most reads can be distributed evenly on all MySQL servers or non-writers, critical reads can be forwarded to the most up-to-date servers and analytical reads can be forwarded to a slave replica.

ProxySQL Deployment for Galera Cluster

In this example, suppose we already have a three-node Galera Cluster deployed by ClusterControl as shown in the following diagram:

Our Wordpress applications are running on Docker while the Wordpress database is hosted on our Galera Cluster running on bare-metal servers. We decided to run a ProxySQL container alongside our Wordpress containers to have a better control on Wordpress database query routing and fully utilize our database cluster infrastructure. Since the read-write ratio is around 80%-20%, we want to configure ProxySQL to:

  • Forward all writes to one Galera node (less conflict, focus on write)
  • Balance all reads to the other two Galera nodes (better distribution for the majority of the workload)

Firstly, create a ProxySQL configuration file inside the Docker host so we can map it into our container:

$ mkdir /root/proxysql-docker
$ vim /root/proxysql-docker/proxysql.cnf

Then, copy the following lines (we will explain the configuration lines further down):

datadir="/var/lib/proxysql"

admin_variables=
{
    admin_credentials="admin:admin"
    mysql_ifaces="0.0.0.0:6032"
    refresh_interval=2000
    web_enabled=true
    web_port=6080
    stats_credentials="stats:admin"
}

mysql_variables=
{
    threads=4
    max_connections=2048
    default_query_delay=0
    default_query_timeout=36000000
    have_compress=true
    poll_timeout=2000
    interfaces="0.0.0.0:6033;/tmp/proxysql.sock"
    default_schema="information_schema"
    stacksize=1048576
    server_version="5.1.30"
    connect_timeout_server=10000
    monitor_history=60000
    monitor_connect_interval=200000
    monitor_ping_interval=200000
    ping_interval_server_msec=10000
    ping_timeout_server=200
    commands_stats=true
    sessions_sort=true
    monitor_username="proxysql"
    monitor_password="proxysqlpassword"
    monitor_galera_healthcheck_interval=2000
    monitor_galera_healthcheck_timeout=800
}

mysql_galera_hostgroups =
(
    {
        writer_hostgroup=10
        backup_writer_hostgroup=20
        reader_hostgroup=30
        offline_hostgroup=9999
        max_writers=1
        writer_is_also_reader=1
        max_transactions_behind=30
        active=1
    }
)

mysql_servers =
(
    { address="db1.cluster.local" , port=3306 , hostgroup=10, max_connections=100 },
    { address="db2.cluster.local" , port=3306 , hostgroup=10, max_connections=100 },
    { address="db3.cluster.local" , port=3306 , hostgroup=10, max_connections=100 }
)

mysql_query_rules =
(
    {
        rule_id=100
        active=1
        match_pattern="^SELECT .* FOR UPDATE"
        destination_hostgroup=10
        apply=1
    },
    {
        rule_id=200
        active=1
        match_pattern="^SELECT .*"
        destination_hostgroup=20
        apply=1
    },
    {
        rule_id=300
        active=1
        match_pattern=".*"
        destination_hostgroup=10
        apply=1
    }
)

mysql_users =
(
    { username = "wordpress", password = "passw0rd", default_hostgroup = 10, transaction_persistent = 0, active = 1 },
    { username = "sbtest", password = "passw0rd", default_hostgroup = 10, transaction_persistent = 0, active = 1 }
)

Now, let's pay a visit to some of the most configuration sections. Firstly, we define the Galera hostgroups configuration as below:

mysql_galera_hostgroups =
(
    {
        writer_hostgroup=10
        backup_writer_hostgroup=20
        reader_hostgroup=30
        offline_hostgroup=9999
        max_writers=1
        writer_is_also_reader=1
        max_transactions_behind=30
        active=1
    }
)

Hostgroup 10 will be the writer_hostgroup, hostgroup 20 for backup_writer and hostgroup 30 for reader. We set max_writers to 1 so we can have a single-writer hostgroup for hostgroup 10 where all writes should be sent to. Then, we define writer_is_also_reader to 1 which will make all Galera nodes as reader as well, suitable for queries that can be equally distributed to all nodes. Hostgroup 9999 is reserved for offline_hostgroup if ProxySQL detects unoperational Galera nodes.

Then, we configure our MySQL servers with default to hostgroup 10:

mysql_servers =
(
    { address="db1.cluster.local" , port=3306 , hostgroup=10, max_connections=100 },
    { address="db2.cluster.local" , port=3306 , hostgroup=10, max_connections=100 },
    { address="db3.cluster.local" , port=3306 , hostgroup=10, max_connections=100 }
)

With the above configurations, ProxySQL will "see" our hostgroups as below:

Then, we define the query routing through query rules. Based on our requirement, all reads should be sent to all Galera nodes except the writer (hostgroup 20) and everything else is forwarded to hostgroup 10 for single writer:

mysql_query_rules =
(
    {
        rule_id=100
        active=1
        match_pattern="^SELECT .* FOR UPDATE"
        destination_hostgroup=10
        apply=1
    },
    {
        rule_id=200
        active=1
        match_pattern="^SELECT .*"
        destination_hostgroup=20
        apply=1
    },
    {
        rule_id=300
        active=1
        match_pattern=".*"
        destination_hostgroup=10
        apply=1
    }
)

Finally, we define the MySQL users that will be passed through ProxySQL:

mysql_users =
(
    { username = "wordpress", password = "passw0rd", default_hostgroup = 10, transaction_persistent = 0, active = 1 },
    { username = "sbtest", password = "passw0rd", default_hostgroup = 10, transaction_persistent = 0, active = 1 }
)

We set transaction_persistent to 0 so all connections coming from these users will respect the query rules for reads and writes routing. Otherwise, the connections would end up hitting one hostgroup which defeats the purpose of load balancing. Do not forget to create those users first on all MySQL servers. For ClusterControl user, you may use Manage -> Schemas and Users feature to create those users.

We are now ready to start our container. We are going to map the ProxySQL configuration file as bind mount when starting up the ProxySQL container. Thus, the run command will be:

$ docker run -d \
--name proxysql2 \
--hostname proxysql2 \
--publish 6033:6033 \
--publish 6032:6032 \
--publish 6080:6080 \
--restart=unless-stopped \
-v /root/proxysql/proxysql.cnf:/etc/proxysql.cnf \
severalnines/proxysql:2.0

Finally, change the Wordpress database pointing to ProxySQL container port 6033, for instance:

$ docker run -d \
--name wordpress \
--publish 80:80 \
--restart=unless-stopped \
-e WORDPRESS_DB_HOST=proxysql2:6033 \
-e WORDPRESS_DB_USER=wordpress \
-e WORDPRESS_DB_HOST=passw0rd \
wordpress

At this point, our architecture is looking something like this:

If you want ProxySQL container to be persistent, map /var/lib/proxysql/ to a Docker volume or bind mount, for example:

$ docker run -d \
--name proxysql2 \
--hostname proxysql2 \
--publish 6033:6033 \
--publish 6032:6032 \
--publish 6080:6080 \
--restart=unless-stopped \
-v /root/proxysql/proxysql.cnf:/etc/proxysql.cnf \
-v proxysql-volume:/var/lib/proxysql \
severalnines/proxysql:2.0

Keep in mind that running with persistent storage like the above will make our /root/proxysql/proxysql.cnf obsolete on the second restart. This is due to ProxySQL multi-layer configuration whereby if /var/lib/proxysql/proxysql.db exists, ProxySQL will skip loading options from configuration file and load whatever is in the SQLite database instead (unless you start proxysql service with --initial flag). Having said that, the next ProxySQL configuration management has to be performed via ProxySQL admin console on port 6032, instead of using configuration file.

Monitoring

ProxySQL process log by default logging to syslog and you can view them by using standard docker command:

$ docker ps
$ docker logs proxysql2

To verify the current hostgroup, query the runtime_mysql_servers table:

$ docker exec -it proxysql2 mysql -uadmin -padmin -h127.0.0.1 -P6032 --prompt='Admin> '
Admin> select hostgroup_id,hostname,status from runtime_mysql_servers;
+--------------+--------------+--------+
| hostgroup_id | hostname     | status |
+--------------+--------------+--------+
| 10           | 192.168.0.21 | ONLINE |
| 30           | 192.168.0.21 | ONLINE |
| 30           | 192.168.0.22 | ONLINE |
| 30           | 192.168.0.23 | ONLINE |
| 20           | 192.168.0.22 | ONLINE |
| 20           | 192.168.0.23 | ONLINE |
+--------------+--------------+--------+

If the selected writer goes down, it will be transferred to the offline_hostgroup (HID 9999):

Admin> select hostgroup_id,hostname,status from runtime_mysql_servers;
+--------------+--------------+--------+
| hostgroup_id | hostname     | status |
+--------------+--------------+--------+
| 10           | 192.168.0.22 | ONLINE |
| 9999         | 192.168.0.21 | ONLINE |
| 30           | 192.168.0.22 | ONLINE |
| 30           | 192.168.0.23 | ONLINE |
| 20           | 192.168.0.23 | ONLINE |
+--------------+--------------+--------+

The above topology changes can be illustrated in the following diagram:

We have also enabled the web stats UI with admin-web_enabled=true.To access the web UI, simply go to the Docker host in port 6080, for example: http://192.168.0.200:8060 and you will be prompted with username/password pop up. Enter the credentials as defined under admin-stats_credentials and you should see the following page:

By monitoring MySQL connection pool table, we can get connection distribution overview for all hostgroups:

Admin> select hostgroup, srv_host, status, ConnUsed, MaxConnUsed, Queries from stats.stats_mysql_connection_pool order by srv_host;
+-----------+--------------+--------+----------+-------------+---------+
| hostgroup | srv_host     | status | ConnUsed | MaxConnUsed | Queries |
+-----------+--------------+--------+----------+-------------+---------+
| 20        | 192.168.0.23 | ONLINE | 5        | 24          | 11458   |
| 30        | 192.168.0.23 | ONLINE | 0        | 0           | 0       |
| 20        | 192.168.0.22 | ONLINE | 2        | 24          | 11485   |
| 30        | 192.168.0.22 | ONLINE | 0        | 0           | 0       |
| 10        | 192.168.0.21 | ONLINE | 32       | 32          | 9746    |
| 30        | 192.168.0.21 | ONLINE | 0        | 0           | 0       |
+-----------+--------------+--------+----------+-------------+---------+

The output above shows that hostgroup 30 does not process anything because our query rules do not have this hostgroup configured as destination hostgroup.

The statistics related to the Galera nodes can be viewed in the mysql_server_galera_log table:

Admin>  select * from mysql_server_galera_log order by time_start_us desc limit 3\G
*************************** 1. row ***************************
                       hostname: 192.168.0.23
                           port: 3306
                  time_start_us: 1552992553332489
                success_time_us: 2045
              primary_partition: YES
                      read_only: NO
         wsrep_local_recv_queue: 0
              wsrep_local_state: 4
                   wsrep_desync: NO
           wsrep_reject_queries: NO
wsrep_sst_donor_rejects_queries: NO
                          error: NULL
*************************** 2. row ***************************
                       hostname: 192.168.0.22
                           port: 3306
                  time_start_us: 1552992553329653
                success_time_us: 2799
              primary_partition: YES
                      read_only: NO
         wsrep_local_recv_queue: 0
              wsrep_local_state: 4
                   wsrep_desync: NO
           wsrep_reject_queries: NO
wsrep_sst_donor_rejects_queries: NO
                          error: NULL
*************************** 3. row ***************************
                       hostname: 192.168.0.21
                           port: 3306
                  time_start_us: 1552992553329013
                success_time_us: 2715
              primary_partition: YES
                      read_only: NO
         wsrep_local_recv_queue: 0
              wsrep_local_state: 4
                   wsrep_desync: NO
           wsrep_reject_queries: NO
wsrep_sst_donor_rejects_queries: NO
                          error: NULL

The resultset returns the related MySQL variable/status state for every Galera node for a particular timestamp. In this configuration, we configured the Galera health check to run every 2 seconds (monitor_galera_healthcheck_interval=2000). Hence, the maximum failover time would be around 2 seconds if a topology change happens to the cluster.

References

Database High Availability for Camunda BPM using MySQL or MariaDB Galera Cluster

$
0
0

Camunda BPM is an open-source workflow and decision automation platform. Camunda BPM ships with tools for creating workflow and decision models, operating deployed models in production, and allowing users to execute workflow tasks assigned to them.

By default, Camunda comes with an embedded database called H2, which works pretty decently within a Java environment with relatively small memory footprint. However, when it comes to scaling and high availability, there are other database backends that might be more appropriate.

In this blog post, we are going to deploy Camunda BPM 7.10 Community Edition on Linux, with a focus on achieving database high availability. Camunda supports major databases through JDBC drivers, namely Oracle, DB2, MySQL, MariaDB and PostgreSQL. This blog only focuses on MySQL and MariaDB Galera Cluster, with different implementation on each - one with ProxySQL as database load balancer, and the other using the JDBC driver to connect to multiple database instances. Take note that this article does not cover on high availability for the Camunda application itself.

Prerequisite

Camunda BPM runs on Java. In our CentOS 7 box, we have to install JDK and the best option is to use the one from Oracle, and skip using the OpenJDK packages provided in the repository. On the application server where Camunda should run, download the latest Java SE Development Kit (JDK) from Oracle by sending the acceptance cookie:

$ wget --header "Cookie: oraclelicense=accept-securebackup-cookie" https://download.oracle.com/otn-pub/java/jdk/12+33/312335d836a34c7c8bba9d963e26dc23/jdk-12_linux-x64_bin.rpm

Install it on the host:

$ yum localinstall jdk-12_linux-x64_bin.rpm

Verify with:

$ java --version
java 12 2019-03-19
Java(TM) SE Runtime Environment (build 12+33)
Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing)

Create a new directory and download Camunda Community for Apache Tomcat from the official download page:

$ mkdir ~/camunda
$ cd ~/camunda
$ wget --content-disposition 'https://camunda.org/release/camunda-bpm/tomcat/7.10/camunda-bpm-tomcat-7.10.0.tar.gz'

Extract it:

$ tar -xzf camunda-bpm-tomcat-7.10.0.tar.gz

There are a number of dependencies we have to configure before starting up Camunda web application. This depends on the chosen database platform like datastore configuration, database connector and CLASSPATH environment. The next sections explain the required steps for MySQL Galera (using Percona XtraDB Cluster) and MariaDB Galera Cluster.

Note that the configurations shown in this blog are based on Apache Tomcat environment. If you are using JBOSS or Wildfly, the datastore configuration will be a bit different. Refer to Camunda documentation for details.

MySQL Galera Cluster (with ProxySQL and Keepalived)

We will use ClusterControl to deploy MySQL-based Galera cluster with Percona XtraDB Cluster. There are some Galera-related limitations mentioned in the Camunda docs surrounding Galera multi-writer conflicts handling and InnoDB isolation level. In case you are affected by these, the safest way is to use the single-writer approach, which is achievable with ProxySQL hostgroup configuration. To provide no single-point of failure, we will deploy two ProxySQL instances and tie them with a virtual IP address by Keepalived.

The following diagram illustrates our final architecture:

First, deploy a three-node Percona XtraDB Cluster 5.7. Install ClusterControl, generate a SSH key and setup passwordless SSH from ClusterControl host to all nodes (including ProxySQL). On ClusterControl node, do:

$ whoami
root
$ ssh-keygen -t rsa
$ for i in 192.168.0.21 192.168.0.22 192.168.0.23 192.168.0.11 192.168.0.12; do ssh-copy-id $i; done

Before we deploy our cluster, we have to modify the MySQL configuration template file that ClusterControl will use when installing MySQL servers. The template file name is my57.cnf.galera and located under /usr/share/cmon/templates/ on the ClusterControl host. Make sure the following lines exist under [mysqld] section:

[mysqld]
...
transaction-isolation=READ-COMMITTED
wsrep_sync_wait=7
...

Save the file and we are good to go. The above are the requirements as stated in Camunda docs, especially on the supported transaction isolation for Galera. Variable wsrep_sync_wait is set to 7 to perform cluster-wide causality checks for READ (including SELECT, SHOW, and BEGIN or START TRANSACTION), UPDATE, DELETE, INSERT, and REPLACE statements, ensuring that the statement is executed on a fully synced node. Keep in mind that value other than 0 can result in increased latency.

Go to ClusterControl -> Deploy -> MySQL Galera and specify the following details (if not mentioned, use the default value):

  • SSH User: root
  • SSH Key Path: /root/.ssh/id_rsa
  • Cluster Name: Percona XtraDB Cluster 5.7
  • Vendor: Percona
  • Version: 5.7
  • Admin/Root Password: {specify a password}
  • Add Node: 192.168.0.21 (press Enter), 192.168.0.22 (press Enter), 192.168.0.23 (press Enter)

Make sure you got all the green ticks, indicating ClusterControl is able to connect to the node passwordlessly. Click "Deploy" to start the deployment.

Create the database, MySQL user and password on one of the database nodes:

mysql> CREATE DATABASE camunda;
mysql> CREATE USER camunda@'%' IDENTIFIED BY 'passw0rd';
mysql> GRANT ALL PRIVILEGES ON camunda.* TO camunda@'%';

Or from the ClusterControl interface, you can use Manage -> Schema and Users instead:

Once cluster is deployed, install ProxySQL by going to ClusterControl -> Manage -> Load Balancer -> ProxySQL -> Deploy ProxySQL and enter the following details:

  • Server Address: 192.168.0.11
  • Administration Password:
  • Monitor Password:
  • DB User: camunda
  • DB Password: passw0rd
  • Are you using implicit transactions?: Yes

Repeat the ProxySQL deployment step for the second ProxySQL instance, by changing the Server Address value to 192.168.0.12. The virtual IP address provided by Keepalived requires at least two ProxySQL instances deployed and running. Finally, deploy virtual IP address by going to ClusterControl -> Manage -> Load Balancer -> Keepalived and pick both ProxySQL nodes and specify the virtual IP address and network interface for the VIP to listen:

Our database backend is now complete. Next, import the SQL files into the Galera Cluster as the created MySQL user. On the application server, go to the "sql" directory and import them into one of the Galera nodes (we pick 192.168.0.21):

$ cd ~/camunda/sql/create
$ yum install mysql #install mysql client
$ mysql -ucamunda -p -h192.168.0.21 camunda < mysql_engine_7.10.0.sql
$ mysql -ucamunda -p -h192.168.0.21 camunda < mysql_identity_7.10.0.sql

Camunda does not provide MySQL connector for Java since its default database is H2. On the application server, download MySQL Connector/J from MySQL download page and copy the JAR file into Apache Tomcat bin directory:

$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.15.tar.gz
$ tar -xzf mysql-connector-java-8.0.15.tar.gz
$ cd mysql-connector-java-8.0.15
$ cp mysql-connector-java-8.0.15.jar ~/camunda/server/apache-tomcat-9.0.12/bin/

Then, set the CLASSPATH environment variable to include the database connector. Open setenv.sh using text editor:

$ vim ~/camunda/server/apache-tomcat-9.0.12/bin/setenv.sh

And add the following line:

export CLASSPATH=$CLASSPATH:$CATALINA_HOME/bin/mysql-connector-java-8.0.15.jar

Open ~/camunda/server/apache-tomcat-9.0.12/conf/server.xml and change the lines related to datastore. Specify the virtual IP address as the MySQL host in the connection string, with ProxySQL port 6033:

<Resource name="jdbc/ProcessEngine"
              ...
              driverClassName="com.mysql.jdbc.Driver" 
              defaultTransactionIsolation="READ_COMMITTED"
              url="jdbc:mysql://192.168.0.10:6033/camunda"
              username="camunda"  
              password="passw0rd"
              ...
/>

Finally, we can start the Camunda service by executing start-camunda.sh script:

$ cd ~/camunda
$ ./start-camunda.sh
starting camunda BPM platform on Tomcat Application Server
Using CATALINA_BASE:   ./server/apache-tomcat-9.0.12
Using CATALINA_HOME:   ./server/apache-tomcat-9.0.12
Using CATALINA_TMPDIR: ./server/apache-tomcat-9.0.12/temp
Using JRE_HOME:        /
Using CLASSPATH:       :./server/apache-tomcat-9.0.12/bin/mysql-connector-java-8.0.15.jar:./server/apache-tomcat-9.0.12/bin/bootstrap.jar:./server/apache-tomcat-9.0.12/bin/tomcat-juli.jar
Tomcat started.

Make sure the CLASSPATH shown in the output includes the path to the MySQL Connector/J JAR file. After the initialization completes, you can then access Camunda webapps on port 8080 at http://192.168.0.8:8080/camunda/. The default username is demo with password 'demo':

You can then see the digested capture queries from Nodes -> ProxySQL -> Top Queries, indicating the application is interacting correctly with the Galera Cluster:

There is no read-write splitting configured for ProxySQL. Camunda uses "SET autocommit=0" on every SQL statement to initialize transaction and the best way for ProxySQL to handle this by sending all the queries to the same backend servers of the target hostgroup. This is the safest method alongside better availability. However, all connections might end up reaching a single server, so there is no load balancing.

MariaDB Galera

MariaDB Connector/J is able to handle a variety of connection modes - failover, sequential, replication and aurora - but Camunda only supports failover and sequential. Taken from MariaDB Connector/J documentation:

ModeDescription
sequential
(available since 1.3.0)
This mode supports connection failover in a multi-master environment, such as MariaDB Galera Cluster. This mode does not support load-balancing reads on slaves. The connector will try to connect to hosts in the order in which they were declared in the connection URL, so the first available host is used for all queries. For example, let's say that the connection URL is the following:
jdbc:mariadb:sequential:host1,host2,host3/testdb
When the connector tries to connect, it will always try host1 first. If that host is not available, then it will try host2. etc. When a host fails, the connector will try to reconnect to hosts in the same order.
failover
(available since 1.2.0)
This mode supports connection failover in a multi-master environment, such as MariaDB Galera Cluster. This mode does not support load-balancing reads on slaves. The connector performs load-balancing for all queries by randomly picking a host from the connection URL for each connection, so queries will be load-balanced as a result of the connections getting randomly distributed across all hosts.

Using "failover" mode poses a higher potential risk of deadlock, since writes will be distributed to all backend servers almost equally. Single-writer approach is a safe way to run, which means using sequential mode should do the job pretty well. You also can skip the load-balancer tier in the architecture. Hence with MariaDB Java connector, we can deploy our architecture as simple as below:

Before we deploy our cluster, modify the MariaDB configuration template file that ClusterControl will use when installing MariaDB servers. The template file name is my.cnf.galera and located under /usr/share/cmon/templates/ on ClusterControl host. Make sure the following lines exist under [mysqld] section:

[mysqld]
...
transaction-isolation=READ-COMMITTED
wsrep_sync_wait=7
performance_schema = ON
...

Save the file and we are good to go. A bit of explanation, the above list are the requirements as stated in Camunda docs, especially on the supported transaction isolation for Galera. Variable wsrep_sync_wait is set to 7 to perform cluster-wide causality checks for READ (including SELECT, SHOW, and BEGIN or START TRANSACTION), UPDATE, DELETE, INSERT, and REPLACE statements, ensuring that the statement is executed on a fully synced node. Keep in mind that value other than 0 can result in increased latency. Enabling Performance Schema is optional for ClusterControl query monitoring feature.

Now we can start the cluster deployment process. Install ClusterControl, generate a SSH key and setup passwordless SSH from ClusterControl host to all Galera nodes. On ClusterControl node, do:

$ whoami
root
$ ssh-keygen -t rsa
$ for i in 192.168.0.41 192.168.0.42 192.168.0.43; do ssh-copy-id $i; done

Go to ClusterControl -> Deploy -> MySQL Galera and specify the following details (if not mentioned, use the default value):

  • SSH User: root
  • SSH Key Path: /root/.ssh/id_rsa
  • Cluster Name: MariaDB Galera 10.3
  • Vendor: MariaDB
  • Version: 10.3
  • Admin/Root Password: {specify a password}
  • Add Node: 192.168.0.41 (press Enter), 192.168.0.42 (press Enter), 192.168.0.43 (press Enter)

Make sure you got all the green ticks when adding nodes, indicating ClusterControl is able to connect to the node passwordlessly. Click "Deploy" to start the deployment.

Create the database, MariaDB user and password on one of the Galera nodes:

mysql> CREATE DATABASE camunda;
mysql> CREATE USER camunda@'%' IDENTIFIED BY 'passw0rd';
mysql> GRANT ALL PRIVILEGES ON camunda.* TO camunda@'%';

For ClusterControl user, you can use ClusterControl -> Manage -> Schema and Users instead:

Our database cluster deployment is now complete. Next, import the SQL files into the MariaDB cluster. On the application server, go to the "sql" directory and import them into one of the MariaDB nodes (we chose 192.168.0.41):

$ cd ~/camunda/sql/create
$ yum install mysql #install mariadb client
$ mysql -ucamunda -p -h192.168.0.41 camunda < mariadb_engine_7.10.0.sql
$ mysql -ucamunda -p -h192.168.0.41 camunda < mariadb_identity_7.10.0.sql

Camunda does not provide MariaDB connector for Java since its default database is H2. On the application server, download MariaDB Connector/J from MariaDB download page and copy the JAR file into Apache Tomcat bin directory:

$ wget https://downloads.mariadb.com/Connectors/java/connector-java-2.4.1/mariadb-java-client-2.4.1.jar
$ cp mariadb-java-client-2.4.1.jar ~/camunda/server/apache-tomcat-9.0.12/bin/

Then, set the CLASSPATH environment variable to include the database connector. Open setenv.sh via text editor:

$ vim ~/camunda/server/apache-tomcat-9.0.12/bin/setenv.sh

And add the following line:

export CLASSPATH=$CLASSPATH:$CATALINA_HOME/bin/mariadb-java-client-2.4.1.jar

Open ~/camunda/server/apache-tomcat-9.0.12/conf/server.xml and change the lines related to datastore. Use the sequential connection protocol and list out all the Galera nodes separated by comma in the connection string:

<Resource name="jdbc/ProcessEngine"
              ...
              driverClassName="org.mariadb.jdbc.Driver" 
              defaultTransactionIsolation="READ_COMMITTED"
              url="jdbc:mariadb:sequential://192.168.0.41:3306,192.168.0.42:3306,192.168.0.43:3306/camunda"
              username="camunda"  
              password="passw0rd"
              ...
/>

Finally, we can start the Camunda service by executing start-camunda.sh script:

$ cd ~/camunda
$ ./start-camunda.sh
starting camunda BPM platform on Tomcat Application Server
Using CATALINA_BASE:   ./server/apache-tomcat-9.0.12
Using CATALINA_HOME:   ./server/apache-tomcat-9.0.12
Using CATALINA_TMPDIR: ./server/apache-tomcat-9.0.12/temp
Using JRE_HOME:        /
Using CLASSPATH:       :./server/apache-tomcat-9.0.12/bin/mariadb-java-client-2.4.1.jar:./server/apache-tomcat-9.0.12/bin/bootstrap.jar:./server/apache-tomcat-9.0.12/bin/tomcat-juli.jar
Tomcat started.

Make sure the CLASSPATH shown in the output includes the path to the MariaDB Java client JAR file. After the initialization completes, you can then access Camunda webapps on port 8080 at http://192.168.0.8:8080/camunda/. The default username is demo with password 'demo':

You can see the digested capture queries from ClusterControl -> Query Monitor -> Top Queries, indicating the application is interacting correctly with the MariaDB Cluster:

With MariaDB Connector/J, we do not need load balancer tier which simplifies our overall architecture. The sequential connection mode should do the trick to avoid multi-writer deadlocks - which can happen in Galera. This setup provides high availability with each Camunda instance configured with JDBC to access the cluster of MySQL or MariaDB nodes. Galera takes care of synchronizing the data between the database instances in real time.

How to Migrate WHMCS Database to MariaDB Galera Cluster

$
0
0

WHMCS is an all-in-one client management, billing and support solution for web hosting companies. It's one of the leaders in the hosting automation world to be used alongside the hosting control panel itself. WHMCS runs on a LAMP stack, with MySQL/MariaDB as the database provider. Commonly, WHMCS is installed as a standalone instance (application and database) independently by following the WHMCS installation guide, or through software installer tools like cPanel Site Software or Softaculous. The database can be made highly available by migrating to a Galera Cluster of 3 nodes.

In this blog post, we will show you how to migrate the WHMCS database from a standalone MySQL server (provided by the WHM/cPanel server itself) to an external three-node MariaDB Galera Cluster to improve the database availability. The WHMCS application itself will be kept running on the same cPanel server. We’ll also give you some tuning tips to optimize performance.

Deploying the Database Cluster

  1. Install ClusterControl:
    $ whoami
    root
    $ wget https://severalnines.com/downloads/cmon/install-cc
    $ chmod 755 install-cc
    $ ./install-cc
    Follow the instructions accordingly until the installation is completed. Then, go to the http://192.168.55.50/clustercontrol (192.168.55.50 being the IP address of the ClusterControl host) and register a super admin user with password and other required details.
  2. Setup passwordless SSH from ClusterControl to all database nodes:
    $ whoami
    root
    $ ssh-keygen -t rsa # Press enter on all prompts
    $ ssh-copy-id 192.168.55.51
    $ ssh-copy-id 192.168.55.52
    $ ssh-copy-id 192.168.55.53
  3. Configure the database deployment for our 3-node MariaDB Galera Cluster. We are going to use the latest supported version MariaDB 10.3:
    Make sure you get all green checks after pressing ‘Enter’ when adding the node details. Wait until the deployment job completes and you should see the database cluster is listed in ClusterControl.
  4. Deploy a ProxySQL node (we are going to co-locate it with the ClusterControl node) by going to Manage -> Load Balancer -> ProxySQL -> Deploy ProxySQL. Specify the following required details:
    Under "Add Database User", you can ask ClusterControl to create a new ProxySQL and MySQL user as it sets up , thus we put the user as "portal_whmcs", assigned with ALL PRIVILEGES on database "portal_whmcs.*". Then, check all the boxes for "Include" and finally choose "false" for "Are you using implicit transactions?".

Once the deployment finished, you should see something like this under Topology view:

Our database deployment is now complete. Keep in mind that we do not cover the load balancer tier redundancy in this blog post. You can achieve that by adding a secondary load balancer and string them together with Keepalived. To learn more about this, check out ProxySQL Tutorials under chapter "4.2. High availability for ProxySQL".

WHMCS Installation

If you already have WHMCS installed and running, you may skip this step.

Take note that WHMCS requires a valid license which you have to purchase beforehand in order to use the software. They do not provide a free trial license, but they do offer a no questions asked 30-day money-back guarantee, which means you can always cancel the subscription before the offer expires without being charged.

To simplify the installation process, we are going to use cPanel Site Software (you may opt for WHMCS manual installation) to one of our sub-domain, selfportal.mytest.io. After creating the account in WHM, go to cPanel > Software > Site Software > WHMCS and install the web application. Login as the admin user and activate the license to start using the application.

At this point, our WHMCS instance is running as a standalone setup, connecting to the local MySQL server.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Migrating the WHMCS Database to MariaDB Galera Cluster

Running WHMCS on a standalone MySQL server exposes the application to single-point-of-failure (SPOF) from database standpoint. MariaDB Galera Cluster provides redundancy to the data layer with built-in clustering features and support for multi-master architecture. Combine this with a database load balancer, for example ProxySQL, and we can improve the WHMCS database availability with very minimal changes to the application itself.

However, there are a number of best-practices that WHMCS (or other applications) have to follow in order to work efficiently on Galera Cluster, especially:

  • All tables must be running on InnoDB/XtraDB storage engine.
  • All tables should have a primary key defined (multi-column primary key is supported, unique key does not count).

Depending on the version installed, in our test environment installation (cPanel/WHM 11.78.0.23, WHMCS 7.6.0 via Site Software), the above two points did not meet the requirement. The default cPanel/WHM MySQL configuration comes with the following line inside /etc/my.cnf:

default-storage-engine=MyISAM

The above would cause additional tables managed by WHMCS Addon Modules to be created in MyISAM storage engine format if those modules are enabled. Here is the output of the storage engine after we have enabled 2 modules (New TLDs and Staff Noticeboard):

MariaDB> SELECT tables.table_schema, tables.table_name, tables.engine FROM information_schema.tables WHERE tables.table_schema='whmcsdata_whmcs' and tables.engine <> 'InnoDB';
+-----------------+----------------------+--------+
| table_schema    | table_name           | engine |
+-----------------+----------------------+--------+
| whmcsdata_whmcs | mod_enomnewtlds      | MyISAM |
| whmcsdata_whmcs | mod_enomnewtlds_cron | MyISAM |
| whmcsdata_whmcs | mod_staffboard       | MyISAM |
+-----------------+----------------------+--------+

MyISAM support is experimental in Galera, which means you should not run it in production. In some worse cases, it could compromise data consistency and cause writeset replication failures due to its non-transactional nature.

Another important point is that every table must have a primary key defined. Depending on the WHMCS installation procedure that you performed (as for us, we used cPanel Site Software to install WHMCS), some of the tables created by the installer do not come with primary key defined, as shown in the following output:

MariaDB [information_schema]> SELECT TABLES.table_schema, TABLES.table_name FROM TABLES LEFT JOIN KEY_COLUMN_USAGE AS c ON (TABLES.TABLE_NAME = c.TABLE_NAME AND c.CONSTRAINT_SCHEMA = TABLES.TABLE_SCHEMA AND c.constraint_name = 'PRIMARY' ) WHERE TABLES.table_schema <> 'information_schema' AND TABLES.table_schema <> 'performance_schema' AND TABLES.table_schema <> 'mysql' and TABLES.table_schema <> 'sys' AND c.constraint_name IS NULL;
+-----------------+------------------------------------+
| table_schema    | table_name                         |
+-----------------+------------------------------------+
| whmcsdata_whmcs | mod_invoicedata                    |
| whmcsdata_whmcs | tbladminperms                      |
| whmcsdata_whmcs | tblaffiliates                      |
| whmcsdata_whmcs | tblconfiguration                   |
| whmcsdata_whmcs | tblknowledgebaselinks              |
| whmcsdata_whmcs | tbloauthserver_access_token_scopes |
| whmcsdata_whmcs | tbloauthserver_authcode_scopes     |
| whmcsdata_whmcs | tbloauthserver_client_scopes       |
| whmcsdata_whmcs | tbloauthserver_user_authz_scopes   |
| whmcsdata_whmcs | tblpaymentgateways                 |
| whmcsdata_whmcs | tblproductconfiglinks              |
| whmcsdata_whmcs | tblservergroupsrel                 |
+-----------------+------------------------------------+

As a side note, Galera would still allow tables without primary key to exist. However, DELETE operations are not supported on those tables plus it would expose you to much bigger problems like node crash, writeset certification performance degradation or rows may appear in a different order on different nodes.

To overcome this, our migration plan must include the additional step to fix the storage engine and schema structure, as shown in the next section.

Migration Plan

Due to restrictions explained in the previous chapter, our migration plan has to be something like this:

  1. Enable WHMCS maintenance mode
  2. Take backups of the whmcs database using logical backup
  3. Modify the dump files to meet Galera requirement (convert storage engine)
  4. Bring up one of the Galera nodes and let the remaining nodes shut down
  5. Restore to the chosen Galera node
  6. Fix the schema structure to meet Galera requirement (missing primary keys)
  7. Bootstrap the cluster from the chosen Galera node
  8. Start the second node and let it sync
  9. Start the third node and let it sync
  10. Change the database pointing to the appropriate endpoint
  11. Disable WHMCS maintenance mode

The new architecture can be illustrated as below:

Our WHMCS database name on the cPanel server is "whmcsdata_whmcs" and we are going to migrate this database to an external three-node MariaDB Galera Cluster deployed by ClusterControl. On top of the database server, we have a ProxySQL (co-locate with ClusterControl) running to act as the MariaDB load balancer, providing the single endpoint to our WHMCS instance. The database name on the cluster will be changed to "portal_whmcs" instead, so we can easily distinguish it.

Firstly, enable the site-wide Maintenance Mode by going to WHMCS > Setup > General Settings > General > Maintenance Mode > Tick to enable - prevents client area access when enabled. This will ensure there will be no activity from the end user during the database backup operation.

Since we have to make slight modifications to the schema structure to fit well into Galera, it's a good idea to create two separate dump files. One with the schema only and another one for data only. On the WHM server, run the following command as root:

$ mysqldump --no-data -uroot whmcsdata_whmcs > whmcsdata_whmcs_schema.sql
$ mysqldump --no-create-info -uroot whmcsdata_whmcs > whmcsdata_whmcs_data.sql

Then, we have to replace all MyISAM occurrences in the schema dump file with 'InnoDB':

$ sed -i 's/MyISAM/InnoDB/g' whmcsdata_whmcs_schema.sql

Verify that we don't have MyISAM lines anymore in the dump file (it should return nothing):

$ grep -i 'myisam' whmcsdata_whmcs_schema.sql

Transfer the dump files from the WHM server to mariadb1 (192.168.55.51):

$ scp whmcsdata_whmcs_* 192.168.55.51:~

Create the MySQL database. From ClusterControl, go to Manage -> Schemas and Users -> Create Database and specify the database name. Here we use a different database name called "portal_whmcs". Otherwise, you can manually create the database with the following command:

$ mysql -uroot -p 
MariaDB> CREATE DATABASE 'portal_whmcs';

Create a MySQL user for this database with its privileges. From ClusterControl, go to Manage -> Schemas and Users -> Users -> Create New User and specify the following:

In case you choose to create the MySQL user manually, run the following statements:

$ mysql -uroot -p 
MariaDB> CREATE USER 'portal_whmcs'@'%' IDENTIFIED BY 'ghU51CnPzI9z';
MariaDB> GRANT ALL PRIVILEGES ON portal_whmcs.* TO portal_whmcs@'%';

Take note that the created database user has to be imported into ProxySQL, to allow the WHMCS application to authenticate against the load balancer. Go to Nodes -> pick the ProxySQL node -> Users -> Import Users and select "portal_whmcs"@"%", as shown in the following screenshot:

In the next window (User Settings), specify Hostgroup 10 as the default hostgroup:

Now the restoration preparation stage is complete.

In Galera, restoring a big database via mysqldump on a single-node cluster is more efficient, and this improves the restoration time significantly. Otherwise, every node in the cluster would have to certify every statement from the mysqldump input, which would take longer time to complete.

Since we already have a three-node MariaDB Galera Cluster running, let's stop MySQL service on mariadb2 and mariadb3, one node at a time for a graceful scale down. To shut down the database nodes, from ClusterControl, simply go to Nodes -> Node Actions -> Stop Node -> Proceed. Here is what you would see from ClusterControl dashboard, where the cluster size is 1 and the status of the db1 is Synced and Primary:

Then, on mariadb1 (192.168.55.51), restore the schema and data accordingly:

$ mysql -uportal_whmcs -p portal_whmcs < whmcsdata_whmcs_schema.sql
$ mysql -uportal_whmcs -p portal_whmcs < whmcsdata_whmcs_data.sql

Once imported, we have to fix the table structure to add the necessary "id" column (except for table "tblaffiliates") as well as adding the primary key on all tables that have been missing any:

$ mysql -uportal_whmcs -p
MariaDB> USE portal_whmcs;
MariaDB [portal_whmcs]> ALTER TABLE `tblaffiliates` ADD PRIMARY KEY (id);
MariaDB [portal_whmcs]> ALTER TABLE `mod_invoicedata` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tbladminperms` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tblconfiguration` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tblknowledgebaselinks` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tbloauthserver_access_token_scopes` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tbloauthserver_authcode_scopes` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tbloauthserver_client_scopes` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tbloauthserver_user_authz_scopes` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tblpaymentgateways` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tblproductconfiglinks` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
MariaDB [portal_whmcs]> ALTER TABLE `tblservergroupsrel` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

Or, we can translate the above repeated statements using a loop in a bash script:

#!/bin/bash

db_user='portal_whmcs'
db_pass='ghU51CnPzI9z'
db_whmcs='portal_whmcs'
tables=$(mysql -u${db_user} "-p${db_pass}"  information_schema -A -Bse "SELECT TABLES.table_name FROM TABLES LEFT JOIN KEY_COLUMN_USAGE AS c ON (TABLES.TABLE_NAME = c.TABLE_NAME AND c.CONSTRAINT_SCHEMA = TABLES.TABLE_SCHEMA AND c.constraint_name = 'PRIMARY' ) WHERE TABLES.table_schema <> 'information_schema' AND TABLES.table_schema <> 'performance_schema' AND TABLES.table_schema <> 'mysql' and TABLES.table_schema <> 'sys' AND c.constraint_name IS NULL;")
mysql_exec="mysql -u${db_user} -p${db_pass} $db_whmcs -e"

for table in $tables
do
        if [ "${table}" = "tblaffiliates" ]
        then
                $mysql_exec "ALTER TABLE ${table} ADD PRIMARY KEY (id)";
        else
                $mysql_exec "ALTER TABLE ${table} ADD id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST";
        fi
done

At this point, it's safe to start the remaining nodes to sync up with mariadb1. Start with mariadb2 by going to Nodes -> pick db2 -> Node Actions -> Start Node. Monitor the job progress and make sure mariadb2 is in Synced and Primary state (monitor the Overview page for details) before starting up mariadb3.

Finally, change the database pointing to the ProxySQL host on port 6033 inside WHMCS configuration file, as in our case it's located at /home/whmcsdata/public_html/configuration.php:

$ vim configuration.php
<?php
$license = 'WHMCS-XXXXXXXXXXXXXXXXXXXX';
$templates_compiledir = 'templates_c';
$mysql_charset = 'utf8';
$cc_encryption_hash = 'gLg4oxuOWsp4bMleNGJ--------30IGPnsCS49jzfrKjQpwaN';
$db_host = 192.168.55.50;
$db_port = '6033';
$db_username = 'portal_whmcs';
$db_password = 'ghU51CnPzI9z';
$db_name = 'portal_whmcs';

$customadminpath = 'admin2d27';

Don't forget to disable WHMCS maintenance mode by going to WHMCS > Setup > General Settings > General > Maintenance Mode > uncheck "Tick to enable - prevents client area access when enabled". Our database migration exercise is now complete.

Testing and Tuning

You can verify if by looking at the ProxySQL's query entries under Nodes -> ProxySQL -> Top Queries:

For the most repeated read-only queries (you can sort them by Count Star), you may cache them to improve the response time and reduce the number of hits to the backend servers. Simply rollover to any query and click Cache Query, and the following pop-up will appear:

What you need to do is to only choose the destination hostgroup and click "Add Rule". You can then verify if the cached query got hit under "Rules" tab:

From the query rule itself, we can tell that reads (all SELECT except SELECT .. FOR UPDATE) are forwarded to hostgroup 20 where the connections are distributed to all nodes while writes (other than SELECT) are forwarded to hostgroup 10, where the connections are forwarded to one Galera node only. This configuration minimizes the risk for deadlocks that may be caused by a multi-master setup, which improves the replication performance as a whole.

That's it for now. Happy clustering!

How to Automate Migration from Standalone MySQL to Galera Cluster using Ansible

$
0
0

Database migrations don’t scale well. Typically you need to perform a great deal of tests before you can pull the trigger and switch from old to new. Migrations are usually done manually, as most of the process does not lend itself to automation. But that doesn’t mean there is no room for automation in the migration process. Imagine setting up a number of nodes with new software, provisioning them with data and configuring replication between old and new environments by hand. This takes days. Automation can be very useful when setting up a new environment and provisioning it with data. In this blog post, we will take a look at a very simple migration - from standalone Percona Server 5.7 to a 3-node Percona XtraDB Cluster 5.7. We will use Ansible to accomplish that.

Environment Description

First of all, one important disclaimer - what we are going to show here is only a draft of what you might like to run in production. It does work on our test environment but it may require modifications to make it suitable for your environment. In our tests we used four Ubuntu 16.04 VM’s deployed using Vagrant. One contains standalone Percona Server 5.7, remaining three will be used for Percona XtraDB Cluster nodes. We also use a separate node for running ansible playbooks, although this is not a requirement and the playbook can also be executed from one of the nodes. In addition, SSH connectivity is available between all of the nodes. You have to have connectivity from the host where you run ansible, but having the ability to ssh between nodes is useful (especially between master and new slave - we rely on this in the playbook).

Playbook Structure

Ansible playbooks typically share common structure - you create roles, which can be assigned to different hosts. Each role will contain tasks to be executed on it, templates that will be used, files that will be uploaded, variables which are defined for this particular playbook. In our case, the playbook is very simple.

.
├── inventory
├── playbook.yml
├── roles
│   ├── first_node
│   │   ├── my.cnf.j2
│   │   ├── tasks
│   │   │   └── main.yml
│   │   └── templates
│   │       └── my.cnf.j2
│   ├── galera
│   │   ├── tasks
│   │   │   └── main.yml
│   │   └── templates
│   │       └── my.cnf.j2
│   ├── master
│   │   └── tasks
│   │       └── main.yml
│   └── slave
│       └── tasks
│           └── main.yml
└── vars
    └── default.yml

We defined a couple of roles - we have a master role, which is intended to do some sanity checks on the standalone node. There is slave node, which will be executed on one of the Galera nodes to configure it for replication, and set up the asynchronous replication. Then we have a role for all Galera nodes and a role for the first Galera node to bootstrap the cluster from it. For Galera roles, we have a couple of templates that we will use to create my.cnf files. We will also use local .my.cnf to define a username and password. We have a file containing a couple of variables which we may want to customize, just like passwords. Finally we have an inventory file, which defines hosts on which we will run the playbook, we also have the playbook file with information on how exactly things should be executed. Let’s take a look at the individual bits.

Inventory File

This is a very simple file.

[galera]
10.0.0.142
10.0.0.143
10.0.0.144

[first_node]
10.0.0.142

[master]
10.0.0.141

We have three groups, ‘galera’, which contains all Galera nodes, ‘first_node’, which we will use for the bootstrap and finally ‘master’, which contains our standalone Percona Server node.

Playbook.yml

The file playbook.yml contains the general guidelines on how the playbook should be executed.

-   hosts: master
    gather_facts: yes
    become: true
    pre_tasks:
    -   name: Install Python2
        raw: test -e /usr/bin/python || (apt -y update && apt install -y python-minimal)
    vars_files:
        -   vars/default.yml
    roles:
    -   { role: master }

As you can see, we start with the standalone node and we apply tasks related to the role ‘master’ (we will discuss this in details further down in this post).

-   hosts: first_node
    gather_facts: yes
    become: true
    pre_tasks:
    -   name: Install Python2
        raw: test -e /usr/bin/python || (apt -y update && apt install -y python-minimal)
    vars_files:
        -   vars/default.yml
    roles:
    -   { role: first_node }
    -   { role: slave }

Second, we go to node defined in ‘first_node’ group and we apply two roles: ‘first_node’ and ‘slave’. The former is intended to deploy a single node PXC cluster, the later will configure it to work as a slave and set up the replication.

-   hosts: galera
    gather_facts: yes
    become: true
    pre_tasks:
    -   name: Install Python2
        raw: test -e /usr/bin/python || (apt -y update && apt install -y python-minimal)
    vars_files:
        -   vars/default.yml
    roles:
    -   { role: galera }

Finally, we go through all Galera nodes and apply ‘galera’ role on all of them.

Severalnines
 
DevOps Guide to Database Management
Learn about what you need to know to automate and manage your open source databases

Variables

Before we begin to look into roles, we want to mention default variables that we defined for this playbook.

sst_user: "sstuser"
sst_password: "pa55w0rd"
root_password: "pass"
repl_user: "repl_user"
repl_password: "repl1cati0n"

As we stated, this is a very simple playbook without much options for customization. You can configure users and passwords and this is basically it. One gotcha - please make sure that the standalone node’s root password matches ‘root_password’ here as otherwise the playbook wouldn’t be able to connect there (it can be extended to handle it but we did not cover that).

This file is without much of a value but, as a rule of thumb, you want to encrypt any file which contains credentials. Obviously, this is for the security reasons. Ansible comes with ansible-vault, which can be used to encrypt and decrypt files. We will not cover details here, all you need to know is available in the documentation. In short, you can easily encrypt files using passwords and configure your environment so that the playbooks can be decrypted automatically using password from file or passed by hand.

Roles

In this section we will go over roles that are defined in the playbook, summarizing what they are intended to perform.

Master role

As we stated, this role is intended to run a sanity check on the configuration of the standalone MySQL. It will install required packages like percona-xtrabackup-24. It also creates replication user on the master node. A configuration is reviewed to ensure that the server_id and other replication and binary log-related settings are set. GTID is also enabled as we will rely on it for replication.

First_node role

Here, the first Galera node is installed. Percona repository will be configured, my.cnf will be created from the template. PXC will be installed. We also run some cleanup to remove unneeded users and to create those, which will be required (root user with the password of our choosing, user required for SST). Finally, cluster is bootstrapped using this node. We rely on the empty ‘wsrep_cluster_address’ as a way to initialize the cluster. This is why later we still execute ‘galera’ role on the first node - to swap initial my.cnf with the final one, containing ‘wsrep_cluster_address’ with all the members of the cluster. One thing worth remembering - when you create a root user with password you have to be careful not to get locked off MySQL so that Ansible could execute other steps of the playbook. One way to do that is to provide .my.cnf with correct user and password. Another would be to remember to always set correct login_user and login_password in ‘mysql_user’ module.

Slave role

This role is all about configuring replication between standalone node and the single node PXC cluster. We use xtrabackup to get the data, we also check for executed gtid in xtrabackup_binlog_info to ensure the backup will be restored properly and that replication can be configured. We also perform a bit of the configuration, making sure that the slave node can use GTID replication. There is a couple of gotchas here - it is not possible to run ‘RESET MASTER’ using ‘mysql_replication’ module as of Ansible 2.7.10, it should be possible to do that in 2.8, whenever it will come out. We had to use ‘shell’ module to run MySQL CLI commands. When rebuilding Galera node from external source, you have to remember to re-create any required users (at least user used for SST). Otherwise the remaining nodes will not be able to join the cluster.

Galera role

Finally, this is the role in which we install PXC on remaining two nodes. We run it on all nodes, the initial one will get “production” my.cnf instead of its “bootstrap” version. Remaining two nodes will have PXC installed and they will get SST from the first node in the cluster.

Summary

As you can see, you can easily create a simple, reusable Ansible playbook which can be used for deploying Percona XtraDB Cluster and configuring it to be a slave of standalone MySQL node. To be honest, for migrating a single server, this will probably have no point as doing the same manually will be faster. Still, if you expect you will have to re-execute this process a couple of times, it will definitely make sense to automate it and make it more time efficient. As we stated at the beginning, this is by no means production-ready playbook. It is more of a proof of concept, something you may extend to make it suitable for your environment. You can find archive with the playbook here: http://severalnines.com/sites/default/files/ansible.tar.gz

We hope you found this blog post interesting and valuable, do not hesitate to share your thoughts.

What's New in MariaDB Cluster 10.4

$
0
0

In one of the previous blogs, we covered new features which are coming out in MariaDB 10.4. We mentioned there that included in this version will be a new Galera Cluster release. In this blog post we will go over the features of Galera Cluster 26.4.0 (or Galera 4), take a quick look at them, and explore how they will affect your setup when working with MariaDB Galera Cluster.

Streaming Replication

Galera Cluster is by no means a drop-in replacement for standalone MySQL. The way in which the writeset certification works introduced several limitations and edge cases which may seriously limit the ability to migrate into Galera Cluster. The three most common limitations are...

  1. Problems with long transactions
  2. Problems with large transactions
  3. Problems with hot-spots in tables

What’s great to see is that Galera 4 introduces Streaming Replication, which may help in reducing these limitations. Let’s review the current state in a little more detail.

Long Running Transactions

In this case we are talking timewise, which are definitely problematic in Galera. The main thing to understand is that Galera replicates transactions as writesets. Those writesets are certified on the members of the cluster, ensuring that all nodes can apply given writeset. The problem is, locks are created on the local node, they are not replicated across the cluster therefore if your transaction takes several minutes to complete and if you are writing to more than one Galera node, with time it is more and more likely that on one of the remaining nodes some transactions will modify some of the rows updated in your long-running transaction. This will cause certification to fail and long running transaction will have to be rolled back. In short, given you send writes to more than one node in the cluster, longer the transaction, the more likely it is to fail certification due to some conflict.

Hotspots

By that we mean rows, which are frequently updated. Typically it’s some sort of a counter that’s being updated over and over again. The culprit of the problem is the same as in long transactions - rows are locked only locally. Again, if you send writes to more than one node, it is likely that the same counter will be modified at the same time on more than one node, causing conflicts and making certification fail.

For both those problems there is one solution - you can send your writes to just one node instead of distributing them across the whole cluster. You can use proxies for that - ClusterControl deploys HAProxy and ProxySQL, both can be configured so that writes will be sent to only one node. If you cannot send writes to one node only, you have to accept you will be seeing certification conflicts and rollbacks from time to time. In general, application has to be able to handle rollbacks from the database - there is no way around that, but it is even more important when application works with Galera Cluster.

Still, sending the traffic to one node is not enough to handle third problem.

Large Transactions

What is important to keep in mind is that the writeset is sent for certification only when the transaction completes. Then, the writeset is sent to all nodes and the certification process takes place. This induces limits on how big the single transaction can be as Galera, when preparing writeset, stores it in in-memory buffer. Too large transactions will reduce the cluster performance. Therefore two variables has been introduced: wsrep_max_ws_rows, which limits the number of rows per transaction (although it can be set to 0 - unlimited) and, more important: wsrep_max_ws_size, which can be set up to 2 GB. So, the largest transaction you can run with Galera Cluster is up to 2GB in size. Also, you have to keep in mind that certification and applying of the large transaction also takes time, creating “lag” - read after write, that hit node other than where you initially committed the transaction, will most likely result in incorrect data as the transaction is still being applied.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Galera 4 comes with Streaming Replication, which can be used to mitigate all those problems. The main difference will be that the writeset now can be split into parts - no longer it will be needed to wait for the whole transaction to finish before data will be replicated. This may make you wonder - how the certification look like in such case? In short, certification is on the fly - each fragment is certified and all involved rows are locked on all of the nodes in the cluster. This is a serious change in how Galera works - until now locks were created locally, with streaming replication locks will be created on all of the nodes. This helps in the cases we discussed above - locking rows as transaction fragments come in, helps to reduce the probability that transaction will have to be rolled back. Conflicting transactions executed locally will not be able to get the locks they need and will have to wait for the replicating transaction to complete and release the row locks.

In the case of hotspots, with streaming replication it is possible to get the locks on all of the nodes when updating the row. Other queries which want to update the same row will have to wait for the lock to be released before they will execute their changes.

Large transactions will benefit from the streaming replication because it will no longer be needed to wait for the whole transaction to finish nor they will be limited by the transaction size - large transaction will be split into fragments. It also helps to utilize network better - instead of sending 2GB of data at once the same 2GB of data can be split into fragments and sent over a longer period of time.

There are two configuration options for streaming replication: wsrep_trx_fragment_size, which tells how big a fragment should be (by default it is set to 0, which means that the streaming replication is disabled) and wsrep_trx_fragment_unit, which tells what the fragment really is. By default it is bytes, but it can also be a ‘statements’ or ‘rows’. Those variables can (and should) be set on a session level, making it possible for user to decide which particular query should be replicated using streaming replication. Setting unit to ‘statements’ and size to 1 allow, for example, to use streaming replication just for a single query which, for example, updates a hotspot.

Of course, there are drawbacks of running the streaming replication, mainly due to the fact that locks are now taken on all nodes in the cluster. If you have seen large transaction rolling back for ages, now such transaction will have to roll back on all of the nodes. Obviously, the best practice is to reduce the size of a transaction as much as possible to avoid rollbacks taking hours to complete. Another drawback is that, for the crash recovery reasons, writesets created from each fragment are stored in wsrep_schema.SR table on all nodes, which, sort of, implements double-write buffer, increasing the load on the cluster. Therefore you should carefully decide which transaction should be replicated using the streaming replication and, as long as it is feasible, you should still stick to the best practices of having small, short transactions or splitting the large transaction into smaller batches.

Backup Locks

Finally, MariaDB users will be able to benefit from backup locks for SST. The idea behind SST executed using (for MariaDB) mariabackup is that the whole dataset has to be transferred, on the fly, with redo logs being collected in the background. Then, a global lock has to be acquired, ensuring that no write will happen, final position of the redo log has to be collected and stored. Historically, for MariaDB, the locking part was performed using FLUSH TABLES WITH READ LOCK which did its job but under heavy load it was quite hard to acquire. It is also pretty heavy - not only transactions have to wait for the lock to be released but also the data has to be flushed to disk. Now, with MariaDB 10.4, it will be possible to use less intrusive BACKUP LOCK, which will not require data to be flushed, only commits will be blocked for the duration of the lock. This should mean less intrusive SST operations, which is definitely great to hear. Everyone who had to run their Galera Cluster in emergency mode, on one node, keeping fingers crossed that SST will not impact cluster operations should be more than happy to hear about this improvement.

Causal Reads From the Application

Galera 4 introduced three new functions which are intended to help add support for causal reads in the applications - WSREP_LAST_WRITTEN_GTID(), which returns GTID of the last write made by the client, WSREP_LAST_SEEN_GTID(), which returns the GTID of the last write transaction observed by the client and WSREP_SYNC_WAIT_UPTO_GTID(), which will block the client until the GTID passed to the function will be committed on the node. Sure, you can enforce causal reads in Galera even now, but by utilizing those functions it will be possible to implement safe read after write in those parts of the application where it is needed, without having a need to make changes in Galera configuration.

Upgrading to MariaDB Galera 10.4

If you would like to try Galera 4, it is available in the latest release candidate for MariaDB 10.4. As per MariaDB documentation, at this moment there is no way to do a live upgrade of 10.3 Galera to 10.4. You have to stop the whole 10.3 cluster, upgrade it to 10.4 and then start it back. This is a serious blocker and we hope this limitation will be removed in one of the next versions. It is of utmost importance to have the option for a live upgrade and for that both MariaDB 10.3 and MariaDB 10.4 will have to coexist in the same Galera Cluster. Another option, which also may be suitable, is to set up asynchronous replication between old and new Galera Cluster.

We really hope you enjoyed this short review of the features of MariaDB 10.4 Galera Cluster, we are looking forward to see streaming replication in real live production environments. We also hope those changes will help to increase Galera adoption even further. After all, streaming replication solves many issues which can prevent people from migrating into Galera.

Viewing all 111 articles
Browse latest View live