Quantcast
Channel: Severalnines - galera
Viewing all 111 articles
Browse latest View live

MySQL on Docker: Running Galera Cluster on Kubernetes

$
0
0

In the last couple of blogs, we covered how to run a Galera Cluster on Docker, whether on standalone Docker or on multi-host Docker Swarm with overlay network. In this blog post, we’ll look into running Galera Cluster on Kubernetes, an orchestration tool to run containers at scale. Some parts are different, such as how the application should connect to the cluster, how Kubernetes handles failover and how the load balancing works in Kubernetes.

Kubernetes vs Docker Swarm

Our ultimate target is to ensure Galera Cluster runs reliably in a container environment. We previously covered Docker Swarm, and it turned out that running Galera Cluster on it has a number of blockers, which prevent it from being production ready. Our journey now continues with Kubernetes, a production-grade container orchestration tool. Let’s see which level of “production-readiness” it can support when running a stateful service like Galera Cluster.

Before we move further, let us highlight some of key differences between Kubernetes (1.6) and Docker Swarm (17.03) when running Galera Cluster on containers:

  • Kubernetes supports two health check probes - liveness and readiness. This is important when running a Galera Cluster on containers, because a live Galera container does not mean it is ready to serve and should be included in the load balancing set (think of a joiner/donor state). Docker Swarm only supports one health check probe similar to Kubernetes’ liveness, a container is either healthy and keeps running or unhealthy and gets rescheduled. Read here for details.
  • Kubernetes has a UI dashboard accessible via “kubectl proxy”.
  • Docker Swarm only supports round-robin load balancing (ingress), while Kubernetes uses least connection.
  • Docker Swarm supports routing mesh to publish a service to the external network, while Kubernetes supports something similar called NodePort, as well as external load balancers (GCE GLB/AWS ELB) and external DNS names (as for v1.7)

Installing Kubernetes using kubeadm

We are going to use kubeadm to install a 3-node Kubernetes cluster on CentOS 7. It consists of 1 master and 2 nodes (minions). Our physical architecture looks like this:

1. Install kubelet and Docker on all nodes:

$ ARCH=x86_64
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-${ARCH}
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg
        https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
$ setenforce 0
$ yum install -y docker kubelet kubeadm kubectl kubernetes-cni
$ systemctl enable docker && systemctl start docker
$ systemctl enable kubelet && systemctl start kubelet

2. On the master, initialize the master, copy the configuration file, setup the Pod network using Weave and install Kubernetes Dashboard:

$ kubeadm init
$ cp /etc/kubernetes/admin.conf $HOME/
$ export KUBECONFIG=$HOME/admin.conf
$ kubectl apply -f https://git.io/weave-kube-1.6
$ kubectl create -f https://git.io/kube-dashboard

3. Then on the other remaining nodes:

$ kubeadm join --token 091d2a.e4862a6224454fd6 192.168.55.140:6443

4. Verify the nodes are ready:

$ kubectl get nodes
NAME          STATUS    AGE       VERSION
kube1.local   Ready     1h        v1.6.3
kube2.local   Ready     1h        v1.6.3
kube3.local   Ready     1h        v1.6.3

We now have a Kubernetes cluster for Galera Cluster deployment.

Galera Cluster on Kubernetes

In this example, we are going to deploy a MariaDB Galera Cluster 10.1 using Docker image pulled from our DockerHub repository. The YAML definition files used in this deployment can be found under example-kubernetes directory in the Github repository.

Kubernetes supports a number of deployment controllers. To deploy a Galera Cluster, one can use:

  • ReplicaSet
  • StatefulSet

Each of them has their own pro and cons. We are going to look into each one of them and see what’s the difference.

Prerequisites

The image that we built requires an etcd (standalone or cluster) for service discovery. To run an etcd cluster requires each etcd instance to be running with different commands so we are going to use Pods controller instead of Deployment and create a service called “etcd-client” as endpoint to etcd Pods. The etcd-cluster.yaml definition file tells it all.

To deploy a 3-pod etcd cluster, simply run:

$ kubectl create -f etcd-cluster.yaml

Verify if the etcd cluster is ready:

$ kubectl get po,svc
NAME                        READY     STATUS    RESTARTS   AGE
po/etcd0                    1/1       Running   0          1d
po/etcd1                    1/1       Running   0          1d
po/etcd2                    1/1       Running   0          1d

NAME              CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
svc/etcd-client   10.104.244.200   <none>        2379/TCP            1d
svc/etcd0         10.100.24.171    <none>        2379/TCP,2380/TCP   1d
svc/etcd1         10.108.207.7     <none>        2379/TCP,2380/TCP   1d
svc/etcd2         10.101.9.115     <none>        2379/TCP,2380/TCP   1d

Our architecture is now looking something like this:

Using ReplicaSet

A ReplicaSet ensures that a specified number of pod “replicas” are running at any given time. However, a Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to pods along with a lot of other useful features. Therefore, it’s recommended to use Deployments instead of directly using ReplicaSets, unless you require custom update orchestration or don’t require updates at all. When you use Deployments, you don’t have to worry about managing the ReplicaSets that they create. Deployments own and manage their ReplicaSets.

In our case, we are going to use Deployment as the workload controller, as shown in this YAML definition. We can directly create the Galera Cluster ReplicaSet and Service by running the following command:

$ kubectl create -f mariadb-rs.yml

Verify if the cluster is ready by looking at the ReplicaSet (rs), pods (po) and services (svc):

$ kubectl get rs,po,svc
NAME                  DESIRED   CURRENT   READY     AGE
rs/galera-251551564   3         3         3         5h

NAME                        READY     STATUS    RESTARTS   AGE
po/etcd0                    1/1       Running   0          1d
po/etcd1                    1/1       Running   0          1d
po/etcd2                    1/1       Running   0          1d
po/galera-251551564-8c238   1/1       Running   0          5h
po/galera-251551564-swjjl   1/1       Running   1          5h
po/galera-251551564-z4sgx   1/1       Running   1          5h

NAME              CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
svc/etcd-client   10.104.244.200   <none>        2379/TCP            1d
svc/etcd0         10.100.24.171    <none>        2379/TCP,2380/TCP   1d
svc/etcd1         10.108.207.7     <none>        2379/TCP,2380/TCP   1d
svc/etcd2         10.101.9.115     <none>        2379/TCP,2380/TCP   1d
svc/galera-rs     10.107.89.109    <nodes>       3306:30000/TCP      5h
svc/kubernetes    10.96.0.1        <none>        443/TCP             1d

From the output above, we can illustrate our Pods and Service as below:

Running Galera Cluster on ReplicaSet is similar to treating it as a stateless application. It orchestrates pod creation, deletion and updates and can be targeted for Horizontal Pod Autoscales (HPA), i.e. a ReplicaSet can be auto-scaled if it meets certain thresholds or targets (CPU usage, packets-per-second, request-per-second etc).

If one of the Kubernetes nodes goes down, new Pods will be scheduled on an available node to meet the desired replicas. Volumes associated with the Pod will be deleted, if the Pod is deleted or rescheduled. The Pod hostname will be randomly generated, making it harder to track where the container belongs by simply looking at the hostname.

All this works pretty well in test and staging environments, where you can perform a full container lifecycle like deploy, scale, update and destroy without any dependencies. Scaling up and down is straightforward, by updating the YAML file and posting it to Kubernetes cluster or by using the scale command:

$ kubectl scale replicaset galera-rs --replicas=5

Using StatefulSet

Known as PetSet on pre 1.6 version, StatefulSet is the best way to deploy Galera Cluster in production, because:

  • Deleting and/or scaling down a StatefulSet will not delete the volumes associated with the StatefulSet. This is done to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.
  • For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0 .. N-1}.
  • When Pods are being deleted, they are terminated in reverse order, from {N-1 .. 0}.
  • Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
  • Before a Pod is terminated, all of its successors must be completely shut down.

StatefulSet provides first-class support for stateful containers. It provides a deployment and scaling guarantee. When a three-node Galera Cluster is created, three Pods will be deployed in the order db-0, db-1, db-2. db-1 will not be deployed before db-0 is “Running and Ready”, and db-2 will not be deployed until db-1 is “Running and Ready”. If db-0 should fail, after db-1 is “Running and Ready”, but before db-2 is launched, db-2 will not be launched until db-0 is successfully relaunched and becomes “Running and Ready”.

We are going to use Kubernetes implementation of persistent storage called PersistentVolume and PersistentVolumeClaim. This to ensure data persistency if the pod got rescheduled to the other node. Even though Galera Cluster provides the exact copy of data on each replica, having the data persistent in every pod is good for troubleshooting and recovery purposes.

To create a persistent storage, first we have to create PersistentVolume for every pod. PVs are volume plugins like Volumes in Docker, but have a lifecycle independent of any individual pod that uses the PV. Since we are going to deploy a 3-node Galera Cluster, we need to create 3 PVs:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: datadir-galera-0
  labels:
    app: galera-ss
    podindex: "0"
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  hostPath:
    path: /data/pods/galera-0/datadir
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: datadir-galera-1
  labels:
    app: galera-ss
    podindex: "1"
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  hostPath:
    path: /data/pods/galera-1/datadir
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: datadir-galera-2
  labels:
    app: galera-ss
    podindex: "2"
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  hostPath:
    path: /data/pods/galera-2/datadir

The above definition shows we are going to create 3 PV, mapped to the Kubernetes nodes’ physical path with 10GB of storage space. We defined ReadWriteOnce, which means the volume can be mounted as read-write by only a single node. Save the above lines into mariadb-pv.yml and post it to Kubernetes:

$ kubectl create -f mariadb-pv.yml
persistentvolume "datadir-galera-0" created
persistentvolume "datadir-galera-1" created
persistentvolume "datadir-galera-2" created

Next, define the PersistentVolumeClaim resources:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: mysql-datadir-galera-ss-0
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  selector:
    matchLabels:
      app: galera-ss
      podindex: "0"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: mysql-datadir-galera-ss-1
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  selector:
    matchLabels:
      app: galera-ss
      podindex: "1"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: mysql-datadir-galera-ss-2
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  selector:
    matchLabels:
      app: galera-ss
      podindex: "2"

The above definition shows that we would like to claim the PV resources and use the spec.selector.matchLabels to look for our PV (metadata.labels.app: galera-ss) based on the respective pod index (metadata.labels.podindex) assigned by Kubernetes. The metadata.name resource must use the format “{volumeMounts.name}-{pod}-{ordinal index}” defined under the spec.templates.containers so Kubernetes knows which mount point to map the claim into the pod.

Save the above lines into mariadb-pvc.yml and post it to Kubernetes:

$ kubectl create -f mariadb-pvc.yml
persistentvolumeclaim "mysql-datadir-galera-ss-0" created
persistentvolumeclaim "mysql-datadir-galera-ss-1" created
persistentvolumeclaim "mysql-datadir-galera-ss-2" created

Our persistent storage is now ready. We can then start the Galera Cluster deployment by creating a StatefulSet resource together with Headless service resource as shown in mariadb-ss.yml:

$ kubectl create -f mariadb-ss.yml
service "galera-ss" created
statefulset "galera-ss" created

Now, retrieve the summary of our StatefulSet deployment:

$ kubectl get statefulsets,po,pv,pvc -o wide
NAME                     DESIRED   CURRENT   AGE
statefulsets/galera-ss   3         3         1d        galera    severalnines/mariadb:10.1   app=galera-ss

NAME                        READY     STATUS    RESTARTS   AGE       IP          NODE
po/etcd0                    1/1       Running   0          7d        10.36.0.1   kube3.local
po/etcd1                    1/1       Running   0          7d        10.44.0.2   kube2.local
po/etcd2                    1/1       Running   0          7d        10.36.0.2   kube3.local
po/galera-ss-0              1/1       Running   0          1d        10.44.0.4   kube2.local
po/galera-ss-1              1/1       Running   1          1d        10.36.0.5   kube3.local
po/galera-ss-2              1/1       Running   0          1d        10.44.0.5   kube2.local

NAME                  CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM                               STORAGECLASS   REASON    AGE
pv/datadir-galera-0   10Gi       RWO           Retain          Bound     default/mysql-datadir-galera-ss-0                            4d
pv/datadir-galera-1   10Gi       RWO           Retain          Bound     default/mysql-datadir-galera-ss-1                            4d
pv/datadir-galera-2   10Gi       RWO           Retain          Bound     default/mysql-datadir-galera-ss-2                            4d

NAME                            STATUS    VOLUME             CAPACITY   ACCESSMODES   STORAGECLASS   AGE
pvc/mysql-datadir-galera-ss-0   Bound     datadir-galera-0   10Gi       RWO                          4d
pvc/mysql-datadir-galera-ss-1   Bound     datadir-galera-1   10Gi       RWO                          4d
pvc/mysql-datadir-galera-ss-2   Bound     datadir-galera-2   10Gi       RWO                          4d

At this point, our Galera Cluster running on StatefulSet can be illustrated as in the following diagram:

Running on StatefulSet guarantees consistent identifiers like hostname, IP address, network ID, cluster domain, Pod DNS and storage. This allows the Pod to easily distinguish itself from others in a group of Pods. The volume will be retained on the host and will not get deleted if the Pod is deleted or rescheduled onto another node. This allows for data recovery and reduces the risk of total data loss.

On the negative side, the deployment time will be N-1 times (N = replicas) longer because Kubernetes will obey the ordinal sequence when deploying, rescheduling or deleting the resources. It would be a bit of a hassle to prepare the PV and claims before thinking about scaling your cluster. Take note that updating an existing StatefulSet is currently a manual process, where you can only update spec.replicas at the moment.

Connecting to Galera Cluster Service and Pods

There are a couple of ways you can connect to the database cluster. You can connect directly to the port. In the “galera-rs” service example, we use NodePort, exposing the service on each Node’s IP at a static port (the NodePort). A ClusterIP service, to which the NodePort service will route, is automatically created. You’ll be able to contact the NodePort service, from outside the cluster, by requesting {NodeIP}:{NodePort}.

Example to connect to the Galera Cluster externally:

(external)$ mysql -udb_user -ppassword -h192.168.55.141 -P30000
(external)$ mysql -udb_user -ppassword -h192.168.55.142 -P30000
(external)$ mysql -udb_user -ppassword -h192.168.55.143 -P30000

Within the Kubernetes network space, Pods can connect via cluster IP or service name internally which is retrievable by using the following command:

$ kubectl get services -o wide
NAME          CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE       SELECTOR
etcd-client   10.104.244.200   <none>        2379/TCP            1d        app=etcd
etcd0         10.100.24.171    <none>        2379/TCP,2380/TCP   1d        etcd_node=etcd0
etcd1         10.108.207.7     <none>        2379/TCP,2380/TCP   1d        etcd_node=etcd1
etcd2         10.101.9.115     <none>        2379/TCP,2380/TCP   1d        etcd_node=etcd2
galera-rs     10.107.89.109    <nodes>       3306:30000/TCP      4h        app=galera-rs
galera-ss     None             <none>        3306/TCP            3m        app=galera-ss
kubernetes    10.96.0.1        <none>        443/TCP             1d        <none>

From the service list, we can see that the Galera Cluster ReplicaSet Cluster-IP is 10.107.89.109. Internally, another pod can access the database through this IP address or service name using the exposed port, 3306:

(etcd0 pod)$ mysql -udb_user -ppassword -hgalera-rs -P3306 -e 'select @@hostname'
+------------------------+
| @@hostname             |
+------------------------+
| galera-251551564-z4sgx |
+------------------------+

You can also connect to the external NodePort from inside any pod on port 30000:

(etcd0 pod)$ mysql -udb_user -ppassword -h192.168.55.143 -P30000 -e 'select @@hostname'
+------------------------+
| @@hostname             |
+------------------------+
| galera-251551564-z4sgx |
+------------------------+

Connection to the backend Pods will be load balanced accordingly based on least connection algorithm.

Summary

At this point, running Galera Cluster on Kubernetes in production seems much more promising as compared to Docker Swarm. As discussed in the last blog post, the concerns raised are tackled differently with the way Kubernetes orchestrates containers in StatefulSet, (although it’s still a beta feature in v1.6). We do hope that the suggested approach is going to help run Galera Cluster on containers at scale in production.


What's New with ProxySQL in ClusterControl v1.4.2

$
0
0

ClusterControl v1.4.2 comes with a number of improvements around ProxySQL. The previous version, also known as “The ProxySQL Edition”, introduced a full integration of ProxySQL with MySQL Replication and Galera Cluster. With ClusterControl v1.4.2, we introduced some amazing features to help you running ProxySQL at scale in production. These features include running ProxySQL in high availability mode, keeping multiple instances in sync with each other, managing your existing ProxySQL instances and caching queries in one click.

High Availability with Keepalived

ProxySQL can be deployed as a distributed or centralized MySQL load balancer. Setting up a centralized ProxySQL usually requires two or more instances, these are coupled with a virtual IP address that provides a single endpoint for your database applications. This is a proven approach that we have been using with HAProxy. By having a primary ProxySQL instance and a backup instance, your load balancer would not be a single-point-of-failure. The following diagram shows an example architecture of centralized ProxySQL instances with Keepalived:

ClusterControl allows more than two ProxySQL instances to share the same virtual IP address. From the ClusterControl web interface, you can manage them from the Keepalived deployment wizard:

Note that Keepalived is automatically configured with a single-master-multiple-backups approach, where only one ProxySQL instance (assuming one instance per host) will hold the virtual IP address at a time. The rest will act as a hot-standby proxy, unless you explicitly connect to them via the instance IP address. Your application is simplified where you only need to connect to a single virtual IP address and the proxy will take care of load balancing connections on one of the available database nodes.

Add Existing ProxySQL

If you already installed ProxySQL manually, and want to manage it using ClusterControl, use the ‘Import ProxySQL’ feature to add it into ClusterControl. Before importing the instance, ensure ClusterControl is able to perform passwordless SSH to the target host. Once done, simply provide the ProxySQL host, the listening port (default to 6033) and ProxySQL administrator user credentials:

ClusterControl will then connect to the target host via SSH and perform some sanity checks before connecting to ProxySQL admin interface via local socket. You will then see it listed in the side menu of the Nodes tab, under ProxySQL section. That’s it, simple and straightforward. By ticking the ‘Import Configuration’, you can also choose to update the ProxySQL node with the configuration from an existing ProxySQL node.

Sync ProxySQL Instances

It is extremely common to deploy multiple instances of ProxySQL. One would deploy at least one primary and one standby for high availability purposes. Or the architecture might mandate multiple instances, for example one proxy instance per web server. Managing multiple ProxySQL instances can be a hassle though, as they need to be kept in sync - things like configurations, hostgroup definitions, query rules, backend/frontend users, global variables, and so on.

Consider the following architecture:

When you have a distributed ProxySQL deployment like the above, you need a way to synchronize configurations. Ideally, you would want to do this automatically and not rely on manually changing the configurations on each instance. ClusterControl v1.4.2 allows you to synchronize ProxySQL configurations between instances, or simply export and import the configuration from one instance to another.

When changing the configuration on a ProxySQL instance, you can use “Node Actions” -> “Synchronize Instances” feature to apply the configuration to another instance. The following configuration will be synced over:

  • Query rules
  • Hostgroups and servers
  • Users (backend and frontend)
  • Global variables
  • Scheduler
  • The content of ProxySQL configuration file (proxysql.cnf)

Take note that by clicking on “Synchronize”, the existing configuration on the target instance will be overwritten. There is a warning for that. A new job will be initiated by ClusterControl, and all changes will be flushed to disk on the target node to make it persistent across restarts.

Simplified Query Cache

ProxySQL allows you to cache specific read queries, as well as specify how long they should be cached. Resultsets are cached in native MySQL packets format. This offloads the database servers and applications have faster access to cached data. This also reduces the need for a separate caching layer (e.g. Redis or Memcached).

You can now cache any query digested by ProxySQL with a single click. Simply rollover the query that you would like to cache and click “Cache Query”:

ClusterControl will provide a shortcut to add a new query rule to cache the selected query:

Define the Rule ID (query is processed based on this ordering), destination hostgroup and TTL value in milliseconds. Click “Add Rule” to start cache the query. You can then verify from ProxySQL’s Rules page if the current incoming query matches the rule by looking at the “Hits” column:

That’s it for now. You are welcome to install ClusterControl and try out the ProxySQL features. Happy proxying!

MySQL on Docker: Running Galera Cluster in Production with ClusterControl on Kubernetes

$
0
0

In our “MySQL on Docker” blog series, we continue our quest to make Galera Cluster run smoothly in different container environments. One of the most important things when running a database service, whether in containers or bare-metal, is to eliminate the risk of data loss. We will see how we can leverage a promising feature in Kubernetes called StatefulSet, which orchestrates container deployment in a more predictable and controllable fashion.

In our previous blog post, we showed how one can deploy a Galera Cluster within Docker with the help of Kubernetes as orchestration tool. However, it is only about deployment. Running a database in production requires more than just deployment - we need to think about monitoring, backups, upgrades, recovery from failures, topology changes and so on. This is where ClusterControl comes into the picture, as it completes the stack and makes it production ready. In simple words, Kubernetes takes care of database deployment and scaling while ClusterControl fills in the missing components including configuration, monitoring and ongoing management.

ClusterControl on Docker

This blog post describes how ClusterControl runs in a Docker environment. The Docker image has been updated, and now comes with the standard ClusterControl packages in the latest stable branch with additional support for container orchestration platforms like Docker Swarm and Kubernetes, we’ll describe this further below. You can also use the image to deploy a database cluster on a standalone Docker host.

Details at the Github repository or Docker Hub page.

ClusterControl on Kubernetes

The updated Docker image now supports automatic deployment of database containers scheduled by Kubernetes. The steps are similar to the Docker Swarm implementation, where the user decides the specs of the database cluster and ClusterControl automates the actual deployment.

ClusterControl can be deployed as ReplicaSet or StatefulSet. Since it’s a single instance, either way works. The only significant difference is the container identification would be easier with StatefulSet, since it provides consistent identifiers like the container hostname, IP address, DNS and storage. ClusterControl also provides service discovery for the new cluster deployment.

To deploy ClusterControl on Kubernetes, the following setup is recommended:

  • Use centralized persistent volumes supported by Kubernetes plugins (e.g. NFS, iSCSI) for the following paths:
    • /etc/cmon.d - ClusterControl configuration directory
    • /var/lib/mysql - ClusterControl cmon and dcps databases
  • Create 2 services for this pod:
    • One for internal communication between pods (expose port 80 and 3306)
    • One for external communication to outside world (expose port 80, 443 using NodePort or LoadBalancer)

In this example, we are going to use simple NFS. Make sure you have an NFS server ready. For the sake of simplicity, we are going to demonstrate this deployment on a 3-host Kubernetes cluster (1 master + 2 Kubernetes nodes). For production use, please use at least 3 Kubernetes nodes to minimize the risk of losing quorum.

With that in place, we can deploy the ClusterControl as something like this:

On the NFS server (kube1.local), install NFS server and client packages and export the following paths:

  • /storage/pods/cc/cmon.d - to be mapped with /etc/cmon.d
  • /storage/pods/cc/datadir - to be mapped with /var/lib/mysql

Ensure to restart NFS service to apply the changes. Then create PVs and PVCs, as shown in cc-pv-pvc.yml:

$ kubectl create -f cc-pv-pvc.yml

We are now ready to start a replica of the ClusterControl pod. Send cc-rs.yml to Kubernetes master:

$ kubectl create -f cc-rs.yml

ClusterControl is now accessible on port 30080 on any of the Kubernetes nodes, for example, http://kube1.local:30080/clustercontrol. With this approach (ReplicaSet + PV + PVC), the ClusterControl pod would survive if the physical host goes down. Kubernetes will automatically schedule the pod to the other available hosts and ClusterControl will be bootstrapped from the last existing dataset which is available through NFS.

Galera Cluster on Kubernetes

If you would like to use the ClusterControl automatic deployment feature, simply send the following YAML files to the Kubernetes master:

$ kubectl create -f cc-galera-pv-pvc.yml
$ kubectl create -f cc-galera-ss.yml

Details on the definition files can be found here - cc-galera-pv-pvc.yml and cc-galera-ss.yml.

The above commands tell Kubernetes to create 3 PVs, 3 PVCs and 3 pods running as StatefulSet using a generic base image called “centos-ssh”. In this example, the database cluster that we are going to deploy is MariaDB 10.1. Once the containers are started, they will register themselves to ClusterControl CMON database. ClusterControl will then pick up the containers’ hostname and start the deployment based on the variables that have been passed.

You can check the progress directly from the ClusterControl UI. Once the deployment has finished, our architecture will look something like this:

HAProxy as Load Balancer

Kubernetes comes with an internal load balancing capability via the Service component when distributing traffic to the backend pods. This is good enough if the balancing (least connections) fits your workload. In some cases, where your application needs to send queries to a single master due to deadlock or strict read-after-write semantics, you have to create another Kubernetes service with a proper selector to redirect the incoming connection to one and only one pod. If this single pod goes down, there is a chance of service interruption when Kubernetes schedules it again to another available node. What we are trying to highlight here is that if you’d want better control on what’s being sent to the backend Galera Cluster, something like HAProxy (or even ProxySQL) is pretty good at that.

You can deploy HAProxy as a two-pod ReplicaSet and use ClusterControl to deploy, configure and manage it. Simply post this YAML definition to Kubernetes:

$ kubectl create -f cc-haproxy-rs.yml

The above definition instructs Kubernetes to create a service called cc-haproxy and run two replicas of “severalnines/centos-ssh” image without automatic deployment (AUTO_DEPLOYMENT=0). These pods will then connect to the ClusterControl pod and perform automatic passwordless SSH setup. What you need to do now is to log into ClusterControl UI and start the deployment of HAProxy.

Firstly, retrieve the IP address of HAProxy pods:

$ kubectl describe pods -l app=cc-haproxy | grep IP
IP:        10.44.0.6
IP:        10.36.0.5

Then use the address as the HAProxy Address under ClusterControl -> choose the DB cluster -> Manage -> Load Balancer -> HAProxy -> Deploy HAProxy, as shown below:

**Repeat the above step for the second HAProxy instance.

Once done, our Galera Cluster can be accessible through the “cc-haproxy” service on port 3307 internally (within Kubernetes network space) or port 30016 externally (outside world). The connection will be load balanced between these HAProxy instances. At this point, our architecture can be illustrated as the following:

With this setup, you have maximum control of your load-balanced Galera Cluster running on Docker. Kubernetes brings something good to the table by supporting stateful service orchestration.

Do give it a try. We would love to hear how you get along.

Galera Cluster: All the Severalnines Resources

$
0
0

Galera Cluster is a true multi-master cluster solution for MySQL and MariaDB, based on synchronous replication. Galera Cluster is easy-to-use, provides high-availability, as well as scalability for certain workloads.

ClusterControl provides advanced deployment, management, monitoring, and scaling functionality to get your Galera clusters up-and-running using proven methodologies.

Here are just some of the great resources we’ve developed for Galera Cluster over the last few years...

Tutorials

Galera Cluster for MySQL

Galera allows applications to read and write from any MySQL Server. Galera enables synchronous replication for InnoDB, creating a true multi-master cluster of MySQL servers. Allows for synchronous replication between data centers. Our tutorial covers MySQL Galera concepts and explains how to deploy and manage a Galera cluster.

Read the Tutorial

Deploying a Galera Cluster for MySQL on Amazon VPC

This tutorial shows you how to deploy a multi-master synchronous Galera Cluster for MySQL with Amazon's Virtual Private Cloud (Amazon VPC) service.

Read the Tutorial

Training: Galera Cluster For System Administrators, DBAs And DevOps

The course is designed for system administrators & database administrators looking to gain more in depth expertise in the automation and management of Galera Clusters.

Book Your Seat

On-Demand Webinars

MySQL Tutorial - Backup Tips for MySQL, MariaDB & Galera Cluster

In this webinar, Krzysztof Książek, Senior Support Engineer at Severalnines, discusses backup strategies and best practices for MySQL, MariaDB and Galera clusters; including a live demo on how to do this with ClusterControl.

Watch the replay

9 DevOps Tips for Going in Production with Galera Cluster for MySQL / MariaDB

In this webinar replay, we guide you through 9 key tips to consider before taking Galera Cluster for MySQL / MariaDB into production.

Watch the replay

Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraDB Cluster

Our colleague Krzysztof Książek provided a deep-dive session on what to monitor in Galera Cluster for MySQL & MariaDB. Krzysztof is a MySQL DBA with experience in managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard.

Watch the replay

Become a MySQL DBA - webinar series: Schema Changes for MySQL Replication & Galera Cluster

In this webinar, we discuss how to implement schema changes in the least impacting way to your operations and ensure availability of your database. We also cover some real-life examples and discuss how to handle them.

Watch the replay

Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Cluster

In this webinar, we walk you through what you need to know in order to migrate from standalone or a master-slave MySQL / MariaDB setup to Galera Cluster.

Watch the replay

Introducing Galera 3.0

In this webinar you'll learn all about the new Galera Cluster capabilities in version 3.0.

Watch the replay

Top Blogs

MySQL on Docker: Running Galera Cluster on Kubernetes

In our previous posts, we showed how one can run Galera Cluster on Docker Swarm, and discussed some of the limitations with regards to production environments. Kubernetes is widely used as orchestration tool, and we’ll see whether we can leverage it to achieve production-grade Galera Cluster on Docker.

Read More

ClusterControl for Galera Cluster for MySQL

Galera Cluster is widely supported by ClusterControl. With over four thousand deployments and more than sixteen thousand configurations, you can be assured that ClusterControl is more than capable of helping you manage your Galera setup.

Read More

How Galera Cluster Enables High Availability for High Traffic Websites

This post gives an insight into how Galera can help to build HA websites.

Read More

How to Set Up Asynchronous Replication from Galera Cluster to Standalone MySQL server with GTID

Hybrid replication, i.e. combining Galera and asynchronous MySQL replication in the same setup, became much easier since GTID got introduced in MySQL 5.6. In this blog post, we will show you how to replicate a Galera Cluster to a MySQL server with GTID, and how to failover the replication in case the master node fails.

Read More

Full Restore of a MySQL or MariaDB Galera Cluster from Backup

Performing regular backups of your database cluster is imperative for high availability and disaster recovery. This blog post provides a series of best practices on how to fully restore a MySQL or MariaDB Galera Cluster from backup.

Read More

How to Bootstrap MySQL or MariaDB Galera Cluster

Unlike standard MySQL server and MySQL Cluster, the way to start a MySQL or MariaDB Galera Cluster is a bit different. Galera requires you to start a node in a cluster as a reference point, before the remaining nodes are able to join and form the cluster. This process is known as cluster bootstrap. Bootstrapping is an initial step to introduce a database node as primary component, before others see it as a reference point to sync up data.

Read More

Schema changes in Galera cluster for MySQL and MariaDB - how to avoid RSU locks

This post shows you how to avoid locking existing queries when performing rolling schema upgrades in Galera Cluster for MySQL and MariaDB.

Read More

Deploy an asynchronous slave to Galera Cluster for MySQL - The Easy Way

Due to its synchronous nature, Galera performance can be limited by the slowest node in the cluster. So running heavy reporting queries or making frequent backups on one node, or putting a node across a slow WAN link to a remote data center might indirectly affect cluster performance. Combining Galera and asynchronous MySQL replication in the same setup, aka Hybrid Replication, can help

Read More

Top Videos

ClusterControl for Galera Cluster - All Inclusive Database Management System

Watch the Video

Galera Cluster - ClusterControl Product Demonstration

Watch the Video

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

ClusterControl for Galera

ClusterControl makes it easy for those new to Galera to use the technology and deploy their first clusters. It centralizes the database management into a single interface. ClusterControl automation ensures DBAs and SysAdmins make critical changes to the cluster efficiently with minimal risks.

ClusterControl delivers on an array of features to help manage and monitor your open source database environments:

  • Deploy Database Clusters
  • Add Node, Load Balancer (HAProxy, ProxySQL) or Replication Slave
  • Backup Management
  • Configuration Management
  • Full stack monitoring (DB/LB/Host)
  • Query Monitoring
  • Enable SSL Encryption Galera Replication
  • Node Management
  • Developer Studio with Advisors

Learn more about how ClusterControl can help you drive high availability with Galera Cluster here.

We hope that these resources prove useful!

Happy Clustering!

Galera Cluster Comparison - Codership vs Percona vs MariaDB

$
0
0

Galera Cluster is a synchronous multi-master replication plugin for InnoDB or XtraDB storage engine. It offers a number of outstanding features that standard MySQL replication doesn’t - read-write to any cluster node, automatic membership control, automatic node joining, parallel replication on row-level, and still keeping the native look and feel of a MySQL server. This plug-in is open-source and developed by Codership as a patch for standard MySQL. Percona and MariaDB leverage the Galera library in Percona XtraDB Cluster (PXC) and MariaDB Server (MariaDB Galera Cluster for pre 10.1) respectively.

We often get the question - which version of Galera should I use? Percona? MariaDB? Codership? This is not an easy one, since they all use the same Galera plugin that is developed by Codership. Nevertheless, let’s give it a try.

In this blog post, we’ll compare the three vendors and their Galera Cluster releases. We will be using the latest stable version of each vendor available at the time of writing - Galera Cluster for MySQL 5.7.18, Percona XtraDB Cluster 5.7.18 and MariaDB 10.2.7 where all are shipped with InnoDB storage engine 5.7.18.

Database Release

A database vendor who wish to leverage Galera Cluster technology would need to incorporate the WriteSet Replication (wsrep) API patch into its server codebase. This will allow the Galera plugin to work as a wsrep provider, to communicate and replicate transactions (writesets in Galera terms) via a group communication protocol.

The following diagram illustrates the difference between the standalone MySQL server, MySQL Replication and Galera Cluster:

Codership releases the wsrep-patched version of Oracle’s MySQL. MySQL has already released MySQL 5.7 as General Availability (GA) since October 2015. However the first beta wsrep-patched for MySQL was released a year later around October 2016, then became GA in January 2017. It took more than a year to incorporate Galera Cluster into Oracle’s MySQL 5.7 release line.

Percona releases the wsrep-patched version of its Percona Server for MySQL called Percona XtraDB Cluster (PXC). Percona Server for MySQL comes with XtraDB storage engine (a drop-in replacement of InnoDB) and follows the upstream Oracle MySQL releases very closely (including all the bug fixes in it) with some additional features like MyRocks storage engine, TokuDB as well as Percona’s own bug fixes. In a way, you can think of it as an improved version of Oracle’s MySQL, embedded with Galera technology.

MariaDB releases the wsrep-patched version of its MariaDB Server, and it’s already embedded since MariaDB 10.1, where you don’t have to install separate packages for Galera. In the previous versions (5.5 and 10.0 particularly), the Galera variant’s of MariaDB is called MariaDB Galera Cluster (MGC) with separate builds. MariaDB has its own path of releases and versioning and does not follow any upstream like Percona does. The MariaDB server functionality has started diverging from MySQL, so it might not be as straightforward a replacement for MySQL. It still comes with a bunch of great features and performance improvements though.

System Status

Monitoring Galera nodes and the cluster requires the wsrep API to report several statuses, which is exposed through SHOW STATUS statement:

mysql> SHOW STATUS LIKE 'wsrep%';

PXC does have a number of extra statuses, if compared to other variants. The following list shows wsrep related status that can only be found in PXC:

  • wsrep_flow_control_interval
  • wsrep_flow_control_interval_low
  • wsrep_flow_control_interval_high
  • wsrep_flow_control_status
  • wsrep_cert_bucket_count
  • wsrep_gcache_pool_size
  • wsrep_ist_receive_status
  • wsrep_ist_receive_seqno_start
  • wsrep_ist_receive_seqno_current
  • wsrep_ist_receive_seqno_end

While MariaDB only has one extra wsrep status, if compared to the Galera version provided by Codership:

  • wsrep_thread_count

The above does not necessarily tell us that PXC is superior to the others. It means that you can get better insights with more statuses.

Configuration Options

Since Galera is part of MariaDB 10.1 and later, you have to explicitly enable the following option in the configuration file:

wsrep_ready=ON

Note that if you do not enable this option, the server will act as a standard MariaDB installation. For Codership and Percona, this option is enabled by default.

Some Galera-related variables are NOT available across all Galera variants:

Database ServerVariable name
Codership’s MySQL Galera Cluster 5.7.18, wsrep 25.12
  • wsrep_mysql_replication_bundle
  • wsrep_preordered
  • wsrep_reject_queries
Percona XtraDB Cluster 5.7.18, wsrep 29.20
  • wsrep_preordered
  • wsrep_reject_queries
  • pxc_encrypt_cluster_traffic
  • pxc_maint_mode
  • pxc_maint_transition_period
  • pxc_strict_mode
MariaDB 10.2.7, wsrep 25.19
  • wsrep_gtid_domain_id
  • wsrep_gtid_mode
  • wsrep_mysql_replication_bundle
  • wsrep_patch_version

The above list might change once the vendor releases a new version. The only point that we would like to highlight here is, do not expect that Galera nodes hold the same set of configuration parameters across all variants. Some configuration variables were introduced by a vendor to specifically complement and improve the database server.

Contributions and Improvements

Database performance is not easily comparable, as it can vary a lot depending on the workloads. For general workloads, the replication performance are fairly similar across all variants. Under some specific workloads, it could be different.

Looking at the latest claims, Percona did an amazing job improving IST performance up to 4x as well as the commit operation. MariaDB also contributes a number of useful features for example WSREP_INFO plugin. On the other hand, Codership is focusing more on core Galera issues issues, including bug fixing and new features. Galera 4.0 has features like intelligent donor selection, huge transaction support, and non-blocking DDL.

The introduction of Percona Xtrabackup (a.k.a xtrabackup) as part of Galera’s SST has improved the SST performance significantly. The syncing process becomes faster and non-blocking to the donor. MariaDB then came up with its own xtrabackup fork called MariaDB Backup (mariabackup) which supported by Galera’s SST method through variable wsrep_sst_method=mariabackup. It also supports installation on Microsoft Windows.

Support

All Galera Cluster variants software are open-source and available for free. This includes the syncing software supported by Galera like mysqldump, rsync, Percona Xtrabackup and MariaDB Backup. For community users, you can seek for support, ask for questions, file a bug report, feature request or even make a pull request to the vendor’s respective support channels:

 CodershipPerconaMariaDB
Database server public issue trackerMySQL wsrep on GithubPercona XtraDB Cluster on LaunchpadMariaDB Server on JIRA
Galera issue trackerGalera on Github
DocumentationGalera Cluster DocumentationPercona XtraDB Cluster DocumentationMariaDB Documentation
Support forumCodership Team GroupsPercona ForumMariaDB Open Questions

Each vendor provides commercial support services.

Summary

We hope that this comparison gives you a clearer picture and helps you determine which vendor that better suits your need. They all use pretty much the same wsrep libraries, the differences would be mainly on the server side - for instance, if you want to leverage some specific features in MariaDB or Percona Server. You might want to check out this blog that compares the different servers (Oracle MySQL, MariaDB and Percona Server). ClusterControl supports all of the three vendors, so you can easily deploy different clusters and compare them yourself with your own workload, on your own hardware. Do give it a try.

A How-To Guide for Galera Cluster - Updated Tutorial

$
0
0

Since it was originally published more than 63,000 people (to date) have leveraged the MySQL for Galera Cluster Tutorial to both learn about and get started using MySQL Galera Cluster.

Galera Cluster for MySQL is a true Multi-master Cluster which is based on synchronous replication. Galera Cluster is an easy-to-use, high-availability solution, which provides high system uptime, no data loss and scalability to allow for future growth.

Severalnines was a very early adopter of the Galera Cluster technology; which was created by Codership and has since expanded to include versions from Percona and MariaDB.  

Included in this newly updated tutorial are topics like…

  • An introduction to Galera Cluster
  • An explanation of the differences between MySQL Replication and Galera Replication
  • Deployment of Galera Cluster
  • Accessing the Galera Cluster
  • Failure Handling
  • Management and Operations
  • FAQs and Common Questions

Check out the updated tutorial MySQL for Galera Cluster here.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

ClusterControl for Galera

ClusterControl makes it easy for those new to Galera to use the technology and deploy their first clusters. It centralizes the database management into a single interface. ClusterControl automation ensures DBAs and SysAdmins make critical changes to the cluster efficiently with minimal risks.

ClusterControl delivers on an array of features to help manage and monitor your open source database environments:

  • Deploy Database Clusters
  • Add Node, Load Balancer (HAProxy, ProxySQL) or Replication Slave
  • Backup Management
  • Configuration Management
  • Full stack monitoring (DB/LB/Host)
  • Query Monitoring
  • Enable SSL Encryption Galera Replication
  • Node Management
  • Developer Studio with Advisors

Learn more about how ClusterControl can help you drive high availability with Galera Cluster here.

Multiple Data Center Setups Using Galera Cluster for MySQL or MariaDB

$
0
0

Building high availability, one step at a time

When it comes to database infrastructure, we all want it. We all strive to build a highly available setup. Redundancy is the key. We start to implement redundancy at the lowest level and continue up the stack. It starts with hardware - redundant power supplies, redundant cooling, hot-swap disks. Network layer - multiple NIC’s bonded together and connected to different switches which are using redundant routers. For storage we use disks set in RAID, which gives better performance but also redundancy. Then, on the software level, we use clustering technologies: multiple database nodes working together to implement redundancy: MySQL Cluster, Galera Cluster.

All of this is no good  if you have everything in a single datacenter: when a datacenter goes down, or part of the services (but important ones) go offline, or even if you lose connectivity to the datacenter, your service will go down -  no matter the amount of redundancy in the lower levels. And yes, those things happens.

  • S3 service disruption wreaked havoc in US-East-1 region in February, 2017
  • EC2 and RDS Service Disruption in US-East region in April, 2011
  • EC2, EBS and RDS were disrupted in EU-West region in August, 2011
  • Power outage brought down Rackspace Texas DC in June, 2009
  • UPS failure caused hundreds of servers to go offline in Rackspace London DC in January, 2010

This is by no means a complete list of failures, it’s just the result of a quick Google search. These serve as examples that things may and will go wrong if you put all your eggs into the same basket. One more example would be Hurricane Sandy, which caused enormous exodus of data from US-East to US-West DC’s - at that time you could hardly spin up instances in US-West as everyone rushed to move their infrastructure to the other coast in expectation that North Virginia DC will be seriously affected by the weather.

So, multi-datacenter setups are a must if you want to build a high availability environment. In this blog post, we will discuss how to build such infrastructure using Galera Cluster for MySQL/MariaDB.

Galera concepts

Before we look into particular solutions, let us spend some time explaining two concepts which are very important in highly available, multi-DC Galera setups.

Quorum

High availability requires resources - namely, you need a number of nodes in the cluster to make it highly available. A cluster can tolerate the loss of some of its members, but only to a certain extent. Beyond a certain failure rate, you might be looking at a split-brain scenario.

Let’s take an example with a 2 node setup.  If one of the nodes goes down, how can the other one know that its peer crashed and it’s not a network failure? In that case, the other node might as well be up and running, serving traffic. There is no good way to handle such case… This is why fault tolerance usually starts from three nodes. Galera uses a quorum calculation to determine if it is safe for the cluster to handle traffic, or if it should cease operations. After a failure, all remaining nodes attempt to connect to each other and determine how many of them are up. It’s then compared to the previous state of the cluster, and as long as more than 50% of the nodes are up, the cluster can continue to operate.

This results in following:
2 node cluster - no fault tolerance
3 node cluster - up to 1 crash
4 node cluster - up to 1 crash (if two nodes would crash, only 50% of the cluster would be available, you need more than 50% nodes to survive)
5 node cluster - up to 2 crashes
6 node cluster - up to 2 crashes

You probably see the pattern - you want your cluster to have an odd number of nodes - in terms of high availability there’s no point in moving from 5 to 6 nodes in the cluster. If you want better fault tolerance, you should go for 7 nodes.

Segments

Typically, in a Galera cluster, all communication follows the all to all pattern. Each node talks to all the other nodes in the cluster.

As you may know, each writeset in Galera has to be certified by all of the nodes in the cluster - therefore every write that happened on a node has to be transferred to all of the nodes in the cluster. This works ok in a low-latency environment. But if we are talking about multi-DC setups, we need to consider much higher latency than in a local network. To make it more bearable in clusters spanning over Wide Area Networks, Galera introduced segments.

They work by containing the Galera traffic within a group of nodes (segment). All nodes within a single segment act as if they were in a local network - they assume one to all communication. For cross-segment traffic, things are different - in each of the segments, one “relay” node is chosen, all of the cross-segment traffic goes through those nodes. When a relay node goes down, another node is elected. This does not reduce latency by much - after all, WAN latency will stay the same no matter if you make a connection to one remote host or to multiple remote hosts, but given that WAN links tend to be limited in bandwidth and there might be a charge for the amount of data transferred, such approach allows you to limit the amount of data exchanged between segments. Another time and cost-saving option is the fact that nodes in the same segment are prioritized when a donor is needed - again, this limits the amount of data transferred over the WAN and, most likely, speeds up SST as a local network almost always will be faster than a WAN link.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Galera in multi-DC setups

Now that we’ve got some of these concepts out of the way, let’s look at some other important aspects of multi-DC setups for Galera cluster.

Issues you are about to face

When working in environments spanning across WAN, there are a couple of issues you need to take under consideration when designing your environment.

Quorum calculation

In the previous section, we described how a quorum calculation looks like in Galera cluster - in short, you want to have an odd number of nodes to maximize survivability. All of that is still true in multi-DC setups, but some more elements are added into the mix. First of all, you need to decide if you want Galera to automatically handle a datacenter failure. This will determine how many datacenters you are going to use. Let’s imagine two DC’s - if you’ll split your nodes 50% - 50%, if one datacenter goes down, the second one doesn’t have 50%+1 nodes to maintain its “primary” state. If you split your nodes in an uneven way, using the majority of them in the “main” datacenter, when that datacenter goes down, the “backup” DC won’t have 50% + 1 nodes to form a quorum. You can assign different weights to nodes but the result will be exactly the same - there’s no way to automatically failover between two DC’s without manual intervention. To implement automated failover, you need more than two DC’s. Again, ideally an odd number - three datacenters is a perfectly fine setup. Next, the question is - how many nodes you need to have? You want to have them evenly distributed across the datacenters. The rest is just a matter of how many failed nodes your setup has to handle.

Minimal setup will use one node per datacenter - it has serious drawbacks, though. Every state transfer will require moving data across the WAN and this results in either longer time needed to complete SST or higher costs.

Quite typical setup is to have six nodes, two per datacenter. This setup seems unexpected as it has an even number of nodes. But, when you think of it, it might not be that big of an issue: it’s quite unlikely that three nodes will go down at once, and such a setup will survive a crash of up to two nodes. A whole datacenter may go offline and two remaining DC’s will continue operations. It also has a huge advantage over the minimal setup - when a node goes offline, there’s always a second node in the datacenter which can serve as a donor. Most of the time, the WAN won’t be used for SST.

Of course, you can increase the number of nodes to three per cluster, nine in total. This gives you even better survivability: up to four nodes may crash and the cluster will still survive. On the other hand, you have to keep in mind that, even with the use of segments, more nodes means higher overhead of operations and you can scale out Galera cluster only to a certain extent.

It may happen that there’s no need for a third datacenter because, let’s say, your application is located in only two of them. Of course, the requirement of three datacenters is still valid so you won’t go around it, but it is perfectly fine to use a Galera Arbitrator (garbd) instead of fully loaded database servers.

Garbd can be installed on smaller nodes, even virtual servers. It does not require powerful hardware, it does not store any data nor apply any of the writesets. But it does see all the replication traffic, and takes part in the quorum calculation. Thanks to it, you can deploy setups like four nodes, two per DC + garbd in the third one - you have five nodes in total, and such cluster can accept up to two failures. So it means it can accept a full shutdown of one of the datacenters.

Which option is better for you? There is no best solution for all cases, it all depends on your infrastructure requirements. Luckily, there are different options to pick from: more or less nodes, full 3 DC or 2 DC and garbd in the third one - it’s quite likely you’ll find something suitable for you.

Network latency

When working with multi-DC setups, you have to keep in mind that network latency will be significantly higher than what you’d expect from a local network environment. This may seriously reduce performance of the Galera cluster when you compare it with standalone MySQL instance or a MySQL replication setup. The requirement that all of the nodes have to certify a writeset means that all of the nodes have to receive it, no matter how far away they are. With asynchronous replication, there’s no need to wait before a commit. Of course, replication has other issues and drawbacks, but latency is not the major one. The problem is especially visible when your database has hot spots - rows, which are frequently updated (counters, queues, etc). Those rows cannot be updated more often than once per network round trip. For clusters spanning across the globe, this can easily mean that you won’t be able to update a single row more often than 2 - 3 times per second. If this becomes a limitation for you, it may mean that Galera cluster is not a good fit for your particular workload.

Proxy layer in multi-DC Galera cluster

It’s not enough to have Galera cluster spanning across multiple datacenters, you still need your application to access them. One of the popular methods to hide complexity of the database layer from an application is to utilize a proxy. Proxies are used as an entry point to the databases, they track the state of the database nodes and should always direct traffic to only the nodes that are available. In this section, we’ll try to propose a proxy layer design which could be used for a multi-DC Galera cluster. We’ll use ProxySQL, which gives you quite a bit of flexibility in handling database nodes, but you can use another proxy, as long as it can track the state of Galera nodes.

Where to locate the proxies?

In short, there are two common patterns here: you can either deploy ProxySQL on a separate nodes or you can deploy them on the application hosts. Let’s take a look at pros and cons of each of these setups.

Proxy layer as a separate set of hosts

The first pattern is to build a proxy layer using separate, dedicated hosts. You can deploy ProxySQL on a couple of hosts, and use Virtual IP and keepalived to maintain high availability. An application will use the VIP to connect to the database, and the VIP will ensure that requests will always be routed to an available ProxySQL. The main issue with this setup is that you use at most one of the ProxySQL instances - all standby nodes are not used for routing the traffic. This may force you to use more powerful hardware than you’d typically use. On the other hand, it is easier to maintain the setup - you will have to apply configuration changes on all of the ProxySQL nodes, but there will be just a handful of them. You can also utilize ClusterControl’s option to sync the nodes. Such setup will have to be duplicated on every datacenter that you use.

Proxy installed on application instances

Instead of having a separate set of hosts, ProxySQL can also be installed on the application hosts. Application will connect directly to the ProxySQL on localhost, it could even use unix socket to minimize the overhead of the TCP connection. The main advantage of such a setup is that you have a large number of ProxySQL instances, and the load is evenly distributed across them. If one goes down, only that application host will be affected. The remaining nodes will continue to work. The most serious issue to face is configuration management. With a large number of ProxySQL nodes, it is crucial to come up with an automated method of keeping their configurations in sync. You could use ClusterControl, or a configuration management tool like Puppet.

Tuning of Galera in a WAN environment

Galera defaults are designed for local network and if you want to use it in a WAN environment, some tuning is required. Let’s discuss some of the basic tweaks you can make. Please keep in mind that the precise tuning requires production data and traffic - you can’t just make some changes and assume they are good, you should do proper benchmarking.

Operating system configuration

Let’s start with the operating system configuration. Not all of the modifications proposed here are WAN-related, but it’s always good to remind ourselves what is a good starting point for any MySQL installation.

vm.swappiness = 1

Swappiness controls how aggressive the operating system will use swap. It should not be set to zero because in more recent kernels, it prevents the OS from using swap at all and it may cause serious performance issues.

/sys/block/*/queue/scheduler = deadline/noop

The scheduler for the block device, which MySQL uses, should be set to either deadline or noop. The exact choice depends on the benchmarks but both settings should deliver similar performance, better than default scheduler, CFQ.

For MySQL, you should consider using EXT4 or XFS, depending on the kernel (performance of those filesystems changes from one kernel version to another). Perform some benchmarks to find the better option for you.

In addition to this, you may want to look into sysctl network settings. We will not discuss them in detail (you can find documentation here) but the general idea is to increase buffers, backlogs and timeouts, to make it easier to accommodate for stalls and unstable WAN link.

net.core.optmem_max = 40960
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 87380 16777216
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_congestion_control = htcp
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_slow_start_after_idle = 0

In addition to OS tuning you should consider tweaking Galera network - related settings.

evs.suspect_timeout
evs.inactive_timeout

You may want to consider changing the default values of these variables. Both timeouts govern how the cluster evicts failed nodes. Suspect timeout takes place when all of the nodes cannot reach the inactive member. Inactive timeout defines a hard limit of how long a node can stay in the cluster if it’s not responding. Usually you’ll find that the default values work well. But in some cases, especially if you run your Galera cluster over WAN (for example, between AWS regions), increasing those variables may result in more stable performance. We’d suggest to set both of them to PT1M, to make it less likely that WAN link instability will throw a node out of the cluster.

evs.send_window
evs.user_send_window

These variables, evs.send_window and evs.user_send_window, define how many packets can be sent via replication at the same time (evs.send_window) and how many of them may contain data (evs.user_send_window). For high latency connections, it may be worth increasing those values significantly (512 or 1024 for example).

evs.inactive_check_period

The above variable may also be changed. evs.inactive_check_period, by default, is set to one second, which may be too often for a WAN setup. We’d suggest to set it to PT30S.

gcs.fc_factor
gcs.fc_limit

Here we want to minimize chances that flow control will kick in, therefore we’d suggest to set gcs.fc_factor to 1 and increase gcs.fc_limit to, for example, 260.

gcs.max_packet_size

As we are working with the WAN link, where latency is significantly higher, we want to increase size of the packets. A good starting point would be 2097152.

As we mentioned earlier, it is virtually impossible to give a simple recipe on how to set these parameters as it depends on too many factors - you will have to do your own benchmarks, using data as close to your production data as possible, before you can say your system is tuned. Having said that, those settings should give you a starting point for the more precise tuning.

That’s it for now. Galera works pretty well in WAN environments, so do give it a try and let us know how you get on.

ClusterControl in the Cloud - All Our Resources

$
0
0

While many of our customers utilize ClusterControl on-premise to automate and manage their open source databases, several are deploying ClusterControl alongside their applications in the cloud. Utilizing the cloud allows your business and applications to benefit from the cost-savings and flexibility that come with cloud computing. In addition you don’t have to worry about purchasing, maintaining and upgrading equipment.

Along the same lines, ClusterControl offers a suite of database automation and management functions to give you full control of your database infrastructure. With it you can deploy, manage, monitor and scale your databases, securely and with ease through our point-and-click interface.

As the load on your application increases… Your cloud environment can be expanded to provide more computing power to handle that load. In much the same way ClusterControl utilizes state-of-the-art database, caching, and load balancing technologies that enable you to scale-out the load on your databases and spread that load evenly across nodes.

These performance benefits are just some of the many reasons to leverage ClusterControl to manage your open source database instances in the cloud. From advanced monitoring to backups and automatic failover, ClusterControl is your true end-to-end database management solution.

Below you will find some of our top resources to help you get your databases up-and-running in the cloud…

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

AWS Marketplace

ClusterControl on the AWS Marketplace

Want to install ClusterControl directly onto your AWS EC2 instance? Check us out on the Amazon Marketplace.
(New version coming soon!)

Install Today

Top Blogs

Migrating MySQL database from Amazon RDS to DigitalOcean

This blog post describes the migration process from EC2 instance to a DigitalOcean droplet

Read More

MySQL in the Cloud - Online Migration from Amazon RDS to EC2 Instance (PART ONE)

RDS for MySQL is easy to get started. It's a convenient way to deploy and use MySQL, without having to worry about any operational overhead. The tradeoff though is reduced control.

Read More

MySQL in the Cloud - Online Migration from Amazon RDS to Your Own Server (PART TWO)

It's challenging to move data out of RDS for MySQL. We will show you how to do the actual migration of data to your own server, and redirect your applications to the new database without downtime.

Read More

MySQL in the Cloud - Pros and Cons of Amazon RDS

Moving your data into a public cloud service is a big decision. All the major cloud vendors offer cloud database services, with Amazon RDS for MySQL being probably the most popular. In this blog, we’ll have a close look at what it is, how it works, and compare its pros and cons.

Read More

About Cloud Lock-in and Open Source Databases

Severalnines CEO Vinay Joosery discusses key considerations to take when choosing cloud providers to host and manage mission critical data; and thus avoid cloud lock-in.

Read More

Infrastructure Automation - Deploying ClusterControl and MySQL-based systems on AWS using Ansible

This blog post has the latest updates to our ClusterControl Ansible Role. It now supports automatic deployment of MySQL-based systems (MySQL Replication, Galera Cluster, NDB Cluster).

Read More

Leveraging AWS tools to speed up management of Galera Cluster on Amazon Cloud

We previously covered basic tuning and configuration best practices for MyQL Galera Cluster on AWS. In this blog post, we’ll go over some AWS features/tools that you may find useful when managing Galera on Amazon Cloud. This won’t be a detailed how-to guide as each tool described below would warrant its own blog post. But this should be a good overview of how you can use the AWS tools at your disposal.

Read More

5 Performance tips for running Galera Cluster for MySQL or MariaDB on AWS Cloud

Amazon Web Services is one of the most popular cloud environments. Galera Cluster is one of the most popular MySQL clustering solutions. This is exactly why you’ll see many Galera clusters running on EC2 instances. In this blog post, we’ll go over five performance tips that you need to take under consideration while deploying and running Galera Cluster on EC2.

Read More

How to change AWS instance sizes for your Galera Cluster and optimize performance

Running your database cluster on AWS is a great way to adapt to changing workloads by adding/removing instances, or by scaling up/down each instance. At Severalnines, we talk much more about scale-out than scale up, but there are cases where you might want to scale up an instance instead of scaling out.

Read More


How to Automate Galera Cluster Using the ClusterControl CLI

$
0
0

As sysadmins and developers, we spend a lot our time in a terminal. So we brought ClusterControl to the terminal with our command line interface tool called s9s. s9s provides an easy interface to the ClusterControl RPC v2 API. You will find it very useful when working with large scale deployments, as the CLI allows will allow you to design more complex features and workflows.

This blog post showcases how to use s9s to automate the management of Galera Cluster for MySQL or MariaDB, as well as a simple master-slave replication setup.

Setup

You can find installation instructions for your particular OS in the documentation. What’s important to note is that if you happen to use the latest s9s-tools, from GitHub, there’s a slight change in the way you create a user. The following command will work fine:

s9s user --create --generate-key --controller="https://localhost:9501" dba

In general, there are two steps required if you want to configure CLI locally on the ClusterControl host. First, you need to create a user and then make some changes in the configuration file - all the steps are included in the documentation.

Deployment

Once the CLI has been configured correctly and has SSH access to your target database hosts, you can start the deployment process. At the time of writing, you can use the CLI to deploy MySQL, MariaDB and PostgreSQL clusters. Let’s start with an example of how to deploy Percona XtraDB Cluster 5.7. A single command is required to do that.

s9s cluster --create --cluster-type=galera --nodes="10.0.0.226;10.0.0.227;10.0.0.228"  --vendor=percona --provider-version=5.7 --db-admin-passwd="pass" --os-user=root --cluster-name="PXC_Cluster_57" --wait

Last option “--wait” means that the command will wait until the job completes, showing its progress. You can skip it if you want - in that case, the s9s command will return immediately to shell after it registers a new job in cmon. This is perfectly fine as cmon is the process which handles the job itself. You can always check the progress of a job separately, using:

root@vagrant:~# s9s job --list -l
--------------------------------------------------------------------------------------
Create Galera Cluster
Installing MySQL on 10.0.0.226                                           [██▊       ]
                                                                                                                                                                                                         26.09%
Created   : 2017-10-05 11:23:00    ID   : 1          Status : RUNNING
Started   : 2017-10-05 11:23:02    User : dba        Host   :
Ended     :                        Group: users
--------------------------------------------------------------------------------------
Total: 1

Let’s take a look at another example. This time we’ll create a new cluster, MySQL replication: simple master - slave pair. Again, a single command is enough:

root@vagrant:~# s9s cluster --create --nodes="10.0.0.229?master;10.0.0.230?slave" --vendor=percona --cluster-type=mysqlreplication --provider-version=5.7 --os-user=root --wait
Create MySQL Replication Cluster
/ Job  6 FINISHED   [██████████] 100% Cluster created

We can now verify that both clusters are up and running:

root@vagrant:~# s9s cluster --list --long
ID STATE   TYPE        OWNER GROUP NAME           COMMENT
 1 STARTED galera      dba   users PXC_Cluster_57 All nodes are operational.
 2 STARTED replication dba   users cluster_2      All nodes are operational.
Total: 2

Of course, all of this is also visible via the GUI:

Now, let’s add a ProxySQL loadbalancer:

root@vagrant:~# s9s cluster --add-node --nodes="proxysql://10.0.0.226" --cluster-id=1
WARNING: admin/admin
WARNING: proxy-monitor/proxy-monitor
Job with ID 7 registered.

This time we didn’t use ‘--wait’ option so, if we want to check the progress, we have to do it on our own. Please note that we are using a job ID which was returned by the previous command, so we’ll obtain information on this particular job only:

root@vagrant:~# s9s job --list --long --job-id=7
--------------------------------------------------------------------------------------
Add ProxySQL to Cluster
Waiting for ProxySQL                                                     [██████▋   ]
                                                                            65.00%
Created   : 2017-10-06 14:09:11    ID   : 7          Status : RUNNING
Started   : 2017-10-06 14:09:12    User : dba        Host   :
Ended     :                        Group: users
--------------------------------------------------------------------------------------
Total: 7

Scaling out

Nodes can be added to our Galera cluster via a single command:

s9s cluster --add-node --nodes 10.0.0.229 --cluster-id 1
Job with ID 8 registered.
root@vagrant:~# s9s job --list --job-id=8
ID CID STATE  OWNER GROUP CREATED  RDY  TITLE
 8   1 FAILED dba   users 14:15:52   0% Add Node to Cluster
Total: 8

Something went wrong. We can check what exactly happened:

root@vagrant:~# s9s job --log --job-id=8
addNode: Verifying job parameters.
10.0.0.229:3306: Adding host to cluster.
10.0.0.229:3306: Testing SSH to host.
10.0.0.229:3306: Installing node.
10.0.0.229:3306: Setup new node (installSoftware = true).
10.0.0.229:3306: Detected a running mysqld server. It must be uninstalled first, or you can also add it to ClusterControl.

Right, that IP is already used for our replication server. We should have used another, free IP. Let’s try that:

root@vagrant:~# s9s cluster --add-node --nodes 10.0.0.231 --cluster-id 1
Job with ID 9 registered.
root@vagrant:~# s9s job --list --job-id=9
ID CID STATE    OWNER GROUP CREATED  RDY  TITLE
 9   1 FINISHED dba   users 14:20:08 100% Add Node to Cluster
Total: 9

Managing

Let’s say we want to take a backup of our replication master. We can do that from the GUI but sometimes we may need to integrate it with external scripts. ClusterControl CLI would make a perfect fit for such case. Let’s check what clusters we have:

root@vagrant:~# s9s cluster --list --long
ID STATE   TYPE        OWNER GROUP NAME           COMMENT
 1 STARTED galera      dba   users PXC_Cluster_57 All nodes are operational.
 2 STARTED replication dba   users cluster_2      All nodes are operational.
Total: 2

Then, let’s check the hosts in our replication cluster, with cluster ID 2:

root@vagrant:~# s9s nodes --list --long --cluster-id=2
STAT VERSION       CID CLUSTER   HOST       PORT COMMENT
soM- 5.7.19-17-log   2 cluster_2 10.0.0.229 3306 Up and running
soS- 5.7.19-17-log   2 cluster_2 10.0.0.230 3306 Up and running
coC- 1.4.3.2145      2 cluster_2 10.0.2.15  9500 Up and running

As we can see, there are three hosts that ClusterControl knows about - two of them are MySQL hosts (10.0.0.229 and 10.0.0.230), the third one is the ClusterControl instance itself. Let’s print only the relevant MySQL hosts:

root@vagrant:~# s9s nodes --list --long --cluster-id=2 10.0.0.2*
STAT VERSION       CID CLUSTER   HOST       PORT COMMENT
soM- 5.7.19-17-log   2 cluster_2 10.0.0.229 3306 Up and running
soS- 5.7.19-17-log   2 cluster_2 10.0.0.230 3306 Up and running
Total: 3

In the “STAT” column you can see some characters there. For more information, we’d suggest to look into the manual page for s9s-nodes (man s9s-nodes). Here we’ll just summarize the most important bits. First character tells us about the type of the node: “s” means it’s regular MySQL node, “c” - ClusterControl controller. Second character describes the state of the node: “o” tells us it’s online. Third character - role of the node. Here “M” describes a master and “S” - a slave while “C” stands for controller. Final, fourth character tells us if the node is in maintenance mode. “-” means there’s no maintenance scheduled. Otherwise we’d see “M” here. So, from this data we can see that our master is a host with IP: 10.0.0.229. Let’s take a backup of it and store it on the controller.

root@vagrant:~# s9s backup --create --nodes=10.0.0.229 --cluster-id=2 --backup-method=xtrabackupfull --wait
Create Backup
| Job 12 FINISHED   [██████████] 100% Command ok

We can then verify if it indeed completed ok. Please note the “--backup-format” option which allows you to define which information should be printed:

root@vagrant:~# s9s backup --list --full --backup-format="Started: %B Completed: %E Method: %M Stored on: %S Size: %s %F\n" --cluster-id=2
Started: 15:29:11 Completed: 15:29:19 Method: xtrabackupfull Stored on: 10.0.0.229 Size: 543382 backup-full-2017-10-06_152911.xbstream.gz
Total 1
Severalnines
 
DevOps Guide to Database Management
Learn about what you need to know to automate and manage your open source databases

Monitoring

All databases have to be monitored. ClusterControl uses advisors to watch some of the metrics on both MySQL and the operating system. When a condition is met, a notification is sent. ClusterControl provides also an extensive set of graphs, both real-time as well as historical ones for post-mortem or capacity planning. Sometimes it would be great to have access to some of those metrics without having to go through the GUI. ClusterControl CLI makes it possible through the s9s-node command. Information on how to do that can be found in the manual page of s9s-node. We’ll show some examples of what you can do with CLI.

First of all, let’s take a look at the “--node-format” option to “s9s node” command. As you can see, there are plenty of options to print interesting content.

root@vagrant:~# s9s node --list --node-format "%N %T %R %c cores %u%% CPU utilization %fmG of free memory, %tMB/s of net TX+RX, %M\n""10.0.0.2*"
10.0.0.226 galera none 1 cores 13.823200% CPU utilization 0.503227G of free memory, 0.061036MB/s of net TX+RX, Up and running
10.0.0.227 galera none 1 cores 13.033900% CPU utilization 0.543209G of free memory, 0.053596MB/s of net TX+RX, Up and running
10.0.0.228 galera none 1 cores 12.929100% CPU utilization 0.541988G of free memory, 0.052066MB/s of net TX+RX, Up and running
10.0.0.226 proxysql  1 cores 13.823200% CPU utilization 0.503227G of free memory, 0.061036MB/s of net TX+RX, Process 'proxysql' is running.
10.0.0.231 galera none 1 cores 13.104700% CPU utilization 0.544048G of free memory, 0.045713MB/s of net TX+RX, Up and running
10.0.0.229 mysql master 1 cores 11.107300% CPU utilization 0.575871G of free memory, 0.035830MB/s of net TX+RX, Up and running
10.0.0.230 mysql slave 1 cores 9.861590% CPU utilization 0.580315G of free memory, 0.035451MB/s of net TX+RX, Up and running

With what we shown here, you probably can imagine some cases for automation. For example, you can watch the CPU utilization of the nodes and if it reaches some threshold, you can execute another s9s job to spin up a new node in the Galera cluster. You can also, for example, monitor memory utilization and send alerts if it passess some threshold.

The CLI can do more than that. First of all, it is possible to check the graphs from within the command line. Of course, those are not as feature-rich as graphs in the GUI, but sometimes it’s enough just to see a graph to find an unexpected pattern and decide if it is worth further investigation.

root@vagrant:~# s9s node --stat --cluster-id=1 --begin="00:00" --end="14:00" --graph=load 10.0.0.231
root@vagrant:~# s9s node --stat --cluster-id=1 --begin="00:00" --end="14:00" --graph=sqlqueries 10.0.0.231

During emergency situations, you may want to check resource utilization across the cluster. You can create a top-like output that combines data from all of the cluster nodes:

root@vagrant:~# s9s process --top --cluster-id=1
PXC_Cluster_57 - 14:38:01                                                                                                                                                               All nodes are operational.
4 hosts, 7 cores,  2.2 us,  3.1 sy, 94.7 id,  0.0 wa,  0.0 st,
GiB Mem : 2.9 total, 0.2 free, 0.9 used, 0.2 buffers, 1.6 cached
GiB Swap: 3 total, 0 used, 3 free,

PID   USER       HOST       PR  VIRT      RES    S   %CPU   %MEM COMMAND
 8331 root       10.0.2.15  20   743748    40948 S  10.28   5.40 cmon
26479 root       10.0.0.226 20   278532     6448 S   2.49   0.85 accounts-daemon
 5466 root       10.0.0.226 20    95372     7132 R   1.72   0.94 sshd
  651 root       10.0.0.227 20   278416     6184 S   1.37   0.82 accounts-daemon
  716 root       10.0.0.228 20   278304     6052 S   1.35   0.80 accounts-daemon
22447 n/a        10.0.0.226 20  2744444   148820 S   1.20  19.63 mysqld
  975 mysql      10.0.0.228 20  2733624   115212 S   1.18  15.20 mysqld
13691 n/a        10.0.0.227 20  2734104   130568 S   1.11  17.22 mysqld
22994 root       10.0.2.15  20    30400     9312 S   0.93   1.23 s9s
 9115 root       10.0.0.227 20    95368     7192 S   0.68   0.95 sshd
23768 root       10.0.0.228 20    95372     7160 S   0.67   0.94 sshd
15690 mysql      10.0.2.15  20  1102012   209056 S   0.67  27.58 mysqld
11471 root       10.0.0.226 20    95372     7392 S   0.17   0.98 sshd
22086 vagrant    10.0.2.15  20    95372     4960 S   0.17   0.65 sshd
 7282 root       10.0.0.226 20        0        0 S   0.09   0.00 kworker/u4:2
 9003 root       10.0.0.226 20        0        0 S   0.09   0.00 kworker/u4:1
 1195 root       10.0.0.227 20        0        0 S   0.09   0.00 kworker/u4:0
27240 root       10.0.0.227 20        0        0 S   0.09   0.00 kworker/1:1
 9933 root       10.0.0.227 20        0        0 S   0.09   0.00 kworker/u4:2
16181 root       10.0.0.228 20        0        0 S   0.08   0.00 kworker/u4:1
 1744 root       10.0.0.228 20        0        0 S   0.08   0.00 kworker/1:1
28506 root       10.0.0.228 20    95372     7348 S   0.08   0.97 sshd
  691 messagebus 10.0.0.228 20    42896     3872 S   0.08   0.51 dbus-daemon
11892 root       10.0.2.15  20        0        0 S   0.08   0.00 kworker/0:2
15609 root       10.0.2.15  20   403548    12908 S   0.08   1.70 apache2
  256 root       10.0.2.15  20        0        0 S   0.08   0.00 jbd2/dm-0-8
  840 root       10.0.2.15  20   316200     1308 S   0.08   0.17 VBoxService
14694 root       10.0.0.227 20    95368     7200 S   0.00   0.95 sshd
12724 n/a        10.0.0.227 20     4508     1780 S   0.00   0.23 mysqld_safe
10974 root       10.0.0.227 20    95368     7400 S   0.00   0.98 sshd
14712 root       10.0.0.227 20    95368     7384 S   0.00   0.97 sshd
16952 root       10.0.0.227 20    95368     7344 S   0.00   0.97 sshd
17025 root       10.0.0.227 20    95368     7100 S   0.00   0.94 sshd
27075 root       10.0.0.227 20        0        0 S   0.00   0.00 kworker/u4:1
27169 root       10.0.0.227 20        0        0 S   0.00   0.00 kworker/0:0
  881 root       10.0.0.227 20    37976      760 S   0.00   0.10 rpc.mountd
  100 root       10.0.0.227  0        0        0 S   0.00   0.00 deferwq
  102 root       10.0.0.227  0        0        0 S   0.00   0.00 bioset
11876 root       10.0.0.227 20     9588     2572 S   0.00   0.34 bash
11852 root       10.0.0.227 20    95368     7352 S   0.00   0.97 sshd
  104 root       10.0.0.227  0        0        0 S   0.00   0.00 kworker/1:1H

When you take a look at the top, you’ll see CPU and memory statistics aggregated across the whole cluster.

root@vagrant:~# s9s process --top --cluster-id=1
PXC_Cluster_57 - 14:38:01                                                                                                                                                               All nodes are operational.
4 hosts, 7 cores,  2.2 us,  3.1 sy, 94.7 id,  0.0 wa,  0.0 st,
GiB Mem : 2.9 total, 0.2 free, 0.9 used, 0.2 buffers, 1.6 cached
GiB Swap: 3 total, 0 used, 3 free,

Below you can find the list of processes from all of the nodes in the cluster.

PID   USER       HOST       PR  VIRT      RES    S   %CPU   %MEM COMMAND
 8331 root       10.0.2.15  20   743748    40948 S  10.28   5.40 cmon
26479 root       10.0.0.226 20   278532     6448 S   2.49   0.85 accounts-daemon
 5466 root       10.0.0.226 20    95372     7132 R   1.72   0.94 sshd
  651 root       10.0.0.227 20   278416     6184 S   1.37   0.82 accounts-daemon
  716 root       10.0.0.228 20   278304     6052 S   1.35   0.80 accounts-daemon
22447 n/a        10.0.0.226 20  2744444   148820 S   1.20  19.63 mysqld
  975 mysql      10.0.0.228 20  2733624   115212 S   1.18  15.20 mysqld
13691 n/a        10.0.0.227 20  2734104   130568 S   1.11  17.22 mysqld

This can be extremely useful if you need to figure out what’s causing the load and which node is the most affected one.

Hopefully, the CLI tool makes it easier for you to integrate ClusterControl with external scripts and infrastructure orchestration tools. We hope you’ll enjoy using this tool and if you have any feedback on how to improve it, feel free to let us know.

How to Stop or Throttle SST Operation on a Galera Cluster

$
0
0

State Snapshot Transfer (SST) is one of the two ways used by Galera to perform initial syncing when a node is joining a cluster, until the node is declared as synced and part of the “primary component”. Depending on the dataset size and workload, SST could be lightning fast, or an expensive operation which will bring your database service down on its knees.

SST can be performed using 3 different methods:

  • mysqldump
  • rsync (or rsync_wan)
  • xtrabackup (or xtrabackup-v2, mariabackup)

Most of the time, xtrabackup-v2 and mariabackup are the preferred options. We rarely see people running on rsync or mysqldump in production clusters.

The Problem

When SST is initiated, there are several processes triggered on the joiner node, which are executed by the "mysql" user:

$ ps -fu mysql
UID         PID   PPID  C STIME TTY          TIME CMD
mysql    117814 129515  0 13:06 ?        00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role donor --address 192.168.55.173:4444/xtrabackup_sst//1 --socket /var/lib/mysql/mysql.sock --datadir
mysql    120036 117814 15 13:06 ?        00:00:06 innobackupex --no-version-check --tmpdir=/tmp/tmp.pMmzIlZJwa --user=backupuser --password=x xxxxxxxxxxxxxx --socket=/var/lib/mysql/mysql.sock --galera-inf
mysql    120037 117814 19 13:06 ?        00:00:07 socat -u stdio TCP:192.168.55.173:4444
mysql    129515      1  1 Oct27 ?        01:11:46 /usr/sbin/mysqld --wsrep_start_position=7ce0e31f-aa46-11e7-abda-56d6a5318485:4949331

While on the donor node:

mysql     43733      1 14 Oct16 ?        03:28:47 /usr/sbin/mysqld --wsrep-new-cluster --wsrep_start_position=7ce0e31f-aa46-11e7-abda-56d6a5318485:272891
mysql     87092  43733  0 14:53 ?        00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role donor --address 192.168.55.172:4444/xtrabackup_sst//1 --socket /var/lib/mysql/mysql.sock --datadir /var/lib/mysql/  --gtid 7ce0e31f-aa46-11e7-abda-56d6a5318485:2883115 --gtid-domain-id 0
mysql     88826  87092 30 14:53 ?        00:00:05 innobackupex --no-version-check --tmpdir=/tmp/tmp.LDdWzbHkkW --user=backupuser --password=x xxxxxxxxxxxxxx --socket=/var/lib/mysql/mysql.sock --galera-info --stream=xbstream /tmp/tmp.oXDumYf392
mysql     88827  87092 30 14:53 ?        00:00:05 socat -u stdio TCP:192.168.55.172:4444

SST against a large dataset (hundreds of GBytes) is no fun. Depending on the hardware, network and workload, it may take hours to complete. Server resources may be saturated during the operation. Despite throttling is supported in SST (only for xtrabackup and mariabackup) using --rlimit and --use-memory options, we are still exposed to a degraded cluster when you are running out of majority active nodes. For example, if you are unlucky enough to find yourself with only one out of three nodes running. Therefore, you are advised to perform SST during quiet hours. You can, however, avoid SST by taking some manual steps, as described in this blog post.

Stopping an SST

Stopping an SST needs to be done on both the donor and the joiner nodes. The joiner triggers SST after determining how big the gap is when comparing the local Galera seqno with cluster's seqno. It executes the wsrep_sst_{wsrep_sst_method} command. This will be picked by the chosen donor, which will start streaming out data to the joiner. A donor node has no capabilities of refusing to serve snapshot transfer, once selected by Galera group communication, or by the value defined in wsrep_sst_donor variable. Once the syncing has started and you want to revert the decision, there is no single command to stop the operation.

The basic principle when stopping an SST is to:

  • Make the joiner look dead from a Galera group communication point-of-view (shutdown, fence, block, reset, unplug cable, blacklist, etc)
  • Kill the SST processes on the donor

One would think that killing the innobackupex process (kill -9 {innobackupex PID}) on the donor would be enough, but that is not the case. If you kill the SST processes on donor (or joiner) without fencing off the joiner, Galera still can see the joiner as active and will mark the SST process as incomplete, thus respawning a new set of processes to continue or start over again. You will be back to square one. This is the expected behaviour of /usr/bin/wsrep_sst_{method} script to safeguard SST operation which is vulnerable to timeouts (e.g., if it is long-running and resource intensive).

Let's look at an example. We have a crashed joiner node that we would like to rejoin the cluster. We would start by running the following command on the joiner:

$ systemctl start mysql # or service mysql start

A minute later, we found out that the operation is too heavy at that particular moment, and decided to postpone it later during low traffic hours. The most straightforward way to stop an xtrabackup-based SST method is by simply shutting down the joiner node, and kill the SST-related processes on the donor node. Alternatively, you can also block the incoming ports on the joiner by running the following iptables command on the joiner:

$ iptables -A INPUT -p tcp --dport 4444 -j DROP
$ iptables -A INPUT -p tcp --dport 4567:4568 -j DROP

Then on the donor, retrieve the PID of SST processes (list out the processes owned by "mysql" user):

$ ps -u mysql
   PID TTY          TIME CMD
117814 ?        00:00:00 wsrep_sst_xtrab
120036 ?        00:00:06 innobackupex
120037 ?        00:00:07 socat
129515 ?        01:11:47 mysqld

Finally, kill them all except the mysqld process (you must be extremely careful to NOT kill the mysqld process on the donor!):

$ kill -9 117814 120036 120037

Then, on the donor MySQL error log, you should notice the following line appearing after ~100 seconds:

2017-10-30 13:24:08 139722424837888 [Warning] WSREP: Could not find peer: 42b85e82-bd32-11e7-87ae-eff2b8dd2ea0
2017-10-30 13:24:08 139722424837888 [Warning] WSREP: 1.0 (192.168.55.172): State transfer to -1.-1 (left the group) failed: -32 (Broken pipe)

At this point, the donor should return to the "synced" state as reported by wsrep_local_state_comment and the SST process is completely stopped. The donor is back to its operational state and is able to serve clients in full capacity.

For the cleanup process on the joiner, you can simply flush the iptables chain:

$ iptables -F

Or simply remove the rules with -D flag:

$ iptables -D INPUT -p tcp --dport 4444 -j DROP
$ iptables -D INPUT -p tcp --dport 4567:4568 -j DROP

The similar approach can be used with other SST methods like rsync, mariabackup and mysqldump.

Throttling an SST (xtrabackup method only)

Depending on how busy the donor is, it's a good approach to throttle the SST process so it won't impact the donor significantly. We've seen a number of cases where, during catastrophic failures, users were desperate to bring back a failed cluster as a single bootstrapped node, and let the rest of the members catch up later. This attempt reduces the downtime from the application side, however, it creates additional burden on this “one-node cluster”, while the remaining members are still down or recovering.

Xtrabackup can be throttled with --throttle=<rate of IO/sec> to simply limit the number of IO operation if you are afraid that it will saturate your disks, but this option is only applicable when running xtrabackup as a backup process, not as an SST operator. Similar options are available with rlimit (rate limit) and can be combined with --use-memory to limit the RAM usage. By setting up values under [sst] directive inside the MySQL configuration file, we can ensure that the SST operation won't put too much load on the donor, even though it can take longer to complete. On the donor node, set the following:

[sst]
rlimit=128k
inno-apply-opts="--use-memory=200M"

More details on the Percona Xtrabackup SST documentation page.

However, there is a catch. The process could be so slow that it will never catch up with the transaction logs that InnoDB is writing, so SST might never complete. Generally, this situation is very uncommon, unless if you really have a very write-intensive workload or you allocate very limited resources to SST.

Conclusions

SST is critical but heavy, and could potentially be a long-running operation depending on the dataset size and network throughput between the nodes. Regardless of the consequences, there are still possibilities to stop the operation so we can have a better recovery plan at a better time.

Several Ways to Intentionally Fail or Crash your MySQL Instances for Testing

$
0
0

You can take down a MySQL database in multiple ways. Some obvious ways are to shut down the host, pull out the power cable, or hard kill the mysqld process with SIGKILL to simulate an unclean MySQL shutdown behaviour. But there are also less subtle ways to deliberately crash your MySQL server, and then see what kind of chain reaction it triggers. Why would you want to do this? Failure and recovery can have many corner cases, and understanding them can help reduce the element of surprise when things happen in production. Ideally, you would want to simulate failures in a controlled environment, and then design and test database failover procedures.

There are several areas in MySQL that we can tackle, depending on how you want it to fail or crash. You can corrupt the tablespace, overflow the MySQL buffers and caches, limit the resources to starve the server, and also mess around with permissions. In this blog post, we are going to show you some examples of how to crash a MySQL server in a Linux environment. Some of them would be suitable for e.g. Amazon RDS instances, where you would have no access to the underlying host.

Kill, Kill, Kill, Die, Die, Die

The easiest way to fail a MySQL server is to simply kill the process or host, and not give MySQL a chance to do a graceful shutdown. To simulate a mysqld crash, just send signal 4, 6, 7, 8 or 11 to the process:

$ kill -11 $(pidof mysqld)

When looking at the MySQL error log, you can see the following lines:

11:06:09 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
..
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...

You can also use kill -9 (SIGKILL) to kill the process immediately. More details on Linux signal can be found here. Alternatively, you can use a meaner way on the hardware side like pulling off the power cable, pressing down the hard reset button or using a fencing device to STONITH.

Triggering OOM

Popular MySQL in the cloud offerings like Amazon RDS and Google Cloud SQL have no straightforward way to crash them. Firstly because you won't get any OS-level access to the database instance, and secondly because the provider uses a proprietary patched MySQL server. One ways is to overflow some buffers, and let the out-of-memory (OOM) manager to kick out the MySQL process.

You can increase the sort buffer size to something bigger than what the RAM can handle, and shoot a number of mysql sort queries against the MySQL server. Let's create a 10 million rows table using sysbench on our Amazon RDS instance, so we can build a huge sort:

$ sysbench \
--db-driver=mysql \
--oltp-table-size=10000000 \
--oltp-tables-count=1 \
--threads=1 \
--mysql-host=dbtest.cdw9q2wnb00s.ap-tokyo-1.rds.amazonaws.com \
--mysql-port=3306 \
--mysql-user=rdsroot \
--mysql-password=password \
/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua \
run

Change the sort_buffer_size to 5G (our test instance is db.t2.micro - 1GB, 1vCPU) by going to Amazon RDS Dashboard -> Parameter Groups -> Create Parameter Group -> specify the group name -> Edit Parameters -> choose "sort_buffer_size" and specify the value as 5368709120.

Apply the parameter group changes by going to Instances -> Instance Action -> Modify -> Database Options -> Database Parameter Group -> and choose our newly created parameter group. Then, reboot the RDS instance to apply the changes.

Once up, verify the new value of sort_buffer_size:

MySQL [(none)]> select @@sort_buffer_size;
+--------------------+
| @@sort_buffer_size |
+--------------------+
|         5368709120 |
+--------------------+

Then fire 48 simple queries that requires sorting from a client:

$ for i in {1..48}; do (mysql -urdsroot -ppassword -hdbtest.cdw9q2wnb00s.ap-tokyo-1.rds.amazonaws.com -e 'SELECT * FROM sbtest.sbtest1 ORDER BY c DESC >/dev/null &); done

If you run the above on a standard host, you will notice the MySQL server will be terminated and you can see the following lines appear in the OS's syslog or dmesg:

[164199.868060] Out of memory: Kill process 47060 (mysqld) score 847 or sacrifice child
[164199.868109] Killed process 47060 (mysqld) total-vm:265264964kB, anon-rss:3257400kB, file-rss:0kB

With systemd, MySQL or MariaDB will be restarted automatically, so does Amazon RDS. You can see the uptime for our RDS instance will be resetted back to 0 (under mysqladmin status), and the 'Latest restore time' value (under RDS Dashboard) will be updated to the moment it went down.

Corrupting the Data

InnoDB has its own system tablespace to store data dictionary, buffers and rollback segments inside a file named ibdata1. It also stores the shared tablespace if you do not configure innodb_file_per_table (enabled by default in MySQL 5.6.6+). We can just zero this file, send a write operation and flush tables to crash mysqld:

# empty ibdata1
$ cat /dev/null > /var/lib/mysql/ibdata1
# send a write
$ mysql -uroot -p -e 'CREATE TABLE sbtest.test (id INT)'
# flush tables
$ mysql -uroot -p -e 'FLUSH TABLES WITH READ LOCK; UNLOCK TABLES'

After you send a write, in the error log, you will notice:

2017-11-15T06:01:59.345316Z 0 [ERROR] InnoDB: Tried to read 16384 bytes at offset 98304, but was only able to read 0
2017-11-15T06:01:59.345332Z 0 [ERROR] InnoDB: File (unknown): 'read' returned OS error 0. Cannot continue operation
2017-11-15T06:01:59.345343Z 0 [ERROR] InnoDB: Cannot continue operation.

At this point, mysql will hang because it cannot perform any operation, and after the flushing, you will get "mysqld got signal 11" lines and mysqld will shut down. To clean up, you have to remove the corrupted ibdata1, as well as ib_logfile* because the redo log files cannot be used with a new system tablespace that will be generated by mysqld on the next restart. Data loss is expected.

For MyISAM tables, we can mess around with .MYD (MyISAM data file) and .MYI (MyISAM index) under the MySQL datadir. For instance, the following command replaces any occurrence of string "F" with "9" inside a file:

$ replace F 9 -- /var/lib/mysql/sbtest/sbtest1.MYD

Then, send some writes (e.g, using sysbench) to the target table and perform the flushing:

mysql> FLUSH TABLE sbtest.sbtest1;

The following should appear in the MySQL error log:

2017-11-15T06:56:15.021564Z 448 [ERROR] /usr/sbin/mysqld: Incorrect key file for table './sbtest/sbtest1.MYI'; try to repair it
2017-11-15T06:56:15.021572Z 448 [ERROR] Got an error from thread_id=448, /export/home/pb2/build/sb_0-24964902-1505318733.42/rpm/BUILD/mysql-5.7.20/mysql-5.7.20/storage/myisam/mi_update.c:227

The MyISAM table will be marked as crashed and running REPAIR TABLE statement is necessary to make it accessible again.

Limiting the Resources

We can also apply the operating system resource limit to our mysqld process, for example number of open file descriptors. Using open_file_limit variable (default is 5000) allows mysqld to reserve file descriptors using setrlimit() command. You can set this variable relatively small (just enough for mysqld to start up) and then send multiple queries to the MySQL server until it hits the limit.

If mysqld is running in a systemd server, we can set it in the systemd unit file located at /usr/lib/systemd/system/mysqld.service, and change the following value to something lower (systemd default is 6000):

# Sets open_files_limit
LimitNOFILE = 30

Apply the changes to systemd and restart MySQL server:

$ systemctl daemon-reload
$ systemctl restart mysqld

Then, start sending new connections/queries that count in different databases and tables so mysqld has to open multiple files. You will notice the following error:

2017-11-16T04:43:26.179295Z 4 [ERROR] InnoDB: Operating system error number 24 in a file operation.
2017-11-16T04:43:26.179342Z 4 [ERROR] InnoDB: Error number 24 means 'Too many open files'
2017-11-16T04:43:26.179354Z 4 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
2017-11-16T04:43:26.179363Z 4 [ERROR] InnoDB: File ./sbtest/sbtest9.ibd: 'open' returned OS error 124. Cannot continue operation
2017-11-16T04:43:26.179371Z 4 [ERROR] InnoDB: Cannot continue operation.
2017-11-16T04:43:26.372605Z 0 [Note] InnoDB: FTS optimize thread exiting.
2017-11-16T04:45:06.816056Z 4 [Warning] InnoDB: 3 threads created by InnoDB had not exited at shutdown!

At this point, when the limit is reached, MySQL will freeze and it will not be able to perform any operation. When trying to connect, you would see the following after a while:

$ mysql -uroot -p
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 104

Messing up with Permissions

The mysqld process runs by "mysql" user, which means all the files and directory that it needs to access are owned by mysql user/group. By messing up with the permission and ownership, we can make the MySQL server useless:

$ chown root:root /var/lib/mysql
$ chmod 600 /var/lib/mysql

Generate some loads to the server and then connect to the MySQL server and flush all tables onto disk:

mysql> FLUSH TABLES WITH READ LOCK; UNLOCK TABLES;

At this moment, mysqld is still running but it's kind of useless. You can access it via a mysql client but you can't do any operation:

mysql> SHOW DATABASES;
ERROR 1018 (HY000): Can't read dir of '.' (errno: 13 - Permission denied)

To clean up the mess, set the correct permissions:

$ chown mysql:mysql /var/lib/mysql
$ chmod 750 /var/lib/mysql
$ systemctl restart mysqld

Lock it Down

FLUSH TABLE WITH READ LOCK (FTWRL) can be destructive in a number of conditions. Like for example, in a Galera cluster where all nodes are able to process writes, you can use this statement to lock down the cluster from within one of the nodes. This statement simply halts other queries to be processed by mysqld during the flushing until the lock is released, which is very handy for backup processes (MyISAM tables) and file system snapshots.

Although this action won't crash or bring down your database server during the locking, the consequence can be huge if the session that holds the lock does not release it. To try this, simply:

mysql> FLUSH TABLES WITH READ LOCK;
mysql> exit

Then send a bunch of new queries to the mysqld until it reaches the max_connections value. Obviously, you can not get back the same session as the previous one once you are out. So the lock will be running infinitely and the only way to release the lock is by killing the query, by another SUPER privilege user (using another session). Or kill the mysqld process itself, or perform a hard reboot.

Disclaimer

This blog is written to give alternatives to sysadmins and DBAs to simulate failure scenarios with MySQL. Do not try these on your production server :-)

Zero Downtime Network Migration with MySQL Galera Cluster using Relay Node

$
0
0

Galera Cluster’s automatic node provisioning simplifies the complexity of scaling out a database cluster with guaranteed data consistency. SST and IST improve the usability of initial data synchronization without the need to manually backup the database and copy it to the new node. Combine this with Galera's ability to tolerate different network setups (e.g, WAN replication), we can now migrate the database between different isolated networks with zero service disruption.

In this blog post, we are going to look into how to migrate our MySQL Galera Cluster without downtime. We will move the database from Amazon Web Service (AWS) EC2 to Google Cloud Platform (GCP) Compute Engine, with the help of a relay node. Note that we had a similar blog post in the past, but this one uses a different approach.

The following diagram simplifies our migration plan:

Old Site Preparation

Since both sites cannot communicate with each other due to security group or VPC isolation, we need to have a relay node to bridge these two sites together. This node can be located on either site, but must able to connect to one or more nodes on the other side on port 3306 (MySQL), 4444 (SST), 4567 (gcomm) and 4568 (IST). Here is what we already have, and how we will scale the old site:

You can also use an existing Galera node (e.g, the third node) as the relay node, as long as it has connectivity to the other side. The downside is that the cluster capacity will be reduced to two, because one node will be used for SST and relaying the Galera replication stream between sites. Depending on the dataset size and connection between sites, this can introduce database reliability issues on the current cluster.

So, we are going to use a fourth node, to reduce the risk on the current production cluster when syncing to the other side. First, create a new instance in the AWS Dashboard with a public IP address (so it can talk to the outside world) and allow the required Galera communication ports (TCP 3306, 4444, 4567-4568).

Deploy the fourth node (relay node) on the old site. If you are using ClusterControl, you can simply use "Add Node" feature to scale the cluster out (don't forget to setup passwordless SSH from ClusterControl node to this fourth host beforehand):

Ensure the relay node is in sync with the current cluster and is able to communicate to the other side.

From the new site, we are going to connect to the relay node since this is the only node that has connectivity to the outside world.

New Site Deployment

On the new site, we will deploy a similar setup with one ClusterControl node and three-node Galera Cluster. Both sites must use the same MySQL version. Here is our architecture on the new site:

With ClusterControl, the new cluster deployment is just a couple of clicks away and a free feature in the community edition. Go to ClusterControl -> Deploy Database Cluster -> MySQL Galera and follow the deployment wizard:

Click Deploy and monitor the progress under Activity -> Jobs -> Create Cluster. Once done, you should have the following on the dashboard:

At this point, you are having two separate Galera Clusters - 4 nodes at the old site and 3 nodes at the new site.

Connecting Both Sites

On the new site (GCP), pick one node to communicate with the relay node on the old site. We are going to pick galera-gcp1 as the connector to the relay node (galera-aws4). The following diagram illustrates our bridging plan:

The important things to configure are the following parameters:

  • wsrep_sst_donor: The wsrep_node_name of the donor node. On galera-gcp1, set the donor to galera-aws4.
  • wsrep_sst_auth: SST user credentials in username:password format must follow the old site (AWS).
  • wsrep_sst_receive_address: The IP address that will receive SST on the joiner node. On galera-gcp1, set this to the public IP address of this node.
  • wsrep_cluster_address: Galera connection string. On galera-gcp1, add the public IP address of galera-aws4.
  • wsrep_provider_options:
    • gmcast.segment: Default is 0. Set a different integer on all nodes in GCP.
  1. On the relay node (galera-aws4), retrieve the wsrep_node_name:

    $ mysql -uroot -p -e 'SELECT @@wsrep_node_name'
    Enter password:
    +-------------------+
    | @@wsrep_node_name |
    +-------------------+
    | 10.0.0.13         |
    +-------------------+
  2. On galera-gcp1's my.cnf, set wsrep_sst_donor value to the relay node's wsrep_node_name and wsrep_sst_receive_address to the public IP address of galera-gcp1:

    wsrep_sst_donor=10.0.0.13
    wsrep_sst_receive_address=35.197.136.232
  3. On all nodes on GCP, ensure the wsrep_sst_auth value is identical following the old site (AWS) and change the Galera segment to 1 (so Galera knows both sites are in different networks):

    wsrep_sst_auth=backupuser:mysecretP4ssW0rd
    wsrep_provider_options="base_port=4567; gcache.size=512M; gmcast.segment=1"
  4. On galera-gcp1, set the wsrep_cluster_address to include the relay node's public IP address:

    wsrep_cluster_address=gcomm://10.148.0.2,10.148.0.3,10.148.0.4,13.229.247.149

    **Only modify wsrep_cluster_address on galera-gcp1. Don't modify this parameter on galera-gcp2 and galera-gcp3.

  5. Stop all nodes on GCP. If you are using ClusterControl, go to Cluster Actions dropdown -> Stop Cluster. You are also required to turn off automatic recovery at both cluster and node levels, so ClusterControl won't try to recover the failed nodes.

  6. Now the syncing part. Start galera-gcp1. You can see from the MySQL error log on the donor node that SST is initiated between the the relay node (10.0.0.13) using a public address on galera-gcp1 (35.197.136.232):

    2017-12-19T13:58:04.765238Z 0 [Note] WSREP: Initiating SST/IST transfer on DONOR side (wsrep_sst_xtrabackup-v2 --role 'donor' --address '35.197.136.232:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/m
    ysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '''' --gtid 'df23adb8-b567-11e7-8c50-a386c8cc7711:151181')
    2017-12-19T13:58:04.765468Z 5 [Note] WSREP: DONOR thread signaled with 0
            2017-12-19T13:58:15.158757Z WSREP_SST: [INFO] Streaming the backup to joiner at 35.197.136.232 4444
    2017-12-19T13:58:52.512143Z 0 [Note] WSREP: 1.0 (10.0.0.13): State transfer to 0.0 (10.148.0.2) complete.

    Take note that, at this point of time, galera-gcp1 will be flooded with following lines:

    2017-12-19T13:32:47.111002Z 0 [Note] WSREP: (ed66842b, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.0.118:4567 timed out, no messages seen in PT3S
    2017-12-19T13:32:48.111123Z 0 [Note] WSREP: (ed66842b, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.0.90:4567 timed out, no messages seen in PT3S
    2017-12-19T13:32:50.611462Z 0 [Note] WSREP: (ed66842b, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.0.0.25:4567 timed out, no messages seen in PT3S

    You can safely ignore this warning since galera-gcp1 keeps trying to see the remaining nodes beyond the relay node on AWS.

  7. Once SST on galera-gcp1 completes, ClusterControl on GCE won't be able to connect the database nodes, due to missing GRANTs (existing GRANTs have been overridden after syncing from AWS). So here is what we need to do after SST completes on galera-gcp1:

    mysql> GRANT ALL PRIVILEGES ON *.* TO cmon@'10.148.0.5' IDENTIFIED BY 'cmon' WITH GRANT OPTION;

    Once this is done, ClusterControl will correctly report the state of galera-gcp1 as highlighted below:

  8. The last part is to start the remaining galera-gcp2 and galera-gcp3, one node at a time. Go to ClusterControl -> Nodes -> pick the node -> Start Node. Once all nodes are synced, you should get 7 as the cluster size:

The cluster is now operating on both sites and scaling out is complete.

Decommissioning

Once the migration completes and all nodes are in synced, you can start to switch your application to the new cluster on GCP:

At this point MySQL data is replicated to all nodes until decommissioning. The replication performance will be as good as the farthest node in the cluster permits. The relay node is critical, as it broadcasts writesets to the other side. From the application standpoint, it's recommended to write to only one site at a time, which means you will have to start redirecting reads/writes from AWS and serve them from GCP cluster instead.

To decommission the old database nodes and move to the cluster on GCP, we have to perform a graceful shutdown (one node at a time) on AWS. It is important to shut down the nodes gracefully, since the AWS site holds the majority number of nodes (4/7) for this cluster. Shutting them down all at once will cause the cluster on GCP to go into non-primary state, forcing the cluster to refuse operation. Make sure the last node to shutdown on the AWS side is the relay node.

Don't forget to update the following parameters on galera-gcp1 accordingly:

  • wsrep_cluster_address - Remove the relay node public IP address.
  • wsrep_sst_donor - Comment this line. Let Galera auto pick the donor.
  • wsrep_sst_receive_address - Comment this line. Let Galera auto pick the receiving interface.

Your Galera Cluster is now running on a completely new platform, hosts and network without a second of downtime to your database service during migration. How cool is that?

Updated: ClusterControl Tips & Tricks: Securing your MySQL Installation

$
0
0

Requires ClusterControl 1.2.11 or later. Applies to MySQL based clusters.

During the life cycle of Database installation it is common that new user accounts are created. It is a good practice to once in a while verify that the security is up to standards. That is, there should at least not be any accounts with global access rights, or accounts without password.

Using ClusterControl, you can at any time perform a security audit.

In the User Interface go to Manage > Developer Studio. Expand the folders so that you see s9s/mysql/programs. Click on security_audit.js and then press Compile and Run.

If there are problems you will clearly see it in the messages section:

Enlarged Messages output:

Here we have accounts that can connect from any hosts and accounts which do not have a password. Those accounts should not exist in a secure database installation. That is rule number one. To correct this problem, click on mysql_secure_installation.js in the s9s/mysql/programs folder.

Click on the dropdown arrow next to Compile and Run and press Change Settings. You will see the following dialog and enter the argument “STRICT”:

Then press Execute. The mysql_secure_installation.js script will then do on each MySQL database instance part of the cluster:

  1. Delete anonymous users
  2. Dropping 'test' database (if exists).
  3. If STRICT is given as an argument to mysql_secure_installation.js it will also do:
    • Remove accounts without passwords.

In the Message box you will see:

The MySQL database servers part of this cluster have now been secured and you have reduced the risk of compromising your data.

You can re-run security_audit.js to verify that the actions have had effect.

Happy Clustering!

PS.: To get started with ClusterControl, click here!

How to Secure Galera Cluster - 8 Tips

$
0
0

As a distributed database system, Galera Cluster requires additional security measures as compared to a centralized database. Data is distributed across multiple servers or even datacenters perhaps. With significant data communication happening across nodes, there can be significant exposure if the appropriate security measures are not taken.

In this blog post, we are going to look into some tips on how to secure our Galera Cluster. Note that this blog builds upon our previous blog post - How to Secure Your Open Source Databases with ClusterControl.

Firewall & Security Group

The following ports are very important for a Galera Cluster:

  • 3306 - MySQL
  • 4567 - Galera communication and replication
  • 4568 - Galera IST
  • 4444 - Galera SST

From the external network, it is recommended to only open access to MySQL port 3306. The other three ports can be closed down from the external network, and only allows them for internal access between the Galera nodes. If you are running a reverse proxy sitting in front of the Galera nodes, for example HAProxy, you can lock down the MySQL port from public access. Also ensure the monitoring port for the HAProxy monitoring script is opened. The default port is 9200 on the Galera node.

The following diagram illustrates our example setup on a three-node Galera Cluster, with an HAProxy facing the public network with its related ports:

Based on the above diagram, the iptables commands for database nodes are:

$ iptables -A INPUT -p tcp -s 10.0.0.0/24 --dport 3306 -j ACCEPT
$ iptables -A INPUT -p tcp -s 10.0.0.0/24 --dport 4444 -j ACCEPT
$ iptables -A INPUT -p tcp -s 10.0.0.0/24 --dports 4567:4568 -j ACCEPT
$ iptables -A INPUT -p tcp -s 10.0.0.0/24 --dport 9200 -j ACCEPT

While on the load balancer:

$ iptables -A INPUT -p tcp --dport 3307 -j ACCEPT

Make sure to end your firewall rules with deny all, so only traffic as defined in the exception rules is allowed. You can be stricter and extend the commands to follow your security policy - for example, by adding network interface, destination address, source address, connection state and what not.

MySQL Client-Server Encryption

MySQL supports encryption between the client and the server. First we have to generate the certificate. Once configured, you can enforce user accounts to specify certain options to connect with encryption to a MySQL server.

The steps require you to:

  1. Create a key for Certificate Authority (ca-key.pem)
  2. Generate a self-signed CA certificate (ca-cert.pem)
  3. Create a key for server certificate (server-key.pem)
  4. Generate a certificate for server and sign it with ca-key.pem (server-cert.pem)
  5. Create a key for client certificate (client-key.pem)
  6. Generate a certificate for client and sign it with ca-key.pem (client-cert.pem)

Always be careful with the CA private key (ca-key.pem) - anybody with access to it can use it to generate additional client or server certificates that will be accepted as legitimate when CA verification is enabled. The bottom line is all the keys must be kept discreet.

You can then add the SSL-related variables under [mysqld] directive, for example:

ssl-ca=/etc/ssl/mysql/ca-cert.pem
ssl-cert=/etc/ssl/mysql/server-cert.pem
ssl-key=/etc/ssl/mysql/server-key.pem

Restart the MySQL server to load the changes. Then create a user with the REQUIRE SSL statement, for example:

mysql> GRANT ALL PRIVILEGES ON db1.* TO 'dbuser'@'192.168.1.100' IDENTIFIED BY 'mySecr3t' REQUIRE SSL;

The user created with REQUIRE SSL will be enforced to connect with the correct client SSL files (client-cert.pem, client-key.pem and ca-cert.pem).

With ClusterControl, client-server SSL encryption can easily be enabled from the UI, using the "Create SSL Encryption" feature.

Galera Encryption

Enabling encryption for Galera means IST will also be encrypted because the communication happens via the same socket. SST, on the other hand, has to be configured separately as shown in the next section. All nodes in the cluster must be enabled with SSL encryption and you cannot have a mix of nodes where some have enabled SSL encryption, and others not. The best time to configure this is when setting up a new cluster. However, if you need to add this on a running production system, you will unfortunately need to rebootstrap the cluster and there will be downtime.

All Galera nodes in the cluster must use the same key, certificate and CA (optional). You could also use the same key and certificate created for MySQL client-server encryption, or generate a new set for this purpose only. To activate encryption inside Galera, one has to append the option and value under wsrep_provider_options inside the MySQL configuration file on each Galera node. For example, consider the following existing line for our Galera node:

wsrep_provider_options = "gcache.size=512M; gmcast.segment=0;"

Append the related variables inside the quote, delimited by a semi-colon:

wsrep_provider_options = "gcache.size=512M; gmcast.segment=0; socket.ssl_cert=/etc/mysql/cert.pem; socket.ssl_key=/etc/mysql/key.pem;"

For more info on the Galera's SSL related parameters, see here. Perform this modification on all nodes. Then, stop the cluster (one node at a time) and bootstrap from the last node that shut down. You can verify if SSL is loaded correctly by looking into the MySQL error log:

2018-01-19T01:15:30.155211Z 0 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '192.168.10.61:,192.168.10.62:,192.168.10.63:'
2018-01-19T01:15:30.159654Z 0 [Note] WSREP: SSL handshake successful, remote endpoint ssl://192.168.10.62:53024 local endpoint ssl://192.168.10.62:4567 cipher: AES128-SHA compression:

With ClusterControl, Galera Replication encryption can be easily enabled using the "Create SSL Galera Encryption" feature.

SST Encryption

When SST happens without encryption, the data communication is exposed while the SST process is ongoing. SST is a full data synchronization process from a donor to a joiner node. If an attacker was able to "see" the full data transmission, the person would get a complete snapshot of your database.

SST with encryption is supported only for mysqldump and xtrabackup-v2 methods. For mysqldump, the user must be granted with "REQUIRE SSL" on all nodes and the configuration is similar to standard MySQL client-server SSL encryption (as described in the previous section). Once the client-server encryption is activated, create a new SST user with SSL enforced:

mysql> GRANT ALL ON *.* TO 'sst_user'@'%' IDENTIFIED BY 'mypassword' REQUIRE SSL;

For rsync, we recommend using galera-secure-rsync, a drop-in SSL-secured rsync SST script for Galera Cluster. It operates almost exactly like wsrep_sst_rsync except that it secures the actual communications with SSL using socat. Generate the required client/server key and certificate files, copy them to all nodes and specify the "secure_rsync" as the SST method inside the MySQL configuration file to activate it:

wsrep_sst_method=secure_rsync

For xtrabackup, the following configuration options must be enabled inside the MySQL configuration file under [sst] directive:

[sst]
encrypt=4
ssl-ca=/path/to/ca-cert.pem
ssl-cert=/path/to/server-cert.pem
ssl-key=/path/to/server-key.pem

Database restart is not necessary. If this node is selected by Galera as a donor, these configuration options will be picked up automatically when Galera initiates the SST.

SELinux

Security-Enhanced Linux (SELinux) is an access control mechanism implemented in the kernel. Without SELinux, only traditional access control methods such as file permissions or ACL are used to control the file access of users.

By default, with strict enforcing mode enabled, everything is denied and the administrator has to make a series of exceptions policies to the elements of the system require in order to function. Disabling SELinux entirely has become a common poor practice for many RedHat based installation nowadays.

Depending on the workloads, usage patterns and processes, the best way is to create your own SELinux policy module tailored for your environment. What you really need to do is to set SELinux to permissive mode (logging only without enforce), and trigger events that can happen on a Galera node for SELinux to log. The more extensive the better. Example events like:

  • Starting node as donor or joiner
  • Restart node to trigger IST
  • Use different SST methods
  • Backup and restore MySQL databases using mysqldump or xtrabackup
  • Enable and disable binary logs

One example is if the Galera node is monitored by ClusterControl and the query monitor feature is enabled, ClusterControl will enable/disable the slow query log variable to capture the slow running queries. Thus, you would see the following denial in the audit.log:

$ grep -e denied audit/audit.log | grep -i mysql
type=AVC msg=audit(1516835039.802:37680): avc:  denied  { open } for  pid=71222 comm="mysqld" path="/var/log/mysql/mysql-slow.log" dev="dm-0" ino=35479360 scontext=system_u:system_r:mysqld_t:s0 tcontext=unconfined_u:object_r:var_log_t:s0 tclass=file

The idea is to let all possible denials get logged into the audit log, which later can be used to generate the policy module using audit2allow before loading it into SELinux. Codership has covered this in details in the documentation page, SELinux Configuration.

SST Account and Privileges

SST is an initial syncing process performed by Galera. It brings a joiner node up-to-date with the rest of the members in the cluster. The process basically exports the data from the donor node and restores it on the joiner node, before the joiner is allowed to catch up on the remaining transactions from the queue (i.e., those that happened during the syncing process). Three SST methods are supported:

  • mysqldump
  • rsync
  • xtrabackup (or xtrabackup-v2)

For mysqldump SST usage, the following privileges are required:

  • SELECT, SHOW VIEW, TRIGGER, LOCK TABLES, RELOAD, FILE

We are not going to go further with mysqldump because it is probably not often used in production as SST method. Beside, it is a blocking procedure on the donor. Rsync is usually a preferred second choice after xtrabackup due to faster syncing time, and less error-prone as compared to mysqldump. SST authentication is ignored with rsync, therefore you may skip configuring SST account privileges if rsync is the chosen SST method.

Moving along with xtrabackup, the following privileges are advised for standard backup and restore procedures based on the Xtrabackup documentation page:

  • CREATE, CREATE TABLESPACE, EVENT, INSERT, LOCK TABLE, PROCESS, RELOAD, REPLICATION CLIENT, SELECT, SHOW VIEW, SUPER

However for xtrabackup's SST usage, only the following privileges matter:

  • PROCESS, RELOAD, REPLICATION CLIENT

Thus, the GRANT statement for SST can be minimized as:

mysql> GRANT PROCESS,RELOAD,REPLICATION CLIENT ON *.* TO 'sstuser'@'localhost' IDENTIFIED BY 'SuP3R@@sTr0nG%%P4ssW0rD';

Then, configure wsrep_sst_auth accordingly inside MySQL configuration file:

wsrep_sst_auth = sstuser:SuP3R@@sTr0nG%%P4ssW0rD

Only grant the SST user for localhost and use a strong password. Avoid using root user as the SST account, because it would expose the root password inside the configuration file under this variable. Plus, changing or resetting the MySQL root password would break SST in the future.

MySQL Security Hardening

Galera Cluster is a multi-master replication plugin for InnoDB storage engine, which runs on MySQL and MariaDB forks. Therefore, standard MySQL/MariaDB/InnoDB security hardening recommendations apply to Galera Cluster as well.

This topic has been covered in numerous blog posts out there. We have also covered this topic in the following blog posts:

The above blog posts summarize the necessity of encrypting data at rest and data in transit, having audit plugins, general security guidelines, network security best practices and so on.

Use a Load Balancer

There are a number of database load balancers (reverse proxy) that can be used together with Galera - HAProxy, ProxySQL and MariaDB MaxScale to name some of them. You can set up a load balancer to control access to your Galera nodes. It is a great way of distributing the database workload between the database instances, as well as restricting access, e.g., if you want to take a node offline for maintenance, or if you want to limit the number of connections opened on the Galera nodes. The load balancer should be able to queue connections, and therefore provide some overload protection to your database servers.

ProxySQL, a powerful database reverse-proxy which understands MySQL and MariaDB, can be extended with many useful security features like query firewall, to block offending queries from the database server. The query rules engine can also be used to rewrite bad queries into something better/safer, or redirect them to another server which can absorb the load without affecting any of the Galera nodes. MariaDB MaxScale also capable of blocking queries based on regular expressions with its Database Firewall filter.

Another advantage having a load balancer for your Galera Cluster is the ability to host a data service without exposing the database tier to the public network. The proxy server can be used as the bastion host to gain access to the database nodes in a private network. By having the database cluster isolated from the outside world, you have removed one of the important attacking vectors.

That's it. Always stay secure and protected.

Updated: Become a ClusterControl DBA - SSL Key Management and Encryption of MySQL Data in Transit

$
0
0

Databases usually work in a secure environment. It may be a datacenter with a dedicated VLAN for database traffic. It may be a VPC in EC2. If your network spreads across multiple datacenters in different regions, you’d usually use some kind of Virtual Private Network or SSH tunneling to connect these locations in a secure manner. With data privacy and security being hot topics these days, you might feel better with an additional layer of security.

MySQL supports SSL as a means to encrypt traffic both between MySQL servers (replication) and between MySQL servers and clients. If you use Galera cluster, similar features are available - both intra-cluster communication and connections with clients can be encrypted using SSL.

A common way of implementing SSL encryption is to use self-signed certificates. Most of the time, it is not necessary to purchase an SSL certificate issued by the Certificate Authority. Anybody who’s been through the process of generating a self-signed certificate will probably agree that it is not the most straightforward process - most of the time, you end up searching through the internet to find howto’s and instructions on how to do this. This is especially true if you are a DBA and only go through this process every few months or even years. This is why we added a ClusterControl feature to help you manage SSL keys across your database cluster. In this blog post, we’ll be making use of ClusterControl 1.5.1.

Key Management in the ClusterControl

You can enter Key Management by going to Side Menu -> Key Management section.

You will be presented with the following screen:

You can see two certificates generated, one being a CA and the other one a regular certificate. To generate more certificates, switch to the ‘Generate Key’ tab:

A certificate can be generated in two ways - you can first create a self-signed CA and then use it to sign a certificate. Or you can go directly to the ‘Client/Server Certificates and Key’ tab and create a certificate. The required CA will be created for you in the background. Last but not least, you can import an existing certificate (for example a certificate you bought from one of many companies which sell SSL certificates).

To do that, you should upload your certificate, key and CA to your ClusterControl node and store them in /var/lib/cmon/ca directory. Then you fill in the paths to those files and the certificate will be imported.

If you decided to generate a CA or generate a new certificate, there’s another form to fill - you need to pass details about your organization, common name, email, pick the key length and expiration date.

Once you have everything in place, you can start using your new certificates. ClusterControl currently supports deployment of SSL encryption between clients and MySQL databases and SSL encryption of intra-cluster traffic in Galera Cluster. We plan to extend the variety of supported deployments in future releases of the ClusterControl.

Full SSL encryption for Galera Cluster

Now let’s assume we have our SSL keys ready and we have a Galera Cluster, which needs SSL encryption, deployed through our ClusterControl instance. We can easily secure it in two steps.

First - encrypt Galera traffic using SSL. From your cluster view, one of the cluster actions is 'Enable SSL Galera Encryption'. You’ll be presented with the following options:

If you do not have a certificate, you can generate it here. But if you already generated or imported an SSL certificate, you should be able to see it in the list and use it to encrypt Galera replication traffic. Please keep in mind that this operation requires a cluster restart - all nodes will have to stop at the same time, apply config changes and then restart. Before you proceed here, make sure you are prepared for some downtime while the cluster restarts.

Once intra-cluster traffic has been secured, we want to cover client-server connections. To do that, pick ‘Enable SSL Encryption’ job and you’ll see following dialog:

It’s pretty similar - you can either pick an existing certificate or generate new one. The main difference is that to apply client-server encryption, downtime is not required - a rolling restart will suffice. Once restarted, you will find a lock icon right under the encrypted host on the Overview page:

The label 'Galera' means Galera encryption is enabled, while 'SSL' means client-server encryption is enabled for that particular host.

Of course, enabling SSL on the database is not enough - you have to copy certificates to clients which are supposed to use SSL to connect to the database. All certificates can be found in /var/lib/cmon/ca directory on the ClusterControl node. You also have to remember to change grants for users and make sure you’ve added REQUIRE SSL to them if you want to enforce only secure connections.

We hope you’ll find those options easy to use and help you secure your MySQL environment. If you have any questions or suggestions regarding this feature, we’d love to hear from you.


Comparing Oracle RAC HA Solution to Galera Cluster for MySQL or MariaDB

$
0
0

Business has continuously desired to derive insights from information to make reliable, smarter, real-time, fact-based decisions. As firms rely more on data and databases, information and data processing is the core of many business operations and business decisions. The faith in the database is total. None of the day-to-day company services can run without the underlying database platforms. As a consequence, the necessity on scalability and performance of database system software is more critical than ever. The principal benefits of the clustered database system are scalability and high availability. In this blog, we will try to compare Oracle RAC and Galera Cluster in the light of these two aspects. Real Application Clusters (RAC) is Oracle’s premium solution to clustering Oracle databases and provides High Availability and Scalability. Galera Cluster is the most popular clustering technology for MySQL and MariaDB.

Architecture overview

Oracle RAC uses Oracle Clusterware software to bind multiple servers. Oracle Clusterware is a cluster management solution that is integrated with Oracle Database, but it can also be used with other services, not only the database. The Oracle Clusterware is an additional software installed on servers running the same operating system, which lets the servers to be chained together to operate as if they were one server.

Oracle Clusterware watches the instance and automatically restarts it if a crash occurs. If your application is well designed, you may not experience any service interruption. Only a group of sessions (those connected to the failed instance) is affected by the failure. The blackout can be efficiently masked to the end user using advanced RAC features like Fast Application Notification and the Oracle client’s Fast Connection Failover. Oracle Clusterware controls node membership and prevents split brain symptoms in which two or more instances attempt to control the instance.

Galera Cluster is a synchronous active-active database clustering technology for MySQL and MariaDB. Galera Cluster differs from what is known as Oracle’s MySQL Cluster - NDB. MariaDB cluster is based on the multi-master replication plugin provided by Codership. Since version 5.5, the Galera plugin (wsrep API) is an integral part of MariaDB. Percona XtraDB Cluster (PXC) is also based on the Galera plugin. The Galera plugin architecture stands on three core layers: certification, replication, and group communication framework. Certification layer prepares the write-sets and does the certification checks on them, guaranteeing that they can be applied. Replication layer manages the replication protocol and provides the total ordering capability. Group Communication Framework implements a plugin architecture which allows other systems to connect via gcomm back-end schema.

To keep the state identical across the cluster, the wsrep API uses a Global Transaction ID. GTID unique identifier is created and associated with each transaction committed on the database node. In Oracle RAC, various database instances share access to resources such as data blocks in the buffer cache to enqueue data blocks. Access to the shared resources between RAC instances needs to be coordinated to avoid conflict. To organize shared access to these resources, the distributed cache maintains information such as data block ID, which RAC instance holds the current version of this data block, and the lock mode in which each instance contains the data block.

Data storage key concepts

Oracle RAC relies on a distributed disk architecture. The database files, control files and online redo logs for the database need be accessible to each node in the cluster. There is a variation of ways to configure shared storage including directly attached disks, Storage Area Networks (SAN), and Network Attached Storage (NAS) and Oracle ASM. Two most popular are OCFS and ASM. Oracle Cluster File System (OCFS) is a shared file system designed specifically for Oracle RAC. OCFS eliminates the requirement that Oracle database files be connected to logical drives and enables all nodes to share a single Oracle Home ASM, RAW Device. Oracle ASM is Oracle's advised storage management solution that provides an alternative to conventional volume managers, file systems, and raw devices. The Oracle ASM provides a virtualization layer between the database and storage. It treats multiple disks as a single disk group and lets you dynamically add or remove drives while maintaining databases online.

There is no need to build sophisticated shared disk storage for Galera, as each node has its full copy of data. However it is a good practice to make the storage reliable with battery-backed write caches.

Oracle RAC, Cluster storage
Oracle RAC, Cluster storage
Galera replication, disks attached to database nodes
Galera replication, disks attached to database nodes

Cluster nodes communication and cache

Oracle Real Application Clusters has a shared cache architecture, it utilizes Oracle Grid Infrastructure to enable the sharing of server and storage resources. Communication between nodes is the critical aspect of cluster integrity. Each node must have at least two network adapters or network interface cards: one for the public network interface, and one for the interconnect. Each cluster node is connected to all other nodes via a private high-speed network, also recognized as the cluster interconnect.

Oracle RAC, network architecture
Oracle RAC, network architecture

The private network is typically formed with Gigabit Ethernet, but for high-volume environments, many vendors offer low-latency, high-bandwidth solutions designed for Oracle RAC. Linux also extends a means of bonding multiple physical NICs into a single virtual NIC to provide increased bandwidth and availability.

While the default approach to connecting Galera nodes is to use a single NIC per host, you can have more than one card. ClusterControl can assist you with such setup. The main difference is the bandwidth requirement on the interconnect. Oracle RAC ships blocks of data between instances, so it places a heavier load on the interconnect as compared to Galera write-sets (which consist of a list of operations).

With Redundant Interconnect Usage in RAC, you can identify multiple interfaces to use for the private cluster network, without the need of using bonding or other technologies. This functionality is available starting with Oracle Database 11gR2. If you use the Oracle Clusterware excessive interconnect feature, then you must use IPv4 addresses for the interfaces (UDP is a default).

To manage high availability, each cluster node is assigned a virtual IP address (VIP). In the event of node failure, the failed node's IP address can be reassigned to a surviving node to allow applications continue to reach the database through the same IP address.

Sophisticated network setup is necessary to Oracle's Cache Fusion technology to couple the physical memory in each host into a single cache. Oracle Cache Fusion provides data stored in the cache of one Oracle instance to be accessed by any other instance by transporting it across the private network. It also protects data integrity and cache coherency by transmitting locking and supplementary synchronization information beyond cluster nodes.

On top of the described network setup, you can set a single database address for your application - Single Client Access Name (SCAN). The primary purpose of SCAN is to provide ease of connection management. For instance, you can add new nodes to the cluster without changing your client connection string. This functionality is because Oracle will automatically distribute requests accordingly based on the SCAN IPs which point to the underlying VIPs. Scan listeners do the bridge between clients and the underlying local listeners which are VIP-dependent.

For Galera Cluster, the equivalent of SCAN would be adding a database proxy in front of the Galera nodes. The proxy would be a single point of contact for applications, it can blacklist failed nodes and route queries to healthy nodes. The proxy itself can be made redundant with Keepalived and Virtual IP.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Failover and data recovery

The main difference between Oracle RAC and MySQL Galera Cluster is that Galera is shared nothing architecture. Instead of shared disks, Galera uses certification based replication with group communication and transaction ordering to achieve synchronous replication. A database cluster should be able to survive a loss of a node, although it's achieved in different ways. In case of Galera, the critical aspect is the number of nodes, Galera requires a quorum to stay operational. A three node cluster can survive the crash of one node. With more nodes in your cluster, your availability will grow. Oracle RAC doesn't require a quorum to stay operational after a node crash. It is because of the access to distributed storage that keeps consistent information about cluster state. However, your data storage could be a potential point of failure in your high availability plan. While it's reasonably straightforward task to have Galera cluster nodes spread across geolocation data centers, it wouldn't be that easy with RAC. Oracle RAC requires additional high-end disk mirroring however, basic RAID like redundancy can be achieved inside an ASM diskgroup.

Disk Group TypeSupported Mirroring LevelsDefault Mirroring Level
External redundancyUnprotected (none)Unprotected
Normal redundancyTwo-way, three-way, unprotected (none)Two-way
High redundancyThree-wayThree-way
Flex redundancyTwo-way, three-way, unprotected (none)Two-way (newly-created)
Extended redundancyTwo-way, three-way, unprotected (none)Two-way
ASM Disk Group redundancy

Locking Schemes

In a single-user database, a user can alter data without concern for other sessions modifying the same data at the same time. However, in a multi-user database multi-node environment, this become more tricky. A multi-user database must provide the following:

  • data concurrency - the assurance that users can access data at the same time,
  • data consistency - the assurance that each user sees a consistent view of the data.

Cluster instances require three main types of concurrency locking:

  • Data concurrency reads on different instances,
  • Data concurrency reads and writes on different instances,
  • Data concurrency writes on different instances.

Oracle lets you choose the policy for locking, either pessimistic or optimistic, depending on your requirements. To obtain concurrency locking, RAC has two additional buffers. They are Global Cache Service (GCS) and Global Enqueue Service (GES). These two services cover the Cache Fusion process, resource transfers, and resource escalations among the instances. GES include cache locks, dictionary locks, transaction locks and table locks. GCS maintains the block modes and block transfers between the instances.

In Galera cluster, each node has its storage and buffers. When a transaction is started, database resources local to that node are involved. At commit, the operations that are part of that transaction are broadcasted as part of a write-set, to the rest of the group. Since all nodes have the same state, the write-set will either be successful on all nodes or it will fail on all nodes.

Galera Cluster uses at the cluster-level optimistic concurrency control, which can appear in transactions that result in a COMMIT aborting. The first commit wins. When aborts occur at the cluster level, Galera Cluster gives a deadlock error. This may or may not impact your application architecture. High number of rows to replicate in a single transaction would impact node responses, although there are techniques to avoid such behavior.

Hardware & Software requirements

Configuring both clusters hardware doesn’t require potent resources. Minimal Oracle RAC cluster configuration would be satisfied by two servers with two CPUs, physical memory at least 1.5 GB of RAM, an amount of swap space equal to the amount of RAM and two Gigabit Ethernet NICs. Galera’s minimum node configuration is three nodes (one of nodes can be an arbitrator, gardb), each with 1GHz single-core CPU 512MB RAM, 100 Mbps network card. While these are the minimal, we can safely say that in both cases you would probably like to have more resources for your production system.

Each node stores software so you would need to prepare several gigabytes of your storage. Oracle and Galera both have the ability to individually patch the nodes by taking them down one at a time. This rolling patch avoids a complete application outage as there are always database nodes available to handle traffic.

What is important to mention is that a production Galera cluster can easily run on VM’s or basic bare metal, while RAC would need investment in sophisticated shared storage and fiber communication.

Monitoring and management

Oracle Enterprise Manager is the favored approach for monitoring Oracle RAC and Oracle Clusterware. Oracle Enterprise Manager is an Oracle Web-based unified management system for monitoring and administering your database environment. It’s part of Oracle Enterprise License and should be installed on separate server. Cluster control monitoring and management is done via combination on crsctl and srvctl commands which are part of cluster binaries. Below you can find a couple of example commands.

Clusterware Resource Status Check:

    crsctl status resource -t (or shorter: crsctl stat res -t)

Example:

$ crsctl stat res ora.test1.vip
NAME=ora.test1.vip
TYPE=ora.cluster_vip_net1.type
TARGET=ONLINE
STATE=ONLINE on test1

Check the status of the Oracle Clusterware stack:

    crsctl check cluster

Example:

$ crsctl check cluster -all
*****************************************************************
node1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
*****************************************************************
node2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Check the status of Oracle High Availability Services and the Oracle Clusterware stack on the local server:

    crsctl check crs

Example:

$ crsctl check crs
CRS-4638: Oracle High Availablity Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Stop Oracle High Availability Services on the local server.

    crsctl stop has

Stop Oracle High Availability Services on the local server.

    crsctl start has

Displays the status of node applications:

    srvctl status nodeapps

Displays the configuration information for all SCAN VIPs

    srvctl config scan

Example:

srvctl config scan -scannumber 1
SCAN name: testscan, Network: 1
Subnet IPv4: 192.51.100.1/203.0.113.46/eth0, static
Subnet IPv6: 
SCAN 1 IPv4 VIP: 192.51.100.195
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes:
SCAN VIP is individually disabled on nodes:

The Cluster Verification Utility (CVU) performs system checks in preparation for installation, patch updates, or other system changes:

    cluvfy comp ocr

Example:

Verifying OCR integrity
Checking OCR integrity...
Checking the absence of a non-clustered configurationl...
All nodes free of non-clustered, local-only configurations
ASM Running check passed. ASM is running on all specified nodes
Checking OCR config file “/etc/oracle/ocr.loc"...
OCR config file “/etc/oracle/ocr.loc" check successful
Disk group for ocr location “+DATA" available on all the nodes
NOTE:
This check does not verify the integrity of the OCR contents. Execute ‘ocrcheck' as a privileged user to verify the contents of OCR.
OCR integrity check passed
Verification of OCR integrity was successful.

Galera nodes and the cluster requires the wsrep API to report several statuses, which is exposed. There are currently 34 dedicated status variables can be viewed with SHOW STATUS statement.

mysql> SHOW STATUS LIKE 'wsrep_%';
wsrep_apply_oooe
wsrep_apply_oool
wsrep_cert_deps_distance
wsrep_cluster_conf_id
wsrep_cluster_size
wsrep_cluster_state_uuid
wsrep_cluster_status
wsrep_connected
wsrep_flow_control_paused
wsrep_flow_control_paused_ns
wsrep_flow_control_recv
wsrep_local_send_queue_avg
wsrep_local_state_uuid
wsrep_protocol_version
wsrep_provider_name
wsrep_provider_vendor
wsrep_provider_version
wsrep_flow_control_sent
wsrep_gcomm_uuid
wsrep_last_committed
wsrep_local_bf_aborts
wsrep_local_cert_failures
wsrep_local_commits
wsrep_local_index
wsrep_local_recv_queue
wsrep_local_recv_queue_avg
wsrep_local_replays
wsrep_local_send_queue
wsrep_ready
wsrep_received
wsrep_received_bytes
wsrep_replicated
wsrep_replicated_bytes
wsrep_thread_count

The administration of MySQL Galera Cluster in many aspects is very similar. There are just few exceptions like bootstrapping the cluster from initial node or recovering nodes via SST or IST operations.

Bootstrapping cluster:

$ service mysql bootstrap # sysvinit
$ service mysql start --wsrep-new-cluster # sysvinit
$ galera_new_cluster # systemd
$ mysqld_safe --wsrep-new-cluster # command line

The equivalent Web-based, out of the box solution to manage and monitor Galera Cluster is ClusterControl. It provides a web-based interface to deploy clusters, monitors key metrics, provides database advisors, and take care of management tasks like backup and restore, automatic patching, traffic encryption and availability management.

Restrictions on workload

Oracle provides SCAN technology which we found missing in Galera Cluster. The benefit of SCAN is that the client’s connection information does not need to change if you add or remove nodes or databases in the cluster. When using SCAN, the Oracle database randomly connects to one of the available SCAN listeners (typically three) in a round robin fashion and balances the connections between them. Two kinds load balancing can be configured: client side, connect time load balancing and on the server side, run time load balancing. Although there is nothing similar within Galera cluster itself, the same functionality can be addressed with additional software like ProxySQL, HAProxy, Maxscale combined with Keepalived.

When it comes to application workload design for Galera Cluster, you should avoid conflicting updates on the same row, as it leads to deadlocks across the cluster. Avoid bulk inserts or updates, as these might be larger than the maximum allowed writeset. That might also cause cluster stalls.

Designing Oracle HA with RAC you need to keep in mind that RAC only protects against server failure, and you need to mirror the storage and have network redundancy. Modern web applications require access to location-independent data services, and because of RAC’s storage architecture limitations, it can be tricky to achieve. You also need to spend a notable amount of time to gain relevant knowledge to manage the environment; it is a long process. On the application workload side, there are some drawbacks. Distributing separated read or write operations on the same dataset is not optimal because latency is added by supplementary internode data exchange. Things like partitioning, sequence cache, and sorting operations should be reviewed before migrating to RAC.

Multi data-center redundancy

According to the Oracle documentation, the maximum distance between two boxes connected in a point-to-point fashion and running synchronously can be only 10 km. Using specialized devices, this distance can be increased to 100 km.

Galera Cluster is well known for its multi-datacenter replication capabilities. It has rich support for Wider Area Networks network settings. It can be configured for high network latency by taking Round-Trip Time (RTT) measurements between cluster nodes and adjusting necessary parameters. The wsrep_provider_options parameters allow you to configure settings like suspect_timeout, interactive_timeout, join_retrans_timouts and many more.

Using Galera and RAC in Cloud

Per Oracle note www.oracle.com/technetwork/database/options/.../rac-cloud-support-2843861.pdf no third-party cloud currently meets Oracle’s requirements regarding natively provided shared storage. “Native” in this context means that the cloud provider must support shared storage as part of their infrastructure as per Oracle’s support policy.

Thanks to its shared nothing architecture, which is not tied to a sophisticated storage solution, Galera cluster can be easily deployed in a cloud environment. Things like:

  • optimized network protocol,
  • topology-aware replication,
  • traffic encryption,
  • detection and automatic eviction of unreliable nodes,

makes cloud migration process more reliable.

Licenses and hidden costs

Oracle licensing is a complex topic and would require a separate blog article. The cluster factor makes it even more difficult. The cost goes up as we have to add some options to license a complete RAC solution. Here we just want to highlight what to expect and where to find more information.

RAC is a feature of Oracle Enterprise Edition license. Oracle Enterprise license is split into two types, per named user and per processor. If you consider Enterprise Edition with per core license, then the single core cost is RAC 23,000 USD + Oracle DB EE 47,500 USD, and you still need to add a ~ 22% support fee. We would like to refer to a great blog on pricing found on https://flashdba.com/2013/09/18/the-real-cost-of-oracle-rac/.

Flashdba calculated the price of a four node Oracle RAC. The total amount was 902,400 USD plus additional 595,584 USD for three years DB maintenance, and that does not include features like partitioning or in-memory database, all that with 60% Oracle discount.

Galera Cluster is an open source solution that anyone can run for free. Subscriptions are available for production implementations that require vendor support. A good TCO calculation can be found at https://severalnines.com/blog/database-tco-calculating-total-cost-ownership-mysql-management.

Conclusion

While there are significant differences in architecture, both clusters share the main principles and can achieve similar goals. Oracle enterprise product comes with everything out of the box (and it's price). With a cost in the range of >1M USD as seen above, it is a high-end solution that many enterprises would not be able to afford. Galera Cluster can be described as a decent high availability solution for the masses. In certain cases, Galera may well be a very good alternative to Oracle RAC. One drawback is that you have to build your own stack, although that can be completely automated with ClusterControl. We’d love to hear your thoughts on this.

New Webinar on How to Migrate to Galera Cluster for MySQL & MariaDB

$
0
0

Join us on Tuesday May 29th for this new webinar with Severalnines Support Engineer Bart Oles, who will walk you through what you need to know in order to migrate from standalone or a master-slave MySQL/MariaDB setup to Galera Cluster.

When considering such a migration, plenty of questions typically come up, such as: how do we migrate? Does the schema or application change? What are the limitations? Can a migration be done online, without service interruption? What are the potential risks?

Galera Cluster has become a mainstream option for high availability MySQL and MariaDB. And though it is now known as a credible replacement for traditional MySQL master-slave architectures, it is not a drop-in replacement.

It has some characteristics that make it unsuitable for certain use cases, however, most applications can still be adapted to run on it.

The benefits are clear: multi-master InnoDB setup with built-in failover and read scalability.

Join us on May 29th for this walk-through on how to migrate to Galera Cluster for MySQL and MariaDB.

Sign up below!

Date, Time & Registration

Europe/MEA/APAC

Tuesday, May 29th at 09:00 BST / 10:00 CEST (Germany, France, Sweden)

Register Now

North America/LatAm

Tuesday, May 29th at 09:00 PDT (US) / 12:00 EDT (US)

Register Now

Agenda

  • Application use cases for Galera
  • Schema design
  • Events and Triggers
  • Query design
  • Migrating the schema
  • Load balancer and VIP
  • Loading initial data into the cluster
  • Limitations:
    • Cluster technology
    • Application vendor support
  • Performing Online Migration to Galera
  • Operational management checklist
  • Belts and suspenders: Plan B
  • Demo

Speaker

Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.

We look forward to “seeing” you there and to insightful discussions!

Deploying Cloud Databases with ClusterControl 1.6

$
0
0

ClusterControl 1.6 comes with tighter integration with AWS, Azure and Google Cloud, so it is now possible to launch new instances and deploy MySQL, MariaDB, MongoDB and PostgreSQL directly from the ClusterControl user interface. In this blog, we will show you how to deploy a cluster on Amazon Web Services.

Note that this new feature requires two modules called clustercontrol-cloud and clustercontrol-clud. The former is a helper daemon which extends CMON capability of cloud communication, while the latter is a file manager client to upload and download files on cloud instances. Both packages are dependencies of the clustercontrol UI package, which will be installed automatically if they do not exist. See the Components documentation page for details.

Cloud Credentials

ClusterControl allows you to store and manage your cloud credentials under Integrations (side menu) -> Cloud Providers:

The supported cloud platforms in this release are Amazon Web Services, Google Cloud Platform and Microsoft Azure. On this page, you can add new cloud credentials, manage existing ones and also connect to your cloud platform to manage resources.

The credentials that have been set up here can be used to:

  • Manage cloud resources
  • Deploy databases in the cloud
  • Upload backup to cloud storage

The following is what you would see if you clicked on "Manage AWS" button:

You can perform simple management tasks on your cloud instances. You can also check the VPC settings under "AWS VPC" tab, as shown in the following screenshot:

The above features are useful as reference, especially when preparing your cloud instances before you start the database deployments.

Database Deployment on Cloud

In previous versions of ClusterControl, database deployment on cloud would be treated similarly to deployment on standard hosts, where you had to create the cloud instances beforehand and then supply the instance details and credentials in the "Deploy Database Cluster" wizard. The deployment procedure was unaware of any extra functionality and flexibility in the cloud environment, like dynamic IP and hostname allocation, NAT-ed public IP address, storage elasticity, virtual private cloud network configuration and so on.

With version 1.6, you just need to supply the cloud credentials, which can be managed via the "Cloud Providers" interface and follow the "Deploy in the Cloud" deployment wizard. From ClusterControl UI, click Deploy and you will be presented with the following options:

At the moment, the supported cloud providers are the three big players - Amazon Web Service (AWS), Google Cloud and Microsoft Azure. We are going to integrate more providers in the future release.

In the first page, you will be presented with the Cluster Details options:

In this section, you would need to select the supported cluster type, MySQL Galera Cluster, MongoDB Replica Set or PostgreSQL Streaming Replication. The next step is to choose the supported vendor for the selected cluster type. At the moment, the following vendors and versions are supported:

  • MySQL Galera Cluster - Percona XtraDB Cluster 5.7, MariaDB 10.2
  • MongoDB Cluster - MongoDB 3.4 by MongoDB, Inc and Percona Server for MongoDB 3.4 by Percona (replica set only).
  • PostgreSQL Cluster - PostgreSQL 10.0 (streaming replication only).

In the next step, you will be presented with the following dialog:

Here you can configure the selected cluster type accordingly. Pick the number of nodes. The Cluster Name will be used as the instance tag, so you can easily recognize this deployment in your cloud provider dashboard. No space is allowed in the cluster name. My.cnf Template is the template configuration file that ClusterControl will use to deploy the cluster. It must be located under /usr/share/cmon/templates on the ClusterControl host. The rest of the fields are pretty self-explanatory.

The next dialog is to select the cloud credentials:

You can choose the existing cloud credentials or create a new one by clicking on the "Add New Credential" button. The next step is to choose the virtual machine configuration:

Most of the settings in this step are dynamically populated from the cloud provider by the chosen credentials. You can configure the operating system, instance size, VPC setting, storage type and size and also specify the SSH key location on the ClusterControl host. You can also let ClusterControl generate a new key specifically for these instances. When clicking on "Add New" button next to Virtual Private Cloud, you will be presented with a form to create a new VPC:

VPC is a logical network infrastructure you have within your cloud platform. You can configure your VPC by modifying its IP address range, create subnets, configure route tables, network gateways, and security settings. It's recommended to deploy your database infrastructure in this network for isolation, security and routing control.

When creating a new VPC, specify the VPC name and IPv4 address block with subnet. Then, choose whether IPv6 should be part of the network and the tenancy option. You can then use this virtual network for your database infrastructure.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

The last step is the deployment summary:

In this stage, you need to choose which subnet under the chosen virtual network that you want the database to be running on. Take note that the chosen subnet MUST have auto-assign public IPv4 address enabled. You can also create a new subnet under this VPC by clicking on "Add New Subnet" button. Verify if everything is correct and hit the "Deploy Cluster" button to start the deployment.

You can then monitor the progress by clicking on the Activity -> Jobs -> Create Cluster -> Full Job Details:

Depending on the connections, it could take 10 to 20 minutes to complete. Once done, you will see a new database cluster listed under the ClusterControl dashboard. For PostgreSQL streaming replication cluster, you might need to know the master and slave IP addresses once the deployment completes. Simply go to Nodes tab and you would see the public and private IP addresses on the node list on the left:

Your database cluster is now deployed and running on AWS.

At the moment, the scaling up works similar to the standard host, where you need to create a cloud instance manually beforehand and specify the host under ClusterControl -> pick the cluster -> Add Node.

Under the hood, the deployment process does the following:

  1. Create cloud instances
  2. Configure security groups and networking
  3. Verify the SSH connectivity from ClusterControl to all created instances
  4. Deploy database on every instance
  5. Configure the clustering or replication links
  6. Register the deployment into ClusterControl

Take note that this feature is still in beta. Nevertheless, you can use this feature to speed up your development and testing environment by controlling and managing the database cluster in different cloud providers from a single user interface.

Database Backup on Cloud

This feature has been around since ClusterControl 1.5.0, and now we added support for Azure Cloud Storage. This means that you can now upload and download the created backup on all three major cloud providers (AWS, GCP and Azure). The upload process happens right after the backup is successfully created (if you toggle "Upload Backup to the Cloud") or you can manually click on the cloud icon button of the backup list:

You can then download and restore backups from the cloud, in case you lost your local backup storage, or if you need to reduce local disk space usage for your backups.

Current Limitations

There are some known limitations for the cloud deployment feature, as stated below:

  • There is currently no 'accounting' in place for the cloud instances. You will need to manually remove the cloud instances if you remove a database cluster.
  • You cannot add or remove a node automatically with cloud instances.
  • You cannot deploy a load balancer automatically with a cloud instance.

We have extensively tested the feature in many environments and setups but there are always corner cases that we might have missed out upon. For more information, please take a look at the change log.

Happy clustering in the cloud!

How to Recover Galera Cluster or MySQL Replication from Split Brain Syndrome

$
0
0

You may have heard about the term “split brain”. What it is? How does it affect your clusters? In this blog post we will discuss what exactly it is, what danger it may pose to your database, how we can prevent it, and if everything goes wrong, how to recover from it.

Long gone are the days of single instances, nowadays almost all databases run in replication groups or clusters. This is great for high availability and scalability, but a distributed database introduces new dangers and limitations. One case which can be deadly is a network split. Imagine a cluster of multiple nodes which, due to network issues, was split in two parts. For obvious reasons (data consistency), both parts shouldn’t handle traffic at the same time as they are isolated from each other and data cannot be transferred between them. It is also wrong from the application point of view - even if, eventually, there would be a way to sync the data (although reconciliation of 2 datasets is not trivial). For a while, part of the application would be unaware of the changes made by other application hosts, which accesses the other part of the database cluster. This can lead to serious problems.

The condition in which the cluster has been divided in two or more parts that are willing to accept writes is called “split brain”.

The biggest problem with split brain is data drift, as writes happen on both parts of the cluster. None of MySQL flavors provide automated means of merging datasets that have diverged. You will not find such feature in MySQL replication, Group Replication or Galera. Once the data has diverged, the only option is to either use one of the parts of the cluster as the source of truth and discard changes executed on the other part - unless we can follow some manual process in order to merge the data.

This is why we will start with how to prevent split brain from happening. This is so much easier than having to fix any data discrepancy.

How to prevent split brain

The exact solution depends on the type of the database and the setup of the environment. We will take a look at some of the most common cases for Galera Cluster and MySQL Replication.

Galera cluster

Galera has a built-in “circuit breaker” to handle split brain: it rely on a quorum mechanism. If a majority (50% + 1) of the nodes are available in the cluster, Galera will operate normally. If there is no majority, Galera will stop serving traffic and switch to so called “non-Primary” state. This is pretty much all you need to deal with a split brain situation while using Galera. Sure, there are manual methods to force Galera into “Primary” state even if there’s not a majority. Thing is, unless you do that, you should be safe.

The way how quorum is calculated has important repercussions - at a single datacenter level, you want to have an odd number of nodes. Three nodes give you a tolerance for failure of one node (2 nodes match the requirement of more than 50% of the nodes in the cluster being available). Five nodes will give you a tolerance for failure of two nodes (5 - 2 = 3 which is more than 50% from 5 nodes). On the other hand, using four nodes will not improve your tolerance over three node cluster. It would still handle only a failure of one node (4 - 1 = 3, more than 50% from 4) while failure of two nodes will render the cluster unusable (4 - 2 = 2, just 50%, not more).

While deploying Galera cluster in a single datacenter, please keep in mind that, ideally, you would like to distribute nodes across multiple availability zones (separate power source, network, etc.) - as long as they do exist in your datacenter, that is. A simple setup may look like below:

At the multi-datacenter level, those considerations are also applicable. If you want Galera cluster to automatically handle datacenter failures, you should use an odd number of datacenters. To reduce costs, you can use a Galera arbitrator in one of them instead of a database node. Galera arbitrator (garbd) is a process which takes part in the quorum calculation but it does not contain any data. This makes it possible to use it even on very small instances as it is not resource-intensive - although the network connectivity has to be good as it ‘sees’ all the replication traffic. Example setup may look like on a diagram below:

MySQL Replication

With MySQL replication the biggest issue is that there is no quorum mechanism builtin, as it is in Galera cluster. Therefore more steps are required to ensure that your setup will not be affected by a split brain.

One method is to avoid cross-datacenter automated failovers. You can configure your failover solution (it can be through ClusterControl, or MHA or Orchestrator) to failover only within single datacenter. If there was a full datacenter outage, it would be up to the admin to decide how to failover and how to ensure that the servers in the failed datacenter will not be used.

There are options to make it more automated. You can use Consul to store data about the nodes in the replication setup, and which one of them is the master. Then it will be up to the admin (or via some scripting) to update this entry and move writes to the second datacenter. You can benefit from an Orchestrator/Raft setup where Orchestrator nodes can be distributed across multiple datacenters and detect split brain. Based on this you could take different actions like, as we mentioned previously, update entries in our Consul or etcd. The point is that this is a much more complex environment to setup and automate than Galera cluster. Below you can find example of multi-datacenter setup for MySQL replication.

Please keep in mind that you still have to create scripts to make it work, i.e. monitor Orchestrator nodes for a split brain and take necessary actions to implement STONITH and ensure that the master in datacenter A will not be used once the network converge and connectivity will be restored.

Split brain happened - what to do next?

The worst case scenario happened and we have data drift. We will try to give you some hints what can be done here. Unfortunately, the exact steps will depend mostly on your schema design so it will not be possible to write a precise how-to guide.

What you have to keep in mind is that the ultimate goal will be to copy data from one master to the other and recreate all relations between tables.

First of all, you have to identify which node will continue serving data as master. This is a dataset to which you will merge data stored on the other “master” instance. Once that’s done, you have to identify data from old master which is missing on the current master. This will be manual work. If you have timestamps in your tables, you can leverage them to pinpoint the missing data. Ultimately, binary logs will contain all data modifications so you can rely on them. You may also have to rely on your knowledge of the data structure and relations between tables. If your data is normalized, one record in one table could be related to records in other tables. For example, your application may insert data to “user” table which is related to “address” table using user_id. You will have to find all related rows and extract them.

Next step will be to load this data into the new master. Here comes the tricky part - if you prepared your setups beforehand, this could be simply a matter of running a couple of inserts. If not, this may be rather complex. It’s all about primary key and unique index values. If your primary key values are generated as unique on each server using some sort of UUID generator or using auto_increment_increment and auto_increment_offset settings in MySQL, you can be sure that the data from the old master you have to insert won’t cause primary key or unique key conflicts with data on the new master. Otherwise, you may have to manually modify data from the old master to ensure it can be inserted correctly. It sounds complex, so let’s take a look at an example.

Let’s imagine we insert rows using auto_increment on node A, which is a master. For the sake of simplicity, we will focus on a single row only. There are columns ‘id’ and ‘value’.

If we insert it without any particular setup, we’ll see entries like below:

1000, ‘some value0’
1001, ‘some value1’
1002, ‘some value2’
1003, ‘some value3’

Those will replicate to the slave (B). If the split brain happens and writes will be executed on both old and new master, we will end up with following situation:

A

1000, ‘some value0’
1001, ‘some value1’
1002, ‘some value2’
1003, ‘some value3’
1004, ‘some value4’
1005, ‘some value5’
1006, ‘some value7’

B

1000, ‘some value0’
1001, ‘some value1’
1002, ‘some value2’
1003, ‘some value3’
1004, ‘some value6’
1005, ‘some value8’
1006, ‘some value9’

As you can see, there’s no way to simply dump records with id of 1004, 1005 and 1006 from node A and store them on node B because we will end up with duplicated primary key entries. What needs to be done is to change values of id column in the rows that will be inserted to a value larger than the maximum value of the id column from the table. This is all what’s needed for single rows. For more complex relations, where multiple tables are involved, you may have to make the changes in multiple locations.

On the other hand, if we had anticipated this potential problem and configured our nodes to store odd id’s on node A and even id’s on node B, the problem would have been so much easier to solve.

Node A was configured with auto_increment_offset = 1 and auto_increment_increment = 2

Node B was configured with auto_increment_offset = 2 and auto_increment_increment = 2

This is how the data would look on node A before the split brain:

1001, ‘some value0’
1003, ‘some value1’
1005, ‘some value2’
1007, ‘some value3’

When split brain happened, it will look like below.

Node A:

1001, ‘some value0’
1003, ‘some value1’
1005, ‘some value2’
1007, ‘some value3’
1009, ‘some value4’
1011, ‘some value5’
1013, ‘some value7’

Node B:

1001, ‘some value0’
1003, ‘some value1’
1005, ‘some value2’
1007, ‘some value3’
1008, ‘some value6’
1010, ‘some value8’
1012, ‘some value9’

Now we can easily copy missing data from node A:

1009, ‘some value4’
1011, ‘some value5’
1013, ‘some value7’

And load it to node B ending up with following data set:

1001, ‘some value0’
1003, ‘some value1’
1005, ‘some value2’
1007, ‘some value3’
1008, ‘some value6’
1009, ‘some value4’
1010, ‘some value8’
1011, ‘some value5’
1012, ‘some value9’
1013, ‘some value7’

Sure, rows are not in the original order, but this should be ok. In the worst case scenario you will have to order by ‘value’ column in queries and maybe add an index on it to make the sorting fast.

Now, imagine hundreds or thousands of rows and a highly normalized table structure - to restore one row may mean you will have to restore several of them in additional tables. With a need to change id’s (because you didn’t have protective settings in place) across all related rows and all of this being manual work, you can imagine that this is not the best situation to be in. It takes time to recover and it is an error-prone process. Luckily, as we discussed at the beginning, there are means to minimize chances that split brain will impact your system or to reduce the work that needs to be done to sync back your nodes. Make sure you use them and stay prepared.

MySQL on Docker: Running a MariaDB Galera Cluster without Container Orchestration Tools - Part 1

$
0
0

Container orchestration tools simplify the running of a distributed system, by deploying and redeploying containers and handling any failures that occur. One might need to move applications around, e.g., to handle updates, scaling, or underlying host failures. While this sounds great, it does not always work well with a strongly consistent database cluster like Galera. You can’t just move database nodes around, they are not stateless applications. Also, the order in which you perform operations on a cluster has high significance. For instance, restarting a Galera cluster has to start from the most advanced node, or else you will lose data. Therefore, we’ll show you how to run Galera Cluster on Docker without a container orchestration tool, so you have total control.

In this blog post, we are going to look into how to run a MariaDB Galera Cluster on Docker containers using the standard Docker image on multiple Docker hosts, without the help of orchestration tools like Swarm or Kubernetes. This approach is similar to running a Galera Cluster on standard hosts, but the process management is configured through Docker.

Before we jump further into details, we assume you have installed Docker, disabled SElinux/AppArmor and cleared up the rules inside iptables, firewalld or ufw (whichever you are using). The following are three dedicated Docker hosts for our database cluster:

  • host1.local - 192.168.55.161
  • host2.local - 192.168.55.162
  • host3.local - 192.168.55.163

Multi-host Networking

First of all, the default Docker networking is bound to the local host. Docker Swarm introduces another networking layer called overlay network, which extends the container internetworking to multiple Docker hosts in a cluster called Swarm. Long before this integration came into place, there were many network plugins developed to support this - Flannel, Calico, Weave are some of them.

Here, we are going to use Weave as the Docker network plugin for multi-host networking. This is mainly due to its simplicity to get it installed and running, and support for DNS resolver (containers running under this network can resolve each other's hostname). There are two ways to get Weave running - systemd or through Docker. We are going to install it as a systemd unit, so it's independent from Docker daemon (otherwise, we would have to start Docker first before Weave gets activated).

  1. Download and install Weave:

    $ curl -L git.io/weave -o /usr/local/bin/weave
    $ chmod a+x /usr/local/bin/weave
  2. Create a systemd unit file for Weave:

    $ cat > /etc/systemd/system/weave.service << EOF
    [Unit]
    Description=Weave Network
    Documentation=http://docs.weave.works/weave/latest_release/
    Requires=docker.service
    After=docker.service
    [Service]
    EnvironmentFile=-/etc/sysconfig/weave
    ExecStartPre=/usr/local/bin/weave launch --no-restart $PEERS
    ExecStart=/usr/bin/docker attach weave
    ExecStop=/usr/local/bin/weave stop
    [Install]
    WantedBy=multi-user.target
    EOF
  3. Define IP addresses or hostname of the peers inside /etc/sysconfig/weave:

    $ echo 'PEERS="192.168.55.161 192.168.55.162 192.168.55.163"'> /etc/sysconfig/weave
  4. Start and enable Weave on boot:

    $ systemctl start weave
    $ systemctl enable weave

Repeat the above 4 steps on all Docker hosts. Verify with the following command once done:

$ weave status

The number of peers is what we are looking after. It should be 3:

          ...
          Peers: 3 (with 6 established connections)
          ...

Running a Galera Cluster

Now the network is ready, it's time to fire our database containers and form a cluster. The basic rules are:

  • Container must be created under --net=weave to have multi-host connectivity.
  • Container ports that need to be published are 3306, 4444, 4567, 4568.
  • The Docker image must support Galera. If you'd like to use Oracle MySQL, then get the Codership version. If you'd like Percona's, use this image instead. In this blog post, we are using MariaDB's.

The reasons we chose MariaDB as the Galera cluster vendor are:

  • Galera is embedded into MariaDB, starting from MariaDB 10.1.
  • The MariaDB image is maintained by the Docker and MariaDB teams.
  • One of the most popular Docker images out there.

Bootstrapping a Galera Cluster has to be performed in sequence. Firstly, the most up-to-date node must be started with "wsrep_cluster_address=gcomm://". Then, start the remaining nodes with a full address consisting of all nodes in the cluster, e.g, "wsrep_cluster_address=gcomm://node1,node2,node3". To accomplish these steps using container, we have to do some extra steps to ensure all containers are running homogeneously. So the plan is:

  1. We would need to start with 4 containers in this order - mariadb0 (bootstrap), mariadb2, mariadb3, mariadb1.
  2. Container mariadb0 will be using the same datadir and configdir with mariadb1.
  3. Use mariadb0 on host1 for the first bootstrap, then start mariadb2 on host2, mariadb3 on host3.
  4. Remove mariadb0 on host1 to give way for mariadb1.
  5. Lastly, start mariadb1 on host1.

At the end of the day, you would have a three-node Galera Cluster (mariadb1, mariadb2, mariadb3). The first container (mariadb0) is a transient container for bootstrapping purposes only, using cluster address "gcomm://". It shares the same datadir and configdir with mariadb1 and will be removed once the cluster is formed (mariadb2 and mariadb3 are up) and nodes are synced.

By default, Galera is turned off in MariaDB and needs to be enabled with a flag called wsrep_on (set to ON) and wsrep_provider (set to the Galera library path) plus a number of Galera-related parameters. Thus, we need to define a custom configuration file for the container to configure Galera correctly.

Let's start with the first container, mariadb0. Create a file under /containers/mariadb0/conf.d/my.cnf and add the following lines:

$ mkdir -p /containers/mariadb0/conf.d
$ cat /containers/mariadb0/conf.d/my.cnf
[mysqld]

default_storage_engine          = InnoDB
binlog_format                   = ROW

innodb_flush_log_at_trx_commit  = 0
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_autoinc_lock_mode        = 2
innodb_lock_schedule_algorithm  = FCFS # MariaDB >10.1.19 and >10.2.3 only

wsrep_on                        = ON
wsrep_provider                  = /usr/lib/galera/libgalera_smm.so
wsrep_sst_method                = xtrabackup-v2

Since the image doesn't come with MariaDB Backup (which is the preferred SST method for MariaDB 10.1 and MariaDB 10.2), we are going to stick with xtrabackup-v2 for the time being.

To perform the first bootstrap for the cluster, run the bootstrap container (mariadb0) on host1:

$ docker run -d \
        --name mariadb0 \
        --hostname mariadb0.weave.local \
        --net weave \
        --publish "3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --env MYSQL_USER=proxysql \
        --env MYSQL_PASSWORD=proxysqlpassword \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm:// \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb0.weave.local

The parameters used in the the above command are:

  • --name, creates the container named "mariadb0",
  • --hostname, assigns the container a hostname "mariadb0.weave.local",
  • --net, places the container in the weave network for multi-host networing support,
  • --publish, exposes ports 3306, 4444, 4567, 4568 on the container to the host,
  • $(weave dns-args), configures DNS resolver for this container. This command can be translated into Docker run as "--dns=172.17.0.1 --dns-search=weave.local.",
  • --env MYSQL_ROOT_PASSWORD, the MySQL root password,
  • --env MYSQL_USER, creates "proxysql" user to be used later with ProxySQL for database routing,
  • --env MYSQL_PASSWORD, the "proxysql" user password,
  • --volume /containers/mariadb1/datadir:/var/lib/mysql, creates /containers/mariadb1/datadir if does not exist and map it with /var/lib/mysql (MySQL datadir) of the container (for bootstrap node, this could be skipped),
  • --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d, mounts the files under directory /containers/mariadb1/conf.d of the Docker host, into the container at /etc/mysql/mariadb.conf.d.
  • mariadb:10.2.15, uses MariaDB 10.2.15 image from here,
  • --wsrep_cluster_address, Galera connection string for the cluster. "gcomm://" means bootstrap. For the rest of the containers, we are going to use a full address instead.
  • --wsrep_sst_auth, authentication string for SST user. Use the same user as root,
  • --wsrep_node_address, the node hostname, in this case we are going to use the FQDN provided by Weave.

The bootstrap container contains several key things:

  • The name, hostname and wsrep_node_address is mariadb0, but it uses the volumes of mariadb1.
  • The cluster address is "gcomm://"
  • There are two additional --env parameters - MYSQL_USER and MYSQL_PASSWORD. This parameters will create additional user for our proxysql monitoring purpose.

Verify with the following command:

$ docker ps
$ docker logs -f mariadb0

Once you see the following line, it indicates the bootstrap process is completed and Galera is active:

2018-05-30 23:19:30 139816524539648 [Note] WSREP: Synchronized with group, ready for connections

Create the directory to load our custom configuration file in the remaining hosts:

$ mkdir -p /containers/mariadb2/conf.d # on host2
$ mkdir -p /containers/mariadb3/conf.d # on host3

Then, copy the my.cnf that we've created for mariadb0 and mariadb1 to mariadb2 and mariadb3 respectively:

$ scp /containers/mariadb1/conf.d/my.cnf /containers/mariadb2/conf.d/ # on host1
$ scp /containers/mariadb1/conf.d/my.cnf /containers/mariadb3/conf.d/ # on host1

Next, create another 2 database containers (mariadb2 and mariadb3) on host2 and host3 respectively:

$ docker run -d \
        --name ${NAME} \
        --hostname ${NAME}.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/${NAME}/datadir:/var/lib/mysql \
        --volume /containers/${NAME}/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
    
--wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=${NAME}.weave.local

** Replace ${NAME} with mariadb2 or mariadb3 respectively.

However, there is a catch. The entrypoint script checks the mysqld service in the background after database initialization by using MySQL root user without password. Since Galera automatically performs synchronization through SST or IST when starting up, the MySQL root user password will change, mirroring the bootstrapped node. Thus, you would see the following error during the first start up:

018-05-30 23:27:13 140003794790144 [Warning] Access denied for user 'root'@'localhost' (using password: NO)
MySQL init process in progress…
MySQL init process failed.

The trick is to restart the failed containers once more, because this time, the MySQL datadir would have been created (in the first run attempt) and it would skip the database initialization part:

$ docker start mariadb2 # on host2
$ docker start mariadb3 # on host3

Once started, verify by looking at the following line:

$ docker logs -f mariadb2
…
2018-05-30 23:28:39 139808069601024 [Note] WSREP: Synchronized with group, ready for connections

At this point, there are 3 containers running, mariadb0, mariadb2 and mariadb3. Take note that mariadb0 is started using the bootstrap command (gcomm://), which means if the container is automatically restarted by Docker in the future, it could potentially become disjointed with the primary component. Thus, we need to remove this container and replace it with mariadb1, using the same Galera connection string with the rest and use the same datadir and configdir with mariadb0.

First, stop mariadb0 by sending SIGTERM (to ensure the node is going to be shutdown gracefully):

$ docker kill -s 15 mariadb0

Then, start mariadb1 on host1 using similar command as mariadb2 or mariadb3:

$ docker run -d \
        --name mariadb1 \
        --hostname mariadb1.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
    
--wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb1.weave.local

This time, you don't need to do the restart trick because MySQL datadir already exists (created by mariadb0). Once the container is started, verify the cluster size is 3, the status must be in Primary and the local state is synced:

$ docker exec -it mariadb3 mysql -uroot "-pPM7%cB43$sd@^1" -e 'select variable_name, variable_value from information_schema.global_status where variable_name in ("wsrep_cluster_size", "wsrep_local_state_comment", "wsrep_cluster_status", "wsrep_incoming_addresses")'
+---------------------------+-------------------------------------------------------------------------------+
| variable_name             | variable_value                                                                |
+---------------------------+-------------------------------------------------------------------------------+
| WSREP_CLUSTER_SIZE        | 3                                                                             |
| WSREP_CLUSTER_STATUS      | Primary                                                                       |
| WSREP_INCOMING_ADDRESSES  | mariadb1.weave.local:3306,mariadb3.weave.local:3306,mariadb2.weave.local:3306 |
| WSREP_LOCAL_STATE_COMMENT | Synced                                                                        |
+---------------------------+-------------------------------------------------------------------------------+

At this point, our architecture is looking something like this:

Although the run command is pretty long, it well describes the container's characteristics. It's probably a good idea to wrap the command in a script to simplify the execution steps, or use a compose file instead.

Database Routing with ProxySQL

Now we have three database containers running. The only way to access to the cluster now is to access the individual Docker host’s published port of MySQL, which is 3306 (map to 3306 to the container). So what happens if one of the database containers fails? You have to manually failover the client's connection to the next available node. Depending on the application connector, you could also specify a list of nodes and let the connector do the failover and query routing for you (Connector/J, PHP mysqlnd). Otherwise, it would be a good idea to unify the database resources into a single resource, that can be called a service.

This is where ProxySQL comes into the picture. ProxySQL can act as the query router, load balancing the database connections similar to what "Service" in Swarm or Kubernetes world can do. We have built a ProxySQL Docker image for this purpose and will maintain the image for every new version with our best effort.

Before we run the ProxySQL container, we have to prepare the configuration file. The following is what we have configured for proxysql1. We create a custom configuration file under /containers/proxysql1/proxysql.cnf on host1:

$ cat /containers/proxysql1/proxysql.cnf
datadir="/var/lib/proxysql"
admin_variables=
{
        admin_credentials="admin:admin"
        mysql_ifaces="0.0.0.0:6032"
        refresh_interval=2000
}
mysql_variables=
{
        threads=4
        max_connections=2048
        default_query_delay=0
        default_query_timeout=36000000
        have_compress=true
        poll_timeout=2000
        interfaces="0.0.0.0:6033;/tmp/proxysql.sock"
        default_schema="information_schema"
        stacksize=1048576
        server_version="5.1.30"
        connect_timeout_server=10000
        monitor_history=60000
        monitor_connect_interval=200000
        monitor_ping_interval=200000
        ping_interval_server=10000
        ping_timeout_server=200
        commands_stats=true
        sessions_sort=true
        monitor_username="proxysql"
        monitor_password="proxysqlpassword"
}
mysql_servers =
(
        { address="mariadb1.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb1.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=20, max_connections=100 }
)
mysql_users =
(
        { username = "sbtest" , password = "password" , default_hostgroup = 10 , active = 1 }
)
mysql_query_rules =
(
        {
                rule_id=100
                active=1
                match_pattern="^SELECT .* FOR UPDATE"
                destination_hostgroup=10
                apply=1
        },
        {
                rule_id=200
                active=1
                match_pattern="^SELECT .*"
                destination_hostgroup=20
                apply=1
        },
        {
                rule_id=300
                active=1
                match_pattern=".*"
                destination_hostgroup=10
                apply=1
        }
)
scheduler =
(
        {
                id = 1
                filename = "/usr/share/proxysql/tools/proxysql_galera_checker.sh"
                active = 1
                interval_ms = 2000
                arg1 = "10"
                arg2 = "20"
                arg3 = "1"
                arg4 = "1"
                arg5 = "/var/lib/proxysql/proxysql_galera_checker.log"
        }
)

The above configuration will:

  • configure two host groups, the single-writer and multi-writer group, as defined under "mysql_servers" section,
  • send reads to all Galera nodes (hostgroup 20) while write operations will go to a single Galera server (hostgroup 10),
  • schedule the proxysql_galera_checker.sh,
  • use monitor_username and monitor_password as the monitoring credentials created when we first bootstrapped the cluster (mariadb0).

Copy the configuration file to host2, for ProxySQL redundancy:

$ mkdir -p /containers/proxysql2/ # on host2
$ scp /containers/proxysql1/proxysql.cnf /container/proxysql2/ # on host1

Then, run the ProxySQL containers on host1 and host2 respectively:

$ docker run -d \
        --name=${NAME} \
        --publish 6033 \
        --publish 6032 \
        --restart always \
        --net=weave \
        $(weave dns-args) \
        --hostname ${NAME}.weave.local \
        -v /containers/${NAME}/proxysql.cnf:/etc/proxysql.cnf \
        -v /containers/${NAME}/data:/var/lib/proxysql \
        severalnines/proxysql

** Replace ${NAME} with proxysql1 or proxysql2 respectively.

We specified --restart=always to make it always available regardless of the exit status, as well as automatic startup when Docker daemon starts. This will make sure the ProxySQL containers act like a daemon.

Verify the MySQL servers status monitored by both ProxySQL instances (OFFLINE_SOFT is expected for the single-writer host group):

$ docker exec -it proxysql1 mysql -uadmin -padmin -h127.0.0.1 -P6032 -e 'select hostgroup_id,hostname,status from mysql_servers'
+--------------+----------------------+--------------+
| hostgroup_id | hostname             | status       |
+--------------+----------------------+--------------+
| 10           | mariadb1.weave.local | ONLINE       |
| 10           | mariadb2.weave.local | OFFLINE_SOFT |
| 10           | mariadb3.weave.local | OFFLINE_SOFT |
| 20           | mariadb1.weave.local | ONLINE       |
| 20           | mariadb2.weave.local | ONLINE       |
| 20           | mariadb3.weave.local | ONLINE       |
+--------------+----------------------+--------------+

At this point, our architecture is looking something like this:

All connections coming from 6033 (either from the host1, host2 or container's network) will be load balanced to the backend database containers using ProxySQL. If you would like to access an individual database server, use port 3306 of the physical host instead. There is no virtual IP address as single endpoint configured for the ProxySQL service, but we could have that by using Keepalived, which is explained in the next section.

Virtual IP Address with Keepalived

Since we configured ProxySQL containers to be running on host1 and host2, we are going to use Keepalived containers to tie these hosts together and provide virtual IP address via the host network. This allows a single endpoint for applications or clients to connect to the load balancing layer backed by ProxySQL.

As usual, create a custom configuration file for our Keepalived service. Here is the content of /containers/keepalived1/keepalived.conf:

vrrp_instance VI_DOCKER {
   interface ens33               # interface to monitor
   state MASTER
   virtual_router_id 52          # Assign one ID for this route
   priority 101
   unicast_src_ip 192.168.55.161
   unicast_peer {
      192.168.55.162
   }
   virtual_ipaddress {
      192.168.55.160             # the virtual IP
}

Copy the configuration file to host2 for the second instance:

$ mkdir -p /containers/keepalived2/ # on host2
$ scp /containers/keepalived1/keepalived.conf /container/keepalived2/ # on host1

Change the priority from 101 to 100 inside the copied configuration file on host2:

$ sed -i 's/101/100/g' /containers/keepalived2/keepalived.conf

**The higher priority instance will hold the virtual IP address (in this case is host1), until the VRRP communication is interrupted (in case host1 goes down).

Then, run the following command on host1 and host2 respectively:

$ docker run -d \
        --name=${NAME} \
        --cap-add=NET_ADMIN \
        --net=host \
        --restart=always \
        --volume /containers/${NAME}/keepalived.conf:/usr/local/etc/keepalived/keepalived.conf \ osixia/keepalived:1.4.4

** Replace ${NAME} with keepalived1 and keepalived2.

The run command tells Docker to:

  • --name, create a container with
  • --cap-add=NET_ADMIN, add Linux capabilities for network admin scope
  • --net=host, attach the container into the host network. This will provide virtual IP address on the host interface, ens33
  • --restart=always, always keep the container running,
  • --volume=/containers/${NAME}/keepalived.conf:/usr/local/etc/keepalived/keepalived.conf, map the custom configuration file for container's usage.

After both containers are started, verify the virtual IP address existence by looking at the physical network interface of the MASTER node:

$ ip a | grep ens33
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    inet 192.168.55.161/24 brd 192.168.55.255 scope global ens33
    inet 192.168.55.160/32 scope global ens33

The clients and applications may now use the virtual IP address, 192.168.55.160 to access the database service. This virtual IP address exists on host1 at this moment. If host1 goes down, keepalived2 will take over the IP address and bring it up on host2. Take note that the configuration for this keepalived does not monitor the ProxySQL containers. It only monitors the VRRP advertisement of the Keepalived peers.

At this point, our architecture is looking something like this:

Summary

So, now we have a MariaDB Galera Cluster fronted by a highly available ProxySQL service, all running on Docker containers.

In part two, we are going to look into how to manage this setup. We’ll look at how to perform operations like graceful shutdown, bootstrapping, detecting the most advanced node, failover, recovery, scaling up/down, upgrades, backup and so on. We will also discuss the pros and cons of having this setup for our clustered database service.

Happy containerizing!

Viewing all 111 articles
Browse latest View live