We recently deployed Elasticsearch to Google Kubernetes Engine and this blog outlines a few tips from our implementation.
GKE provides a managed Kubernetes environment that makes it very easy to deploy scalable solutions. Best part is customers do not need to worry about managing the environment as GKE is a fully managed service supported by the Google team. GKE runs Certified Kubernetes under the hood. Why is this important? It’s because it avoids any type of vendor lock-in and customers can take their applications out of GKE and run them anywhere or across clouds if needed.
Although Elasticsearch offers a clustering mechanism deploying it on GKE offers the added advantage of a system that’s easy to scale and maintain in high availability mode. Before starting the deployment here a few things to decide on:
- Version of Elastic to be used – https://www.docker.elastic.co/ has the list of docker images
- Will this be a regional cluster? By default, a cluster creates its cluster master and its nodes in a single compute zone that you specify at the time of creation. You can improve availability and resilience of your clusters by creating regional clusters. Regional clusters distribute Kubernetes resources across multiple zones within a region. The best part? Regional clusters are offered at no additional charge. Benefits of using regional clusters should be obvious:
- Resilience from single zone failure. Regional clusters are available across a region rather than a single zone within a region. If a single zone becomes unavailable, your Kubernetes control plane and your resources are not impacted.
- Zero downtime master upgrades and reduced downtime from master failures. Regional clusters provide a high availability control plane, so you can access your control plane even during upgrades.
- What type of networking should be enabled? Where will clients be accessing the Elastic service? Are they on-premise or within GCP? Typically a LoadBalancer is the best choice to expose Elastic to external clients. This can be an internal load balancer if clients are within GCP.
- Type of storage to be used – SSD vs persistent disk. SSD offers better performance. While Persistent Disks (PD) are zonal resources, and applications built on top of PDs can become unavailable in the event of a zonal failure, Regional PDs provide network-attached block storage with synchronous replication of data between two zones in a region. Regional PD has native integration with the Kubernetes master that manages health monitoring and failover to the secondary zone in case of an outage in the primary zone.
- The key setting that makes the Elastic nodes talk to each other is the “discovery.zen.ping.unicast.hosts” that’s part of the stateful set yaml. This needs to be set to all the Elastic nodes that are in the cluster. If this is not set then the
Once Elastic is setup on GKE how do you ensure that failover is happening as expected? Here are a few tests you can run.
Delete a node from cluster:
- Turn on Auto Scaling and Delete Node
- Drop iptables (kube-proxy) to simulate node failure. Kube-proxy routes traffic from port to nodes
- Use stress (http://people.seas.harvard.edu/~apw/stress/) to simulate high memory usage on node
- Use stress (http://people.seas.harvard.edu/~apw/stress/) to simulate high cpu usage on node
- Configure liveness and readiness probes using default metrics i.e., memory and cpu
- Configure liveness and readiness probes using custom stackdriver metric e.g. requests per second
- Ensure data is replicated across all nodes – insert, shutdown 2 nodes, query from 3rd node