Self-Healing Systems Workshop

Viktor Farcic

Continuous Deployment

Provision
Test
Build
Test
Deploy
Test
Enable
Test

Continuous Deployment

The Second Half

Monitor
React to problems
Prevent problems
Automated?

What Are We Trying to Accomplish?

Perfection?

Applications that never fail?
Services that can handle any load?
Commits without bugs?
Hardware that never breaks?

What Are We Trying to Accomplish?

There is no such thing as perfection

Applications fail!
Services that can NOT handle any load!
Commits contain undetected bugs!
Hardware breaks!

What Are We Trying to Accomplish?

Self-Replicating System

Life

System over individuals
Small and incremental evolutionary improvements
Reproduction
Self-healing

What Are We Trying to Accomplish?

Resilient System

Cells

Restoring to the state of equilibrium
Monitoring and adjusting processes
Reproducing
Healing

What Are We Trying to Accomplish?

Body Equivalent?

Datacenter
Orchestrator
(micro)Services
Containers
Scaling
Self-Healing?

Self-Healing System

Discover problems
Restore itself to the desired state
Make decisions
Adapt to changed conditions

Self-Healing System

Levels

Application level
System level
Hardware level

Self-Healing System

Types

Reactive healing
Preventive healing

Tools

Cluster orchestrator
Realtime Registry
Historical Registry
Monitoring
General orchestrator

Tools

Cluster Orchestrator

Tools

Realtime Registry

Tools

Historical Registry

Tools

Monitoring

Tools

General Orchestrator

Provisioning

Nodes

git clone https://github.com/vfarcic/ms-lifecycle.git

cd ms-lifecycle

vagrant plugin install vagrant-cachier

vagrant up cd swarm-master swarm-node-1 swarm-node-2

Provisioning

Nodes

Provisioning

Docker Swarm

vagrant ssh cd

cat /vagrant/ansible/swarm.yml

cat /vagrant/ansible/roles/docker/tasks/debian.yml

ansible-playbook /vagrant/ansible/swarm.yml \
    -i /vagrant/ansible/hosts/prod

Provisioning

Docker Swarm

Provisioning

Jenkins

cat /vagrant/ansible/jenkins-node-swarm.yml

cat /vagrant/ansible/jenkins.yml

ansible-playbook /vagrant/ansible/jenkins-node-swarm.yml \
    -i /vagrant/ansible/hosts/prod

ansible-playbook /vagrant/ansible/jenkins.yml \
    --extra-vars "main_job_src=service-healing-config.xml service_redeploy_src=service-redeploy-df-config.xml" \
    -c local

Provisioning

Jenkins

Prerequisite #1

Ability to deploy to any server inside a cluster that meets hardware requirements

Deployment

Cluster Orchestrator Tools

Deployment

The Code

git clone https://github.com/vfarcic/go-demo.git

cd go-demo

cat docker-compose-test.yml

docker-compose -f docker-compose-test.yml run --rm unit

ll

cat Dockerfile

docker build -t vfarcic/go-demo .

# docker push vfarcic/go-demo

Deployment

Run

docker-compose up -d db app

docker-compose ps

GO_DEMO_PORT=<PORT>

curl -i -XPUT localhost:$GO_DEMO_PORT/demo/hello

docker-compose down

docker-compose ps

Deployment

Run in cluster

export DOCKER_HOST=tcp://swarm-master:2375

docker ps -a --format "table {{.Names}}\t{{.Status}}"

docker info

docker-compose up -d db app

docker-compose ps

curl swarm-master:8500/v1/catalog/service/go-demo \
    | jq '.'

Deployment

Run

Deployment

Proxy

Deployment

Proxy

curl -i localhost/demo/hello

sudo docker run -d \
    --name docker-flow-proxy \
    -e CONSUL_ADDRESS=10.100.192.200:8500 \
    -p 80:80 -p 8081:8080 \
    vfarcic/docker-flow-proxy

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo&servicePath=/demo" \
    | jq '.'

curl -i localhost/demo/hello

Deployment

Proxy: Operations

Deployment

Proxy: User

Deployment

Scale

docker-compose ps

docker-compose scale app=3

docker-compose ps

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo&servicePath=/demo" \
    | jq '.'

curl -i localhost/demo/hello # Repeat a few times

docker-compose logs app

docker-compose down

Deployment

Scale With Limits

cat docker-compose-limits.yml

docker-compose -f docker-compose-limits.yml up -d db app-big

docker info

docker-compose -f docker-compose-limits.yml up -d db app-small

docker-compose -f docker-compose-limits.yml scale app-small=10

docker info

docker-compose -f docker-compose-limits.yml scale app-big=2

docker-compose -f docker-compose-limits.yml down

Deployment

Networking

docker-compose up -d db app

docker-compose scale app=2

docker-compose ps

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo&servicePath=/demo" \
    | jq '.'

curl -i -XPUT localhost/demo/person?name=Viktor

curl -i -XPUT localhost/demo/person?name=Sara

curl -i localhost/demo/person

Deployment

Networking

docker-compose exec app ping -c 1 db

docker-compose exec db ping -c 1 godemo_app_1

docker-compose exec db ping -c 1 godemo_app_2

docker-compose down

Deployment

Networking

Prerequisite #2

Ability to deploy without downtime and to automatically scale and de-scale instances

Redeployment

Deployment

docker-compose up -d db app

docker-compose ps

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo&servicePath=/demo" \
    | jq '.'

curl -i localhost/demo/hello

Redeployment

Deployment

Redeployment

Scaling

docker-compose up -d db app

docker-compose scale app=2

docker-compose ps

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo&servicePath=/demo" \
    | jq '.'

curl -i localhost/demo/hello

curl -XPUT -d "10.100.198.200" \
    http://swarm-master:8500/v1/kv/proxy/ip

curl -XPUT -d "2" http://swarm-master:8500/v1/kv/go-demo/scale

Redeployment

Scaling

Redeployment

Relative Scaling

curl http://swarm-master:8500/v1/kv/go-demo/scale?raw

CURRENT_SCALE=$(curl http://swarm-master:8500/v1/kv/go-demo/scale?raw)

NEW_SCALE=$(($CURRENT_SCALE + 1))

docker-compose scale app=$NEW_SCALE

docker-compose ps

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo&servicePath=/demo" \
    | jq '.'

curl -XPUT -d $NEW_SCALE http://swarm-master:8500/v1/kv/go-demo/scale

curl -i localhost/demo/hello

Redeployment

Relative Scaling

Redeployment

Relative De-scaling

CURRENT_SCALE=$(curl http://swarm-master:8500/v1/kv/go-demo/scale?raw)

NEW_SCALE=$(($CURRENT_SCALE - 1))

docker-compose scale app=$NEW_SCALE

docker-compose ps

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo&servicePath=/demo" \
    | jq '.'

curl -XPUT -d $NEW_SCALE http://swarm-master:8500/v1/kv/go-demo/scale

curl -i localhost/demo/hello

docker-compose down

Redeployment

Relative De-scaling

Redeployment

Standard Deployment

cat docker-compose-bg.yml

VERSION=:1.0 docker-compose -f docker-compose-bg.yml up -d db app

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo&servicePath=/demo" \
    | jq '.'

VERSION=:1.1 docker-compose -f docker-compose-bg.yml up -d db app

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo&servicePath=/demo" \
    | jq '.'

docker-compose -f docker-compose-bg.yml down

curl "localhost:8081/v1/docker-flow-proxy/remove?serviceName=go-demo" \
    | jq '.'

Redeployment

Standard Deployment

Redeployment

Standard Deployment

Redeployment

Blue-Green Deployment

cat docker-compose-bg.yml

export NEXT_COLOR=blue

VERSION=:1.0 docker-compose -f docker-compose-bg.yml \
    up -d db app-${NEXT_COLOR}

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo-${NEXT_COLOR}&servicePath=/demo" \
    | jq '.'

curl -i localhost/demo/hello

docker-compose -f docker-compose-bg.yml logs

curl -XPUT -d $NEXT_COLOR \
    http://swarm-master:8500/v1/kv/go-demo/color

Redeployment

Blue-Green Deployment

Redeployment

Blue-Green Deployment

COLOR=$(curl http://swarm-master:8500/v1/kv/go-demo/color?raw)

NEXT_COLOR=$(if [[ "$COLOR" == "blue" ]]; then echo "green"
else echo "blue"; fi)

VERSION=:1.1 docker-compose -f docker-compose-bg.yml \
    up -d db app-${NEXT_COLOR}

docker-compose -f docker-compose-bg.yml ps

curl -i localhost/demo/hello

docker-compose -f docker-compose-bg.yml logs

Redeployment

Blue-Green Deployment

Redeployment

Blue-Green Deployment

curl "localhost:8081/v1/docker-flow-proxy/reconfigure?serviceName=go-demo-${NEXT_COLOR}&servicePath=/demo" \
    | jq '.'

curl "localhost:8081/v1/docker-flow-proxy/remove?serviceName=go-demo-${COLOR}" \
    | jq '.'

curl -XPUT -d $NEXT_COLOR http://swarm-master:8500/v1/kv/go-demo/color

curl -i localhost/demo/hello # Repeat

docker-compose -f docker-compose-bg.yml logs

docker-compose -f docker-compose-bg.yml stop app-${COLOR}

docker-compose -f docker-compose-bg.yml ps

Redeployment

Blue-Green Deployment

Redeployment

Blue-Green Deployment

docker-compose -f docker-compose-bg.yml down

curl "localhost:8081/v1/docker-flow-proxy/remove?serviceName=go-demo-${NEXT_COLOR}" \
    | jq '.'

Redeployment

Docker Flow

export FLOW_PROXY_HOST=10.100.198.200

export FLOW_PROXY_RECONF_PORT=8081

export FLOW_CONSUL_ADDRESS=http://10.100.192.200:8500

export FLOW_PROXY_DOCKER_HOST=tcp://10.100.198.200:2375

docker-flow --flow=deploy --flow=proxy --flow=stop-old

docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

curl -i localhost/demo/hello

Redeployment

Docker Flow

Redeployment

Docker Flow

docker-flow --flow=deploy --flow=proxy --flow=stop-old

docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

curl -i localhost/demo/hello

Redeployment

Docker Flow

Redeployment

Docker Flow

docker-flow --scale="+2" --flow=scale --flow=proxy

docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

curl -i localhost/demo/hello

docker-flow --scale="-2" --flow=scale --flow=proxy

docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Redeployment

Docker Flow

Redeployment

Automation With Jenkins

git clone http://10.100.198.200:8080/workflowLibs.git /tmp/workflowLibs

cd /tmp/workflowLibs

git checkout -b master

mkdir vars

cp ~/go-demo/jenkins/vars/dockerFlowWorkshop.groovy \
    /tmp/workflowLibs/vars/dockerFlow.groovy

git config --global user.name "vfarcic"

git add --all && git commit -a -m "Docker Flow"

git push --set-upstream origin master

Redeployment

Automation With Jenkins

cd ~/go-demo

cat jenkins/vars/dockerFlowWorkshop.groovy

cat Jenkinsfile

Redeployment

Automation With Jenkins

http://10.100.198.200:8080 > New Item

Item Name: go-demo; Type: Multibranch Pipeline > OK

Add Source > Git

Project Repository: https://github.com/vfarcic/go-demo.git > Save

http://10.100.198.200:8080/job/go-demo/branch/master/

http://10.100.198.200:8080/job/go-demo/branch/master/lastBuild/console

Prerequisite #3

Ability to monitor hardware and dynamically adjust cluster capacity (elasticity)

Hardware Health Checks and Watches

Hard Disk Script

df -h

set -- $(df -h | awk '$NF=="/"{print $2" "$3" "$5}')

total=$1

used=$2

used_percent=${3::-1}

printf "Disk Usage: %s/%s (%s%%)\n" $used $total $used_percent

Hardware Health Checks and Watches

Hard Disk Script

exit

vagrant ssh swarm-master

sudo mkdir -p /data/consul/scripts

Hardware Health Checks and Watches

Hard Disk Script

echo '#!/usr/bin/env bash

set -- $(df -h | awk '"'"'$NF=="/"{print $2" "$3" "$5}'"'"')
total=$1
used=$2
used_percent=${3::-1}
printf "Disk Usage: %s/%s (%s%%)\n" $used $total $used_percent
if [ $used_percent -gt 95 ]; then
  exit 2
elif [ $used_percent -gt 80 ]; then
  exit 1
else
  exit 0
fi
' | sudo tee /data/consul/scripts/disk.sh

Hardware Health Checks and Watches

Hard Disk Script

sudo chmod +x /data/consul/scripts/disk.sh

/data/consul/scripts/disk.sh

echo $?

Hardware Health Checks and Watches

Hard Disk Check

echo '{
  "checks": [
    {
      "id": "disk",
      "name": "Disk utilization",
      "notes": "Critical 95% util, warning 80% util",
      "script": "/data/consul/scripts/disk.sh",
      "interval": "10s"
    }
  ]
}' | sudo tee /data/consul/config/consul_check.json

sudo killall -HUP consul

Hardware Health Checks and Watches

Hard Disk Check

http://10.100.192.200:8500/ui/ > Nodes > swarm-master

Hardware Health Checks and Watches

Hard Disk Check

echo '#!/usr/bin/env bash

read -r JSON

STATUS_ARRAY=($(echo "$JSON" | jq -r ".[].Status"))
CHECK_ID_ARRAY=($(echo "$JSON" | jq -r ".[].CheckID"))
LENGTH=${#STATUS_ARRAY[*]}

for (( i=0; i<=$(( $LENGTH -1 )); i++ ))
do
  CHECK_ID=${CHECK_ID_ARRAY[$i]}
  STATUS=${STATUS_ARRAY[$i]}
  echo -e "Triggering Jenkins job http://10.100.198.200:8080/job/hardware-notification/build"
  curl -X POST http://10.100.198.200:8080/job/hardware-notification/build \
    --data-urlencode json="{\"parameter\": [{\"name\":\"checkId\", \"value\":\"$CHECK_ID\"}, {\"name\":\"status\", \"value\":\"$STATUS\"}]}"
done' | sudo tee /data/consul/scripts/manage_watches.sh

sudo chmod +x /data/consul/scripts/manage_watches.sh

Hardware Health Checks and Watches

Hard Disk Check

echo '{
  "watches": [
    {
      "type": "checks",
      "state": "warning",
      "handler": "/data/consul/scripts/manage_watches.sh >>/data/consul/logs/watches.log"
    }, {
      "type": "checks",
      "state": "critical",
      "handler": "/data/consul/scripts/manage_watches.sh >>/data/consul/logs/watches.log"
    }
  ]
}' | sudo tee /data/consul/config/watches.json

Hardware Health Checks and Watches

Hard Disk Check

sudo sed -i "s/80/2/" /data/consul/scripts/disk.sh

sudo killall -HUP consul

cat /data/consul/logs/watches.log

exit

http://10.100.198.200:8080/job/hardware-notification/

Hardware Health Checks and Watches

Cluster Hardware Checks

vagrant ssh cd

export DOCKER_HOST=tcp://swarm-master:2375

cat /vagrant/ansible/swarm-healing.yml

cat /vagrant/ansible/roles/consul-healing/files/mem.sh

ansible-playbook /vagrant/ansible/swarm-healing.yml \
    -i /vagrant/ansible/hosts/prod \
    --extra-vars "elk_ip=10.100.198.200"

http://10.100.192.200:8500/ui/ > Nodes

Prerequisite #4

Ability to monitor services in (near)real-time and execute reactive actions

Reactive Healing

Service Checks

cat /vagrant/ansible/roles/consul-healing/templates/manage_watches.sh

# ansible-playbook /vagrant/ansible/swarm-healing.yml \
#    -i /vagrant/ansible/hosts/prod \
#    --extra-vars "elk_ip=10.100.198.200"

cd ~/go-demo

cat consul_service.ctmpl

cat consul_check.ctmpl

cat Jenkinsfile

Reactive Healing

Service Checks

docker rm -f $(docker ps -a | grep godemo | awk '{ print $1}')

curl -i http://10.100.198.200/demo/hello

http://10.100.198.200:8080/job/service-redeploy

curl -i http://10.100.198.200/demo/hello

Reactive Healing

Service Checks

sudo docker rm -f docker-flow-proxy

curl -i http://10.100.198.200/demo/hello

http://10.100.198.200:8080/job/service-redeploy/lastBuild/console

curl -i http://10.100.198.200/demo/hello

Prerequisite #5

Ability to predict the future and execute proactive actions

Proactive Healing

Scheduled Scaling

http://10.100.198.200:8080/job/go-demo-scale/configure

Build With Parameters > Build

http://10.100.198.200:8080/job/go-demo-scale/lastBuild/console

Proactive Healing

Scheduled Descaling

http://10.100.198.200:8080/job/go-demo-descale/configure

Build With Parameters > Build

http://10.100.198.200:8080/job/go-demo-descale/lastBuild/console

Proactive Healing

Proxy Stats

http://10.100.198.200/admin?stats

http://10.100.198.200/admin?stats;csv

Proactive Healing

ELK Setup

cat /vagrant/ansible/roles/logstash/files/haproxy.conf

ansible-playbook /vagrant/ansible/elk-local.yml \
    --extra-vars "logstash_config=haproxy.conf" -c local

sudo docker logs logstash

http://10.100.198.200:5601/

Proactive Healing

LogStash API

curl http://10.100.198.200:9200/logstash-*/_search | jq '.'

Proactive Healing

LogStash API

curl http://10.100.198.200:9200/logstash-*/_search -d '{
  "query": {
    "bool": {
      "must": { "match": { "tags" : "haproxy_stats" } },
      "must": { "match": { "haproxy_stats.svname" : "BACKEND" } },
      "must": { "range": { "@timestamp": { "gt" : "now-1h" } } }
    }
  }
}' | jq '.'

Proactive Healing

LogStash API

curl http://10.100.198.200:9200/logstash-*/_search -d '{
  "size" : 0,
  "query": {
    "bool": {
      "must": { "match": { "tags" : "haproxy_stats" } },
      "must": { "match": { "haproxy_stats.svname" : "BACKEND" } },
      "must": { "range": { "@timestamp": { "gt" : "now-1h" } } }
    }
  },
  "aggs" : {
    "services" : {
      "terms" : { "field" : "haproxy_stats.pxname.raw" },
      "aggs": { "avg_rtime": { "avg": { "field": "haproxy_stats.rtime" } } }
    }
  }
}' | jq '.'

Proactive Healing

LogStash API

curl http://10.100.198.200:9200/logstash-*/_search -d '{
  "size" : 0,
  "query": {
    "bool": {
      "must": { "match": { "tags" : "haproxy_stats" } },
      "must": { "match": { "haproxy_stats.svname" : "BACKEND" } },
      "must": { "match": { "haproxy_stats.pxname.raw" : "go-demo-app-be" } },
      "must": { "range": { "@timestamp": { "gt" : "now-1h" } } }
    }
  },
  "aggs" : {
    "avg_rtime" : {
      "avg": { "field": "haproxy_stats.rtime" }
    }
  }
}' | jq '.'

Proactive Healing

Consul / Jenkins

cat /vagrant/ansible/roles/consul-healing/templates/rtime_up.sh

cat /vagrant/ansible/roles/consul-healing/templates/rtime_down.sh

cat /vagrant/ansible/roles/consul-healing/templates/manage_watches.sh

cat consul_check.ctmpl

http://10.100.198.200:8080/job/service-scale/configure

Proactive Healing

Consul

exit

vagrant ssh swarm-master

sudo sed -i "s/1000 2000/1 2/" /data/consul/config/go-demo_check.json

sudo killall -HUP consul

exit

vagrant ssh cd

curl localhost/demo/hello?delay=2000 # Repeat

http://10.100.198.200:8080/job/service-scale/1/console

http://10.100.192.200:8500/ui

Proactive Healing

Consul

exit

vagrant ssh swarm-master

sudo sed -i "s/1 2/1000 2000/" /data/consul/config/go-demo_check.json

sudo killall -HUP consul

http://10.100.192.200:8500/ui

Viktor Farcic

@vfarcic

TechnologyConversations.com

Viktor Farcic

Cleanup

exit

vagrant destroy -f

Self-Healing Systems Workshop

Viktor Farcic

@vfarcic

TechnologyConversations.com

CloudBees.com

Viktor Farcic

Continuous Deployment

Continuous Deployment

The Second Half

What Are We Trying to Accomplish?

Perfection?

What Are We Trying to Accomplish?

There is no such thing as perfection

What Are We Trying to Accomplish?

Self-Replicating System

Life

What Are We Trying to Accomplish?

Resilient System

Cells

What Are We Trying to Accomplish?

Body Equivalent?

Self-Healing System

Self-Healing System

Levels

Self-Healing System

Types

Tools

Tools

Cluster Orchestrator

Tools

Realtime Registry

Tools

Historical Registry

Tools

Monitoring

Tools

General Orchestrator

Provisioning

Provisioning

Nodes

Provisioning

Nodes

Provisioning

Docker Swarm

Provisioning

Docker Swarm

Provisioning

Jenkins

Provisioning

Jenkins

Prerequisite #1

Deployment

Cluster Orchestrator Tools

Deployment

The Code

Deployment

Run

Deployment

Run in cluster

Deployment

Run

Deployment

Proxy

Deployment

Proxy

Deployment

Proxy: Operations

Deployment

Proxy: User

Deployment

Scale

Deployment

Scale With Limits

Deployment

Networking

Deployment

Networking

Deployment

Networking

Prerequisite #2