Skip to content

Latest commit

 

History

History
228 lines (172 loc) · 6.24 KB

File metadata and controls

228 lines (172 loc) · 6.24 KB

A-HA 60k+ nodes Tuning Recommendations

Good read for additional Operating system level tuning https://community.progress.com/s/article/Chef-Automate-Deployment-Planning-and-Performance-tuning-transcribed-from-Scaling-Chef-Automate-Beyond-100-000-nodes

Assumption is running with minimum servers specs for a combined cluster of:

  • 7 FE Nodes:
    • 8-16 cores cpu, 32GB ram
  • 3 BE PGSQL Nodes:
    • 8-16 cores cpu, 32-64GB ram, 1TB SSD hard drive space
  • 5 BE OpenSearch Nodes:
    • 16 cores cpu, 64GB ram, 15TB SSD hard drive space

You will also get more mileage by creating separate clusters for infra-server and Automate. This will allow for separate PGSQL and OpenSearch clusters for each application.


#1 Apply to all BE’s for PGSQL via chef-automate config patch pgsql-be-patch.toml --pg

*Note: Requires manual service restart of the leader for the setting to take effect

# PGSQL connections
[postgresql.v1.sys.pg]
  max_connections = 5000
  max_wal_size = "8GB"
  wal_sender_timeout = 300000
  wal_receiver_timeout = 300000
  wal_keep_size = 32768
  wal_compression = "on"
  checkpoint_timeout = "15min"
  checkpoint_completion_target = 0.9

*Note: Only needed for A-HA clusters version below 4.13.76. 4.13.76 removed the need for HAProxy connection to PGSQL and FE nodes connect directly to PGSQL now.

PGSQL servers haproxy service isn't configurable via chef-automate config patch Below are the steps to update the haproxy service

Get the current HaProxy config, and update with the new parameters

Note: run this on a db backend, normally a follower

source /hab/sup/default/SystemdEnvironmentFile.sh
automate-backend-ctl applied --svc=automate-ha-haproxy | tail -n +2 > haproxy_config.toml
# note haproxy_config.toml may be blank. This is only to capture any local customisations that might have occurred
# HaProxy config
# Global
maxconn = 2000
# Backend Servers
[server]
maxconn = 1500
Apply the change as below on a single db backend:-
hab config apply automate-ha-haproxy.default $(date '+%s') haproxy_config.toml

Note: this will propagate to all 3 backend db's and will restart the haproxy service on each Backend, causing an outage(will only last a few mins), but a complete db restart is required as follows:- (the only robust way is to restart all db backends, Do not skip the below steps)

Restart, follower01, follower02 ,then leader as below. Have to wait for sync.
On Followers
Systemctl stop hab-sup 
Systemctl start hab-sup 
journalctl -fu hab-sup
On leader
Systemctl stop hab-sup
# wait till leader is elected from other 2 old followers.  Only then do the start 
Systemctl start hab-sup
Check the synchronization
journalctl -fu hab-sup
Cat the following file on all x3 BE pgsql nodes. Just to be sure the settings have taken, after restart

(ie witness the "maxconn = 1500" setting is present )

hab/svc/automate-ha-haproxy/config/haproxy.conf

#2 Apply to all BE’s for OpenSearch via chef-automate config patch opensearch-be-patch.toml --os

Fix for knife search when nodes are over 10k. First run this on an BE node for embedded OpenSearch.

curl -XPUT "http://127.0.0.1:10144/chef/_settings" -d '{"index": {"max_result_window": 100000}}' -H "Content-Type: application/json"

Then run config patch with toml file below

# Cluster Ingestion
[opensearch.v1.sys.cluster]
  max_shards_per_node = 6000
# JVM Heap
[opensearch.v1.sys.runtime]
  heapsize = "32g" # 50% of total memory up to 32GB

#3 Apply to all FE’s for Automate via chef-automate config patch automate-fe-patch.toml --a2

# Worker Processes
[load_balancer.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores
[esgateway.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores

#4 Apply to all FE’s for infra-server via chef-automate config patch infr-fe-patch.toml -cs

# Cookbook Version Cache
[erchef.v1.sys.api]
  cbv_cache_enabled = true

# Worker Processes
[load_balancer.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores
[cs_nginx.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores
[esgateway.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores

# CB Depsolver
# Depsolver tuning parameters assume a chef workload of roles/envs/cookbooks
# If only using policyfiles instead of roles/envs depsolver tuning is not required
# knife search fix for nodes over 10k
[erchef.v1.sys.index] # For Automate version 4.13.76 and newer
  track_total_hits = true

[erchef.v1.sys.depsolver]
  timeout = 10000
  pool_init_size = 50
  pool_max_size = 50
  pool_queue_max = 512
  pool_queue_timeout = 10000

# Connection Pools
[erchef.v1.sys.data_collector]
  pool_init_size = 100
  pool_max_size = 100
[erchef.v1.sys.sql]
  timeout = 5000
  pool_init_size = 100
  pool_max_size = 100
  pool_queue_max = 512
  pool_queue_timeout = 10000
[bifrost.v1.sys.sql]
  timeout = 5000
  pool_init_size = 100
  pool_max_size = 100
  pool_queue_max = 512
  pool_queue_timeout = 10000
[erchef.v1.sys.authz]
  timeout = 10000
  pool_init_size = 100
  pool_max_size = 100
  pool_queue_max = 512
  pool_queue_timeout = 10000

[pg_gateway.v1]
  [pg_gateway.v1.sys]
    max_connections = 500

#5 Optional settings

Add client ip to x-forwarded-for header for tracing requests

Patch frontend nodes via chef-automate config patch x-forward-patch.toml -fe

[global.v1.sys.ngx.http]
  include_x_forwarded_for = true

For Automate version 4.13.76 and newer

On an Opensearch node run:

curl -XPUT "http://127.0.0.1:10144/chef/_settings" \
    -d '{
          "index": {
            "max_result_window": 50000
          }
        }' \
    -H "Content-Type: application/json"

To verify setting run:

curl http://127.0.0.1:10144/_settings?pretty

Then patch frontend nodes via chef-automate config patch knife-patch.toml -fe

[erchef.v1.sys.index]
 track_total_hits = true