Manage your ZDM Proxy instances

After you deploy ZDM Proxy instances, you might need to perform various management operations, such as rolling restarts, configuration changes, log inspection, version upgrades, and infrastructure changes.

If you are using ZDM Proxy Automation, you can use Ansible playbooks for all of these operations.

Perform a rolling restart of the proxies

Rolling restarts of the ZDM Proxy instances are useful to apply configuration changes or to upgrade the ZDM Proxy version without impacting the availability of the deployment.

A rolling restart is a destructive action because it stops the previous containers, and then starts new containers. Collect the logs before you apply the configuration change if you want to keep them.

With ZDM Proxy Automation
Without ZDM Proxy Automation

If you use ZDM Proxy Automation to manage your ZDM Proxy deployment, you can use a dedicated playbook to perform rolling restarts of all ZDM Proxy instances in a deployment:

Connect to your Ansible Control Host container.

For example, ssh into the jumphost:
```
ssh -F ~/.ssh/zdm_ssh_config jumphost
```
Then, connect to the Ansible Control Host container:
```
docker exec -it zdm-ansible-container bash
```
Result
ubuntu@52772568517c:~$

Run the rolling restart playbook:

ansible-playbook rolling_update_zdm_proxy.yml -i zdm_ansible_inventory

The rolling restart playbook recreates each ZDM Proxy container, one by one. The ZDM Proxy deployment remains available at all times, and you can safely use it throughout this operation. If you modified mutable configuration variables, the new containers use the updated configuration files.

The playbook performs the following actions automatically:

ZDM Proxy Automation stops one container gracefully, and then waits for it to shut down.
ZDM Proxy Automation recreates the container, and then starts it.
ZDM Proxy Automation calls the readiness endpoint to check the container’s status:
- If the status check fails, ZDM Proxy Automation repeats the check up to six times at 5-second intervals. If all six attempts fail, ZDM Proxy Automation interrupts the entire rolling restart process.
- If the check succeeds, ZDM Proxy Automation waits a fixed amount of time, and then moves on to the next container. The default pause between containers is 10 seconds. You can change the pause duration in zdm-proxy-automation/ansible/vars/zdm_playbook_internal_config.yml.

If you don’t use ZDM Proxy Automation, you must manually restart each instance.

To avoid downtime, wait for each instance to fully restart and begin receiving traffic before restarting the next instance.

Inspect ZDM Proxy logs

ZDM Proxy logs can help you verify that your ZDM Proxy instances are operating normally, investigate how processes are executed, and troubleshoot issues. For information about configuring, retrieving, and interpreting ZDM Proxy logs, see Viewing and interpreting ZDM Proxy logs.

Change mutable configuration variables

Some, but not all, configuration variables can be changed after you deploy a ZDM Proxy instance.

This section lists the mutable configuration variables that you can change on an existing ZDM Proxy deployment.

After you edit mutable variables in their corresponding configuration files (vars/zdm_proxy_core_config.yml, vars/zdm_proxy_cluster_config.yml, or vars/zdm_proxy_advanced_config.yml), you must perform a rolling restart to apply the configuration changes to your ZDM Proxy instances.

Mutable variables in `vars/zdm_proxy_core_config.yml`

primary_cluster

Determines which cluster is currently considered the primary cluster, either ORIGIN or TARGET.

At the start of the migration, the primary cluster is the origin cluster because it contains all of the data. After all the existing data has been transferred and validated/reconciled on the new cluster, you can switch the primary cluster to the target cluster.

read_mode

Determines how reads are handled by ZDM Proxy:

PRIMARY_ONLY (default): Reads are sent synchronously to the primary cluster only.
DUAL_ASYNC_ON_SECONDARY: Reads are sent synchronously to the primary cluster, and also asynchronously to the secondary cluster. See Phase 3: Enable asynchronous dual reads.

Typically, you only set read_mode to DUAL_ASYNC_ON_SECONDARY if the primary_cluster variable is set to ORIGIN. This is because asynchronous dual reads are primarily intended to help test production workloads against the target cluster near the end of the migration. When you are ready to switch primary_cluster to TARGET, revert read_mode to PRIMARY_ONLY because there is no need to send writes to both clusters at that point in the migration.

log_level

Set the ZDM Proxy log level as INFO (default) or DEBUG.

Only use DEBUG while temporarily troubleshooting an issue. Revert to INFO as soon as possible because the extra logging can impact performance slightly.

For more information, see Check ZDM Proxy logs.

Mutable variables in `vars/zdm_proxy_cluster_config.yml`

In vars/zdm_proxy_cluster_config.yml, you can change the connection credentials (username and password) for the origin and target clusters.

Mutable variables in `vars/zdm_proxy_advanced_config.yml`

zdm_proxy_max_clients_connections

The maximum number of client connections that ZDM Proxy can accept. Each client connection results in additional cluster connections and causes the allocation of several in-memory structures. A high number of client connections per proxy instance can cause performance degradation, especially at high throughput. Adjust this variable to limit the total number of connections on each instance.

Default: 1000

replace_cql_functions

Whether ZDM Proxy replaces standard now() CQL function calls in write requests with an explicit timeUUID value computed at proxy level.

false (default): Replacement of now() is disabled.

true: ZDM Proxy replaces instances of now() in write requests with an explicit timeUUID value before sending the write to each cluster.

Enabling replace_cql_functions has a noticeable performance impact because the proxy must do more extensive parsing and manipulation of the statements before sending the modified statement to each cluster. Only enable this variable if required, and implement proper performance testing to quantify and prepare for the performance impact.

If you use now() to populate a regular (non-primary key) column, consider if you can pragmatically accept a slight discrepancy in the values between the origin and target cluster for these writes. This depends on your application, and whether it can tolerate a potential difference of a few milliseconds.

However, if you use now() to populate a primary key column, differences between the origin and target values result in different primary keys. This means that the same row on the origin and target are technically considered different records, and this will cause problems with duplicate entries that aren’t caught by validation (because the primary keys are different). If now() is used in any of your primary key columns, DataStax recommends enabling replace_cql_functions, regardless of the performance impact.

For more information, see Server-side non-deterministic functions in the primary key.

zdm_proxy_request_timeout_ms

Global timeout in milliseconds of a request at proxy level. Determines how long ZDM Proxy waits for one cluster (for reads) or both clusters (for writes) to reply to a request. Upon reaching the timeout limit, ZDM Proxy abandons the request and no longer considers it pending, which frees up internal resources to processes other requests.

When a request is abandoned due to a timeout, ZDM Proxy doesn’t return any result or error. A timeout warning or error is only returned when the client application’s own timeout is reached and the request is expired on the driver side.

Make sure zdm_proxy_request_timeout_ms is always greater than your client application’s client-side timeout. If the client has an especially high timeout because it routinely generates long-running requests, you must increase the zdm_proxy_request_timeout_ms timeout accordingly so that the ZDM Proxy doesn’t abandon requests prematurely.

Default: 10000

origin_connection_timeout_ms and target_connection_timeout_ms

Timeout in milliseconds for establishing a connection from the proxy to the origin or target cluster, respectively.

Default: 30000

async_handshake_timeout_ms

Timeout in milliseconds for the initialization (handshake) of the connection that is used solely for asynchronous dual reads between the proxy and the secondary cluster.

Upon reaching the timeout limit, the asynchronous reads aren’t sent because the connection failed to be established. This has no impact on the handling of synchronous requests: ZDM Proxy continues to handle all synchronous reads and writes as normal against the primary cluster.

Default: 4000

heartbeat_interval_ms

The interval in milliseconds that heartbeats are sent to keep idle cluster connections alive. This includes all control and request connections to the origin and the target clusters.

Default: 30000

metrics_enabled

Whether to enable metrics collection.

true (default): ZDM Proxy collects and exposes metrics.
false: ZDM Proxy metrics collection is completely disabled. This isn’t recommended.

zdm_proxy_max_stream_ids: Set the maximum pool size of available stream IDs managed by ZDM Proxy per client connection. Use the same value as your driver’s maximum stream IDs configuration.

In the CQL protocol, every request has a unique stream ID. However, if there are a lot of requests in a given amount of time, errors can occur due to stream ID exhaustion.

In the client application, the stream IDs are managed internally by the driver, and, in most drivers, the max number is 2048, which is the same default value used by ZDM Proxy. If you have a custom driver configuration with a higher value, make sure zdm_proxy_max_stream_ids matches your driver’s maximum stream IDs.

Default: 2048

blocked_protocol_versions

This variable requires ZDM Proxy and ZDM Proxy Automation version 2.4.0 or later.

Use blocked_protocol_versions to block specific Cassandra Native Protocol versions that are supported by ZDM Proxy. Blocking unsupported versions isn’t necessary because ZDM Proxy automatically rejects those versions.

Allowed values include the following:

Omitted or empty (default): All supported protocol versions are allowed. ZDM Proxy supports protocol V2, V3, V4, V5, DSE_V1, and DSE_V2.
One or more protocol versions: Provide a case-insensitive, comma-separated list of one or more protocol versions that you want to block. The v prefix is optional.

For example, all of the following configurations are valid:
```
# Block none
blocked_protocol_versions:

# Block one
blocked_protocol_versions: V2
blocked_protocol_versions: v2
blocked_protocol_versions: 2

# Block multiple
blocked_protocol_versions: V2,V3,DSEV1
blocked_protocol_versions: v2,v3,dsev1
blocked_protocol_versions: 2,3,dse1
```
This variable can be useful if you notice performance degradation with specific protocol versions, and you want to disallow the protocol version at the proxy level instead of the driver level. For more information, see Supported Cassandra Native Protocol versions.

Deprecated mutable variables

Deprecated variables will be removed in a future ZDM Proxy release. Replace them with their recommended alternatives as soon as possible.

forward_client_credentials_to_origin

Whether to use the credentials provided by the client application to connect to the origin cluster.

false (default): The credentials from the client application were used to connect to the target cluster.
true: The credentials from the client application were used to connect to the origin cluster.

This deprecated variable is no longer functional. Instead, the expected credentials are based on the authentication requirements of the origin and target clusters. For more information, see Connect applications to ZDM Proxy.

Change immutable configuration variables

All configuration variables not listed in Change mutable configuration variables are immutable and can only be changed by recreating the deployment with the initial deployment playbook (deploy_zdm_proxy.yml):

ansible-playbook deploy_zdm_proxy.yml -i zdm_ansible_inventory

You can re-run the deployment playbook as many times as necessary. However, this playbook decommissions and recreates all ZDM Proxy instances simultaneously. This results in a brief period of time where the entire ZDM Proxy deployment is offline because no instances are available.

For more information, see Configuration changes aren’t applied by ZDM Proxy Automation.

Upgrade the proxy version

The same playbook that you use for configuration changes can also be used to upgrade the ZDM Proxy version in a rolling fashion. All containers are recreated with the given image version.

A version change is a destructive action because the rolling restart playbook removes the previous containers and their logs, replacing them with new containers using the new image. Collect the logs before you run the playbook if you want to keep them.

To check your current ZDM Proxy version, see Check your ZDM Proxy version.

In vars/zdm_proxy_container.yml, set zdm_proxy_image to the desired tag. For available tags, see the ZDM Proxy Docker Hub repository.
```
zdm_proxy_image: datastax/zdm-proxy:TAG
```
For example:
```
zdm_proxy_image: datastax/zdm-proxy:2.3.4
```
Perform a rolling restart to update all ZDM Proxy instances to the new version.

Scale ZDM Proxy instances

Scale with ZDM Proxy Automation
Scale without ZDM Proxy Automation

ZDM Proxy Automation doesn’t provide a way to scale operations up or down in a rolling fashion. If you are using ZDM Proxy Automation and you need a larger ZDM Proxy deployment, you can create a new deployment, or you can add instances to an existing deployment.

Create a new deployment (recommended)
Add instances to an existing deployment

This option is the recommended way to scale your ZDM Proxy deployment because it requires no downtime.

Create a new ZDM Proxy deployment, and then reconfigure your client application to use the new instance:

Create a new ZDM Proxy deployment with the desired topology on a new set of machines.
Change the contact points in the application configuration so that the application instances point to the new ZDM Proxy deployment.
Perform a rolling restart of the application instances to apply the new contact point configuration.

The rolling restart ensures there is no interruption of service. The application instances switch seamlessly from the old deployment to the new one, and they are able to serve requests immediately.
After restarting all application instances, you can safely remove the old ZDM Proxy deployment.

This option requires manual configuration and a small amount of downtime.

Change the topology of your existing ZDM Proxy deployment, and then restart the entire deployment to apply the change:

Amend the inventory file so that it contains one line for each machine where you want to deploy a ZDM Proxy instance.

For example, if you want to add three nodes to a deployment with six nodes, then the amended inventory file must contain nine total IPs, including the six existing IPs and the three new IPs.
Run the deploy_zdm_proxy.yml playbook to apply the change and start the new instances:
```
ansible-playbook deploy_zdm_proxy.yml -i zdm_ansible_inventory
```
Rerunning the playbook stops the existing instances, destroys them, and then creates and starts a new deployment with new instances based on the amended inventory. This results in a brief interruption of service for your entire ZDM Proxy deployment.

If you aren’t using ZDM Proxy Automation, use these steps to add, change, or remove ZDM Proxy instances.

Add an instance
Vertically scale existing instances
Remove an instance

Prepare and configure the new ZDM Proxy instances appropriately based on your other instances.

Make sure the new instance’s configuration references all planned ZDM Proxy cluster nodes.
On all ZDM Proxy instances, add the new instance’s address to the ZDM_PROXY_TOPOLOGY_ADDRESSES environment variable.

Make sure to include all new nodes.
On the new ZDM Proxy instance, set the ZDM_PROXY_TOPOLOGY_INDEX to the next sequential integer after the greatest one in your existing deployment.
Perform a rolling restart of all ZDM Proxy instances, one at a time.

Use these steps to increase or decrease resources for existing ZDM Proxy instances, such as CPU or memory. To avoid downtime, perform the following steps on one instance at a time:

Stop the first ZDM Proxy instance that you want to modify.
Modify the instance’s resources as required.

Make sure the instance’s IP address remains the same. If the IP address changes, you must treat it as a new instance; follow the steps on the Add an instance tab.
Restart the modified ZDM Proxy instance.
Wait until the instance starts, and then confirm that it is receiving traffic.
Repeat these steps to modify each additional instance, one at a time.

On all ZDM Proxy instances, remove the unused instance’s address from the ZDM_PROXY_TOPOLOGY_ADDRESSES environment variable.
Perform a rolling restart of all remaining ZDM Proxy instances.
Clean up resources used by the removed instance, such as the container or VM.

Proxy topology addresses enable failover and high availability

When you configure a ZDM Proxy deployment, either through ZDM Proxy Automation or manually-managed ZDM Proxy instances, you specify the addresses of your instances. These are populated in the ZDM_PROXY_TOPOLOGY_ADDRESSES variable, either manually or automatically depending on how you manage your instances.

Cassandra drivers look up nodes on a cluster by querying the system.peers table. ZDM Proxy uses the topology addresses to effectively respond to the driver’s request for connection nodes. If there are no topology addresses specified, ZDM Proxy defaults to a single-instance configuration. This means that driver connections use only that one ZDM Proxy instance rather than all instances in your ZDM Proxy deployment.

If that one instance goes down, ZDM Proxy won’t know that there are other instances available, and your application can experience an outage. Additionally, if you need to restart ZDM Proxy instances, and there is only one instance specified in the topology addresses, your migration will have downtime while that one instance restarts.