Troubleshoot Zero Downtime Migration
This page provides general troubleshooting advice and describes some common issues you might encounter with Zero Downtime Migration (ZDM).
For additional assistance, you can report an issue, contact your DataStax account representative, or contact DataStax Support.
Check ZDM Proxy logs
ZDM Proxy logs can help you verify that your ZDM Proxy instances are operating normally, investigate how processes are executed, and troubleshoot issues.
Set the ZDM Proxy log level
Set the ZDM Proxy log level to print the messages that you need.
The default log level is INFO, which is adequate for most logging.
If you need more detail for temporary troubleshooting, you can set the log level to DEBUG.
However, this can slightly degrade performance, and DataStax recommends that you revert to INFO logging as soon as possible.
How you set the log level depends on how you deployed ZDM Proxy:
-
If you used ZDM Proxy Automation to deploy ZDM Proxy, set
log_levelinvars/zdm_proxy_core_config.yml, and then run therolling_update_zdm_proxy.ymlplaybook. For more information, see Change a mutable configuration variable. -
If you didn’t use ZDM Proxy Automation to deploy ZDM Proxy, set the
ZDM_LOG_LEVELenvironment variable on each proxy instance, and then restart each instance.
Get ZDM Proxy log files
If you used ZDM Proxy Automation to deploy ZDM Proxy, then you can get logs for a single proxy instance, and you can use a playbook to retrieve logs for all instances.
-
View or tail logs for one instance
-
Collect logs for multiple instances
-
Get logs for deployments that don’t use ZDM Proxy Automation
ZDM Proxy runs as a Docker container on each proxy host.
To view the logs for a single ZDM Proxy instance, connect to a proxy host, and then run the following command:
docker container logs zdm-proxy-container
To tail (stream) the logs as they are written, use the --follow (-f) option:
docker container logs zdm-proxy-container -f
Keep in mind that Docker logs are deleted if the container is recreated.
ZDM Proxy Automation has a dedicated playbook, collect_zdm_proxy_logs.yml, that you can use to collect logs for all ZDM Proxy instances in a deployment.
You can view the playbook’s configuration in vars/zdm_proxy_log_collection_config.yml, but no changes are required to run it.
-
Connect to the Ansible Control Host Docker container. You can do this from the jumphost machine by running the following command:
docker exec -it zdm-ansible-container bashResult
ubuntu@52772568517c:~$ -
Run the log collection playbook:
ansible-playbook collect_zdm_proxy_logs.yml -i zdm_ansible_inventoryThis playbook creates a single zip file,
zdm_proxy_logs_TIMESTAMP.zip, that contains the logs from all proxy instances. This archive is stored on the Ansible Control Host Docker container at/home/ubuntu/zdm_proxy_archived_logs. -
To copy the archive from the container to the jumphost, open a shell on the jumphost, and then run the following command:
docker cp zdm-ansible-container:/home/ubuntu/zdm_proxy_archived_logs/zdm_proxy_logs_TIMESTAMP.zip DESTINATION_DIRECTORY_ON_JUMPHOSTReplace the following:
-
TIMESTAMP: The timestamp from the name of your log file archive -
DESTINATION_DIRECTORY_ON_JUMPHOST: The path to the directory where you want to copy the archive
-
If you didn’t use ZDM Proxy Automation to deploy ZDM Proxy, you must access the logs another way, depending on your deployment configuration and infrastructure.
For example, if you used Docker, you can use the following command to export a container’s logs to a log.txt file:
docker logs my-container > log.txt
Keep in mind that Docker logs are deleted if the container is recreated.
Message levels
Some log messages contain text that seems like an error but they aren’t errors.
Instead, the message’s level indicates severity:
-
level=info: Expected and normal messages that typically aren’t errors. -
level=debug: Expected and normal messages that typically aren’t errors. However, they can help you find the source of a problem by providing information about the environment and conditions when the error occurred.debugmessages are only recorded if you set the log level to DEBUG. -
level=warn: Reports an event that wasn’t fatal to the overall process but might indicate an issue with an individual request or connection. -
level=error: Indicates an issue with ZDM Proxy, the client application, or the clusters. These messages require further examination.
If the meaning of a warn or error message isn’t clear, you can report an issue.
Common log messages
Here are some of the most common messages in the ZDM Proxy logs.
- ZDM Proxy startup message
-
If the log level doesn’t filter out
infoentries, you can look for aProxy startedlog message to verify that ZDM Proxy started correctly. For example:{"log":"time=\"2023-01-13T11:50:48Z\" level=info msg=\"Proxy started. Waiting for SIGINT/SIGTERM to shutdown. \"\n","stream":"stderr","time":"2023-01-13T11:50:48.522097083Z"} - ZDM Proxy configuration message
-
If the log level doesn’t filter out
infoentries, the first few lines of a ZDM Proxy log file contain all configuration variables and values in a long JSON string.The following example log message is truncated for readability:
{"log":"time=\"2023-01-13T11:50:48Z\" level=info msg=\"Parsed configuration: {\\\"ProxyIndex\\\":1,\\\"ProxyAddresses\\\":"...", ...TRUNCATED... ","stream":"stderr","time":"2023-01-13T11:50:48.339225051Z"}Configuration settings can help with troubleshooting.
To make this message easier to read, pass it through a JSON formatter or paste it into a text editor that can reformat JSON.
- Protocol log messages
-
There are cases where protocol errors are fatal, and they will kill an active connection that was being used to serve requests. However, it is also possible to get normal protocol log messages that contain wording that sounds like an error.
For example, the following
DEBUGmessage contains the phrasesforce a downgradeandunsupported protocol version, which can sound like errors:{"log":"time=\"2023-01-13T12:02:12Z\" level=debug msg=\"[TARGET-CONNECTOR] Protocol v5 detected while decoding a frame. Returning a protocol message to the client to force a downgrade: PROTOCOL (code=Code Protocol [0x0000000A], msg=Invalid or unsupported protocol version (5)).\"\n","stream":"stderr","time":"2023-01-13T12:02:12.379287735Z"}However,
level=debugindicates that this is not an error. Instead, this is a normal part of protocol version negotiation (handshake) during connection initialization.
Check your ZDM Proxy version
The ZDM Proxy version is printed at startup, and it is the first message in the logs, immediately before the long Parsed configuration string:
time="2023-01-13T13:37:28+01:00" level=info msg="Starting ZDM proxy version 2.1.0"
time="2023-01-13T13:37:28+01:00" level=info msg="Parsed configuration: ..."
You can also pass the -version flag to ZDM Proxy to print the version.
For example, you can use the following Docker command, replacing TAG with the zdm_proxy_image tag set in vars/zdm_proxy_container.yml:
docker run --rm datastax/zdm-proxy:TAG -version
Result
The output shows the binary version of ZDM Proxy that is currently running:
ZDM proxy version 2.1.0
|
Don’t use |
Query system.peers and system.local to check for configuration issues
Querying system.peers and system.local can help you investigate ZDM Proxy configuration issues:
-
Query
system.peers:SELECT * FROM system.peers -
Query
system.local:SELECT * FROM system.local -
Repeat for each of your ZDM Proxy instances.
Because
system.peersandsystem.localreflect the local ZDM Proxy instance’s configuration, you must query all instances to get all information and identify potential misconfigurations. -
Compare the results from each instance by searching for values related to an error that you are troubleshooting, such as IP addresses or tokens.
For example, you might compare
cluster_nameto ensure that all instances are connected to the same cluster rather than mixing contact points from different clusters.
Troubleshooting scenarios
The following sections provide troubleshooting advice for specific issues or error messages related to Zero Downtime Migration.
Configuration changes aren’t applied by ZDM Proxy Automation
If some configuration changes aren’t applied to your ZDM Proxy instances after a rolling restart, this typically means that you modified an immutable configuration variable.
Not all ZDM Proxy configuration variables can be changed after deployment, with or without a rolling restart. For a list of variables that you can change on a live deployment, see Change mutable configuration variables.
Any configuration variables excluded from the mutable variables list are considered immutable, and you must fully redeploy your instances to change them. This is by design because immutable configuration variables store values that must not change between the time that you finalize the deployment and start the migration. Allowing these values to change from a rolling restart could propagate a misconfiguration and compromise the deployment’s integrity.
If you change the value of an immutable configuration variable, you must run the deploy_zdm_proxy.yml playbook again.
You can run this playbook as many times as needed.
Each time, ZDM Proxy Automation recreates your entire ZDM Proxy deployment with the new configuration.
However, this doesn’t happen in a rolling fashion: The existing ZDM Proxy instances are torn down simultaneously, and then they are recreated.
This results in a brief period of downtime where the entire ZDM Proxy deployment is unavailable.
Client application throws unsupported protocol version error
If you are running version 4.0 to 4.9 of the Cassandra Java driver, the following errors can occur during or after session initialization:
[s0|/10.169.241.224:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [/10.169.241.224:9042] Host does not support protocol version DSE_V2)
[s0|/10.169.241.24:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [/10.169.241.24:9042] Host does not support protocol version DSE_V2)
[s0|/10.169.241.251:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [/10.169.241.251:9042] Host does not support protocol version DSE_V2)
[s0] Failed to connect with protocol DSE_V1, retrying with V4
[s0] Failed to connect with protocol DSE_V2, retrying with DSE_V1
These errors are caused by a Java driver bug that was resolved in version 4.10.0.
To resolve this issue, do one of the following:
-
If your application uses any dependency that includes a version of the Java driver, such as Spring Boot or
spring-data-cassandra, you must upgrade these dependencies to a version that uses Java driver 4.10.0 or later. -
If your are using the Java driver directly, upgrade to version 4.10.0 or later, if these versions are compatible with both your origin and target clusters.
-
Force the protocol version on the driver to the highest version that is supported by both your origin and target clusters. Typically,
V4is broadly supported. However, if you are migrating from DSE to DSE, then useDSE_V1for DSE 5.x migrations, andDSE_V2for DSE 6.x migrations.For more information, see the documentation for your version of the Java driver:
Logs report protocol errors but clients connect successfully
PROTOCOL ERROR messages in ZDM Proxy logs are a normal part of the handshake process while the protocol version is being negotiated.
These messages indicate that a protocol version downgrades happened because ZDM Proxy or one of the clusters doesn’t support the version requested by the client.
V5 downgrades are enforced by ZDM Proxy.
Any other downgrade results from a request by a cluster that doesn’t support the version that the client requested.
ZDM Proxy supports V3, V4, DSE_V1 and DSE_V2.
In the following example, notice that the PROTOCOL ERROR message is introduced by level=debug, indicating that it isn’t a true error:
{"log":"time=\"2022-10-01T12:02:12Z\" level=debug msg=\"[TARGET-CONNECTOR] Protocol v5 detected while decoding a frame.
Returning a protocol error to the client to force a downgrade: ERROR PROTOCOL ERROR (code=ErrorCode ProtocolError [0x0000000A],
msg=Invalid or unsupported protocol version (5)).\"\n","stream":"stderr","time":"2022-07-20T12:02:12.379287735Z"}
PROTOCOL ERROR messages recorded at a higher log level, especially level=error, might indicate a bug because this means that the error occurred outside of the handshake process.
This is a fatal unexpected error that terminates the connection.
If you observe this behavior in your logs, report an issue so the issue can be investigated by the ZDM team.
Proxy fails to start due to invalid or unsupported protocol version
If the ZDM Proxy logs contain debug messages with Invalid or unsupported protocol version: 3, this means that one of the origin clusters doesn’t support protocol version V3 or later.
Invalid or unsupported protocol version logs
time="2022-10-01T19:58:15+01:00" level=info msg="Starting proxy..."
time="2022-10-01T19:58:15+01:00" level=info msg="Parsed Topology Config: TopologyConfig{VirtualizationEnabled=false, Addresses=[127.0.0.1], Count=1, Index=0, NumTokens=8}"
time="2022-10-01T19:58:15+01:00" level=info msg="Parsed Origin contact points: [127.0.0.1]"
time="2022-10-01T19:58:15+01:00" level=info msg="Parsed Target contact points: [127.0.0.1]"
time="2022-10-01T19:58:15+01:00" level=info msg="TLS was not configured for Origin"
time="2022-10-01T19:58:15+01:00" level=info msg="TLS was not configured for Target"
time="2022-10-01T19:58:15+01:00" level=info msg="[openTCPConnection] Opening connection to 127.0.0.1:9042"
time="2022-10-01T19:58:15+01:00" level=info msg="[openTCPConnection] Successfully established connection with 127.0.0.1:9042"
time="2022-10-01T19:58:15+01:00" level=debug msg="performing handshake"
time="2022-10-01T19:58:15+01:00" level=error msg="cqlConn{conn: 127.0.0.1:9042}: handshake failed: expected AUTHENTICATE or READY, got ERROR PROTOCOL ERROR (code=ErrorCode ProtocolError [0x0000000A], msg=Invalid or unsupported protocol version: 3)"
time="2022-10-01T19:58:15+01:00" level=warning msg="Error while initializing a new cql connection for the control connection of ORIGIN: failed to perform handshake: expected AUTHENTICATE or READY, got ERROR PROTOCOL ERROR (code=ErrorCode ProtocolError [0x0000000A], msg=Invalid or unsupported protocol version: 3)"
time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down request loop on cqlConn{conn: 127.0.0.1:9042}"
time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down response loop on cqlConn{conn: 127.0.0.1:9042}."
time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down event loop on cqlConn{conn: 127.0.0.1:9042}."
time="2022-10-01T19:58:15+01:00" level=error msg="Couldn't start proxy: failed to initialize origin control connection: could not open control connection to ORIGIN, tried endpoints: [127.0.0.1:9042]."
time="2022-10-01T19:58:15+01:00" level=info msg="Initiating proxy shutdown..."
time="2022-10-01T19:58:15+01:00" level=debug msg="Requesting shutdown of the client listener..."
time="2022-10-01T19:58:15+01:00" level=debug msg="Requesting shutdown of the client handlers..."
time="2022-10-01T19:58:15+01:00" level=debug msg="Waiting until all client handlers are done..."
time="2022-10-01T19:58:15+01:00" level=debug msg="Requesting shutdown of the control connections..."
time="2022-10-01T19:58:15+01:00" level=debug msg="Waiting until control connections done..."
time="2022-10-01T19:58:15+01:00" level=debug msg="Shutting down the schedulers and metrics handler..."
time="2022-10-01T19:58:15+01:00" level=info msg="Proxy shutdown complete."
time="2022-10-01T19:58:15+01:00" level=error msg="Couldn't start proxy, retrying in 2.229151525s: failed to initialize origin control connection: could not open control connection to ORIGIN, tried endpoints: [127.0.0.1:9042]."
Specifically, this happens with Cassandra 2.0 and DSE 4.6.
ZDM cannot be used for these migrations because the ZDM Proxy control connections don’t perform protocol version negotiation, they only attempt to use V3.
Authentication errors
Authentication errors indicate that credentials are incorrect or have insufficient permissions.
There are three sets of credentials used with ZDM Proxy:
-
Target cluster: Credentials that you set in the ZDM Proxy configuration through the
ZDM_TARGET_USERNAMEandZDM_TARGET_PASSWORDsettings. -
Origin cluster: Credentials that you set in the ZDM Proxy configuration through the
ZDM_ORIGIN_USERNAMEandZDM_ORIGIN_PASSWORDsettings. -
Client application: Credentials that the client application sends to the proxy during the connection handshake. These are set in the application configuration, not the proxy configuration.
Authentication errors mean that at least one of these three sets of credentials is incorrect or has insufficient permissions.
If the authentication error prevents ZDM Proxy from starting, then the issue is in the origin or target cluster credentials. The log message shows whether the origin or target handshake failed.
If ZDM Proxy starts but the logs contain a message like Proxy started. Waiting for SIGINT/SIGTERM to shutdown, then the authentication error occurs when a client application tries to open a connection to the proxy.
This means that the client application itself has invalid credentials, such as an incorrect username/password, expired token, or insufficient permissions.
|
Proxy startup messages are reported at |
ZDM Proxy listens on a custom port, and all applications connect to one proxy instance only
If ZDM Proxy is listening on a custom port (not 9042), you might see either of the following issues:
-
The Grafana dashboard shows only one proxy instance receiving all the connections from the application.
-
Only one proxy instance has log messages like
level=info msg="Accepted connection from 10.4.77.210:39458".
This happens because the application specifies the custom port as part of the contact points using the format
PROXY_IP_ADDRESS:CUSTOM_PORT.
For example, if the ZDM Proxy instances were listening on port 14035, the contact points for the Cassandra Java driver might be specified as .addContactPoints("172.18.10.36:14035", "172.18.11.48:14035", "172.18.12.61:14035").
The contact point is the first point of contact to the cluster, but the driver discovers the rest of the nodes through CQL queries. However, this discovery process finds the addresses only, not the ports. The driver uses the addresses it discovers with the port that is configured at startup. As a result, the custom port is used for the initial contact point only, and the default port is used with all other nodes.
To resolve this issue, ensure that the custom port is explicitly set in your application.
The way that you do this depends on your driver language and version.
For example, for the Java driver, use .withPort(CUSTOM_PORT):
.addContactPoints("172.18.10.36", "172.18.11.48", "172.18.12.61")
.withPort(14035)
Proxy logs contain SyntaxError no viable alternative at input 'CALL'
ZDM Proxy log messages such as the following indicate that the server doesn’t recognize the word "CALL" in the query string, which typically means that it is a remote procedure call (RPC):
{"log":"time=\"2022-10-01T13:10:47Z\" level=debug msg=\"Recording TARGET-CONNECTOR other error:
ERROR SYNTAX ERROR (code=ErrorCode SyntaxError [0x00002000], msg=line 1:0 no viable alternative
at input 'CALL' ([CALL]...))\"\n","stream":"stderr","time":"2022-07-20T13:10:47.322882877Z"}
From the proxy logs alone, you cannot determine which method was called by the query, but it’s typically the RPC that Cassandra drivers use to send DSE Insights data to the server.
Most DataStax-compatible drivers have DSE Insights reporting enabled by default when they detect a server version that supports it, even if the feature is disabled on the server side.
The driver might also have it enabled for Astra DB depending on what server version Astra DB is returning for queries involving the system.local and system.peers tables.
These log messages are harmless, but if you need to remove them, you can disable DSE Insights in the driver configuration.
For example, in the Java driver, you can set advanced.monitor-reporting to false.
Default Grafana credentials don’t work
When you deploy the ZDM Proxy Automation metrics component, a Grafana instance is deployed that doesn’t use Grafana’s default admin/admin credentials.
Instead, ZDM Proxy Automation specifies a custom set of credentials.
You can find the credentials for your ZDM Proxy Automation Grafana instance in the vars/zdm_monitoring_config.yml file in the ZDM Proxy Automation directory.
You can also modify these credentials before deploying the metrics stack.
Proxy starts but client cannot connect (connection timed out or connection closed)
If the ZDM Proxy logs contain messages like Couldn’t connect to, context timed out or cancelled while opening connection, and context deadline exceeded, it can indicate that the ZDM Proxy couldn’t establish a connection with a particular node.
This can happen because ZDM Proxy has connectivity to a specific subset of the nodes. The control connection, which is established during ZDM Proxy startup, cycles through the nodes until it finds one that it can connect to successfully.
For client connections, each proxy instance cycles through its assigned nodes only. Each proxy instance has a different group of assigned nodes, which are a subset of the cluster nodes. Generally, these are unique for each proxy instances to avoid interference with load balancing that is already in place at client-side driver level. The assigned nodes aren’t necessarily contact points: Even discovered nodes undergo assignment to proxy instances.
In the following example, ZDM Proxy doesn’t have connectivity to 10.0.63.20, which was chosen as the origin node for the incoming client connection, but it connected to 10.0.63.163 during startup:
INFO[0000] [openTCPConnection] Opening connection to 10.0.63.163:9042
INFO[0000] [openTCPConnection] Successfully established connection with 10.0.63.163:9042
INFO[0000] [openTLSConnection] Opening TLS connection to 10.0.63.163:9042 using underlying TCP connection
INFO[0000] [openTLSConnection] Successfully established connection with 10.0.63.163:9042
INFO[0000] Successfully opened control connection to ORIGIN using endpoint 10.0.63.163:9042.
INFO[0000] [openTCPConnection] Opening connection to 5bc479c2-c3d0-45be-bfba-25388f2caff7-us-east-1.db.astra.datastax.com:29042
INFO[0000] [openTCPConnection] Successfully established connection with 54.84.75.118:29042
INFO[0000] [openTLSConnection] Opening TLS connection to 211d66bf-de8d-48ac-a25b-bd57d504bd7c using underlying TCP connection
INFO[0000] [openTLSConnection] Successfully established connection with 211d66bf-de8d-48ac-a25b-bd57d504bd7
INFO[0000] Successfully opened control connection to TARGET using endpoint 5bc479c2-c3d0-45be-bfba-25388f2caff7-us-east-1.db.astra.datastax.com:29042-211d66bf-de8d-48ac-a25b-bd57d504bd7c.
INFO[0000] Proxy connected and ready to accept queries on 0.0.0.0:9042
INFO[0000] Proxy started. Waiting for SIGINT/SIGTERM to shutdown.
INFO[0043] Accepted connection from 10.0.62.255:33808
INFO[0043] [ORIGIN-CONNECTOR] Opening request connection to ORIGIN (10.0.63.20:9042).
ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 100ms...
ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 200ms...
ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 400ms...
ERRO[0043] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 800ms...
ERRO[0044] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 1.6s...
ERRO[0046] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 3.2s...
ERRO[0049] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 6.4s...
ERRO[0056] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 10s...
ERRO[0066] [openTCPConnectionWithBackoff] Couldn't connect to 10.0.63.20:9042, retrying in 10s...
ERRO[0076] Client Handler could not be created: ORIGIN-CONNECTOR context timed out or cancelled while opening connection to ORIGIN: context deadline exceeded
To avoid this issue, ensure that a stable network connection exists between the ZDM Proxy instances and all nodes of your origin and target clusters in the client application’s local datacenter.
Client application driver takes too long to reconnect to a proxy instance
When a ZDM Proxy instance comes back online after being unavailable for some time, then your client application might take too long to reconnect.
If ZDM Proxy doesn’t send topology events to the client application, the driver’s reconnection policy determines the time required for the driver to reconnect to the ZDM Proxy instance.
You can restart the client application to force an immediate reconnection attempt.
If you expect ZDM Proxy instances to go down frequently, change the driver’s reconnection policy to shorten the interval between reconnection attempts.
Astra DevOps API errors when using ZDM Proxy Automation
ZDM Proxy Automation logs might report errors that contain your Astra DevOps API endpoint:
fatal: [10.255.13.6]: FAILED! => {"changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]:
Connection failure: Remote end closed connection without response", "redirected": false, "status": -1, "url":
"https://api.astra.datastax.com/v2/databases/REDACTED/secureBundleURL"}
This can indicate that the Astra DevOps API is temporarily unavailable. You can either wait to retry the operation, or you can download your database’s Secure Connect Bundle (SCB) from the Astra Portal, and then provide the path the SCB zip file in the ZDM Proxy Automation configuration.
Metadata service returned not successful status code (4xx or 5xx)
If ZDM Proxy doesn’t start, the logs might contain messages with not successful status code:
Couldn't start proxy: error initializing the connection configuration or control connection for Target:
metadata service (Astra) returned not successful status code
There are two possible causes for this:
-
The credentials that ZDM Proxy is using for Astra DB don’t have sufficient permissions.
-
The Astra DB database is hibernated or otherwise unavailable.
To resolve this issue, sign in to the Astra Portal, and then check the database status.
If the database isn’t in Active status, reactivate or wait for the database to return to Active status, and then retry the connection.
If the database is in Active status, then the credentials likely have insufficient permissions. Try using an application token scoped to a database, specifically a token with the Database Administrator role for your target database.
Async read timeouts / stream id map exhausted
When dual reads are enabled, you might find the following messages in the ZDM Proxy logs:
{"log":"\u001b[33mWARN\u001b[0m[430352] Async Request (OpCode EXECUTE [0x0A]) timed out after 10000 ms. \r\n","stream":"stdout","time":"2022-10-03T17:29:42.548941854Z"}
{"log":"\u001b[33mWARN\u001b[0m[430368] Could not find async request context for stream id 331 received from async connector. It either timed out or a protocol error occurred. \r\n","stream":"stdout","time":"2022-10-03T17:29:58.378080933Z"}
{"log":"\u001b[33mWARN\u001b[0m[431533] Could not send async request due to an error while storing the request state: stream id map ran out of stream ids: channel was empty. \r\n","stream":"stdout","time":"2022-10-03T17:49:23.786335428Z"}
The last message is logged when the async connection runs out of stream IDs. The async connection is a connection dedicated to the asynchronous dual reads feature. This can be caused by timeouts, as indicated in the first log message, or the connection being unable to keep up with the load.
Errors in the async request path (dual reads) don’t affect the client application. These log messages can be useful to predict what could happen when reads are switched over to the target cluster permanently, but async read errors and warnings don’t, by themselves, have any impact on the client.
If you find many of these messages in the logs, it is likely that an outage occurred. This causes all responses to arrive after requests timed out, as reported in the second log message. In this case the async connection might not recover.
Starting in version 2.1.0, you can tune the maximum number of stream IDs available per connection.
The default is 2048, and you can increase it to match your driver configuration with the zdm_proxy_max_stream_ids property.
If you are running a version prior to 2.1.0, upgrade ZDM Proxy.
If these errors are constantly written to the log files over a period of minutes or hours, then you likely need to restart the client application or ZDM Proxy to fix the issue. If you find an error like this, report an issue so the ZDM team can investigate it.
Client application closed connection errors every 10 minutes when migrating to Astra DB
This issue is fixed in ZDM Proxy 2.1.0.
In ZDM Proxy versions earlier than 2.1.0, the logs can report that the Astra DB TARGET-CONNECTOR is disconnected every 10 minutes.
This happens because Astra DB terminates idle connections after 10 minutes of inactivity.
In the absence of asynchronous dual reads, the target cluster won’t get any traffic when the client application produces only read requests because ZDM forwards all reads to the origin cluster only.
To resolve this issue, DataStax recommends that you upgrade your ZDM Proxy instances to 2.1.0 or later to take advantage of the heartbeats feature, which keeps the connection alive during periods of inactivity.
You can tune the heartbeat interval with the heartbeat_interval_ms variable, or by directly setting the ZDM_HEARTBEAT_INTERVAL_MS environment variable if you aren’t using ZDM Proxy Automation.
If upgrading is impossible, you can try the following alternatives:
-
Don’t connect the read-only client applications to ZDM Proxy, and then manually ensure that these client applications switch reads to the target at any point after all the data has been migrated, validated, and reconciled on the target cluster.
-
Implement a mechanism in your client application that creates a new
Sessionperiodically to avoid the Astra DB inactivity timeout. -
Implement a mechanism in your client application to issue a periodic meaningless write request to prevent the Astra DB connection from becoming idle.
Performance degradation with ZDM Proxy
If you run separate benchmarks against Astra DB directly, the origin cluster directly, and ZDM Proxy with both Astra DB and the origin cluster, then the results of these tests might show that latency or throughput is worse with ZDM Proxy than when connecting to Astra DB or origin cluster directly.
This is observed because ZDM always increases latency and, depending on the nature of the test, reduces throughput. Whether this performance hit is expected or not depends on the difference between the ZDM test results and the test results with the cluster that performed the worst.
Writes through ZDM require successful acknowledgement from both clusters, whereas reads require only the result from the primary cluster, which is typically the origin cluster. This means that if the origin cluster has better performance than the target cluster, then ZDM will inevitably have worse write performance than the target cluster alone.
Although it is typical for latency to increase with ZDM Proxy, you can minimize performance degradation with ZDM Proxy:
-
Make sure your ZDM Proxy infrastructure or configuration doesn’t unnecessarily increase latency.
For example, make sure your ZDM Proxy instances are in the same availability zone (AZ) as your origin cluster or application instances.
-
Understand the impact of simple and batch statements on latency, as compared to typical prepared statements.
Avoid simple statements with ZDM Proxy because they require significant time for ZDM Proxy to parse the queries.
As an alternative, use prepared statements, which are parsed once, and then reused on subsequent requests if repreparation isn’t required. However, inefficient use of prepared statements can degrade performance further, but this would be observed even without ZDM Proxy.
-
Increase the number of proxies only if the VMs resources (CPU, RAM or network IO) are near capacity. ZDM Proxy doesn’t use a lot of RAM, but it uses a lot of CPU and network IO. Deploying the proxy instances on VMs with faster CPUs and faster network IO might help, but there is no standardized approach to scaling these resources for ZDM Proxy. This is because the ideal balance of resources depends on the workload type and your environment, such as network/VPC configurations and hardware. If you choose to adjust the infrastructure, you must repeat your tests to determine if there was any benefit
Permission errors related to InsightsRpc
If the ZDM Proxy logs contain messages such as the following, it’s likely that you have an origin DSE cluster where Metrics Collector is enabled, and the user named in the logs doesn’t have sufficient permissions to report Insights data:
time="2023-05-05T19:14:31Z" level=debug msg="Recording ORIGIN-CONNECTOR other error: ERROR UNAUTHORIZED (code=ErrorCode Unauthorized [0x00002100], msg=User my_user has no EXECUTE permission on <rpc method InsightsRpc.reportInsight> or any of its parents)"
time="2023-05-05T19:14:31Z" level=debug msg="Recording TARGET-CONNECTOR other error: ERROR SERVER ERROR (code=ErrorCode ServerError [0x00000000], msg=Unexpected persistence error: Unable to authorize statement com.datastax.bdp.cassandra.cql3.RpcCallStatement)"
These are reported as level=debug, so ZDM Proxy isn’t affected by them.
There are two ways to resolve this issue:
-
Disable DSE Metrics Collector (recommended)
-
Grant InsightsRpc permissions
-
On the origin DSE cluster, disable Metrics Collector:
dsetool insights_config --mode DISABLED -
Run the following command to verify that
modeis set toDISABLED:dsetool insights_config --show_config
Only use this option if you cannot disable Metrics Collector.
Using a superuser role, grant the appropriate permissions to the user named in the logs:
GRANT EXECUTE ON REMOTE OBJECT InsightsRpc TO USER;
Replace USER with the actual username given in the logs.
Report an issue
To report an issue or get additional support, submit an issue in the ZDM component GitHub repositories:
-
ZDM Proxy Automation repository (includes ZDM Proxy Automation and ZDM Utility)
|
These repositories are public. Don’t include any proprietary or private information in issues, pull requests, or comments that you make in these repositories. |
In the issue description, include as much of the following information as possible, and make sure to remove all proprietary and private information before submitting the issue:
-
Your ZDM Proxy version.
-
ZDM Proxy logs, ideally at
DEBUGlevel, if you can easily reproduce the issue and tolerate restarting the proxy instances to apply the log level configuration change. -
Database deployment type (DSE, HCD, Cassandra, or Astra DB) and version for the origin and target clusters. The version isn’t required for Astra DB.
-
Screenshots of the ZDM Proxy metrics dashboards from Grafana or your chosen visualization tool.
Direct read access to your metrics dashboard is preferred, if permitted by your security policy. This is particularly helpful for performance-related issues.
-
Client application and driver logs.
-
The driver language and version that the client application is using.
For performance-related issues, provide the following additional information:
-
Which statement types (simple, prepared, batch) do you use?
-
If you use batch statements:
-
Which driver API do you use to create these batches?
-
Are you passing a
BEGIN BATCHCQL query string to a simple/prepared statement, or do you use the actual batch statement objects that the drivers allow you to create?
-
-
How many parameters does each statement have?
-
Is CQL function replacement enabled? This feature is disabled by default. To determine if this feature is enabled, check the following variables:
-
If you use ZDM Proxy Automation, check the Ansible advanced configuration variable
replace_cql_functions. -
If you don’t use ZDM Proxy Automation, check the environment variable
ZDM_REPLACE_CQL_FUNCTIONS.
-