Use DSBulk Migrator with ZDM Proxy
DSBulk Migrator is an extension of DSBulk Loader that adds the following three commands:
-
migrate-live: Immediately runs a live data migration using DSBulk Loader. -
generate-script: Generates a migration script that you can use to run a data migration with a standalone DSBulk Loader installation. This command doesn’t trigger the migration; it only generates the migration script. -
generate-ddl: Reads the origin cluster’s schema, and then generates CQL files that you can use to recreate the schema on your target cluster in preparation for data migration.
DSBulk Migrator is best for smaller migrations and migrations that don’t require extensive data validation aside from post-migration row counts. You might also use this tool for migrations where you can shard data from large tables into more manageable quantities.
You can use DSBulk Migrator alone or with ZDM Proxy.
Install DSBulk Migrator
-
Install Java 11 and Maven 3.9.x.
-
Optional: If you don’t want to use the embedded DSBulk Loader that is bundled with DSBulk Migrator, you must install DSBulk Loader before installing DSBulk Migrator.
-
Clone the DSBulk Migrator repository:
git clone git@github.com:datastax/dsbulk-migrator.git -
Change to the cloned directory:
cd dsbulk-migrator -
Use Maven to build DSBulk Migrator:
mvn clean packageThe build produces two distributable fat jars. You will use one of these jars when you run a DSBulk Migrator command.
-
dsbulk-migrator-VERSION-embedded-dsbulk.jar: Contains an embedded DSBulk Loader installation and an embedded Java driver.Supports all DSBulk Migrator operations, but it is larger than the other JAR due to the presence of the DSBulk Loader classes.
Use this jar if you don’t want to use your own DSBulk Loader installation.
-
dsbulk-migrator-VERSION-embedded-driver.jar: Contains an embedded Java driver only.Suitable for using the
generate-scriptandmigrate-livecommands with your own DSBulk Loader installation.You cannot use this jar for
migrate-livewith the embedded DSBulk Loader because the required DSBulk Loader classes aren’t present in this jar.
-
-
Clone and build Simulacron, which is required for some DSBulk Migrator integration tests.
Note the prerequisites for Simulacron, particularly for macOS.
-
Run the DSBulk Migrator integration tests:
mvn clean verify
After you install, build, and test DSBulk Migrator, you can run it from the command line, specifying your desired jar, command, and options.
For a quick test, try the --help option.
For information and examples for each command, see the following:
Get help for DSBulk Migrator
Use --help (-h) to get information about DSBulk Migrator commands and options:
-
Print the available DSBulk Migrator commands:
java -jar /path/to/dsbulk-migrator.jar --helpReplace
/path/to/dsbulk-migrator.jarwith the path to your DSBulk Migrator fat jar. -
Print help for a specific command:
java -jar /path/to/dsbulk-migrator.jar COMMAND --helpReplace the following:
-
/path/to/dsbulk-migrator.jar: The path to your DSBulk Migrator fat jar. -
COMMAND: The command for which you want to get help, one ofmigrate-live,generate-script, orgenerate-ddl.
-
Run a live migration
The migrate-live command immediately runs a live data migration using the embedded version of DSBulk Loader or your own DSBulk Loader installation.
A live migration means the data migration starts immediately, and it is handled by the migrator tool through the specified DSBulk Loader installation.
To run the migrate-live command, provide the path to your DSBulk Migrator fat jar followed by migrate-live and any options:
java -jar /path/to/dsbulk-migrator.jar migrate-live OPTIONS
The following examples show how to use either fat jar to perform a live migration where the target cluster is an Astra DB database. The password parameters are left blank so that DSBulk Migrator prompts for them interactively during the migration. All unspecified options use their default values.
-
Use the embedded DSBulk Loader
-
Use your own DSBulk Loader installation
If you want to run the migration with the embedded DSBulk Loader, you must use the dsbulk-migrator-VERSION-embedded-dsbulk.jar fat jar and the --dsbulk-use-embedded option:
java -jar target/dsbulk-migrator-VERSION-embedded-dsbulk.jar migrate-live \
--data-dir=/path/to/data/dir \
--dsbulk-use-embedded \
--dsbulk-log-dir=/path/to/log/dir \
--export-host=ORIGIN_CLUSTER_HOSTNAME \
--export-username=ORIGIN_USERNAME \
--export-password # Origin password will be prompted \
--export-dsbulk-option "--connector.csv.maxCharsPerColumn=65536" \
--export-dsbulk-option "--executor.maxPerSecond=1000" \
--import-bundle=/path/to/scb.zip \
--import-username=token \
--import-password # Application token will be prompted \
--import-dsbulk-option "--connector.csv.maxCharsPerColumn=65536" \
--import-dsbulk-option "--executor.maxPerSecond=1000"
If you want to run the migration with your own DSBulk Loader installation, use the dsbulk-migrator-VERSION-embedded-driver.jar fat jar, and use the --dsbulk-cmd option to specify the path to your DSBulk Loader installation:
java -jar target/dsbulk-migrator-VERSION-embedded-driver.jar migrate-live \
--data-dir=/path/to/data/dir \
--dsbulk-cmd=${DSBULK_ROOT}/bin/dsbulk \
--dsbulk-log-dir=/path/to/log/dir \
--export-host=ORIGIN_CLUSTER_HOSTNAME \
--export-username=ORIGIN_USERNAME \
--export-password # Origin password will be prompted \
--import-bundle=/path/to/scb.zip \
--import-username=token \
--import-password # Application token will be prompted
Options for migrate-live
Options for the migrate-live command are used to configure the migration parameters and connect to the origin and target clusters.
Most options have sensible default values and don’t need to be specified unless you want to override the default value.
| Option | Description |
|---|---|
|
The directory where data is exported to and imported from. The directory is created if it doesn’t exist. The default is a Tables are exported and imported in subdirectories of the specified data directory: One subdirectory is created for each keyspace, and then one subdirectory is created for each table within each keyspace subdirectory. |
|
The path to your own external (non-embedded) DSBulk Loader installation, such as The default is Ignored if the embedded DSBulk Loader is used ( |
|
The path to the directory where you want to store DSBulk Loader logs, such as The default is a Each DSBulk Loader operation creates its own subdirectory within the specified log directory. This parameter applies whether you use the embedded DSBulk Loader or your own external (non-embedded) DSBulk Loader installation. |
|
Use the embedded DSBulk Loader. Accepts no arguments; it’s either included (enabled) or not (disabled). By default, this option is disabled/omitted, and |
|
The path to the directory where you want to run Only applicable when using your own external (non-embedded) DSBulk Loader installation with the |
|
If your origin cluster is an Astra DB database, provide the path to your database’s Secure Connect Bundle (SCB), such as Cannot be used with |
|
The consistency level to use when exporting data.
The default is |
|
An additional DSBulk Loader option to use when exporting data. The expected format is You must use the option’s full long form name and leading dashes; short form options will fail. You must wrap the entire expression in quotes so that it is handled correctly by DSBulk Migrator. This is in addition to any escaping required for DSBulk Loader to process the option correctly. To pass multiple additional options, pass each option separately with |
|
The origin cluster’s host name or IP address, and an optional port for a node in the origin cluster.
The default port is
This option can be passed multiple times. If your origin cluster is an Astra DB database, use |
|
The maximum number of concurrent files to write to when exporting data from the origin cluster. Can be either |
|
The maximum number of concurrent queries to execute. Can be either |
|
The maximum number of records to export for each table. The default is To export a fixed number of records, set to a positive integer, such as |
|
The password for authentication to the origin cluster. You can either provide the password directly ( If set, then If the cluster doesn’t require authentication, omit both If your origin cluster is an Astra DB database, the password is an Astra application token. |
|
The protocol version to use when connecting to the origin cluster, such as If unspecified, the driver negotiates the highest version supported by both the client and the server. Specify only if you want to force the protocol version. |
|
The maximum number of token range queries to generate. This is an advanced setting that DataStax doesn’t recommend modifying unless you have a specific need to do so. Can be either of the following:
The default is |
|
The username for authentication to the origin cluster. If set, then If the cluster doesn’t require authentication, omit both If your origin cluster is an Astra DB database, the username is the literal string |
|
If your target cluster is an Astra DB database, provide the path to your database’s Secure Connect Bundle (SCB), such as Cannot be used with |
|
The consistency level to use when importing data.
The default is |
|
The default timestamp to use when importing data.
Must be a valid instant in ISO-8601 format.
The default is |
|
An additional DSBulk Loader option to use when importing data. The expected format is You must use the option’s full long form name and leading dashes; short form options will fail. You must wrap the entire expression in quotes so that it is handled correctly by DSBulk Migrator. This is in addition to any escaping required for DSBulk Loader to process the option correctly. To pass multiple additional options, pass each option separately with |
|
The target cluster’s host name or IP address, and an optional port for a node in the target cluster.
The default port is
This option can be passed multiple times. If your target cluster is an Astra DB database, use |
|
The maximum number of concurrent files to read from when importing data to the target cluster. Can be either |
|
The maximum number of concurrent queries to execute. Can be either |
|
The maximum number of failed records to tolerate when importing data. Must be a positive integer, such as Failed records are written to a |
|
The password for authentication to the target cluster. You can either provide the password directly ( If set, then If the cluster doesn’t require authentication, omit both If your target cluster is an Astra DB database, the password is an Astra application token. |
|
The protocol version to use when connecting to the target cluster, such as If unspecified, the driver negotiates the highest version supported by both the client and the server. Specify only if you want to force the protocol version. |
|
The username for authentication to the target cluster. If set, then If the cluster doesn’t require authentication, omit both If your target cluster is an Astra DB database, the username is the literal string |
|
A regular expression to select keyspaces to migrate, such as The default expression is Case-sensitive keyspace names must be specified by their exact case. |
|
The maximum number of concurrent operations (exports and imports) to carry. The default is Increase this value to allow exports and imports to occur concurrently.
For example, if |
|
Whether to bypass truncation confirmation before actually truncating counter tables. The default is disabled/omitted, which means you must confirm truncation before counter tables are truncated. Only applicable when migrating counter tables. This option is ignored otherwise. |
|
A regular expression to select tables to migrate, such as The default expression is Case-sensitive table names must be specified by their exact case. |
|
The table types to migrate:
|
|
Truncate counter tables before exporting them, rather than truncating them afterwards. The default is disabled/omitted, which means counter tables are truncated after being exported. Only applicable when migrating counter tables. This option is ignored otherwise. |
Generate a migration script
The generate-script command generates a migration script that you can use to perform a data migration with your own DSBulk Loader installation.
This command doesn’t trigger the migration; it only generates the migration script that you must then run.
If you want to run a migration immediately, or you want to use the embedded DSBulk Loader, use the migrate-live command instead.
To run the generate-script command, provide the path to your DSBulk Migrator fat jar followed by generate-script and any options:
java -jar /path/to/dsbulk-migrator.jar generate-script OPTIONS
The following example generates a migration script where the target cluster is an Astra DB database.
The --dsbulk-cmd option specifies the path to the DSBulk Loader installation that you plan to use to run the generated migration script.
All unspecified options use their default values.
java -jar target/dsbulk-migrator-VERSION-embedded-driver.jar generate-script \
--data-dir=/path/to/data/dir \
--dsbulk-cmd=${DSBULK_ROOT}/bin/dsbulk \
--dsbulk-log-dir=/path/to/log/dir \
--export-host=ORIGIN_CLUSTER_HOSTNAME \
--export-username=ORIGIN_USERNAME \
--export-password=ORIGIN_PASSWORD \
--import-bundle=/path/to/scb.zip \
--import-username=token \
--import-password=ASTRA_APPLICATION_TOKEN
Options for generate-script
The options for the generate-script command become options in the generated migration script.
The only exceptions are the origin cluster connection parameters (export-username, export-password, export-host, export-bundle), which are used in the migration script and by DSBulk Migrator to gather metadata about the tables to migrate.
Most options have sensible default values and don’t need to be specified unless you want to override the default value.
| Option | Description |
|---|---|
|
The directory where you want the generated migration script files are stored. The directory is created if it doesn’t exist. The default is a |
|
The path to an external (non-embedded) DSBulk Loader installation, such as The default is |
|
The path to the directory where you want to store DSBulk Loader logs, such as The default is a Each DSBulk Loader operation creates its own subdirectory within the specified log directory. |
|
The path to the directory where you want to run |
|
If your origin cluster is an Astra DB database, provide the path to your database’s Secure Connect Bundle (SCB), such as Cannot be used with |
|
The consistency level to use when exporting data.
The default is |
|
An additional DSBulk Loader option to use when exporting data. The expected format is You must use the option’s full long form name and leading dashes; short form options will fail. You must wrap the entire expression in quotes so that it is handled correctly by DSBulk Migrator. This is in addition to any escaping required for DSBulk Loader to process the option correctly. To pass multiple additional options, pass each option separately with |
|
The origin cluster’s host name or IP address, and an optional port for a node in the origin cluster.
The default port is
This option can be passed multiple times. If your origin cluster is an Astra DB database, use |
|
The maximum number of concurrent files to write to when exporting data from the origin cluster. Can be either |
|
The maximum number of concurrent queries to execute. Can be either |
|
The maximum number of records to export for each table. The default is To export a fixed number of records, set to a positive integer, such as |
|
The password for authentication to the origin cluster. You can either provide the password directly ( If set, then If the cluster doesn’t require authentication, omit both If your origin cluster is an Astra DB database, the password is an Astra application token. |
|
The protocol version to use when connecting to the origin cluster, such as If unspecified, the driver negotiates the highest version supported by both the client and the server. Specify only if you want to force the protocol version. |
|
The maximum number of token range queries to generate. This is an advanced setting that DataStax doesn’t recommend modifying unless you have a specific need to do so. Can be either of the following:
The default is |
|
The username for authentication to the origin cluster. If set, then If the cluster doesn’t require authentication, omit both If your origin cluster is an Astra DB database, the username is the literal string |
|
If your target cluster is an Astra DB database, provide the path to your database’s Secure Connect Bundle (SCB), such as Cannot be used with |
|
The consistency level to use when importing data.
The default is |
|
The default timestamp to use when importing data.
Must be a valid instant in ISO-8601 format.
The default is |
|
An additional DSBulk Loader option to use when importing data. The expected format is You must use the option’s full long form name and leading dashes; short form options will fail. You must wrap the entire expression in quotes so that it is handled correctly by DSBulk Migrator. This is in addition to any escaping required for DSBulk Loader to process the option correctly. To pass multiple additional options, pass each option separately with |
|
The target cluster’s host name or IP address, and an optional port for a node in the target cluster.
The default port is
This option can be passed multiple times. If your target cluster is an Astra DB database, use |
|
The maximum number of concurrent files to read from when importing data to the target cluster. Can be either |
|
The maximum number of concurrent queries to execute. Can be either |
|
The maximum number of failed records to tolerate when importing data. Must be a positive integer, such as Failed records are written to a |
|
The password for authentication to the target cluster. You can either provide the password directly ( If set, then If the cluster doesn’t require authentication, omit both If your target cluster is an Astra DB database, the password is an Astra application token. |
|
The protocol version to use when connecting to the target cluster, such as If unspecified, the driver negotiates the highest version supported by both the client and the server. Specify only if you want to force the protocol version. |
|
The username for authentication to the target cluster. If set, then If the cluster doesn’t require authentication, omit both If your target cluster is an Astra DB database, the username is the literal string |
|
A regular expression to select keyspaces to migrate, such as The default expression is Case-sensitive keyspace names must be specified by their exact case. |
|
A regular expression to select tables to migrate, such as The default expression is Case-sensitive table names must be specified by their exact case. |
|
The table types to migrate:
|
Unsupported live migration options for migration scripts
The following migrate-live options cannot be set in generate-script.
If you want to use these options, you must run the migration directly with migrate-live instead of generating a script.
-
--dsbulk-use-embedded: Not applicable togenerate-scriptbecause the resulting script is intended to be run with your own (non-embedded) DSBulk Loader installation. -
--max-concurrent-ops: Cannot be customized ingenerate-script. Uses the default value of1. -
--skip-truncate-confirmation: Cannot be customized ingenerate-script. Uses the default behavior of requiring confirmation before truncating counter tables. -
--truncate-before-export: Cannot be customized ingenerate-script. Uses the default behavior of truncating counter tables after exporting them. -
--data-dir: Ingenerate-script, this parameter sets the location to store the generated script files. There is nogenerate-scriptoption to set a custom data directory for the migration’s actual import and export operations. When you run the migration script, the default data directory is used for the data export and import operations, which is adatasubdirectory in the current working directory.
Generate DDL files
The generate-ddl command reads the origin cluster’s schema, and then generates CQL files that you can use to recreate the schema on your target CQL-compatible cluster.
To run the generate-ddl command, provide the path to your DSBulk Migrator fat jar followed by generate-ddl and any options:
java -jar /path/to/dsbulk-migrator.jar generate-ddl OPTIONS
The following example generates DDL files that are optimized for recreating the schema on an Astra DB database:
java -jar target/dsbulk-migrator-VERSION-embedded-driver.jar generate-ddl \
--data-dir=/path/to/data/directory \
--export-host=ORIGIN_CLUSTER_HOSTNAME \
--export-username=ORIGIN_USERNAME \
--export-password=ORIGIN_PASSWORD \
--optimize-for-astra
Options for generate-ddl
The generate-ddl command ignores all import-* options and DSBulk Loader-related options because they aren’t relevant to this operation.
Origin cluster connection details (export-* options) are required so that DSBulk Migrator can access the origin cluster to gather metadata about the keyspaces and tables for the DDL statements.
Most options have sensible default values and don’t need to be specified unless you want to override the default value.
| Option | Description |
|---|---|
|
The directory where you want to store the generated CQL files. The directory is created if it doesn’t exist. The default is a |
|
If your origin cluster is an Astra DB database, provide the path to your database’s Secure Connect Bundle (SCB), such as Cannot be used with |
|
The origin cluster’s host name or IP address, and an optional port for a node in the origin cluster.
The default port is
This option can be passed multiple times. If your origin cluster is an Astra DB database, use |
|
The password for authentication to the origin cluster. You can either provide the password directly ( If set, then If the cluster doesn’t require authentication, omit both If your origin cluster is an Astra DB database, the password is an Astra application token. |
|
The protocol version to use when connecting to the origin cluster, such as If unspecified, the driver negotiates the highest version supported by both the client and the server. Specify only if you want to force the protocol version. |
|
The username for authentication to the origin cluster. If set, then If the cluster doesn’t require authentication, omit both If your origin cluster is an Astra DB database, the username is the literal string |
|
A regular expression to select keyspaces to include in the generated CQL files, such as The default expression is Case-sensitive keyspace names must be specified by their exact case. |
|
Produce CQL files optimized for Astra DB. Astra DB doesn’t support all CQL options in DDL statements. This option omits forbidden CQL options from the generated CQL files so you can use them to create the schema in your Astra DB database without producing warnings or errors. The default is disabled/omitted, which generates the CQL files as-is without any Astra DB-specific optimizations. |
|
A regular expression to select tables to include in the generated CQL files, such as The default expression is Case-sensitive table names must be specified by their exact case. |
|
The table types to include in the generated CQL files:
|