|
| 1 | += Overview |
| 2 | + |
| 3 | +This document describes the configuration options available for the bulk reader and bulk writer components. |
| 4 | + |
| 5 | +== Cassandra Sidecar Configuration |
| 6 | + |
| 7 | +Analytics library uses Sidecar to interact with Cassandra cluster. Bulk reader and writer components share common |
| 8 | +Sidecar configuration properties. |
| 9 | + |
| 10 | +[cols="1,1,2"] |
| 11 | +|=== |
| 12 | +|Property name|Default|Description |
| 13 | + |
| 14 | +|_sidecar_contact_points_ |
| 15 | +| |
| 16 | +|Comma-separated list of Cassandra Sidecar contact points. IP addresses and FQDN domain names are supported, |
| 17 | +with an optional port number (e.g. `lcoalhost1,localhost2`, `127.0.0.1,127.0.0.2`, `127.0.0.1:9043,127.0.0.2:9043`) |
| 18 | + |
| 19 | +|_sidecar_port_ |
| 20 | +|`9043` |
| 21 | +|Default port on which Cassandra Sidecar listens |
| 22 | + |
| 23 | +|_keystore_path_ |
| 24 | +| |
| 25 | +|Path to keystore used to establish TLS connection with Cassandra Sidecar |
| 26 | + |
| 27 | +|_keystore_base64_encoded_ |
| 28 | +| |
| 29 | +|Base64-encoded keystore used to establish TLS connection with Cassandra Sidecar |
| 30 | + |
| 31 | +|_keystore_password_ |
| 32 | +| |
| 33 | +|Keystore password |
| 34 | + |
| 35 | +|_keystore_type_ |
| 36 | +|`PKCS12` |
| 37 | +|Keystore type, `PKCS12` or `JKS` |
| 38 | + |
| 39 | +|_truststore_path_ |
| 40 | +| |
| 41 | +|Path to truststore used to establish TLS connection with Cassandra Sidecar |
| 42 | + |
| 43 | +|_truststore_base64_encoded_ |
| 44 | +| |
| 45 | +|Base64-encoded truststore used to establish TLS connection with Cassandra Sidecar |
| 46 | + |
| 47 | +|_truststore_password_ |
| 48 | +| |
| 49 | +|Truststore password |
| 50 | + |
| 51 | +|_truststore_type_ |
| 52 | +|`PKCS12` |
| 53 | +|Truststore type, `PKCS12` or `JKS` |
| 54 | + |
| 55 | +|_cassandra_role_ |
| 56 | +| |
| 57 | +|Specific role that Sidecar shall use to authorize the request. For further details consult Sidecar documentation |
| 58 | +for `cassandra-auth-role` HTTP header |
| 59 | + |
| 60 | +|=== |
| 61 | + |
| 62 | +== Bulk Reader |
| 63 | + |
| 64 | +This section describes configuration properties specific to the bulk reader. |
| 65 | + |
| 66 | +=== Cassandra Sidecar Configuration |
| 67 | + |
| 68 | +[cols="1,1,2"] |
| 69 | +|=== |
| 70 | +|Property name|Default|Description |
| 71 | + |
| 72 | +|_defaultMillisToSleep_ |
| 73 | +|`500` |
| 74 | +|Number of milliseconds to wait between retry attempts |
| 75 | + |
| 76 | +|_maxMillisToSleep_ |
| 77 | +|`60000` |
| 78 | +|Maximum number of milliseconds to sleep between retries |
| 79 | + |
| 80 | +|_maxPoolSize_ |
| 81 | +|`64` |
| 82 | +|Size of the Vert.x worker thread pool |
| 83 | + |
| 84 | +|_timeoutSeconds_ |
| 85 | +|`600` |
| 86 | +|Request timeout, expressed in seconds |
| 87 | + |
| 88 | +|=== |
| 89 | + |
| 90 | +=== Spark Reader Configuration |
| 91 | + |
| 92 | +[cols="1,1,2"] |
| 93 | +|=== |
| 94 | +|Property name|Default|Description |
| 95 | + |
| 96 | +|_keyspace_ |
| 97 | +| |
| 98 | +|Keyspace of a table to read |
| 99 | + |
| 100 | +|_table_ |
| 101 | +| |
| 102 | +|Table to be read |
| 103 | + |
| 104 | +|_dc_ |
| 105 | +| |
| 106 | +|Data center used when `LOCAL_*` consistency level is specified |
| 107 | + |
| 108 | +|_consistencyLevel_ |
| 109 | +|`LOCAL_QUORUM` |
| 110 | +|Read consistency level |
| 111 | + |
| 112 | +|_snapshotName_ |
| 113 | +|`sbr_\{uuid\}` |
| 114 | +|Name of a snapshot to use (for data consistency). By default, unique name is always generated |
| 115 | + |
| 116 | +|_createSnapshot_ |
| 117 | +|`true` |
| 118 | +|Indicates whether a new snapshot should be created prior to performing the read operation |
| 119 | + |
| 120 | +|_clearSnapshotStrategy_ |
| 121 | +|`OnCompletionOrTTL 2d` |
| 122 | +|Strategy of removing snapshot once read operation completes. This option is enabled always when _createSnapshot_ |
| 123 | +flag is set to `true`. Value of _clearSnapshotStrategy_ must follow the format: `[strategy] [snapshotTTL]`. Supported |
| 124 | +strategies: `NoOp`, `OnCompletion`, `OnCompletionOrTTL`, `TTL`. Example configurations: `OnCompletionOrTTL 2d`, |
| 125 | +`TTL 2d`, `NoOp`, `OnCompletion`. TTL value has to match pattern: `\d+(d\|h\|m\|s)` |
| 126 | + |
| 127 | +|_bigNumberConfig_ |
| 128 | +| |
| 129 | +a|Defines the output scale and precision of `decimal` and `varint` columns. Parameter value is a JSON string |
| 130 | +with the following structure: |
| 131 | + |
| 132 | +[source,json] |
| 133 | +---- |
| 134 | +{ |
| 135 | + "columnName1" : {"bigDecimalPrecision": 10, "bigDecimalScale": 5}, |
| 136 | + "columnName2" : {"bigIntegerPrecision": 10, "bigIntegerScale": 5} |
| 137 | +} |
| 138 | +---- |
| 139 | + |
| 140 | +|_lastModifiedColumnName_ |
| 141 | +| |
| 142 | +|Name of the field to be appended to Spark RDD that represents last modification timestamp of each row |
| 143 | + |
| 144 | +|=== |
| 145 | + |
| 146 | +=== Other Properties |
| 147 | + |
| 148 | +[cols="1,1,2"] |
| 149 | +|=== |
| 150 | +|Property name|Default|Description |
| 151 | + |
| 152 | +|_defaultParallelism_ |
| 153 | +|`1` |
| 154 | +|Value of Spark property `spark.default.parallelism` |
| 155 | + |
| 156 | +|_numCores_ |
| 157 | +|`1` |
| 158 | +|Total number of cores used by all Spark executors |
| 159 | + |
| 160 | +|_maxBufferSizeBytes_ |
| 161 | +|`6291456` |
| 162 | +a|Maximum amount of bytes per sstable file that may be downloaded and buffered in-memory. This parameter is |
| 163 | +global default and can be overridden per sstable file type. Effective defaults are: |
| 164 | + |
| 165 | +- `Data.db`: 6291456 |
| 166 | +- `Index.db`: 131072 |
| 167 | +- `Summary.db`: 262144 |
| 168 | +- `Statistics.db`: 131072 |
| 169 | +- `CompressionInfo.db`: 131072 |
| 170 | +- `.log` (commit log): 65536 |
| 171 | +- `Partitions.db`: 131072 |
| 172 | +- `Rows.db`: 131072 |
| 173 | + |
| 174 | +To override size for `Data.db`, use property `_maxBufferSizeBytes_Data.db_`. |
| 175 | + |
| 176 | +|_chunkBufferSizeBytes_ |
| 177 | +|`4194304` |
| 178 | +a|Default chunk size (in bytes) that will be requested when fetching next portion of sstable file. This parameter is |
| 179 | +global default and can be overridden per sstable file type. Effective defaults are: |
| 180 | + |
| 181 | +- `Data.db`: 4194304 |
| 182 | +- `Index.db`: 32768 |
| 183 | +- `Summary.db`: 131072 |
| 184 | +- `Statistics.db`: 65536 |
| 185 | +- `CompressionInfo.db`: 65536 |
| 186 | +- `.log` (commit log): 65536 |
| 187 | +- `Partitions.db`: 4096 |
| 188 | +- `Rows.db`: 4096 |
| 189 | + |
| 190 | +To override size for `Data.db`, use property `_chunkBufferSizeBytes_Data.db_`. |
| 191 | + |
| 192 | +|_sizing_ |
| 193 | +|`default` |
| 194 | +a|Determines how the number of CPU cores is selected during the read operation. Supported options: |
| 195 | + |
| 196 | +* `default`: static number of cores defined by _numCores_ parameter |
| 197 | +* `dynamic`: calculates number of cores dynamically based on table size. Improves cost efficiency for processing small |
| 198 | +tables (few GBs). Consult JavaDoc of `org.apache.cassandra.spark.data.DynamicSizing` for implementation details. |
| 199 | +Relevant configuration properties: |
| 200 | + ** _maxPartitionSize_: maximum Spark partition size (in GiB) |
| 201 | + |
| 202 | +|_quote_identifiers_ |
| 203 | +|`false` |
| 204 | +|When `true`, keyspace, table and column names are quoted |
| 205 | + |
| 206 | +|_sstable_start_timestamp_micros_ and _sstable_end_timestamp_micros_ |
| 207 | +| |
| 208 | +|Define an inclusive time-range filter for sstable selection. Both timestamps are expressed in microseconds |
| 209 | + |
| 210 | +|=== |
| 211 | + |
| 212 | +== Bulk Writer |
| 213 | + |
| 214 | +This section describes configuration properties specific to the bulk writer. |
| 215 | + |
| 216 | +=== Spark Writer Configuration |
| 217 | + |
| 218 | +[cols="1,1,2"] |
| 219 | +|=== |
| 220 | +|Property name|Default|Description |
| 221 | + |
| 222 | +|_keyspace_ |
| 223 | +| |
| 224 | +|Keyspace of a table to write |
| 225 | + |
| 226 | +|_table_ |
| 227 | +| |
| 228 | +|Table to which rows are written or from which rows are removed depending on _write_mode_ |
| 229 | + |
| 230 | +|_local_dc_ |
| 231 | +| |
| 232 | +|Data center used when `LOCAL_*` consistency level is specified |
| 233 | + |
| 234 | +|_bulk_writer_cl_ |
| 235 | +|`EACH_QUORUM` |
| 236 | +|Write consistency level |
| 237 | + |
| 238 | +|_write_mode_ |
| 239 | +|`INSERT` |
| 240 | +|Determines write mode: `INSERT` or `DELETE_PARTITION` |
| 241 | + |
| 242 | +|_ttl_ |
| 243 | +| |
| 244 | +|Time-to-live value applied to created records |
| 245 | + |
| 246 | +|_timestamp_ |
| 247 | +|`NOW` |
| 248 | +|Mutation timestamp assigned to generated rows, expressed in microseconds |
| 249 | + |
| 250 | +|_skip_extended_verify_ |
| 251 | +|`false` |
| 252 | +|Every imported sstable is verified for corruption during import process. This property allows to enable extended |
| 253 | +checking of all values in the new sstables |
| 254 | + |
| 255 | +|_quote_identifiers_ |
| 256 | +|`false` |
| 257 | +|Option that specifies whether the identifiers (i.e. keyspace, table name, column names) should be quoted to |
| 258 | +support mixed case and reserved keyword names for these fields |
| 259 | + |
| 260 | +|_data_transport_ |
| 261 | +|`DIRECT` |
| 262 | +a|Specifies data transport mode. Supported implementations: |
| 263 | + |
| 264 | +* `DIRECT`: Upload of generated sstables directly to Cassandra cluster via Sidecar |
| 265 | +* `S3_COMPAT`: Upload of generated sstables to remote S3-compliant storage |
| 266 | + |
| 267 | +|=== |
| 268 | + |
| 269 | +=== S3 Upload Properties |
| 270 | + |
| 271 | +[cols="1,1,2"] |
| 272 | +|=== |
| 273 | +|Property name|Default|Description |
| 274 | + |
| 275 | +|=== |
| 276 | + |
| 277 | +=== Other Properties |
| 278 | + |
| 279 | +[cols="1,1,2"] |
| 280 | +|=== |
| 281 | +|Property name|Default|Description |
| 282 | + |
| 283 | +|_number_splits_ |
| 284 | +|`-1` |
| 285 | +|User defined number of token range splits. By default, library will dynamically calculate number of splits based |
| 286 | +on Spark properties `spark.default.parallelism`, `spark.executor.cores` and `spark.executor.instances` |
| 287 | + |
| 288 | +|_sstable_data_size_in_mib_ |
| 289 | +|`160` |
| 290 | +|Maximum sstable size (in MiB) |
| 291 | + |
| 292 | +|_digest_ |
| 293 | +|`XXHash32` |
| 294 | +|Digest algorithm used to compute when uploading sstables for checksum validation. Supported values: `XXHash32`, `MD5` |
| 295 | + |
| 296 | +|_job_timeout_seconds_ |
| 297 | +|`-1` |
| 298 | +a|Specifies a timeout in seconds for bulk write jobs. Disabled by default. When configured, job exceeding |
| 299 | +the timeout is: |
| 300 | + |
| 301 | +* successful when the desired consistency level is achieved |
| 302 | +* failed otherwise |
| 303 | + |
| 304 | +|_job_id_ |
| 305 | +| |
| 306 | +|User-defined identifier for the bulk write job |
| 307 | + |
| 308 | +|=== |
0 commit comments