Skip to content

Commit 3e59e64

Browse files
CASSANALYTICS-6: User documentation
1 parent 2e15761 commit 3e59e64

4 files changed

Lines changed: 340 additions & 1 deletion

File tree

build.gradle

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ plugins {
3232

3333
// Release Audit Tool (RAT) plugin for checking project licenses
3434
id("org.nosphere.apache.rat") version "0.8.1"
35+
36+
id 'org.asciidoctor.jvm.convert' version '3.3.2'
3537
}
3638

3739
repositories {

docs/build.gradle

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one
3+
* or more contributor license agreements. See the NOTICE file
4+
* distributed with this work for additional information
5+
* regarding copyright ownership. The ASF licenses this file
6+
* to you under the Apache License, Version 2.0 (the
7+
* "License"); you may not use this file except in compliance
8+
* with the License. You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
apply plugin: 'org.asciidoctor.jvm.convert'
21+
22+
asciidoctor {
23+
sourceDir = file("src")
24+
outputDir = file("build")
25+
attributes(
26+
'project-version': project.version
27+
)
28+
}

docs/src/user.adoc

Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
= Overview
2+
3+
This document describes the configuration options available for the bulk reader and bulk writer components.
4+
5+
== Cassandra Sidecar Configuration
6+
7+
Analytics library uses Sidecar to interact with Cassandra cluster. Bulk reader and writer components share common
8+
Sidecar configuration properties.
9+
10+
[cols="1,1,2"]
11+
|===
12+
|Property name|Default|Description
13+
14+
|_sidecar_contact_points_
15+
|
16+
|Comma-separated list of Cassandra Sidecar contact points. IP addresses and FQDN domain names are supported,
17+
with an optional port number (e.g. `lcoalhost1,localhost2`, `127.0.0.1,127.0.0.2`, `127.0.0.1:9043,127.0.0.2:9043`)
18+
19+
|_sidecar_port_
20+
|`9043`
21+
|Default port on which Cassandra Sidecar listens
22+
23+
|_keystore_path_
24+
|
25+
|Path to keystore used to establish TLS connection with Cassandra Sidecar
26+
27+
|_keystore_base64_encoded_
28+
|
29+
|Base64-encoded keystore used to establish TLS connection with Cassandra Sidecar
30+
31+
|_keystore_password_
32+
|
33+
|Keystore password
34+
35+
|_keystore_type_
36+
|`PKCS12`
37+
|Keystore type, `PKCS12` or `JKS`
38+
39+
|_truststore_path_
40+
|
41+
|Path to truststore used to establish TLS connection with Cassandra Sidecar
42+
43+
|_truststore_base64_encoded_
44+
|
45+
|Base64-encoded truststore used to establish TLS connection with Cassandra Sidecar
46+
47+
|_truststore_password_
48+
|
49+
|Truststore password
50+
51+
|_truststore_type_
52+
|`PKCS12`
53+
|Truststore type, `PKCS12` or `JKS`
54+
55+
|_cassandra_role_
56+
|
57+
|Specific role that Sidecar shall use to authorize the request. For further details consult Sidecar documentation
58+
for `cassandra-auth-role` HTTP header
59+
60+
|===
61+
62+
== Bulk Reader
63+
64+
This section describes configuration properties specific to the bulk reader.
65+
66+
=== Cassandra Sidecar Configuration
67+
68+
[cols="1,1,2"]
69+
|===
70+
|Property name|Default|Description
71+
72+
|_defaultMillisToSleep_
73+
|`500`
74+
|Number of milliseconds to wait between retry attempts
75+
76+
|_maxMillisToSleep_
77+
|`60000`
78+
|Maximum number of milliseconds to sleep between retries
79+
80+
|_maxPoolSize_
81+
|`64`
82+
|Size of the Vert.x worker thread pool
83+
84+
|_timeoutSeconds_
85+
|`600`
86+
|Request timeout, expressed in seconds
87+
88+
|===
89+
90+
=== Spark Reader Configuration
91+
92+
[cols="1,1,2"]
93+
|===
94+
|Property name|Default|Description
95+
96+
|_keyspace_
97+
|
98+
|Keyspace of a table to read
99+
100+
|_table_
101+
|
102+
|Table to be read
103+
104+
|_dc_
105+
|
106+
|Data center used when `LOCAL_*` consistency level is specified
107+
108+
|_consistencyLevel_
109+
|`LOCAL_QUORUM`
110+
|Read consistency level
111+
112+
|_snapshotName_
113+
|`sbr_\{uuid\}`
114+
|Name of a snapshot to use (for data consistency). By default, unique name is always generated
115+
116+
|_createSnapshot_
117+
|`true`
118+
|Indicates whether a new snapshot should be created prior to performing the read operation
119+
120+
|_clearSnapshotStrategy_
121+
|`OnCompletionOrTTL 2d`
122+
|Strategy of removing snapshot once read operation completes. This option is enabled always when _createSnapshot_
123+
flag is set to `true`. Value of _clearSnapshotStrategy_ must follow the format: `[strategy] [snapshotTTL]`. Supported
124+
strategies: `NoOp`, `OnCompletion`, `OnCompletionOrTTL`, `TTL`. Example configurations: `OnCompletionOrTTL 2d`,
125+
`TTL 2d`, `NoOp`, `OnCompletion`. TTL value has to match pattern: `\d+(d\|h\|m\|s)`
126+
127+
|_bigNumberConfig_
128+
|
129+
a|Defines the output scale and precision of `decimal` and `varint` columns. Parameter value is a JSON string
130+
with the following structure:
131+
132+
[source,json]
133+
----
134+
{
135+
"columnName1" : {"bigDecimalPrecision": 10, "bigDecimalScale": 5},
136+
"columnName2" : {"bigIntegerPrecision": 10, "bigIntegerScale": 5}
137+
}
138+
----
139+
140+
|_lastModifiedColumnName_
141+
|
142+
|Name of the field to be appended to Spark RDD that represents last modification timestamp of each row
143+
144+
|===
145+
146+
=== Other Properties
147+
148+
[cols="1,1,2"]
149+
|===
150+
|Property name|Default|Description
151+
152+
|_defaultParallelism_
153+
|`1`
154+
|Value of Spark property `spark.default.parallelism`
155+
156+
|_numCores_
157+
|`1`
158+
|Total number of cores used by all Spark executors
159+
160+
|_maxBufferSizeBytes_
161+
|`6291456`
162+
a|Maximum amount of bytes per sstable file that may be downloaded and buffered in-memory. This parameter is
163+
global default and can be overridden per sstable file type. Effective defaults are:
164+
165+
- `Data.db`: 6291456
166+
- `Index.db`: 131072
167+
- `Summary.db`: 262144
168+
- `Statistics.db`: 131072
169+
- `CompressionInfo.db`: 131072
170+
- `.log` (commit log): 65536
171+
- `Partitions.db`: 131072
172+
- `Rows.db`: 131072
173+
174+
To override size for `Data.db`, use property `_maxBufferSizeBytes_Data.db_`.
175+
176+
|_chunkBufferSizeBytes_
177+
|`4194304`
178+
a|Default chunk size (in bytes) that will be requested when fetching next portion of sstable file. This parameter is
179+
global default and can be overridden per sstable file type. Effective defaults are:
180+
181+
- `Data.db`: 4194304
182+
- `Index.db`: 32768
183+
- `Summary.db`: 131072
184+
- `Statistics.db`: 65536
185+
- `CompressionInfo.db`: 65536
186+
- `.log` (commit log): 65536
187+
- `Partitions.db`: 4096
188+
- `Rows.db`: 4096
189+
190+
To override size for `Data.db`, use property `_chunkBufferSizeBytes_Data.db_`.
191+
192+
|_sizing_
193+
|`default`
194+
a|Determines how the number of CPU cores is selected during the read operation. Supported options:
195+
196+
* `default`: static number of cores defined by _numCores_ parameter
197+
* `dynamic`: calculates number of cores dynamically based on table size. Improves cost efficiency for processing small
198+
tables (few GBs). Consult JavaDoc of `org.apache.cassandra.spark.data.DynamicSizing` for implementation details.
199+
Relevant configuration properties:
200+
** _maxPartitionSize_: maximum Spark partition size (in GiB)
201+
202+
|_quote_identifiers_
203+
|`false`
204+
|When `true`, keyspace, table and column names are quoted
205+
206+
|_sstable_start_timestamp_micros_ and _sstable_end_timestamp_micros_
207+
|
208+
|Define an inclusive time-range filter for sstable selection. Both timestamps are expressed in microseconds
209+
210+
|===
211+
212+
== Bulk Writer
213+
214+
This section describes configuration properties specific to the bulk writer.
215+
216+
=== Spark Writer Configuration
217+
218+
[cols="1,1,2"]
219+
|===
220+
|Property name|Default|Description
221+
222+
|_keyspace_
223+
|
224+
|Keyspace of a table to write
225+
226+
|_table_
227+
|
228+
|Table to which rows are written or from which rows are removed depending on _write_mode_
229+
230+
|_local_dc_
231+
|
232+
|Data center used when `LOCAL_*` consistency level is specified
233+
234+
|_bulk_writer_cl_
235+
|`EACH_QUORUM`
236+
|Write consistency level
237+
238+
|_write_mode_
239+
|`INSERT`
240+
|Determines write mode: `INSERT` or `DELETE_PARTITION`
241+
242+
|_ttl_
243+
|
244+
|Time-to-live value applied to created records
245+
246+
|_timestamp_
247+
|`NOW`
248+
|Mutation timestamp assigned to generated rows, expressed in microseconds
249+
250+
|_skip_extended_verify_
251+
|`false`
252+
|Every imported sstable is verified for corruption during import process. This property allows to enable extended
253+
checking of all values in the new sstables
254+
255+
|_quote_identifiers_
256+
|`false`
257+
|Option that specifies whether the identifiers (i.e. keyspace, table name, column names) should be quoted to
258+
support mixed case and reserved keyword names for these fields
259+
260+
|_data_transport_
261+
|`DIRECT`
262+
a|Specifies data transport mode. Supported implementations:
263+
264+
* `DIRECT`: Upload of generated sstables directly to Cassandra cluster via Sidecar
265+
* `S3_COMPAT`: Upload of generated sstables to remote S3-compliant storage
266+
267+
|===
268+
269+
=== S3 Upload Properties
270+
271+
[cols="1,1,2"]
272+
|===
273+
|Property name|Default|Description
274+
275+
|===
276+
277+
=== Other Properties
278+
279+
[cols="1,1,2"]
280+
|===
281+
|Property name|Default|Description
282+
283+
|_number_splits_
284+
|`-1`
285+
|User defined number of token range splits. By default, library will dynamically calculate number of splits based
286+
on Spark properties `spark.default.parallelism`, `spark.executor.cores` and `spark.executor.instances`
287+
288+
|_sstable_data_size_in_mib_
289+
|`160`
290+
|Maximum sstable size (in MiB)
291+
292+
|_digest_
293+
|`XXHash32`
294+
|Digest algorithm used to compute when uploading sstables for checksum validation. Supported values: `XXHash32`, `MD5`
295+
296+
|_job_timeout_seconds_
297+
|`-1`
298+
a|Specifies a timeout in seconds for bulk write jobs. Disabled by default. When configured, job exceeding
299+
the timeout is:
300+
301+
* successful when the desired consistency level is achieved
302+
* failed otherwise
303+
304+
|_job_id_
305+
|
306+
|User-defined identifier for the bulk write job
307+
308+
|===

settings.gradle

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,4 +50,5 @@ include 'cassandra-analytics-cdc-codec'
5050
include 'analytics-sidecar-vertx-client-shaded'
5151
include 'analytics-sidecar-vertx-client'
5252
include 'analytics-sidecar-client'
53-
include 'analytics-sidecar-client-common'
53+
include 'analytics-sidecar-client-common'
54+
include 'docs'

0 commit comments

Comments
 (0)