Apache Spark Cluster Configuration Options

The TDspora application is supplied with a pre-configured Apache Spark cluster running within Docker containers. This cluster is referenced as the "Local cluster" and configured to work in the standalone mode. It has two workers.

Please note that the Local cluster is intended for testing and demonstration purposes only. It is not recommended to use the cluster to process a large amount of data or execute more than one Spark application simultaneously. You can change its default configuration by using the Spark-submit options field in the Advanced section. Please ensure that the TDspora server has enough resources to allow the Local cluster to process Spark applications without significant degradation in performance of the server itself.

The production scale jobs require distributed Spark cluster configured to handle workloads corresponding to the data volumes extracted from the production databases. For example, the subsetting of 20GB of data from an Oracle database requires roughly 6GB RAM on a worker node. You also can use the Spark-submit options field in the Advanced section to manage the cluster resources (CPU cores and memory) for running your pipelines.

See Apache Spark configuration and Dynamic Resource Allocation for additional information about Spark configuration and available Spark properties.

Deploment Mode

The cluster deploy mode set by default for the Apache Spark cluster supplied with the product. You can change the mode by providing option --deploy-mode in the advanced parameters of the cluster configuration. Currently, Spark supports two modes client and cluster.

Note, that if you set Master URL as local (or any variation of local, see Apache Spark Master URL), the deploy mode will be automatically set to client as with this Master URLs the cluster deploy mode is not supported.

Spark-submit options

The options you define in this control passed directly to the spark-submit command that executes pipeline job on a Spark cluster except the following

--class - points to the application engine implementation class com.epam.tdm.engine.TdmEngineJobApp
--files - includes files required to execute the job (pipeline)
--master - the master URL specified in a separate input control

Please refer to the Apache Spark documentation for additional options available for the driver and executors.

Temporal tables created by a pipeline

Normally, all temporal tables dropped from the source and target locations, but you can specify additional configuration parameter of the Spark driver using --conf 'spark.driver.extraJavaOptions=-Dtdspora.drop.temp.tables=<MODE>'as the following

Mode	Effect
ALL	Drop all temporal tables. Default.
SOURCE	Drop all temporal tables from the source.
TARGET	Drop all temporal tables from the target.
NONE	Do not drop any tables which created by programm.

For example,

--conf 'spark.driver.extraJavaOptions=-Dtdspora.drop.temp.tables=NONE'

Running a Pipeline on a Remote Cluster

When you execute a pipeline, the application connects to the corresponding Apache Spark cluster using SSH and runs the spark-submit command. You can adjust the location and parameters of the command in the Advanced cluster configuration section: Advanced Cluster Configuration Options

SSH Session and Environment Variables

Under the hood, the application connects to the cluster host in the non-interactive mode. It applied limitations to the availability of the user-defined environment variables.

We recommend modifying the PATH variable in the /etc/environment file to include Apache Spark bin and sbin folders, for example PATH=...:/opt/spark/bin:/opt/spark/sbin. Also, add the SPARK_HOME variable there. For instance, append line SPARK_HOME=/opt/spark to the file.

In case you have no privileges to modify the /etc/environment file, you can define the location of the Apache Spark home in the Advanced section.

Spark Home

The SPARK_HOME environment variable controls the location of the Apache Spark scripts executed even if you use the full path to the script or overrides it in the Cluster configuration. Pre-defined SPARK_HOME takes precedence over any other means of Apache Spark home selection.

If you have several versions of Spark on the remote machine make sure that the SPARK_HOME variable point to the valid version of Apache Spark.

Check Cluster connection environment variables for details.

For more details please check original Apache Spark documentation: Submitting Applications.

Deploment Mode​

Spark-submit options​

Temporal tables created by a pipeline​

Running a Pipeline on a Remote Cluster​