-
Notifications
You must be signed in to change notification settings - Fork 0
Frequently Asked Questions Autotuner for Apache Spark
When choosing Spark Configuration and Infrastructure options to optimally run a job, developers are faced with endless combinations of options, making it impossible to select the correct combination each time, leading to failed jobs, increased cost, and reduced performance. The Sync Autotuner for Apache Spark predicts cost and runtime across thousands of possible combinations, solving this problem in seconds.
- Most other solutions recommend tuning of single parameters. Each Autotuner prediction is a combination of infrastructure and application configuration parameters, and takes into account the interrelationships between parameters and how that can impact cost and runtime.
- We just need 1 log to give you better performance, other solutions require many runs to train their models.
- We provide many recommendations which result in different runtime and cost values, users can choose what works for them. Other solutions give you just 1 option with unclear runtime/cost impact.
The Autotuner Beta supports Apache Spark running on AWS EMR, as well as Databricks running on AWS. This does not yet include EKS or Serverless support.
To run the autotuner, we need cluster information with an associated Spark Event log from the most recent successful job run. If you aren't sure how to access this information, please view our User Documentation.
Assure that you have spark.eventLog.enabled set to true for any jobs you are interested in optimizing. For more information, see: https://spark.apache.org/docs/latest/configuration.html
This depends on the size of your log file, but most of the time results are returned in less than 10 minutes. We are working on speeding this up so results are in seconds.
Many elements can change daily in a data pipeline such as data size, data skew, the code itself, spot pricing, and spot availability. Because of this, we recommend running the autotuner right before you run your real job, to avoid any parameters from aging. Eventually, the autotuner is meant to be run every time your production Spark job runs.
We query the AWS public APIs as part of the prediction process to assure we have the most current on demand and spot pricing. We know many users have special vendor discounts, as well as RIs and Savings plans that impact on demand costs. Eventually we will take this into account when reporting on costs. For Databricks, we use list price for DBUs: https://databricks.com/product/aws-pricing
Currently we align our recommendation in terms of On Demand and Spot with your log and cluster input data. We plan in the future to add user settings around on demand and spot preferences, as well as options for instance fleet settings.
We check this at runtime, and won't recommend highly interruptible instance types. This is one of the reasons you should run the Autotuner as close to your job run as possible, as these rates change frequently.
The user API is under development, and we anticipate having it available soon!
Please email support@synccomputing.com.
My Spark event logs contain sensitive data that I do not want to share outside my organization, what are my options?
We have an open source log parser that will remove sensitive information, and only include the info that our Autotuner needs. You can access the parser here: https://github.com/synccomputingcode/spark_log_parser
Each configuration in a recommendation must be implemented as given to achieve the predicted results. They should not be considered independent recommendations.
We check this at runtime, and won't recommend highly interruptible instance types. This is one of the reasons you should run the Autotuner as close to your job run as possible, as these rates change frequently.
While in the future we will build support for a feedback loop and tracking of ROI, we would love to hear from you on results of running our recommended configurations. Please email support@synccomputing.com.
This is still a beta product, you may very well have found a bug. One other large source of error we’ve found is the recommended settings were not correctly implemented. Sometimes companies forbid certain things from changing, or maybe your infrastructure has an edge case we’ve never seen before. Either way, please send us a message and we can help troubleshoot. Email us at support@synccomputing.com.