An Apache Spark-based analytics platform optimized for Azure.
The issue was caused by driver overload due to too many tasks running in parallel. Each parallel Python task spawns a Python REPL (ipykernel) on the driver, and under high load the REPLs could not start within the default 80‑second timeout. This made the driver unresponsive, resulting in the error “Could not reach driver of cluster”.
Resolution
- Reduced job parallelism to limit concurrent REPL creation.
- Increased the REPL startup timeout by setting:
spark.databricks.driver.ipykernel.launchTimeoutSeconds = 300
After these changes, the job ran consistently without intermittent failures.
Hope this helps, Please let us know if you have any questions and concerns.
If this solution helped resolve your issue, please consider clicking ‘Accept Answer’ or giving it an upvote to help others find it easily.