Share via

Databricks job cluster error: Could not reach driver of cluster cluster-id.

González Ternero, David 0 Reputation points
2026-02-25T16:44:55.28+00:00

We are experiencing an intermittent error in one of our jobs that always runs at the same time (11:00 PM). On some days, the job fails with the following error:

DriverError: Could not reach driver of cluster <cluster-id>

I have seen in other threads that this issue can sometimes be caused by binary mismatches between libraries installed on the cluster. However, in our case, we are not using any custom-installed libraries beyond the default runtime.

I have also come across similar issues related to the following error:

Failed to start repl ReplId-<id> com.databricks.backend.daemon.driver.PythonDriverLocal$PythonException: Unable to start python kernel for ReplId-<id>, kernel did not start within 80 seconds.

However, in our case, this trace (or any similar one) does not appear in our logs.

Instead, the logs repeatedly show the following errors:

ERROR Datastore: Exception thrown creating StoreManager. See the nested exception Error creating transactional connection factory org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "HikariCP" plugin to create a ConnectionPool gave an error: Failed to initialize pool: Could not connect to address=(host=REDACTED)(port=REDACTED)(type=master) : Read timed out Caused by: com.zaxxer.hikari.pool.HikariPool$PoolInitializationException: Failed to initialize pool: Could not connect to address=(host=REDACTED)(port=REDACTED)(type=master) : Read timed out

Do you have any idea what might be causing this issue?

Azure Databricks
Azure Databricks

An Apache Spark-based analytics platform optimized for Azure.

0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Manoj Kumar Boyini 9,410 Reputation points Microsoft External Staff Moderator
    2026-02-25T23:46:06.05+00:00

    Hi González Ternero, David,

    The issue was caused by driver overload due to too many tasks running in parallel. Each parallel Python task spawns a Python REPL (ipykernel) on the driver, and under high load the REPLs could not start within the default 80‑second timeout. This made the driver unresponsive, resulting in the error “Could not reach driver of cluster”.

    Resolution

    • Reduced job parallelism to limit concurrent REPL creation.
    • Increased the REPL startup timeout by setting: spark.databricks.driver.ipykernel.launchTimeoutSeconds = 300

    After these changes, the job ran consistently without intermittent failures.

    Hope this helps, Please let us know if you have any questions and concerns.

    If this solution helped resolve your issue, please consider clicking ‘Accept Answer’ or giving it an upvote to help others find it easily.


  2. Q&A Assist
    2026-02-25T16:45:07.73+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The error DriverError: Could not reach driver of cluster <cluster-id> can occur due to several reasons. Since you mentioned that you are not using any custom-installed libraries, the issue might be related to the driver being overloaded or the cluster malfunctioning.

    1. Driver Overload: If the driver is overloaded, it may not be able to respond to requests. You can check the cluster metrics to determine if the driver is heavily loaded. Look for the Server load distribution visualization in the cluster metrics. If the driver is overloaded, it will be indicated by a red block in the visualization. To address this, consider increasing the size of the driver or reducing the concurrency of jobs running on the cluster.
    2. Cluster Malfunctioning: Although rare, a malfunctioning cluster can also lead to this error. Restarting the cluster might resolve the issue. Additionally, you can check the logs for any suspicious entries, particularly in the Event log and Driver logs tabs.
    3. Datastore Issues: The error message regarding the StoreManager could indicate a problem with the underlying data storage or configuration. Ensure that the data store is accessible and properly configured.

    Since the job runs at the same time every day, consider if there are any scheduled tasks or resource contention occurring at that time that could impact the cluster's performance.

    If the problem persists, you may want to reach out to Databricks support for further assistance, especially if you suspect it could be a deeper issue with the cluster setup or configuration.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.