Exam Details

  • Exam Code
    :DATABRICKS-CERTIFIED-PROFESSIONAL-DATA-ENGINEER
  • Exam Name
    :Databricks Certified Data Engineer Professional
  • Certification
    :Databricks Certifications
  • Vendor
    :Databricks
  • Total Questions
    :120 Q&As
  • Last Updated
    :Jul 02, 2025

Databricks Databricks Certifications DATABRICKS-CERTIFIED-PROFESSIONAL-DATA-ENGINEER Questions & Answers

  • Question 81:

    The following code has been migrated to a Databricks notebook from a legacy workload:

    The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data. Which statement is a possible explanation for this behavior?

    A. %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

    B. Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

    C. %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

    D. Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

    E. %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

  • Question 82:

    The data architect has decided that once data has been ingested from external sources into the

    Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.

    The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.

    GRANT USAGE ON DATABASE prod TO eng;

    GRANT SELECT ON DATABASE prod TO eng;

    Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?

    A. Group members have full permissions on the prod database and can also assign permissions to other users or groups.

    B. Group members are able to list all tables in the prod database but are not able to see the results of any queries on those tables.

    C. Group members are able to query and modify all tables and views in the prod database, but cannot create new tables or views.

    D. Group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database.

    E. Group members are able to create, query, and modify all tables and views in the prod database, but cannot define custom functions.

  • Question 83:

    A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.

    A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays:

    Which limitation will the team face while diagnosing this problem?

    A. New fields not be computed for historic records.

    B. Updating the table schema will invalidate the Delta transaction log metadata.

    C. Updating the table schema requires a default value provided for each file added.

    D. Spark cannot capture the topic partition fields from the kafka source.

  • Question 84:

    A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

    The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

    Which approach would simplify the identification of these changed records?

    A. Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

    B. Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

    C. Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.

    D. Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

    E. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

  • Question 85:

    The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

    item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

    The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field. A junior data engineer suggests converting this data to Delta Lake will improve query performance.

    Which response to the junior data engineer s suggestion is correct?

    A. Delta Lake statistics are not optimized for free text fields with high cardinality.

    B. Text data cannot be stored with Delta Lake.

    C. ZORDER ON review will need to be run to see performance gains.

    D. The Delta log creates a term matrix for free text fields to support selective filtering.

    E. Delta Lake statistics are only collected on the first 4 columns in a table.

  • Question 86:

    A Delta Lake table representing metadata about content from user has the following schema:

    Based on the above schema, which column is a good candidate for partitioning the Delta Table?

    A. Date

    B. Post_id

    C. User_id

    D. Post_time

  • Question 87:

    A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

    When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

    A. The five Minute Load Average remains consistent/flat

    B. Bytes Received never exceeds 80 million bytes per second

    C. Total Disk Space remains constant

    D. Network I/O never spikes

    E. Overall cluster CPU utilization is around 25%

  • Question 88:

    A table is registered with the following code: Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

    A. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

    B. All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

    C. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

    D. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

    E. The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

  • Question 89:

    Which Python variable contains a list of directories to be searched when trying to locate required modules?

    A. importlib.resource path

    B. ,sys.path

    C. os-path

    D. pypi.path

    E. pylib.source

  • Question 90:

    The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

    What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

    A. Can Manage

    B. Can Edit

    C. No permissions

    D. Can Read

    E. Can Run

Tips on How to Prepare for the Exams

Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Databricks exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATABRICKS-CERTIFIED-PROFESSIONAL-DATA-ENGINEER exam preparations and Databricks certification application, do not hesitate to visit our Vcedump.com to find your solutions here.