Vcedump 100% Guareented DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK Questions and Answers. 100% Pass Guarantee. Latest Questions with Accurate Answers.

Exam Details

Exam Code
:DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK
Exam Name
:Databricks Certified Associate Developer for Apache Spark 3.0
Certification
:Databricks Certifications
Vendor
:Databricks
Total Questions
:180 Q&As
Last Updated
:Jul 02, 2025

Databricks Databricks Certifications DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK Questions & Answers

Question 71:

Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?
A. itemsDf.withColumn(["supplier", "manufacturer"])
B. itemsDf.withColumn("supplier").alias("manufacturer")
C. itemsDf.withColumnRenamed("supplier", "manufacturer")
D. itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
E. itemsDf.withColumnsRenamed("supplier", "manufacturer")

Correct Answer: C
itemsDf.withColumnRenamed("supplier", "manufacturer") Correct! This uses the relatively trivial
DataFrame method withColumnRenamed for renaming column supplier to column manufacturer.
Note that the asks for "a copy of DataFrame itemsDf". This may be confusing if you are not familiar with
Spark yet. RDDs (Resilient Distributed Datasets) are the foundation of
Spark DataFrames and are immutable. As such, DataFrames are immutable, too. Any command that
changes anything in the DataFrame therefore necessarily returns a copy, or a new version, of it
that has the changes applied.
itemsDf.withColumnsRenamed("supplier", "manufacturer") Incorrect. Spark's DataFrame API does not
have a withColumnsRenamed() method. itemsDf.withColumnRenamed(col("manufacturer"), col
("supplier")) No. Watch out ?although the col() method works for many methods of the DataFrame API,
withColumnRenamed is not one of them. As outlined in the documentation linked below,
withColumnRenamed expects strings.
itemsDf.withColumn(["supplier", "manufacturer"])
Wrong. While DataFrame.withColumn() exists in Spark, it has a different purpose than renaming columns.
withColumn is typically used to add columns to DataFrames, taking the name of the new
column as a first, and a Column as a second argument. Learn more via the documentation that is linked
below.
itemsDf.withColumn("supplier").alias("manufacturer") No. While DataFrame.withColumn() exists, it
requires 2 arguments. Furthermore, the alias() method on DataFrames would not help the cause of
renaming a column much.
DataFrame.alias() can be
useful in addressing the input of join statements. However, this is far outside of the scope of this question.
If you are curious nevertheless, check out the link below. More info:
pyspark.sql.DataFrame.withColumnRenamed -- PySpark 3.1.1 documentation,
pyspark.sql.DataFrame.withColumn -- PySpark 3.1.1 documentation, and pyspark.sql.DataFrame.alias --
PySpark 3.1.2 documentation (https://bit.ly/3aSB5tm , https://bit.ly/2Tv4rbE , https://bit.ly/2RbhBd2)
Static notebook | Dynamic notebook: See test 1, 31 (Databricks import instructions) (https://flrs.github.io/
spark_practice_tests_code/#1/31.html , https://bit.ly/sparkpracticeexams_import_instructions)
Question 72:

The code block shown below should write DataFrame transactionsDf to disk at path csvPath as a single CSV file, using tabs (\t characters) as separators between columns, expressing missing
values as string n/a, and omitting a header row with column names. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__.write.__2__(__3__, " ").__4__.__5__(csvPath)
A. 1. coalesce(1)
2.
option
3.
"sep"
4.
option("header", True)
5.
path
B. 1. coalesce(1)
2.
option
3.
"colsep"
4.
option("nullValue", "n/a")
5.
path
C. 1. repartition(1)
2.
option
3.
"sep"
4.
option("nullValue", "n/a")
5.
csv
D. 1. csv
2.
option
3.
"sep"
4.
option("emptyValue", "n/a")
5.
path
?
1.
repartition(1)
2.
mode
3.
"sep"
4.
mode("nullValue", "n/a")
5.
csv

Correct Answer: C

Correct code block: transactionsDf.repartition(1).write.option("sep", "\t").option("nullValue", "n/a").csv(csvPath) It is important here to understand that the specifically asks for writing the DataFrame as a single CSV file. This should trigger you to think about partitions. By default, every partition is written as a separate file, so you need to include repatition(1) into your call. coalesce(1) works here, too! Secondly, the is very much an invitation to search through the parameters in the Spark documentation that work with DataFrameWriter.csv (link below). You will also need to know that you need an option() statement to apply these parameters. The final concern is about the general call structure. Once you have called accessed write of your DataFrame, options follow and then you write the DataFrame with csv. Instead of csv(csvPath), you could also use save(csvPath, format='csv') here. More info: pyspark.sql.DataFrameWriter.csv --PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1, 52 (Databricks import instructions)
Question 73:

The code block shown below should set the number of partitions that Spark uses when shuffling data for joins or aggregations to 100. Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.sql.shuffle.partitions
__1__.__2__.__3__(__4__, 100)
A. 1. spark
2.
conf
3.
set
4.
"spark.sql.shuffle.partitions"
B. 1. pyspark
2.
config
3.
set
4.
spark.shuffle.partitions
C. 1. spark
2.
conf
3.
get
4.
"spark.sql.shuffle.partitions"
D. 1. pyspark
2.
config
3.
set
4.
"spark.sql.shuffle.partitions"
E. 1. spark
2.
conf
3.
set
4.
"spark.sql.aggregate.partitions"

Correct Answer: A
Correct code block:
spark.conf.set("spark.sql.shuffle.partitions", 100) The conf interface is part of the SparkSession, so you
need to call it through spark and not pyspark. To configure spark, you need to use the set method, not the
get method. get reads a property,
but does not write it. The correct property to achieve what is outlined in the
Question 74:

Which of the following DataFrame operators is never classified as a wide transformation?
A. DataFrame.sort()
B. DataFrame.aggregate()
C. DataFrame.repartition()
D. DataFrame.select()
E. DataFrame.join()

Correct Answer: D
As a general rule: After having gone through the practice tests you probably have a good feeling for what classifies as a wide and what classifies as a narrow transformation. If you are unsure, feel free to play around in Spark and display the explanation of the Spark execution plan via DataFrame.[operation, for example sort()].explain(). If repartitioning is involved, it would count as a wide transformation. DataFrame.select() Correct! A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select() operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each partition can be worked on independently. Thus, you do not cause a wide transformation. DataFrame.repartition() Incorrect. When you repartition a DataFrame, you redefine partition boundaries. Data will flow across your cluster and end up in different partitions after the repartitioning is completed. This is known as a shuffle and, in turn, is classified as a wide transformation. DataFrame.aggregate() No. When you aggregate, you may compare and summarize data across partitions. In the process, data are exchanged across the cluster, and newly formed output partitions depend on one or more input partitions. This is a typical characteristic of a shuffle, meaning that the aggregate operation may classify as a wide transformation.
DataFrame.join() Wrong. Joining multiple DataFrames usually means that large amounts of data are exchanged across the cluster, as new partitions are formed. This is a shuffle and therefore DataFrame.join() counts as a wide transformation. DataFrame.sort() False. When sorting, Spark needs to compare many rows across all partitions to each other. This is an expensive operation, since data is exchanged across the cluster and new partitions are formed as data is reordered. This process classifies as a shuffle and, as a result, DataFrame.sort() counts as wide transformation. More info: Understanding Apache Spark Shuffle | Philipp Brunenberg
Question 75:

Which of the following describes a shuffle?
A. A shuffle is a process that is executed during a broadcast hash join.
B. A shuffle is a process that compares data across executors.
C. A shuffle is a process that compares data across partitions.
D. A shuffle is a Spark operation that results from DataFrame.coalesce().
E. A shuffle is a process that allocates partitions to executors.

Correct Answer: C
A shuffle is a Spark operation that results from DataFrame.coalesce(). No. DataFrame.coalesce() does not result in a shuffle. A shuffle is a process that allocates partitions to executors. This is incorrect. A shuffle is a process that is executed during a broadcast hash join. No, broadcast hash joins avoid shuffles and yield performance benefits if at least one of the two tables is small in size (<= 10 MB by default). Broadcast hash joins can avoid shuffles because instead of exchanging partitions between executors, they broadcast a small table to all executors that then perform the rest of the join operation locally. A shuffle is a process that compares data across executors. No, in a shuffle, data is compared across partitions, and not executors. More info: Spark Repartition and Coalesce - Explained (https:// bit.ly/32KF7zS)
Question 76:

The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')
A. 1. withColumn
2.
'associateId'
3.
5
4.
remove
5.
'productId'
B. 1. withNewColumn
2.
associateId
3.
lit(5)
4.
drop
5.
productId
C. 1. withColumn
2.
'associateId'
3.
lit(5)
4.
drop
5.
'productId'
D. 1. withColumnRenamed
2.
'associateId'
3.
5
4.
drop
5.
'productId'
E. 1. withColumn
2.
col(associateId)
3.
lit(5)
4.
drop
5.
col(productId)

Correct Answer: C
Correct code block:
transactionsDf.withColumn('associateId', lit(5)).drop('productId', 'value') For solving this it is important that
you know the lit() function (link to documentation below). This function enables you to add a column of a
constant value to a DataFrame.
More info: pyspark.sql.functions.lit -- PySpark 3.1.1 documentation Static notebook | Dynamic notebook:
See test 1, 57 (Databricks import instructions)
Question 77:

The code block shown below should return all rows of DataFrame itemsDf that have at least 3 items in column itemNameElements. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Example of DataFrame itemsDf:
1.+------+----------------------------------+-------------------+------------------------------------------+
2.|itemId|itemName |supplier |itemNameElements |
3.+------+----------------------------------+-------------------+------------------------------------------+
4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the, Snow]|
5.|2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] |
6.|3 |Outdoors Backpack |Sports Company Inc.|[Outdoors, Backpack] |
7.+------+----------------------------------+-------------------+------------------------------------------+
Code block:
itemsDf.__1__(__2__(__3__)__4__)
A. 1. select
2.
count
3.
col("itemNameElements")
4.
>3
B. 1. filter
2.
count
3.
itemNameElements
4.
>=3
C. 1. select
2.
count
3.
"itemNameElements"
4.
>3
D. 1. filter
2.
size
3.
"itemNameElements"
4.
>=3
E. 1. select
2.
size
3.
"itemNameElements"
4.
>3

Correct Answer: D
Correct code block:
itemsDf.filter(size("itemNameElements")>3)
Output of code block:
+------+----------------------------------+-------------------+------------------------------------------+ |itemId|itemName |
supplier |itemNameElements |
+------+----------------------------------+-------------------+------------------------------------------+ |1 |Thick Coat for
Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the, Snow]|
|2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] | +-----+----------------------------------+-------------------+------------------------------------------+ The big difficulty with this is
in knowing the difference between count and size (refer to documentation below). size is the correct
function to choose here since it returns the number
of elements in an array on a per-row basis.
The other consideration for solving this is the difference between select and filter. Since we want to return
the rows in the original DataFrame, filter is the right choice. If we would use select, we would simply get a
single-column DataFrame showing which rows match the criteria, like so:
+----------------------------+
|(size(itemNameElements) > 3)|
+----------------------------+
|true |
|true |
|false |
+----------------------------+
More info:
Count documentation: pyspark.sql.functions.count -- PySpark 3.1.1 documentation Size documentation:
pyspark.sql.functions.size -- PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test
1, 47 (Databricks import instructions)
Question 78:

In which order should the code blocks shown below be run in order to create a DataFrame that shows the mean of column predError of DataFrame transactionsDf per column storeId and productId, where productId should be either 2 or 3 and the returned DataFrame should be sorted in ascending order by column storeId, leaving out any nulls in that column?
DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+
1.
.mean("predError")
2.
.groupBy("storeId")
3.
.orderBy("storeId")
4.
transactionsDf.filter(transactionsDf.storeId.isNotNull())
5.
.pivot("productId", [2, 3])
A. 4, 5, 2, 3, 1
B. 4, 2, 1
C. 4, 1, 5, 2, 3
D. 4, 2, 5, 1, 3
E. 4, 3, 2, 5, 1

Correct Answer: D
Correct code block:
transactionsDf.filter(transactionsDf.storeId.isNotNull()).groupBy("storeId").pivot("productId", [2, 3]).mean
("predError").orderBy("storeId")
Output of correct code block:
+-------+----+----+
|storeId| 2| 3|
+-------+----+----+
| 2| 6.0|null|
| 3|null|null|
| 25| 3.0| 3.0|
+-------+----+----+
This is quite convoluted and requires you to think hard about the correct order of operations. The pivot
method also makes an appearance - a method that you may not know all that much about (yet).
At the first position in all answers is code block 4, so the is essentially just about the ordering of the
remaining 4 code blocks. The states that the returned DataFrame should be sorted by column storeId. So,
it should make sense to have code block 3 which includes the orderBy operator at the very end of the code
block. This leaves you with only two answer options. Now, it is useful to know more about the context of
pivot in PySpark. A common pattern is groupBy, pivot, and then another aggregating function, like mean.
In the documentation linked below you can see that pivot is a method of pyspark.sql.GroupedData meaning that before pivoting, you have to use groupBy. The only answer option matching this requirement
is the one in which code block 2 (which includes groupBy) is stated before code block 5 (which includes pivot).
More info: pyspark.sql.GroupedData.pivot -- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, 43 (Databricks import instructions)
Question 79:

Which of the following describes slots?
A. Slots are dynamically created and destroyed in accordance with an executor's workload.
B. To optimize I/O performance, Spark stores data on disk in multiple slots.
C. A Java Virtual Machine (JVM) working as an executor can be considered as a pool of slots for task execution.
D. A slot is always limited to a single core. Slots are the communication interface for executors and are used for receiving commands and sending results to the driver.

Correct Answer: C
Slots are the communication interface for executors and are used for receiving commands and sending results to the driver. Wrong, executors communicate with the driver directly. Slots are dynamically created and destroyed in accordance with an executor's workload. No, Spark does not actively create and destroy slots in accordance with the workload. Per executor, slots are made available in accordance with how many cores per executor (property spark.executor.cores) and how many CPUs per task (property spark.task.cpus) the Spark configuration calls for. A slot is always limited to a single core. No, a slot can span multiple cores. If a task would require multiple cores, it would have to be executed through a slot that spans multiple cores. In Spark documentation, "core" is often used interchangeably with "thread", although "thread" is the more accurate word. A single physical core may be able to make multiple threads available. So, it is better to say that a slot can span multiple threads. To optimize I/O performance, Spark stores data on disk in multiple slots. No ?Spark stores data on disk in multiple partitions, not slots. More info: Spark Architecture | Distributed Systems Architecture
Question 80:

Which of the following describes Spark's standalone deployment mode?
A. Standalone mode uses a single JVM to run Spark driver and executor processes.
B. Standalone mode means that the cluster does not contain the driver.
C. Standalone mode is how Spark runs on YARN and Mesos clusters.
D. Standalone mode uses only a single executor per worker per application.
E. Standalone mode is a viable solution for clusters that run multiple frameworks, not only Spark.

Correct Answer: D
Standalone mode uses only a single executor per worker per application. This is correct and a limitation of Spark's standalone mode. Standalone mode is a viable solution for clusters that run multiple frameworks. Incorrect. A limitation of standalone mode is that Apache Spark must be the only framework running on the cluster. If you would want to run multiple frameworks on the same cluster in parallel, for example Apache Spark and Apache Flink, you would consider the YARN deployment mode. Standalone mode uses a single JVM to run Spark driver and executor processes. No, this is what local mode does. Standalone mode is how Spark runs on YARN and Mesos clusters. No. YARN and Mesos modes are two deployment modes that are different from standalone mode. These modes allow Spark to run alongside other frameworks on a cluster. When Spark is run in standalone mode, only the Spark framework can run on the cluster. Standalone mode means that the cluster does not contain the driver. Incorrect, the cluster does not contain the driver in client mode, but in standalone mode the driver runs on a node in the cluster. More info: Learning Spark, 2nd Edition, Chapter 1

Related Exams:

Tips on How to Prepare for the Exams

Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Databricks exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK exam preparations and Databricks certification application, do not hesitate to visit our Vcedump.com to find your solutions here.

Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Databricks Databricks Certifications DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK Questions & Answers

Question 71:

Question 72:

Question 73:

Question 74:

Question 75:

Question 76:

Question 77:

Question 78:

Question 79:

Question 80:

Related Exams:

DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK

DATABRICKS-CERTIFIED-DATA-ANALYST-ASSOCIATE

DATABRICKS-CERTIFIED-DATA-ENGINEER-ASSOCIATE

DATABRICKS-CERTIFIED-GENERATIVE-AI-ENGINEER-ASSOCIATE

DATABRICKS-CERTIFIED-PROFESSIONAL-DATA-ENGINEER

DATABRICKS-CERTIFIED-PROFESSIONAL-DATA-SCIENTIST

DATABRICKS-MACHINE-LEARNING-ASSOCIATE

DATABRICKS-MACHINE-LEARNING-PROFESSIONAL

Tips on How to Prepare for the Exams

Databricks Certified Associate Developer for Apache Spark 3.0

Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Databricks Databricks Certifications DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK Questions & Answers

Question 71:

Question 72:

Question 73:

Question 74:

Question 75:

Question 76:

Question 77:

Question 78:

Question 79:

Question 80:

Related Exams:

Tips on How to Prepare for the Exams