The output is the typical output of a DataFrame.printSchema() call. The DataFrame's RDD representation does not have a printSchema or formatSchema method (find available methods in the RDD documentation linked below). The output of print(transactionsDf.schema) is this: StructType(List(StructField(transactionId,IntegerType,true),StructField(predError,IntegerTy pe,true),StructField (value,IntegerType,true),StructField(storeId,IntegerType,true),StructField(productId,Integer Type,true),StructField(f,IntegerType,true))). It includes the same information as the nicely formatted original output, but is not nicely formatted itself. Lastly, the DataFrame's schema attribute does not have a print() method.
Static notebook | Dynamic notebook: See test 2, 52 (Databricks import instructions)
Question 12:
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.
itemsDf.__1__(__2__).select(__3__, __4__) A. 1. filter
2.
col("supplier").isin("Sports")
3.
"itemName"
4.
explode(col("attributes"))
B. 1. where
2.
col("supplier").contains("Sports")
3.
"itemName"
4.
"attributes"
C. 1. where
2.
col(supplier).contains("Sports")
3.
explode(attributes)
4.
itemName
D. 1. where
2.
"Sports".isin(col("Supplier"))
3.
"itemName"
4.
array_explode("attributes")
E. 1. filter
2.
col("supplier").contains("Sports")
3.
"itemName"
4.
explode("attributes")
Correct Answer: E
Output of correct code block:
+----------------------------------+------+
|itemName |col |
+----------------------------------+------+
|Thick Coat for Walking in the Snow|blue |
|Thick Coat for Walking in the Snow|winter|
|Thick Coat for Walking in the Snow|cozy |
|Outdoors Backpack |green |
|Outdoors Backpack |summer|
|Outdoors Backpack |travel|
+----------------------------------+------+
The key to solving this is knowing about Spark's explode operator. Using this operator, you can extract
values from arrays into single rows. The following guidance steps through
the
answers systematically from the first to the last gap. Note that there are many ways to solving the gap
Question 13:
Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?
A. transactionsDf.select("storeId").dropDuplicates().count()
B. transactionsDf.select(count("storeId")).dropDuplicates()
C. transactionsDf.select(distinct("storeId")).count()
D. transactionsDf.dropDuplicates().agg(count("storeId"))
E. transactionsDf.distinct().select("storeId").count()
Correct Answer: A
transactionsDf.select("storeId").dropDuplicates().count() Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column. transactionsDf.select(count("storeId")).dropDuplicates() No. transactionsDf.select(count("storeId")) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.
transactionsDf.dropDuplicates().agg(count("storeId")) Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates instead. transactionsDf.distinct().select("storeId").count() Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count not represent the number of unique values in that column. transactionsDf.select(distinct("storeId")).count() False. There is no distinct method in pyspark.sql.functions.
Question 14:
Which of the following code blocks returns a 2-column DataFrame that shows the distinct values in column productId and the number of rows with that productId in DataFrame transactionsDf?
A. transactionsDf.count("productId").distinct()
B. transactionsDf.groupBy("productId").agg(col("value").count())
C. transactionsDf.count("productId")
D. transactionsDf.groupBy("productId").count()
E. transactionsDf.groupBy("productId").select(count("value"))
Correct Answer: D
transactionsDf.groupBy("productId").count()
Correct. This code block first groups DataFrame transactionsDf by column productId and then counts the
rows in each group.
transactionsDf.groupBy("productId").select(count("value")) Incorrect. You cannot call select on a
GroupedData object (the output of a groupBy) statement.
transactionsDf.count("productId")
No. DataFrame.count() does not take any arguments.
transactionsDf.count("productId").distinct()
Wrong. Since DataFrame.count() does not take any arguments, this option cannot be right.
transactionsDf.groupBy("productId").agg(col("value").count()) False. A Column object, as returned by col
("value"), does not have a count() method. You can see all available methods for Column object linked in
the Spark documentation below. More info: pyspark.sql.DataFrame.count -- PySpark 3.1.2 documentation,
pyspark.sql.Column -- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, 41 (Databricks import instructions)
Question 15:
Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?
A. transactionsDf.sort("storeId", asc("productId"))
B. transactionsDf.sort(col(storeId)).desc(col(productId))
C. transactionsDf.order_by(col(storeId), desc(col(productId)))
D. transactionsDf.sort("storeId", desc("productId"))
E. transactionsDf.sort("storeId").sort(desc("productId"))
Correct Answer: D
In this it is important to realize that you are asked to sort transactionDf by two columns. This means that the sorting of the second column depends on the sorting of the first column. So, any option that sorts the entire DataFrame (through chaining sort statements) will not work. The two columns need to be channeled through the same call to sort(). Also, order_by is not a valid DataFrame API method. More info: pyspark.sql.DataFrame.sort -- PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2, 22 (Databricks import instructions)
Question 16:
The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__).__3__
A. 1. select
2.
"storeId"
3.
print_schema()
B. 1. limit
2.
1
3.
columns
C. 1. select
2.
"storeId"
3.
printSchema()
D. 1. limit
2.
"storeId"
3.
printSchema()
E. 1. select
2.
storeId
3.
dtypes
Correct Answer: B
Correct code block: transactionsDf.select("storeId").printSchema() The difficulty of this is that it is hard to solve with the stepwise first-to-last- gap approach that has worked well for similar questions, since the answer options are so different from one another. Instead, you might want to eliminate answers by looking for patterns of frequently wrong answers. A first pattern that you may recognize by now is that column names are not expressed in quotes. For this reason, the answer that includes storeId should be eliminated. By now, you may have understood that the DataFrame.limit() is useful for returning a specified amount of rows. It has nothing to do with specific columns. For this reason, the answer that resolves to
limit("storeId") can be eliminated. Given that we are interested in information about the data type, you should
Question 17:
Which of the following describes Spark actions?
A. Writing data to disk is the primary purpose of actions.
B. Actions are Spark's way of exchanging data between executors.
C. The driver receives data upon request by actions.
D. Stage boundaries are commonly established by actions.
E. Actions are Spark's way of modifying RDDs.
Correct Answer: C
The driver receives data upon request by actions.
Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion,
transfer result data back to the driver. Actions are Spark's way of exchanging data between executors. No.
In Spark, data is exchanged between executors via shuffles. Writing data to disk is the primary purpose of
actions. No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the
data, often in aggregated form, back to the driver.
Actions are Spark's way of modifying RDDs.
Incorrect. Firstly, RDDs are immutable ?they cannot be modified. Secondly, Spark generates new RDDs
via transformations and not actions. Stage boundaries are commonly established by actions. Wrong. A
stage boundary is commonly established by a shuffle, for example caused by a wide transformation.
Question 18:
Which of the elements in the labeled panels represent the operation performed for broadcast variables?
Larger image
A. 2, 5
B. 3
C. 2, 3
D. 1, 2
E. 1, 3, 4
Correct Answer: C
2,3 Correct! Both panels 2 and 3 represent the operation performed for broadcast variables. While a broadcast operation may look like panel 3, with the driver being the bottleneck, it most probably looks like panel 2. This is because the torrent protocol sits behind Spark's broadcast implementation. In the torrent protocol, each executor will try to fetch missing broadcast variables from the driver or other nodes, preventing the driver from being the bottleneck. 1,2 Wrong. While panel 2 may represent broadcasting, panel 1 shows bi-directional communication which does not occur in broadcast operations. No. While broadcasting may materialize like shown in panel 3, its use of the torrent protocol also enables communciation as shown in panel 2 (see first explanation). 1,3,4 No. While panel 2 shows broadcasting, panel 1 shows bi-directional communication ?not a characteristic of broadcasting. Panel 4 shows uni-directional communication, but in the wrong direction. Panel 4 resembles more an accumulator variable than a broadcast variable. 2,5 Incorrect. While panel 2 shows broadcasting, panel 5 includes bi-directional communication ?not a characteristic of broadcasting. More info: Broadcast Join with Spark ?henning.kropponline.de
Question 19:
The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.
Code block:
A. transactionsDf.format("parquet").option("mode", "append").save(path)
B. The code block is missing a reference to the DataFrameWriter.
C. save() is evaluated lazily and needs to be followed by an action.
D. The mode option should be omitted so that the command uses the default mode.
E. The code block is missing a bucketBy command that takes care of partitions.
F. Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.
Which of the following code blocks applies the Python function to_limit on column predError in table transactionsDf, returning a DataFrame with columns transactionId and result?
A. 1.spark.udf.register("LIMIT_FCN", to_limit) 2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf")
B. 1.spark.udf.register("LIMIT_FCN", to_limit) 2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result")
C. 1.spark.udf.register("LIMIT_FCN", to_limit) 2.spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf") spark.sql ("SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf")
D. 1.spark.udf.register(to_limit, "LIMIT_FCN") 2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf")
Correct Answer: A
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") Correct! First,
you have to register to_limit as UDF to use it in a sql statement. Then, you can use it under the
LIMIT_FCN name, correctly naming the resulting column result.
spark.udf.register(to_limit, "LIMIT_FCN")
spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") No. In this
answer, the arguments to spark.udf.register are flipped.
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf") Wrong, this answer
does not use the registered LIMIT_FCN in the sql statement, but tries to access the to_limit method
directly. This will fail, since Spark cannot access it. spark.sql("SELECT transactionId, udf(to_limit
(predError)) AS result FROM transactionsDf") Incorrect, there is no udf method in Spark's SQL.
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result") False. In this
answer, the column that results from applying the UDF is not correctly renamed to result.
Static notebook | Dynamic notebook: See test 3, 52 (Databricks import instructions)
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Databricks exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK exam preparations and Databricks certification application, do not hesitate to visit our Vcedump.com to find your solutions here.