The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and
A. The parentheses around the column names need to be removed and .select() needs to be appended to the code block.
B. Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.
C. Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.
D. Operator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.
E. Operator coalesce needs to be replaced by repartition.
Correct Answer: B
Correct code block:
transactionsDf.repartition(14, "storeId", "transactionDate").count() Since we do not know how many
partitions DataFrame transactionsDf has, we cannot safely use coalesce, since it would not make any
change if the current number of partitions is smaller than 14.
So, we need to use repartition.
In the Spark documentation, the call structure for repartition is shown like this:
DataFrame.repartition(numPartitions, *cols). The * operator means that any argument after numPartitions
will be
interpreted as column. Therefore, the brackets need to be removed. Finally, the specifies that after the execution the DataFrame should be divided. So, indirectly this is asking us to append an action to the code block. Since .select() is a transformation. the only possible choice here is .count(). More info: pyspark.sql.DataFrame.repartition -- PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1, 40 (Databricks import instructions)
Question 62:
Which of the following describes Spark's Adaptive Query Execution?
A. Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.
B. Adaptive Query Execution is enabled in Spark by default.
C. Adaptive Query Execution reoptimizes queries at execution points.
D. Adaptive Query Execution features are dynamically switching join strategies and dynamically optimizing skew joins.
E. Adaptive Query Execution applies to all kinds of queries.
Correct Answer: D
Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins. This is almost correct. All of these features, except for dynamically injecting scan filters, are part of Adaptive Query Execution. Dynamically injecting scan filters for join operations to limit the amount of data to be considered in a query is part of Dynamic Partition Pruning and not of Adaptive Query Execution. Adaptive Query Execution reoptimizes queries at execution points. No, Adaptive Query Execution reoptimizes queries at materialization points. Adaptive Query Execution is enabled in Spark by default. No, Adaptive Query Execution is disabled in Spark needs to be enabled through the spark.sql.adaptive.enabled property. Adaptive Query Execution applies to all kinds of queries. No, Adaptive Query Execution applies only to queries that are not streaming queries and that contain at least one exchange (typically expressed through a join, aggregate, or window operator) or one subquery. More info: How to Speed up SQL Queries with Adaptive Query Execution, Learning Spark, 2nd Edition, Chapter 12 (https://bit.ly/3tOh8M1)
Question 63:
Which of the following code blocks returns a DataFrame that matches the multi-column DataFrame itemsDf, except that integer column itemId has been converted into a string column?
A. itemsDf.withColumn("itemId", convert("itemId", "string"))
B. itemsDf.withColumn("itemId", col("itemId").cast("string"))
C. itemsDf.select(cast("itemId", "string"))
D. itemsDf.withColumn("itemId", col("itemId").convert("string"))
E. spark.cast(itemsDf, "itemId", "string")
Correct Answer: B
itemsDf.withColumn("itemId", col("itemId").cast("string")) Correct. You can convert the data type of a column using the cast method of the Column class. Also note that you will have to use the withColumn method on itemsDf for replacing the existing itemId column with the new version that contains strings. itemsDf.withColumn("itemId", col("itemId").convert ("string")) Incorrect. The Column object that col("itemId") returns does not have a convert method. itemsDf.withColumn("itemId", convert("itemId", "string")) Wrong. Spark's spark.sql.functions module does not have a convert method. The is trying to mislead you by using the word "converted". Type conversion is also called "type casting". This may help you remember to look for a cast method instead of a convert method (see correct answer). itemsDf.select(astype("itemId", "string")) False. While astype is a method of Column (and an alias of Column.cast), it is not a method of pyspark.sql.functions (what the code block implies). In addition, the
Question 64:
Which of the following code blocks returns approximately 1000 rows, some of them potentially being duplicates, from the 2000-row DataFrame transactionsDf that only has unique rows?
A. transactionsDf.sample(True, 0.5)
B. transactionsDf.take(1000).distinct()
C. transactionsDf.sample(False, 0.5)
D. transactionsDf.take(1000)
E. transactionsDf.sample(True, 0.5, force=True)
Correct Answer: A
To solve this question, you need to know that DataFrame.sample() is not guaranteed to return the exact fraction of the number of rows specified as an argument. Furthermore, since duplicates may be returned, you should understand that the operator's withReplacement argument should be set to True. A force= argument for the operator does not exist. While the take argument returns an exact number of rows, it will just take the first specified number of rows (1000 in this question) from the DataFrame. Since the DataFrame does not include duplicate rows, there is no potential of any of those returned rows being duplicates when using take(), so the correct answer cannot involve take().
More info: pyspark.sql.DataFrame.sample -- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, 41 (Databricks import instructions)
Question 65:
Which of the following code blocks returns a single-column DataFrame showing the number of words in column supplier of DataFrame itemsDf?
pyspark.sql.functions.size -- PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test
2, 29 (Databricks import instructions)
Question 66:
Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?
A. array_remove(transactionsDf, "*")
B. transactionsDf.unpersist()
C. del transactionsDf
D. transactionsDf.clearCache()
E. transactionsDf.persist()
Correct Answer: B
transactionsDf.unpersist()
Correct. The DataFrame.unpersist() command does exactly what the asks for - it removes all cached
parts of the DataFrame from memory and disk.
del transactionsDf
False. While this option can help remove the DataFrame from memory and disk, it does not do so
immediately. The reason is that this command just notifies the Python garbage collector that the
transactionsDf now may be deleted from memory. However, the garbage collector does not do so
immediately and, if you wanted it to run immediately, would need to be specifically triggered to do
so. Find more information linked below.
array_remove(transactionsDf, "*")
Incorrect. The array_remove method from pyspark.sql.functions is used for removing elements from arrays
in columns that match a specific condition. Also, the first argument would be a column, and
not a DataFrame as shown in the code block.
transactionsDf.persist()
No. This code block does exactly the opposite of what is asked for: It caches (writes) DataFrame
transactionsDf to memory and disk. Note that even though you do not pass in a specific storage
level here, Spark will use the default storage level (MEMORY_AND_DISK).
transactionsDf.clearCache()
Wrong. Spark's DataFrame does not have a clearCache() method.
More info: pyspark.sql.DataFrame.unpersist -- PySpark 3.1.2 documentation, python - How to delete an
RDD in PySpark for the purpose of releasing resources? - Stack Overflow
Static notebook | Dynamic notebook: See test 3, 40 (Databricks import instructions)
Question 67:
The code block shown below should read all files with the file ending .png in directory path into Spark. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Correct code block: spark.read.format("binaryFile").option("recursiveFileLookup", "*.png").load(path) Spark can deal with binary files, like images. Using the binaryFile format specification in the SparkSession's read API is the way to read in those files. Remember that, to access the read API, you need to start the command with spark.read. The pathGlobFilter option is a great way to filter files by name (and ending). Finally, the path can be specified using the load operator ?the open operator shown in one of the answers does not exist.
Question 68:
Which of the following code blocks uses a schema fileSchema to read a parquet file at location filePath into a DataFrame?
A. spark.read.schema(fileSchema).format("parquet").load(filePath)
B. spark.read.schema("fileSchema").format("parquet").load(filePath)
C. spark.read().schema(fileSchema).parquet(filePath)
D. spark.read().schema(fileSchema).format(parquet).load(filePath)
E. spark.read.schema(fileSchema).open(filePath)
Correct Answer: A
Pay attention here to which variables are quoted. fileSchema is a variable and thus should not be in quotes. parquet is not a variable and therefore should be in quotes. SparkSession.read (here referenced as spark.read) returns a DataFrameReader which all subsequent calls reference - the DataFrameReader is not callable, so you should not use parentheses here. Finally, there is no open method in PySpark. The method name is load. Static notebook | Dynamic notebook: See test 1, 44 (Databricks import instructions)
Question 69:
Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate
The schema passed into schema should be of type StructType or a string, so all entries in which a list is
passed are incorrect.
In addition, since all numbers are whole numbers, the IntegerType() data type is the correct option here.
NumberType() is not a valid data type and StringType() would fail, since the parquet file is
stored in the "most appropriate format for this kind of data", meaning that it is most likely an IntegerType,
and Spark does not convert data types if a schema is provided. Also note that StructType accepts only a
single argument (a list of StructFields). So, passing multiple arguments is invalid.
Finally, Spark needs to know which format the file is in. However, all of the options listed are valid here,
since Spark assumes parquet as a default when no file format is specifically passed.
More info: pyspark.sql.DataFrameReader.schema -- PySpark 3.1.2 documentation and StructType --
PySpark 3.1.2 documentation
Question 70:
The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate
row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes
contains the element cozy.
A sample of DataFrame itemsDf is below.
Code block:
itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))
A. 1. filter
2.
array_contains("cozy")
3.
select
4.
"itemId"
5.
explode
6.
"attributes"
B. 1. where
2.
"array_contains(attributes, 'cozy')"
3.
select
4.
itemId
5.
explode
6.
attributes
C. 1. filter
2.
"array_contains(attributes, 'cozy')"
3.
select
4.
"itemId"
5.
map
6.
"attributes"
D. 1. filter
2.
"array_contains(attributes, cozy)"
3.
select
4.
"itemId"
5.
explode
6.
"attributes"
E. 1. filter
2.
"array_contains(attributes, 'cozy')"
3.
select
4.
"itemId"
5.
explode
6.
"attributes"
Correct Answer: E
The correct code block is:
itemsDf.filter("array_contains(attributes, 'cozy')").select("itemId", explode("attributes")) The key here is
understanding how to use array_contains(). You can either use it as an expression in a string, or you can
import it from pyspark.sql.functions. In that case, the following would also
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Databricks exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK exam preparations and Databricks certification application, do not hesitate to visit our Vcedump.com to find your solutions here.