A. transactionsDf.withColumnRemoved("predError", "productId")
B. transactionsDf.drop(["predError", "productId", "associateId"])
C. transactionsDf.drop("predError", "productId", "associateId")
D. transactionsDf.dropColumns("predError", "productId", "associateId")
E. transactionsDf.drop(col("predError", "productId"))
Correct Answer: D
The key here is to understand that columns that are passed to DataFrame.drop() are ignored if they do not exist in the DataFrame. So, passing column name associateId to transactionsDf.drop() does not have any effect. Passing a list to transactionsDf.drop() is not valid. The documentation (link below) shows the call structure as DataFrame.drop(*cols). The * means that all arguments that are passed to DataFrame.drop() are read as columns. However, since a list of columns, for example ["predError", "productId", "associateId"] is not a column, Spark will run into an error. More info: pyspark.sql.DataFrame.drop -- PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1, 50 (Databricks import instructions)
Question 82:
Which of the following describes characteristics of the Spark driver?
A. The Spark driver requests the transformation of operations into DAG computations from the worker nodes.
B. If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance.
C. The Spark driver processes partitions in an optimized, distributed fashion.
D. In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.
E. The Spark driver's responsibility includes scheduling queries for execution on worker nodes.
Correct Answer: D
The Spark driver requests the transformation of operations into DAG computations from the worker nodes.
No, the Spark driver transforms operations into DAG computations itself. If set in the Spark configuration,
Spark scales the Spark driver horizontally to improve parallel processing performance.
No. There is always a single driver per application, but one or more executors. The Spark driver processes
partitions in an optimized, distributed fashion.
No, this is what executors do.
In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.
Wrong. In a non-interactive Spark application, you need to create the SparkSession object. In an
interactive Spark shell, the Spark driver instantiates the object for you.
Question 83:
The code block displayed below contains multiple errors. The code block should remove column transactionDate from DataFrame transactionsDf and add a column transactionTimestamp in which
dates that are expressed as strings in column transactionDate of DataFrame transactionsDf are converted into unix timestamps. Find the errors.
A. Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used instead of the existing column assignment. Operator to_unixtime() should be used instead of unix_timestamp().
B. Column transactionDate should be dropped after transactionTimestamp has been written. The withColumn operator should be used instead of the existing column assignment. Column transactionDate should be wrapped in a col() operator.
C. Column transactionDate should be wrapped in a col() operator.
D. The string indicating the date format should be adjusted. The withColumnReplaced operator should be used instead of the drop and assign pattern in the code block to replace column transactionDate with the new column transactionTimestamp.
E. Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used instead of the existing column assignment.
Correct Answer: E
This requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error will make it easier for you to deal with single-error questions in the real exam. You can clearly see that column transactionDate should be dropped only after transactionTimestamp has been written. This is because to generate column transactionTimestamp, Spark needs to read the values from column transactionDate. Values in column transactionDate in the original transactionsDf DataFrame look like 2020- 04-26 15:35. So, to convert those correctly, you would have to pass yyyy-MM-dd HH:mm. In other words: The string indicating the date format should be adjusted. While you might be tempted to change unix_timestamp() to to_unixtime() (in line with the from_unixtime() operator), this function does not exist in Spark. unix_timestamp() is the correct operator to use here. Also, there is no DataFrame.withColumnReplaced() operator. A similar operator that exists is DataFrame.withColumnRenamed(). Whether you use col() or not is irrelevant with unix_timestamp() - the command is fine with both. Finally, you cannot assign a column like transactionsDf["columnName"] = ... in Spark. This is Pandas syntax (Pandas is a popular Python package for data analysis), but it is not supported in Spark. So, you need to use Spark's DataFrame.withColumn() syntax instead. More info: pyspark.sql.functions.unix_timestamp -- PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, 28 (Databricks import instructions)
Question 84:
The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.
The code block displayed below contains an error. The code block should configure Spark so that DataFrames up to a size of 20 MB will be broadcast to all worker nodes when performing a join.
Find the error.
Code block:
A. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 20)
B. Spark will only broadcast DataFrames that are much smaller than the default value.
C. The correct option to write configurations is through spark.config and not spark.conf.
D. Spark will only apply the limit to threshold joins and not to other joins.
E. The passed limit has the wrong variable type.
F. The command is evaluated lazily and needs to be followed by an action.
Correct Answer: B
This is is hard. Let's assess the different answers one-by-one. Spark will only broadcast DataFrames that
are much smaller than the default value. This is correct. The default value is 10 MB (10485760 bytes).
Since the configuration for spark.sql.autoBroadcastJoinThreshold expects a number in bytes (and not
megabytes), the code block sets
the limits to merely 20 bytes, instead of the requested 20 * 1024 * 1024 (= 20971520) bytes.
The command is evaluated lazily and needs to be followed by an action.
No, this command is evaluated right away!
Spark will only apply the limit to threshold joins and not to other joins. There are no "threshold joins", so
this option does not make any sense. The correct option to write configurations is through spark.config and
not spark.conf.
No, it is indeed spark.conf!
The passed limit has the wrong variable type.
The configuration expects the number of bytes, a number, as an input. So, the 20 provided in the code
block is fine.
Question 86:
Which of the following statements about Spark's DataFrames is incorrect?
A. Spark's DataFrames are immutable.
B. Spark's DataFrames are equal to Python's DataFrames.
C. Data in DataFrames is organized into named columns.
D. RDDs are at the core of DataFrames.
E. The data in DataFrames may be split into multiple chunks.
Correct Answer: B
Spark's DataFrames are equal to Python's or R's DataFrames. No, they are not equal. They are only similar. A major difference between Spark and Python is that Spark's DataFrames are distributed, whereby Python's are not.
Question 87:
The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame
transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
This is pretty complex and, in its complexity, is probably above what you would encounter in the exam. However, reading the carefully, you can use your logic skills to weed out the wrong answers here. First, you should examine the join statement which is common to all answers. The first argument of the join () operator (documentation linked below) is the DataFrame to be joined with. Where join is in gap 3, the first argument of gap 4 should therefore be another DataFrame. For none of the questions where join is in the third gap, this is the case. So you can immediately discard two answers. For all other answers, join is in gap 1, followed by .(itemsDf, according to the code block. Given how the join() operator is called, there are now three remaining candidates. Looking further at the join() statement, the second argument (on=) expects "a string for the join column name, a list of column names, a join expression (Column), or a list of Columns", according to the documentation. As one answer option includes a list of join expressions (transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId) which is unsupported according to the documentation, we can discard that answer, leaving us with two remaining candidates. Both candidates have valid syntax, but only one of them fulfills the condition in the "only where column storeId of DataFrame transactionsDf does not match column itemId of DataFrame itemsDf". So, this one remaining answer option has to be the correct one! As you can see, although sometimes overwhelming at first, even more complex questions can be figured out by rigorously applying the knowledge you can gain from the documentation during the exam.
More info: pyspark.sql.DataFrame.join -- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, 47 (Databricks import instructions)
Question 88:
Which of the following describes the conversion of a computational query into an execution plan in Spark?
A. Spark uses the catalog to resolve the optimized logical plan.
B. The catalog assigns specific resources to the optimized memory plan.
C. The executed physical plan depends on a cost optimization from a previous stage.
D. Depending on whether DataFrame API or SQL API are used, the physical plan may differ.
E. The catalog assigns specific resources to the physical plan.
Correct Answer: C
The executed physical plan depends on a cost optimization from a previous stage. Correct! Spark considers multiple physical plans on which it performs a cost analysis and selects the final physical plan in accordance with the lowest-cost outcome of that analysis. That final physical plan is then executed by Spark. Spark uses the catalog to resolve the optimized logical plan. No. Spark uses the catalog to resolve the unresolved logical plan, but not the optimized logical plan. Once the unresolved logical plan is resolved, it is then optimized using the Catalyst Optimizer. The optimized logical plan is the input for physical planning. The catalog assigns specific resources to the physical plan. No. The catalog stores metadata, such as a list of names of columns, data types, functions, and databases. Spark consults the catalog for resolving the references in a logical plan at the beginning of the conversion of the query into an execution plan. The result is then an optimized logical plan. Depending on whether DataFrame API or SQL API are used, the physical plan may differ. Wrong ?the physical plan is independent of which API was used. And this is one of the great strengths of Spark! The catalog assigns specific resources to the optimized memory plan. There is no specific "memory plan" on the journey of a Spark computation. More info: Spark's Logical and Physical plans ... When, Why, How and Beyond. | by Laurent Leturgez | datalex | Medium
Question 89:
Which of the following statements about storage levels is incorrect?
A. The cache operator on DataFrames is evaluated like a transformation.
B. In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.
C. Caching can be undone using the DataFrame.unpersist() operator.
D. MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.
E. DISK_ONLY will not use the worker node's memory.
Correct Answer: D
MEMORY_AND_DISK replicates cached DataFrames both on memory and disk. Correct, this statement is
wrong. Spark prioritizes storage in memory, and will only store data on disk that does not fit into memory.
DISK_ONLY will not use the worker node's memory.
Wrong, this statement is correct. DISK_ONLY keeps data only on the worker node's disk, but not in
memory.
In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's
memory.
Wrong, this statement is correct. In fact, Spark does not have a provision to cache DataFrames in the
driver (which sits on the edge node in client mode). Spark caches DataFrames in the executors' memory.
Caching can be undone using the DataFrame.unpersist() operator. Wrong, this statement is correct. Caching, as achieved via the DataFrame.cache() or DataFrame.persist() operators can be undone using the DataFrame.unpersist() operator. This operator will remove all of its parts from the executors' memory and disk. The cache operator on DataFrames is evaluated like a transformation. Wrong, this statement is correct. DataFrame.cache() is evaluated like a transformation: Through lazy evaluation. This means that after calling DataFrame.cache() the command will not have any effect until you call a subsequent action, like DataFrame.cache().count(). More info: pyspark.sql.DataFrame.unpersist -- PySpark 3.1.2 documentation
Question 90:
Which of the following code blocks returns DataFrame transactionsDf sorted in descending order by column predError, showing missing values last?
A. transactionsDf.sort(asc_nulls_last("predError"))
B. transactionsDf.orderBy("predError").desc_nulls_last()
C. transactionsDf.sort("predError", ascending=False)
D. transactionsDf.desc_nulls_last("predError")
E. transactionsDf.orderBy("predError").asc_nulls_last()
Correct Answer: C
transactionsDf.sort("predError", ascending=False) Correct! When using DataFrame.sort() and setting ascending=False, the DataFrame will be sorted by the specified column in descending order, putting all missing values last. An alternative, although not listed as an answer here, would be transactionsDf.sort(desc_nulls_last("predError")). transactionsDf.sort(asc_nulls_last("predError")) Incorrect. While this is valid syntax, the DataFrame will be sorted on column predError in ascending order and not in descending order, putting missing values last. transactionsDf.desc_nulls_last("predError") Wrong, this is invalid syntax. There is no method DataFrame.desc_nulls_last() in the Spark API. There is a Spark function desc_nulls_last() however (link see below). transactionsDf.orderBy ("predError").desc_nulls_last() No. While transactionsDf.orderBy("predError") is correct syntax (although it sorts the DataFrame by column predError in ascending order) and returns a DataFrame, there is no method DataFrame.desc_nulls_last() in the Spark API. There is a Spark function desc_nulls_last() however (link see below). transactionsDf.orderBy("predError").asc_nulls_last() Incorrect. There is no method DataFrame.asc_nulls_last() in the Spark API (see above). More info: pyspark.sql.functions.desc_nulls_last -- PySpark 3.1.2 documentation and pyspark.sql.DataFrame.sort -- PySpark 3.1.2 documentation (https:// bit.ly/3g1JtbI , https://bit.ly/2R90NCS) Static notebook | Dynamic notebook: See test 1, 32 (Databricks import instructions) (https://flrs.github.io/ spark_practice_tests_code/#1/32.html , https://bit.ly/sparkpracticeexams_import_instructions)
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Databricks exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK exam preparations and Databricks certification application, do not hesitate to visit our Vcedump.com to find your solutions here.