Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?
A. spark.createDataFrame((throughputRates), FloatType)
B. spark.createDataFrame(throughputRates, FloatType)
C. spark.DataFrame(throughputRates, FloatType)
D. spark.createDataFrame(throughputRates)
E. spark.createDataFrame(throughputRates, FloatType())
Correct Answer: E
spark.createDataFrame(throughputRates, FloatType()) Correct! spark.createDataFrame is the correct
operator to use here and the type FloatType() which is passed in for the command's schema argument is
correctly instantiated using the parentheses.
Remember that it is essential in PySpark to instantiate types when passing them to
SparkSession.createDataFrame. And, in Databricks, spark returns a SparkSession object.
spark.createDataFrame((throughputRates), FloatType) No. While packing throughputRates in parentheses
does not do anything to the execution of this command, not instantiating the FloatType with parentheses
as in the previous answer will make this
command fail.
spark.createDataFrame(throughputRates, FloatType)
Incorrect. Given that it does not matter whether you pass throughputRates in parentheses or not, see the
explanation of the previous answer for further insights.
spark.DataFrame(throughputRates, FloatType)
Wrong. There is no SparkSession.DataFrame() method in Spark.
spark.createDataFrame(throughputRates)
False. Avoiding the schema argument will have PySpark try to infer the schema. However, as you can see
in the documentation (linked below), the inference will only work if you pass in an "RDD of
either Row, namedtuple, or dict" for data (the first argument to createDataFrame). But since you are
passing a Python list, Spark's schema inference will fail.
More info: pyspark.sql.SparkSession.createDataFrame -- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, 55 (Databricks import instructions)
Question 52:
Which of the following code blocks returns only rows from DataFrame transactionsDf in which values in column productId are unique?
A. transactionsDf.distinct("productId")
B. transactionsDf.dropDuplicates(subset=["productId"])
C. transactionsDf.drop_duplicates(subset="productId")
D. transactionsDf.unique("productId")
E. transactionsDf.dropDuplicates(subset="productId")
Correct Answer: B
Although the suggests using a method called unique() here, that method does not actually exist in PySpark. In PySpark, it is called distinct(). But then, this method is not the right one to use here, since with distinct() we could filter out unique values in a specific column. However, we want to return the entire rows here. So the trick is to use dropDuplicates with the subset keyword parameter. In the documentation for dropDuplicates, the examples show that subset should be used with a list. And this is exactly the key to solving this question: The productId column needs to be fed into the subset argument in a list, even though it is just a single column. More info: pyspark.sql.DataFrame.dropDuplicates -- PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1, 45 (Databricks import instructions)
Question 53:
Which of the following describes the role of the cluster manager?
A. The cluster manager schedules tasks on the cluster in client mode.
B. The cluster manager schedules tasks on the cluster in local mode.
C. The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.
D. The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode.
E. The cluster manager allocates resources to the DataFrame manager.
Correct Answer: C
The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode. Correct. In cluster mode, the cluster manager is located on a node other than the client machine. From there it starts and ends executor processes on the cluster nodes as required by the Spark application running on the Spark driver. The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode. Wrong, there is no "remote" execution mode in Spark. Available execution modes are local, client, and cluster. The cluster manager allocates resources to the DataFrame manager Wrong, there is no "DataFrame manager" in Spark. The cluster manager schedules tasks on the cluster in client mode. No, in client mode, the Spark driver schedules tasks on the cluster ?not the cluster manager. The cluster manager schedules tasks on the cluster in local mode. Wrong: In local mode, there is no "cluster". The Spark application is running on a single machine, not on a cluster of machines.
Question 54:
Which of the following code blocks creates a new 6-column DataFrame by appending the rows of the 6column DataFrame yesterdayTransactionsDf to the rows of the 6-column DataFrame todayTransactionsDf, ignoring that both DataFrames have different column names?
A. union(todayTransactionsDf, yesterdayTransactionsDf)
B. todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True)
C. todayTransactionsDf.unionByName(yesterdayTransactionsDf)
D. todayTransactionsDf.concat(yesterdayTransactionsDf)
E. todayTransactionsDf.union(yesterdayTransactionsDf)
Correct Answer: E
todayTransactionsDf.union(yesterdayTransactionsDf) Correct. The union command appends rows of yesterdayTransactionsDf to the rows of todayTransactionsDf, ignoring that both DataFrames have different column names. The resulting DataFrame will have the column names of DataFrame todayTransactionsDf. todayTransactionsDf.unionByName (yesterdayTransactionsDf) No. unionByName specifically tries to match columns in the two DataFrames by name and only appends values in columns with identical names across the two DataFrames. In the form presented above, the command is a great fit for joining DataFrames that have exactly the same columns, but in a different order. In this case though, the command will fail because the two DataFrames have different columns. todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True) No. The unionByName command is described in the previous explanation. However, with the allowMissingColumns argument set to True, it is no longer an issue that the two DataFrames have different column names. Any columns that do not have a match in the other DataFrame will be filled with null where there is no value. In the case at hand, the resulting DataFrame will have 7 or more columns though, so it this command is not the right answer. union(todayTransactionsDf, yesterdayTransactionsDf) No, there is no union method in pyspark.sql.functions. todayTransactionsDf.concat(yesterdayTransactionsDf) Wrong, the DataFrame class does not have a concat method. More info: pyspark.sql.DataFrame.union -- PySpark 3.1.2 documentation, pyspark.sql.DataFrame.unionByName -- PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, 18 (Databricks import instructions)
Question 55:
Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned DataFrame?
A. transactionsDf.resample(0.15, False, 3142)
B. transactionsDf.sample(0.15, False, 3142)
C. transactionsDf.sample(0.15)
D. transactionsDf.sample(0.85, 8429)
E. transactionsDf.sample(True, 0.15, 8261)
Correct Answer: E
Answering this correctly depends on whether you understand the arguments to the DataFrame.sample() method (link to the documentation below). The arguments are as follows: DataFrame.sample(withReplacement=None, fraction=None, seed=None). The first argument withReplacement specified whether a row can be drawn from the DataFrame multiple times. By default, this option is disabled in Spark. But we have to enable it here, since the question asks for a row being able to appear more than once. So, we need to pass True for this argument.
About replacement: "Replacement" is easiest explained with the example of removing random items from a box. When you remove those "with replacement" it means that after you have taken an item out of the box, you put it back inside. So, essentially, if you would randomly take 10 items out of a box with 100 items, there is a chance you take the same item twice or more times. "Without replacement" means that you would not put the item back into the box after removing it. So, every time you remove an item from the box, there is one less item in the box and you can never take the same item twice. The second argument to the withReplacement method is fraction. This referes to the fraction of items that should be returned. In the we are asked for 150 out of 1000 items ?a fraction of 0.15. The last argument is a random seed. A random seed makes a randomized processed repeatable. This means that if you would re-run the same sample() operation with the same random seed, you would get the same rows returned from the sample() command. There is no behavior around the random seed specified in the question. The varying random seeds are only there to confuse you!
More info: pyspark.sql.DataFrame.sample -- PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1, 49 (Databricks import instructions)
Question 56:
Which of the following code blocks displays various aggregated statistics of all columns in DataFrame transactionsDf, including the standard deviation and minimum of values in each column?
A. transactionsDf.summary()
B. transactionsDf.agg("count", "mean", "stddev", "25%", "50%", "75%", "min")
C. transactionsDf.summary("count", "mean", "stddev", "25%", "50%", "75%", "max").show()
D. transactionsDf.agg("count", "mean", "stddev", "25%", "50%", "75%", "min").show()
E. transactionsDf.summary().show()
Correct Answer: E
The DataFrame.summary() command is very practical for quickly calculating statistics of a DataFrame. You need to call .show() to display the results of the calculation. By default, the command calculates various statistics (see documentation linked below), including standard deviation and minimum. Note that the answer that lists many options in the summary() parentheses does not include the minimum, which is asked for in the question. Answer options that include agg() do not work here as shown, since DataFrame.agg() expects more complex, column-specific instructions on how to aggregate values.
Static notebook | Dynamic notebook: See test 3, 46 (Databricks import instructions)
Question 57:
The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient
executor memory is available, in a fault-tolerant way. Find the error.
A. Caching is not supported in Spark, data are always recomputed.
B. Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.
C. The storage level is inappropriate for fault-tolerant storage.
D. The code block uses the wrong operator for caching.
E. The DataFrameWriter needs to be invoked.
Correct Answer: C
The storage level is inappropriate for fault-tolerant storage. Correct. Typically, when thinking about fault tolerance and storage levels, you would want to store redundant copies of the dataset. This can be achieved by using a storage level such as StorageLevel.MEMORY_AND_DISK_2. The code block uses the wrong command for caching. Wrong. In this case, DataFrame.persist() needs to be used, since this operator supports passing a storage level. DataFrame.cache() does not support passing a storage level.
Caching is not supported in Spark, data are always recomputed. Incorrect. Caching is an important component of Spark, since it can help to accelerate Spark programs to great extent. Caching is often a good idea for datasets that need to be accessed repeatedly.
Data caching capabilities can be accessed through the spark object, but not through the DataFrame API. No. Caching is either accessed through DataFrame.cache() or DataFrame.persist().
The DataFrameWriter needs to be invoked. Wrong. The DataFrameWriter can be accessed via DataFrame.write and is used to write data to external data stores, mostly on disk. Here, we find keywords such as "cache" and "executor memory" that point us away from using external data stores. We aim to save data to memory to accelerate the reading process, since reading from disk is comparatively slower. The DataFrameWriter does not write to memory, so we cannot use it here.
More info: Best practices for caching in Spark SQL | by David Vrba | Towards Data Science
Question 58:
Which of the following describes Spark's way of managing memory?
A. Spark uses a subset of the reserved system memory.
B. Storage memory is used for caching partitions derived from DataFrames.
C. As a general rule for garbage collection, Spark performs better on many small objects than few big objects.
D. Disabling serialization potentially greatly reduces the memory footprint of a Spark application.
E. Spark's memory usage can be divided into three categories: Execution, transaction, and storage.
Correct Answer: B
Spark's memory usage can be divided into three categories: Execution, transaction, and storage.
No, it is either execution or storage.
As a general rule for garbage collection, Spark performs better on many small objects than few big objects.
No, Spark's garbage collection runs faster on fewer big objects than many small objects. Disabling
serialization potentially greatly reduces the memory footprint of a Spark application.
The opposite is true ?serialization reduces the memory footprint, but may impact performance in a
negative way.
Spark uses a subset of the reserved system memory. No, the reserved system memory is separate from
Which of the following describes the role of tasks in the Spark execution hierarchy?
A. Tasks are the smallest element in the execution hierarchy.
B. Within one task, the slots are the unit of work done for each partition of the data.
C. Tasks are the second-smallest element in the execution hierarchy.
D. Stages with narrow dependencies can be grouped into one task.
E. Tasks with wide dependencies can be grouped into one stage.
Correct Answer: A
Stages with narrow dependencies can be grouped into one task. Wrong, tasks with narrow dependencies can be grouped into one stage. Tasks with wide dependencies can be grouped into one stage. Wrong, since a wide transformation causes a shuffle which always marks the boundary of a stage. So, you cannot bundle multiple tasks that have wide dependencies into a stage. Tasks are the second-smallest element in the execution hierarchy. No, they are the smallest element in the execution hierarchy. Within one task, the slots are the unit of work done for each partition of the data. No, tasks are the unit of work done per partition. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
Question 60:
Which of the following describes properties of a shuffle?
A. Operations involving shuffles are never evaluated lazily.
B. Shuffles involve only single partitions.
C. Shuffles belong to a class known as "full transformations".
D. A shuffle is one of many actions in Spark.
E. In a shuffle, Spark writes data to disk.
Correct Answer: E
In a shuffle, Spark writes data to disk.
Correct! Spark's architecture dictates that intermediate results during a shuffle are written to disk.
A shuffle is one of many actions in Spark.
Incorrect. A shuffle is a transformation, but not an action.
Shuffles involve only single partitions.
No, shuffles involve multiple partitions. During a shuffle, Spark generates output partitions from multiple
input partitions.
Operations involving shuffles are never evaluated lazily. Wrong. A shuffle is a costly operation and Spark
will evaluate it as lazily as other transformations. This is, until a subsequent action triggers its evaluation.
Shuffles belong to a class known as "full transformations". Not quite. Shuffles belong to a class known as
"wide transformations". "Full transformation" is not a relevant term in Spark.
More info: Spark ?The Definitive Guide, Chapter 2 and Spark: disk I/O on stage boundaries explanation Stack Overflow
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Databricks exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK exam preparations and Databricks certification application, do not hesitate to visit our Vcedump.com to find your solutions here.