Which of the following is the idea behind dynamic partition pruning in Spark?
A. Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.
B. Dynamic partition pruning concatenates columns of similar data types to optimize join performance.
C. Dynamic partition pruning performs wide transformations on disk instead of in memory.
D. Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.
E. Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.
Correct Answer: A
Question 22:
Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?
A. transactionsDf.withColumn("predErrorSqrt", sqrt(predError))
B. transactionsDf.select(sqrt(predError))
C. transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())
D. transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))
E. transactionsDf.select(sqrt("predError"))
Correct Answer: D
transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError"))) Correct. The DataFrame.withColumn() operator is used to add a new column to a DataFrame. It takes two arguments: The name of the new column (here: predErrorSqrt) and a Column expression as the new column. In PySpark, a Column expression means referring to a column using the col ("predError") command or by other means, for example by transactionsDf.predError, or even just using the column name as a string, "predError". The asks for the square root. sqrt() is a function in pyspark.sql.functions and calculates the square root. It takes a value or a Column as an input. Here it is the predError column of DataFrame transactionsDf expressed through col("predError"). transactionsDf.withColumn ("predErrorSqrt", sqrt(predError)) Incorrect. In this expression, sqrt(predError) is incorrect syntax. You cannot refer to predError in this way ?to Spark it looks as if you are trying to refer to the non-existent Python variable predError. You could pass transactionsDf.predError, col("predError") (as in the correct solution), or even just "predError" instead. transactionsDf.select(sqrt(predError)) Wrong. Here, the explanation just above this one about how to refer to predError applies. transactionsDf.select(sqrt("predError")) No. While this is correct syntax, it will return a single-column DataFrame only containing a column showing the square root of column predError. However, the asks for a column to be added to the original DataFrame transactionsDf. transactionsDf.withColumn("predErrorSqrt", col ("predError").sqrt()) No. The issue with this statement is that column col("predError") has no sqrt() method. sqrt() is a member of pyspark.sql.functions, but not of pyspark.sql.Column. More info: pyspark.sql.DataFrame.withColumn -- PySpark 3.1.2 documentation and pyspark.sql.functions.sqrt --PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2, 31 (Databricks import instructions)
Question 23:
Which of the following code blocks returns a single-row DataFrame that only has a column corr which shows the Pearson correlation coefficient between columns predError and value in DataFrame transactionsDf?
A. transactionsDf.select(corr(["predError", "value"]).alias("corr")).first()
B. transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first()
C. transactionsDf.select(corr(predError, value).alias("corr"))
D. transactionsDf.select(corr(col("predError"), col("value")).alias("corr"))
E. transactionsDf.select(corr("predError", "value"))
Correct Answer: D
In difficulty, this is above what you can expect from the exam. What this wants to teach you, however, is
to pay attention to the useful details included in the
documentation.
pyspark.sql.corr is not a very common method, but it deals with Spark's data structure in an interesting
way. The command takes two columns over multiple rows and returns a single row - similar to
an aggregation function. When examining the documentation (linked below), you will find this code
example:
a = range(20)
b = [2 * x for x in range(20)]
df = spark.createDataFrame(zip(a, b), ["a", "b"])
df.agg(corr("a", "b").alias('c')).collect()
[Row(c=1.0)]
See how corr just returns a single row? Once you understand this, you should be suspicious about
answers that include first(), since there is no need to just select a single row. A reason to eliminate
those answers is that DataFrame.first() returns an object of type Row, but not DataFrame, as requested in
the question.
transactionsDf.select(corr(col("predError"), col("value")).alias("corr")) Correct! After calculating the Pearson
correlation coefficient, the resulting column is correctly renamed to corr.
transactionsDf.select(corr(predError, value).alias("corr")) No. In this answer, Python will interpret column
names predError and value as variable names.
transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first() Incorrect. first() returns a row,
not a DataFrame (see above and linked documentation below).
transactionsDf.select(corr("predError", "value"))
Wrong. Whie this statement returns a DataFrame in the desired shape, the column will have the name corr
(predError, value) and not corr.
transactionsDf.select(corr(["predError", "value"]).alias("corr")).first() False. In addition to first() returning a
row, this code block also uses the wrong call structure for command corr which takes two arguments (the
Static notebook | Dynamic notebook: See test 3, 53 (Databricks import instructions)
Question 24:
Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000row DataFrame itemsDf, without any duplicates, returning the same rows even if the code block is run twice?
A. itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)
B. itemsDf.sample(fraction=0.1, seed=87238)
C. itemsDf.sample(fraction=1000, seed=98263)
D. itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
E. itemsDf.sample(fraction=0.1)
Correct Answer: B
itemsDf.sample(fraction=0.1, seed=87238) Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning duplicates, you should leave the withReplacement parameter at False, which is the default. Since the specifies that the same rows should be returned even if the code block is run twice, you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.
itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536) Incorrect. While this code block fulfills
almost all requirements, it may return duplicates.
This is because withReplacement is set to True.
Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls
and you need to take 1,000 balls at random from the bucket (similar to the problem in the
Question 25:
The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching column names and inserting null values where column names do not appear in both DataFrames. Find the error.
A. The DataFrames' RDDs need to be passed into the sc.union method instead of the DataFrame variable names.
B. Instead of union, the concat method should be used, making sure to not use its default arguments.
C. Instead of the Spark context, transactionDfMonday should be called with the join method instead of the union method, making sure to use its default arguments.
D. Instead of the Spark context, transactionDfMonday should be called with the union method.
E. Instead of the Spark context, transactionDfMonday should be called with the unionByName method instead of the union method, making sure to not use its default arguments.
Correct Answer: E
Correct code block:
transactionsDfMonday.unionByName(transactionsDfTuesday, True) Output of correct code block:
+-------------+---------+-----+-------+---------+----+ For solving this question, you should be aware of the
difference between the DataFrame.union() and DataFrame.unionByName() methods. The first one
matches columns independent of their
names, just by their order. The second one matches columns by their name (which is asked for in the
Question 26:
Which of the following statements about data skew is incorrect?
A. Spark will not automatically optimize skew joins by default.
B. Broadcast joins are a viable way to increase join performance for skewed data over sort- merge joins.
C. In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.
D. To mitigate skew, Spark automatically disregards null values in keys when joining.
E. Salting can resolve data skew.
Correct Answer: D
To mitigate skew, Spark automatically disregards null values in keys when joining. This statement is incorrect, and thus the correct answer to the question. Joining keys that contain null values is of particular concern with regard to data skew. In real-world applications, a table may contain a great number of records that do not have a value assigned to the column used as a join key. During the join, the data is at risk of being heavily skewed. This is because all records with a null-value join key are then evaluated as a single large partition, standing in stark contrast to the potentially diverse key values (and therefore small partitions) of the non-null-key records. Spark specifically does not handle this automatically. However, there are several strategies to mitigate this problem like discarding null values temporarily, only to merge them back later (see last link below). In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory. This statement is correct. In fact, having very different partition sizes is the very definition of skew. Skew can degrade Spark performance because the largest partition occupies a single executor for a long time. This blocks a Spark job and is an inefficient use of resources, since other executors that processed smaller partitions need to idle until the large partition is processed. Salting can resolve data skew. This statement is correct. The purpose of salting is to provide Spark with an opportunity to repartition data into partitions of similar size, based on a salted partitioning key. A salted partitioning key typically is a column that consists of uniformly distributed random numbers. The number of unique entries in the partitioning key column should match the number of your desired number of partitions. After repartitioning by the salted key, all partitions should have roughly the same size. Spark does not automatically optimize skew joins by default. This statement is correct. Automatic skew join optimization is a feature of Adaptive Query Execution (AQE). By default, AQE is disabled in Spark. To enable it, Spark's spark.sql.adaptive.enabled configuration option needs to be set to true instead of leaving it at the default false. To automatically optimize skew joins, Spark's spark.sql.adaptive.skewJoin.enabled options also needs to be set to true, which it is by default. When skew join optimization is enabled, Spark recognizes skew joins and optimizes them by splitting the bigger partitions into smaller partitions which leads to performance increases. Broadcast joins are a viable way to increase join performance for skewed data over sort- merge joins. This statement is correct. Broadcast joins can indeed help increase join performance for skewed data, under some conditions. One of the DataFrames to be joined needs to be small enough to fit into each executor's memory, along a partition from the other DataFrame. If this is the case, a broadcast join increases join performance over a sort-merge join. The reason is that a sort-merge join with skewed data involves excessive shuffling. During shuffling, data is sent around the cluster, ultimately slowing down the Spark application. For skewed data, the amount of data, and thus the slowdown, is particularly big. Broadcast joins, however, help reduce shuffling data. The smaller table is directly stored on all executors, eliminating a great amount of network traffic, ultimately increasing join performance relative to the sort-merge join. It is worth noting that for optimizing skew join behavior it may make sense to manually adjust Spark's spark.sql.autoBroadcastJoinThreshold configuration property if the smaller DataFrame is bigger than the 10 MB set by default. More info:
-Performance Tuning - Spark 3.0.0 Documentation
-Data Skew and Garbage Collection to Improve Spark Performance
-Section 1.2 - Joins on Skewed Data ?GitBook
Question 27:
The code block displayed below contains an error. The code block should display the schema of DataFrame transactionsDf. Find the error.
Code block:
transactionsDf.rdd.printSchema
A. There is no way to print a schema directly in Spark, since the schema can be printed easily through using print(transactionsDf.columns), so that should be used instead.
B. The code block should be wrapped into a print() operation.
C. PrintSchema is only accessible through the spark session, so the code block should be rewritten as spark.printSchema(transactionsDf).
D. PrintSchema is a method and should be written as printSchema(). It is also not callable through transactionsDf.rdd, but should be called directly from transactionsDf.
E. PrintSchema is a not a method of transactionsDf.rdd. Instead, the schema should be printed via transactionsDf.print_schema().
Correct Answer: D
Correct code block: transactionsDf.printSchema() This is more of a knowledge that you should just memorize or look up in the provided documentation during the exam. You can get more info about DataFrame.printSchema() in the documentation (link below). However - it is a plain simple method without any arguments. One answer points to an alternative of printing the schema: You could also use print(transactionsDf.schema). This will give you readable, but not nicely formatted, description of the schema. More info: pyspark.sql.DataFrame.printSchema -- PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1, 54 (Databricks import instructions)
Question 28:
Which of the following options describes the responsibility of the executors in Spark?
A. The executors accept jobs from the driver, analyze those jobs, and return results to the driver.
B. The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.
C. The executors accept tasks from the driver, execute those tasks, and return results to the driver.
D. The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.
E. The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.
Correct Answer: C
More info: Running Spark: an overview of Spark's runtime architecture - Manning (https://bit.ly/2RPmJn9)
Question 29:
The code block shown below should return only the average prediction error (column predError) of a random subset, without replacement, of approximately 15% of rows in DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Correct code block: transactionsDf.sample(withReplacement=False, fraction=0.15).select(avg('predError')) You should remember that getting a random subset of rows means sampling. This, in turn should point you to the DataFrame.sample() method. Once you know this, you can look up the correct order of arguments in the documentation (link below). Lastly, you have to decide whether to use filter, where or select. where is just an alias for filter(). filter() is not the correct method to use here, since it would only allow you to filter rows based on some condition. However, the asks to return only the average prediction error. You can control the columns that a query returns with the select() method ?so this is the correct method to use here. More info: pyspark.sql.DataFrame.sample -- PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2, 28 (Databricks import instructions)
Question 30:
Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should only be listed once.
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Databricks exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK exam preparations and Databricks certification application, do not hesitate to visit our Vcedump.com to find your solutions here.