Which of the following code blocks prints out in how many rows the expression Inc. appears in the stringtype column supplier of DataFrame itemsDf?
A. 1.counter = 0
2.
3.for index, row in itemsDf.iterrows():
4.
if 'Inc.' in row['supplier']:
5.
counter = counter + 1
6.
7.print(counter)
B. 1.counter = 0
2.
3.def count(x):
4.
if 'Inc.' in x['supplier']:
5.
counter = counter + 1
6.
7.itemsDf.foreach(count)
8.print(counter)
C. print(itemsDf.foreach(lambda x: 'Inc.' in x))
D. print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())
E. 1.accum=sc.accumulator(0)
2.
3.def check_if_inc_in_supplier(row):
4.
if 'Inc.' in row['supplier']:
5.
accum.add(1)
6.
7.itemsDf.foreach(check_if_inc_in_supplier)
8.print(accum.value)
Correct Answer: E
Correct code block:
accum=sc.accumulator(0)
def check_if_inc_in_supplier(row):
if 'Inc.' in row['supplier']:
accum.add(1)
itemsDf.foreach(check_if_inc_in_supplier)
print(accum.value)
To answer this correctly, you need to know both about the DataFrame.foreach() method and
accumulators.
When Spark runs the code, it executes it on the executors. The executors do not have any information
about variables outside of their scope. This is whhy simply using a Python variable counter,
like in the two examples that start with counter = 0, will not work. You need to tell the executors explicitly
that counter is a special shared variable, an Accumulator, which is managed by the driver
and can be accessed by all executors for the purpose of adding to it. If you have used Pandas in the past,
you might be familiar with the iterrows() command.
Notice that there is no such command in PySpark.
The two examples that start with print do not work, since DataFrame.foreach() does not have a return
value.
More info: pyspark.sql.DataFrame.foreach -- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, 22 (Databricks import instructions)
Question 32:
Which of the following code blocks performs a join in which the small DataFrame transactionsDf is sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId, respectively?
A. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "right_outer")
B. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "broadcast")
C. itemsDf.merge(transactionsDf, "itemsDf.itemId == transactionsDf.storeId", "broadcast")
D. itemsDf.join(broadcast(transactionsDf), itemsDf.itemId == transactionsDf.storeId)
E. itemsDf.join(transactionsDf, broadcast(itemsDf.itemId == transactionsDf.storeId))
Correct Answer: D
Question 33:
The code block displayed below contains an error. The code block should read the csv file located at path data/transactions.csv into DataFrame transactionsDf, using the first row as column header and casting the columns in the most appropriate type. Find the error. First 3 rows of transactions.csv: 1.transactionId;storeId;productId;name 2.1;23;12;green grass 3.2;35;31;yellow sun 4.3;23;12;green grass Code block: transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True)
A. The DataFrameReader is not accessed correctly.
B. The transaction is evaluated lazily, so no file will be read.
C. Spark is unable to understand the file type.
D. The code block is unable to capture all columns.
E. The resulting DataFrame will not have the appropriate schema.
By default, Spark does not infer the schema of the CSV (since this usually takes some time). So, you need
to add the inferSchema=True option to the code block.
More info: pyspark.sql.DataFrameReader.csv -- PySpark 3.1.2 documentation
Question 34:
The code block displayed below contains an error. The code block should return all rows of DataFrame transactionsDf, but including only columns storeId and predError. Find the error.
A. Instead of select, DataFrame transactionsDf needs to be filtered using the filter operator.
B. Columns storeId and predError need to be represented as a Python list, so they need to be wrapped in brackets ([]).
C. The take method should be used instead of the collect method.
D. Instead of collect, collectAsRows needs to be called.
E. The collect method is not a method of the SparkSession object.
Correct Answer: E
Correct code block: transactionsDf.select("storeId", "predError").collect() collect() is a method of the DataFrame object. More info: pyspark.sql.DataFrame.collect -- PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2, 24 (Databricks import instructions)
Question 35:
Which of the following statements about reducing out-of-memory errors is incorrect?
A. Concatenating multiple string columns into a single column may guard against out-of- memory errors.
B. Reducing partition size can help against out-of-memory errors.
C. Limiting the amount of data being automatically broadcast in joins can help against out- of-memory errors.
D. Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-ofmemory errors.
E. Decreasing the number of cores available to each executor can help against out-of- memory errors.
Correct Answer: A
Concatenating multiple string columns into a single column may guard against out-of- memory errors. Exactly, this is an incorrect answer! Concatenating any string columns does not reduce the size of the data, it just structures it a different way. This does little to how Spark processes the data and definitely does not reduce out-of-memory errors. Reducing partition size can help against out-of-memory errors. No, this is not incorrect. Reducing partition size is a viable way to aid against out-of- memory errors, since executors need to load partitions into memory before processing them. If the executor does not have enough memory available to do that, it will throw an out-of-memory error. Decreasing partition size can therefore be very helpful for preventing that. Decreasing the number of cores available to each executor can help against out-of- memory errors. No, this is not incorrect. To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine that every core in every executor processes a partition, potentially in parallel with other executors, you can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage of executors is a concern, especially when multiple partitions are processed at the same time. To strike a balance between performance and memory usage, decreasing the number of cores may help against out-of-memory errors. Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors. No, this is not incorrect. When using commands like collect() that trigger the transmission of potentially large amounts of data from the cluster to the driver, the driver may experience out-of-memory errors. One strategy to avoid this is to be careful about using commands like collect() that send back large amounts of data to the driver. Another strategy is setting the parameter spark.driver.maxResultSize. If data to be transmitted to the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore prevent an out-of- memory error. Limiting the amount of data being automatically broadcast in joins can help against out-of- memory errors. Wrong, this is not incorrect. As part of Spark's internal optimization, Spark may choose to speed up operations by broadcasting (usually relatively small) tables to executors. This broadcast is happening from the driver, so all the broadcast tables are loaded into the driver first. If these tables are relatively big, or multiple mid-size tables are being broadcast, this may lead to an out-ofmemory error. The maximum table size for which Spark will consider broadcasting is set by the spark.sql.autoBroadcastJoinThreshold parameter. More info: Configuration - Spark 3.1.2 Documentation and Spark OOM Error -- Closeup. Does the following look familiar when... | by Amit Singh Rathore | The Startup | Medium
Question 36:
The code block shown below should return an exact copy of DataFrame transactionsDf that does not include rows in which values in column storeId have the value 25. Choose the answer that correctly fills the blanks in the code block to accomplish this.
A. transactionsDf.remove(transactionsDf.storeId==25)
B. transactionsDf.where(transactionsDf.storeId!=25)
C. transactionsDf.filter(transactionsDf.storeId==25)
D. transactionsDf.drop(transactionsDf.storeId==25)
E. transactionsDf.select(transactionsDf.storeId!=25)
Correct Answer: B
transactionsDf.where(transactionsDf.storeId!=25)
Correct. DataFrame.where() is an alias for the DataFrame.filter() method. Using this method, it is
straightforward to filter out rows that do not have value 25 in column storeId.
transactionsDf.select(transactionsDf.storeId!=25)
Wrong. The select operator allows you to build DataFrames column-wise, but when using it as shown, it
does not filter out rows.
transactionsDf.filter(transactionsDf.storeId==25)
Incorrect. Although the filter expression works for filtering rows, the == in the filtering condition is
inappropriate. It should be != instead.
transactionsDf.drop(transactionsDf.storeId==25)
No. DataFrame.drop() is used to remove specific columns, but not rows, from the DataFrame.
transactionsDf.remove(transactionsDf.storeId==25)
False. There is no DataFrame.remove() operator in PySpark. More info: pyspark.sql.DataFrame.where --
PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, 48 (Databricks import
instructions)
Question 37:
Which of the following DataFrame methods is classified as a transformation?
A. DataFrame.count()
B. DataFrame.show()
C. DataFrame.select()
D. DataFrame.foreach()
E. DataFrame.first()
Correct Answer: C
DataFrame.select()
Correct, DataFrame.select() is a transformation. When the command is executed, it is evaluated lazily and
returns an RDD when it is triggered by an action.
DataFrame.foreach()
Incorrect, DataFrame.foreach() is not a transformation, but an action. The intention of foreach() is to apply
code to each element of a DataFrame to update accumulator variables or write the
elements to external storage. The process does not return an RDD - it is an action! DataFrame.first()
Wrong. As an action, DataFrame.first() executed immediately and returns the first row of a DataFrame.
DataFrame.count()
Incorrect. DataFrame.count() is an action and returns the number of rows in a DataFrame.
DataFrame.show()
No, DataFrame.show() is an action and displays the DataFrame upon execution of the command.
Question 38:
Which of the following code blocks saves DataFrame transactionsDf in location /FileStore/transactions.csv as a CSV file and throws an error if a file already exists in the location?
A. transactionsDf.write.save("/FileStore/transactions.csv")
B. transactionsDf.write.format("csv").mode("error").path("/FileStore/transactions.csv")
C. transactionsDf.write.format("csv").mode("ignore").path("/FileStore/transactions.csv")
D. transactionsDf.write("csv").mode("error").save("/FileStore/transactions.csv")
E. transactionsDf.write.format("csv").mode("error").save("/FileStore/transactions.csv")
Correct Answer: E
Static notebook | Dynamic notebook: See test 1, 28 (Databricks import instructions) (https://flrs.github.io/ spark_practice_tests_code/#1/28.html , https://bit.ly/sparkpracticeexams_import_instructions)
Question 39:
The code block displayed below contains at least one error. The code block should return a DataFrame
with only one column, result. That column should include all values in column value from DataFrame transactionsDf raised to the power of 5, and a null value for rows in which there is no value in column value. Find the error(s).
Code block: 1.from pyspark.sql.functions import udf 2.from pyspark.sql import types as T
10.spark.sql('SELECT power_5_udf(value) FROM transactions')
A. The pow_5 method is unable to handle empty values in column value and the name of the column in the returned DataFrame is not result.
B. The returned DataFrame includes multiple columns instead of just one column.
C. The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and the SparkSession cannot access the transactionsDf DataFrame.
D. The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and Spark driver does not call the UDF function appropriately.
E. The pow_5 method is unable to handle empty values in column value, the UDF function is not registered properly with the Spark driver, and the name of the column in the returned DataFrame is not result.
Correct Answer: D
Correct code block: from pyspark.sql.functions import udf from pyspark.sql import types as T transactionsDf.createOrReplaceTempView('transactions') def pow_5(x): if x: return x**5 return x spark.udf.register('power_5_udf', pow_5, T.LongType()) spark.sql('SELECT power_5_udf(value) AS result FROM transactions') Here it is important to understand how the pow_5 method handles empty values. In the wrong code block above, the pow_5 method is unable to handle empty values and will throw an error, since Python's ** operator cannot deal with any null value Spark passes into method pow_5. The order of arguments for registering the UDF function with Spark via spark.udf.register matters. In the code snippet in the question, the arguments for the SQL method name and the actual Python function are switched. You can read more about the arguments of spark.udf.register and see some examples of its usage in the documentation (link below). Finally, you should recognize that in the original code block, an expression to rename column created through the UDF function is missing. The renaming is done by SQL's AS result argument. Omitting that argument, you end up with the column name power_5_udf(value) and not result. More info: pyspark.sql.functions.udf -- PySpark 3.1.1 documentation
Question 40:
The code block displayed below contains an error. The code block is intended to perform an outer join of
DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively.
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Databricks exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK exam preparations and Databricks certification application, do not hesitate to visit our Vcedump.com to find your solutions here.