Vcedump 100% Guareented CCA175 Questions and Answers. 100% Pass Guarantee. Latest Questions with Accurate Answers.

Exam Details

Exam Code
:CCA175
Exam Name
:CCA Spark and Hadoop Developer Exam
Certification
:Cloudera Certifications
Vendor
:Cloudera
Total Questions
:95 Q&As
Last Updated
:Jul 03, 2025

Cloudera Cloudera Certifications CCA175 Questions & Answers

Question 31:

Problem Scenario 78 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.orders
table=retail_db.order_items
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of order table : (orderid , order_date , order_customer_id, order_status)
Columns of ordeMtems table : (order_item_td , order_item_order_id ,
order_item_product_id,
order_item_quantity,order_item_subtotal,order_item_product_price)
Please accomplish following activities.
1.
Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory p92_orders and p92_order_items .
2.
Join these data using order_id in Spark and Python
3.
Calculate total revenue perday and per customer
4.
Calculate maximum revenue customer

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution : Step 1 : Import Single table . sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba password=cloudera -table=orders --target-dir=p92_orders –m 1 sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba password=cloudera -table=order_items --target-dir=p92_order_orderitems --m 1 Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p92_orders/part-m-00000 hadoop fs -cat p92 orderitems/part-m-00000 Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile(Mp92_orders") orderitems = sc.textFile("p92_order_items") Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)
#First value is orderjd
orders Key Value = orders.map(lambda line: (int(line.split(",")[0]), line))
#Second value as an Orderjd
orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line))
Step 5 : Join both the RDD using orderjd
joinedData = orderltemsKeyValue.join(ordersKeyValue)
#print the joined data
for line in joinedData.collect():
print(line)
#Format of joinedData as below.
#[Orderld, 'All columns from orderltemsKeyValue', 'All columns from ordersKeyValue']
ordersPerDatePerCustomer = joinedData.map(lambda line: ((line[1][1].split(",")[1],
line[1][1].split(",M)[2]), float(line[1][0].split(",")[4]))) amountCollectedPerDayPerCustomer =
ordersPerDatePerCustomer.reduceByKey(lambda runningSum, amount: runningSum +
amount}
#(Out record format will be ((date,customer_id), totalAmount} for line in
amountCollectedPerDayPerCustomer.collect(): print(line)
#now change the format of record as (date,(customer_id,total_amount))
revenuePerDatePerCustomerRDD = amountCollectedPerDayPerCustomer.map(lambda
threeElementTuple: (threeElementTuple[0][0],
(threeElementTuple[0][1],threeElementTuple[1])))
for line in revenuePerDatePerCustomerRDD.collect():
print(line)
#Calculate maximum amount collected by a customer for each day
perDateMaxAmountCollectedByCustomer =revenuePerDatePerCustomerRDD.reduceByKey(lambda runningAmountTuple,newAmountTuple: (runningAmountTuple if runningAmountTuple[1] >= newAmountTuple[1] else newAmountTuple})
for line in perDateMaxAmountCollectedByCustomer\sortByKey().collect(): print(line)
Question 32:

Problem Scenario 74 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.orders
table=retail_db.order_items
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of order table : (orderjd , order_date , ordercustomerid, order status}
Columns of orderjtems table : (order_item_td , order_item_order_id ,
order_item_product_id,
order_item_quantity,order_item_subtotal,order_item_product_price)
Please accomplish following activities.
1.
Copy "retaildb.orders" and "retaildb.orderjtems" table to hdfs in respective directory p89_orders and p89_order_items .
2.
Join these data using orderjd in Spark and Python
3.
Now fetch selected columns from joined data Orderld, Order date and amount collected on this order.
4.
Calculate total order placed for each date, and produced the output sorted by date.

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution: Step 1 : Import Single table . sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba password=cloudera -table=orders --target-dir=p89_orders - -m1 sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba password=cloudera -table=order_items ~target-dir=p89_ order items -m 1 Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs Step 2 : Read the data from one of the partition, created using above command, hadoopfs-cat p89_orders/part-m-00000 hadoop fs -cat p89_order_items/part-m-00000 Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p89_orders") orderitems = sc.textFile("p89_order_items") Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value) #First value is orderjd ordersKeyValue = orders.map(lambda line: (int(line.split(",")[0]), line)) #Second value as an Orderjd orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line)) Step 5 : Join both the RDD using orderjd joinedData = orderltemsKeyValue.join(ordersKeyValue) #print the joined data
tor line in joinedData.collect():
print(line)
Format of joinedData as below.
[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']
Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.
revenuePerOrderPerDay = joinedData.map(lambda row: (row[0]( row[1][1].split(",")[1](
f!oat(row[1][0].split('\M}[4]}}}
#printthe result
for line in revenuePerOrderPerDay.collect():
print(line)
Step 7 : Select distinct order ids for each date.
#distinct(date,order_id)
distinctOrdersDate = joinedData.map(lambda row: row[1][1].split('\")[1] + "," +
str(row[0])).distinct()
for line in distinctOrdersDate.collect(): print(line)
Step 8 : Similar to word count, generate (date, 1) record for each row. newLineTuple =
distinctOrdersDate.map(lambda line: (line.split(",")[0], 1))
Step 9 : Do the count for each key(date), to get total order per date. totalOrdersPerDate =
newLineTuple.reduceByKey(lambda a, b: a + b}
#print results
for line in totalOrdersPerDate.collect():
print(line)
step 10 : Sort the results by date sortedData=totalOrdersPerDate.sortByKey().collect()
#print results
for line in sortedData:
print(line)
Question 33:

You have been given MySQL DB with following details. user=retail_dba password=cloudera database=retail_db table=retail_db.categories jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following activities.
1.
Connect MySQL DB and check the content of the tables.
2.
Copy "retaildb.categories" table to hdfs, without specifying directory name.
3.
Copy "retaildb.categories" table to hdfs, in a directory name "categories_target".
4.
Copy "retaildb.categories" table to hdfs, in a warehouse directory name "categories_warehouse".

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Connecting to existing MySQL Database mysql --user=retail_dba -password=cloudera retail_db
Step 2 : Show all the available tables show tables;
Step 3 : View/Count data from a table in MySQL select count(1} from categories;
Step 4 : Check the currently available data in HDFS directory hdfs dfs -Is
Step 5 : Import Single table (Without specifying directory). sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba password=cloudera -table=categories Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs Step 6 : Read the data from one of the partition, created using above command, hdfs dfs catxategories/part-m-00000 Step 7 : Specifying target directory in import command (We are using number of mappers =1, you can change accordingly) sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera ~table=categories -target-dir=categortes_target --m 1 Step 8 : Check the content in one of the partition file. hdfs dfs -cat categories_target/part-m-00000 Step 9 : Specifying parent directory so that you can copy more than one table in a specified target directory. Command to specify warehouse directory. sqoop import -.-connect jdbc:mysql://quickstart:3306/retail_db --username=retail dba password=cloudera -table=categories -warehouse-dir=categories_warehouse --m 1
Question 34:

Problem Scenario 44 : You have been given 4 files , with the content as given below: spark11/file1.txt Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework spark11/file2.txt The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. spark11/file3.txt his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking spark11/file4.txt Apache Storm is focused on stream processing or what some call complex event processing. Storm implements a fault tolerant method for performing a computation or pipelining multiple computations on an event as it flows into a system. One might use
Storm to transform unstructured data as it flows into a system into a desired format
(spark11Afile1.txt)
(spark11/file2.txt)
(spark11/file3.txt)
(sparkl 1/file4.txt)
Write a Spark program, which will give you the highest occurring words in each file. With
their file name and highest occurring words.

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create all 4 file first using Hue in hdfs.
Step 2 : Load all file as an RDD
val file1 = sc.textFile("sparkl1/filel.txt")
val file2 = sc.textFile("spark11/file2.txt")
val file3 = sc.textFile("spark11/file3.txt")
val file4 = sc.textFile("spark11/file4.txt")
Step 3 : Now do the word count for each file and sort in reverse order of count.
val contentl = filel.flatMap( line => line.split(" ")).map(word => (word,1)).reduceByKey(_ +
_).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content.2 = file2.flatMap( line => line.splitf ")).map(word => (word,1)).reduceByKey(_
+
_).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content3 = file3.flatMap( line > line.split)" ")).map(word => (word,1)).reduceByKey(_
+
_).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content4 = file4.flatMap( line => line.split(" ")).map(word => (word,1)).reduceByKey(_ +
_).map(item => item.swap).sortByKey(false).map(e=>e.swap)
Step 4 : Split the data and create RDD of all Employee objects.
val filelword = sc.makeRDD(Array(file1.name+"->"+content1(0)._1+"-"+content1(0)._2)) val
file2word = sc.makeRDD(Array(file2.name+"->"+content2(0)._1+"-"+content2(0)._2)) val
file3word = sc.makeRDD(Array(file3.name+"->"+content3(0)._1+"-"+content3(0)._2)) val
file4word = sc.makeRDD(Array(file4.name+M->"+content4(0)._1+"-"+content4(0)._2))
Step 5: Union all the RDDS
val unionRDDs = filelword.union(file2word).union(file3word).union(file4word)
Step 6 : Save the results in a text file as below.
unionRDDs.repartition(1).saveAsTextFile("spark11/union.txt")
Question 35:

Problem Scenario 90 : You have been given below two files course.txt id,course 1,Hadoop 2,Spark 3,HBase fee.txt id,fee 2,3900 3,4200 4,2900 Accomplish the following activities.
1.
Select all the courses and their fees , whether fee is listed or not.
2.
Select all the available fees and respective course. If course does not exists still list the fee
3.
Select all the courses and their fees , whether fee is listed or not. However, ignore records having fee as null.

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1:
hdfs dfs -mkdir sparksql4
hdfs dfs -put course.txt sparksql4/
hdfs dfs -put fee.txt sparksql4/
Step 2 : Now in spark shell
// load the data into a new RDD
val course = sc.textFile("sparksql4/course.txt")
val fee = sc.textFile("sparksql4/fee.txt")
// Return the first element in this RDD
course.fi rst()
fee.fi rst()
//define the schema using a case class case class Course(id: Integer, name: String) case
class Fee(id: Integer, fee: Integer)
// create an RDD of Product objects
val courseRDD = course.map(_.split(",")).map(c => Course(c(0).tolnt,c(1)))
val feeRDD =fee.map(_.split(",")).map(c => Fee(c(0}.tolnt,c(1}.tolnt))
courseRDD.first()
courseRDD.count(}
feeRDD.first()
feeRDD.countQ
// change RDD of Product objects to a DataFrame val courseDF = courseRDD.toDF(} val
feeDF = feeRDD.toDF{)
// register the DataFrame as a temp table courseDF. registerTempTable("course") feeDF.
registerTempTablef'fee")
// Select data from table
val results = sqlContext.sql(......SELECT' FROM course """ )
results. showQ
val results = sqlContext.sql(......SELECT' FROM fee......)
results. showQ
val results = sqlContext.sql(......SELECT * FROM course LEFT JOIN fee ON course.id =
fee.id......)
results-showQ
val results ="sqlContext.sql(......SELECT * FROM course RIGHT JOIN fee ON course.id =
fee.id "MM )
results. showQ
val results = sqlContext.sql(......SELECT' FROM course LEFT JOIN fee ON course.id =
fee.id where fee.id IS NULL"
results. show()
Question 36:

Problem Scenario 34 : You have given a file named spark6/user.csv. Data is given below: user.csv id,topic,hits Rahul,scala,120 Nikita,spark,80 Mithun,spark,1 myself,cca175,180 Now write a Spark code in scala which will remove the header part and create RDD of values as below, for all rows. And also if id is myself" than filter out row. Map(id -> om, topic -> scala, hits -> 120)

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create file in hdfs (We will do using Hue). However, you can first create in local
filesystem and then upload it to hdfs.
Step 2 : Load user.csv file from hdfs and create PairRDDs val csv =
sc.textFile("spark6/user.csv")
Step 3 : split and clean data
val headerAndRows = csv.map(line => line.split(",").map(_.trim))
Step 4 : Get header row
val header = headerAndRows.first
Step 5 : Filter out header (We need to check if the first val matches the first header name)
val data = headerAndRows.filter(_(0) != header(O))
Step 6 : Splits to map (header/value pairs)
val maps = data.map(splits => header.zip(splits).toMap)
step 7: Filter out the user "myself
val result = maps.filter(map => mapf'id") != "myself")
Step 8 : Save the output as a Text file. result.saveAsTextFile("spark6/result.txt")
Question 37:

Problem Scenario GG : You have been given below code snippet.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant", "falcon", "squid"), 2)
val d = c.keyBy(.length)
operation 1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, String)] = Array((4,lion))

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution : b.subtractByKey(d).collect subtractByKey [Pair] : Very similar to subtract, but instead of supplying a function, the keycomponent of each pair will be automatically used as criterion for removing items from the first RDD.
Question 38:

Problem Scenario 84 : In Continuation of previous question, please accomplish following activities.
1.
Select all the products which has product code as null
2.
Select all the products, whose name starts with Pen and results should be order by Price descending order.
3.
Select all the products, whose name starts with Pen and results should be order by Price descending order and quantity ascending order.
4.
Select top 2 products by price

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution : Step 1 : Select all the products which has product code as null val results = sqlContext.sql(......SELECT' FROM products WHERE code IS NULL......) results. showQ val results = sqlContext.sql(......SELECT * FROM products WHERE code = NULL ",,M ) results.showQ Step 2 : Select all the products , whose name starts with Pen and results should be order by Price descending order. val results = sqlContext.sql(......SELECT * FROM products WHERE name LIKE 'Pen %' ORDER BY price DESC......) results. showQ Step 3 : Select all the products , whose name starts with Pen and results should be order by Price descending order and quantity ascending order. val results = sqlContext.sql('.....SELECT * FROM products WHERE name LIKE 'Pen %' ORDER BY price DESC, quantity......) results. showQ Step 4 : Select top 2 products by price val results = sqlContext.sql(......SELECT' FROM products ORDER BY price desc LIMIT2......} results. show()
Question 39:

Problem Scenario 32 : You have given three files as below. spark3/sparkdir1/file1.txt spark3/sparkd ir2ffile2.txt spark3/sparkd ir3Zfile3.txt Each file contain some text.
spark3/sparkdir1/file1.txt
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework
spark3/sparkdir2/file2.txt
The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.
spark3/sparkdir3/file3.txt
his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking Now write a Spark code in scala which will load all these three files from hdfs and do the word count by filtering following words. And result should be sorted by word count in reverse order. Filter words ("a","the","an", "as", "a","with","this","these","is","are","in", "for", "to","and","The","of") Also please make sure you load all three files as a Single RDD (All three files must be loaded using single API call). You have also been given following codec import org.apache.hadoop.io.compress.GzipCodec Please use above codec to compress file, while saving in hdfs.

Correct Answer: See the explanation for Step by Step Solution and configuration.

Solution :
Step 1 : Create all three files in hdfs (We will do using Hue). However, you can first
create in local filesystem and then upload it to hdfs.
Step 2 : Load content from all files.
val content =
sc.textFile("spark3/sparkdir1/file1.txt,spark3/sparkdir2/file2.txt,spark3/sparkdir3/file3.
txt") //Load the text file
Step 3 : Now create split each line and create RDD of words.
val flatContent = content.flatMap(word=>word.split(" "))
step 4 : Remove space after each word (trim it)
val trimmedContent = f1atContent.map(word=>word.trim)
Step 5 : Create an RDD from remove, all the words that needs to be removed.
val removeRDD = sc.parallelize(List("a","theM,ManM, "as",
"a","with","this","these","is","are'\"in'\ "for", "to","and","The","of"))
Step 6 : Filter the RDD, so it can have only content which are not present in removeRDD.
val filtered = trimmedContent.subtract(removeRDD}
Step 7 : Create a PairRDD, so we can have (word,1) tuple or PairRDD. val pairRDD =
filtered.map(word => (word,1))
Step 8 : Now do the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ +
_)
Step 9 : Now swap PairRDD.
val swapped = wordCount.map(item => item.swap)
Step 10 : Now revers order the content. val sortedOutput = swapped.sortByKey(false)
Step 11 : Save the output as a Text file. sortedOutput.saveAsTextFile("spark3/result")
Step 12 : Save compressed output.
import org.apache.hadoop.io.compress.GzipCodec
sortedOutput.saveAsTextFile("spark3/compressedresult", classOf[GzipCodec])
Question 40:

Problem Scenario 77 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.orders
table=retail_db.order_items
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of order table : (orderid , order_date , order_customer_id, order_status)
Columns of ordeMtems table : (order_item_id , order_item_order_ld ,
order_item_product_id, order_item_quantity,order_item_subtotal,order_
item_product_price)
Please accomplish following activities.
1.
Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory p92_orders and p92 order items .
2.
Join these data using orderid in Spark and Python
3.
Calculate total revenue perday and per order
4.
Calculate total and average revenue for each date. - combineByKey -aggregateByKey

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba password=cloudera -table=orders --target-dir=p92_orders –m 1
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba password=cloudera -table=order_items --target-dir=p92_order_items –m1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Read the data from one of the partition, created using above command, hadoop fs
-cat p92_orders/part-m-00000 hadoop fs -cat p92_order_items/part-m-00000
Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark
terminal and do following). orders = sc.textFile("p92_orders") orderltems =
sc.textFile("p92_order_items")
Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)
#First value is orderjd
ordersKeyValue = orders.map(lambda line: (int(line.split(",")[0]), line))
#Second value as an Orderjd
orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line))
Step 5 : Join both the RDD using orderjd
joinedData = orderltemsKeyValue.join(ordersKeyValue)
#print the joined data
for line in joinedData.collect():
print(line)
Format of joinedData as below.
[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']
Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.
//Retruned row will contain ((order_date,order_id),amout_collected)
revenuePerDayPerOrder = joinedData.map(lambda row: ((row[1][1].split(M,M)[1],row[0]},
float(row[1][0].split(",")[4])))
#print the result
for line in revenuePerDayPerOrder.collect():
print(line)
Step 7 : Now calculate total revenue perday and per order
A. Using reduceByKey
totalRevenuePerDayPerOrder = revenuePerDayPerOrder.reduceByKey(lambda
runningSum, value: runningSum + value)
for line in totalRevenuePerDayPerOrder.sortByKey().collect(): print(line)
#Generate data as (date, amount_collected) (Ignore ordeMd)
dateAndRevenueTuple = totalRevenuePerDayPerOrder.map(lambda line: (line[0][0],
line[1]))
for line in dateAndRevenueTuple.sortByKey().collect(): print(line)
Step 8 : Calculate total amount collected for each day. And also calculate number of days.
#Generate output as (Date, Total Revenue for date, total_number_of_dates)
#Line 1 : it will generate tuple (revenue, 1)
#Line 2 : Here, we will do summation for all revenues at the same time another counter to
maintain number of records.
#Line 3 : Final function to merge all the combiner
totalRevenueAndTotalCount = dateAndRevenueTuple.combineByKey( \
lambda revenue: (revenue, 1), \
lambda revenueSumTuple, amount: (revenueSumTuple[0] + amount, revenueSumTuple[1]
+ 1), \
lambda tuplel, tuple2: (round(tuple1[0] + tuple2[0], 2}, tuple1[1] + tuple2[1]) \
for line in totalRevenueAndTotalCount.collect(): print(line)
Step 9 : Now calculate average for each date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements:
(threeElements[0], threeElements[1][0]/threeElements[1][1]}}
for line in averageRevenuePerDate.collect(): print(line)
Step 10 : Using aggregateByKey
#line 1 : (Initialize both the value, revenue and count)
#line 2 : runningRevenueSumTuple (Its a tuple for total revenue and total record count for
each date)
#line 3 : Summing all partitions revenue and count
totalRevenueAndTotalCount = dateAndRevenueTuple.aggregateByKey( \
(0,0), \
lambda runningRevenueSumTuple, revenue: (runningRevenueSumTuple[0] + revenue,
runningRevenueSumTuple[1] + 1), \
lambda tupleOneRevenueAndCount, tupleTwoRevenueAndCount:
(tupleOneRevenueAndCount[0] + tupleTwoRevenueAndCount[0],
tupleOneRevenueAndCount[1] + tupleTwoRevenueAndCount[1]) \
)
for line in totalRevenueAndTotalCount.collect(): print(line)
Step 11 : Calculate the average revenue per date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements:
(threeElements[0], threeElements[1][0]/threeElements[1][1]))
for line in averageRevenuePerDate.collect(): print(line)

Related Exams:

Tips on How to Prepare for the Exams

Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Cloudera exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your CCA175 exam preparations and Cloudera certification application, do not hesitate to visit our Vcedump.com to find your solutions here.

Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Cloudera Cloudera Certifications CCA175 Questions & Answers

Question 31:

Question 32:

Question 33:

Question 34:

Question 35:

Question 36:

Question 37:

Question 38:

Question 39:

Question 40:

Related Exams:

CCA-175

CCA-410

CCA-470

CCA-500

CCA-505

CCB-400

CCD-410

CCD-470

Tips on How to Prepare for the Exams

CCA Spark and Hadoop Developer Exam

Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Cloudera Cloudera Certifications CCA175 Questions & Answers

Question 31:

Question 32:

Question 33:

Question 34:

Question 35:

Question 36:

Question 37:

Question 38:

Question 39:

Question 40:

Related Exams:

Tips on How to Prepare for the Exams