Cloudera Cloudera Certified Associate CCA CCA175 Questions & Answers
Question 21:
Problem Scenario 88 : You have been given below three files product.csv (Create this file in hdfs) productID,productCode,name,quantity,price,supplierid 1001,PEN,Pen Red,5000,1.23,501 1002,PEN,Pen Blue,8000,1.25,501 1003,PEN,Pen Black,2000,1.25,501 1004,PEC,Pencil 2B,10000,0.48,502 1005,PEC,Pencil 2H,8000,0.49,502 1006,PEC,Pencil HB,0,9999.99,502 2001,PEC,Pencil 3B,500,0.52,501 2002,PEC,Pencil 4B,200,0.62,501 2003,PEC,Pencil 5B,100,0.73,501 2004,PEC,Pencil 6B,500,0.47,502 supplier.csv supplierid,name,phone 501,ABC Traders,88881111 502,XYZ Company,88882222 503,QQ Corp,88883333 products_suppliers.csv productID,supplierID 2001,501 2002,501 2003,501 2004,502 2001,503 Now accomplish all the queries given in solution.
1.
It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.
2.
Find all the supllier name, who are supplying 'Pencil 3B'
3.
Find all the products , which are supplied by ABC Traders.
Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : It is possible that, same product can be supplied by multiple supplier. Now find
each product, its price according to each supplier.
val results = sqlContext.sql(......SELECT products.name AS Product Name', price,
suppliers.name AS Supplier Name'
FROM products_suppliers
JOIN products ON products_suppliers.productlD = products.productID JOIN suppliers ON
Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
val z = x.intersection(y)
intersection : Returns the elements in the two RDDs which are the same.
Question 23:
Problem Scenario 95 : You have to run your Spark application on yarn with each executor Maximum heap size to be 512MB and Number of processor cores to allocate on each executor will be 1 and Your main application required three values as input arguments V1 V2 V3. Please replace XXX, YYY, ZZZ ./bin/spark-submit -class com.hadoopexam.MyTask --master yarn-cluster--num-executors 3 --driver-memory 512m XXX YYY lib/hadoopexam.jarZZZ
Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution
XXX: -executor-memory 512m YYY: -executor-cores 1 ZZZ : V1 V2 V3 Notes : spark-submit on yarn options Option Description archives Comma-separated list of archives to be extracted into the working directory of each executor. The path must be globally visible inside your cluster; see Advanced Dependency Management. executor-cores Number of processor cores to allocate on each executor. Alternatively, you can use the spark.executor.cores property, executor-memory Maximum heap size to allocate to each executor. Alternatively, you can use the spark.executor.memory-property. num-executors Total number of YARN containers to allocate for this application. Alternatively, you can use the spark.executor.instances property. queue YARN queue to submit to. For more information, see Assigning Applications and Queries to Resource Pools. Default: default.
Question 24:
Problem Scenario 93 : You have to run your Spark application with locally 8 thread or locally on 8 cores. Replace XXX with correct values. spark-submit --class com.hadoopexam.MyTask XXX \ -deploy-mode cluster SSPARK_HOME/lib/hadoopexam.jar 10
Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution
XXX: -master local[8]
Notes : The master URL passed to Spark can be in one of the following formats:
Master URL Meaning
local Run Spark locally with one worker thread (i.e. no parallelism at all}.
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on
your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone cluster master. The port must
be whichever one your master is configured to use, which is 7077 by default.
mesos://HOST:PORT Connect to the given Mesos cluster. The port must be whichever one
your is configured to use, which is 5050 by default. Or, for a Mesos cluster using
ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT
should be configured to connect to the MesosClusterDispatcher.
yarn Connect to a YARN cluster in client or cluster mode depending on the value of deploy-mode. The cluster location will be found based on the HADOOP CONF DIR or
YARN CONF DIR variable.
Question 25:
Problem Scenario 89 : You have been given below patient data in csv format, patientID,name,dateOfBirth,lastVisitDate 1001,Ah Teck,1991-12-31,2012-01-20 1002,Kumar,2011-10-29,2012-09-20 1003,Ali,2011-01-30,2012-10-21 Accomplish following activities.
1.
Find all the patients whose lastVisitDate between current time and '2012-09-15'
2.
Find all the patients who born in 2011
3.
Find all the patients age
4.
List patients whose last visited more than 60 days ago
5.
Select patients 18 years old or younger
Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1:
hdfs dfs -mkdir sparksql3
hdfs dfs -put patients.csv sparksql3/
Step 2 : Now in spark shell
// SQLContext entry point for working with structured data
val sqlContext = neworg.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.impIicits._
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// load the data into a new RDD
val patients = sc.textFilef'sparksqIS/patients.csv")
// Return the first element in this RDD
patients.first()
//define the schema using a case class
case class Patient(patientid: Integer, name: String, dateOfBirth:String , lastVisitDate:
String)
// create an RDD of Product objects
val patRDD = patients.map(_.split(M,M)).map(p => Patient(p(0).tolnt,p(1),p(2),p(3)))
patRDD.first()
patRDD.count(}
// change RDD of Product objects to a DataFrame val patDF = patRDD.toDF()
// register the DataFrame as a temp table patDF.registerTempTable("patients"}
// Select data from table
val results = sqlContext.sql(......SELECT* FROM patients '.....)
// display dataframe in a tabular format
results.show()
//Find all the patients whose lastVisitDate between current time and '2012-09-15'
val results = sqlContext.sql(......SELECT * FROM patients WHERE
TO_DATE(CAST(UNIX_TIMESTAMP(lastVisitDate, 'yyyy-MM-dd') AS TIMESTAMP))
BETWEEN '2012-09-15' AND current_timestamp() ORDER BY lastVisitDate......)
results.showQ
/.Find all the patients who born in 2011
val results = sqlContext.sql(......SELECT * FROM patients WHERE
YEAR(TO_DATE(CAST(UNIXJTlMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS
TIMESTAMP))) = 2011 ......)
results. show()
//Find all the patients age
val results = sqlContext.sql(......SELECT name, dateOfBirth, datediff(current_date(),
TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS TlMESTAMP}}}/365
AS age
FROM patients
Mini >
results.show() //List patients whose last visited more than 60 days ago -- List patients whose last visited more than 60 days ago val results = sqlContext.sql(......SELECT name, lastVisitDate FROM patients WHERE
Problem Scenario 14 : You have been given following mysql database details as well as other info. user=retail_dba password=cloudera database=retail_db jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following activities.
1.
Create a csv file named updated_departments.csv with the following contents in local file system. updated_departments.csv 2,fitness 3,footwear 12,fathematics 13,fcience 14,engineering 1000,management
2.
Upload this csv file to hdfs filesystem,
3.
Now export this data from hdfs to mysql retaildb.departments table. During upload make sure existing department will just updated and new departments needs to be inserted.
4.
Now update updated_departments.csv file with below content. 2,Fitness 3,Footwear 12,Fathematics 13,Science 14,Engineering 1000,Management 2000,Quality Check
5.
Now upload this file to hdfs.
6.
Now export this data from hdfs to mysql retail_db.departments table. During upload make sure existing department will just updated and no new departments needs to be inserted.
Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create a csv tile named updateddepartments.csv with give content.
Step 2 : Now upload this tile to HDFS.
Create a directory called newdata.
hdfs dfs -mkdir new_data
hdfs dfs -put updated_departments.csv newdata/
Step 3 : Check whether tile is uploaded or not. hdfs dfs -Is new_data
Step 4 : Export this file to departments table using sqoop.
Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution : b.leftOuterJoin(d}.collect leftOuterJoin [Pair]: Performs an left outer join using two key-value RDDs. Please note that the keys must be generally comparable to make this work keyBy : Constructs twocomponent tuples (key-value pairs) by applying a function on each data item. Trie result of the function becomes the key and the original data item becomes the value of the newly created tuples.
Question 29:
Problem Scenario 53 : You have been given below code snippet. val a = sc.parallelize(1 to 10, 3) operation1 b.collect Output 1 Array[lnt] = Array(2, 4, 6, 8,10) operation2 Output 2 Array[lnt] = Array(1,2, 3) Write a correct code snippet for operation1 and operation2 which will produce desired output, shown above.
Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution : valb = a.filter(_%2==0) a.filter(_ < 4).collectfilter Evaluates a boolean function for each data item of the RDD and puts the items for which the function returned true into the resulting RDD. When you provide a filter function, it must be able to handle all data items contained in the RDD. Scala provides so-called partial functions to deal with mixed data types (Tip: Partial functions to deal are very useful if you have some data which may be bad and you do not want to handle but for the good data (matching data) you want to apply some Kind of map function. The following article is good. It teaches you about partial functions in a very nice way and explains why case has to be used for partial functions:article) Examples for mixed data without partial functions val b = sc.parallelize(1 to 8) b.filter(_ < 4)xollectres15: Arrayjlnt] = Array(1, 2, 3) val a = sc.parallelize(List("cat'\ "horse", 4.0, 3.5, 2, "dog")) a.filter(_<4).collect error: value < is not a member of Any
Question 30:
Problem Scenario 33 : You have given a files as below. spark5/EmployeeName.csv (id,name) spark5/EmployeeSalary.csv (id,salary) Data is given below: EmployeeName.csv E01,Lokesh E02,Bhupesh E03,Amit E04,Ratan E05,Dinesh E06,Pavan E07,Tejas E08,Sheela E09,Kumar E10,Venkat EmployeeSalary.csv E01,50000 E02,50000 E03,45000 E04,45000 E05,50000 E06,45000 E07,50000 E08,10000 E09,10000 E10,10000 Now write a Spark code in scala which will load these two tiles from hdfs and join the same, and produce the (name.salary) values. And save the data in multiple tile group by salary (Means each file will have name of employees with same salary). Make sure file name include salary as well.
Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create all three files in hdfs (We will do using Hue). However, you can first create
in local filesystem and then upload it to hdfs.
Step 2 : Load EmployeeName.csv file from hdfs and create PairRDDs
val name = sc.textFile("spark5/EmployeeName.csv")
val namePairRDD = name.map(x=> (x.split(",")(0),x.split('V')(1)))
Step 3 : Load EmployeeSalary.csv file from hdfs and create PairRDDs
val salary = sc.textFile("spark5/EmployeeSalary.csv")
val salaryPairRDD = salary.map(x=> (x.split(",")(0),x.split(",")(1)))
Step 4 : Join all pairRDDS
val joined = namePairRDD.join(salaryPairRDD}
Step 5 : Remove key from RDD and Salary as a Key. val keyRemoved = joined.values
Step 6 : Now swap filtered RDD.
val swapped = keyRemoved.map(item => item.swap)
Step 7 : Now groupBy keys (It will generate key and value array) val grpByKey =
swapped.groupByKey().collect()
Step 8 : Now create RDD for values collection
val rddByKey = grpByKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)}
Step 9 : Save the output as a Text file.
rddByKey.foreach{ case (k,rdd) => rdd.saveAsTextFile("spark5/Employee"+k)}
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Cloudera exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your CCA175 exam preparations and Cloudera certification application, do not hesitate to visit our Vcedump.com to find your solutions here.