Vcedump 100% Guareented CCA175 Questions and Answers. 100% Pass Guarantee. Latest Questions with Accurate Answers.

Exam Details

Exam Code
:CCA175
Exam Name
:CCA Spark and Hadoop Developer Exam
Certification
:Cloudera Certifications
Vendor
:Cloudera
Total Questions
:95 Q&As
Last Updated
:Jul 03, 2025

Cloudera Cloudera Certifications CCA175 Questions & Answers

Question 1:

Problem Scenario 91 : You have been given data in json format as below.
{"first_name":"Ankit", "last_name":"Jain"}
{"first_name":"Amir", "last_name":"Khan"}
{"first_name":"Rajesh", "last_name":"Khanna"}
{"first_name":"Priynka", "last_name":"Chopra"}
{"first_name":"Kareena", "last_name":"Kapoor"}
{"first_name":"Lokesh", "last_name":"Yadav"}
Do the following activity
1.
create employee.json tile locally.
2.
Load this tile on hdfs
3.
Register this data as a temp table in Spark using Python.
4.
Write select query and print this data.
5.
Now save back this selected data in json format.

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : create employee.json tile locally.
vi employee.json (press insert) past the content.
Step 2 : Upload this tile to hdfs, default location hadoop fs -put employee.json
val employee = sqlContext.read.json("/user/cloudera/employee.json")
employee.write.parquet("employee. parquet")
val parq_data = sqlContext.read.parquet("employee.parquet")
parq_data.registerTempTable("employee")
val allemployee = sqlContext.sql("SELeCT' FROM employee")
all_employee.show()
import org.apache.spark.sql.SaveMode prdDF.write..format("orc").saveAsTable("product
ore table"}
//Change the codec. sqlContext.setConf("spark.sql.parquet.compression.codec","snappy") employee.write.mode(SaveMode.Overwrite).parquet("employee.parquet")
Question 2:

Problem Scenario 80 : You have been given MySQL DB with following details. user=retail_dba password=cloudera database=retail_db table=retail_db.products jdbc URL = jdbc:mysql://quickstart:3306/retail_db Columns of products table : (product_id | product_category_id | product_name | product_description | product_price | product_image ) Please accomplish following activities.
1.
Copy "retaildb.products" table to hdfs in a directory p93_products
2.
Now sort the products data sorted by product price per category, use productcategoryid colunm to group by category

Correct Answer: See the explanation for Step by Step Solution and configuration.

Solution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba password=cloudera -table=products --target-dir=p93
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Step 2 : Read the data from one of the partition, created using above command,
hadoop fs -cat p93_products/part-m-00000
Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and
do following}. productsRDD = sc.textFile(Mp93_products")
Step 4 : Filter empty prices, if exists
#filter out empty prices lines
Nonempty_lines = productsRDD.filter(lambda x: len(x.split(",")[4]) > 0)
Step 5 : Create data set like (categroyld, (id,name,price)
mappedRDD = nonempty_lines.map(lambda line: (line.split(",")[1], (line.split(",")[0],
line.split(",")[2], float(line.split(",")[4]))))
tor line in mappedRDD.collect(): print(line)
Step 6 : Now groupBy the all records based on categoryld, which a key on mappedRDD it
will produce output like (categoryld, iterable of all lines for a key/categoryld)
groupByCategroyld = mappedRDD.groupByKey() for line in groupByCategroyld.collect():
print(line)
step 7 : Now sort the data in each category based on price in ascending order.
# sorted is a function to sort an iterable, we can also specify, what would be the Key on
which we want to sort in this case we have price on which it needs to be sorted.
groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue:
tupleValue[2])).take(5)
Step 8 : Now sort the data in each category based on price in descending order.
# sorted is a function to sort an iterable, we can also specify, what would be the Key on
which we want to sort in this case we have price which it needs to be sorted.
on groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue:
tupleValue[2] , reverse=True)).take(5)
Question 3:

Problem Scenario 71 :
Write down a Spark script using Python,
In which it read a file "Content.txt" (On hdfs) with following content.
After that split each row as (key, value), where key is first word in line and entire line as
value.
Filter out the empty lines.
And save this key value in "problem86" as Sequence file(On hdfs)
Part 2 : Save as sequence file , where key as null and entire line as value. Read back the
stored sequence files.
Content.txt
Hello this is ABCTECH.com
This is XYZTECH.com
Apache Spark Training
This is Spark Learning Session Spark is faster than MapReduce

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 :
# Import SparkContext and SparkConf
from pyspark import SparkContext, SparkConf
Step 2:
#load data from hdfs
contentRDD = sc.textFile(MContent.txt")
Step 3:
#filter out non-empty lines
nonemptyjines = contentRDD.filter(lambda x: len(x) > 0)
Step 4:
#Split line based on space (Remember : It is mandatory to convert is in tuple} words =
nonempty_lines.map(lambda x: tuple(x.split('', 1)))
words.saveAsSequenceFile("problem86")
Step 5: Check contents in directory problem86 hdfs dfs -cat problem86/part*
Step 6 : Create key, value pair (where key is null)
nonempty_lines.map(lambda line: (None, Mne}).saveAsSequenceFile("problem86_1")
Step 7 : Reading back the sequence file data using spark. seqRDD =
sc.sequenceFile("problem86_1")
Step 8 : Print the content to validate the same.
for line in seqRDD.collect():
print(line)
Question 4:

Problem Scenario 23 : You have been given log generating service as below. Start_logs (It will generate continuous logs) Tail_logs (You can check , what logs are being generated) Stop_logs (It will stop the log service) Path where logs are generated using above service : /opt/gen_logs/logs/access.log Now write a flume configuration file named flume3.conf , using that configuration file dumps logs in HDFS file system in a directory called flumeflume3/%Y/%m/%d/%H/%M Means every minute new directory should be created). Please us the interceptors to provide timestamp information, if message header does not have header info. And also note that you have to preserve existing timestamp, if message contains it. Flume channel should have following property as well. After every 100 message it should be committed, use non-durable/faster channel and it should be able to hold maximum 1000 events.

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution : Step 1 : Create flume configuration file, with below configuration for source, sink and channel. #Define source , sink , channel and agent, agent1 .sources = source1 agent1 .sinks = sink1 agent1.channels = channel1 # Describe/configure source1 agent1 .sources.source1.type = exec agentl.sources.source1.command = tail -F /opt/gen logs/logs/access.log #Define interceptors agent1 .sources.source1.interceptors=i1 agent1 .sources.source1.interceptors.i1.type=timestamp agent1 .sources.source1.interceptors.i1.preserveExisting=true ## Describe sink1 agent1 .sinks.sink1.channel = memory-channel agent1 .sinks.sink1.type = hdfs agent1 .sinks.sink1.hdfs.path = flume3/%Y/%m/%d/%H/%M agent1 .sinks.sjnkl.hdfs.fileType = Data Stream # Now we need to define channel1 property. agent1.channels.channel1.type = memory agent1.channels.channel1.capacity = 1000 agent1.channels.channel1.transactionCapacity = 100 # Bind the source and sink to the channel Agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1 Step 2 : Run below command which will use this configuration file and append data in hdfs. Start log service using : start_logs Start flume service: flume-ng agent -conf /home/cloudera/flumeconf -conf-file /home/cloudera/flumeconf/flume3.conf -DfIume.root.logger=DEBUG,INFO,console –name agent1 Wait for few mins and than stop log service. stop logs
Question 5:

Problem Scenario 87 : You have been given below three files product.csv (Create this file in hdfs) productID,productCode,name,quantity,price,supplierid 1001,PEN,Pen Red,5000,1.23,501 1002,PEN,Pen Blue,8000,1.25,501
1003,PEN,Pen Black,2000,1.25,501 1004,PEC,Pencil 2B,10000,0.48,502 1005,PEC,Pencil 2H,8000,0.49,502 1006,PEC,Pencil HB,0,9999.99,502 2001,PEC,Pencil 3B,500,0.52,501 2002,PEC,Pencil 4B,200,0.62,501 2003,PEC,Pencil 5B,100,0.73,501 2004,PEC,Pencil 6B,500,0.47,502 supplier.csv supplierid,name,phone 501,ABC Traders,88881111 502,XYZ Company,88882222 503,QQ Corp,88883333 products_suppliers.csv productID,supplierID 2001,501 2002,501 2003,501 2004,502 2001,503 Now accomplish all the queries given in solution. Select product, its price , its supplier name where product price is less than 0.6 using SparkSQL

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1:
hdfs dfs -mkdir sparksql2
hdfs dfs -put product.csv sparksq!2/
hdfs dfs -put supplier.csv sparksql2/
hdfs dfs -put products_suppliers.csv sparksql2/
Step 2 : Now in spark shell
// this Is used to Implicitly convert an RDD to a DataFrame.
import sqlContext.impIicits._
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// load the data into a new RDD
val products = sc.textFile("sparksql2/product.csv")
val supplier = sc.textFileC'sparksq^supplier.csv")
val prdsup = sc.textFile("sparksql2/products_suppliers.csv"}
// Return the first element in this RDD
products.fi rst()
supplier.first{).
prdsup.first()
//define the schema using a case class
case class Product(productid: Integer, code: String, name: String, quantity:lnteger, price:
Float, supplierid:lnteger)
case class Suplier(supplierid: Integer, name: String, phone: String)
case class PRDSUP(productid: Integer.supplierid: Integer)
// create an RDD of Product objects
val prdRDD = products.map(_.split('\")).map(p =>
Product(p(0).tolnt,p(1),p(2),p(3).tolnt,p(4).toFloat,p(5).toint))
val supRDD = supplier.map(_.split(",")).map(p => Suplier(p(0).tolnt,p(1),p(2)))
val prdsupRDD = prdsup.map(_.split(",")).map(p => PRDSUP(p(0).tolnt,p(1}.tolnt}}
prdRDD.first()
prdRDD.count()
supRDD.first() supRDD.count()
prdsupRDD.first() prdsupRDD.count(}
// change RDD of Product objects to a DataFrame
val prdDF = prdRDD.toDF()
val supDF = supRDD.toDF()
val prdsupDF = prdsupRDD.toDF()
// register the DataFrame as a temp table prdDF.registerTempTablef'products")
supDF.registerTempTablef'suppliers")
prdsupDF.registerTempTablef'productssuppliers"}
//Select product, its price , its supplier name where product price is less than 0.6
val results = sqlContext.sql(......SELECT products.name, price, suppliers.name as
sup_name FROM products JOIN suppliers ON products.supplierlD= suppliers.supplierlD
WHERE price < 0.6......]
results. show()
Question 6:

Problem Scenario 6 : You have been given following mysql database details as well as other info. user=retail_dba password=cloudera database=retail_db jdbc URL = jdbc:mysql://quickstart:3306/retail_db Compression Codec : org.apache.hadoop.io.compress.SnappyCodec Please accomplish following.
1.
Import entire database such that it can be used as a hive tables, it must be created in default schema.
2.
Also make sure each tables file is partitioned in 3 files e.g. part-00000, part-00002, part00003
3.
Store all the Java files in a directory called java_output to evalute the further

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Drop all the tables, which we have created in previous problems. Before
implementing the solution.
Login to hive and execute following command.
show tables;
drop table categories;
drop table customers;
drop table departments;
drop table employee;
drop table ordeMtems;
drop table orders;
drop table products;
show tables;
Check warehouse directory. hdfs dfs -Is /user/hive/warehouse
Step 2 : Now we have cleaned database. Import entire retail db with all the required
parameters as problem statement is asking.
sqoop import-all-tables \
-m3\
-connect jdbc:mysql://quickstart:3306/retail_db \
--username=retail_dba \
-password=cloudera \
-hive-import \
--hive-overwrite \
-create-hive-table \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--outdir java_output
Step 3 : Verify the work is accomplished or not.
a.
Go to hive and check all the tables hive
show tables;
select count(1) from customers;
b.
Check the-warehouse directory and number of partitions,
hdfs dfs -Is /user/hive/warehouse
hdfs dfs -Is /user/hive/warehouse/categories
c.
Check the output Java directory.
Is -Itr java_output/
Question 7:

Problem Scenario 41 : You have been given below code snippet.
val aul = sc.parallelize(List (("a" , Array(1,2)), ("b" , Array(1,2))))
val au2 = sc.parallelize(List (("a" , Array(3)), ("b" , Array(2))))
Apply the Spark method, which will generate below output.
Array[(String, Array[lnt])] = Array((a,Array(1, 2)), (b,Array(1, 2)), (a(Array(3)), (b,Array(2)))

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution: au1.union(au2)
Question 8:

Problem Scenario 7 : You have been given following mysql database details as well as other info. user=retail_dba password=cloudera database=retail_db jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following.
1.
Import department tables using your custom boundary query, which import departments between 1 to 25.
2.
Also make sure each tables file is partitioned in 2 files e.g. part-00000, part-00002
3.
Also make sure you have imported only two columns from table, which are department_id,department_name

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solutions :
Step 1 : Clean the hdfs tile system, if they exists clean out.
hadoop fs -rm -R departments
hadoop fs -rm -R categories
hadoop fs -rm -R products
hadoop fs -rm -R orders
hadoop fs -rm -R order_itmes
hadoop fs -rm -R customers
Step 2 : Now import the department table as per requirement.
sqoop import \
-connect jdbc:mysql://quickstart:3306/retail_db \
--username=retail_dba \
-password=cloudera \
-table departments \
-target-dir /user/cloudera/departments \
-m2\
-boundary-query "select 1, 25 from departments" \
-columns department_id,department_name
Step 3 : Check imported data.
hdfs dfs -Is departments
hdfs dfs -cat departments/part-m-00000
hdfs dfs -cat departments/part-m-00001
Question 9:

Problem Scenario 12 : You have been given following mysql database details as well as other info. user=retail_dba password=cloudera database=retail_db jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following.
1.
Create a table in retailedb with following definition.
CREATE table departments_new (department_id int(11), department_name varchar(45),
created_date T1MESTAMP DEFAULT NOW());
2.
Now isert records from departments table to departments_new
3.
Now import data from departments_new table to hdfs.
4.
Insert following 5 records in departmentsnew table. Insert into departments_new
values(110, "Civil" , null); Insert into departments_new values(111, "Mechanical" , null);
Insert into departments_new values(112, "Automobile" , null); Insert into departments_new
values(113, "Pharma" , null);
Insert into departments_new values(114, "Social Engineering" , null);
5.
Now do the incremental import based on created_date column.

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Login to musql db
mysql --user=retail_dba -password=cloudera
show databases;
use retail db; show tables;
Step 2 : Create a table as given in problem statement.
CREATE table departments_new (department_id int(11), department_name varchar(45),
createddate T1MESTAMP DEFAULT NOW());
show tables;
Step 3 : isert records from departments table to departments_new insert into
departments_new select a.", null from departments a;
Step 4 : Import data from departments new table to hdfs.
sqoop import \
-connect jdbc:mysql://quickstart:330G/retail_db \
~username=retail_dba \
-password=cloudera \
-table departments_new\
--target-dir /user/cloudera/departments_new \
--split-by departments
Stpe 5 : Check the imported data.
hdfs dfs -cat /user/cloudera/departmentsnew/part"
Step 6 : Insert following 5 records in departmentsnew table.
Insert into departments_new values(110, "Civil" , null);
Insert into departments_new values(111, "Mechanical" , null);
Insert into departments_new values(112, "Automobile" , null);
Insert into departments_new values(113, "Pharma" , null);
Insert into departments_new values(114, "Social Engineering" , null);
commit;
Stpe 7 : Import incremetal data based on created_date column.
sqoop import \
-connect jdbc:mysql://quickstart:330G/retaiI_db \
-username=retail_dba \
-password=cloudera \
--table departments_new\
-target-dir /user/cloudera/departments_new \
-append \
-check-column created_date \
-incremental lastmodified \
-split-by departments \
-last-value "2016-01-30 12:07:37.0"
Step 8 : Check the imported value.
hdfs dfs -cat /user/cloudera/departmentsnew/part"
Question 10:

Problem Scenario 3: You have been given MySQL DB with following details. user=retail_dba password=cloudera database=retail_db table=retail_db.categories jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following activities.
1.
Import data from categories table, where category=22 (Data should be stored in categories subset)
2.
Import data from categories table, where category>22 (Data should be stored in categories_subset_2)
3.
Import data from categories table, where category between 1 and 22 (Data should be stored in categories_subset_3)
4.
While importing catagories data change the delimiter to '|' (Data should be stored in categories_subset_S)
5.
Importing data from catagories table and restrict the import to category_name,category id columns only with delimiter as '|'
6.
Add null values in the table using below SQL statement ALTER TABLE categories modify category_department_id int(11); INSERT INTO categories values (eO.NULL.'TESTING');
7.
Importing data from catagories table (In categories_subset_17 directory) using '|' delimiter and categoryjd between 1 and 61 and encode null values for both string and non string columns.
8.
Import entire schema retail_db in a directory categories_subset_all_tables

Correct Answer: See the explanation for Step by Step Solution and configuration.
Solution: Step 1: Import Single table (Subset data} Note: Here the ' is the same you find on - key sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba password=cloudera -table=categories ~warehouse-dir= categories_subset --where \'category_id\’=22 --m 1 Step 2 : Check the output partition hdfs dfs -cat categoriessubset/categories/part-m-00000 Step 3 : Change the selection criteria (Subset data) sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba password=cloudera -table=categories ~warehouse-dir= categories_subset_2 --where \’category_id\’\>22 -m 1 Step 4 : Check the output partition hdfs dfs -cat categories_subset_2/categories/part-m-00000 Step 5 : Use between clause (Subset data) sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba password=cloudera -table=categories ~warehouse-dir=categories_subset_3 --where "\’category_id\' between 1 and 22" --m 1 Step 6 : Check the output partition hdfs dfs -cat categories_subset_3/categories/part-m-00000 Step 7 : Changing the delimiter during import. sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail dba password=cloudera -table=categories -warehouse-dir=:categories_subset_6 --where "/’categoryjd /’ between 1 and 22" -fields-terminated-by='|' -m 1 Step 8 : Check the.output partition hdfs dfs -cat categories_subset_6/categories/part-m-00000 Step 9 : Selecting subset columnssqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba password=cloudera -table=categories --warehouse-dir=categories subset col -where "/’category id/’ between 1 and 22" -fields-terminated-by=T -columns=category name,category id --m 1 Step 10 : Check the output partition hdfs dfs -cat categories_subset_col/categories/part-m-00000 Step 11 : Inserting record with null values (Using mysql} ALTER TABLE categories modify category_department_id int(11); INSERT INTO categories values ^NULL/TESTING'); select" from categories; Step 12 : Encode non string null column sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail dba password=cloudera -table=categories --warehouse-dir=categortes_subset_17 -where "\"category_id\" between 1 and 61" -fields-terminated-by=,|' --null-string-N' -null-nonstring=, N' --m 1 Step 13 : View the content hdfs dfs -cat categories_subset_17/categories/part-m-00000 Step 14 : Import all the tables from a schema (This step will take little time) sqoop import-all-tables -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -warehouse-dir=categories_si Step 15 : View the contents
hdfs dfs -Is categories_subset_all_tables
Step 16 : Cleanup or back to originals.
delete from categories where categoryid in (59,60);
ALTER TABLE categories modify category_department_id int(11) NOTNULL;
ALTER TABLE categories modify category_name varchar(45) NOT NULL;
desc categories;

Related Exams:

Tips on How to Prepare for the Exams

Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Cloudera exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your CCA175 exam preparations and Cloudera certification application, do not hesitate to visit our Vcedump.com to find your solutions here.

Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Cloudera Cloudera Certifications CCA175 Questions & Answers

Question 1:

Question 2:

Question 3:

Question 4:

Question 5:

Question 6:

Question 7:

Question 8:

Question 9:

Question 10:

Related Exams:

CCA-175

CCA-410

CCA-470

CCA-500

CCA-505

CCB-400

CCD-410

CCD-470

Tips on How to Prepare for the Exams

CCA Spark and Hadoop Developer Exam

Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Cloudera Cloudera Certifications CCA175 Questions & Answers

Question 1:

Question 2:

Question 3:

Question 4:

Question 5:

Question 6:

Question 7:

Question 8:

Question 9:

Question 10:

Related Exams:

Tips on How to Prepare for the Exams