Cloudera Certified Developer for Apache Hadoop (CCDH)
Exam Details
Exam Code
:CCD-410
Exam Name
:Cloudera Certified Developer for Apache Hadoop (CCDH)
Certification
:CCDH
Vendor
:Cloudera
Total Questions
:60 Q&As
Last Updated
:May 14, 2024
Cloudera CCDH CCD-410 Questions & Answers
Question 31:
A client application creates an HDFS file named foo.txt with a replication factor of 3. Identify which best describes the file access rules in HDFS if the file has a single block that is stored on data nodes A, B and C?
A. The file will be marked as corrupted if data node B fails during the creation of the file.
B. Each data node locks the local file to prohibit concurrent readers and writers of the file.
C. Each data node stores a copy of the file in the local file system with the same name as the HDFS file.
D. The file can be accessed if at least one of the data nodes storing the file is available.
Correct Answer: D
HDFS keeps three copies of a block on three different datanodes to protect against true data corruption.
HDFS also tries to distribute these three replicas on more than one rack to protect against data availability
issues. The fact that HDFS actively monitors any failed datanode(s) and upon failure detection immediately
schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is
sufficient to avoid corrupted files.
Note:
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as
a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. An application
can specify the number of replicas of a file. The replication factor can be specified at file creation time and
can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The
NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement
policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on
datanodes on same rack and 3rd copy on a different rack.
Reference: 24 Interview Questions and Answers for Hadoop MapReduce developers , How the HDFS Blocks
are replicated?
Question 32:
Indentify which best defines a SequenceFile?
A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous Writable objects
B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous Writable objects
C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be the same type.
Correct Answer: D
SequenceFile is a flat file consisting of binary key/value pairs.
There are 3 different SequenceFile formats:
Uncompressed key/value records.
Record compressed key/value records - only 'values' are compressed here.
Block compressed key/value records - both keys and values are collected in 'blocks' separately and
compressed. The size of the 'block' is configurable.
On a cluster running MapReduce v1 (MRv1), a TaskTracker heartbeats into the JobTracker on your cluster, and alerts the JobTracker it has an open map task slot. What determines how the JobTracker assigns each map task to a TaskTracker?
A. The amount of RAM installed on the TaskTracker node.
B. The amount of free disk space on the TaskTracker node.
C. The number and speed of CPU cores on the TaskTracker node.
D. The average system load on the TaskTracker node over the past fifteen (15) minutes.
E. The location of the InsputSplit to be processed in relation to the location of the node.
Correct Answer: E
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
Reference: 24 Interview Questions and Answers for Hadoop MapReduce developers, How JobTracker schedules a task?
Question 34:
All keys used for intermediate output from mappers must:
A. Implement a splittable compression algorithm.
B. Be a subclass of FileInputFormat.
C. Implement WritableComparable.
D. Override isSplitable.
E. Implement a comparator for speedy sorting.
Correct Answer: C
The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Reference: MapReduce Tutorial
Question 35:
What data does a Reducer reduce method process?
A. All the data in a single input file.
B. All data produced by a single mapper.
C. All data for a given key, regardless of which mapper(s) produced it.
D. All data for a given value, regardless of which mapper(s) produced it.
Correct Answer: C
Reducing lets you aggregate values together. A reducer function receives an iterator of input values from
an input list. It then combines these values together, returning a single output value.
All values with the same key are presented to a single reduce task.
For each intermediate key, each reducer task can emit:
A. As many final key-value pairs as desired. There are no restrictions on the types of those key- value pairs (i.e., they can be heterogeneous).
B. As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.
C. As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
D. One final key-value pair per value associated with the key; no restrictions on the type.
E. One final key-value pair per key; no restrictions on the type.
Correct Answer: E
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value. Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce
Question 37:
The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called:
A. Combine
B. IdentityMapper
C. IdentityReducer
D. Default Partitioner
E. Speculative Execution
Correct Answer: E
Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Reference: Apache Hadoop, Module 4: MapReduce
Note:
*
Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to
finish wins, and the other copies are killed.
Failed tasks are tasks that error out.
*
There are a few reasons Hadoop can kill tasks by his own decisions:
a) Task does not report progress during timeout (default is 10 minutes)
b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue
(CapacityScheduler).
c) Speculative execution causes results of task not to be needed since it has completed on other place.
Reference: Difference failed tasks vs killed tasks
Question 38:
You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?
A. Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.
B. Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
C. When submitting the job on the command line, specify the libjars option followed by the JAR file path.
D. Package your code and the Apache Commands Math library into a zip file named JobJar.zip
Correct Answer: C
The usage of the jar command is like this,
Usage: hadoop jar [mainClass] args...
If you want the commons-math3.jar to be available for all the tasks you can do any one of these
1.
Copy the jar file in $HADOOP_HOME/lib dir or
2.
Use the generic option -libjars.
Question 39:
Given a directory of files with the following structure: line number, tab character, string:
What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?
A. You will not be able to compress the intermediate data.
B. You will longer be able to take advantage of a Combiner.
C. By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
D. There are no concerns with this approach. It is always advisable to use multiple reduces.
Correct Answer: C
Multiple reducers and total ordering
If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred- site.xml has been set to a number larger than 1, or because you've used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the HashPartitioner means that you can't concatenate your output files to create a single sorted output file. To do this you'll need total ordering,
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Cloudera exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your CCD-410 exam preparations and Cloudera certification application, do not hesitate to visit our Vcedump.com to find your solutions here.