Identify the tool best suited to import a portion of a relational database every day as files into HDFS, and generate Java classes to interact with that imported data?
A. Oozie
B. Flume
C. Pig
D. Hue
E. Hive
F. Sqoop
G. fuse-dfs
Correct Answer: F
Sqoop ("SQL-to-Hadoop") is a straightforward command-line tool with the following capabilities:
Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to
interact with your imported data Provides the ability to import from SQL databases straight into your Hive
data warehouse
Note:
Data Movement Between Hadoop and Relational Databases Data can be moved between Hadoop and a
relational database as a bulk data transfer, or relational tables can be accessed from within a MapReduce
map function.
Note:
* Cloudera's Distribution for Hadoop provides a bulk data transfer tool (i.e., Sqoop) that imports individual tables or entire databases into HDFS files. The tool also generates Java classes that support interaction with the imported data. Sqoop supports all relational databases over JDBC, and Quest Software provides a connector (i.e., OraOop) that has been optimized for access to data residing in Oracle databases.
Reference: http://log.medcl.net/item/2011/08/hadoop-and-mapreduce-big-data-analytics- gartner/ (Data Movement between hadoop and relational databases, second paragraph)
Question 23:
Given the following Hive command:
INSERT OVERWRITE TABLE mytable SELECT * FROM myothertable;
Which one of the following statements is true?
A. The contents of myothertable are appended to mytable
B. Any existing data in mytable will be overwritten
C. A new table named mytable is created, and the contents of myothertable are copied into mytable
D. The statement is not a valid Hive command
Correct Answer: B
Question 24:
Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers and monitoring application resource usage?
Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data?
A. HBase
B. Hue
C. Pig
D. Hive
E. Oozie
F. Flume
G. Sqoop
Correct Answer: A
Explanation: Use Apache HBase when you need random, realtime read/write access to your Big Data.
Note: This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop
clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-
oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by
Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System,
Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Features
Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries. Query predicate push down via server side Filters Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options Extensible jruby-based (JIRB) shell Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
Reference: http://hbase.apache.org/ (when would I use HBase? First sentence)
Question 26:
You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key- values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.
A. There is no difference in output between the two settings.
B. With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS.
C. With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS.
D. With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.
Correct Answer: D
Explanation: * It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
* Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
Note:
Reduce
In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for
each pair in the grouped inputs.
The output of the reduce task is typically written to the FileSystem via OutputCollector.collect
(WritableComparable, Writable).
Applications can use the Reporter to report progress, set application-level status messages and update
Counters, or just indicate that they are alive.
The output of the Reducer is not sorted.
Question 27:
The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called:
A. Combine
B. IdentityMapper
C. IdentityReducer
D. Default Partitioner
E. Speculative Execution
Correct Answer: E
Explanation: Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes. By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Reference: Apache Hadoop, Module 4: MapReduce
Note:
*
Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed.
*
There are a few reasons Hadoop can kill tasks by his own decisions:
Failed tasks are tasks that error out.
a) Task does not report progress during timeout (default is 10 minutes)
b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler).
c) Speculative execution causes results of task not to be needed since it has completed on other place.
Reference: Difference failed tasks vs killed tasks
Question 28:
Consider the following two relations, A and B.
What is the output of the following Pig commands?
X = GROUP A BY S1;
DUMP X;
A. Option A
B. Option B
C. Option C
D. Option D
Correct Answer: D
Question 29:
MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate daemons? Select two.
A. Heath states checks (heartbeats)
B. Resource management
C. Job scheduling/monitoring
D. Job coordination between the ResourceManager and NodeManager
E. Launching tasks
F. Managing file system metadata
G. MapReduce metric reporting H. Managing tasks
Correct Answer: BC
Explanation: The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker,
resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global
ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in
the classical sense of Map- Reduce jobs or a DAG of jobs.
Note:
The central goal of YARN is to clearly separate two things that are unfortunately smushed together in
current Hadoop, specifically in (mainly) JobTracker:
/ Monitoring the status of the cluster with respect to which nodes have which resources available. Under
YARN, this will be global.
/ Managing the parallelization execution of any specific job. Under YARN, this will be done separately for
Assuming default settings, which best describes the order of data provided to a reducer's reduce method:
A. The keys given to a reducer aren't in a predictable order, but the values associated with those keys always are.
B. Both the keys and values passed to a reducer always appear in sorted order.
C. Neither keys nor values are in any predictable order.
D. The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order
Correct Answer: D
Explanation: Reducer has 3 primary phases:
1.
Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2.
Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Hortonworks exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your HADOOP-PR000007 exam preparations and Hortonworks certification application, do not hesitate to visit our Vcedump.com to find your solutions here.