Which recommender system technique is domain specific?
A. Content-based collaboration filtering
B. Item-based collaborative filtering
C. User-based collaborative filtering
D. Naïve Bayes classifier
You are about to sample a 100-dimensinal unit-cube. To adequately sample any single given dimension, you need only capture 10 points. How many points do you need to order to sample the complete 100dimensional unit cube adequately?
A. 10010
B. 1010
C. Log2(100)
D. 100
E. 1000
F. 1010
You have acquired a new data source of millions of customer records, and you've this data into HDFS. Prior to analysis, you want to change all customer registration to the same date format, make all addresses uppercase, and remove all customer names (for anonymization). Which process will accomplish all three objectives?
A. Adapt the data cleansing module in Mahout to your data, and invoke the Mahout library when you run your analysis
B. Pull this data into an RDBMS using sqoop and scrub records using stored procedures
C. Write a script that receives records on stdin, corrects them, and then writes them to stdout. Then, invoke this script in a map-only Hadoop Streaming Job
D. Write a MapReduce job with a mapper to change words to uppercase and to reduce different forms of dates to a single form
A company has 20 software engineers working to fix on a project. Over the past week, the team has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs fixed per engineer id five. None of the engineer fixed exactly five bugs last week.
You want to understand how productive each engineer is at fixing bugs. What is the best way to visualize the distribution of bug fixes per engineer?
A. A bar chart of engineers vs. number of bugs fixed
B. A scatter plot of engineers vs. number of bugs fixed
C. A normal distribution of the mean and standard deviation of bug fixes per engineer D. A histogram that groups engineers to together based on the number of bugs they fixed
A company has 20 software engineers working to fix on a project. Over the past week, the team has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs fixed per engineer id five. None of the engineer fixed exactly five bugs last week. One engineer points out that some bugs are more difficult to fix than others. What metric should you use to estimate how hard a particular bug is to fix?
A. The tech lead's estimate of how many hours would be needed to fix the bug.
B. The priority of the bug according to the project manager
C. The number of years that the engineer who was assigned the bug has worked at the company
D. The number of bugs that had been found in each sub-component of the project
In what way can Hadoop be used to improve the performance of LIoyd's algorithm for k-means clustering on large data sets?
A. Parallelizing the centroid computations to improve numerical stability
B. Distributing the updates of the cluster centroids
C. Reducing the number of iterations required for the centroids to converge
D. Mapping the input data into a non-Euclidean metric space
You have a data file that contains two trillion records, one record per line (comma separated). Each record lists two friends and unique message sent between them. Their names will not have commas.
Michael, John, Pabst, Blue Ribbon Tiffany, James, BMX Racing John, Michael, Natural Lemon Flavor
Analyze the pseudo code examples below and determine which set of mappers and reducers in the below pseudo code snippets will solve for the mean number of messages each user sends to all of the friends?
For example pseudo code may have three friends to whom he sends 6, 10, and 200 messages, respectively, so Michael's mean would be (6+10+200)/3. The solution may require a pipeline of two MapReduce jobs.
A. def mapper1 (line): key1, key2, message = line.split (` , ') emit ( (key1, key2) , 1) def reducer1(key, values): emit (key, sum(values)) def mapper2(key, value): key1, key2 = key / / unpack both friends name into separate keys emit (key1, value)
def reducer2(key, values):
emit (key, mean (values) )
B. def mapper1 (line): key1, key2, message = line.split (` , ') emit ( (key1, key2) , 1) emit ( (key1, key2) , 1) def reducer1(key, values): emit (key, sum(values)) def mapper2(key, value): key1, key2 = key / / unpack both friends name into separate keys emit (key1, value) def reducer2(key, values): emit (key, mean (values) )
C. def mapper1 (line): key1, key2, message = line.split (` , ') emit ( (key1, key2) , 1) emit ( (key1, key2) , 1) def reducer1(key, values): emit (key, sum(values))
D. def mapper (line) : Key1, key2, message = line.split (` , ') Sort (key1, key2) / / a fiven pair will always be sorted the same Emit ( ( key 1, key2), 1) Def reducer1(key, values) : Emit (key, sum (values) ) Def Mapper2 (key, value) Key1, key2 = key / / unpack both friends names into separate keys Emit (key1, value) Emit (key2, value) Def reducer2(key, values); Emit (key, mean (values) )
You have just run a MapReduce job to filter user messages to only those of a selected geographical region. The output for this job in a directory named westUsers, located just below your home directory in HDFS. Which command gathers these records into a single file on your local file system?
A. Hadoop fs getmerge westUsers WestUsers.txt
B. Hadoop fs get westUsers WestUsers.txt
C. Hadoop fs cp westUsers/* westUsers.txt
D. Hadoop fs getmerge R westUsers westUsers.txt
Function is convex if the line segment between two points, a and b is greater than equal to the value of the a x b Which two functions are convex?
A. X1/2
B. Ex
C. 2x-1
D. 1-x2
You need to analyze 60,000,000 images stored in JPEG format, each of which is approximately 25 KB. Because your Hadoop cluster isn't optimized for storing and processing many small files you decide to do the following actions:
1.
Group the individual images into a set of larger files
2.
Use the set of larger files as input for a MapReduce job that processes them directly with Python using Hadoop streaming
Which data serialization system gives you the flexibility to do this?
A. CSV
B. XML
C. HTML
D. Avro
E. Sequence Files
F. JSON
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Cloudera exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DS-200 exam preparations and Cloudera certification application, do not hesitate to visit our Vcedump.com to find your solutions here.