Which word or phrase completes the statement? A Data Scientist would consider that a RDBMS is to a Table as R is to a ______________ .
A. Data frame
B. List
C. Matrix
D. Array
The Marketing department of your company wishes to track opinion on a new product that was recently introduced. Marketing would like to know how many positive and negative reviews are appearing over a given period and potentially retrieve each review for more in-depth insight. They have identified several popular product review blogs that historically have published thousands of user reviews of your company's products. You have been asked to provide the desired analysis. You examine the RSS feeds for each blog and determine which fields are relevant. You then craft a regular expression to match your new product's name and extract the relevant text from each matching review. What is the next step you should take?
A. Convert the extracted text into a suitable document representation and index into a review corpus
B. Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product
C. Read the extracted text for each review and manually tabulate the results
D. Group the reviews using Na.ve Bayesian classification
In which lifecycle stage are initial hypotheses formed?
A. Discovery
B. Model planning
C. Model building
D. Data preparation
You are given 10, 000, 000 user profile pages of an online dating site in XML files, and they are stored in HDFS. You are assigned to divide the users into groups based on the content of their profiles. You have been instructed to try K-means clustering on this data. How should you proceed?
A. Run MapReduce to transform the data,and find relevant key value pairs.
B. Divide the data into sets of 1,000 user profiles,and run K-means clustering in RHadoop iteratively.
C. Run a Naive Bayes classification as a pre-processing step in HDFS.
D. Partition the data by XML file size,and run K-means clustering in each partition.
You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000 filers. The audit rate in your training data is 4.2%. What is the sum of the probabilities that the model assigns to all the filers in your training set that have been audited?
A. 42.0
B. 4.2
C. 0.42
D. 0.042
You are asked to write a report on how specific variables impact your client's sales using a data set
provided to you by the client. The data includes 15 variables that the client views as directly related to
sales, and you are restricted to these variables only.
After a preliminary analysis of the data, the following findings were made:
1.
Multicollinearity is not an issue among the variables
2.
Only three variables--A, B, and C--have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. You cannot request additional datA. what is a way that you could try to increase the R2 of the model without artificially inflating it?
A. Create clusters based on the data and use them as model inputs
B. Force all 15 variables into the model as independent variables
C. Create interaction variables based only on variables A,B,and C
D. Break variables A,B,and C into their own univariate models
You have two tables of customers in your database. Customers in cust_table_1 were sent an e-mail promotion last year, and customers in cust_table_2 received a newsletter last year. Customers can only be entered in once per table. You want to create a table that includes all customers, and any of the
communications they received last year. Which type of join would you use for this table?
A. Full outer join
B. Inner join
C. Left outer join
D. Cross join
For which class of problem is MapReduce most suitable?
A. Embarrassingly parallel
B. Minimal result data
C. Simple marginalization tasks
D. Non-overlapping queries
Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?
A. Define the process to maintain the model
B. Try different analytical techniques
C. Try different variables
D. Transform existing variables
Since R factors are categorical variables, they are most closely related to which data classification level?
A. nominal
B. ordinal
C. interval
D. ratio
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only EMC exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your E20-026 exam preparations and EMC certification application, do not hesitate to visit our Vcedump.com to find your solutions here.