Exam Details

  • Exam Code
    :PROFESSIONAL-DATA-ENGINEER
  • Exam Name
    :Professional Data Engineer on Google Cloud Platform
  • Certification
    :Google Certifications
  • Vendor
    :Google
  • Total Questions
    :331 Q&As
  • Last Updated
    :May 08, 2024

Google Google Certifications PROFESSIONAL-DATA-ENGINEER Questions & Answers

  • Question 41:

    You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should you do?

    A. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.

    B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.

    C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.

    D. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.

  • Question 42:

    You are building a report-only data warehouse where the data is streamed into BigQuery via the streaming API Following Google's best practices, you have both a staging and a production table for the data How should you design your data loading to ensure that there is only one master dataset without affecting performance on either the ingestion or reporting pieces?

    A. Have a staging table that is an append-only model, and then update the production table every three hours with the changes written to staging

    B. Have a staging table that is an append-only model, and then update the production table every ninety minutes with the changes written to staging

    C. Have a staging table that moves the staged data over to the production table and deletes the contents of the staging table every three hours

    D. Have a staging table that moves the staged data over to the production table and deletes the contents of the staging table every thirty minutes

  • Question 43:

    You work for a bank. You have a labelled dataset that contains information on already granted loan application and whether these applications have been defaulted. You have been asked to train a model to predict default rates for credit applicants.

    What should you do?

    A. Increase the size of the dataset by collecting additional data.

    B. Train a linear regression to predict a credit default risk score.

    C. Remove the bias from the data and collect applications that have been declined loans.

    D. Match loan applicants with their social profiles to enable feature engineering.

  • Question 44:

    You have a data pipeline with a Cloud Dataflow job that aggregates and writes time series metrics to Cloud Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data. Which two actions should you take? (Choose two.)

    A. Configure your Cloud Dataflow pipeline to use local execution

    B. Increase the maximum number of Cloud Dataflow workers by setting maxNumWorkers in PipelineOptions

    C. Increase the number of nodes in the Cloud Bigtable cluster

    D. Modify your Cloud Dataflow pipeline to use the Flatten transform before writing to Cloud Bigtable

    E. Modify your Cloud Dataflow pipeline to use the CoGroupByKey transform before writing to Cloud Bigtable

  • Question 45:

    You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?

    A. Increase the share of the test sample in the train-test split.

    B. Try to collect more data and increase the size of your dataset.

    C. Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.

    D. Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

  • Question 46:

    You have a requirement to insert minute-resolution data from 50,000 sensors into a BigQuery table. You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real-time analysis of aggregated trends. What should you do?

    A. Use bq load to load a batch of sensor data every 60 seconds.

    B. Use a Cloud Dataflow pipeline to stream data into the BigQuery table.

    C. Use the INSERT statement to insert a batch of data every 60 seconds.

    D. Use the MERGE statement to apply updates in batch every 60 seconds.

  • Question 47:

    You are responsible for writing your company's ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?

    A. PigLatin using Pig

    B. HiveQL using Hive

    C. Java using MapReduce

    D. Python using MapReduce

  • Question 48:

    You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the Data Science team runs a query filtered on a date column and limited to 30?0 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries. What should you do?

    A. Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.

    B. Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.

    C. Modify your pipeline to maintain the last 30?0 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.

    D. Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.

  • Question 49:

    You are testing a Dataflow pipeline to ingest and transform text files. The files are compressed gzip, errors are written to a dead-letter queue, and you are using Sidelnputs to join data You noticed that the pipeline is taking longer to complete than expected, what should you do to expedite the Dataflow job?

    A. Switch to compressed Avro files

    B. Reduce the batch size

    C. Retry records that throw an error

    D. Use CoGroupByKey instead of the Sidelnput

  • Question 50:

    You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

    A. Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.

    B. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.

    C. Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.

    D. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.

Tips on How to Prepare for the Exams

Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Google exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your PROFESSIONAL-DATA-ENGINEER exam preparations and Google certification application, do not hesitate to visit our Vcedump.com to find your solutions here.