Professional Data Engineer on Google Cloud Platform
Exam Details
Exam Code
:PROFESSIONAL-DATA-ENGINEER
Exam Name
:Professional Data Engineer on Google Cloud Platform
Certification
:Google Certifications
Vendor
:Google
Total Questions
:331 Q&As
Last Updated
:May 19, 2025
Google Google Certifications PROFESSIONAL-DATA-ENGINEER Questions & Answers
Question 251:
You have terabytes of customer behavioral data streaming from Google Analytics into BigQuery daily Your customers' information, such as their preferences, is hosted on a Cloud SQL for MySQL database Your CRM database is hosted on a Cloud SQL for PostgreSQL instance. The marketing team wants to use your customers' information from the two databases and the customer behavioral data to create marketing campaigns for yearly active customers. You need to ensure that the marketing team can run the campaigns over 100 times a day on typical days and up to 300 during sales. At the same time you want to keep the load on the Cloud SQL databases to a minimum. What should you do?
A. Create BigQuery connections to both Cloud SQL databases Use BigQuery federated queries on the two databases and the Google Analytics data on BigQuery to run these queries.
B. Create streams in Datastream to replicate the required tables from both Cloud SQL databases to BigQuery for these queries.
C. Create a Dataproc cluster with Trino to establish connections to both Cloud SQL databases and BigQuery, to execute the queries.
D. Create a job on Apache Spark with Dataproc Serverless to query both Cloud SQL databases and the Google Analytics data on BigQuery for these queries.
Correct Answer: B
Datastream is a serverless Change Data Capture (CDC) and replication service that allows you to stream data changes from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud Storage, Cloud SQL, and Pub/ Sub. Datastream captures and delivers database changes in real-time, with minimal impact on the source database performance. Datastream also preserves the schema and data types of the source database, and automatically creates and updates the corresponding tables in BigQuery. By using Datastream, you can replicate the required tables from both Cloud SQL databases to BigQuery, and keep them in sync with the source databases. This way, you can reduce the load on the Cloud SQL databases, as the marketing team can run their queries on the BigQuery tables instead of the Cloud SQL tables. You can also leverage the scalability and performance of BigQuery to query the customer behavioral data from Google Analytics and the customer information from the replicated tables. You can run the queries as frequently as needed, without worrying about the impact on the Cloud SQL databases. Option A is not a good solution, as BigQuery federated queries allow you to query external data sources such as Cloud SQL databases, but they do not reduce the load on the source databases. In fact, federated queries may increase the load on the source databases, as they need to execute the query statements on the external data sources and return the results to BigQuery. Federated queries also have some limitations, such as data type mappings, quotas, and performance issues. Option C is not a good solution, as creating a Dataproc cluster with Trino would require more resources and management overhead than using Datastream. Trino is a distributed SQL query engine that can connect to multiple data sources, such as Cloud SQL and BigQuery, and execute queries across them. However, Trino requires a Dataproc cluster to run, which means you need to provision, configure, and monitor the cluster nodes. You also need to install and configure the Trino connector for Cloud SQL and BigQuery, and write the queries in Trino SQL dialect. Moreover, Trino does not replicate or sync the data from Cloud SQL to BigQuery, so the load on the Cloud SQL databases would still be high. Option D is not a good solution, as creating a job on Apache Spark with Dataproc Serverless would require more coding and processing power than using Datastream. Apache Spark is a distributed data processing framework that can read and write data from various sources, such as Cloud SQL and BigQuery, and perform complex transformations and analytics on them. Dataproc Serverless is a serverless Spark service that allows you to run Spark jobs without managing clusters. However, Spark requires you to write code in Python, Scala, Java, or R, and use the Spark connector for Cloud SQL and BigQuery to access the data sources. Spark also does not replicate or sync the data from Cloud SQL to BigQuery, so the load on the Cloud SQL databases would still be high. References: Datastream overview | Datastream | Google Cloud, Datastream concepts | Datastream | Google Cloud, Datastream quickstart | Datastream | Google Cloud, Introduction to federated queries | BigQuery | Google Cloud, Trino overview | Dataproc Documentation | Google Cloud, Dataproc Serverless overview | Dataproc Documentation | Google Cloud, Apache Spark overview | Dataproc Documentation | Google Cloud.
Question 252:
You orchestrate ETL pipelines by using Cloud Composer One of the tasks in the Apache Airflow directed acyclic graph (DAG) relies on a third-party service. You want to be notified when the task does not succeed. What should you do?
A. Configure a Cloud Monitoring alert on the sla_missed metric associated with the task at risk to trigger a notification.
B. Assign a function with notification logic to the sla_miss_callback parameter for the operator responsible for the task at risk.
C. Assign a function with notification logic to the on_retry_callback parameter for the operator responsible for the task at risk.
D. Assign a function with notification logic to the on_failure_callback parameter for the operator responsible for the task at risk.
Correct Answer: D
By assigning a function with notification logic to the on_failure_callback parameter, you can customize the action that is taken when a task fails in your DAG1. For example, you can send an email, a Slack message, or a PagerDuty alert to notify yourself or your team about the task failure2. This option is more flexible and reliable than configuring a Cloud Monitoring alert on the sla_missed metric, which only triggers when a task misses its scheduled deadline3. The sla_miss_callback parameter is also related to the sla_missed metric, and it is executed when the task instance has not succeeded and the time is past the task's scheduled execution date plus its sla4. The on_retry_callback parameter is executed before a task is retried4. These options are not suitable for notifying when a task does not succeed, as they depend on the task's schedule and retry settings, which may not reflect the actual task completion status. References:
1: Callbacks | Cloud Composer | Google Cloud
2: How to Send an Email on Task Failure in Airflow - Astronomer
3: Monitoring SLA misses | Cloud Composer | Google Cloud
4: BaseOperator | Apache Airflow Documentation
Question 253:
You are creating a data model in BigQuery that will hold retail transaction data. Your two largest tables, sales_transation_header and sales_transation_line. have a tightly coupled immutable relationship. These tables are rarely modified after load and are frequently joined when queried. You need to model the sales_transation_header and sales_transation_line tables to improve the performance of data analytics queries. What should you do?
A. Create a sal es_transaction table that Stores the sales_tran3action_header and sales_transaction_line data as a JSON data type.
B. Create a sale3_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields.
C. Create a sale_transaction table that holds the sales_transaction_header and sales_transaction_line information as rows, duplicating the sales_transaction_header data for each line.
D. Create separate sales_transation_header and sales_transation_line tables and. when querying, specify the sales transition line first in the WHERE clause.
Correct Answer: B
BigQuery supports nested and repeated fields, which are complex data types that can represent hierarchical and one-to-many relationships within a single table. By using nested and repeated fields, you can denormalize your data model and reduce the number of joins required for your queries. This can improve the performance and efficiency of your data analytics queries, as joins can be expensive and require shuffling data across nodes. Nested and repeated fields also preserve the data integrity and avoid data duplication. In this scenario, the sales_transaction_header and sales_transaction_line tables have a tightly coupled immutable relationship, meaning that each header row corresponds to one or more line rows, and the data is rarely modified after load. Therefore, it makes sense to create a single sales_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields. This way, you can query the sales transaction data without joining two tables, and use dot notation or array functions to access the nested and repeated fields. For example, the sales_transaction table could have the following schema: Table Field name Type Mode id INTEGER NULLABLE order_time TIMESTAMP NULLABLE customer_id INTEGER NULLABLE line_items RECORD REPEATED line_items.sku STRING NULLABLE line_items.quantity INTEGER NULLABLE line_items.price FLOAT NULLABLE To query the total amount of each order, you could use the following SQL statement: SQL SELECT id, SUM(line_items.quantity * line_items.price) AS total_amount FROM sales_transaction GROUP BY id; AI-generated code. Review and use carefully. More info on FAQ. References: Use nested and repeated fields BigQuery explained: Working with joins, nested and repeated data Arrays in BigQuery -- How to improve query performance and optimise storage
Question 254:
You have a network of 1000 sensors. The sensors generate time series data: one metric per sensor per second, along with a timestamp. You already have 1 TB of data, and expect the data to grow by 1 GB every day You need to access this data in two ways. The first access pattern requires retrieving the metric from one specific sensor stored at a specific timestamp, with a median single-digit millisecond latency. The second access pattern requires running complex analytic queries on the data, including joins, once a day. How should you store this data?
A. Store your data in Bigtable Concatenate the sensor ID and timestamp and use it as the row key Perform an export to BigQuery every day.
B. Store your data in BigQuery Concatenate the sensor ID and timestamp. and use it as the primary key.
C. Store your data in Bigtable Concatenate the sensor ID and metric, and use it as the row key Perform an export to BigQuery every day.
D. Store your data in BigQuery. Use the metric as a primary key.
Correct Answer: A
To store your data in a way that meets both access patterns, you should:
A. Store your data in Bigtable Concatenate the sensor ID and timestamp and use it as the row key Perform an export to BigQuery every day. This option allows you to leverage the high performance and scalability of Bigtable for low-latency point queries on sensor data, as well as the powerful analytics capabilities of BigQuery for complex queries on large datasets. By using the sensor ID and timestamp as the row key, you can ensure that your data is sorted and distributed evenly across Bigtable nodes, and that you can easily retrieve the metric for a specific sensor and time. By performing an export to BigQuery every day, you can transfer your data to a columnar storage format that is optimized for analytical queries, and take advantage of BigQuery's features such as partitioning, clustering, and caching. B. Store your data in BigQuery Concatenate the sensor ID and timestamp. and use it as the primary key. This option is not optimal because BigQuery is not designed for low-latency point queries, and using a concatenated primary key may result in poor performance and high costs. BigQuery does not support primary keys natively, and you would have to use a unique constraint or a hash function to enforce uniqueness. Moreover, BigQuery charges by the amount of data scanned, so using a long and complex primary key may increase the query cost and complexity. C. Store your data in Bigtable Concatenate the sensor ID and metric, and use it as the row key Perform an export to BigQuery every day. This option is not optimal because using the sensor ID and metric as the row key may result in data skew and hotspots in Bigtable, as some sensors may generate more metrics than others, or some metrics may be more common than others. This may affect the performance and availability of Bigtable, as well as the efficiency of the export to BigQuery.
D. Store your data in BigQuery. Use the metric as a primary key. This option is not optimal because using the metric as a primary key may result in data duplication and inconsistency in BigQuery, as multiple sensors may generate the same metric at different times, or the same sensor may generate different metrics at the same time. This may affect the accuracy and reliability of your analytical queries, as well as the query cost and complexity.
Question 255:
You are implementing workflow pipeline scheduling using open source-based tools and Google Kubernetes Engine (GKE). You want to use a Google managed service to simplify and automate the task. You also want to accommodate Shared VPC networking considerations. What should you do?
A. Use Dataflow for your workflow pipelines. Use Cloud Run triggers for scheduling.
B. Use Dataflow for your workflow pipelines. Use shell scripts to schedule workflows.
C. Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the host project.
D. Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the service project.
Correct Answer: D
Shared VPC requires that you designate a host project to which networks and subnetworks belong and a service project, which is attached to the host project. When Cloud Composer participates in a Shared VPC, the Cloud Composer environment is in the service project. Reference: https://cloud.google.com/composer/docs/how-to/managing/configuring-shared-vpc
Question 256:
You need to migrate a 2TB relational database to Google Cloud Platform. You do not have the resources to significantly refactor the application that uses this database and cost to operate is of primary concern.
Which service do you select for storing and serving your data?
A. Cloud Spanner
B. Cloud Bigtable
C. Cloud Firestore
D. Cloud SQL
Correct Answer: D
Question 257:
You are designing a real-time system for a ride hailing app that identifies areas with high demand for rides to effectively reroute available drivers to meet the demand. The system ingests data from multiple sources to Pub/Sub. processes the
data, and stores the results for visualization and analysis in real-time dashboards. The data sources include driver location updates every 5 seconds and app-based booking events from riders. The data processing involves real-time
aggregation of supply and demand data for the last 30 seconds, every 2 seconds, and storing the results in a low-latency system for visualization.
What should you do?
A. Group the data by using a tumbling window in a Dataflow pipeline, and write the aggregated data to Memorystore
B. Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore
C. Group the data by using a session window in a Dataflow pipeline, and write the aggregated data to BigQuery.
D. Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to BigQuery.
Correct Answer: B
A hopping window is a type of sliding window that advances by a fixed period of time, producing overlapping windows. This is suitable for the scenario where the system needs to aggregate data for the last 30 seconds, every 2 seconds, and provide real-time updates. A Dataflow pipeline can implement the hopping window logic using Apache Beam, and process both streaming and batch data sources. Memorystore is a low-latency, in- memory data store that can serve the aggregated data to the visualization layer. BigQuery is not a good choice for this scenario, as it is not optimized for low-latency queries and frequent updates.
Question 258:
You have 100 GB of data stored in a BigQuery table. This data is outdated and will only be accessed one or two times a year for analytics with SQL. For backup purposes, you want to store this data to be immutable for 3 years. You want to minimize storage costs. What should you do?
A. 1 Create a BigQuery table clone.
2. Query the clone when you need to perform analytics.
B. 1 Create a BigQuery table snapshot. 2 Restore the snapshot when you need to perform analytics.
C. 1. Perform a BigQuery export to a Cloud Storage bucket with archive storage class. 2 Enable versionmg on the bucket.
3. Create a BigQuery external table on the exported files.
D. 1 Perform a BigQuery export to a Cloud Storage bucket with archive storage class. 2 Set a locked retention policy on the bucket.
3. Create a BigQuery external table on the exported files.
Correct Answer: D
This option will allow you to store the data in a low-cost storage option, as the archive storage class has the lowest price per GB among the Cloud Storage classes. It will also ensure that the data is immutable for 3 years, as the locked retention policy prevents the deletion or overwriting of the data until the retention period expires. You can still query the data using SQL by creating a BigQuery external table that references the exported files in the Cloud Storage bucket. Option A is incorrect because creating a BigQuery table clone will not reduce the storage costs, as the clone will have the same size and storage class as the original table. Option B is incorrect because creating a BigQuery table snapshot will also not reduce the storage costs, as the snapshot will have the same size and storage class as the original table. Option C is incorrect because enabling versioning on the bucket will not make the data immutable, as the versions can still be deleted or overwritten by anyone with the appropriate permissions. It will also increase the storage costs, as each version of the file will be charged separately. References: Exporting table data | BigQuery | Google Cloud Storage classes | Cloud Storage | Google Cloud Retention policies and retention periods | Cloud Storage | Google Cloud Federated queries | BigQuery | Google Cloud
Question 259:
You are developing an application on Google Cloud that will automatically generate subject labels for users' blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?
A. Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.
B. Call the Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels.
C. Build and train a text classification model using TensorFlow. Deploy the model using Cloud Machine Learning Engine. Call the model from your application and process the results as labels.
D. Build and train a text classification model using TensorFlow. Deploy the model using a Kubernetes Engine cluster. Call the model from your application and process the results as labels.
Correct Answer: B
Question 260:
You have uploaded 5 years of log data to Cloud Storage A user reported that some data points in the log data are outside of their expected ranges, which indicates errors You need to address this issue and be able to run the process again in the future while keeping the original data for compliance reasons. What should you do?
A. Import the data from Cloud Storage into BigQuery Create a new BigQuery table, and skip the rows with errors.
B. Create a Compute Engine instance and create a new copy of the data in Cloud Storage Skip the rows with errors
C. Create a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in Cloud Storage
D. Create a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to the same dataset in Cloud Storage
Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Google exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your PROFESSIONAL-DATA-ENGINEER exam preparations and Google certification application, do not hesitate to visit our Vcedump.com to find your solutions here.