DATA-ENGINEER-ASSOCIATE Practice Questions & Online Exam Preparation

DATA-ENGINEER-ASSOCIATE Exam Details

Exam Code
:DATA-ENGINEER-ASSOCIATE
Exam Name
:AWS Certified Data Engineer - Associate (DEA-C01)
Certification
:Amazon Certifications
Vendor
:Amazon
Total Questions
:403 Q&As
Last Updated
:Jul 16, 2026

Amazon DATA-ENGINEER-ASSOCIATE Online Questions & Answers

Question 261:

A company uses Amazon S3 and AWS Glue Data Catalog to manage a data lake that contains contact information for customers. The company uses PySpark and AWS Glue jobs with a DynamicFrame to run a workflow that procees data within the data lake.
A data engineer notices that the workflow is generating errors as a result of how customer postal codes are stored in the data lake. Some postal codes include unneceary numbers or invalid characters.
The data engineer needs a solution to addre the errors and correct the postal codes in the data lake.
A. Create a schema definition for PySpark that matches the format the proceing workflow requires for postal codes. Pa the schema to the DynamicFrame during proceing.
B. Use AWS Glue workflow properties to allow job state sharing. Configure the AWS Glue jobs to read values from the postal code column by using the properties from a previously succeful run of the jobs.
C. Configure the column.push_down_predicate setting and the catalogPartitionPredicate settings for the postal code column in the DynamicFrame.
D. Set the DynamicFrame additional_options parameter `useS3ListImplementation' to True.

A. Create a schema definition for PySpark that matches the format the proceing workflow requires for postal codes. Pa the schema to the DynamicFrame during proceing.
Explanation
By defining and enforcing a strict schema in PySpark that specifies the exact format you expect for postal codes (for example, a string of five digits), Spark will automatically flag or coerce any values that don't match. Paing that schema into your Glue DynamicFrame ensures that malformed postal codes are identified and can be corrected or filtered out as part of your ETL logic, eliminating the downstream errors.
Question 262:

A data engineer is using Amazon Athena to analyze sales data that is in Amazon S3. The data engineer writes a query to retrieve sales amounts for 2023 for several products from a table named sales_data.
However, the query does not return results for all of the products that are in the sales_data table. The data engineer needs to troubleshoot the query to resolve the issue.
The data engineer's original query is as follows:
SELECT product_name, sum(sales_amount)
FROM sales_data
WHERE year = 2023
GROUP BY product_name
How should the data engineer modify the Athena query to meet these requirements?
A. Replace sum(sales amount) with count(*J for the aggregation.
B. Change WHERE year = 2023 to WHERE extractlyear FROM sales data) = 2023.
C. Add HAVING sumfsales amount) > 0 after the GROUP BY clause.
D. Remove the GROUP BY clause

B. Change WHERE year = 2023 to WHERE extractlyear FROM sales data) = 2023.
Explanation
The original query does not return results for all of the products because the year column in the sales_data table is not an integer, but a timestamp. Therefore, the WHERE clause does not filter the data correctly, and only returns the products that have a null value for the year column. To fix this, the data engineer should use the extract function to extract the year from the timestamp and compare it with 2023. This way, the query will return the correct results for all of the products in the sales_data table. The other options are either incorrect or irrelevant, as they do not address the root cause of the issue. Replacing sum with count does not change the filtering condition, adding HAVING clause does not affect the grouping logic, and removing the GROUP BY clause does not solve the problem of missing products.
Question 263:

A company is using Amazon S3 to build a data lake. The company needs to replicate records from multiple source databases into Apache Parquet format.
Most of the source databases are hosted on Amazon RDS. However, one source database is an on-premises Microsoft SQL Server Enterprise instance. The company needs to implement a solution to replicate existing data from all source databases and all future changes to the target S3 data lake.
Which solution will meet these requirements MOST cost-effectively?
A. Use one AWS Glue job to replicate existing data. Use a second AWS Glue job to replicate future changes.
B. Use AWS Database Migration Service (AWS DMS) to replicate existing data. Use AWS Glue jobs to replicate future changes.
C. Use AWS Database Migration Service (AWS DMS) to replicate existing data and future changes.
D. Use AWS Glue jobs to replicate existing data. Use Amazon Kinesis Data Streams to replicate future changes.

C. Use AWS Database Migration Service (AWS DMS) to replicate existing data and future changes.
Explanation
Amazon DMS can perform a one-time full load of each source (including the on-prem SQL Server) and then capture ongoing changes (CDC), delivering them directly to S3 in Apache Parquet format. Using a single DMS replication task per source minimizes infrastructure, operational overhead, and cost.
Question 264:

A company needs to ensure data analysts can only access customer data from their own country.
Which solution requires the least operational effort?
A. Create a separate table for each country and assign access based on analyst's country
B. Register the S3 bucket in AWS Lake Formation and use row-level security
C. Migrate data to Regions close to customers and assign access by country
D. Load data into Amazon Redshift, create country-specific views and IAM roles

B. Register the S3 bucket in AWS Lake Formation and use row-level security
Question 265:

An AWS Glue Spark job uses job bookmarks to read new files from Amazon S3. The job still reprocesses older files after some runs.
Which actions can help the bookmark feature work correctly? (Choose two.)
A. Ensure the job script calls job.init at the beginning and job.commit at the end.
B. Keep the same source path and transformation context for the bookmarked source unless a reset or redesign is intended.
C. Disable job bookmarks so every run starts from a clean state.
D. Change the source S3 path after each successful run while keeping the same transformation context.
E. Store the bookmark state in an S3 bucket policy.

A. Ensure the job script calls job.init at the beginning and job.commit at the end.
B. Keep the same source path and transformation context for the bookmarked source unless a reset or redesign is intended.
Explanation
AWS Glue scripts need initialization and commit calls so bookmark state can be loaded and saved. Stable source configuration and transformation context help Glue associate the persisted state with the intended source. Disabling bookmarks forces full processing. Changing the source path while reusing the same context can produce skipped or unexpected files. S3 bucket policies do not store Glue bookmark state.
Question 266:

A company uses AWS Step Functions to orchestrate a data pipeline. The pipeline consists of Amazon EMR jobs that ingest data from data sources and store the data in an Amazon S3 bucket. The pipeline also includes EMR jobs that load the data to Amazon Redshift.
The company's cloud infrastructure team manually built a Step Functions state machine. The cloud infrastructure team launched an EMR cluster into a VPC to support the EMR jobs. However, the deployed Step Functions state machine is not able to run the EMR jobs.
Which combination of steps should the company take to identify the reason the Step Functions state machine is not able to run the EMR jobs? (Choose two.)
A. Use AWS CloudFormation to automate the Step Functions state machine deployment. Create a step to pause the state machine during the EMR jobs that fail. Configure the step to wait for a human user to send approval through an email message. Include details of the EMR task in the email message for further analysis.
B. Verify that the Step Functions state machine code has all IAM permissions that are necessary to create and run the EMR jobs. Verify that the Step Functions state machine code also includes IAM permissions to access the Amazon S3 buckets that the EMR jobs use. Use Access Analyzer for S3 to check the S3 access properties.
C. Check for entries in Amazon CloudWatch for the newly created EMR cluster. Change the AWS Step Functions state machine code to use Amazon EMR on EKS. Change the IAM access policies and the security group configuration for the Step Functions state machine code to reflect inclusion of Amazon Elastic Kubernetes Service (Amazon EKS).
D. Query the flow logs for the VPC. Determine whether the traffic that originates from the EMR cluster can successfully reach the data providers. Determine whether any security group that might be attached to the Amazon EMR cluster allows connections to the data source servers on the informed ports.
E. Check the retry scenarios that the company configured for the EMR jobs. Increase the number of seconds in the interval between each EMR task. Validate that each fallback state has the appropriate catch for each decision state. Configure an Amazon Simple Notification Service (Amazon SNS) topic to store the error messages.

B. Verify that the Step Functions state machine code has all IAM permissions that are necessary to create and run the EMR jobs. Verify that the Step Functions state machine code also includes IAM permissions to access the Amazon S3 buckets that the EMR jobs use. Use Access Analyzer for S3 to check the S3 access properties.
D. Query the flow logs for the VPC. Determine whether the traffic that originates from the EMR cluster can successfully reach the data providers. Determine whether any security group that might be attached to the Amazon EMR cluster allows connections to the data source servers on the informed ports.
Explanation
To identify the reason why the Step Functions state machine is not able to run the EMR jobs, the company should take the following steps: Verify that the Step Functions state machine code has all IAM permissions that are necessary to create and run the EMR jobs. The state machine code should have an IAM role that allows it to invoke the EMR APIs, such as RunJobFlow, AddJobFlowSteps, and DescribeStep. The state machine code should also have IAM permissions to access the Amazon S3 buckets that the EMR jobs use as input and output locations. The company can use Access Analyzer for S3 to check the access policies and permissions of the S3 buckets. Therefore, option B is correct.
Query the flow logs for the VPC. The flow logs can provide information about the network traffic to and from the EMR cluster that is launched in the VPC. The company can use the flow logs to determine whether the traffic that originates from the EMR cluster can successfully reach the data providers, such as Amazon RDS, Amazon Redshift, or other external sources. The company can also determine whether any security group that might be attached to the EMR cluster allows connections to the data source servers on the informed ports. The company can use Amazon VPC Flow Logs or Amazon CloudWatch Logs Insights to query the flow logs . Therefore, option D is correct.
Option A is incorrect because it suggests using AWS CloudFormation to automate the Step Functions state machine deployment. While this is a good practice to ensure consistency and repeatability of the deployment, it does not help to identify the reason why the state machine is not able to run the EMR jobs. Moreover, creating a step to pause the state machine during the EMR jobs that fail and wait for a human user to send approval through an email message is not a reliable way to troubleshoot the issue. The company should use the Step Functions console or API to monitor the execution history and status of the state machine, and use Amazon CloudWatch to view the logs and metrics of the EMR jobs.
Option C is incorrect because it suggests changing the AWS Step Functions state machine code to use Amazon EMR on EKS. Amazon EMR on EKS is a service that allows you to run EMR jobs on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. While this service has some benefits, such as lower cost and faster execution time, it does not support all the features and integrations that EMR on EC2 does, such as EMRNotebooks, EMR Studio, and EMRFS. Therefore, changing the state machine code to use EMR on EKS may not be compatible with the existing data pipeline and may introduce new issues.
Option E is incorrect because it suggests checking the retry scenarios that the company configured for the EMR jobs. While this is a good practice to handle transient failures and errors, it does not help to identify the root cause of why the state machine is not able to run the EMR jobs. Moreover, increasing the number of seconds in the interval between each EMR task may not improve the success rate of the jobs, and may increase the execution time and cost of the state machine. Configuring an Amazon SNS topic to store the error messages may help to notify the company of any failures, but it does not provide enough information to troubleshoot the issue.
References:
1: Manage an Amazon EMR Job - AWS Step Functions
2: Access Analyzer for S3 - Amazon Simple Storage Service
3: Working with Amazon EMR and VPC Flow Logs - Amazon EMR
[4]: Analyzing VPC Flow Logs with Amazon CloudWatch Logs Insights - Amazon Virtual Private Cloud [5]: Monitor AWS Step Functions - AWS Step Functions [6]: Monitor Amazon EMR clusters - Amazon EMR [7]: Amazon EMR on Amazon EKS - Amazon EMR
Question 267:

A retail company is using an Amazon Redshift cluster to support real-time inventory management. The company has deployed an ML model on a real-time endpoint in Amazon SageMaker.
The company wants to make real-time inventory recommendations. The company also wants to make predictions about future inventory needs.
Which solutions will meet these requirements? (Choose Two.)
A. Use Amazon Redshift ML to generate inventory recommendations.
B. Use SQL to invoke a remote SageMaker endpoint for prediction.
C. Use Amazon Redshift ML to schedule regular data exports for offline model training.
D. Use SageMaker Autopilot to create inventory management dashboards in Amazon Redshift.
E. Use Amazon Redshift as a file storage system to archive old inventory management reports.

A. Use Amazon Redshift ML to generate inventory recommendations.
B. Use SQL to invoke a remote SageMaker endpoint for prediction.
Explanation
The company needs to use machine learning models for real-time inventory recommendations and future inventory predictions while leveraging both Amazon Redshift and Amazon SageMaker.
Option A: Use Amazon Redshift ML to generate inventory recommendations.Amazon Redshift MLallows you to build, train, and deploy machine learning models directly from Redshift using SQL statements. It integrates with SageMaker to train models and run inference. This feature is useful for generating inventory recommendations directly from the data stored in Redshift.
Option B: Use SQL to invoke a remote SageMaker endpoint for prediction.You can use SQL in Redshift to call a SageMaker endpoint for real-time inference. By invoking a SageMaker endpoint from within Redshift, the company can get real-time predictions on inventory, allowing for integration between the data warehouse and the machine learning model hosted in SageMaker.
Option C (offline model training)andOption D (creating dashboards with SageMaker Autopilot) are not relevant to the real-time prediction and recommendation requirements.
Option E (archiving inventory reports in Redshift)is not related to making predictions or recommendations.
References:
Amazon Redshift ML Documentation Invoking SageMaker Endpoints from SQL
Question 268:

A data engineer must build an extract, transform, and load (ETL) pipeline to process and load data from 10 source systems into 10 tables that are in an Amazon Redshift database. All the source systemsgenerate
.csv, JSON, or Apache Parquet files every 15 minutes. The source systems all deliver files into one Amazon S3 bucket. The file sizes range from 10 MB to 20 GB. The ETL pipeline must function correctly despite changes to the data schema.
Which data pipeline solutions will meet these requirements? (Choose two.)
A. Use an Amazon EventBridge rule to run an AWS Glue job every 15 minutes. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.
B. Use an Amazon EventBridge rule to invoke an AWS Glue workflow job every 15 minutes. Configure the AWS Glue workflow to have an on-demand trigger that runs an AWS Glue crawler and then runs an AWS Glue job when the crawler finishes running successfully. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.
C. Configure an AWS Lambda function to invoke an AWS Glue crawler when a file is loaded into the S3 bucket. Configure an AWS Glue job to process and load the data into the Amazon Redshift tables. Create a second Lambda function to run the AWS Glue job. Create an Amazon EventBridge rule to invoke the second Lambda function when the AWS Glue crawler finishes running successfully.
D. Configure an AWS Lambda function to invoke an AWS Glue workflow when a file is loaded into the S3 bucket. Configure the AWS Glue workflow to have an on-demand trigger that runs an AWS Glue crawler and then runs an AWS Glue job when the crawler finishes running successfully. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.
E. Configure an AWS Lambda function to invoke an AWS Glue job when a file is loaded into the S3 bucket. Configure the AWS Glue job to read the files from the S3 bucket into an Apache Spark DataFrame. Configure the AWS Glue job to also put smaller partitions of the DataFrame into an Amazon Kinesis Data Firehose delivery stream. Configure the delivery stream to load data into the Amazon Redshift tables.

B. Use an Amazon EventBridge rule to invoke an AWS Glue workflow job every 15 minutes. Configure the AWS Glue workflow to have an on-demand trigger that runs an AWS Glue crawler and then runs an AWS Glue job when the crawler finishes running successfully. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.
D. Configure an AWS Lambda function to invoke an AWS Glue workflow when a file is loaded into the S3 bucket. Configure the AWS Glue workflow to have an on-demand trigger that runs an AWS Glue crawler and then runs an AWS Glue job when the crawler finishes running successfully. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.
Question 269:

A telecommunications company collects network usage data throughout each day at a rate of several thousand data points each second. The company runs an application to process the usage data in real time. The company aggregates and stores the data in an Amazon Aurora DB instance.
Sudden drops in network usage usually indicate a network outage. The company must be able to identify sudden drops in network usage so the company can take immediate remedial actions.
Which solution will meet this requirement with the LEAST latency?
A. Create an AWS Lambda function to query Aurora for drops in network usage. Use Amazon EventBridge to automatically invoke the Lambda function every minute.
B. Modify the processing application to publish the data to an Amazon Kinesis data stream. Create an Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) application to detect drops in network usage.
C. Replace the Aurora database with an Amazon DynamoDB table. Create an AWS Lambda function to query the DynamoDB table for drops in network usage every minute. Use DynamoDB Accelerator (DAX) between the processing application and DynamoDB table.
D. Create an AWS Lambda function within the Database Activity Streams feature of Aurora to detect drops in network usage.

B. Modify the processing application to publish the data to an Amazon Kinesis data stream. Create an Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) application to detect drops in network usage.
Explanation
The telecommunications company needs a low-latency solution to detect sudden drops in network usage from real-time data collected throughout the day.
Option B: Modify the processing application to publish the data to an Amazon Kinesis data stream. Create an Amazon Managed Service for Apache Flink (Amazon Kinesis Data Analytics) application to detect drops in network usage.Using Amazon Kinesis with Managed Service for Apache Flink (formerly Kinesis Data Analytics) is ideal for real-time stream processing with minimal latency. Flink can analyze the incoming data stream in real-time and detect anomalies, such as sudden drops in usage, which makes it the best fit for this scenario. Other options (A, C, and D) either introduce unnecessary delays (e.g., querying databases) or do not provide the same real-time, low-latency processing that is critical for this use case.
References:
Amazon Kinesis Data Analytics for Apache Flink Amazon Kinesis Documentation
Question 270:

A car sales company maintains data about cars that are listed for sale in an area. The company receives data about new car listings from vendors who upload the data daily as compressed files into Amazon S3.
The compressed files are up to 5 KB in size. The company wants to see the most up-to-date listings as soon as the data is uploaded to Amazon S3.
A data engineer must automate and orchestrate the data processing workflow of the listings to feed a dashboard. The data engineer must also provide the ability to perform one-time queries and analytical reporting. The query solution must be scalable.
Which solution will meet these requirements MOST cost-effectively?
A. Use an Amazon EMR cluster to process incoming data. Use AWS Step Functions to orchestrate workflows. Use Apache Hive for one-time queries and analytical reporting. Use Amazon OpenSearch Service to bulk ingest the data into compute optimized instances. Use OpenSearch Dashboards in OpenSearch Service for the dashboard.
B. Use a provisioned Amazon EMR cluster to process incoming data. Use AWS Step Functions to orchestrate workflows. Use Amazon Athena for one-time queries and analytical reporting. Use Amazon QuickSight for the dashboard.
C. Use AWS Glue to process incoming data. Use AWS Step Functions to orchestrate workflows. Use Amazon Redshift Spectrum for one-time queries and analytical reporting. Use OpenSearch Dashboards in Amazon OpenSearch Service for the dashboard.
D. Use AWS Glue to process incoming data. Use AWS Lambda and S3 Event Notifications to orchestrate workflows. Use Amazon Athena for one-time queries and analytical reporting. Use Amazon QuickSight for the dashboard.

D. Use AWS Glue to process incoming data. Use AWS Lambda and S3 Event Notifications to orchestrate workflows. Use Amazon Athena for one-time queries and analytical reporting. Use Amazon QuickSight for the dashboard.
Explanation
For processing the incoming car listings in a cost-effective, scalable, and automated way, the ideal approach involves using AWS Glue for data processing, AWS Lambda with S3 Event Notifications for orchestration, Amazon Athena for one-time queries and analytical reporting, and Amazon QuickSight for visualization on the dashboard. Let's break this down:
AWS Glue: This is a fully managed ETL (Extract, Transform, Load) service that automatically processes the incoming data files. Glue is serverless and supports diverse data sources, including
Amazon S3 and Redshift. AWS Lambda and S3 Event Notifications: Using Lambda and S3 Event Notifications allows near real-time triggering of processing workflows as soon as new data is uploaded into S3. This approach is event-driven, ensuring that the listings are processed as soon as they are uploaded, reducing the latency for data processing.
Amazon Athena: A serverless, pay-per-query service that allows interactive queries directly against data in S3 using standard SQL. It is ideal for the requirement of one-time queries and analytical reporting without the need for provisioning or managing servers.
Amazon QuickSight: A business intelligence tool that integrates with a wide range of AWS data sources, including Athena, and is used for creating interactive dashboards. It scales well and provides real-time insights for the car listings. This solution (Option D) is the most cost-effective , because both Glue and Athena are serverless and priced based on usage, reducing costs when compared to provisioning EMR clusters in the other options. Moreover, using Lambda for orchestration is more cost-effective than AWS Step Functions due to its lightweight nature.
References:
AWS Glue Documentation
Amazon Athena Documentation
Amazon QuickSight Documentation
S3 Event Notifications and Lambda

Related Exams:

Tips on How to Prepare for the Exams

Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Amazon exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATA-ENGINEER-ASSOCIATE exam preparations and Amazon certification application, do not hesitate to visit our Vcedump.com to find your solutions here.

DATA-ENGINEER-ASSOCIATE Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Amazon DATA-ENGINEER-ASSOCIATE Online Questions & Answers

Question 261:

Question 262:

Question 263:

Question 264:

Question 265:

Question 266:

Question 267:

Question 268:

Question 269:

Question 270:

Related Exams:

AIF-C01

AIP-C01

ANS-C00

ANS-C01

AXS-C01

BDS-C00

CLF-C02

DAS-C01

DATA-ENGINEER-ASSOCIATE

DBS-C01

Tips on How to Prepare for the Exams

Amazon DATA-ENGINEER-ASSOCIATE Online Practice Questions and Exam Preparation

DATA-ENGINEER-ASSOCIATE Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Amazon DATA-ENGINEER-ASSOCIATE Online Questions & Answers

Question 261:

Question 262:

Question 263:

Question 264:

Question 265:

Question 266:

Question 267:

Question 268:

Question 269:

Question 270:

Related Exams:

Tips on How to Prepare for the Exams