DATA-ENGINEER-ASSOCIATE Practice Questions & Online Exam Preparation

DATA-ENGINEER-ASSOCIATE Exam Details

Exam Code
:DATA-ENGINEER-ASSOCIATE
Exam Name
:AWS Certified Data Engineer - Associate (DEA-C01)
Certification
:Amazon Certifications
Vendor
:Amazon
Total Questions
:403 Q&As
Last Updated
:Jul 16, 2026

Amazon DATA-ENGINEER-ASSOCIATE Online Questions & Answers

Question 171:

A company uses EventBridge Scheduler to start an AWS Glue workflow every hour. When the target invocation fails, the operations team needs failed invocation events to be retained for later troubleshooting without writing custom retry storage code.
Which configuration should the data engineer add?
A. Configure a dead-letter queue for the EventBridge Scheduler target.
B. Configure an S3 Lifecycle rule for the Glue script bucket.
C. Enable Redshift concurrency scaling for the workflow.
D. Enable DynamoDB TTL on the EventBridge rule.

A. Configure a dead-letter queue for the EventBridge Scheduler target.
Explanation
A dead-letter queue is the native way to retain failed EventBridge Scheduler target invocations for later analysis. S3 Lifecycle manages object retention and does not capture failed scheduler events. Redshift concurrency scaling affects query capacity, not workflow invocation failures. DynamoDB TTL is unrelated to EventBridge failure handling.
Question 172:

An online retailer uses multiple delivery partners to deliver products to customers. The delivery partners send order summaries to the retailer. The retailer stores the order summaries in Amazon S3.
Some of the order summaries contain personally identifiable information (PII) about customers. A data engineer needs to detect PII in the order summaries so the company can redact the PII.
Which solution will meet these requirements with the LEAST operational overhead?
A. Amazon Textract
B. Amazon S3 Storage Lens
C. Amazon Macie
D. Amazon SageMaker Data Wrangler

C. Amazon Macie
Explanation
Amazon Macie is a fully managed data security and privacy service that uses machine learning and pattern matching to discover and protect sensitive data, such as Personally Identifiable Information (PII), stored in Amazon S3. Macie can automatically scan the order summaries for PII, enabling the company to detect and then redact PII as required.
Question 173:

A company uses an Amazon S3 Standard bucket to maintain a self-managed transactional data lake that uses Apache Iceberg tables. The data lake ingests data both in real time and in batches.
Users report slow performance for real-time tables. A data engineer reviews the real-time tables and notices that the tables are made up of many small data files.
The data engineer must improve the performance of the real-time tables.
Which solution will meet this requirement?
A. Expire historic snapshots.
B. Archive historic snapshots.
C. Delete S3 objects that are not linked from the Iceberg table.
D. Apply compaction.

D. Apply compaction.
Question 174:

An ecommerce company processes millions of orders each day. The company uses AWS Glue ETL to collect data from multiple sources, clean the data, and store the data in an Amazon S3 bucket in CSV format by using the S3 Standard storage class. The company uses the stored data to conduct daily analysis.
The company wants to optimize costs for data storage and retrieval.
Which solution will meet this requirement?
A. Transition the data to Amazon S3 Glacier Flexible Retrieval.
B. Transition the data from Amazon S3 to an Amazon Aurora cluster.
C. Configure AWS Glue ETL to transform the incoming data to Apache Parquet format.
D. Configure AWS Glue ETL to use Amazon EMR to process incoming data in parallel.

C. Configure AWS Glue ETL to transform the incoming data to Apache Parquet format.
Explanation
Converting the CSV files into Apache Parquet during your Glue ETL jobs dramatically reduces both storage size (because Parquet is a compressed, columnar format) and query cost (because analytics engines only scan the columns you need). This change requires no new infrastructure and pays off immediately for your daily analysis workloads.
Question 175:

A company has a data lake in Amazon S3. The company collects AWS CloudTrail logs for multiple applications. The company stores the logs in the data lake, catalogs the logs in AWS Glue, and partitions the logs based on the year. The company uses Amazon Athena to analyze the logs.
Recently, customers reported that a query on one of the Athena tables did not return any data. A data engineer must resolve the issue.
Which combination of troubleshooting steps should the data engineer take? (Choose Two.)
A. Confirm that Athena is pointing to the correct Amazon S3 location.
B. Increase the query timeout duration.
C. Use the MSCK REPAIR TABLE command.
D. Restart Athena.
E. Delete and recreate the problematic Athena table.

A. Confirm that Athena is pointing to the correct Amazon S3 location.
C. Use the MSCK REPAIR TABLE command.
Explanation
The problem likely arises from Athena not being able to read from the correct S3 location or missing partitions. The two most relevant troubleshooting steps involve checking the S3 location and repairing the table metadata.
A. Confirm that Athena is pointing to the correct Amazon S3 location: One of the most common issues with missing data in Athena queries is that the query is pointed to an incorrect or outdated S3 location. Checking the S3 path ensures Athena is querying the correct data.
Question 176:

A data engineer is building a data orchestration workflow. The data engineer plans to use a hybrid model that includes some on-premises resources and some resources that are in the cloud. The data engineer wants to prioritize portability and open source resources.
Which service should the data engineer use in both the on-premises environment and the cloud-based environment?
A. AWS Data Exchange
B. Amazon Simple Workflow Service (Amazon SWF)
C. Amazon Managed workflows for Apache Airflow (Amazon MWAA)
D. AWS Glue

C. Amazon Managed workflows for Apache Airflow (Amazon MWAA)
Question 177:

A company adds new data to a large CSV file in an Amazon S3 bucket every day. The file contains company sales data from the previous 5 years. The file currently includes more than 5,000 rows. The CSV file structure is shown below with sample data:
The company needs to use Amazon Athena to run queries on the CSV file to fetch data from a specific time period.
Which solution will meet this requirement MOST cost-effectively?
A. Write an Apache Spark script to convert the CSV data to JSON format. Create an AWS Glue job to run the script every day. Catalog the JSON data in AWS Glue. Run the Athena queries on the JSON data.
B. Use prefixes to partition the data in the S3 bucket. Use the SALE_DATE column to create a partition for each day. Catalog the data in AWS Glue and ensure that the partitions are added. Update the Athena queries to use the new partitions.
C. Launch an Amazon EMR cluster. Specify AWS Glue Data Catalog as the default Apache Hive metastore. Use Trino (Presto) to run queries on the data.
D. Create an Amazon RDS database. Create a table named SALES that matches the schema of the CSV file. Create an index on the SALE_DATE column. Create an AWS Lambda function to load the CSV data into the RDS database. Use S3 Event Notifications to invoke the Lambda function.

B. Use prefixes to partition the data in the S3 bucket. Use the SALE_DATE column to create a partition for each day. Catalog the data in AWS Glue and ensure that the partitions are added. Update the Athena queries to use the new partitions.
Explanation
Partitioning the S3 data by SALE_DATE (daily prefixes) lets Athena prune non-matching partitions so it scans only the requested time range, minimizing bytes scanned and cost while working directly on the existing CSVs. Cataloging the partitions in AWS Glue enables efficient querying with minimal operational overhead.
Question 178:

A data engineer must ingest a source of structured data that is in .csv format into an Amazon S3 data lake.
The .csv files contain 15 columns. Data analysts need to run Amazon Athena queries on one or two columns of the dataset. The data analysts rarely query the entire file.
Which solution will meet these requirements MOST cost-effectively?
A. Use an AWS Glue PySpark job to ingest the source data into the data lake in .csv format.
B. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to ingest the data into the data lake in JSON format.
C. Use an AWS Glue PySpark job to ingest the source data into the data lake in Apache Avro format.
D. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to write the data into the data lake in Apache Parquet format.

D. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to write the data into the data lake in Apache Parquet format.
Explanation
Amazon Athena is a serverless interactive query service that allows you to analyze data in Amazon S3 using standard SQL. Athena supports various data formats, such as CSV, JSON, ORC, Avro, and Parquet.
However, not all data formats are equally efficient for querying. Some data formats, such as CSV and JSON, are row-oriented, meaning that they store data as a sequence of records, each with the same fields.
Row-oriented formats are suitable for loading and exporting data, but they are not optimal for analytical queries that often access only a subset of columns. Row-oriented formats also do not support compression or encoding techniques that can reduce the data size and improve the query performance.
On the other hand, some data formats, such as ORC and Parquet, are column-oriented, meaning that they store data as a collection of columns, each with a specific data type. Column-oriented formats are ideal for analytical queries that often filter, aggregate, or join data by columns. Column-oriented formats also support compression and encoding techniques that can reduce the data size and improve the query performance. For example, Parquet supports dictionary encoding, which replaces repeated values with numeric codes, and run-length encoding, which replaces consecutive identical values with a single value and a count. Parquet also supports various compression algorithms, such as Snappy, GZIP, and ZSTD, that can further reduce the data size and improve the query performance.
Therefore, creating an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source and writing the data into the data lake in Apache Parquet format will meet the requirements most cost-effectively. AWS Glue is a fully managed service that provides a serverless data integration platform for data preparation, data cataloging, and data loading. AWS Glue ETL jobs allow you to transform and load data from various sources into various targets, using either a graphical interface (AWS Glue Studio) or a code-based interface (AWS Glue console or AWS Glue API). By using AWS Glue ETL jobs, you can easily convert the data from CSV to Parquet format, without having to write or manage any code. Parquet is a column-oriented format that allows Athena to scan only the relevant columns and skip the rest, reducing the amount of data read from S3. This solution will also reduce the cost of Athena queries, as Athena charges based on the amount of data scanned from S3.
The other options are not as cost-effective as creating an AWS Glue ETL job to write the data into the data lake in Parquet format. Using an AWS Glue PySpark job to ingest the source data into the data lake in .csv format will not improve the query performance or reduce the query cost, as .csv is a row-oriented format that does not support columnar access or compression. Creating an AWS Glue ETL job to ingest the data into the data lake in JSON format will not improve the query performance or reduce the query cost, as JSON is also a row-oriented format that does not support columnar access or compression. Using an AWS Glue PySpark job to ingest the source data into the data lake in Apache Avro format will improve the query performance, as Avro is a column-oriented format that supports compression and encoding, but it will require more operational effort, as you will need to write and maintain PySpark code to convert the data from CSV to Avro format.
Question 179:

A company wants to build a dimension table in an Amazon S3 bucket. The bucket contains historical data that includes 10 million records. The historical data is 1 TB in size.
A data engineer needs a solution to update changes for up to 10,000 records in the base table every day.
Which solution will meet this requirement with the LOWEST runtime?
A. Develop an Apache Spark job in Amazon EMR to read the historical data and the new changes into two Spark DataFrames. Use the Spark update method to update the base table.
B. Develop an AWS Glue Python job to read the historical data and new changes into two Pandas DataFrames. Use the Pandas update method to update the base table.
C. Develop an AWS Glue Apache Spark job to read the historical data and new changes into two Spark DataFrames. Use the Spark update method to update the base table.
D. Develop an Amazon EMR job to read new changes into Apache Spark DataFrames. Use the Apache Hudi framework to create the base table in Amazon S3. Use the Spark update method to update the base table.

D. Develop an Amazon EMR job to read new changes into Apache Spark DataFrames. Use the Apache Hudi framework to create the base table in Amazon S3. Use the Spark update method to update the base table.
Explanation
By using Apache Hudi on EMR you get native upsert support against your S3 "base table." Hudi's indexing and metadata management lets it rewrite only the small set of changed files (your ~10,000 daily records) rather than scanning and rewriting the full 1 TB. This minimizes proceing and achieves the lowest runtime for daily updates.
Question 180:

A data engineer needs to maintain a central metadata repository that users access through Amazon EMR and Amazon Athena queries. The repository needs to provide the schema and properties of many tables.
Some of the metadata is stored in Apache Hive. The data engineer needs to import the metadata from Hive into the central metadata repository.
Which solution will meet these requirements with the LEAST development effort?
A. Use Amazon EMR and Apache Ranger.
B. Use a Hive metastore on an EMR cluster.
C. Use the AWS Glue Data Catalog.
D. Use a metastore on an Amazon RDS for MySQL DB instance.

C. Use the AWS Glue Data Catalog.
Explanation
The AWS Glue Data Catalog is an Apache Hive metastore-compatible catalog that provides a central metadata repository for various data sources and formats. You can use the AWS Glue Data Catalog as an external Hive metastore for Amazon EMR and Amazon Athena queries, and import metadata from existing Hive metastores into the Data Catalog. This solution requires the least development effort, as you can use AWS Glue crawlers to automatically discover and catalog the metadata from Hive, and use the AWS Glue console, AWS CLI, or Amazon EMR API to configure the Data Catalog as the Hive metastore. The other options are either more complex or require additional steps,such as setting up Apache Ranger for security, managing a Hive metastore on an EMR cluster or an RDS instance, or migrating the metadata manually.

Related Exams:

Tips on How to Prepare for the Exams

Nowadays, the certification exams become more and more important and required by more and more enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare for the exam in a short time with less efforts? How to get a ideal result and how to find the most reliable resources? Here on Vcedump.com, you will find all the answers. Vcedump.com provide not only Amazon exam questions, answers and explanations but also complete assistance on your exam preparation and certification application. If you are confused on your DATA-ENGINEER-ASSOCIATE exam preparations and Amazon certification application, do not hesitate to visit our Vcedump.com to find your solutions here.

DATA-ENGINEER-ASSOCIATE Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Amazon DATA-ENGINEER-ASSOCIATE Online Questions & Answers

Question 171:

Question 172:

Question 173:

Question 174:

Question 175:

Question 176:

Question 177:

Question 178:

Question 179:

Question 180:

Related Exams:

AIF-C01

AIP-C01

ANS-C00

ANS-C01

AXS-C01

BDS-C00

CLF-C02

DAS-C01

DATA-ENGINEER-ASSOCIATE

DBS-C01

Tips on How to Prepare for the Exams

Amazon DATA-ENGINEER-ASSOCIATE Online Practice Questions and Exam Preparation

DATA-ENGINEER-ASSOCIATE Exam Details

Exam Code

Exam Name

Certification

Vendor

Total Questions

Last Updated

Amazon DATA-ENGINEER-ASSOCIATE Online Questions & Answers

Question 171:

Question 172:

Question 173:

Question 174:

Question 175:

Question 176:

Question 177:

Question 178:

Question 179:

Question 180:

Related Exams:

Tips on How to Prepare for the Exams