A gaming company uses AWS Glue to perform read and write operations on Apache Iceberg tables for real-time streaming data. The data in the Iceberg tables is in Apache Parquet format. The company is experiencing slow query performance.
Which solutions will improve query performance? (Choose two.)
A. Use AWS Glue Data Catalog to generate column-level statistics for the Iceberg tables on a schedule. B. Use AWS Glue Data Catalog to automatically compact the Iceberg tables. C. Use AWS Glue Data Catalog to automatically optimize indexes for the Iceberg tables. D. Use AWS Glue Data Catalog to enable copy-on-write for the Iceberg tables. E. Use AWS Glue Data Catalog to generate views for the Iceberg tables.
B. Use AWS Glue Data Catalog to automatically compact the Iceberg tables. D. Use AWS Glue Data Catalog to enable copy-on-write for the Iceberg tables. explanation:
Explanation
Compaction reduces many small Parquet files into larger ones, lowering file-listing and open/scan overhead for faster reads. Copy-on-write favors read performance by materializing updates into rewritten data files, resulting in more contiguous, scan-efficient files for queries.
Question 2:
A company has a data processing pipeline that runs multiple SQL queries in sequence against an Amazon Redshift cluster. The company merges with a second company. The original company modifies a query that aggregates sales revenue data to join sales tables from both companies. The sales table for the first company is named Table S1. The sales table for the second company is named Table S2. Table S1 contains 10 billion records. Table S2 contains 900 million records. The query becomes slow after the modification. A data engineer must improve the query performance.
Which solutions will meet these requirements? (Choose two.)
A. Use the KEY distribution style for both sales tables. Select a low cardinality column to use for the join. B. Use the KEY distribution style for both sales tables. Select a high cardinality column to use for the join. C. Use the EVEN distribution style for Table S1. Use the ALL distribution style for Table S2. D. Use the Amazon Redshift query optimizer to review and select optimizations to implement. E. Use Amazon Redshift Advisor to review and select optimizations to implement.
B. Use the KEY distribution style for both sales tables. Select a high cardinality column to use for the join. E. Use Amazon Redshift Advisor to review and select optimizations to implement. explanation:
Explanation
Choosing KEY distribution on both tables with a high-cardinality join column colocates matching rows across nodes and avoids data skew, improving join performance. Redshift Advisor provides automated, actionable recommendations (e.g., distribution and sort keys, stats) to further optimize the slow query with minimal effort.
Question 3:
A company uses a data stream in Amazon Kinesis Data Streams to collect transactional data from multiple sources. The company uses an AWS Glue extract, transform, and load (ETL) pipeline to look for outliers in the data from the stream. When the workflow detects an outlier, it sends a notification to an Amazon Simple Notification Service (Amazon SNS) topic. The SNS topic initiates a second workflow to retrieve logs for the outliers and stores the logs in an Amazon S3 bucket. The company experiences delays in the notifications to the SNS topic during periods when the data stream is processing a high volume of data. When the company examines Amazon CloudWatch logs, the company notices a high value for the glue.driver. BlockManager.disk.diskSpaceUsed_MB metric when the traffic is high. The company must resolve this issue.
Which solution will meet this requirement with the LEAST operational effort?
A. Increase the number of data processing units (DPUs) in AWS Glue ETL jobs. B. Use Amazon EMR to manage the ETL pipeline instead of AWS Glue. C. Use AWS Step Functions to orchestrate a parallel workflow state. D. Enable auto scaling for the AWS Glue ETL jobs.
D. Enable auto scaling for the AWS Glue ETL jobs. explanation:
Explanation
The high BlockManager disk usage indicates memory pressure and spilling due to under-provisioned Spark executors. Enabling AWS Glue auto scaling lets the job automatically add workers during traffic spikes, alleviating memory/disk pressure and reducing delays with minimal operational effort.
Question 4:
A company stores information about its subscribers in an Amazon S3 bucket. The company runs an analysis every time a subscriber ends their subscription. The company uses AWS Lambda functions to respond to events from the S3 bucket by performing analyses.
The Lambda functions clean data from the S3 bucket and initiate an AWS Glue workflow. The Lambda functions have 128 MB of memory and 512 MB of ephemeral storage. The Lambda functions have a timeout of 15 seconds. All three functions successfully finish running. However, CPU usage is often near 100%, which causes slow performance. The company wants to improve the performance of the functions and reduce the total runtime of the pipeline.
Which solution will meet these requirements?
A. Increase the memory of the Lambda functions to 512 MB. B. Increase the number of retries by using the Maximum Retry Attempts setting. C. Configure the Lambda functions to run in the company's VPC. D. Increase the timeout value for the Lambda functions from 15 seconds to 30 seconds.
A. Increase the memory of the Lambda functions to 512 MB. explanation:
Explanation
In AWS Lambda, increasing memory also proportionally increases the allocated CPU. By raising the Lambda functions' memory from 128 MB to 512 MB, the functions receive more CPU power, which reduces execution time and improves performance without changing logic or timeouts.
Question 5:
A ride-sharing company stores records for all rides in an Amazon DynamoDB table. The table includes the following columns and types of values:
The table currently contains billions of items. The table is partitioned by RidelD and uses TripStartTime as the sort key. The company wants to use the data to build a personal interface to give drivers the ability to view the rides that each driver has completed, based on RideStatus. The solution must access the necessary data without scanning the entire table.
Which solution will meet these requirements?
A. Create a local secondary index (LSI) on DriverlD. B. Create a global secondary index (GSI) that uses RiderlD as the partition key and RideStatus as the sort key. C. Create a global secondary index (GSI) that uses DriverlD as the partition key and RideStatus as the sort key. D. Create a filter expression that uses RiderlD and RideStatus.
C. Create a global secondary index (GSI) that uses DriverlD as the partition key and RideStatus as the sort key. explanation:
Explanation
To let drivers efficiently query only their completed rides, you need a global secondary index (GSI) with DriverlD as the partition key (so queries can be targeted per driver) and RideStatus as the sort key (so you can query for "Completed" rides without scanning the full table). This avoids costly scans and supports fast, targeted lookups at scale.
Question 6:
A data engineer is building a new data pipeline that stores metadata in an Amazon DynamoDB table. The data engineer must ensure that all items that are older than a specified age are removed from the DynamoDB table daily.
Which solution will meet this requirement with the LEAST configuration effort?
A. Enable DynamoDB TTL on the DynamoDB table. Adjust the application source code to set the TTL attribute appropriately. B. Create an Amazon EventBridge rule that uses a daily cron expression to trigger an AWS Lambda function to delete items that are older than the specified age. C. Add a lifecycle configuration to the DynamoDB table that deletes items that are older than the specified age. D. Create a DynamoDB stream that has an AWS Lambda function that reacts to data modifications. Configure the Lambda function to delete items that are older than the specified age.
A. Enable DynamoDB TTL on the DynamoDB table. Adjust the application source code to set the TTL attribute appropriately. explanation:
Explanation
DynamoDB Time to Live (TTL) is a fully managed feature that automatically expires and deletes items based on a timestamp attribute you define. By enabling TTL and setting the TTL attribute in the application code, items older than the specified age are removed without the need for custom scheduling or Lambda functions, offering the least configuration effort.
Question 7:
A data engineer is building a solution to detect sensitive information that is stored in a data lake across multiple Amazon S3 buckets. The solution must detect personally identifiable information (Pll) that is in a proprietary data format.
Which solution will meet these requirements with the LEAST operational overhead?
A. Use the AWS Glue Detect Pll transform with specific patterns. B. Use Amazon Made with managed data identifiers. C. Use an AWS Lambda function with custom regular expressions. D. Use Amazon Athena with a SQL query to match the custom formats.
A. Use the AWS Glue Detect Pll transform with specific patterns. explanation:
Explanation
The AWS Glue Detect Pll transform is a built-in feature that can automatically identify personally identifiable information (Pll) using predefined patterns or custom regular expressions. It works directly within AWS Glue jobs on proprietary data formats and integrates with the Glue Data Catalog, delivering a managed, low-overhead solution compared to building and maintaining custom Lambda or SQL-based detection.
Question 8:
A data engineer must implement a data cataloging solution to track schema changes in an Amazon Redshift table. Which solution will meet these requirements?
A. Schedule an AWS Glue crawler to run every day on the table by using the Java Database Connectivity (JDBC) driver. Configure the crawler to update an AWS Glue Data Catalog. B. Use AWS DataSync to log the table metadata to an AWS Glue Data Catalog. Use an AWS Glue crawler to update the Data Catalog every day. C. Use the AWS Schema Conversion Tool (AWS SCT) to log the table metadata to an Apache Hive metastore. Use Amazon EventBridge Scheduler to update the metastore every day. D. Schedule an AWS Glue crawler to run every day on the table. Configure the crawler to update an Apache Hive metastore.
A. Schedule an AWS Glue crawler to run every day on the table by using the Java Database Connectivity (JDBC) driver. Configure the crawler to update an AWS Glue Data Catalog. explanation:
Explanation
An AWS Glue crawler can connect to Amazon Redshift via JDBC, detect schema changes, and automatically update the AWS Glue Data Catalog on a schedule, providing the required ongoing schema tracking.
Question 9:
A company adds new data to a large CSV file in an Amazon S3 bucket every day. The file contains company sales data from the previous 5 years. The file currently includes more than 5,000 rows. The CSV file structure is shown below with sample data:
ID SALE_DATE ITEM_SOLD SALE_PRICE SALES_REP STORE_NAME 01-Jan-2024 TV Terry Alaska 02-Jan-2024 DVD player Diego Boston
The company needs to use Amazon Athena to run queries on the CSV file to fetch data from a specific time period. Which solution will meet this requirement MOST cost-effectively?
A. Write an Apache Spark script to convert the CSV data to JSON format. Create an AWS Glue job to run the script every day. Catalog the JSON data in AWS Glue. Run the Athena queries on the JSON data. B. Use prefixes to partition the data in the S3 bucket. Use the SALE_DATE column to create a partition for each day. Catalog the data in AWS Glue and ensure that the partitions are added. Update the Athena queries to use the new partitions. C. Launch an Amazon EMR cluster. Specify AWS Glue Data Catalog as the default Apache Hive metastore. Use Trino (Presto) to run queries on the data. D. Create an Amazon RDS database. Create a table named SALES that matches the schema of the CSV file. Create an index on the SALE_DATE column. Create an AWS Lambda function to load the CSV data into the RDS database. Use S3 Event Notifications to invoke the Lambda function.
B. Use prefixes to partition the data in the S3 bucket. Use the SALE_DATE column to create a partition for each day. Catalog the data in AWS Glue and ensure that the partitions are added. Update the Athena queries to use the new partitions. explanation:
Explanation
Partitioning the S3 data by SALE_DATE (daily prefixes) lets Athena prune non-matching partitions so it scans only the requested time range, minimizing bytes scanned and cost while working directly on the existing CSVs. Cataloging the partitions in AWS Glue enables efficient querying with minimal operational overhead.
Question 10:
A company wants to use Apache Spark jobs that run on an Amazon EMR cluster to process streaming data. The Spark jobs will transform and store the data in an Amazon S3 bucket. The company will use Amazon Athena to perform analysis. The company needs to optimize the data format for analytical queries.
Which solutions will meet these requirements with the SHORTEST query times? (Choose two.)
A. Use Avro format. Use AWS Glue Data Catalog to track schema changes. B. Use ORC format. Use AWS Glue Data Catalog to track schema changes. C. Use Apache Parquet format. Use an external Amazon DynamoDB table to track schema changes. D. Use Apache Parquet format. Use AWS Glue Data Catalog to track schema changes. E. Use ORC format. Store schema definitions in separate files in Amazon S3.
B. Use ORC format. Use AWS Glue Data Catalog to track schema changes. D. Use Apache Parquet format. Use AWS Glue Data Catalog to track schema changes. explanation:
Explanation
ORC is a columnar format optimized for Athena and provides high compression and predicate pushdown for faster queries; Glue Data Catalog manages schema evolution.
Parquet is a columnar format optimized for Athena with efficient compression and predicate pushdown;
Nowadays, the certification exams become more and more important and required by more and more
enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare
for the exam in a short time with less efforts? How to get a ideal result and how to find the
most reliable resources? Here on Vcedump.com, you will find all the answers.
Vcedump.com provide not only Amazon exam questions,
answers and explanations but also complete assistance on your exam preparation and certification
application. If you are confused on your DATA-ENGINEER-ASSOCIATE exam preparations
and Amazon certification application, do not hesitate to visit our
Vcedump.com to find your solutions here.