A data engineer is building an automated extract, transform, and load (ETL) ingestion pipeline by using AWS Glue. The pipeline ingests compressed files that are in an Amazon S3 bucket. The ingestion pipeline must support incremental data processing.
Which AWS Glue feature should the data engineer use to meet this requirement?
A. workflows B. Triggers C. Job bookmarks D. classifiers
C. Job bookmarks
Explanation
Problem Analysis:
The pipeline processes compressed files in S3 and must support incremental data processing .
AWS Glue features must facilitate tracking progress to avoid reprocessing the same data.
Key Considerations:
Incremental data processing requires tracking which files or partitions have already been processed.
The solution must be automated and efficient for large-scale ETL jobs.
Solution Analysis:
Option A: Workflows
Workflows organize and orchestrate multiple Glue jobs but do not track progress for incremental data processing.
Option B: Triggers
Triggers initiate Glue jobs based on a schedule or events but do not track which data has been processed.
Option C: Job Bookmarks
Job bookmarks track the state of the data that has been processed, enabling incremental processing.
Automatically skip files or partitions that were previously processed in Glue jobs.
Option D: Classifiers
Classifiers determine the schema of incoming data but do not handle incremental processing.
Final Recommendation:
Job bookmarksare specifically designed to enable incremental data processing in AWS Glue ETL pipelines.
References:
AWS Glue Job Bookmarks Documentation
AWS Glue ETL Features
Question 202:
A company creates a new non-production application that runs on an Amazon EC2 instance. The application needs to communicate with an Amazon RDS database instance using Java Database Connectivity (JDBC). The EC2 instances and the RDS database instance are in the same subnet.
Which solution will meet this requirement?
A. Modify the IAM role that is assigned to the database instance to allow connections from the EC2 instances. B. Modify the ec2_authorized_hosts parameter in the RDS parameter group to include the EC2 instances. Restart the database instance. C. Update the database security group to allow connections from the EC2 instances. D. Enable the Amazon RDS Data API and specify the Amazon Resource Name (ARN) of the database instance in the JDBC connection string.
C. Update the database security group to allow connections from the EC2 instances.
Question 203:
A company must retain specific data for 1 year. A data engineer observes that one of the company's Amazon S3 buckets contains millions of objects that are older than 3 years. Versioning is enabled on the bucket.
To reduce costs, the data engineer implements an S3 Lifecycle rule to expire objects after 365 days. The new S3 Lifecycle rule causes the object count to double instead of decrease.
Which additional step must the data engineer take to permanently delete the old objects?
A. Disable versioning on the S3 bucket. B. Use an AWS Lambda function to run a Python job to identify and delete objects that are older than 365 days. C. Suspend versioning on the S3 bucket. D. Add an additional S3 Lifecycle rule to delete the current and expired versions of objects that are older than 365 days.
D. Add an additional S3 Lifecycle rule to delete the current and expired versions of objects that are older than 365 days.
Question 204:
A marketing company collects clickstream data. The company sends the clickstream data to Amazon Kinesis Data Firehose and stores the clickstream data in Amazon S3. The company wants to build a series of dashboards that hundreds of users from multiple departments will use.
The company will use Amazon QuickSight to develop the dashboards. The company wants a solution that can scale and provide daily updates about clickstream activity.
Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)
A. Use Amazon Redshift to store and query the clickstream data. B. Use Amazon Athena to query the clickstream data C. Use Amazon S3 analytics to query the clickstream data. D. Access the query data through a QuickSight direct SQL query. E. Access the query data through QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine). configure a daily refresh for the dataset.
B. Use Amazon Athena to query the clickstream data E. Access the query data through QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine). configure a daily refresh for the dataset.
Question 205:
A company runs a scheduled AWS Glue ETL job that reads daily files from an Amazon S3 prefix and writes curated data to another S3 prefix. The job reprocesses previously handled files every morning, which creates duplicate records in the target dataset. A data engineer needs the job to process only new supported source files after each successful run.
Which solution will meet this requirement with the LEAST operational overhead?
A. Enable AWS Glue job bookmarks for the job and ensure the script commits the job state at the end of each successful run. B. Configure the target S3 bucket with S3 Versioning so that duplicate output files are retained as separate versions. C. Add an Amazon EventBridge schedule that starts the AWS Glue job only once each week. D. Store the list of processed object keys in an Amazon RDS table and update the table manually after each job run.
A. Enable AWS Glue job bookmarks for the job and ensure the script commits the job state at the end of each successful run.
Explanation
AWS Glue job bookmarks persist state for supported sources and let later job runs process only new data since the last checkpoint. The script must commit the job so the bookmark state is updated. S3 Versioning preserves object history but does not control ETL input selection. Running the job less often does not prevent duplicate processing. A manually maintained database adds operational work that AWS Glue can avoid.
Question 206:
A sales company uses AWS Glue ETL to collect, process, and ingest data into an Amazon S3 bucket. The AWS Glue pipeline creates a new file in the S3 bucket every hour. File sizes vary from 200 KB to 300 KB.
The company wants to build a sales prediction model by using data from the previous 5 years. The historic data includes 44,000 files.
The company builds a second AWS Glue ETL pipeline by using the smallest worker type. The second pipeline retrieves the historic files from the S3 bucket and processes the files for downstream analysis. The company notices significant performance issues with the second ETL pipeline.
The company needs to improve the performance of the second pipeline.
Which solution will meet this requirement MOST cost-effectively?
A. Use a larger worker type. B. Increase the number of workers in the AWS Glue ETL jobs. C. Use the AWS Glue DynamicFrame grouping option. D. Enable AWS Glue auto scaling.
C. Use the AWS Glue DynamicFrame grouping option.
Explanation
Using the AWS Glue DynamicFrame grouping option (for example, groupFiles=True with an appropriate groupSize) combines many small input files into larger partitions at read time. This reduces the per-file overhead of task initialization and metadata operations, yielding much faster ETL runs without the added cost of more or bigger workers.
Question 207:
A company has implemented a lake house architecture in Amazon Redshift. The company needs to give users the ability to authenticate into Redshift query editor by using a third-party identity provider (IdP).
A data engineer must set up the authentication mechanism.
What is the first step the data engineer should take to meet this requirement?
A. Register the third-party IdP as an identity provider in the configuration settings of the Redshift cluster. B. Register the third-party IdP as an identity provider from within Amazon Redshift. C. Register the third-party IdP as an identity provider for AVS Secrets Manager. configure Amazon Redshift to use Secrets Manager to manage user credentials. D. Register the third-party IdP as an identity provider for AWS Certi cate Manager (ACM). configure Amazon Redshift to use ACM to manage user credentials.
A. Register the third-party IdP as an identity provider in the configuration settings of the Redshift cluster.
Question 208:
A data engineer is building a serverless. multi-step extract, transform, and load (ETL) pipeline. The pipeline extracts data from an Amazon S3 data lake and transforms the data by using AWS Glue ETL jobs. The pipeline then loads the results into an Amazon Redshift database. The data engineer needs to orchestrate the serverless ETL workflow.
Which solutions will meet these requirements? (Choose two.)
A. Implement the workflow by using AWS Step Functions. Configure Step Functions to coordinate the AWS Glue ETL jobs and handle error conditions with automatic retries. B. Use AWS Glue workflows to create a graph of the ETL tasks that visually represents the dependencies between jobs and the job triggers. C. Provision an always on Amazon EC2 instance. Create a cron job that invokes the AWS Glue ETL jobs in sequence based on a predefined schedule D. Use Amazon EventBridge rules to invoke the AWS Glue ETL jobs based on S3 object creation events. Configure the rules to chain the AWS Glue ETL jobs in sequence and handle complex job dependencies. E. Build an orchestration solution by using AWS CodePipeline to coordinate the ETL pipeline and infrastructure changes based on the dependencies.
A. Implement the workflow by using AWS Step Functions. Configure Step Functions to coordinate the AWS Glue ETL jobs and handle error conditions with automatic retries. B. Use AWS Glue workflows to create a graph of the ETL tasks that visually represents the dependencies between jobs and the job triggers.
Question 209:
A company receives call logs as Amazon S3 objects that contain sensitive customer information. The company must protect the S3 objects by using encryption. The company must also use encryption keys that only specific employees can access.
Which solution will meet these requirements with the LEAST effort?
A. Use an AWS CloudHSM cluster to store the encryption keys. Configure the process that writes to Amazon S3 to make calls to CloudHSM to encrypt and decrypt the objects. Deploy an IAM policy that restricts access to the CloudHSM cluster. B. Use server-side encryption with customer-provided keys (SSE-C) to encrypt the objects that contain customer information. Restrict access to the keys that encrypt the objects. C. Use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the KMS keys that encrypt the objects. D. Use server-side encryption with Amazon S3 managed keys (SSE-S3) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the Amazon S3 managed keys that encrypt the objects.
C. Use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the KMS keys that encrypt the objects.
Explanation
Option C is the best solution to meet the requirements with the least effort because server-side encryption with AWS KMS keys (SSE-KMS) is a feature that allows you to encrypt data at rest in Amazon S3 using keys managed by AWS Key Management Service (AWS KMS). AWS KMS is a fully managed service that enables you to create and manage encryption keys for your AWS services and applications. AWS KMS also allows you to define granular access policies for your keys, such as who can use them to encrypt and decrypt data, and under what conditions. By using SSE-KMS, you canprotect your S3 objects by using encryption keys that only specific employees can access, without having to manage the encryption and decryption process yourself.
Option A is not a good solution because it involves using AWS CloudHSM, which is a service that provides hardware security modules (HSMs) in the AWS Cloud. AWS CloudHSM allows you to generate and use your own encryption keys on dedicated hardware that is compliant with various standards and regulations.
However, AWS CloudHSM is not a fully managed service and requires more effort to set up and maintain than AWS KMS. Moreover, AWS CloudHSM does not integrate with Amazon S3, so you have to configure the process that writes to S3 to make calls to CloudHSM to encrypt and decrypt the objects, which adds complexity and latency to the data protection process.
Option B is not a good solution because it involves using server-side encryption with customer-provided keys (SSE-C), which is a feature that allows you to encrypt data at rest in Amazon S3 using keys that you provide and manage yourself. SSE-C requires you to send your encryption key along with each request to upload or retrieve an object. However, SSE-C does not provide any mechanism to restrict access to the keys that encrypt the objects, so you have to implement your own key management and access control system, which adds more effort and risk to the data protection process.
Option D is not a good solution because it involves using server-side encryption with Amazon S3 managed keys (SSE-S3), which is a feature that allows you to encrypt data at rest in Amazon S3 using keys that are managed by Amazon S3. SSE-S3 automatically encrypts and decrypts your objects as they are uploaded and downloaded from S3. However, SSE-S3 does not allow you to control who can access the encryption keys or under what conditions. SSE-S3 uses a single encryption key for each S3 bucket, which is shared by all users who have access to the bucket. This means that you cannot restrict access to the keys that encrypt the objects by specific employees, which does not meet the requirements.
Question 210:
A gaming company uses AWS Glue to perform read and write operations on Apache Iceberg tables for real-time streaming data. The data in the Iceberg tables is in Apache Parquet format. The company is experiencing slow query performance.
Which solutions will improve query performance? (Choose two.)
A. Use AWS Glue Data Catalog to generate column-level statistics for the Iceberg tables on a schedule. B. Use AWS Glue Data Catalog to automatically compact the Iceberg tables. C. Use AWS Glue Data Catalog to automatically optimize indexes for the Iceberg tables. D. Use AWS Glue Data Catalog to enable copy-on-write for the Iceberg tables. E. Use AWS Glue Data Catalog to generate views for the Iceberg tables.
B. Use AWS Glue Data Catalog to automatically compact the Iceberg tables. D. Use AWS Glue Data Catalog to enable copy-on-write for the Iceberg tables.
Explanation
Compaction reduces many small Parquet files into larger ones, lowering file-listing and open/scan overhead for faster reads. Copy-on-write favors read performance by materializing updates into rewritten data files, resulting in more contiguous, scan-efficient files for queries.
Nowadays, the certification exams become more and more important and required by more and more
enterprises when applying for a job. But how to prepare for the exam effectively? How to prepare
for the exam in a short time with less efforts? How to get a ideal result and how to find the
most reliable resources? Here on Vcedump.com, you will find all the answers.
Vcedump.com provide not only Amazon exam questions,
answers and explanations but also complete assistance on your exam preparation and certification
application. If you are confused on your DATA-ENGINEER-ASSOCIATE exam preparations
and Amazon certification application, do not hesitate to visit our
Vcedump.com to find your solutions here.