Prepare Professional-Data-Engineer Question Answers Free Update With 100% Exam Passing Guarantee [2023]
Dumps Real Google Professional-Data-Engineer Exam Questions [Updated 2023]
Google Professional-Data-Engineer certification exam is a rigorous exam that requires a significant amount of preparation. Candidates must have extensive experience working with big data solutions and be familiar with the latest trends in data processing and analysis. Google Certified Professional Data Engineer Exam certification is highly valued in the industry and can lead to new career opportunities and higher salaries.
Build & Operationalize Data Processing Systems
- Build & Operationalize Storage Systems: This part will require the students’ skills and competence in the effective usage of managed services, including Cloud Spanner, CLoug Bigtable, BigQuery, Cloud SQL, Cloud Memorystore, Cloud Datastore, and Cloud Storage. It also covers their skills in managing the data lifecycle and storage performance and costs;
- Build & Operationalize Processing Infrastructure: The considerations for this subject area include provisioning resources, adjusting pipeline, monitoring pipeline, and testing & quality control.
- Build & Operationalize Pipeline: This module requires that the learners demonstrate competence in data cleansing, transformation, batch & streaming, data import & acquisition, as well as integration with the new data sources;
NEW QUESTION # 107
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the
world. The company has patents for innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to
overcome communications challenges in space. Fundamental to their operation, they need to create a
distributed data infrastructure that drives real-time analysis and incorporates machine learning to
continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the
network allowing them to account for the impact of dynamic regional politics on location availability and
cost.
Their management and operations teams are situated all around the globe creating many-to-many
relationship between data consumers and provides in their system. After careful consideration, they
decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more
than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control
topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production
- to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where
needed in an unpredictable, distributed telecom user community.
Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
Provide reliable and timely access to data for analysis from distributed research workers
Maintain isolated environments that support rapid iteration of their machine-learning models without
affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows
each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems
both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive
hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize
our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data
secure. We also need environments in which our data scientists can carefully study and quickly adapt our
models. Because we rely on automation to process our data, we also need our development and test
environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to
work on our high-value problems instead of problems with our data pipelines.
MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000
installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud
Dataflow pipeline configuration setting should you update?
- A. The number of workers
- B. The zone
- C. The maximum number of workers
- D. The disk size per worker
Answer: B
NEW QUESTION # 108
You operate an IoT pipeline built around Apache Kafka that normally receives around 5000 messages per second. You want to use Google Cloud Platform to create an alert as soon as the moving average over 1 hour drops below 4000 messages per second. What should you do?
- A. Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below 4000, send an alert.
- B. Consume the stream of data in Cloud Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
- C. Consume the stream of data in Cloud Dataflow using Kafka IO. Set a fixed time window of 1 hour.
Compute the average when the window closes, and send an alert if the average is less than 4000 messages. - D. Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to Cloud Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Cloud Bigtable in the last hour. If that number falls below 4000, send an alert.
Answer: D
NEW QUESTION # 109
Which of the following statements is NOT true regarding Bigtable access roles?
- A. To give a user access to only one table in a project, grant the user the Bigtable Editor role for that table.
- B. You can configure access control only at the project level.
- C. To give a user access to only one table in a project, you must configure access through your application.
- D. Using IAM roles, you cannot give a user access to only one table in a project, rather than all tables in a project.
Answer: A
Explanation:
Explanation
For Cloud Bigtable, you can configure access control at the project level. For example, you can grant the ability to:
Read from, but not write to, any table within the project.
Read from and write to any table within the project, but not manage instances.
Read from and write to any table within the project, and manage instances.
Reference: https://cloud.google.com/bigtable/docs/access-control
NEW QUESTION # 110
Your company is currently setting up data pipelines for their campaign. For all the Google Cloud Pub/Sub
streaming data, one of the important business requirements is to be able to periodically identify the inputs
and their timings during their campaign. Engineers have decided to use windowing and transformation in
Google Cloud Dataflow for this purpose. However, when testing this feature, they find that the Cloud
Dataflow job fails for the all streaming insert. What is the most likely cause of this problem?
- A. They have not applied a global windowing function, which causes the job to fail when the pipeline is
created - B. They have not set the triggers to accommodate the data coming in late, which causes the job to fail
- C. They have not applied a non-global windowing function, which causes the job to fail when the pipeline
is created - D. They have not assigned the timestamp, which causes the job to fail
Answer: A
NEW QUESTION # 111
You are training a spam classifier. You notice that you are overfitting the training dat
a. Which three actions can you take to resolve this problem? (Choose three.)
- A. Increase the regularization parameters
- B. Decrease the regularization parameters
- C. Use a larger set of features
- D. Get more training examples
- E. Reduce the number of training examples
- F. Use a smaller set of features
Answer: B,C,D
NEW QUESTION # 112
You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What should you do?
- A. Store and process the entire dataset in BigQuery.
- B. Store and process the entire dataset in Cloud Bigtable.
- C. Store the warm data as files in Cloud Storage, and store the active data in BigQuery. Keep this ratio as
80% warm and 20% active. - D. Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.
Answer: C
NEW QUESTION # 113
Which Cloud Dataflow / Beam feature should you use to aggregate data in an unbounded data source every hour based on the time when the data entered the pipeline?
- A. An hourly watermark
- B. The with Allowed Lateness method
- C. An event time trigger
- D. A processing time trigger
Answer: D
Explanation:
When collecting and grouping data into windows, Beam uses triggers to determine when to emit the aggregated results of each window.
Processing time triggers. These triggers operate on the processing time - the time when the data element is processed at any given stage in the pipeline.
Event time triggers. These triggers operate on the event time, as indicated by the timestamp on each data element. Beam's default trigger is event time-based.
Reference: https://beam.apache.org/documentation/programming-guide/#triggers
NEW QUESTION # 114
You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?
- A. Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
- B. Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
- C. Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.
- D. Load the data every 30 minutes into a new partitioned table in BigQuery.
Answer: D
NEW QUESTION # 115
If you want to create a machine learning model that predicts the price of a particular stock based on its recent price history, what type of estimator should you use?
- A. Regressor
- B. Unsupervised learning
- C. Clustering estimator
- D. Classifier
Answer: A
Explanation:
Regression is the supervised learning task for modeling and predicting continuous, numeric variables. Examples include predicting real-estate prices, stock price movements, or student test scores.
Classification is the supervised learning task for modeling and predicting categorical variables. Examples include predicting employee churn, email spam, financial fraud, or student letter grades.
Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within your dataset. Examples include customer segmentation, grouping similar items in e-commerce, and social network analysis.
NEW QUESTION # 116
A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning.
Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?
- A. Re-create the table using data partitioning on the package delivery date.
- B. Implement clustering in BigQuery on the ingest date column.
- C. Tier older data onto Cloud Storage files, and leverage extended tables.
- D. Implement clustering in BigQuery on the package-tracking ID column.
Answer: B
NEW QUESTION # 117
Your company maintains a hybrid deployment with GCP, where analytics are performed on your
anonymized customer data. The data are imported to Cloud Storage from your data center through parallel
uploads to a data transfer server running on GCP. Management informs you that the daily transfers take
too long and have asked you to fix the problem. You want to maximize transfer speeds. Which action
should you take?
- A. Increase the CPU size on your server.
- B. Increase your network bandwidth from Compute Engine to Cloud Storage.
- C. Increase your network bandwidth from your datacenter to GCP.
- D. Increase the size of the Google Persistent Disk on your server.
Answer: C
Explanation:
Explanation/Reference:
NEW QUESTION # 118
Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by
10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.
You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)
- A. Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
- B. Introduce data compression for each file to increase the rate file of file transfer.
- C. Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premises data to the designated storage bucket.
- D. Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
- E. Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.
Answer: C,D
NEW QUESTION # 119
Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.
Which approach should you take?
- A. Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.
- B. Use the automatically generated timestamp from Cloud Pub/Sub to order the data.
- C. Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.
- D. Use the NOW () function in BigQuery to record the event's time.
Answer: A
NEW QUESTION # 120
Your company is currently setting up data pipelines for their campaign. For all the Google Cloud Pub/Sub
streaming data, one of the important business requirements is to be able to periodically identify the inputs and their timings during their campaign. Engineers have decided to use windowing and transformation in Google Cloud Dataflow for this purpose. However, when testing this feature, they find that the Cloud Dataflow job fails for the all streaming insert. What is the most likely cause of this problem?
- A. They have not applied a global windowing function, which causes the job to fail when the pipeline is
created - B. They have not set the triggers to accommodate the data coming in late, which causes the job to fail
- C. They have not applied a non-global windowing function, which causes the job to fail when the pipeline is created
- D. They have not assigned the timestamp, which causes the job to fail
Answer: A
NEW QUESTION # 121
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?
- A. Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
- B. Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
- C. Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.
- D. Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. the column TS instead of the column DT from now on.
- E. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
Answer: E
Explanation:
It's better to create a new table and delete old one when we are changing the datatype is permanent. View is not suitable because every time the query will run and additional charges will be applied.
NEW QUESTION # 122
You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don't get slots to execute their query and you need to correct this. You'd like to avoid introducing new projects to your account.
What should you do?
- A. Convert your batch BQ queries into interactive BQ queries.
- B. Create an additional project to overcome the 2K on-demand per-project quota.
- C. Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.
- D. Switch to flat-rate pricing and establish a hierarchical priority model for your projects.
Answer: D
Explanation:
Reference https://cloud.google.com/blog/products/gcp/busting-12-myths-about-bigquery
NEW QUESTION # 123
Google Cloud Bigtable indexes a single value in each row. This value is called the _______.
- A. master key
- B. unique key
- C. row key
- D. primary key
Answer: C
Explanation:
Explanation
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key.
Reference: https://cloud.google.com/bigtable/docs/overview
NEW QUESTION # 124
......
Google Professional-Data-Engineer certification involves passing a rigorous exam that tests the candidate's knowledge and skills in several areas, including data processing systems, data analysis, data modeling, machine learning, and data visualizations. Professional-Data-Engineer exam is designed to assess the candidate's ability to design, build, and maintain data processing systems that can handle large quantities of data and provide accurate insights.
Professional-Data-Engineer Exam Dumps, Professional-Data-Engineer Practice Test Questions: https://freedownload.prep4sures.top/Professional-Data-Engineer-real-sheets.html