Microsoft recently updated a certification named Azure Data Engineer Associate. To get this tag you need to clear one examination named: Exam DP-203: Data Engineering on Microsoft Azure.
Topics, we’ll cover:
- Exam DP-203: Data Engineering on Microsoft Azure
- Skills To Demonstrate: Data Engineering on Microsoft Azure
- Examination Pattern: Data Engineering on Microsoft Azure
- Type of Questions: Data Engineering on Microsoft Azure
- Study Guide for Data Engineering on Microsoft Azure
- Module 1: Explore compute and storage options for data engineering workloads
- Module 2: Run interactive queries using Azure Synapse Analytics serverless SQL pools
- Module 3: Data exploration and transformation in Azure Databricks
- Module 4: Explore, transform, and load data into the Data Warehouse using Apache Spark
- Module 5: Ingest and load data into the data warehouse
- Module 6: Transform data with Azure Data Factory or Azure Synapse Pipelines
- Module 7: Orchestrate data movement and transformation in Azure Synapse Pipelines
- Module 8: End-to-end security with Azure Synapse Analytics
- Module 9: Support Hybrid Transactional Analytical Processing (HTAP) with Azure Synapse Link
- Module 10: Real-time Stream Processing with Stream Analytics
- Module 11: Create a Stream Processing Solution with Event Hubs and Azure Databricks
Exam DP-203: Data Engineering on Microsoft Azure
EXAM DP-203 IS REPLACING EXAMS DP-200 AND DP-201. Both the exam will retire on August 31, 2021.
A candidate for this certification should have a solid knowledge of data processing languages, like Scala, SQL, or Python, and they must understand data architecture patterns and parallel processing.
Azure data engineers are liable for data-related tasks that include provisioning Azure data storage services, building and maintaining secure and compliant data processing pipelines, ingesting streaming and batch data, implementing security requirements, transforming data, implementing data retention policies, identifying performance bottlenecks, monitors, and optimizes data platforms to fulfill the data pipeline needs.
Skills To Demonstrate: Data Engineering on Microsoft Azure
Students for this exam must have subject matter expertise integrating, transforming, and consolidating data from various unstructured and structured data systems into a structure that is suitable for design analytics solutions.
Examination Pattern: Data Engineering on Microsoft Azure
Microsoft DP-203 exam will have 40-60 questions that may be in a format like multiple-choice questions, arranged in the correct sequence type questions, scenario-based single answer questions, or drop type of questions.
There will be a time limit of 120 min to complete the exam and the cutoff score is a minimum of 700. Further, the Microsoft DP-203 exam will cost $165 USD and the exam can be taken in only the English language.
Type of Questions: Data Engineering on Microsoft Azure
Below is the type of questions:
- Multiple-choice questions.
- A case study with multiple questions.
- Single choice based on the scenario.
- Arrange in the right sequence type questions.
- Questions that cannot be skipped: There will be at least 3 questions in a sequence where you have to select from Yes or No. These questions cannot be skipped or re-answered afterward.
Study Guide for Data Engineering on Microsoft Azure
Here is a comprehensive list of study material covering DP-203 scope & questions.
*All links are either from Microsoft or K21Academy blogs that I am just listing here.
** I will share links for hands-on labs as well, if you do hands-on then DP-203 both are easy-peasy to crack.
1. Official Microsoft labs on DP-203 for anyone to learn from:
MicrosoftLearning/DP-203-Data-Engineer (github.com)
2. Azure free account:
Create Your Azure Free Account Today | Microsoft Azure
3. Microsoft Learn:
Browse all – Learn | Microsoft Docs
Module 1 Explore compute and storage options for data engineering workloads
1. Azure Synapse Analytics
Azure Synapse is a limitless analytics service that brings together Big Data analytics and enterprise data warehousing . It gives you the ability to query data on your terms, using either provisioned resources—at scale or serverless. Synapse brings these two worlds together with a consolidated experience to ingest, prepare, manage, and serve data for urgent BI and machine learning needs.
Introduction to Azure Synapse Analytics – Learn | Microsoft Docs
2. Components of Azure Synapse Analytics
Learn the individual components of Azure Synapse Analytics that facilitate you to build your analytical solutions in one place.
Survey the Components of Azure Synapse Analytics – Learn | Microsoft Docs
3. Modern Data Warehouse
Learn how Azure Synapse Analytics enables you to build Data Warehouses using modern architecture patterns.
Design a Modern Data Warehouse using Azure Synapse Analytics – Learn | Microsoft Docs
4. Azure Databricks
Explore the capabilities of Azure Databricks and the Apache Spark notebook for handle huge files. Understand the Azure Databricks platform and determine the types of tasks well-suited for Apache Spark.
Describe Azure Databricks – Learn | Microsoft Docs
5. Apache Spark
When it comes to deal with Big Data in a unified way, whether you process in batches or it real time as it arrives, Apache Spark provides a fast and capable engine that also supports data science processes, like ML and advanced analytics.
Introduction – Learn | Microsoft Docs
6. Azure Databricks security
Azure Databricks brings many tools for securing your network infrastructure and data.
Azure Databricks security guide – Azure Databricks – Workspace | Microsoft Docs
7. Integrate Azure Databricks with Azure Synapse
Azure Databricks is just one of the most powerful data services in Azure. Explore how to integrate with Azure Synapse Analytics as part of your data architecture.
Integrate Azure Databricks with Azure Synapse – Learn | Microsoft Docs
8. Processing Big Data with Azure Data Lake Store
9. Azure Data Lake Storage Gen2
A data lake is a repo of data that is stored in its raw format, usually as files or blobs. Azure Data Lake Storage is a scalable, comprehensive, and cost-effective data lake solution for big data analytics built into Azure.
Understand Azure Data Lake Storage Gen2 – Learn | Microsoft Docs
10. Compare Azure Data Lake Store to Azure Blob storage
Compare Azure Data Lake Store to Azure Blob storage – Learn | Microsoft Docs
11. Big Data use cases
Examine uses for Azure Data Lake Storage Gen2 – Learn | Microsoft Docs
12. Delta Lake architecture
Use Delta Lakes as an optimization layer on top of Azure blob storage to ensure low latency and reliability within unified Streaming + Batch data pipelines.
Describe Azure Databricks Delta Lake architecture – Learn | Microsoft Docs
13. What are data streams
Data streams are most often used to good understand change over time. For example, a company may perform sentiment analysis on real-time tweets to see if an advertising campaign results in more positive comments about the company or its products.
Understand data streams – Learn | Microsoft Docs
14. Event processing
There are many services available for several and real-time analytics on Azure. This article gives you the information you need to decide which technology is the good fit for your application.
Choose a real-time and stream processing solution on Azure | Microsoft Docs
15. Process events with Azure Stream Analytics
Azure Stream Analytics is a PaaS event processing engine. It enables the transformation and analysis of huge volumes of streaming data arriving from IoT Hub and Event Hubs and static data from storage. Using Stream Analytics, you can write time-based queries and aggregations over the data generated by connected devices, sensors, or applications.
Process events with Azure Stream Analytics – Learn | Microsoft Docs
16. Work with data streams by using Azure Stream Analytics
Explore how Azure Stream Analytics integrates with IoT devices or your applications to gain insights with real-time streaming data. Learn how to consume and analyze data streams and derive actionable results.
Work with data streams by using Azure Stream Analytics – Learn | Microsoft Docs
Module 2 Run interactive queries using Azure Synapse Analytics serverless SQL pools
1. Azure Synapse serverless SQL Pools
Every Azure Synapse Analytics workspace creates serverless SQL pool endpoints that you can use to query data in the lake. It is a query service over the data in your Azure data lake.
What is Azure Synapse serverless SQL pools – Learn | Microsoft Docs
2. Comparing dedicated SQL Pools with serverless SQL pools in Azure Synapse Analytics
Synapse SQL offers both serverless and dedicated resource models, offering consumption and billing options to fit your needs. For predictable performance and cost, provision dedicated SQL pools to reserve processing power for data stored in SQL tables. For unplanned or bursty workloads, use the serverless SQL pools.
When to use Azure Synapse serverless SQL pools – Learn | Microsoft Docs
3. Azure Synapse serverless SQL pools use cases
A serverless SQL pool acts and performs like a regular SQL Server. So, all clients that can connect to SQL Server can connect to serverless SQL pool as well.
Azure Synapse serverless SQL pools use cases – Learn | Microsoft Docs
4. Common files to query
Learn how you can query the various file types that can be stored in an Azure data lake.
Query data in the lake using Azure Synapse serverless SQL pools – Learn | Microsoft Docs
5. Use Synapse Studio to analyze and visualize data via Azure Synapse serverless SQL pool
Azure Synapse Studio is the primary tool to use to interact with the many components that exist in the service.
6. Querying parquet files in a data lake
You can also execute a query using a serverless SQL pool that will read Parquet files. The OPENROWSET function enables you to read the content of a parquet file by providing the URL to your file.
Query a Parquet file using Azure Synapse serverless SQL pools – Learn | Microsoft Docs
7. Create metadata objects in Azure Synapse serverless SQL pools
Learn how you can create objects to query data or better your existing data transformation pipeline through Azure Synapse serverless SQL pools.
Create metadata objects in Azure Synapse serverless SQL pools – Learn | Microsoft Docs
8. Securing access to data in a data lake when using Azure Synapse Analytics
Learn how you can design security when using Azure Synapse serverless SQL pools.
Secure data and manage users in Azure Synapse serverless SQL pools – Learn | Microsoft Docs
9. Choose an authentication method
Authentication is the process of proving the user is who they claim to be. A user connects to a database using a user account. When a user attempts to connect to a database, they provide a user account and authentication information.
Choose an authentication method in Azure Synapse serverless SQL pools – Learn | Microsoft Docs
10. Manage users in Azure Synapse serverless SQL pools
Manage users in Azure Synapse serverless SQL pools – Learn | Microsoft Docs
11. Manage user access to data lake files
Control to the individual folders and files can also be controlled within the storage account itself by using Access Control (IAM) settings.
Manage user permissions in Azure Synapse serverless SQL pools – Learn | Microsoft Docs
Module 3 Data exploration and transformation in Azure Databricks
1. Azure Databricks
Azure Databricks is a fully-managed, cloud-based Big Data and Machine Learning platform, which empowers developers to accelerate AI and innovation by simplifying the process of building enterprise-grade production data applications. It integrates with Azure Data services which a Data Engineer should be familiar with. These services could include Azure Data Factory, Azure Synapse Analytics, Power BI and Azure Data Lake Storage.
Describe Azure Databricks – Learn | Microsoft Docs
2. Read and write data in Azure Databricks
You can use Azure Databricks in order to read multiple file types, both with and without a Schema. You can read and write from various formats such as CSV, Parquet, and JSON. In addition to that, Azure Databricks can combine inputs from files and data stores, such as Azure SQL Database, Azure Data Lake storage, and Azure Synapse Analytics.
Read and write data in Azure Databricks – Learn | Microsoft Docs
3. Work with DataFrames in Azure Databricks
Your data processing in Azure Databricks is accomplished by defining DataFrames to read and process the Data. Learn how to perform data transformations in DataFrames and execute actions to display the transformed data.
Work with DataFrames in Azure Databricks – Learn | Microsoft Docs
Module 4 Explore, transform, and load data into the Data Warehouse using Apache Spark
1. Apache Spark
Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Apache Spark in Azure Synapse Analytics is one of Microsoft's implementations of Apache Spark in the cloud. Azure Synapse makes it easy to create and configure a Spark pool in Azure Synapse Analytics. Spark pools in Azure Synapse are compatible with Azure Storage and Azure Data Lake Generation 2 Storage, so you can use Azure Spark pools to process your data stored in Azure.
What is an Apache Spark pool in Azure Synapse Analytics – Learn | Microsoft Docs
2. How do Apache Spark pools work in Azure Synapse Analytics
Within Azure Synapse Analytics, Apache Spark applications run as independent sets of processes on a pool, coordinated by the SparkContext object in your main program (called the driver program). The SparkContext can connect to the cluster manager, which allocates resources across applications. The cluster manager is Apache Hadoop YARN. Once connected, Spark acquires executors on nodes in the pool, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
How do Apache Spark pools work in Azure Synapse Analytics – Learn | Microsoft Docs
3. When do you use Apache Spark pools in Azure Synapse Analytics
When do you use Apache Spark pools in Azure Synapse Analytics – Learn | Microsoft Docs
4. Transform data with DataFrames in Apache Spark Pools in Azure Synapse Analytics
5. Ingest data with Apache Spark notebooks in Azure Synapse Analytics
Ingest data with Apache Spark notebooks in Azure Synapse Analytics – Learn | Microsoft Docs
Module 5 Ingest and load data into the data warehouse
1. Explore modern data warehouse analytics in Azure
Learn the basic fundamentals of database concepts in a cloud environment, get basic skills in Azure cloud data services, and build your foundational knowledge of Azure cloud data services within Microsoft Azure. You will examine the processing options available for building big data analytics solutions in Azure. You will use Azure Databricks, Azure Synapse Analytics, and Azure HDInsight.
Azure Data Fundamentals: Explore modern data warehouse analytics in Azure – Learn | Microsoft Docs
2. Understand data load design goals
Loading data is essential because of the need to query or analyze the data to gain insights from it, so one of the main design goals in loading data, is to manage or minimize the impact on analytical workloads while loading the data with the highest throughput possible.
Understand data load design goals – Learn | Microsoft Docs
3. Inserting data into a production table
Data loading best practices for dedicated SQL pools – Azure Synapse Analytics | Microsoft Docs
4. Understand Azure Data Factory components
ADF is composed of four core components. These components work in sync to provide the platform on which you can compose data-driven workflows with steps to move and transform data.
Understand Azure Data Factory components – Learn | Microsoft Docs
5. Use data loading best practices in Azure Synapse Analytics
Learn the best methods you need to adapt to load data into a data warehouse in Synapse Analytics.
Use data loading best practices in Azure Synapse Analytics – Learn | Microsoft Docs
6. Determining which IR to use
An integration runtime needs to be associated with each source and sink data store.
Integration runtime – Azure Data Factory & Azure Synapse | Microsoft Docs
Module 6 Transform data with Azure Data Factory or Azure Synapse Pipelines
1. Data integration with Azure Data Factory or Azure Synapse Pipelines
Learn the ADF and the core components that enable you to create large-scale data ingestion solutions in the cloud.
Data integration with Azure Data Factory – Learn | Microsoft Docs
2. Linked Services
Before you create a dataset, you must create a linked service to link your data store to the data factory/Synapse pipeline. Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. There are over 100 connectors that can be used to define a linked service.
Create linked services – Learn | Microsoft Docs
3. Datasets
A dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents.
Create datasets – Learn | Microsoft Docs
4. Activities and pipelines
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task.
Create data factory activities and pipelines – Learn | Microsoft Docs
5. Code-free transformation at scale with Azure Data Factory or Azure Synapse Pipelines
Learn how to perform cleansing activities and common data transformation within Azure Data Factory without using code.
Code-free transformation at scale with Azure Data Factory – Learn | Microsoft Docs
6. Adding a source
Every data flow requires at least one source transformation, but you can add as many sources as necessary to complete your data transformations.
Source transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
7. Schema modifier transformations
Schema modifier transformations will make changes to the structure of the data. This could include renaming, dropping, or reordering columns.
8 Formatter transformations
Within the formatters category of transformations there are two transformations; flatten and parse. Use the flatten transformation to take array values inside hierarchical structures such as JSON and unroll them into individual rows. This process is known as denormalization. Use the Parse transformation to parse columns in your data that are in document form. The current supported types of embedded documents that can be parsed are JSON, XML, and delimited text.
Flatten transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
Parse data transformation in mapping data flow – Azure Data Factory | Microsoft Docs
9. Multiple inputs/outputs transformations
In the category of multiple inputs/outputs transformations there a variety of transformations that can be used to bring data together, or to output data to new data paths.
Join transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
Exists transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
Union transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
Lookup transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
10. Row modifier transformations
The row modifiers transformation category include the Filter, Sort and Alter Row transform.
Filter transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
Sort transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
Alter row transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
11. Sink transformation
After you finish transforming your data, write it into a destination store by using the sink transformation. Every data flow requires at least one sink transformation, but you can write to as many sinks as necessary to complete your transformation flow.
Sink transformation in mapping data flow – Azure Data Factory & Azure Synapse | Microsoft Docs
Module 7 Orchestrate data movement and transformation in Azure Synapse Pipelines
1. Orchestrate data movement and transformation in Azure Data Factory or Azure Synapse Pipeline
Learn how ADF can orchestrate large-scale data movement by using other Azure Data Platform and ML technologies.
Orchestrating data movement and transformation in Azure Data Factory – Learn | Microsoft Docs
2. Programmatically creating Azure Data Factory or Azure Synapse Pipelines
Quickstart: Create an Azure Data Factory using Azure CLI – Azure Data Factory | Microsoft Docs
Copy data in Blob Storage using Azure Data Factory – Azure Data Factory | Microsoft Docs
Create Azure Data Factory using .NET SDK – Azure Data Factory | Microsoft Docs
Quickstart: Create an Azure Data Factory using Python – Azure Data Factory | Microsoft Docs
Create an Azure data factory using REST API – Azure Data Factory | Microsoft Docs
3. Transform data by running a Databricks notebook
Transform data with Databricks Notebook – Azure Data Factory & Azure Synapse | Microsoft Docs
4. Controlling notebook execution in a pipeline
You can control the execution of the various components that reside within a pipeline through the branching and chaining of activities.
Module 8 End-to-end security with Azure Synapse Analytics
1. Secure a data warehouse in Azure Synapse Analytics
Security is a multi-layered approach. Data is at the core, but there are a range of technologies, processes and people that protect it. From the physical security of the buildings, to the implementation of application security. Every aspect is important.
Secure a data warehouse in Azure Synapse Analytics – Learn | Microsoft Docs
2. Configure and manage secrets in Azure Key Vault
Storing and handling secrets, encryption keys, and certificates directly is risky, and every usage introduces the possibility of unintentional data exposure. Azure Key Vault provides a secure storage area for managing all your app secrets so you can properly encrypt your data in transit or while it's being stored.
Configure and manage secrets in Azure Key Vault – Learn | Microsoft Docs
Azure Key Vault Overview – Azure Key Vault | Microsoft Docs
3. Column-level security
Column-Level security allows customers to control access to table columns based on the user's execution context or group membership.
Column-level security for dedicated SQL pool – Azure Synapse Analytics | Microsoft Docs
4. Row-Level Security
Row-Level Security enables you to use group membership or execution context to control access to rows in a database table.
Row-Level Security – SQL Server | Microsoft Docs
Module 9 Support Hybrid Transactional Analytical Processing (HTAP) with Azure Synapse Link
1. Design hybrid transactional and analytical processing using Azure Synapse Analytics
Azure Synapse Link for Azure Cosmos DB is a cloud-native hybrid transactional and analytical processing (HTAP) capability that enables you to run near real-time analytics over operational data in Azure Cosmos DB. Azure Synapse Link creates a tight seamless integration between Azure Cosmos DB and Azure Synapse Analytics.
2. Query Azure Cosmos DB with SQL Serverless for Azure Synapse Analytics
Azure Synapse Link for Azure Cosmos DB enables users to run near real-time analytics over operational data in Azure Cosmos DB. However, there are times when some data needs to be aggregated and enriched to serve data warehouse users. Curating and exporting Synapse Link data can be done with just a few cells in a notebook.
Query Azure Cosmos DB with SQL Serverless for Azure Synapse Analytics – Learn | Microsoft Docs
3. Query Azure Cosmos DB with SQL Serverless for Azure Synapse Analytics
The query starts out by creating a new serverless SQL pool database named Profiles if it does not exist, then executes USE Profiles to run the rest of the script contents against the Profiles database. Next, it drops the UserProfileHTAP view if it exists.
Query Azure Cosmos DB with SQL Serverless for Azure Synapse Analytics – Learn | Microsoft Docs
Module 10 Real-time Stream Processing with Stream Analytics
1. Azure Event Hubs
Azure Event Hubs is a big data streaming platform and event ingestion service. It can receive and process millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters
What is Azure Event Hubs? – a Big Data ingestion service – Azure Event Hubs | Microsoft Docs
2. Create an Event Hub
The process for creating an event hub namespace and an event hub is straightforward. You should be familiar with the different pricing tiers that are available in terms of the capabilities that are offered.
Azure Quickstart – Create an event hub using the Azure portal – Azure Event Hubs | Microsoft Docs
Create an Event Hub using the Azure CLI – Learn | Microsoft Docs
Exercise – Use the Azure CLI to Create an Event Hub – Learn | Microsoft Docs
3. What are data streams
Data streams can be generated from a variety of hardware devices and software, and whilst Event Hubs are used to ingest the data streams.
Understand data streams – Learn | Microsoft Docs
4. Event processing
The process of consuming data streams, analyzing them, and deriving actionable insights out of them is called Event Processing
Choose a real-time and stream processing solution on Azure | Microsoft Docs
5. Ingest data streams with Azure Stream Analytics
Learn how to create Azure Stream Analytics jobs to process input data, transform it with a query, and return results.
Ingest data streams with Azure Stream Analytics – Learn | Microsoft Docs
Module 11 Create a Stream Processing Solution with Event Hubs and Azure Databricks
1. Process streaming data with Azure Databricks structured streaming
Learn how Structured Streaming helps you process streaming data in real time, and how you can aggregate data over windows of time.
Process streaming data with Azure Databricks structured streaming – Learn | Microsoft Docs
I hope you enjoyed my DP-203 Self Study Guide. Did I miss any link, or do you have any recommended DP-203 Microsoft Azure Data Engineer Certification Exam Study resources? Let me know in the comments. Also, do let me know about any changes in the question pattern that you get, I will update the article for others. Thanks!!
Additional Tips And Resources
I hope this DP-203 Microsoft Azure Data Engineer Certification Exam Study Guide helps you pass the exam. I also highly recommend that you open a free Azure account if you don’t have one yet. You can create your free Azure account here. Also, check out my blog posts about Microsoft Azure Data Engineer Certification:
- Microsoft Certified Azure Data Engineer Associate | DP 203 | Step By Step Activity Guides (Hands-On Labs)
- Exam DP-203: Data Engineering on Microsoft Azure
- Azure Data Lake For Beginners: All you Need To Know
- Batch Processing Vs Stream Processing: All you Need To Know
- Introduction to Big Data and Big Data Architectures
Next Task For You
In our Azure Data Engineer training program, we will cover all the exam objectives, 27 Hands-On Labs, and practice tests. If you want to begin your journey towards becoming a Microsoft Certified: Azure Data Engineer Associate by checking our FREE CLASS.
The post [Self-Study Guide] Exam DP-203: Data Engineering on Microsoft Azure appeared first on Cloud Training Program.