This blog post covers Hands-On Labs that you must perform in order to learn Data Engineering on Microsoft Azure & clear the Data Engineering on Microsoft Azure [DP-203] Certification.
This post helps you in Data Engineering on the Microsoft Azure journey with your self-paced learning as well as for your team learning. There are 17 Hands-On Labs in this course.
- Explore compute and storage options for data engineering workloads
- Designing and Implementing the Serving Layer
- Data engineering considerations
- Run interactive queries using serverless SQL pools
- Explore, transform, and load data into the Data Warehouse using Apache Spark
- Data Exploration and Transformation in Azure Databricks
- Ingest and load Data into the Data Warehouse
- Transform Data with Azure Data Factory or Azure Synapse Pipelines
- Orchestrate data movement and transformation in Azure Synapse Pipelines
- Optimize Query Performance with Dedicated SQL Pools in Azure Synapse
- Analyze and Optimize Data Warehouse Storage
- Support Hybrid Transactional Analytical Processing (HTAP) with Azure Synapse Link
- End-to-end security with Azure Synapse Analytics
- Real-time Stream Processing with Stream Analytics
- Create a Stream Processing Solution with Event Hubs and Azure Databricks
- Build reports using Power BI integration with Azure Synapse Analytics
- Perform Integrated Machine Learning Processes in Azure Synapse Analytics
LAB 1: Explore Compute And Storage Options For Data Engineering Workloads
In this lab, you’ll interact with two compute technologies; Azure Databricks and Azure Synapse Analytics Spark pools, but not in any real depth and this is by design. The objective of the lab is how these compute technologies interact with the primary data storage option for analytical workloads in Azure; Azure Data Lake storage. Azure Data Lake Storage Gen2 is enabled by enabling hierarchical namespace in the Azure Storage Account creation. However, the lab provides an in-depth focus on two primary considerations when working with Azure Data Lake Storage Gen2 with data engineering compute technologies:
- Organizing the data lake folders to be optimized for data exploration, loading, and querying
- Using compute libraries to optimize the querying of files in a data lake
LAB 2: Designing And Implementing The Serving Layer
In this lab, you’ll create a star schema in an SQL database, using foreign key constraints. You’ll also create a snowflake schema in SQL database and then you’ll explore a common way for managing a common dimension. The time dimension. In the end, you’ll create a star schema in Azure Synapse Analytics and you’ll learn how you can update a dimension table by loading data into a dimension table using Azure Synapse pipelines.
Source: Microsoft
LAB 3: Data Engineering Considerations
In this lab, you’ll do a basis for discussion around Modern Data Warehouse patterns, File formats, and folder structures, Security. The process of building a modern data warehouse typically consists of:
- Data Ingestion and Preparation.
- Making the data ready for consumption by analytical tools.
- Providing access to the data, in a shaped format so that it can easily be consumed by data visualization tools.
Source: Microsoft
LAB 4: Run Interactive Queries Using Serverless SQL Pools
In this lab, you’ll learn how to work with files stored in the data lake and external file sources, through T-SQL statements executed by a serverless SQL pool in Azure Synapse Analytics. You’ll query Parquet files stored in a data lake, as well as CSV files stored in an external data store. Next, you’ll create Azure Active Directory security groups and enforce access to files in the data lake through Role-Based Access Control (RBAC) and Access Control Lists (ACLs).
#Read parquet file select top 10 * from openrowset( bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet', format = 'parquet') as rows #Read parquet file create external data source covid with ( location = 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases' ); go select top 10 * from openrowset( bulk 'latest/ecdc_cases.parquet', data_source = 'covid', format = 'parquet' ) as rows #Explicitly specify a schema select top 10 * from openrowset( bulk 'latest/ecdc_cases.parquet', data_source = 'covid', format = 'parquet' ) with ( date_rep date, cases int, geo_id varchar(6) ) as rows
LAB 5: Explore, Transform, And Load Data Into The Data Warehouse Using Apache Spark
In the lab, you’ll explore data stored in a data lake, transform the data, and load data into a relational data store. you’ll explore Parquet and JSON files and use techniques to query and transform JSON files with hierarchical structures. Then you will use Apache Spark to load data into the data warehouse and join Parquet data in the data lake with data in the dedicated SQL pool. In this lab you will:
- Perform Data Exploration in Synapse Studio
- Ingest data with Spark notebooks in Azure Synapse Analytics
- Transform data with Data Frames in Spark pools in Azure Synapse Analytics
- Integrate SQL and Spark pools in Azure Synapse Analytics
LAB 6: Data Exploration And Transformation In Azure Databricks
In the lab, you’ll take to opportunity to explore how to use various Apache Spark DataFrame methods to explore and transform data in Azure Databricks. You’ll learn how to perform standard DataFrame methods to explore and transform data. Then you’ll also learn how to perform more advanced tasks, such as removing duplicate data, manipulate date/time values, rename columns, and aggregate data.
The following exercises will be performed during the lab:
- Use DataFrames in Azure Databricks to explore and filter data
- Cache a DataFrame for faster subsequent queries
- Remove duplicate data
- Manipulate date/time values
- Remove and rename DataFrame columns
- Aggregate data stored in a DataFrame
Source: Microsoft
LAB 7: Ingest And Load Data Into The Data Warehouse
In this lab, you’ll learn how to ingest data into the data warehouse through T-SQL scripts and Synapse Analytics integration pipelines. You’ll learn how to load data into Synapse dedicated SQL pools with PolyBase and COPY using and how to use workload management along with a Copy activity in an Azure Synapse pipeline for petabyte-scale data ingestion.
LAB 8: Transform Data With Azure Data Factory Or Azure Synapse Pipelines
In this lab, you’ll learn how to build data integration pipelines to ingest data from multiple data sources, transform data using mapping data flows and notebooks, and perform data movement into one or more data sinks.
LAB 9: Orchestrate Data Movement And Transformation In Azure Synapse Pipelines
In the lab, you’ll create a notebook to query user activity then you’ll then add the notebook to a pipeline using the new Notebook activity and execute this notebook after the Mapping Data Flow as part of their orchestration process. While configuring this, you’ll implement parameters to add dynamic content in the control flow and validate how the parameters can be used.
LAB 10: Optimize Query Performance With Dedicated SQL Pools In Azure Synapse
In this lab, you’ll be using window functions to perform calculations over a set of rows. You will explore the Over clause, aggregation functions, analytical functions and use the Rows clause to see the different ways you can make use of the windowing function for your data warehouse. You will also see an example of how the APPROX_COUNT_DISTINCT clause works. You will explore optimizing the data warehouse in Azure Synapse Analytics using a range of features including table distribution; indexing, and partitioning and you will look at how to further improve query performance using materialized views, the result set caching, and updating indexes and statistics.
LAB 11: Analyze And Optimize Data Warehouse Storage
This lab explains to you how to analyze and optimize the data storage of the Azure Synapse dedicated SQL pools. You will know the right approach to understand table space usage and column store storage details. Next, you will know how to compare storage requirements between identical tables that use different data types. Finally, you will monitor the impact materialized views have when executed in place of complex queries and learn how to avoid extensive logging by optimizing delete operations.
Source: Microsoft
LAB 12: Support Hybrid Transactional Analytical Processing (HTAP) With Azure Synapse Link
In this lab, you’ll learn how Azure Synapse Link enables seamless connectivity of an Azure Cosmos DB account to an Azure Synapse workspace. You will understand how to enable and configure the Synapse link, then how to query the Azure Cosmos DB analytical store using Apache Spark and SQL Serverless.
Source: Microsoft
LAB 13: End-To-End security With Azure Synapse Analytics
In this lab, you’ll learn how to secure a Synapse Analytics workspace and its supporting infrastructure. You’ll examine the SQL Active Directory Admin, manage IP firewall rules, manage secrets with Azure Key Vault and access those secrets through a Key Vault linked service and pipeline activities. Then you’ll understand how to implement column-level security, row-level security, and dynamic data masking when using dedicated SQL pools.
LAB 14: Real-time Stream Processing With Stream Analytics
This lab covers how to process streaming data with Azure Stream Analytics. You’ll ingest data into Event Hubs, then process that data in real-time, using various windowing functions in Azure Stream Analytics. You’ll output the data to Azure Synapse Analytics. Finally, you will learn how to scale the Stream Analytics job to increase throughput.
LAB 15: Create A Stream Processing Solution With Event Hubs And Azure Databricks
In this lab, you’ll know how to ingest and process streaming data at scale with Event Hubs and Spark Structured Streaming in Azure Databricks. You will learn the key features and uses of Structured Streaming. You will implement sliding windows to aggregate over chunks of data and apply watermarking to remove stale data. Finally, you will connect to Event Hubs to read and write streams.
LAB 16: Build Reports Using Power BI Integration With Azure Synapse Analytics
In this lab, you’ll learn how to integrate Power BI with their Azure Synapse workspace to build reports in Power BI. You’ll create a new data source and Power BI report in Azure Synapse Studio. Then you’ll learn how to improve query performance with materialized views and result-set caching. Finally, you’ll explore the data lake with serverless SQL pools and create visualizations against that data in Power BI.
Source: Microsoft
LAB 17: Perform Integrated Machine Learning Processes In Azure Synapse Analytics
In the lab, you’ll explore the integrated, end-to-end Azure Machine Learning and Azure Cognitive Services experience in Azure Synapse Analytics. You will learn how to connect an Azure Synapse Analytics workspace to an Azure Machine Learning workspace using a Linked Service and then trigger an Automated ML experiment that uses data from a Spark table. You’ll also learn how to use trained models from Azure Machine Learning or Azure Cognitive Services to enrich data in a SQL pool table and then serve prediction results using Power BI.
Related/References
- Exam DP-203: Data Engineering on Microsoft Azure
- Azure Data Lake For Beginners: All you Need To Know
- Batch Processing Vs Stream Processing: All you Need To Know
- Introduction to Big Data and Big Data Architectures
- Designing And Automate An Enterprise BI solution In Azure
- Azure Data Science And Data Engineering Certifications: DP-900 vs DP-100 vs DP-200/DP-201
Next Task For You
In our Azure Data Engineer training program, we will cover 17 Hands-On Labs. If you want to begin your journey towards becoming a Microsoft Certified: Azure Data Engineer Associate by checking our FREE CLASS.
The post Microsoft Certified Azure Data Engineer Associate | DP 203 | Step By Step Activity Guides (Hands-On Labs) appeared first on Cloud Training Program.