Quantcast
Channel: Cloud Training Program
Viewing all articles
Browse latest Browse all 1906

Azure Data Engineer [DP203] Q/A | Day 4 Live Session Review

$
0
0

Apache Spark is an integrated processing engine that can analyze big data using SQL, graph processing, machine learning, or real-time stream analysis

This blog post will go through some quick tips including Q/A and related blog posts on the topics that we covered in the Azure Data Engineer Day 4 Live Session which will help you gain a better understanding and make it easier for you to learn the Azure Data Engineer Training Program clear the [Dp203] Certification & get a better-paid job.

The previous week, In Day 3 session we got an overview of concepts of Azure Synapse serverless SQL pools and learned how to Explore, transform, and load data into the Data Warehouse using Apache Spark.

In this week, Day 4 Live Session of the [DP-203] Microsoft Data Engineering Training Program, we covered the concepts of  Module 6: Data Exploration and Transformation in Azure Databricks where we have covered topics like Azure Databricks, Read and write data in Azure Databricks, DataFrames in Azure Databricks.

We also covered hands-on Working with DataFrames and Working with DataFrames Advanced Methods out of our 40 extensive labs. 

So, here are some FAQs asked during the Day 4 Live session from Module 6 Of DP203.

>Azure Databricks

Azure Databricks may be an absolutely managed, cloud-based big data and Machine Learning platform, that empowers developers to accelerate AI and innovation by simplifying the method of building enterprise-grade production data applications. It integrates with Azure Data services which a Data Engineer should be familiar with. These services could include Azure Data Factory, Azure Synapse Analytics, Power BI, and Azure Data Lake Storage.

Azure Databricks uses an optimized Databricks runtime engine, to provide an end-to-end managed Apache Spark platform optimized for the cloud, whilst having the benefits of the enterprise-scale and security of the Microsoft Azure platform. Azure Databricks makes it straightforward to run large-scale data engineering Spark workloads.

Source: Microsoft

Ques 1: What is the difference between Azure data bricks and synapse?
Ans: Azure databricks provides the ability to create and manage an end-to-end big data/data science project using one platform. Azure synapse provides the ability to scale efficiently with Apache Spark clusters within a one-stop-shop analytical platform to meet your needs.

Ques 2: What format does Delta Lake use to store data?
Ans: Delta Lake uses versioned Parquet files to store your data in your cloud storage. with the exception of the versions, Delta Lake conjointly stores a dealing log to stay track of all the commits created to the table or blob store directory to produce ACID transactions.

Data Lake Architecture using Delta Lake, Databricks and ADLS Gen2 Part 3 — Gerard's tech

Ques 3: Where does Delta Lake store the data?
Ans: When writing data, you’ll be able to specify the placement in your cloud storage. Delta Lake stores the data in this location in Parquet format..

> Work with DataFrames in Azure Databricks

Our data processing in Azure Databricks is accomplished by process DataFrames to browse and process the data.
The Apache Spark DataFrame API provides an expensive set of functions (select columns, filter, join, aggregate, then on) that permit you to resolve common data analysis issues expeditiously.
DataFrames conjointly permits you to combine operations seamlessly with custom Python, SQL, R, and Scala code. Within the notebook experience in Azure Databricks, files can be read in a single command.

Ques 4: What is a DataFrame?
Ans: You could view DataFrames as you might see data in Microsoft Excel. It’s like a box with squares in it, that organizes data, which we could also refer to as a table of data. It is a single set of two-dimensional data that can have multiple rows and columns in the data. Each row is a sample of data. Each column is a variable or parameter that is able to describe the data.
DataFrames in Azure Databricks helps you move beyond simple data-handling functions by simplifying data exploration and perform data transformations. You can access numerous types of data sources and apply these same transformations, aggregates, and caching functions.

Ques 5: Can we also read data from SAP HANA using data bricks Spark or Python?
Ans: SAP HANA database may be accessed victimization the JDBC drivers. The Azure Databricks supports victimization of external libraries to attach to external systems, therefore the entire method is extremely straightforward! The JDBC adapter for SAP HANA is a component of the database consumer libraries and might be downloaded from the SAP Support Launchpad.
Your Databricks cluster is deployed to a similar virtual network because of the SAP HANA database therefore I don’t need to produce peering between vnets. With python script, you can read and displays data stored in the SFLIGHT table

Ques 6: External table also a persistent object?
Ans: No, data is not persistent but the metadata required to access that particular table is persistent.

Ques 7: Data frame- where it stores data Location?
Ans: No, it is a logical table, data is stored in the storage. Also, you can load data from Azure Blob Storage, Azure Data Lake Store Gen 2, and SQL pool. For example, you can ingest data into a Spark DataFrame by reading a CSV from an Azure Data Lake Storage Gen 2 as a Spark DataFrame

Ques 8: Cache similar to Buffer?
Ans: Yes, when you caching data, you are placing it on the worker of the cluster.

Ques 9: Is there a way to check the processing speed of CSV vs parquet files data?
Ans: When managing massive datasets, victimization traditional CSV or JSON formats to store data is very inefficient in terms of question speed and storage prices. Parquet format was nearly 2X quicker than victimization CSV!

Ques 10: What is the difference between Apache Spark for Synapse and Apache Spark?
Ans: Apache Spark for synapse is Apache Spark with other support for integrations with alternative services (AAD, AzureML, etc.) and extra libraries (mssparktuils, Hummingbird) and pre-tuned performance configurations.
Any work that’s presently running on Apache Spark can run on Apache Spark for Azure synapse while not modified.

Ques 11: How do I control dedicated SQL pools, serverless SQL pools, and serverless Spark pools?
Ans: As a starting point, Azure Synapse works with the built-in cost analysis and cost alerts available at the Azure subscription level

Dedicated SQL pools Serverless SQL pools Serverless Spark pools
You have got direct visibility into the price and management over the price as a result of your produce and specify the sizes of dedicated SQL pools. You’ll able to manage that users can produce or scale dedicated SQL pools with Azure RBAC roles. You have got observance and value management controls that allow you to cap defrayal at a daily, weekly, and monthly level. You can restrict who can create Spark pools with Synapse RBAC roles

> Lab: Data Exploration and Transformation in Azure Databricks

In the lab, we take to opportunity to explore how to use various Apache Spark DataFrame methods to explore and transform data in Azure Databricks. we have learned how to perform standard DataFrame methods to explore and transform data. we also learned how to perform more advanced tasks, such as removing duplicate data, manipulate date/time values, rename columns, and aggregate data.

Ques 12: How can I change the type of column?
Ans: Changing a column’s type or dropping a column requires rewriting the table.

Ques 13: What are the benefits of cache ()?
Ans: The ability to cache data is one technique for achieving better performance with Apache Spark.
This is because every action requires Spark to read the data from its source (Azure Blob, Amazon S3, HDFS, etc.) but caching moves that data into the memory of the local executor for “instant” access. cache() is just an alias for persist()

Ques 14: Need guidance/steps to resolve the issue “Pick an account — selected user account does not existing tenant ‘Microsoft Services’ and cannot access the application … in that tenant. The account needs to be added as an external user in the tenant first.”
Ans: You have multiple user accounts, and you are trying to log in with the wrong account that is why you face this error.

Ques 15: Display limit (100). Does it mean, it will display 100 lines from the top?
Ans: yes, and the display limit to 1000 rows.

Ques 16: Why a temporary view needs to create on top of DataFrame?
Ans: Apache Spark permits you to form a brief read employing a data frame. it’s rather like a read in an exceeding database. Once you have got a view, you’ll be able to execute SQL on that view.

Ques 17: What if the file having 1st row correct then 2nd row broken in between and 3rd row onwards having single value. how cleansing will work in such cases?
Ans: You need to write error handling logic to route the records that are not in the right format and filter the erroneous records.

Ques 18: How can I create a Databrick cluster in the free account?
Ans: When your Azure Databricks workspace creation is complete, select the link to go to the resource.

  1. Select Launch space to open your Databricks space in a very new tab.
  2. In the left-hand menu of your Databricks space, choose Clusters.
  3. Select create Cluster to feature a brand new cluster.
  4. Enter a name for your cluster. Use your name or initials to simply differentiate your cluster from your coworkers.
  5. Select the Cluster Mode: Single Node.
  6. Select the Databricks RuntimeVersion: Runtime: 7.3 LTS (Scala 2.12, Spark 3.0.1).
  7. Under Autopilot Options, leave the box checked, and in the text box enter 45.
  8. Select the Node Type: Standard_DS3_v2.
  9. Select Create Cluster.

Feedback Received…

Here is some positive feedback from our trainees who attended the session:

Quiz Time (Sample Exam Questions)!

With our Azure Data Engineer Training Program, we cover 150+ sample exam questions to help you prepare for the [DP-203] Certification.

Check out one of the questions and see if you can crack this…

Ques: You need to create a new Azure Databricks cluster. This cluster would connect to Azure Data Lake Storage Gen2 by using Azure Active Directory (Azure AD) integration.
Which of the following advanced options would you enable?

A. Blob access control
B. Table access control
C. Credential Passthrough
D. Single Sign-On

Comment with your answer & we will tell you if you are correct or not!

References

Next Task For You

In our Azure Data Engineer training program, we will cover 40Hands-On Labs. If you want to begin your journey towards becoming a Microsoft Certified: Azure Data Engineer Associate by checking our FREE CLASS.

https://k21academy.com/wp-content/uploads/2021/06/CU_DP203_GIF1.gif

The post Azure Data Engineer [DP203] Q/A | Day 4 Live Session Review appeared first on Cloud Training Program.


Viewing all articles
Browse latest Browse all 1906

Trending Articles