In this blog, we’ll build our Azure Data Factory Pipeline, which will simply copy data from Azure Blob storage to an Azure SQL Database database.
A pipeline is a logical collection of activities that work together to complete a task. A pipeline, for example, could include a set of activities that ingest and clean log data before launching a mapping data flow to analyze the log data. The pipeline enables you to manage the activities as a group rather than individually. Instead of deploying and scheduling the activities separately, you deploy and schedule the pipeline.
In this blog, you perform the following steps:
- Create a data factory.
- Create a pipeline with a copy activity.
- Test run the pipeline.
- Trigger the pipeline manually.
Create Azure Data Factory Pipeline
1) Prerequisites
1) Subscription to Azure. If you don’t already have an Azure subscription, sign up for a free Azure account before you start.
2) Account for Azure storage. Blob storage is used as a source data store. If you don’t already have an Azure storage account, see Create an Azure storage account for instructions.
3) SQL Database on Azure The database is used as a sink data store. If you don’t already have a database in Azure SQL Database, see Create a database in Azure SQL Database for instructions.
2) Create a blob and a SQL table
Now, prepare your SQL database and Blob storage for the blog by following the steps below.
a) Create a source blob
1) Start Notepad. Copy the following text and save it to your disc as an emp.txt file:
FirstName,LastName John,Doe Jane,Doe
2) In your Blob storage, create a container called adfdemo. In this container, make a folder called input. Then, copy the emp.txt file into the input folder. To complete these tasks, use the Azure portal or tools such as Azure Storage Explorer.
b) Create a sink SQL table
1) To create the dbo.emp table in your database, run the SQL script below:
CREATE TABLE dbo.emp ( ID int IDENTITY(1,1) NOT NULL, FirstName varchar(50), LastName varchar(50) ) GO CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);
2) Permit Azure services to connect to SQL Server. Make sure that Allow access to Azure services is enabled for your SQL Server so that Data Factory can write data to it. To check and enable this setting, navigate to the logical SQL server > Overview > Set server firewall> and toggle the Allow access to Azure services option to ON.
3) Create an Azure Data Factory
In this step, you create a data factory and launch the Data Factory UI to begin building a pipeline in the data factory.
1) Open Microsoft Edge or Google Chrome, whichever you want. Data Factory UI is currently only available in the Microsoft Edge and Google Chrome web browsers.
2) Select Create a resource > Integration > Data Factory from the left menu.
3) Select the Azure Subscription in which you wish to create the data factory on the Create Data Factory page, under the Basics tab.
4) Take one of the following steps to form a Resource Group:
- From the drop-down list, choose an existing resource group.
- Select Create new and give the resource group a name.
5) Select a location for the data factory under Region. In the drop-down list, only places that are supported are shown. The data factory’s data storage (such as Azure Storage and SQL Database) and computes (such as Azure HDInsight) can be in different locations.
6) Enter ADFdemoDataFactory in the Name field.
The Azure data factory must have a globally unique name. Enter a different name for the data factory if you get an error notice concerning the name value. (e.g., yournameADFdemoDataFactory).
7) Select V2 under Version.
8) On the top, pick the Git configuration tab, and then the Configure Git later check box.
9) When the creation is complete, the notification appears in the Notifications center. To get to the Data factory page, select Go to resource.
10) To open the Azure Data Factory UI in a new tab, select Open on the Open Azure Data Factory Studio tile.
4) Create an Azure Data Factory Pipeline
In this phase, you’ll establish a pipeline in the data factory with a copy action. The copy action transfers data from the blob storage to the SQL database. Following these steps, you constructed a pipeline in the part:
- Create the linked service.
- Create input and output datasets.
- Create a pipeline.
1) Select Orchestrate from the main page.
2) Name should be CopyPipeline under the General panel, under Properties. Then, in the top-right corner, click the Properties symbol to collapse the panel.
3) Expand the Move and Transform category in the Activities tool box, then drag and drop the Copy Data activity from the tool box to the pipeline designer surface. The name should be CopyFromBlobToSql.
A) Configure a source
1) Select the Source tab. To add a new source dataset, select Add New.
2) Select Azure Blob Storage in the New Dataset dialogue box, then click Continue. Because the source data is stored in a blob, you choose Azure Blob Storage as the source dataset.
3) Choose the data format type in the Select Format dialogue box, then click Continue.
4) Enter SourceBlobDataset as the Name in the Set Properties dialogue box. Check the box labeled “First row as a header.” Select + New from the Linked service text box.
5) Enter AzureStorageLinkedService as the name in the New Linked Service (Azure Blob Storage) dialogue box, and then choose your storage account from the Storage account name list. To deploy the associated service, select Create after testing the connection.
6) It is returned to the Set properties page after the associated service has been created. Choose Browse next to File path.
7) Select the emp.txt file from the adfdemo/input folder, then select OK.
8) Choose OK. It takes you straight to the pipeline page. Confirm that SourceBlobDataset is chosen on the Source tab. Select preview data to see a preview of the data on this page.
B) Configure sink
1) To build a sink dataset, go to the Sink tab and select + New.
2) To filter the connectors in the New Dataset dialogue box, type “SQL” in the search field, pick Azure SQL Database, and then click Continue. You copy data to a SQL database in this demo.
3) Enter OutputSqlDataset as the Name in the Set Properties dialogue box. Select + New from the Linked service dropdown list. A linked service must be paired with a dataset. The connection string that Data Factory uses to connect to SQL Database at runtime is stored in the linked service. The container, folder, and file (optional) to which the data is copied are all specified in the dataset.
4) Take the following steps in the New Linked Service (Azure SQL Database) dialogue box:
a. Type AzureSqlDatabaseLinkedService in the Name field.
b. Select your SQL Server instance under Server name.
c. Select your database under the Database name.
d. Under User name, type the user’s name.
e. Under Password, type the user’s password.
f. To test the connection, select Test connection.
g. To deploy the associated service, select Create.
5) It will take you straight to the Set Properties dialogue box. Select [dbo].[emp] from the Table drop-down menu. Then press OK.
6) Go to the pipeline tab and make sure OutputSqlDataset is selected in Sink Dataset.
5) Validate the Azure Data Factory Pipeline
1) Select Validate from the tool bar to validate the pipeline.
2) By clicking Code on the upper right, you can see the JSON code linked with the pipeline.
6) Debug and publish the Azure Data Factory Pipeline
Before publishing artifacts (connected services, datasets, and pipelines) to Data Factory or your own Azure Repos Git repository, you can debug your pipeline.
1) Select Debug from the toolbar to debug the pipeline. The Output tab at the bottom of the window displays the status of the pipeline run.
2) Select Publish all from the top toolbar once the pipeline has been completed successfully. This action sends your newly built entities (datasets and pipelines) to Data Factory.
3) Wait until you see the message “Successfully published.” To view notification messages, go to the top-right corner and select Show Notifications (bell button).
7) Trigger the Azure Data Factory pipeline manually
You manually trigger the pipeline you published in the previous stage in this step.
1) On the toolbar, select Trigger, and then Trigger Now. Select OK on the Pipeline Run page.
2) On the left, click the Monitor tab. You see a pipeline being run as a result of a manual trigger. You can use links under the PIPELINE NAME column to check activity data and to repeat the pipeline.
3) Select the CopyPipeline link under the PIPELINE NAME column to see activity runs linked with the pipeline run. There is only one activity in this case, so there is only one entry in the list. Select the Details link (eyeglasses icon) under the ACTIVITY NAME column for more information about the copy process. To return to the Pipeline Runs view, select All pipeline runs at the top. Select Refresh to refresh the view.
4) Check that the emp table in the database has two more rows.
Related/References
- Microsoft Certified Azure Data Engineer Associate | DP 203 | Step By Step Activity Guides (Hands-On Labs)
- Exam DP-203: Data Engineering on Microsoft Azure
- Azure Data Lake For Beginners: All you Need To Know
- Batch Processing Vs Stream Processing: All you Need To Know
- Introduction to Big Data and Big Data Architectures
Next Task For You
In our Azure Data Engineer training program, we will cover all the exam objectives, 27 Hands-On Labs, and practice tests. If you want to begin your journey towards becoming a Microsoft Certified: Azure Data Engineer Associate by checking our FREE CLASS.
The post Create Azure Data Factory Pipeline appeared first on Cloud Training Program.