BigQuery is a data warehouse to work with large amounts of data. With BigQuery, one can collect data from various sources, store the data, analyze the data, and eventually; be able to visualize the analysis in multiple ways. This blog talks about BigQuery, its various features, and use cases.
This blog will go through the following topics:
Introduction
Ever since the dawn of computing, and subsequently the internet, the amount of data in the world has been increasing continuously. However, in recent years, the rate of increase has shot up. According to estimates, the total data in the world today is in zettabytes, with 2.5 quintillion bytes being added to it every day!
Taking that into consideration, we need something to handle this massive amount of data. Google’s offering BigQuery does just that – giving us a tool to help us work with this huge amount of data. BigQuery can be used to run very large-scale SQL queries – spanning up to billions of rows – and generate results in almost real-time.
BigQuery Working
The working of BigQuery can be divided into three main parts – ingestion, storage and preparation, and analysis. Here we will elaborate upon these parts
Pre-requisite: Creating a Table
To start working with BigQuery, one needs to have a table. BigQuery tables can be of three types – Native tables, External tables, and Views. Once you have a table, you can run the various BQ operations on it, including:
- Listing all tables in a dataset
- Getting information about the table(s)
- Controlling access to the table data
- Getting the table metadata
These tasks can be performed by using Console, bq command-line-tool, using client libraries, or by using the API method.
Once a table is created, we can get to the further steps to make use of BQ’s functionalities.
1. Ingestion (Loading Data)
Once a table has been created, the next step is to load data into the table. There are several ways to ingest data into BigQuery:
- Batch loading a set of data records
- Streaming individual records
- Using queries to generate new data to append into table
- Using a third-party application or service
Ingestion can be done manually, as explained above, or automatically, by using Data Transfer Service.
2. Preparation & Storage
In this process, the data is stored in BigQuery, and prepared for analysis. BigQuery takes the raw data gotten in the previous step, and refines it so that it is ready for further analysis.
3. Analysis (Exporting Data)
Finally, once the data is ready in the table, it can be exported. Data can be exported from BigQuery in several ways, with a maximum of 1GB per file. The available options here are to either manually export the data, or to use a service like Dataflow to automate the process. Once exported, various services like BigQuery ML and Google Data Studio can be used to analyze the data.
Features of BigQuery
BigQuery is a fast, serverless data warehouse that is designed for organizations dealing with a high amount of data. From getting detailed insights from the data using its built-in Machine Learning, to analyzing petabytes of data using ANSI SQL, BigQuery offers a wide variety of features in a cost-effective pricing model. Here we look at some of those features.
1. Multicloud Functionality (BQ Omni)
BigQuery is an analytics solution that allows for data analysis across multiple cloud platforms. The USP of BigQuery is providing a novel way of analyzing data present in more than one cloud, without it costing an arm and a leg. This is in contrast to other solutions, where this always involved high egress costs to migrate data from the source. BigQuery achieves this by separating Compute and Storage components. This means that BigQuery can run the computation on the data right where it is located, without the need for moving it to a different zone for processing.
BigQuery Omni runs on Anthos clusters which are managed by Google Cloud. This allows for the secure execution of queries, even on foreign cloud platforms. BigQuery Omni comes with the following functionalities:
- Break down silos and gain insights into the data
- Get consistent data experience across clouds
- Enable flexibility, courtesy of Anthos
2. Built-in ML Integration (BQ ML)
BigQuery ML is used for creating and executing Machine Learning models in BigQuery using simple SQL queries. Prior to the introduction of BigQuery ML, Machine Learning on large datasets required ML-specific knowledge and programming skills. BigQuery Ml eliminated the need for that, by allowing SQL practitioners to build ML models using their existing skills.
BigQuery ML can be accessed in the following four manners:
- Google Cloud Console
- bq command-line-tool
- BigQuery REST API
- An external tool (for example Jupyter)
Machine Learning in BigQuery works on models, which are representations of what the ML system has learned from the data. Some of the models used in BigQuery ML include Linear regression, Binary and Multiclass Logistic regression, Matrix Factorization, Time Series, and Deep Neural Network models.
There are a number of reasons to use BigQuery ML for using Machine Learning with a cloud-based data warehouse. Some of these are:
- No need for data export leading to a higher speed of model development
- No need to program ML solution using Python or Java (as shown in the image above)
- Ability to build and run ML models using existing BI tools & spreadsheets
3. Foundation for BI (BQ BI Engine)
BigQuery BI engine is an in-memory analysis solution. It is used to analyze the data stored in BigQuery with high concurrency, and response times of under a second. Being a part of the BigQuery family, it is no surprise that it comes with an SQL Interface as well. This helps it to interact with other BI tools like Looker, Tableau, Power BI, etc. It can also integrate with custom applications, and help with data exploration and analysis.
BigQuery BI Engine has the following advantages:
- Speed
Very fast response times and load times help with data analysis over streaming data - Simplicity
Performs in-place analysis within BigQuery which eliminates the need for ETL pipelines - Ease of use
The smart tuning design ensures that there are very few configuration settings from the user’s end
BigQuery BI Engine comes with a free tier that allows up to 1GB of free capacity for Data Studio users, and on-demand capacity at $0.0416 per GB afterward. Organizations with higher needs can choose to opt for flat-rate capacity pricing, which comes with monthly charges.
4. Geospatial Analysis (BQ GIS)
BigQuery Geographic Information Systems (GIS) provides information about location and mapping, which is important in a data-warehouse like BigQuery.
BigQuery GIS functions by converting latitudes and longitudes columns into geographical points. There are four types of objects when working with geospatial data:
- geometry
This represents an area on the surface of the earth - spatial feature
This represents a logical spatial object - spatial feature collection
This is a set of spatial features
The final visualization of the BigQuery data can be done by using one of the following:
- BigQuery Geo Viz
- Google Earth Engine
- Jupyter notebooks (using extension)
5. Automated Data Transfer (BQ Data Transfer Service)
BigQuery Data Transfer is a service that automates the movement of data into BigQuery on a regular basis. This schedule can be managed by the analytics team in a simple manner, without the need for any coding. One can also add data backfills to make up for any gaps or outages during ingestion.
The BigQuery Data Transfer Service can be accessed in three ways:
- Cloud Console
- bq command-line-tool
- BigQuery Data Transfer Service API
As of now, BigQuery Data Transfer allows for data to be imported from a variety of Google-owned services like Cloud Storage, Google Play, Google Ads, and YouTube Channel reports. One can also import the data from external cloud storage services, like Amazon S3. Data warehouses like Teradata and Amazon Redshift can also be used to load data into BigQuery using the Data Transfer service.
6. Free Access (BQ Sandbox)
While all the features of BigQuery mentioned above are very useful, sometimes a user might want more experience with a said feature before investing in it. For that purpose, Google has the BigQuery Sandbox – a place for experiencing BigQuery and the Cloud Console without any commitment. This means no need to create a billing account or a project, or even provide credit card details. All the applications run in a separate environment, which emulates the one provided by BigQuery in Google Cloud Platform. Once the user tries the features and is satisfied, they can easily upgrade to the full BigQuery experience.
BigQuery Sandbox also comes with a set of limitations, that are applicable until the upgrade. These are:
- All BigQuery quotas and limits are applicable
- Free usage limits are the same as BQ free tier, and can’t be upgraded
- All datasets have the default expiration time (60 days)
- Streaming data, DML statements, and Data Transfer Service are not supported
BigQuery Pricing
Pricing in BigQuery is done for two main components – Analysis and Storage. Apart from these, charges can be levied for streaming inserts and using the BigQuery Storage API. Let’s look at these in a little more detail.
1. Analysis Pricing
Analysis pricing includes the charges levied by Google for running queries on BigQuery. This can be set up in two ways:
1.1 On-demand Analysis
This pricing model comes with a free tier of 1TB per month, following which you are charged on the basis of usage (bytes processed per query). This query may be done on data stored in BigQuery, or in an external source like Cloud Storage, Google Drive, or Bigtable. The rate after the free tier for Analysis pricing is $5.00 per TB.
1.2 Flat-rate Pricing
In this model, instead of paying per byte used, the user pays a fixed sum to buy slots. These slots are essentially vCPUs, and options vary based on how long the service is meant to be used:
- Flex – Commitment of initial 60 seconds
- Monthly – Commitment of initial 30 days
- Annual – Commitment of initial 365 days
It goes without saying that the cost per unit of time decreases as the commitment period is increased. For instance, the cost for one month for 100 slots with a Monthly commitment would be $2000, while the same with an Annual commitment would be $1700 (per month).
2. Storage Pricing
Storage pricing is the charge levied by Google for storing all the data that is loaded into BigQuery. It can be classified into two types:
2.1 Active Storage
This type of storage is for any and all data that has been modified in the past 90 days. It comes with 10 GB of free storage initially, with subsequent data charged at $0.020 per GB. This is primarily used for frequently used data.
2.2 Long-term Storage
This is for data that is stored for a longer period of time. Any BigQuery table that hasn’t been accessed in 90 days is automatically put under Long-term storage, and charged accordingly. The free tier for this type is the same as Active Storage, but the charges after the free tier are lower, at $0.010 per GB.
3. Data Ingestion & Extraction
Two other BigQuery components incur charges. These are data ingestion and extraction.
3.1 Ingestion
Data ingestion in BigQuery can be done in two ways – through batch loading, and streaming. While the former is free of cost as long as a shared slot pool is used, the latter is billed on usage at $0.010 per 200MB of data.
3.2 Extraction
Similar to data ingestion, data extraction can also be done in two ways – batch exports, and streaming. Again, the batch exports are free as long as a shared slot pool is used, but streaming is charged at $1.10 per TB of data.
BigQuery vs Bigtable
One thing that often confuses people is the difference between BigQuery and Bigtable. In addition to having the same name, BigQuery and Bigtable also have other similarities. Both of them are used for handling and performing operations on large amounts of data. The difference, however, lies in the type of data that these two services use. While BigQuery is concerned with relational structured data, Bigtable deals with non-relational data (NoSQL databases).
The following table further helps bring out the contrast between the two Google services.
FAQs
- Who should use BigQuery?
BigQuery is ideal for large organizations that deal with a huge amount of data and are looking for a service to handle it. BigQuery can import, analyze, and export that data as needed. - Can BigQuery only be used for data in Google Cloud?
No, BigQuery can export data from multiple sources, including third-party storage services. For instance, BigQuery Data Transfer Service for AWS allows for data to be imported from S3 buckets into BigQuery. - Will there be charges even if I am not using BigQuery analytics?
In case you load your data into BigQuery but don’t run BigQuery analytics services on it, you will only be billed for storing the data. The storage cost will also drop if the data is left untouched for long.
Related References
- GCP Professional Cloud Architect: Everything You Need To Know
- Google Cloud Services & Tools
- Google Cloud Platform Console Walkthrough
- Google Cloud Functions
Next Task For You
If you are also interested and want to know more about the Google Cloud Professional Cloud Engineer certification, register for our Free Class.
The post Google BigQuery: Cloud Data Warehouse appeared first on Cloud Training Program.