A journey towards innovation and adoption
Databases are complex, difficult to manage but underpin modern life and our actions. The growing data deluge doesn’t help when you consider “More than 59 zettabytes (ZB) of data will be created, captured, copied, and consumed in the world this year, according to a new update to the Global DataSphere from International Data Corporation (IDC).” The use of replicated data has increased and each usage pattern creates the need for curation including storage, processing, and possibly analytics.
Replicated Data Volumes and Accompanying Growth
According to IDC, “This growth is forecast to continue through 2024 with a five-year compound annual growth rate (CAGR) of 26%”. The amount of data created over the next three years will be more than the data created over the past 30 years, and the world will create more than three times the data over the next five years than it did in the previous five.
Hence the recursion rate, and ability to process data at the source becomes more critical, imminent and something that needs to be addressed with a sense of urgency. As Enterprises struggle to contain and maintain the data volumes, it is vital to revisit and revise the data strategy, so you can retool and can take advantage of current leading market offerings without having to rip and replace existing environments and investments in database infrastructure.
One such solution is Google Cloud’s flagship product, BigQuery which stands out as the enterprise-grade, petabyte-scale data warehouse with feature-rich analytics, storage, and scale.
Gartner Ranking of Cloud Database Management Systems
Google Cloud was named a Leader in the first-ever Gartner Magic Quadrant for Cloud Database Management Systems (DBMS), in 2020. This recognition can be attributed to many facets but widely due to Google Cloud’s data analytics and databases vision and strategy, and is echoed by the market adoption in sectors, segments and industries.
Gartner has positioned Google as a Magic Quadrant Leader among the furthest three positioned vendors on the completeness-of-vision axis. By delivering on the hybrid cloud as the Industry-leading hyperscaler along with the flexible pricing model (easy start-up and ramp) coupled with committed use discounts and generous tiers and consumption thresholds.
Multi-Cloud DNA and Architecture
BigQuery Omni was announced in July 2020 in a blog post by Debanjan Saha (GM and VP of Engineering, Data Analytics). BigQuery Omni is Google Cloud’s multi-cloud analytics solution that enables BigQuery users to access and analyze data on AWS, with Azure coming soon, without having to move or copy datasets.
Data movement is often cited as one of the primary pain points for data scientists and analysts, and it often comes with significant compute costs, which require justification not to mention project timelines impact and ongoing skills deficit that limit such transfers.
Here, Saha promises a service that gives users,
“A consistent data experience using the same SQL and user interface they use in BigQuery for queries, dashboards and to run analytics for consistency and familiarity.”
This containerized architecture allows the data to stay in its AWS S3 bucket, where it is queried using Google Cloud’s Dremel engine, running natively on an Anthos cluster in the same region where the data resides. The results are then passed back to BigQuery, or data storage of choice, where it is combined with any other relevant data, with no associated data movement costs. Consider the case of a retailer wanting to seamlessly query both their Google Analytics 360 Ads data, which is stored in Google Cloud and log data from an e-commerce platform, which is stored in AWS S3, to get a fuller picture of customer buying habits.
Now customers can also use the same BigQuery UI or API to run SQL queries and build BigQuery ML models regardless of where the data is stored. More importantly, BigQuery Omni runs on Anthos. Initially, Anthos is “simply” a hybrid and multi-cloud application platform, leveraging its strong Kubernetes backbone to migrate on-prem and existing AWS/Azure applications onto GCP. With BigQuery Omni, Google is accelerating to commoditize cloud infrastructure further, and use Anthos as a middleware.
Separation of Compute and Storage
“There is a new generation of data silos and we recognize that most of the companies in the world live in a multi-cloud reality, and with BigQuery Omni, we basically give them the possibility to do analytics across them — cross-cloud analytics, which I think is the ultimate in simplicity, because you don’t need to think anymore about where do I move my data and all of the repetition and redundancy in governing it. The single pane of glass and that single processing framework — it’s a feat of engineering that the team basically took BigQuery and brought many parts of it to other clouds.” -Gerrit Kazmaier.
The separation of computing from storage in this case the processing is being brought to the data! This has been a unique differentiator of Google BigQuery since its inception in 2011, while other hyperscalers products are still trying to standardize the value prop such as AWS Redshift. The data can be stored and left in the home cloud as an e.g. AWS while the processing can be done using BigQuery run by Anthos Clusters in Google Cloud. The huge savings in the egress charges along with the comfort of look and feel of the home environment makes it super easy and fast to run workloads using this approach. Inserting data quickly and reliably (within seconds) into an analytics database is a hard problem and this has been tackled by BigQuery. Before this option was released the data had to be stored inside GCP and required being resident in BigQuery.
Optimized Operations Model
BigQuery on-demand is a pure serverless model, where the user submits queries one at a time and pays per query. BigQuery supports scale, truly serverless, managed services offering with no CAPEX costs as well as vendor support costs. Nobody would doubt the need for hybrid cloud enablement and this is a huge step in the multi-cloud Acceleration and Adoption! The market is converging around two key principles: separation of compute and storage, and flat-rate pricing that can “spike” to handle intermittent workloads. On-demand mode if architected properly can reduce/optimize operating costs, depending on the nature of your workload. A “steady” workload that utilizes enterprise compute capacity 24/7 will be much cheaper in flat-rate mode. Fivetran conducted a study on Cloud Data Warehouses and summarized it in the graphic below.
“BigQuery on-demand is a pure serverless model, where the user submits queries one at a time and pays per query”.
Standardizing on BigQuery enables data scientists, DBA’s and analysts to make use of GCP’s many smart analytics tools, Key features are summarized below:
- BigQuery ML, which abstracts away the complexity of traditional machine learning solutions, enabling users to build and deploy ML models using only basic SQL
- Looker, which democratizes data analytics through an intuitive, self-service platform that enables anyone in the organization to analyze, explore, create, and share visualizations
- The on-demand pricing model is based on bytes scanned whilst purchasing fixed slots (reservations)
- Supports Native streaming, Partitioned data in a given table behaves like a single table – but this is mostly abstracted from the user
- Supports federated queries via Cloud SQL which includes support for Postgres and MySQL databases. Other sources include Google Sheets and Google Cloud Storage
- Federated queries are designed to bridge the gap between traditional analytics (OLAP) style databases and typical row-based, transactional style databases (Postgres, MySQL)
- Provides column-level security through Data Catalog and IAM permissions in which columns can be tagged according to a hierarchy that then corresponds to permissions
- Additional geospatial functions were added in September 2020. Growing features for querying geospatial data
- No need to format/re-format data as it supports Parquet, Avro, CSV, JSON, and ORC formats. Hence working off a single copy and system of record vs multiple copies
In one study at Purdue University undertaken by the Joint transportation research program traffic safety on US interstates, where they analyzed road safety using BigQuery a single query to examine a section of interstate highway over a 24-hour period takes just 7 minutes compared to 90 minutes with the existing/on-premise system. Batching 10 billion records could take a month with the old servers; the process takes just 5 minutes with BigQuery.
Geotab is in the business of providing data-driven insights on commercial vehicles and fleets. This includes sensor data such as engine speeds to ambient air temperatures to driving and weather conditions. Geotab records a wealth of data through the Geotab GO telematics device and a range of integrated sensors and apps. At the time of publishing of this study, they were collecting data from 1.4m vehicles on which they stream 3 billion raw records in BigQuery and it took 5-10 seconds to get them from sensors to analytics. SpringML worked with the Google Cloud team and Geotab to implement this, Geotab case study is referenced here.
According to Mike Branch (VP of Data & Analytics – Geotab)
“Google Cloud Platform is helping us transform the way we create value for our customers. We can contextually benchmark data at scale; develop richer, more extensive data insight; grow our smart city business; and share our intelligence data with the general public through our data.geotab.com site. Google Cloud Platform is helping us achieve all that, and more.”
Google Cloud BigQuery has been the leader in moving towards this Golden age of databases by bringing such feature-function rich product suite as BigQuery. The COVID-19 pandemic has clearly highlighted the need for flexibility, resiliency, and redundancy in operations and data warehouses deployment. IT leaders want the ability to write apps once and deploy them anywhere without feeling “locked-in”. Data warehouses are an entry point to the cloud, with BigQuery you can embrace Multi-Cloud/Hybrid Cloud at the same time keeping your existing investments in cloud infrastructure operational and viable.