dataproc | SQL Squirrels

Week of October 9th

Posted on October 7, 2020 by Mark Shay

Part I of a Cloud☁️ Journey

“On a cloud☁️ of sound 🎶, I drift in the night🌙”

Hi All –

Ok Happy Hump 🐫 Day!

The last few weeks we spent some quality time ⏰ visiting with Microsoft SQL Server 2019. A few weeks back, we kicked 🦿the tires 🚗 with IQP and the improvements made to TempDB. Then week after we were fully immersed with Microsoft’s most ambitious offering in SQL Server 2019 with Big Data Clusters (BDC).

This week we make our triumphant return back to the cloud☁️. If you have been following our travels this past summer☀️ we focused on the core concepts of AWS and then we concentrated on the fundamentals of Microsoft Azure. So, the obvious natural progression of our continuous cloud☁️ journey✈️ would be to further explore the Google Cloud Platform or more affectionately known as GCP. We had spent a considerable amount time 🕰 before learning many of the exciting offerings in GCP but our long awaited return has been long overdue. Besides we felt the need to gives some love ❤️ and oxytocin 🤗 to our friends from Mountain View

“It starts with one☝️ thing…I don’t know why?”

Actually, Google has 10 things when it comes to their philosophy but more on that later. 😊

One of Google’s strong 💪 beliefs is that in “in the future every company will be a data company because making the fastest and best use of data is a critical source of a competitive advantage.”

GCP is Google’s Cloud Computing☁️ solution that provides a wide variety of services such as Compute, Storage🗄, Big data, and Machine Learning for managing and getting value from data and doing that at infinite scale⚖️. GCP offers over 90 products and Services.

“If you know your history… Then you would know where you coming from”

In 2006, AWS began offering cloud computing☁️ to the masses, several years later Microsoft Azure followed suit and shortly right after GCP joined the Flexible🧘‍♀️, Agile, Elastic, Highly Available and scalable⚖️ party 🥳. Although, Google was a late arrival to the cloud computing☁️ shindig🎉 their approach and strategy to Cloud☁️ Computing is far from a “Johnny-come-lately” 🤠

“Google Infrastructure for Everyone” 😀

Google does not view cloud computing☁️ as a “commodity” cloud☁️. Google’s methodology to cloud computing☁️ is of a “premier💎 cloud☁️”, one that provides the same innovative, high-quality, deluxe set of services, and rich development environment with the advanced hardware that Google has been running🏃‍♂️ internally for years but made available for everyone through GCP.

“No vendor lock–in 🔒. Guaranteed. 👍”

In addition, Google who is certainly no stranger to Open Source software promotes a vision🕶 of the “open cloud☁️”. A cloud☁️ environment where companies🏢🏭 large and small 🏠can seamlessly move workloads from one cloud☁️ provider to another. Google wants customers to have the ability to run 🏃‍♂️their applications anywhere not just in Google.

“Get outta my dreams😴… Get into my car 🚙”

Now that I extolled the virtues of Google’s vision 🕶 and strategy for Cloud computing☁️, it’s time to take this car 🚙 out for a spin. Fortunately, the team at Google Cloud☁️ have put together one of the best compilations since the Zeppelin Box Set 🎸in there Google Cloud Certified Associate Cloud Engineer Path on Pluralsight.

Since there is some much to unpack📦, we will need to break our learnings down into multiple parts. So to help us put our best foot 🦶forward through the first part our journey ✈️ will be Googler Brice Rice and former Googler Catherine Gamboa through their Google Cloud Platform Fundamentals – Core Infrastructure course.

In a great introduction, Brian expounds on the definition of Cloud Computing☁️ and a brief history on Google’s transformation from the virtualization model to a container‑based architecture, an automated, elastic, third‑wave cloud☁️ built from automated services.

Next, Brian reviews GCP computing architectures:

Infrastructure as a Service (IaaS) – provide raw compute, storage🗄, and network organized in ways that are familiar from data centers. You pay for what you allocate

Platform as a Service (PaaS) – binds application code you write to libraries📚 that give access to the infrastructure your application needs. You pay 💰 for what you use.

Software as a Service (SaaS) – applications in that they’re consumed directly over the internet by end users. Popular examples: Search 🔎, Gmail 📧, Docs 📄, and Drive 💽

Then we had an overview of Google’s network which according to some estimates carries as much as 40% of the world’s 🌎 internet traffic🚦. The network interconnects at more than 90 internet exchanges and more than 100 points of presence worldwide 🌎 (and growing). One of the benefits of GCP is that it leverages Google’s robust network. Allowing GCP resources to be hosted in multiple locations worldwide 🌎. At granular level these locations are organized by regions and zones. A region is a specific geographical 🌎 location where you can host your resources. Each region has one or more zones (most regions have three or more zones).

All of the zones within a region have fast⚡️network connectivity among them. A zone is like as a single failure domain within a region. A best practice in building a fault‑tolerant application, is to deploy resources across multiple zones in a given region to protect against unexpected failures.

Next, we had summary on Google’s Multi-layered approach to security🔒.

Highlights:

Server boards and the networking equipment in Google data centers are custom‑designed by Google.
Google also designs custom chips, including a hardware security🔒 chip (Titan) deployed on both servers and peripherals.
Google Server machines use cryptographic signatures✍️ to make sure they are booting the correct software.
Google designs and builds its own data centers (eco friendly), which incorporate multiple layers of physical security🔒 protections. (Access to these data centers is limited to only a few authorized Google Employees)
Google’s infrastructure provides cryptographic🔐 privacy🤫 and integrity for remote procedure‑called data on the network, which is how Google Services communicate with each other.
Google has multitier, multilayer denial‑of‑service protections that further reduces the risk of any denial‑of‑service 🛡 impact.

Rounding out the introduction was a sneak peek 👀 into the Budgets and Billing 💰. Google offers customer-friendly 😊 pricing with a per‑second billing for its IaaS compute offering, Fine‑grained billing is a big cost‑savings for workloads that are bursting. GCP provides four tools 🛠to help with billing:

Budgets and alerts 🔔
Billing export🧾
Reports 📊
Quotas

Budgets can be a fixed limit, or you can tie it to another metric, for example a percentage of the previous month’s spend.

Alerts 🔔 are generally set at 50%, 90%, and 100%, but are customizable

Billing export🧾 store detailed billing information in places where it’s easy to retrieve for more detailed analysis

Reports📊 is a visual tool in the GCP console that allows you to monitor your expenditure. GCP also implements quotas, which protect both account owners and the GCP community as a whole 😊.

Quotas are designed to prevent the overconsumption of resources, whether because of error or malicious attack 👿. There are two types of quotas, rate quotas and allocation quotas. Both get applied at the level of the GCP project.

After a great intro, next Catherine kick starts🦵 us with GCP. She begins with a discussion around resource hierarchy 👑 and trust🤝 boundaries.

Projects are the main way you organize the resources (all resources belong to a project) you use in GCP. Projects are used to group together related resources, usually because they have a common business objective. A project consists of a set of users, a set of APIs, billing, quotas, authentication, and monitoring 🎛 settings for those APIs. Projects have 3 identifying attributes:

Project ID (Globally 🌎 unique)
Project Name
Project Number (Globally 🌎 Unique)

Projects may be organized into folders 🗂. Folders🗂 can contain other folders 🗂. All the folders 🗂 and projects used by an organization can be put in organization nodes.

Please Note: If you use folders 🗂, you need to have an organization node at the top of the hierarchy👑.

Projects, folders🗂, and organization nodes are all places where the policies can be defined.

A policy is a set on a resource. Each policy contains a set of roles and members👥.

A role is a collection of permissions. Permissions determine what operations are allowed on a resource. There are three kinds of roles (primitive):

Owner
Editor
Viewer

Another role made available in IAM is to control the billing for a project without the right to change the resources in the project is billing administrator role.

Please note IAM provides finer‑grained types of roles for a project that contains sensitive data, where primitive roles are too generic.

A service account is a special type of Google account intended to represent a non-human user ⚙️ that needs to authenticate and be authorized to access data in Google APIs.

A member(s)👥 can be a Google Account(s), a service account, a Google group, or a Google Workspace or Cloud☁️ Identity domain that can access a resource.

Resources inherit policies from the parent.

Identity and Access Management (IAM) allows administrators to manage who (i.e. Google account, a Google group, a service account, or an entire Work Space) can do what (role) on specific resources There are four ways to interact with IAM and the other GCP management layers:

Web‑based console 🕸
SDK and Cloud shell (CLI)☁️
APIs
1. Cloud Client Libraries 📚
1. Google API Client Library 📚
Mobile app 📱

When it comes to entitlements “The principle of least privilege” should be followed. This principle says that each user should have only those privileges needed to do their jobs. In a least privilege environment, people are protected from an entire class of errors. GCP customers use IAM to implement least privilege, and it makes everybody happier 😊.

For example, you can designate an organization policy administrator so that only people with privilege can change policies. You can also assign a project creator role, which control who can spend money 💵.

Finally, we checked into Marketplace 🛒 which provides an easy way to launch common software packages in GCP. Many common web 🕸 frameworks, databases🛢, CMSs, and CRMs are supported. Some Marketplace 🛒 images charge usage fees, like third parties with commercially licensed software. But they all show estimates of their monthly charges before you launch them.

Please Note: GCP updates the base images for these software packages to fix critical issues 🪲and vulnerabilities, but it doesn’t update the software after it’s been deployed. However, you’ll have access to the deployed system so you can maintain them.

“Look at this stuff…🤩 Isn’t it neat? Wouldn’t you think my collection’s complete 🤷‍♂️?”

Now with basics of GCP covered, it was time 🕰 to explore 🧭 some the computing architectures made available within GCP.

Google Compute Engine

Virtual Private Cloud (VPC) – manage a networking functionality for your GCP resources. Unlike AWS (natively), GCP VPC is global 🌎 in scope. They can have subnets in any GCP region worldwide 🌎. And subnets can span the zones that make up a region.

Provides flexibility🧘‍♀️ to scale⚖️ and control how workloads connect regionally and globally🌎
Access VPCs without needing to replicate connectivity or administrative policies in each region
Bring your own IP addresses to Google’s network infrastructure across all regions

Much like physical networks, VPCs have routing tables👮‍♂️and Firewall🔥 Rules which are built in.

Routing tables👮‍♂️ forward traffic🚦from one instance to another instance
Firewall🔥 allows you to restrict access to instances, both incoming(ingress) and outgoing (egress) traffic🚦.

Cloud DNS manages low latency and high availability of the DNS service running on the same infrastructure as Google

Cloud VPN securely connects peer network to Virtual Private Cloud (VPC) network through an IPsec VPN connection.

Cloud Router lets other networks and Google VPC exchange route information over the VPN using the Border Gateway Protocol.

VPC Network Peering enables you to connect VPC networks so that workloads in different VPC networks can communicate internally. Traffic🚦stays within Google’s network and doesn’t traverse the public internet.

Private connection between you and Google for your hybrid cloud☁️
Connection through the largest partner network of service providers

Dedicated Interconnect which allows direct private connections providing highest uptimes (99.99% SLA) for their interconnection with GCP

Google Compute Engine (IaaS) delivers Linux or Windows virtual machines (VMs) running in Google’s innovative data centers and worldwide fiber network. Compute Engine offers scale ⚖️, performance, and value that lets you easily launch large compute clusters on Google’s infrastructure. There are no upfront investments, and you can run thousands of virtual CPUs on a system that offers quick, consistent performance. VMs can be created via Web 🕸 console or the gcloud command line tool🔧.

For Compute Engine VMs there are two kinds of persistent storage🗄 options:

Standard
SSD

If your application needs high‑performance disk, you can attach a local SSD. ⚠️ Beware to store data of permanent value somewhere else because local SSD’s content doesn’t last past when the VM terminates.

Compute Engine offers innovative pricing:

Per second billing
Preemptible instances
High throughput to storage🗄 at no additional cost
Only pay for hardware you need.

At the time of this post, N2D standard and high-CPU machine types have up to 224 vCPUs and 128 GB of memory which seems like enough horsepower 🐎💥 but GCP keeps upping 🃏🃏 the ante 💶 on maximum instance type, vCPU, memory and persistent disk. 😃

Sample Syntax creating a VM:

$ gcloud compute zones list | grep us-central1

$ gcloud config set compute/zone us-central1-c

$ gcloud compute instances create “my-vm-2” –machine-type “n1-standard-1” –image-project “debian-cloud” –image “ebian-9-stretch-v20170918” –subnet “default”

Compute Engine also offers auto Scaling ⚖️ which adds and removes VMs from applications based on load metrics. In addition, Compute Engine VPCs offering load balancing 🏋️‍♀️ across VMs. VPC supports several different kinds of load balancing 🏋️‍♀️:

Layer 7 load balancing 🏋️‍♀️ based on load
Layer 4 load balancing 🏋️‍♀️ based on non-http SSL load
Layer 4 load balancing 🏋️‍♀️ based on non-http SSL traffic🚦
Any Traffic🚦 (TCP, UDP)
Traffic🚦inside a VPC

Cloud CDN –accelerates💥 content delivery 🚚 in your application allowing users to experience lower network latency. The origins of your content will experience reduced load, and cost savings. Once you’ve set up HTTPS load balancing 🏋️‍♀️, simply enable Cloud CDN with a single checkbox.

Next on our plate 🍽 was to investigate the storage🗄 options that are available in GCP

Cloud Storage 🗄 is fully managed, high durability, high availability, scalable ⚖️ service. Cloud Storage 🗄 can be used for lots of use cases like serving website content, storing data for archival and disaster recovery, or distributing large data objects.

Cloud Storage🗄 offers 4 different types of storage 🗄 classes:

Regional
Multi‑regional
Nearline 😰
Coldline 🥶

Cloud Storage🗄 is comprised of buckets 🗑 which create, and configure, and use to hold storage🗄 objects.

Buckets 🗑 are:

Globally 🌎 Unique
Different storage🗄 classes
Regional or multi-regional
Versioning enabled (Objects are immutable)
Lifecycle 🔄 management Rules

Cloud Storage🗄supports several ways to bring data into Cloud Storage🗄.

Use gsutil Cloud SDK.
Drag‑and‑drop in the GCP console (with Google Chrome browser).
Integrated with many of the GCP products and services:
- Import and export tables from and to BigQuery and Cloud SQL
- Store app engine logs
- Cloud data store backups
- Objects used by app engine
- Compute Engine Images
Online storage🗄 transfer service (>TB) (HTTPS endpoint)
Offline transfer appliance (>PB) (rack-able, high capacity storage🗄 server that you lease from Google)

“Big wheels 𐃏 keep on turning”

Cloud Bigtable is afully managed, scalable⚖️ NoSQL database🛢 service for large analytical and operational workloads. The databases🛢 in Bigtable are sparsely populated tables that can scale to billions of rows and thousands of columns, allowing you to store petabytes of data. Data encryption inflight and at rest are automatic

GCP fully manages the surface, so you don’t have to configure and tune it. It’s ideal for data that has a single lookup keys🔑 and for storing large amounts of data with very low latency.

Cloud Bigtable is offered through the same open source API as HBase, which is the native database🛢 for the Apache Hadoop 🐘 project.

Cloud SQL is a fully managed relational database🛢 service for MySQL, PostgreSQL, and MS SQL Server which provides:

Automatic replication
Managed backups
Vertical scaling ⚖️ (Read and Write)
Horizontal Scaling ⚖️
Google integrated Security 🔒

Cloud Spanner is a fully managed relational database🛢 with unlimited scale⚖️ (horizontal), strong consistency & up to 99.999% high availability.

It offers transactional consistency at a global🌎 scale ⚖️, schemas, SQL, and automatic synchronous replication for high availability, and it can provide petabytes of capacity.

Cloud Datastore is a highly scalable ⚖️ (Horizontal) NoSQL database🛢 for your web 🕸 and mobile 📱 applications.

Designed for application backends
Supports transactions
Includes a free daily quota

Comparing Storage🗄 Options

Cloud Datastore is the best for semi‑structured application data that is used in App Engine applications.

Bigtable is best for analytical data with heavy read/write events like AdTech, Financial 🏦, or IoT📲 data.

Cloud Storage🗄 is best for structured and unstructured binary or object data, like images🖼, large media files🎞, and backups.

Cloud SQL is best for web 🕸 frameworks and existing applications, like storing user credentials and customer orders.

Cloud Spanner is best for large‑scale⚖️ database🛢 applications that are larger than 2 TB, for example, for financial trading and e‑commerce use cases.

“Everybody, listen to me… And return me my ship⛵️… I’m your captain👩🏾‍✈️, I’m your captain👩🏾‍✈️”

Containers, Kubernetes ☸️, and Kubernetes Engine ☸️

Containers provide independent scalable ⚖️ workloads, that you would get in a PaaS environment, and an abstraction layer of the operating system and hardware, like you get in an IaaS environment. Containers virtualize the operating system rather than the hardware. The environment scales⚖️ like PaaS but gives you nearly the same flexibility as Infrastructure as a Service

Kubernetes ☸️ is an open source orchestrator for containers. K8s ☸️ make it easy to orchestrate many containers on many hosts, scale ⚖️ them, roll out new versions of them, and even roll back to the old version if things go wrong 🙁. K8s ☸️ lets you deploy containers on a set of nodes called a cluster.

A cluster is set of master components that control the system as a whole, and a set of nodes that run containers.

K8s ☸️ deploys a container or a set of related containers, it does so inside an abstraction called a pod.

A pod is the smallest deployable unit in Kubernetes.

Kubectl starts a deployment with a container running in a pod. A deployment represents a group of replicas of the same pod. It keeps your pods running 🏃‍♂️, even if a node on which some of them run on fails.

Google Kubernetes Engine (GKE) ☸️ is Secured and managed Kubernetes service ☸️ with four-way auto scaling ⚖️ and multi-cluster support.

Leverage a high-availability control plane ✈️including multi-zonal and regional clusters
Eliminate operational overhead with auto-repair 🧰, auto-upgrade, and release channels
Secure🔐 by default, including vulnerability scanning of container images and data encryption
Integrated Cloud Monitoring 🎛 with infrastructure, application, and Kubernetes-specific ☸️ views

GKE is like an IaaS offering in that it saves you infrastructure chores and it’s like a PaaS offering in that it was built with the needs of developers 👩‍💻 in mind.

Sample Syntax building a K8 cluster:

gcloud container clusters create k1

In GKE to make the pods in your deployment publicly available, you can connect a load balancer🏋️‍♀️ to it by running the kubectl expose command. K8s ☸️ then creates a service with a fixed IP address for your pods.

A service is the fundamental way K8s ☸️ represents load balancing 🏋️‍♀️. A K8s ☸️ attaches an external load balancer🏋️‍♀️ with a public IP address to your service so that others outside the cluster can access it.

In GKE, this kind of load balancer🏋️‍♀️ is created as a network load balancer🏋️‍♀️. This is one of the managed load balancing 🏋️‍♀️ services that Compute Engine makes available to virtual machines. GKE makes it easy to use it with containers.

Service groups is a set of pods together and provides a stable end point for them

Imperative commands

kubectl get services shows you your service’s public IP address

kubectl scale – scales ⚖️ a deployment

kubectl expose – creates a service

kubectl get pods watch the pods come online

The real strength 💪of K8s ☸️ comes when you work in a declarative of way. Instead of issuing commands, you provide a configuration file (YAML) that tells K8s ☸️ what you want your desired state to look like, and Kubernetes ☸️ figures out how to do it.

When you choose a rolling update for a deployment and then give it a new version of the software it manages, Kubernetes will create pods of the new version one by one, waiting for each new version pod to become available before destroying one of the old version pods. Rolling updates are a quick way to push out a new version of your application while still sparing your users from experiencing downtime.

“Going where the wind 🌬 goes… Blooming like a red rose🌹”

Introduction to Hybrid and Multi-Cloud Computing (Anthos)

Modern hybrid or multi‑cloud☁️ architectures allows you to keep parts of your system’s infrastructure on‑premises, while moving other parts to the cloud☁️, creating an environment that is uniquely suited to many company’s needs.

Modern distributed systems allow a more agile approach to managing your compute resources

Move only some of you compute workloads to the cloud ☁️
Move at your own pace
Take advantage of cloud’s☁️ scalability⚖️ and lower costs 💰
Add specialized services to compute resources stack

Anthos is Google’s modern solution for hybrid and multi-cloud☁️ systems and services management.

The Anthos framework rests on K8s ☸️ and GKE deployed on‑prem, which provides the foundation for an architecture that is fully integrated with centralized management through a central control plane that supports policy‑based application life cycle🔄 delivery across hybrid and multi‑cloud☁️ environments.

Anthos also provides a rich set of tools🛠 for monitoring 🎛 and maintaining the consistency of your applications across all of your network, whether on‑premises, in the cloud☁️ K8s ☸️, or in multiple clouds☁️☁️.

Anthos Configuration Management provides a single source of truth for your cluster’s configuration. That source of truth is kept in the policy repository, which is actually a Git repository.

“And I discovered🕵️‍♀️ that my castles 🏰 stand…Upon pillars of salt🧂 and pillars of sand 🏖

App Engine (PaaS) builds a highly scalable ⚖️ application on a fully managed serverless platform.

App Engine makes deployment, maintenance, autoscaling ⚖️ workloads easy allowing developers 👨‍💻to focus on innovation

GCP provides an App Engine SDK in several languages so developers 👩‍💻 can test applications locally before uploaded to the real App Engine service.

App Engine’s standard environment provides runtimes for specific versions of Java☕️, Python🐍, PHP, and Go. The standard environment also enforces restrictions🚫 on your code by making it run in a so‑called sandbox. That’s a software construct that’s independent of the hardware, operating system, or physical location of the server it runs🏃‍♂️ on.

If these constraints don’t work for a given applications, that would be a reason to choose the flexible environment.

App Engine flexible environment:

Builds and deploys containerized apps with a click
No sandbox constraints
Can access App Engine resources

App Engine flexible environment apps use standard runtimes, can access App Engine services such as

Datastore
Meme cache
Task Queues

Cloud Endpoints – Develop, deploy, and manage APIs on any Google Cloud☁️ back end.

Cloud Endpoints helps you create and maintain APIs

Distributed API management through an API console
Expose your API using a RESTful interface

Apigee Edge is also a platform for developing and managing API proxies.

Apigee Edge focus on business problems like rate limiting, quotas, and analytics a

A platform for making APIs available to your customers and partners
Contains analytics, monetization, and a developer portal

Developing in the Cloud ☁️

Cloud Source Repositories – Fully featured Git repositories hosted on GCP

Cloud Functions – Scalable ⚖️ pay-as-you-go functions as a service (FaaS) to run your code with zero server management.

No servers to provision, manage, or upgrade
Automatically scale⚖️ based on the load
Integrated monitoring 🎛, logging, and debugging capability
Built-in security🔒 at role and per function level based on the principle of least privilege
Key🔑 networking capabilities for hybrid and multi-cloud☁️☁️ scenarios
Deployment: Infrastructure as code

Deployment: Infrastructure as code

Deployment Manager – creates and manages cloud☁️ resources with simple templates

Provides repeatable deployments
Create a .yaml template describing your environment and use Deployment Manager to create resources

“Follow my lead, oh, how I need… Someone to watch over me”

Monitoring 🎛: Proactive instrumentation

Stackdriver is GCP’s tool for monitoring 🎛, logging and diagnostics. Stackdriver provides access to many different kinds of signals from your infrastructure platforms, virtual machines, containers, middleware and application tier; logs, metrics and traces. It provides insight into your application’s health ⛑, performance and availability. So, if issues occur, you can fix them faster.

Here are the core components of Stackdriver;

Monitoring 🎛
Logging
Trace
Error Reporting
Debugging
Profiler

Stackdriver Monitoring 🎛 checks the end points of web 🕸 applications and other Internet‑accessible services running on your cloud☁️ environment.

Stackdriver Logging view logs from your applications and filter and search on them.

Stackdriver error reporting tracks and groups the errors in your cloud☁️ applications and it notifies you when new errors are detected.

Stackdriver Trace sample the latency of App Engine applications and report per URL statistics.

Stackdriver Debugger of connects your application’s production data to your source code so you can inspect the state of your application at any code location in production

“Whoa oh oh oh oh… Something big I feel it happening”

GCP Big Data Platform – services are fully managed and scalable ⚖️ and Serverless

Cloud Dataproc is a fast, easy, managed way to run🏃‍♂️ Hadoop 🐘 MapReduce 🗺, Spark 🔥, Pig 🐷 and Hive 🐝 Service

Create clusters in 90 seconds or less on average
Scale⚖️ cluster up and down even when jobs are running 🏃‍♂️
Easily migrate on-premises Hadoop 🐘 jobs to the cloud☁️
Uses Spark🔥 Machine Learning Libraries📚 (MLib) to run classification algorithms

Cloud Dataflow🚰 – Stream⛲️ and Batch processing; unified and simplified pipelines

Processes data using Compute Engine instances.
Clusters are sized for you
Automated scaling ⚖️, no instance provisioning required
Managed expressive data Pipelines
Write code once and get batch and streaming⛲️.
Transform-based programming model
ETL pipelines to move, filter, enrich, shape data
Data analysis: batch computation or continuous computation using streaming
Orchestration: create pipelines that coordinate services, including external services
Integrates with GCP services like Cloud Storage🗄, Cloud Pub/Sub, BigQuery and BigTable
Open source Java☕️ and Python 🐍 SDKs

BigQuery🔎 is a fully‑managed, petabyte scale⚖️, low‑cost analytics data warehouse

Analytics database🛢; stream data 100,000 rows /sec
Provides near real-time interactive analysis of massive datasets (hundreds of TBs) using SQL syntax (SQL 2011)
Compute and storage 🗄 are separated with a terabit network in between
Only pay for storage 🗄 and processing used
Automatic discount for long-term data storage 🗄

Cloud Pub/Sub – Scalable ⚖️, flexible🧘‍♀️ and reliable enterprise messaging 📨

Pub in Pub/Sub is short for publishers

Sub is short for subscribers.

Supports many-to-many asynchronous messaging📨
Application components make push/pull subscriptions to topics
Includes support for offline consumers
Simple, reliable, scalable ⚖️ foundation for stream analytics
Building block🧱 for data ingestion in Dataflow, IoT📲, Marketing Analytics
Foundation for Dataflow streaming⛲️
Push notifications for cloud-based☁️ applications
Connect applications across GCP (push/pull between Compute Engine and App Engine

Cloud Datalab🧪 is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models on GCP.

Interactive tool for large-scale⚖️ data exploration, transformation, analysis, and visualization
Integrated, open source
- Built on Jupyter

“Domo arigato misuta Robotto” 🤖

Cloud Machine Learning Platform🤖

Cloud☁️ machine‑learning platform🤖 provides modern machine‑learning services🤖 with pre‑trained models and a platform to generate your own tailored models.

TensorFlow 🧮 is an open‑source software library 📚 that’s exceptionally well suited for machine‑learning applications🤖 like neural networks🧠.

TensorFlow 🧮 can also take advantage of Tensor 🧮 processing units (TPU), which are hardware devices designed to accelerate machine‑learning 🤖 workloads with TensorFlow 🧮. GCP makes them available in the cloud☁️ with Compute Engine virtual machines.

Generally, applications that use machine‑learning platform🤖 fall into two categories, depending on whether the data worked on is structured or unstructured.

For structured data, ML 🤖 can be used for various kinds of classification and regression tasks, like customer churn analysis, product diagnostics, and forecasting. In addition, Detection of anomalies like fraud detection, sensor diagnostics, or log metrics.

For unstructured data, ML 🤖 can be used for image analytics, such as identifying damaged shipment, identifying styles, and flagging🚩content. In addition, ML🤖 can be used for text analytics like a call 📞 center log analysis, language identification, topic classifications, and sentiment analysis.

Cloud Vision API 👓 derives insights from your images in the cloud☁️ or at the edge with AutoML Vision👓 or use pre-trained Vision API👓 models to detect emotion, understand text, and more.

Analyze images with a simple REST API
Logo detection, label detection
Gain insights from images
Detect inappropriate content
Analyze sentiment
Extract text

Cloud Natural Language API 🗣extracts information about people, places, events, (and more) mentioned in text documents, news articles, or blog posts

Uses machine learning🤖 models to reveal structure and meaning of text
Extract information about items mentioned in text documents, news articles, and blof posts

Cloud Speech API 💬 enables developers 👩‍💻 to convert audio to text.

Transcribe your content in real time or from stored files
Deliver a better user experience in products through voice 🎤 commands
Gain insights from customer interactions to improve your service

Cloud Translation API🈴 provides a simple programmatic interface for translating an arbitrary string into a supported language.

Translate arbitrary strings between thousands of language pairs
Programmatically detect a document’s language
Support for dozens of languages

Cloud Video Intelligence API📹 enable powerful content discovery and engaging video experiences.

Annotate the contents of videos
Detect scene changes
Flag inappropriate content
Support for a variety of video formats

“Fly away, high away, bye bye…” 🦋

We will continue next week with Part II of this series….

Thanks –

–MCS

Week of June 19th

Posted on June 19, 2020 by Mark Shay

“I had some dreams, they were clouds ☁️ in my coffee☕️ … Clouds ☁️ in my coffee ☕️ , and…”

Hi All –

Last week, we explored Google’s fully managed “No-Ops” Cloud ☁️ DW solution, BigQuery🔎. So naturally it made sense to drink🍹more of the Google Kool-Aid and further discover the data infrastructure offerings within the Google fiefdom 👑. Besides we have been wanting to find what all the hype was about with Datafusion ☢️ for some time now which we finally did and happily😊 wound up getting a whole lot more than we bargained for…

To take us through the Google’s stratosphere☁️ would be no other than some of more prominent members of the Google Cloud Team; Evan Jones, Julie Price, and Gwendolyn Stripling. Apparently, these Googlers (all of which seemed have mastered the art of using their hands👐 while speaking) collaborated with other data aficionados at Google on a 6 course compilation of awesomeness😎 for the Data Engineering on the Google Cloud☁️Path. The course that fit the bill to start this week’s learning off was Building Batch Data Pipelines on GCP

Before we were able to dive right into DataFusion☢️, we first started off with a brief review of EL (Extract and Load), ELT (Extract, Transform and Load), and ETL (Extract, Load, and Transform) .

The best way to think of these types of data extraction is the following:

EL is like a package📦 delivered right to your door🚪 where the contents can be taken right out of the box and used. (data can be imported “as is”)
ELT is like a hand truck 🛒 which allows you to move packages easily, but the packages 📦 📦 stilled need to be unloaded and items possibly stored a particular way.
ELT is like a forklift 🚜 this is when heavy lifting needs to be done to transfer packages and have them fit in the right place

In the case of EL and ELT our flavor du jour in the Data Warehouse space, Bigquery🔎 is an ideal target 🎯system but when you need the heavy artily (ELT) that’s when you got to bring an intermediate solution. The best way to achieve these goals is the following:

Data pipelines
Manage pipelines
UI to build those pipelines

Google offers several data transformation and streaming pipeline solutions (Dataproc🔧 and Dataflow🚰) and one easy to use UI (DataFusion☢️) that makes it easy to build those pipelines. Our first stop was Dataproc🔧 which is a fast, easy-to-use, fully managed cloud☁️ service meant for Apache Spark⚡️ and Apache Hadoop🐘 clusters. Hadoop🐘 solutions are generally not really our area of expertise but nevertheless we spent some time here to get a good general understanding of how this solution works and since Datafusion ☢️ sits on top of Dataproc🔧. It was worth our while to understand how it all works

Next, we ventured over too the much anticipated Datafusion☢️ which was more than worth our wait! Datafusion☢️ uses ephemeral Dataproc🔧VMs to perform all the transforms in batch data pipelines (Streaming currently not supported but coming soon through Dataflow🚰 support). Under the hood Datafusion☢️ leverages five main components

1. Kubernetes☸️ Engine (runs in a containerized environment on GKE)

2. Key🔑 Management Service (For Security)

3. Persistent Disk

4. Google Cloud☁️ Storage (GCS) (For long term storage)

5. Cloud☁️ SQL – (To manages user and Pipeline Data)

The good news is that you don’t really need to muck around with any of these components. In fact, you shouldn’t even concern yourself with them at all. I just mentioned them because I thought it was kind of cool stack 😎. The most important part of datafusion☢️ is the data fusion☢️ studio which is the graphical “no code” tool that allows Data Analysts and ETL Developers to wrangle data and build batch data pipelines. Basically, it allows you to build pretty complex pipelines by simply “drag and drop”.

“Don’t reinvent the wheel, just realign it.” – Anthony J. D’Angelo

So now with a cool 😎 and easy to build batch pipeline UI under our belt, what about a way to orchestrate all these pipelines? Well, Google pulled no punches🥊🥊 and gave us Cloud☁️ Composer which is fully managed data workflow and orchestration service that allows you to schedule and monitor pipelines. Following the motto of “not reinventing the wheel”, Cloud☁️ Composer leverages Apache Airflow 🌬.

For those who don’t know Apache Airflow 🌬is the popular Data Pipeline orchestration tool originally developed by the fine folks at Airbnb. Airflow🌬 is written in Python 🐍(our new favorite programming language), and workflows are created via Python 🐍 scripts (1 Python🐍 file per DAG). Airflow 🌬uses directed acyclic graphs (DAGs) to manage workflow orchestration. Not to be confused with a uncool person or an unpleasant sight on a sheep 🐑 A DAG* is a simply a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

Take a bow for the new revolution… Smile and grin at the change all around me

Next up on our adventures was onto Dataflow🚰 which is a fully managed streaming 🌊 analytics service that minimizes latency, processing time, and cost through autoscaling as well as batch processing. So why Dataflow🚰 and not DataProc🔧?

No doubt, Dataproc🔧 is a solid data pipeline solution which meets most requirements for either Batch or Streaming🌊Pipelines but it’s a bit clunky and requires existing knowledge of Hadoop🐘/Spark⚡️ infrastructure.

Dataproc🔧 is still an ideal solution for those who want to bridge 🌉 the gap by moving their on-premise Big Data Infrastructure to GCP. However, if you have a green field project than Dataflow🚰 definitely seems like the way to go.

DataFlow🚰 is “Server-less” which means the service “just works” and you don’t need to manage it! Once again, Google holds true to form with our earlier mantra (“not reinventing the wheel”) as Cloud Dataflow🚰 is built on top of the popular batch and streaming pipeline solution Apache Beam.

For those not familiar with Apache BEAM (Batch and StrEAM) it was also developed by Google to ensure the perfect marriage between batch and streaming data-parallel processing pipelines. A true work of art!

The show must go on….

So ya… Thought ya. Might like to go to the show…To feel the warm🔥 thrill of confusion. That space cadet glow

Now that we were on role with our journey through GCP’s Data Ecosystem it seemed logical to continue our path with the next course Building Resilient Streaming Analytics Systems on GCP. This exposition was taught by the brilliant Raiyann Serang who maintains a well kempt hairdo throughout his presentations and the distinguished Nitin Aggarwal as well as the aforementioned Evan Jones.

First, Raiyann’s takes us through a brief introduction on streaming 🌊 data (data processing for unbounded data sets). In addition, he provides The reasons for streaming 🌊 data and value that streaming🌊 data provides to the business by enabling real time information in a dashboard or another means to see the current state. He touches on the ideal architectural model using Google Pub/Sub and Dataflow🚰 to construct a data pipeline to minimize latency at each step during the ingestion process.

Next, he laments about the infamous 3Vs in regards to streaming 🌊 data and how might a data engineer deal with these challenges.

Volume

How to ingest this data into the system?
How to store and organize data to process this data quickly?
How will the storage layer be integrated with other processing layers?

Velocity

10,000 records/sec being transferred (Stock market Data, etc.)
How systems need to be able handle the load change?

Variety

Type and format of data and the constraints of processing

Next, he provides a preview to the rest of the course as he unveils Google’s triumvirate to the streaming data challenge. Pub/Sub to deal with variable volumes of data, Dataflow🚰 to process data without undue delays and Bigquery🔎 to address need of ad-hoc analysis and immediate insights.

Pure Gold!

After a great Introduction, Raiyann’s takes us right to Pub/Sub. Fortunately, we has been to this rodeo before and were well aware of the value of Pub/Sub Pub/Sub Is a ready to use asynchronous distribution system that fully manages data ingestion for both on cloud ☁️ and on premise environments. It’s a highly desirable solution when it comes to streaming solutions because of how well it addresses Availability, Durability, and Scalability.

The short story around Pub/Sub is a story of two data structures, the topic and the subscription. The Pub/Sub Client that creates the topic is called the Publisher and the Cloud Pub/Sub client that creates the subscription is the subscriber. Pub/Sub provides both Push (periodically calling for messages) and Pull (Clients have to acknowledge the message as separate step) deliveries.

Now, that we covered how to process data, it was time to move to the next major piece in our data architectural model and that is how to process the data without undue delays Dataflow🚰

Taking us through this part of the journey would be Nitin. We had already covered Dataflow🚰 earlier in the week in the previous course but that was only in regards to batch data (Bound data or unchanging data) pipelines.

DataFlow🚰 if you remember is built on Apache Beam, so in another words it has “the need for speed” and can support streams🌊 of data. Dataflow🚰 is highly scalable with low latency processing pipelines for incoming messages. Nitin further discusses the major challenges with handling streaming or real time data and how DataFlow🚰 tackles these obstacles.

Streaming 🌊 data generally only grows larger and more frequent
Fault Tolerance – Maintain fault tolerance despite increasing volumes of data
Model – Is it streaming or repeated batch?
Time – (Latency) what if data arrives late

Next, he discusses one of DataFlow🚰 key strengths “Windowing” and provides details in the three kinds of Windows.

Fixed – Divides Data into time Slices
Sliding – Those you use for computing (often required when doing aggregations over unbounded data)
Sessions – defined by minimum gap duration and timing is triggered by another element (communication is bursty)

Then Nitin rounds it off with one of the key concepts when it comes to Streaming🌊 data pipelines the “watermark trigger”. The summit of this module is the lab on Streaming🌊 Data Processing which requires building a full end to end solution using Pub/Sub, Dataflow🚰, and Bigquery. In addition, he gave us a nice introduction to Google Cloud☁️ Monitoring which we had not seen before.

So much larger than life… I’m going to watch it growing

We next headed over to another spoke in the data architecture wheel 🎡with Google’s Bigtable . Bigtable (built on Colossus) is Google’s NoSQL solution for high performance applications. We hadn’t done much so far with NoSQL up until this point. So, this module offered us a great primer for future travels.

Bigtable is ideal for storing very large amounts of data in a key-value store or non-structured data and it supports high read and write throughput at low latency for fast access to large datasets. However, Bigtable is not a good solution for Structured data, small data (< TB) or data that requires SQL Joins. Bigtable is good for specific use cases like real-time lookups as part of an application, where speed and efficiency are desired beyond that of a database. When Bigtable is a good match for specific workloads “it’s so consistently fast that it is magical 🧙‍♂️”.

“And down the stretch they 🐎 come!”

Next up, Evan takes us down the homestretch by surveying Advanced BigQuery 🔎 Functionality and Performance. He first begins with an overview and a demo of BigQuery 🔎 and GIS (Geographic Information Systems) functions which allows you to analyze and visualize Geo Spatial data in BigQuery🔎. This is a little beyond the scope of our usual musings but it’s good to know from an informational standpoint. Then Evan covers a critical topic for any data engineer or analyst to understand, which is how to break apart a single data set into groups or Window Functions .

Followed by a lab that demonstrated some neat tricks on how to reduce I/O, Cache results, and perform efficient joins by using the the WITH Clause, Changing the parameter of Region location, and denormalization of the data respectively. Finally, Evan leaves use with a nice parting gift by providing a handy cheatsheet and a quick lab on Partitioned Tables in Google BigQuery🔎

* DAG is a directed graph data structure that uses a topological ordering. The sequence can only go from earlier to later. DAG is often applied to problems related to data processing, scheduling, finding the best route in navigation, and data compression.

“It’s something unpredictable, but in the end it’s right”

Below are some topics I am considering for my travels next week:

NoSQL – MongoDB, Cosmos DB
More on Google Data Engineering with
Google Cloud Path <-Google Cloud Certified Professional Data Engineer
Working JSON Files
Working with Parquet files
JDBC Drivers
More on Machine Learning
ONTAP Cluster Fundamentals
Data Visualization Tools (i.e. Looker)
Additional ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)

Stay safe and Be well

—MCS