Week of October 9th

Part I of a Cloud☁️ Journey

“On a cloud☁️ of sound 🎶, I drift in the night🌙”

Hi All –

Ok Happy Hump 🐫 Day!

The last few weeks we spent some quality time ⏰ visiting with Microsoft SQL Server 2019. A few weeks back, we kicked 🦿the tires 🚗 with IQP and the improvements made to TempDB. Then week after we were fully immersed with Microsoft’s most ambitious offering in SQL Server 2019 with Big Data Clusters (BDC).

This week we make our triumphant return back to the cloud☁️. If you have been following our travels this past summer☀️ we focused on the core concepts of AWS and then we concentrated on the fundamentals of Microsoft Azure. So, the obvious natural progression of our continuous cloud☁️ journey✈️ would be to further explore the Google Cloud Platform or more affectionately known as GCP. We had spent a considerable amount time 🕰 before learning many of the exciting offerings in GCP but our long awaited return has been long overdue. Besides we felt the need to gives some love ❤️ and oxytocin 🤗 to our friends from Mountain View

“It starts with one☝️ thing…I don’t know why?”

Actually, Google has 10 things when it comes to their philosophy but more on that later. 😊

One of Google’s strong 💪 beliefs is that in “in the future every company will be a data company because making the fastest and best use of data is a critical source of a competitive advantage.”

GCP is Google’s Cloud Computing☁️ solution that provides a wide variety of services such as Compute, Storage🗄, Big data, and Machine Learning for managing and getting value from data and doing that at infinite scale⚖️. GCP offers over 90 products and Services.

“If you know your history… Then you would know where you coming from”

In 2006, AWS began offering cloud computing☁️ to the masses, several years later Microsoft Azure followed suit and shortly right after GCP joined the Flexible🧘‍♀️, Agile, Elastic, Highly Available and scalable⚖️ party 🥳. Although, Google was a late arrival to the cloud computing☁️ shindig🎉 their approach and strategy to Cloud☁️ Computing is far from a Johnny-come-lately” 🤠

“Google Infrastructure for Everyone” 😀

Google does not view cloud computing☁️ as a “commodity” cloud☁️. Google’s methodology to cloud computing☁️ is of a “premier💎 cloud☁️”, one that provides the same innovative, high-quality, deluxe set of services, and rich development environment with the advanced hardware that Google has been running🏃‍♂️ internally for years but made available for everyone through GCP.

“No vendor lockin 🔒. Guaranteed. 👍

In addition, Google who is certainly no stranger to Open Source software promotes a vision🕶 of the “open cloud☁️”. A cloud☁️ environment where companies🏢🏭 large and small 🏠can seamlessly move workloads from one cloud☁️ provider to another. Google wants customers to have the ability to run 🏃‍♂️their applications anywhere not just in Google.

“Get outta my dreams😴… Get into my car 🚙

Now that I extolled the virtues of Google’s vision 🕶 and strategy for Cloud computing☁️, it’s time to take this car 🚙 out for a spin. Fortunately, the team at Google Cloud☁️ have put together one of the best compilations since the Zeppelin Box Set 🎸in there Google Cloud Certified Associate Cloud Engineer Path on Pluralsight.   

Since there is some much to unpack📦, we will need to break our learnings down into multiple parts. So to help us put our best foot 🦶forward through the first part our journey ✈️ will be Googler Brice Rice and former Googler Catherine Gamboa through their Google Cloud Platform Fundamentals – Core Infrastructure course.

In a great introduction, Brian expounds on the definition of Cloud Computing☁️ and a brief history on Google’s transformation from the virtualization model to a container‑based architecture, an automated, elastic, third‑wave cloud☁️ built from automated services.

Next, Brian reviews GCP computing architectures:

Infrastructure as a Service (IaaS) – provide raw compute, storage🗄, and network organized in ways that are familiar from data centers. You pay for what you allocate

Platform as a Service (PaaS) – binds application code you write to libraries📚 that give access to the infrastructure your application needs. You pay 💰 for what you use.

Software as a Service (SaaS) – applications in that they’re consumed directly over the internet by end users. Popular examples: Search 🔎, Gmail 📧, Docs 📄, and Drive 💽

Then we had an overview of Google’s network which according to some estimates carries as much as 40% of the world’s 🌎 internet traffic🚦. The network interconnects at more than 90 internet exchanges and more than 100 points of presence worldwide 🌎 (and growing). One of the benefits of GCP is that it leverages Google’s robust network. Allowing GCP resources to be hosted in multiple locations worldwide 🌎. At granular level these locations are organized by regions and zones. A region is a specific geographical 🌎 location where you can host your resources. Each region has one or more zones (most regions have three or more zones).

All of the zones within a region have fast⚡️network connectivity among them. A zone is like as a single failure domain within a region. A best practice in building a fault‑tolerant application, is to deploy resources across multiple zones in a given region to protect against unexpected failures.

Next, we had summary on Google’s Multi-layered approach to security🔒.

Highlights:

  • Server boards and the networking equipment in Google data centers are custom‑designed by Google.
  • Google also designs custom chips, including a hardware security🔒 chip (Titan) deployed on both servers and peripherals.
  • Google Server machines use cryptographic signatures✍️ to make sure they are booting the correct software.
  • Google designs and builds its own data centers (eco friendly), which incorporate multiple layers of physical security🔒 protections. (Access to these data centers is limited to only a few authorized Google Employees)
  • Google’s infrastructure provides cryptographic🔐 privacy🤫 and integrity for remote procedure‑called data on the network, which is how Google Services communicate with each other.
  • Google has multitier, multilayer denial‑of‑service protections that further reduces the risk of any denial‑of‑service 🛡 impact.

Rounding out the introduction was a sneak peek 👀 into the Budgets and Billing 💰. Google offers customer-friendly 😊 pricing with a per‑second billing for its IaaS compute offering, Fine‑grained billing is a big cost‑savings for workloads that are bursting. GCP provides four tools 🛠to help with billing:

  • Budgets and alerts 🔔
  • Billing export🧾
  • Reports 📊
  • Quotas

Budgets can be a fixed limit, or you can tie it to another metric, for example a percentage of the previous month’s spend.

Alerts 🔔 are generally set at 50%, 90%, and 100%, but are customizable

Billing export🧾 store detailed billing information in places where it’s easy to retrieve for more detailed analysis

Reports📊 is a visual tool in the GCP console that allows you to monitor your expenditure. GCP also implements quotas, which protect both account owners and the GCP community as a whole 😊.

Quotas are designed to prevent the overconsumption of resources, whether because of error or malicious attack 👿. There are two types of quotas, rate quotas and allocation quotas. Both get applied at the level of the GCP project.

After a great intro, next Catherine kick starts🦵 us with GCP. She begins with a discussion around resource hierarchy 👑 and trust🤝 boundaries.

Projects are the main way you organize the resources (all resources belong to a project) you use in GCP. Projects are used to group together related resources, usually because they have a common business objective. A project consists of a set of users, a set of APIs, billing, quotas, authentication, and monitoring 🎛 settings for those APIs. Projects have 3 identifying attributes:

  1. Project ID (Globally 🌎 unique)
  2. Project Name
  3. Project Number (Globally 🌎 Unique)

Projects may be organized into folders 🗂. Folders🗂 can contain other folders 🗂. All the folders 🗂 and projects used by an organization can be put in organization nodes.

Please Note: If you use folders 🗂, you need to have an organization node at the top of the hierarchy👑.

Projects, folders🗂, and organization nodes are all places where the policies can be defined.

A policy is a set on a resource. Each policy contains a set of roles and members👥.

  • A role is a collection of permissions. Permissions determine what operations are allowed on a resource. There are three kinds of roles (primitive):
  1. Owner
  2. Editor
  3. Viewer

Another role made available in IAM is to control the billing for a project without the right to change the resources in the project is billing administrator role.

Please note IAM provides finer‑grained types of roles for a project that contains sensitive data, where primitive roles are too generic.

A service account is a special type of Google account intended to represent a non-human user ⚙️ that needs to authenticate and be authorized to access data in Google APIs.

  • A member(s)👥 can be a Google Account(s), a service account, a Google group, or a Google Workspace or Cloud☁️ Identity domain that can access a resource.

Resources inherit policies from the parent.

Identity and Access Management (IAM) allows administrators to manage who (i.e. Google account, a Google group, a service account, or an entire Work Space) can do what (role) on specific resources There are four ways to interact with IAM and the other GCP management layers:

  1. Web‑based console 🕸
  2. SDK and Cloud shell (CLI)
  3. APIs
    1. Cloud Client Libraries 📚
    1. Google API Client Library 📚
  4. Mobile app 📱

When it comes to entitlements “The principle of least privilege” should be followed. This principle says that each user should have only those privileges needed to do their jobs. In a least privilege environment, people are protected from an entire class of errors.  GCP customers use IAM to implement least privilege, and it makes everybody happier 😊.

For example, you can designate an organization policy administrator so that only people with privilege can change policies. You can also assign a project creator role, which control who can spend money 💵.

Finally, we checked into Marketplace 🛒 which provides an easy way to launch common software packages in GCP. Many common web 🕸 frameworks, databases🛢, CMSs, and CRMs are supported. Some Marketplace 🛒 images charge usage fees, like third parties with commercially licensed software. But they all show estimates of their monthly charges before you launch them.

Please Note: GCP updates the base images for these software packages to fix critical issues 🪲and vulnerabilities, but it doesn’t update the software after it’s been deployed. However, you’ll have access to the deployed system so you can maintain them.

“Look at this stuff🤩 Isn’t it neat? Wouldn’t you think my collection’s complete 🤷‍♂️?

Now with basics of GCP covered, it was time 🕰 to explore 🧭 some the computing architectures made available within GCP.

Google Compute Engine

Virtual Private Cloud (VPC) – manage a networking functionality for your GCP resources. Unlike AWS (natively), GCP VPC is global 🌎 in scope. They can have subnets in any GCP region worldwide 🌎. And subnets can span the zones that make up a region.

  • Provides flexibility🧘‍♀️ to scale️ and control how workloads connect regionally and globally🌎
  • Access VPCs without needing to replicate connectivity or administrative policies in each region
  • Bring your own IP addresses to Google’s network infrastructure across all regions

Much like physical networks, VPCs have routing tables👮‍♂️and Firewall🔥 Rules which are built in.

  • Routing tables👮‍♂️ forward traffic🚦from one instance to another instance
  • Firewall🔥 allows you to restrict access to instances, both incoming(ingress) and outgoing (egress) traffic🚦.

Cloud DNS manages low latency and high availability of the DNS service running on the same infrastructure as Google

Cloud VPN securely connects peer network to Virtual Private Cloud (VPC) network through an IPsec VPN connection.

Cloud Router lets other networks and Google VPC exchange route information over the VPN using the Border Gateway Protocol.

VPC Network Peering enables you to connect VPC networks so that workloads in different VPC networks can communicate internally. Traffic🚦stays within Google’s network and doesn’t traverse the public internet.

  • Private connection between you and Google for your hybrid cloud☁️
  • Connection through the largest partner network of service providers

Dedicated Interconnect which allows direct private connections providing highest uptimes (99.99% SLA) for their interconnection with GCP

Google Compute Engine (IaaS) delivers Linux or Windows virtual machines (VMs) running in Google’s innovative data centers and worldwide fiber network. Compute Engine offers scale ⚖️, performance, and value that lets you easily launch large compute clusters on Google’s infrastructure. There are no upfront investments, and you can run thousands of virtual CPUs on a system that offers quick, consistent performance. VMs can be created via Web 🕸 console or the gcloud command line tool🔧.

For Compute Engine VMs there are two kinds of persistent storage🗄 options:

  • Standard
  • SSD

If your application needs high‑performance disk, you can attach a local SSD. ⚠️ Beware to store data of permanent value somewhere else because local SSD’s content doesn’t last past when the VM terminates.

Compute Engine offers innovative pricing:

  • Per second billing
  • Preemptible instances
  • High throughput to storage🗄 at no additional cost
  • Only pay for hardware you need.

At the time of this post, N2D standard and high-CPU machine types have up to 224 vCPUs and 128 GB of memory which seems like enough horsepower 🐎💥 but GCP keeps upping 🃏🃏 the ante 💶 on maximum instance type, vCPU, memory and persistent disk. 😃

Sample Syntax creating a VM:

$ gcloud compute zones list | grep us-central1

$ gcloud config set compute/zone us-central1-c
$ gcloud compute instances create “my-vm-2” –machine-type “n1-standard-1” –image-project “debian-cloud” –image “ebian-9-stretch-v20170918” –subnet “default”

Compute Engine also offers auto Scaling ⚖️ which adds and removes VMs from applications based on load metrics. In addition, Compute Engine VPCs offering load balancing 🏋️‍♀️ across VMs. VPC supports several different kinds of load balancing 🏋️‍♀️:

  • Layer 7 load balancing 🏋️‍♀️ based on load
  • Layer 4 load balancing 🏋️‍♀️ based on non-http SSL load
  • Layer 4 load balancing 🏋️‍♀️ based on non-http SSL traffic🚦
  • Any Traffic🚦 (TCP, UDP)
  • Traffic🚦inside a VPC

Cloud CDN –accelerates💥 content delivery 🚚 in your application allowing users to experience lower network latency. The origins of your content will experience reduced load, and cost savings. Once you’ve set up HTTPS load balancing 🏋️‍♀️, simply enable Cloud CDN with a single checkbox.

Next on our plate 🍽 was to investigate the storage🗄 options that are available in GCP

Cloud Storage 🗄 is fully managed, high durability, high availability, scalable ⚖️ service. Cloud Storage 🗄 can be used for lots of use cases like serving website content, storing data for archival and disaster recovery, or distributing large data objects.

Cloud Storage🗄 offers 4 different types of storage 🗄 classes:

  • Regional
  • Multi‑regional
  • Nearline 😰
  • Coldline 🥶

Cloud Storage🗄 is comprised of buckets 🗑 which create, and configure, and use to hold storage🗄 objects.

Buckets 🗑 are:

  • Globally 🌎 Unique
  • Different storage🗄 classes
  • Regional or multi-regional
  • Versioning enabled (Objects are immutable)
  • Lifecycle 🔄 management Rules

Cloud Storage🗄supports several ways to bring data into Cloud Storage🗄.

  • Use gsutil Cloud SDK.
  • Drag‑and‑drop in the GCP console (with Google Chrome browser).
  • Integrated with many of the GCP products and services:
    • Import and export tables from and to BigQuery and Cloud SQL
    • Store app engine logs
    • Cloud data store backups
    • Objects used by app engine
    • Compute Engine Images
  • Online storage🗄 transfer service (>TB) (HTTPS endpoint)
  • Offline transfer appliance (>PB) (rack-able, high capacity storage🗄 server that you lease from Google)

“Big wheels 𐃏 keep on turning”

Cloud Bigtable is afully managed, scalable⚖️ NoSQL database🛢 service for large analytical and operational workloads. The databases🛢 in Bigtable are sparsely populated tables that can scale to billions of rows and thousands of columns, allowing you to store petabytes of data. Data encryption inflight and at rest are automatic

GCP fully manages the surface, so you don’t have to configure and tune it. It’s ideal for data that has a single lookup keys🔑 and for storing large amounts of data with very low latency.

Cloud Bigtable is offered through the same open source API as HBase, which is the native database🛢 for the Apache Hadoop 🐘 project.

Cloud SQL is a fully managed relational database🛢 service for MySQL, PostgreSQL, and MS SQL Server which provides:

  • Automatic replication
  • Managed backups
  • Vertical scaling ⚖️ (Read and Write)
  • Horizontal Scaling ⚖️
  • Google integrated Security 🔒

Cloud Spanner is a fully managed relational database🛢 with unlimited scale⚖️ (horizontal), strong consistency & up to 99.999% high availability.

It offers transactional consistency at a global🌎 scale ⚖️, schemas, SQL, and automatic synchronous replication for high availability, and it can provide petabytes of capacity.

Cloud Datastore is a highly scalable ⚖️ (Horizontal) NoSQL database🛢 for your web 🕸 and mobile 📱 applications.

  • Designed for application backends
  • Supports transactions
  • Includes a free daily quota

Comparing Storage🗄 Options

Cloud Datastore is the best for semi‑structured application data that is used in App Engine applications.

Bigtable is best for analytical data with heavy read/write events like AdTech, Financial 🏦, or IoT📲 data.

Cloud Storage🗄 is best for structured and unstructured binary or object data, like images🖼, large media files🎞, and backups.

Cloud SQL is best for web 🕸 frameworks and existing applications, like storing user credentials and customer orders.

Cloud Spanner is best for large‑scale⚖️ database🛢 applications that are larger than 2 TB, for example, for financial trading and e‑commerce use cases.

“Everybody, listen to me… And return me my ship⛵️… I’m your captain👩🏾️, I’m your captain👩🏾‍✈️”

Containers, Kubernetes ☸️, and Kubernetes Engine

Containers provide independent scalable ⚖️ workloads, that you would get in a PaaS environment, and an abstraction layer of the operating system and hardware, like you get in an IaaS environment. Containers virtualize the operating system rather than the hardware. The environment scales⚖️ like PaaS but gives you nearly the same flexibility as Infrastructure as a Service

Kubernetes ️ is an open source orchestrator for containers. K8s ☸️ make it easy to orchestrate many containers on many hosts, scale ⚖️ them, roll out new versions of them, and even roll back to the old version if things go wrong 🙁. K8s ☸️ lets you deploy containers on a set of nodes called a cluster.

A cluster is set of master components that control the system as a whole, and a set of nodes that run containers.

K8s ☸️ deploys a container or a set of related containers, it does so inside an abstraction called a pod.

A pod is the smallest deployable unit in Kubernetes.

Kubectl starts a deployment with a container running in a pod. A deployment represents a group of replicas of the same pod. It keeps your pods running 🏃‍♂️, even if a node on which some of them run on fails.

Google Kubernetes Engine (GKE) ☸️ is Secured and managed Kubernetes service ️ with four-way auto scaling ⚖️ and multi-cluster support.

  • Leverage a high-availability control plane ✈️including multi-zonal and regional clusters
  • Eliminate operational overhead with auto-repair 🧰, auto-upgrade, and release channels
  • Secure🔐 by default, including vulnerability scanning of container images and data encryption
  • Integrated Cloud Monitoring 🎛 with infrastructure, application, and Kubernetes-specific ☸️ views

GKE is like an IaaS offering in that it saves you infrastructure chores and it’s like a PaaS offering in that it was built with the needs of developers 👩‍💻 in mind.

Sample Syntax building a K8 cluster:

gcloud container clusters create k1

In GKE to make the pods in your deployment publicly available, you can connect a load balancer🏋️‍♀️ to it by running the kubectl expose command. K8s ☸️ then creates a service with a fixed IP address for your pods.

A service is the fundamental way K8s ️ represents load balancing 🏋️‍♀️. A K8s ☸️ attaches an external load balancer🏋️‍♀️ with a public IP address to your service so that others outside the cluster can access it.

In GKE, this kind of load balancer🏋️‍♀️ is created as a network load balancer🏋️‍♀️. This is one of the managed load balancing 🏋️‍♀️ services that Compute Engine makes available to virtual machines. GKE makes it easy to use it with containers.

Service groups is a set of pods together and provides a stable end point for them

Imperative commands

kubectl get services shows you your service’s public IP address

kubectl scale – scales ⚖️ a deployment

kubectl expose – creates a service

kubectl get pods watch the pods come online

The real strength 💪of K8s ☸️ comes when you work in a declarative of way. Instead of issuing commands, you provide a configuration file (YAML) that tells K8s ☸️ what you want your desired state to look like, and Kubernetes ☸️ figures out how to do it.

When you choose a rolling update for a deployment and then give it a new version of the software it manages, Kubernetes will create pods of the new version one by one, waiting for each new version pod to become available before destroying one of the old version pods. Rolling updates are a quick way to push out a new version of your application while still sparing your users from experiencing downtime.

“Going where the wind 🌬 goes… Blooming like a red rose🌹

Introduction to Hybrid and Multi-Cloud Computing (Anthos)

Modern hybrid or multi‑cloud☁️ architectures allows you to keep parts of your system’s infrastructure on‑premises, while moving other parts to the cloud☁️, creating an environment that is uniquely suited to many company’s needs.

Modern distributed systems allow a more agile approach to managing your compute resources

  • Move only some of you compute workloads to the cloud ☁️
  • Move at your own pace
  • Take advantage of cloud’s☁️ scalability️ and lower costs 💰
  • Add specialized services to compute resources stack

Anthos is Google’s modern solution for hybrid and multi-cloud☁️ systems and services management.

The Anthos framework rests on K8s ☸️ and GKE deployed on‑prem, which provides the foundation for an architecture that is fully integrated with centralized management through a central control plane that supports policy‑based application life cycle🔄 delivery across hybrid and multi‑cloud☁️ environments.

Anthos also provides a rich set of tools🛠 for monitoring 🎛 and maintaining the consistency of your applications across all of your network, whether on‑premises, in the cloud☁️ K8s ☸️, or in multiple clouds☁️☁️.

Anthos Configuration Management provides a single source of truth for your cluster’s configuration. That source of truth is kept in the policy repository, which is actually a Git repository.

“And I discovered🕵️‍♀️ that my castles 🏰 stand…Upon pillars of salt🧂 and pillars of sand 🏖

App Engine (PaaS) builds a highly scalable ⚖️ application on a fully managed serverless platform.

App Engine makes deployment, maintenance, autoscaling ⚖️ workloads easy allowing developers 👨‍💻to focus on innovation

GCP provides an App Engine SDK in several languages so developers 👩‍💻 can test applications locally before uploaded to the real App Engine service.

App Engine’s standard environment provides runtimes for specific versions of Java☕️, Python🐍, PHP, and Go. The standard environment also enforces restrictions🚫 on your code by making it run in a so‑called sandbox. That’s a software construct that’s independent of the hardware, operating system, or physical location of the server it runs🏃‍♂️ on.

If these constraints don’t work for a given applications, that would be a reason to choose the flexible environment.

App Engine flexible environment:

  • Builds and deploys containerized apps with a click
  • No sandbox constraints
  • Can access App Engine resources

App Engine flexible environment apps use standard runtimes, can access App Engine services such as

  • Datastore
  • Meme cache
  • Task Queues

Cloud Endpoints – Develop, deploy, and manage APIs on any Google Cloud☁️ back end.

Cloud Endpoints helps you create and maintain APIs

  • Distributed API management through an API console
  • Expose your API using a RESTful interface

Apigee Edge is also a platform for developing and managing API proxies.

Apigee Edge focus on business problems like rate limiting, quotas, and analytics a

  • A platform for making APIs available to your customers and partners
  • Contains analytics, monetization, and a developer portal

Developing in the Cloud ☁️

Cloud Source Repositories – Fully featured Git repositories hosted on GCP

Cloud Functions – Scalable ⚖️ pay-as-you-go functions as a service (FaaS) to run your code with zero server management.

  • No servers to provision, manage, or upgrade
  • Automatically scale⚖️ based on the load
  • Integrated monitoring 🎛, logging, and debugging capability
  • Built-in security🔒 at role and per function level based on the principle of least privilege
  • Key🔑 networking capabilities for hybrid and multi-cloud☁️☁️ scenarios
  • Deployment: Infrastructure as code

Deployment: Infrastructure as code

Deployment Manager – creates and manages cloud☁️ resources with simple templates

  • Provides repeatable deployments
  • Create a .yaml template describing your environment and use Deployment Manager to create resources

“Follow my lead, oh, how I need… Someone to watch over me”

Monitoring 🎛: Proactive instrumentation

Stackdriver is GCP’s tool for monitoring 🎛, logging and diagnostics. Stackdriver provides access to many different kinds of signals from your infrastructure platforms, virtual machines, containers, middleware and application tier; logs, metrics and traces. It provides insight into your application’s health ⛑, performance and availability. So, if issues occur, you can fix them faster.

Here are the core components of Stackdriver;

  • Monitoring 🎛
  • Logging
  • Trace
  • Error Reporting
  • Debugging
  • Profiler

Stackdriver Monitoring 🎛 checks the end points of web 🕸 applications and other Internet‑accessible services running on your cloud☁️ environment.

Stackdriver Logging view logs from your applications and filter and search on them.

Stackdriver error reporting tracks and groups the errors in your cloud☁️ applications and it notifies you when new errors are detected.

Stackdriver Trace sample the latency of App Engine applications and report per URL statistics.

Stackdriver Debugger of connects your application’s production data to your source code so you can inspect the state of your application at any code location in production

“Whoa oh oh oh oh… Something big I feel it happening”

GCP Big Data Platform – services are fully managed and scalable ⚖️ and Serverless

Cloud Dataproc is a fast, easy, managed way to run🏃‍♂️ Hadoop 🐘 MapReduce 🗺, Spark 🔥, Pig 🐷 and Hive 🐝 Service

  • Create clusters in 90 seconds or less on average
  • Scale⚖️ cluster up and down even when jobs are running 🏃‍♂️
  • Easily migrate on-premises Hadoop 🐘 jobs to the cloud☁️
  • Uses Spark🔥 Machine Learning Libraries📚 (MLib) to run classification algorithms

Cloud Dataflow🚰 – Stream⛲️ and Batch processing; unified and simplified pipelines

  • Processes data using Compute Engine instances.
  • Clusters are sized for you
  • Automated scaling ⚖️, no instance provisioning required
  • Managed expressive data Pipelines
  • Write code once and get batch and streaming⛲️.
  • Transform-based programming model
  • ETL pipelines to move, filter, enrich, shape data
  • Data analysis: batch computation or continuous computation using streaming
  • Orchestration: create pipelines that coordinate services, including external services
  • Integrates with GCP services like Cloud Storage🗄, Cloud Pub/Sub, BigQuery and BigTable
  • Open source Java☕️ and Python 🐍 SDKs

BigQuery🔎 is a fully‑managed, petabyte scale⚖️, low‑cost analytics data warehouse

  • Analytics database🛢; stream data 100,000 rows /sec
  • Provides near real-time interactive analysis of massive datasets (hundreds of TBs) using SQL syntax (SQL 2011)
  • Compute and storage 🗄 are separated with a terabit network in between
  • Only pay for storage 🗄 and processing used
  • Automatic discount for long-term data storage 🗄

Cloud Pub/Sub – Scalable ⚖️, flexible🧘‍♀️ and reliable enterprise messaging 📨

Pub in Pub/Sub is short for publishers

Sub is short for subscribers.

  • Supports many-to-many asynchronous messaging📨
  • Application components make push/pull subscriptions to topics
  • Includes support for offline consumers
  • Simple, reliable, scalable ⚖️ foundation for stream analytics
  • Building block🧱 for data ingestion in Dataflow, IoT📲, Marketing Analytics
  • Foundation for Dataflow streaming⛲️
  • Push notifications for cloud-based☁️ applications
  • Connect applications across GCP (push/pull between Compute Engine and App Engine

Cloud Datalab🧪 is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models on GCP.

  • Interactive tool for large-scale⚖️ data exploration, transformation, analysis, and visualization
  • Integrated, open source
    • Built on Jupyter

“Domo arigato misuta Robotto” 🤖

Cloud Machine Learning Platform🤖

Cloud☁️ machine‑learning platform🤖 provides modern machine‑learning services🤖 with pre‑trained models and a platform to generate your own tailored models.

TensorFlow 🧮 is an open‑source software library 📚 that’s exceptionally well suited for machine‑learning applications🤖 like neural networks🧠.

TensorFlow 🧮 can also take advantage of Tensor 🧮 processing units (TPU), which are hardware devices designed to accelerate machine‑learning 🤖 workloads with TensorFlow 🧮. GCP makes them available in the cloud☁️ with Compute Engine virtual machines.

Generally, applications that use machine‑learning platform🤖 fall into two categories, depending on whether the data worked on is structured or unstructured.

For structured data, ML 🤖 can be used for various kinds of classification and regression tasks, like customer churn analysis, product diagnostics, and forecasting. In addition, Detection of anomalies like fraud detection, sensor diagnostics, or log metrics.

For unstructured data, ML 🤖 can be used for image analytics, such as identifying damaged shipment, identifying styles, and flagging🚩content. In addition, ML🤖 can be used for text analytics like a call 📞 center log analysis, language identification, topic classifications, and sentiment analysis.

Cloud Vision API 👓 derives insights from your images in the cloud☁️ or at the edge with AutoML Vision👓 or use pre-trained Vision API👓 models to detect emotion, understand text, and more.

  • Analyze images with a simple REST API
  • Logo detection, label detection
  • Gain insights from images
  • Detect inappropriate content
  • Analyze sentiment
  • Extract text

Cloud Natural Language API 🗣extracts information about people, places, events, (and more) mentioned in text documents, news articles, or blog posts

  • Uses machine learning🤖 models to reveal structure and meaning of text
  • Extract information about items mentioned in text documents, news articles, and blof posts

Cloud Speech API 💬 enables developers 👩‍💻 to convert audio to text.

  • Transcribe your content in real time or from stored files
  • Deliver a better user experience in products through voice 🎤 commands
  • Gain insights from customer interactions to improve your service

Cloud Translation API🈴 provides a simple programmatic interface for translating an arbitrary string into a supported language.

  • Translate arbitrary strings between thousands of language pairs
  • Programmatically detect a document’s language
  • Support for dozens of languages

Cloud Video Intelligence API📹 enable powerful content discovery and engaging video experiences.

  • Annotate the contents of videos
  • Detect scene changes
  • Flag inappropriate content
  • Support for a variety of video formats

“Fly away, high away, bye bye…” 🦋

We will continue next week with Part II of this series….

Thanks –

–MCS

Week of July 24th

“And you may ask yourself 🤔, well, how did I get here?” 😲

Happy Opening⚾️ Day!

Last week, we hit a milestone of sorts with our 20th submission🎖since we started our journey way back in March.😊 To commemorate the occasion, we made a return back to AWS by gleefully 😊 exploring their data ecosystem. Of course, trying to cover all the data services that are made available in AWS in such a short duration 🕰 would be a daunting task.

So last week, we concentrated our travels to three of their main offerings in the Relational Database, NoSQL, and Data warehouse realms. This being of course RDS🛢, DynamoDB🧨, and Redshift🎆. We felt rather content 🤗 and enlighten💡with AWS’s Relational Database and Data warehouse offerings, but we were still feeling a little less satisfied 🤔 with NoSQL as we really just scratched the surface on what AWS had to offer.

To recap, we had explored 🔦 DynamoDB🧨 AWS’s multi-model NoSQL service which offers support for a key-value🔑and their propriety document📜 database. But we were still curious to learn more about a Document📜 database that offers MongoDB🍃support as well in AWS. In addition, an answer to the hottest🔥 trend in “Not Just SQL Solutions”, Graph📈 Database.

Well of course being the Cloud☁️ Provider that offers every Cloud☁️native service from A-Z, AWS delivered with many great options. So we began our voyage heading straight over to DocumentDB📜. AWS’s fully managed database service with MongoDB🍃compatibility. As with all AWS services, Document DB📜 was designed from the ground up to give the most optimal performance, scalability⚖️, and availability. DocumentDB📜 like the Cosmo DB🪐 MongoDB🍃API makes it easy to set up, operate, and scale MongoDB-compatible databases. In other words, no code changes are required, and all the same drivers can be utilized by existing legacy MongoDB🍃applications.

In addition, Document DB📜 solves the friction and complications of when an application tries to map JSON to a relational model. DocumentDB📜 solves this problem by making JSON documents a first-class object of the database. Data is stored in the form of documents📜. These documents📜 are stored into collections. Each document📜can have a unique combination and nesting of fields or key-value🔑 pairs. Thus, making querying the database faster⚡️, indexing more flexible, and repetitions easier.

Similar to other AWS Data offerings, the core unit that makes up DocumentDB📜 is the cluster. A cluster consists of one or more instances and cluster storage volume that manages the data for those instances. All writes📝 are done through the primary instance. All instances (primary and replicas) support read 📖 operations.  The cluster’s data stores six copies of your data across three different Availability Zones. AWS easily allows you to create or modify clusters. When you modify a cluster, AWS is really just spinning up a new cluster behind the curtains and then migrates the data taking what is an otherwise complex task and making it seamless.

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to place DocumentDB📜 in. You can leverage an existing one or you can create a dedicated one just for DocumentDB📜. Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to your Document📜 Databases . As for credentials🔐 and entitlements in DocumentDB📜, it is managed through AWS Identity and Access Management (IAM).By default, the cluster Document DB📜accepts secure connections using Transport Layer Security (TLS). So, all traffic in transit is encrypted and Amazon DocumentDB📜 uses the 256-bit Advanced Encryption Standard (AES-256) to encrypt your data or allows you to encrypt your clusters using keys🔑 you manage through AWS Key🔑Management Service (AWS KMS) so data at rest is always encrypted. 

“Such wonderful things surround you…What more is you lookin’ for?”

Lately, we have been really digging Graph📈 Databases. We had our first visit with Graph📈 Databases when we were exposed to the Graph📈 API through Cosmos DB🪐 earlier this month and then furthered our curiosity through Neo4J. Well, now armed with a plethora of knowledge in the Graph📈 Database space we wanted to see what AWS had to offer and once again they did not disappoint.😊

First let me start by writing, It’s a little difficult to compare AWS Neptune🔱 to Neo4J although Nous Populi from Leapgraph does an admirable job. Obviously, both are graph📈 databases but architecturally there some major differences in their graph storage model and query languages. Neo4J uses Cypher and Neptune🔱 uses Apache TinkerPop or Gremlin👹 same as Cosmos DB🪐 as well as SPARQL. Where Neptune🔱 really shines☀️ is that it’s not just another graph database but a great service offering within the AWS portfolio. So, it leverages all the great bells🔔 and whistles like fast⚡️ performance, scalability⚖️, High availability and durability.  As well as being a fully managed service that we have come accustomed too like handling hardware provisioning, software patching, backup, recovery, failure detection, and repair. Neptune🔱  is an optimized for storing billions of relationships and querying the graph with milliseconds latency.

Neptune🔱 uses database instances. The primary database instance supports both read📖 and write📝 operations and performs all the data modifications to the cluster. Neptune🔱  also uses replicas which connects to the same cloud-native☁️ storage service as the primary database instance but only supports read-only operations. There can be up to 15 of these replicas across multiple AZs. In addition, Neptune🔱  supports encryption at rest.

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to place Neptune🔱  in. You can leverage an existing one or you can create a dedicated one just for Neptune🔱  Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to your Neptune🔱. As for credentials🔐 and entitlements in Neptune🔱  is managed through AWS Identity and Access Management (IAM). Your data at rest in the Neptune🔱  is encrypted using the industry standard AES-256-bit encryption algorithm on the server that hosts your Neptune🔱  instance.  Keys🔑 can also be used, which are managed through AWS Key🔑 Management Service (AWS KMS).

“Life moves pretty fast⚡️. If you don’t stop 🛑 and look 👀 around once in a while, you could miss it.”

So now feeling pretty good 😊 about NoSQL on AWS, where do we go now? 

Well, as mentioned we have been learning so much over the last 5 months it could be very easy to forget somethings especially with limited storage capacity. So we decided to take a pause for the rest of the week and go back and review all that we have learned by re-reading all our previous blog posts as well as engaging in some Google Data Engineered solution Quests🛡to help reinforce our previous learnings

Currently, the fine folks at qwiklabs.com are offering anyone who wants to learn Google Cloud ☁️ skills an opportunity for unlimited access for 30 days.  So with an offer too good to be true as well as an opportunity to add some flare to our linked in profile and who doesn’t like flare?  We dove right in head first!

“Where do we go? Oh, where do we go now? Now, now, now, now, now, now, now”

Below are some topics I am considering for my travels next week:

  • OKTA SSO
  • Neo4J and Cypher 
  • More with Google Cloud Path
  • ONTAP Cluster Fundamentals
  • Data Visualization Tools (i.e. Looker)
  • Additional ETL Solutions (Stitch, FiveTran) 
  • Process and Transforming data/Explore data through ML (i.e. Databricks)

Thanks

—MCS

Week of July 17th

“Any timeof year… You can find it here”

Happy World🌎 Emoji 😊😜 Day! 

The Last few weeks we have been enjoying our time in Microsoft’s Cloud☁️ Data Ecosystem and It was just last month that we found ourselves hanging out with the GCP☁️ gang and their awesome Data offerings. All seemed well and good😊 except that we had been missing out on excursion to the one cloud☁️ provider where it all began literally and figuratively.

Back when we first began our journey on a cold 🥶 and rainy☔️ day in March just as Covid-19🦠 occupied Wall Street 🏦 and the rest of the planet 🌎 we started at the one place that disrupted how infrastructure and operations would be implemented and supported going forward.

That’s right Amazon Web Services or more endearingly known to humanity as AWS. AWS was released just two decades ago by the its parent company that offers everything from A to Z.

AWS like its parent company has a similar mantra in the cloud ☁️ computing world as they offer 200+ Cloud☁️ Services. So how the heck with so some many months passed that we haven’t been back since? The question is quite perplexing? But like they say “all Clouds☁️☁️ lead to AWS. So, here we are back in the AWS groove 🎶 and eager 😆 to explore 🔦the wondrous world🌎 of AWS Data solutions. Guiding us through this vast journey would be Richard Seroter (who ironically recently joined the team at Google). In 2015, Richard authored an amazing Pluralsight course covering Amazon RDS🛢, Amazon DynamoDB 🧨 and Amazon’s Redshift 🎆. It was like getting 3 courses for the price of 1! 😊

Although the course was several years old, for the most part it still out lasted the test of time ⏰  by providing a strong foundational knowledge for Amazon’s relational, NoSQL, and Data warehousing solutions. But unfortunately in technology years, it’s kind of like dog🐕  years. So obviously, there have been many innovations to all three of these incredible solutions including UI enhancements, architectural improvements and additional features to these great AWS offerings making them even more awesome!

So for a grand finale to our marvelous week of learning and to help us fill in the gaps on some of these major enhancements as well as offering some additional insights were the team from AWS Training and certification which includes the talented fashionista Michelle Metzger, the always effervescent and insightful Blaine Sundrud and on demos the man with a quirky naming convention for database objects the always witty Stephen Cole 

Back in our Amazon Web Services Databases in Depth course and in effort to make our journey that more captivating, Richard provided us with a nifty mobile sports 🏀 ⚾️ 🏈  app solution written in Node.js which leverages the Amazon data offerings covered in the course as components for an end to end solution. As the solution, was written several years back it did require some updating on some deprecated libraries📚 and some code changes in order to make the solution work which made our learning that more fulfilling. So, after a great introduction from Richard where he compares and contrasts RDS🛢, DynamoDB🧨, and Redshift🎆, we began our journey with Amazon’s Relational Database Service (RDS🛢). RDS🛢 is a database as a service (DBaaS) that makes provisioning, operating and scaling⚖️  either up or out seamless. In addition, RDS🛢 makes other time-consuming administrative tasks such as patching, and backups a thing of the past. Amazon RDS🛢 provides high availability and durability through the use of Multi-AZ deployments. In other words, AWS creates multiple instances of the databases in different Availability Zones making recovery from infrastructure failure automatic and almost transparent to the application. Of course like with all AWS offerings there always a heavy emphasis on security🔐 which it’s certainly reassuring when you putting your mission critical data in their hands 🤲 but it could also be a bit challenging at first to get things up and running when you are simply just trying connect to from your home computer 💻  back to the AWS infrastructure

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to put to your RDS🛢instance(s) in. You can leverage an existing one or you can create a dedicated one for RDS🛢instance(s).

It is required that your VPC have at least two subnets in order to support the Availability Zones for high availability. If direct internet access is needed that you will need to add an internet gateway to your VPC.  

Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to the RDS🛢. RDS🛢 leverages three types of security groups (database, VPC, and EC2). As for credentials🔐 and entitlements in RDS🛢, it is managed through AWS Identity and Access Management (IAM). At the time of the release of Richard’s courseAmazon Aurora was new in the game and was not covered in depth in the course. In addition, at the same time only MySQL, Postgres, Oracle, MS SQL Server and the aforementioned Aurora were only supported at this time. AWS has since added support for MariaDB to their relational database service portfolio.  

Fortunately, our friends from the AWS Training and Certification group gave us the insights that we would require on Amazon’s innovation behind their relational database built for the cloud☁️ better known as Aurora.

So, with six familiar database engines (licensing costs apply) to choose from you have quite a few options. Another key🔑 decision is to determines the resources you want your database to have. RDS🛢offers multiple options for optimized for memory, performance, or I/O.

I would be remised if we didn’t briefly touch on Amazon’s Aurora. As mentioned, it’s one of Amazon’s 6 database options with RDS🛢. Aurora is fully managed by RDS🛢. So it leverages the same great infrastructure and has all the same bells 🔔 and whistles. Aurora comes in two different flavors🍦 MySQL and PostgreSQL. Although, I didn’t benchmark Aurora performance in my evaluation AWS claims that Aurora is 5x faster than the standard MySQL databases. However, it’s probably more like 3x faster. But the bottom line it is that it is faster and more cost-effectiveness for MySQL or PostgreSQL databases that require optimal performances, availability, and reliability. The secret sauce behind Aurora is that automatically maintains 6 copies of your data (which can be increased up to 15 replicas) that is spanned across 3 AZs making data highly available and ultimately providing laser⚡️ fast performance for your database instances.

Please note: There is an option that allows a single Aurora database to span multiple AWS Regions 🌎 for an additional cost

In addition, Aurora uses an innovative and significantly faster log structured distributed storage layer than other RDS🛢offerings.

“Welcome my son, welcome to the machine🤖

Next on our plate 🍽 was to take a deep dive into Amazon’s fast and flexible NoSQL database service a.k.a DynamoDB🧨.. DynamoDB🧨 like Cosmo DB🪐 is a multi-model NoSQL solution.

DynamoDB🧨 combines the best of those two ACID compliant non-relational databases in a key-value🔑 and document database. It is a proprietary engine, so can’t just take your MongoDB🍃 database and convert it to DynamoDB🧨. But don’t worry if you looking to move your MongoDB🍃 works loads to Amazon, AWS offers Amazon DocumentDB (with MongoDB compatibility) but that’s for a later discussion 😉

As for DynamoDB🧨, it delivers a blazing⚡️ single-digit millisecond guaranteed performance at any scale⚖️. It’s a fully managed, multi-Region, multi-master database with built-in security🔐, backup and restore options, and in-memory caching for internet-scale applications. DynamoDB🧨 automatically scales⚖️  up and down to adjust for the capacity and maintain performance of your systems. Availability and fault tolerance are built in, eliminating the need to architect your applications for these capabilities. An important concept to grasp while working with DynamoDB🧨  is that the databases are comprised of tables, items, and attributes. Again, there has been some major architectural design changes to DynamoDB🧨 since Richard’s course was released. Not to go into too many details as its kind or irrelevant but at time⏰ the course was released DynamoDB🧨 used to offer the option to either use a Hash Primary Key🔑 or Hash and Range Primary Key🔑 to organize or partition data and of course as you would imagine choosing the right combination was rather confusing. Intuitively, AWS scrapped this architectural design and the good folks at the AWS Training and Certification group were so kind to offer clarity here as well 😊

Today, DynamoDB🧨 uses partition keys🔑  to find each item in the database similar to Cosmo DB🪐. Data is distributed on physical storage nodes. DynamoDB🧨 uses the partition key🔑  to determine which of the nodes the item is located on. It’s very important to choice the right partition key 🔑 to avoid the dreaded hot 🔥partitions. Again “As rule of thumb 👍, an ideal Partition key🔑 should have a wide range of values, so your data is evenly spread across logical partitions. Also in DynamoDB🧨 items can have an optional sort key🔑 to store related attributes in a sorted order.

One major difference to Cosmos DB🪐 is that DynamoDB🧨 utilizes a primary key🔑 on each table. If there is no sort key🔑, the primary and partition keys🔑 are the same. If there is a sort key🔑, the primary key🔑 is a combination of the partition and sort key 🔑 which is called a composite primary key🔑 .

DynamoDB🧨 allows for secondary indexes for faster searches. It supports two types indexes local (up to 5 per table) and global (up to 20 per table). These indexes can help improve the application’s ability to access data quickly and efficiently.

Differences Between Global and Local Secondary Indexes

GSILSI
Hash or hash and range keyHash and range key
No size limitFor each key, 10GB max
Add during table create, or laterAdd during table create
Query all partitions in a tableQuery single partition
Eventually consistent queriesEventually/strong consistent queries
Dedicated throughput unitsUser table throughput units
Only access projected itemsAccess all attributes from table

DynamoDB🧨 like Cosmo DB🪐 offers multiple Data Consistency Levels. DynamoDB🧨 Offers both Eventually and Strongly consistent Reads but like I said previously “it’s like life itself there is always tradeoffs. So, depending on your application needs. You will need to determine what’s the most important need for your application latency or availability.”

As a prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to put DynamoDB🧨  in. You can leverage an existing one or you can create a dedicated one for DynamoDB🧨  Next, you need to configure security🔒 groups for your VPC. Security🔒  groups are what controls who has access to DynamoDB🧨. As for authentication🔐  and permission to access a table, it is managed through Identity and Access Management (IAM). DynamoDB🧨 provides end-to-end enterprise-grade encryption for data that is both in transit and at rest. All DynamoDB tables have encryption at rest enabled by default. This provides enhanced security by encrypting all your data using encryption keys🔑stored in the AWS Key🔑Management System, or AWS KMS.

“Quicker than a ray of light I’m flying”

Making our final destination for this week’s explorations would be to Amazon’s fully managed, fast, scalable data warehouse known as Redshift🎆 . A “Red Shift🎆” is when a wavelength of the light is stretched, so the light is seen as ‘shifted’ towards the red part of the spectrum but according to anonymous sources “RedShift🎆 was apparently named very deliberately as a nod to Oracle’ trademark red branding, and Salesforce is calling its effort to move onto a new database “Sayonara,”” Be that what it may, this would be the third Data Warehouse cloud☁️ solution we would have the pleasure of be aquatinted with. 😊

AWS claims Redshift🎆 delivers 10x faster performance than other data warehouses. We didn’t have a chance to benchmark RedShift’s 🎆  performance but based some TPC tests vs some of their top competitors there might be some discrepancies with these claims but either case it’s still pretty darn on fast.

Redshift🎆 uses Massively parallel processing (MPP) and columnar storage architecture. The core unit that makes up Redshift🎆  is the cluster. The Cluster is made up of one or more compute nodes. There is a single leader node and several compute nodes. Clients access to Redshift🎆 is via a SQL endpoint on the leader node. The client sends a query to the endpoint. The leader node creates jobs based on the query logic and sends them in parallel to the compute nodes. The compute nodes contain the actual data the queries need. The compute nodes find the required data, perform operations, and return results to the leader node. The leader node then aggregates the results from all of the compute nodes and sends a report back to the client.

The compute nodes themselves are individual servers, they have their own dedicated memory, CPU, and attached disks. An individual compute node is actually split up into slices🍕, one slice🍕 for every core of that node’s processor. Each slice🍕 is given a piece of memory, disk, and so forth, where it processes whatever part of the workflow that’s been assigned to it by the leader node.

The way the columnar database storage works data is stored by columns rather than by rows. This allows for fast retrieval of columns of data. An additional advantage is that, since each block holds the same type of data, block data can use a compression scheme selected specifically for the column data type, further reducing disk space and I/O. Again, there have been several architectural changes to RedShift🎆 as well since Richard’s course was released.

In the past you needed to pick a distribution style. Today, you still have the option to choose a distribution style but if don’t specify a distribution style, Redshift🎆 will uses AUTO distribution making it little easier not to make the wrong choice 😊. Another recent innovation to Redshift🎆 that didn’t exist when the Richard’s course was released is the ability to build a unified data platform. Amazon Redshift🎆 Spectrum allows you to run queries across your data warehouse and Amazon S3 buckets simultaneously. Allowing you to save time ⏰  and money💰 as you don’t need to load all your data into the data warehouse.

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to place Redshift🎆 in. You can leverage an existing one or you can create a dedicated one just for Redshift🎆. In addition, you will need to create an Amazon Simple Storage Service (S3) bucket and S3 Endpoint to be used with Redshift🎆. Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to your data warehouse. As for credentials 🔐  and entitlements in Redshift🎆, it is managed through AWS Identity and Access Management (IAM).

One last point worth mentioning is that AWS Cloud ☁️  Watch ⌚️ is included with all the tremendous Cloud☁️ Services offered by AWS. So you get great monitoring 📉right of the box! 😊 We enjoyed 😊 our time⏰ this week in AWS exploring 🔦 some of their data offerings, but we merely just scratched the service.

So much to do, so much to see… So, what’s wrong with taking the backstreets? You’ll never know if you don’t go. You’ll never shine if you don’t glow

Below are some topics I am considering for my travels next week:

  • More with AWS Data Solutions
  • OKTA SSO
  • Neo4J and Cypher 
  • More with Google Cloud Path
  • ONTAP Cluster Fundamentals
  • Data Visualization Tools (i.e. Looker)
  • Additional ETL Solutions (Stitch, FiveTran) 
  • Process and Transforming data/Explore data through ML (i.e. Databricks)

Thanks

—MCS

Week of July 10th

“Stay with the spirit I found”

Happy Friday! 😊

You take the blue pill 💊— the story ends, you wake up in your bed 🛌 and believe whatever you want to believe. You take the red pill 💊 — you stay in Wonderland and I show you how deep the rabbit🐰 hole goes.”  

Last week, we hopped on board the Nebuchadnezzar🚀 and traveled through the cosmos🌌 to Microsoft’s multi-model NoSQL Solution. So, this week we decided to go further down the “rabbit🐰 hole” and explore the wondrous land of Microsoft’s NoSQL Azure solutions as well as Graph📈 database. We would once again revisit with Cosmos DB🪐 exploring all 5 APIs. In addition, we would have brief journey with Azure Storage (Table), Azure Data Lake (Gen2), and Azure’s managed data analytics service for real-time analysis (ADLS), and Azure Data Explorer (ADX). Then for an encore we would venture into the world’s🌎 most popular Graph📈 database Neo4J 

First, playing the role as our leader “Morpheus” in our first mission would be featured Pluralsight author and premier trainer Reza Salehi through his recently released Pluralsight course  Implementing NoSQL Databases in Microsoft Azure . Reza doesn’t take us quite as deep in the weeds 🌾 with Cosmos DB🪐 as Lenni Lobel’s Learning Azure Cosmos DB🪐 Pluralsight course but that is because his course covers a wide range of topics in the Azure NoSQL ecosystem. Reza provides us a very practical real-world🌎  scenario like migrating from MongoDB🍃Atlas to Cosmos DB🪐(MongoDB🍃 API) and he also covers the Cassandra API which was omitted from Lenni’s offerings. In addition, Reza spends some time giving a comprehensive overview on Azure Storage (Table) and introduces us to ADLS and ADX all of which were all new to our learnings.

In the introduction of the course, Reza gives us a brief history on NoSQL which apparently has existed since the 1960s! It just wasn’t called NoSQL. He then gives us his definition of NoSQL and emphasizes its main goal to provide horizontal scalability, availability and optimal pricing.  Reza’s mentions an interesting factoid that Azure NoSQL solitons have been used by Microsoft for about decade through Skype, Xbox 🎮, Office 365 🧩 neither of which scaled very well with a traditional relational database.

Next, he discusses Azure Table Storage (soon to be deprecated and replaced by Cosmos DB🪐 Table API). Azure Table storage can store large amounts of structured and non-relational data (datasets that don’t require complex joins, foreign keys🔑 or stored procedures) cost effectively.  In addition, It is durable and highly available, secure, and massively scalable⚖️. A table is basically a collection of entities with no schema enforced. An entity is a set of properties (maximum of 252) similar to a row in table in a relational database. A property is a name-value pair. Three main system properties that must exist with each entity are a partition Key🔑 , row key🔑  and a timestamp. In the case of a Partition Key🔑 and a Row key🔑 the application is responsible for inserting and updating these values whereas the Timestamp is managed by Azure Table Storage and this value is immutable. Azure automatically manages the partitions and the underline storage, so as the data in your table grows, the data in your table is divided into different partitions. This allows for faster query performance⚡️ of entities with the same partition key🔑 and for atomic transactions on inserts and updates.

Next on the agenda was Microsoft’s globally distributed, multi-model database service better known Cosmos DB🪐. Again, we had been down this road just last week but just like re-watching the first Matrix movie🎞 I was more than happy 😊  to do so.

As a nice review, Reza reiterated some of the core Cosmos DB🪐 concepts like Global distribution, Multi-homing, Data Consistency Levels, Time-to-live (TTL), and Data Partitioning. All of which are included with of all five flavors or APIs in Cosmos DB🪐  because at the end of the day each API is just another container to the Cosmos DB🪐. Some of the highlights included:

Global distribution

·        Cosmos DB🪐 allows you to add or remove any of the azure regions to your cosmos account at any time with a click of a button.

·        Cosmos DB🪐 will seamlessly replicate your data to all the region’s associate ID with your cosmos account.

·        The multi homing capability of Cosmos DB🪐 allows your application to be highly available.

Multi-homing APIs

·        Your application is aware of the nearest region and sends requests to that region.

·        Nearest region is identified without any configuration changes

·        When a new region is added or removed, the connection string stays the same

Time-to-live (TTL)

•               You can set the expiry time (TTL) on Cosmos DB data items

•               Cosmos DB🪐 will automatically remove the items after this time period, since the last modification time ⏰

Cosmos DB🪐 Consistency Levels

•       Cosmos DB🪐 offers five consistency levels to choose from:

•       Strong, bounded staleness, session, consistent prefix, eventual

Data Partitioning

·        A logical partition consists of a set of items that have the same partition key🔑.

·        Data that’s added to the container is automatically partitioned across a set of logical partitions.

·        Logical partitions are mapped to physical partitions that are distributed among several machines.

·        Throughput provisioned for container, is divided evenly among physical partitions.

Then Reza’s breaks down each of the 5 Cosmos DBs🪐 APIs in separate modules. But at the risk, of being redundant from last week’s submission, we will just focus on the MongoDB🍃 API and the Cassandra API as we covered the other APIs in-depth last week. I will make one important point for all APIs that you are working with that is you must choose an appropriate partition key🔑. As rule of thumb 👍, an ideal Partition key🔑 should have a wide range of values, so your data is evenly spread across logical partitions.

MongoDB🍃 API in Cosmos DB🪐 supports the popular MongoDB🍃 Document database with absolutely no code changes other than a connection string to existing applications. It now supports up to MongoDB 🍃version 3.6.

During this module, Reza provides us with a very practical real world 🌎  scenario migrating from MongoDB🍃Atlas to Cosmos DB🪐 (MongoDB🍃 API). We were happy😊  to report that we were able to follow along easily and successfully migrate our own MongoDB🍃 Atlas collections to Cosmos DB🪐. 

Important to note: Before starting a migration from MongoDB🍃 to Cosmos DB🪐, you should estimate the amount of throughput to provisioned for your azure cosmos databases on collections and of course pick an optimal partition key🔑  for your data.

Next, we will focused on the Cassandra API in Cosmos DB🪐. This one admittedly,  I was really looking forward too as it wasn’t in scope in our previous journey.  Cosmos DB🪐 – Cassandra API can be used as the data store for apps written for Apache Cassandra. Just like for MongoDB🍃, existing Cassandra applications using CQLv4 compliant drivers, can easily communicate with the Cosmos DB🪐 Cassandra API. Making it easy to switch from Apache Cassandra to Cosmos DB🪐 Cassandra API with only requiring an update to the connection string. The familiar CQL, Cassandra client drivers, and Cassandra-based tools can all be used making for seamless migration with of course the benefits of Cosmos DB🪐 like

·        No operations management (PaaS)

·        Low latency reads and writes

·        Use existing code and tools

·        Throughput and storage elasticity

·        Global distribution and availability

·        Choice of five well-defined consistency levels

·        Interact with Cosmos DB🪐 Cassandra API

Next we ventured on to new ground with Azure Data Lake Storage (ADLS). ADLS is a hyper-scale repository for big data analytic workloads. Azure Storage (Gen 2) is the foundation for building enterprise data lakes on ADLS.  ADLS supports hundreds of gigabits of throughput and manages massive amounts of data. Some Key features of ADLS include:

·        Hadoop compatible – manage data same as Hadoop HDFS

·        POSIX permissions – supports ACL and POSIX file permissions

·        Cost effective – offers low cost storage capacity

Last but certainly not least on this Journey with Reza was an introduction to Azure Data Explorer (ADX) a fast and highly scalable⚖️ data exploration service for log and telemetry data. ADX is designed to ingest data from devices like websites, logs and more. These ingestion sources come natively from Azure Event Hub, IoT hub and Blob Storage. Data is then stored in highly scalable⚖️ database and analytics are performed using Kusto Query Language (KQL). ADX can be provisioned with Azure CLI, PowerShell, C# (NuGet package), Python 🐍 SDK and the ARM template. One of the key features of ADS is Anomaly Detection. ADX uses machine learning under the hood to find these anomalies. ADX also supports many data visualization tools like

·        Kusto query language visualizations

·        Azure Data Explorer dashboards (Web UI)

·        Power BI connector

·        Microsoft Excel connector

·        ODBC connector

·        Granfana (ADX plugin)

·        Kibana Connector (using k2bridge)

·        Tableau (via ODBC connector)

·        Qlik (via ODBC connector)

·        Sisense (via JDBC connector)

·        Redash

ADX can easily integrate with other services like:

·        Azure Data Factory

·        Microsoft Flow

·        Logic Apps

·        Apache Spark Connector

·        Azure Databricks

·        CI/CD using Azure DevOps

I’ll show these people what you don’t want them to see. A world🌎 without rules and controls, without borders or boundaries. A world🌎 where anything is possible. -Neo

After spending much time in Cosmos DB🪐 and in particular the Graph📈Database API, I have become very intrigued by this type of NoSQL solution. The more I explored the more I coveted. I had a particular yearning to learn more about the world’s 🌎 most popular graph 📈database Neo4J. For those not aware of Neo4J its developed by Swedish 🇸🇪 Technology company sometimes referred to as Neo4J or Neo Technology. I guess it depends on the day of the week?

According to all accounts the name Neo” was named for Swedish 🇸🇪 pop artist and favorite of the Swedish🇸🇪 developers Linus “Neo” Ingelsbo, “4” (for version) and “J” for the Swedish🇸🇪 word “Jätteträd” which of course means “giant tree 🌳” because a tree 🌳 signifies the huge data structures that could now be stored in this amazing database product. But to me this story seems a bit curious.. With a database name like “Neo” and Querying language called “Cypher” and with Awesome Procedures On Cypher better known as APOC I somehow believe there is another story here..

Anyway to guide us through our learning with Neo4J would be no other than the “Flying Dutchman” 🇳🇱 Roland Guijt through his Introduction to Graph📈 Databases, Cypher, and Neo4j which was short but sweet (sort of like a Stroopwafel🧇)

In the introduction, Roland tells us the Who, What, When, Where, Why and How about graph📈 databases. A graph 📈consists of nodes or vertices which are connected by directional relationships or Edges. A node represents an entity. An entity is typically something in the real world🌎 like a customer, an order or a person A collection of nodes and relationships together is called a graph 📈. Graph📈databases are very mind friendly compared to other data storage technologies because graphs📈 act a lot like how the human brain🧠 works. It’s easier to think of the data structure and also easier to write queries. These patterns are much like the patterns of the brain🧠 uses to fetch data or retrieve memories. 

Graph 📈 Databases are all about relationships and thus are very strong in storing and retrieving highly related data. They are also very performant during querying even with large number of nodes like in the millions. They offer great flexibility as like all NoSQL databases it doesn’t require a fixed schema. In addition, they are quite agile as you can add or delete nodes and property of nodes without affecting already stored nodes and it’s extensible supporting multiple query languages

After a comprehensive overview with graph📈 database, Roland dives right into Neo4J the leader in Graph 📈database. Unlike document databases, Neo4j is ACID compliant which means that all data modification is done within a transaction. If something goes wrong, Neo4j will simply roll back to a state where the data was reliable.

Neo4J is Java☕️ based which allows you to install it on multiple platforms like Windows, Linux, and OS X. Neo4j can scale⚖️ up as it can easily adjust to a hardware changes i.e. adding more physical memory in which it will automatically add more nodes in the cache. Neo4J can also scale ⚖️ out like most NoSQL Solutions i.e. adding more servers meaning it can distribute the load of transactions or create a highly available cluster in which a server will take over when the active one fails.

Since by definition Neo4J is a graph📈 database, it’s all about relationships and nodes. Both nodes and relationships are equally as important. Nodes are schema-less entities with properties (key-value pairs) which are always strings. Relationship connects a node to another node. Just like nodes, they also can contain properties that also support indexing.

Next, Roland discusses Querying Data with Cypher which is the most powerful⚡️of Query languages supported by Neo4J. Cypher was developed and optimized for Neo4j and for graph📈 databases. Cypher is a very fluid language meaning it continuously changes with each release. The good 😊 news is all major releases are backwards compatible to all old versions of the language. It’s very different for SQL so there is a bit of a learning curve. However, it’s not as steep as a learning curve you would imagine because Cypher uses patterns to match the data in the database very much how the brain🧠 works. That and Neo4J Desktop has intellisense. 😊

As example to demonstrate the query language and CRUD we worked with a very cool Dr. Who graph 📈database filled multiple nodes with Actors, Roles, Episodes, Villains and their given relationships. To begin we started with “R” or Reads part of CRUD learning the MATCH command

Below is some MATCH – RETURN syntax:

MATCH (:Actor{name:’Matt Smith’}) -[:PLAYED]->(c:Character) RETURN c.name as name

MATCH (actors:Actor)-[:REGENERATED_TO]-> (others) RETURN actors.name, others.name

MATCH (:Character{name:’Doctor’})<-[:ENEMY_OF]-(:Character)-[:COMES_FROM]->(p:Planet) RETURN p.name as Planet, count(p) AS Count

MATCH (:Actor{name:’Matt Smith’})-[:APPEARED_IN]-> (ep:Episode)<-[:APPEARED_IN]- (:Character{name:’Amy Pond’}),(ep) <-[:APPEARED_IN]-(enemies:Character)<-[:ENEMY_OF]-(Character{name:’Doctor’}) RETURN ep AS Episode, collect(enemies.name) AS Enemies;

Further, Roland discussed the WHERE Clause and ORDER BY Clauses which are very similar to ANSI SQL. Then he converses about other Cypher syntax like:

SKIP – which skips the number of result items you specify.

LIMIT – which limits the numbers of items returned.

With UNION which allows to connect two queries together and generate one result set.

Then he ends the module conferring on Scalar functions like TOINT,

LENGTH, REDUCE, FILTER, ROUND, and SUBSTRING.

Then he reviews two of his favorite some advanced query features like COMPANION_OF and SHORTESTPATH.

Continuing on with C,U,D in CRUD, we played with the CREATE, MATCH WITH SET and MATCH DELETE

Below is some Syntax:

CREATE p= (:Actor{name:’Peter Capaldi’})-[:APPEARED_IN]->(:Episode{name:’The Time of The Doctor’}) RETURN p

MATCH (Matt:Actor{name: ‘Matt Smith’}}

DELETE matt

MATCH (Matt:Actor{name: ‘Matt Smith’}}

SET matt.salary = 1000

Then looking at MERGE and FOREACH with the below syntax as example:

MERGE (peter:Actor{name: ‘Peter Capaldi’}) RETURN peter

Match p =(actors:Actor)-[r:PLAYED]->others)

WHERE actors.salary > 10000

FOREACH (n IN nodes(p)| set n.done = true)

As we continued our journey with Neo4J, we reconnoitered on Indexes and Constraints. Indexes are only good for data retrieval. So, if your application performs lots of writes it’s probably best to avoid them. As for constraints, the unique constraint is currently the only constraint available in Neo4j. That is why this is often called just constraint. Lastly, in the module we reviewed Importing CSV which makes importing data from other sources a breeze. It enables you to import data into a Neo4j’s database from many sources. CSV files can be loaded from the local file system, as well as remote locations.  Cypher has a LOAD CSV statement, which is used together with CREATE and/or MERGE.

Finally, Roland reviewed Neo4j’s APIs which was a little bit out of our lexicon but interesting nonetheless. Neo4j supports two API types out of the box. The traditional REST and their proprietary Bolt⚡️. The advantage of Bolt⚡️is mainly performance. Bolt⚡️ doesn’t have the HTTP overhead, and it uses a binary format instead of text to return data. For both the REST and Bolt APIs Roland provides C# code sample that can be run with NuGet packages in Visual Studio my new favorite IDE.

Ever have that feeling where you’re not sure if you’re awake or dreaming?

Below are some topics I am considering for my learnings next week:

·      More on Neo4J and Cypher 

·      More on MongoDB

·      More with Google Cloud Path

·      Working with Parquet files 

·      JDBC Drivers

·      More on Machine Learning

·      ONTAP Cluster Fundamentals

·      Data Visualization Tools (i.e. Looker)

·      Additional ETL Solutions (Stitch, FiveTran) 

·      Process and Transforming data/Explore data through ML (i.e. Databricks)

Stay safe and Be well –

–MCS

Week of June 26th

 “…And I think to myself… What a wonderful world 🌎 .

Happy Coconut🥥 Day!

Recently, I had been spending so much time ⏰ in GCP land☁️ that it started to feel like it was my second home 🏡 . However, it was time for a little data sabbatical. I needed to visit a land of mysticism✨ and intrigue. A place where developers can roam freely and where data can be flexible, semi-structured, hierarchical nature, and can be easily scaled horizontally… A place not bound to the rigidness of relational tables but a domicile of flexible documents. We would journey to the world of MongoDB🍃.

Ok, so we have been there before, but we needed a refresher. It had been about 6 years since we first became acquainted with this technological phenomenon. Besides we hadn’t played around too much with some of the company’s past innovations like MongoDB🍃 Compass 🧭 a sleek visual environment that allows you to analyze and understand the contents of your data in MongoDB🍃 and MongoDB🍃 Atlas☁️ the managed service used to provision, maintain and scale MongoDB🍃 clusters of instances that is conveniently offered on AWS, Azure and GCP.

To assist us on getting started would be our old comrade in arms, Pinal Dave from SQLAuthority fame. Pinal had put together an outstanding condensed course on Foundations of Document Databases with MongoDB

So, this is where we would begin. The course commences with an introduction on NoSQL (Not Just SQL) databases and some the advantages of a Document Database like an Intuitive Data Model, dynamic Schema and distributed Scalable Database. Then he gives us comprehensible explanation to the CAP (Consistency, Availability and Partition Tolerance) Theorem and that only 2/3 are necessary. MongoDB🍃 fits in under the CP variety while compromising on availability. Next, the following key 🔑 points are made in relation to MongoDB🍃

  • All write operations in MongoDB🍃 are atomic on the level of a single document
  • If the collection does not currently exist, the insert operator will create one in the collection.

Next, Pinal takes use through a few quick and easy steps on how to get setup with MongoDB Atlas☁️. Once, our fully managed MongoDB🍃 Cluster was fired🔥 up it was time to navigate our collections with MongoDB Compass🧭

For much of the rest of the course, we would concentrate on CRUD (Create, Read, Update, Delete) operations in MongoDB🍃 through both Compass🧭 and the CLI (Mongo Shell).

The course would also present us a terse walk through with syntax and elucidation on Read Concerns and Write Concerns in MongoDB🍃

Read Concern

Allows to control the consistency and isolation properties of the data read from replica sets and replica set shards

  1. Local – (No guarantee the data has applied to all Replicas) – Primary
  2. Available – (No guarantee the data has applied to all Replicas) – Secondary
  3. Majority – (default) acknowledged by a majority to all Replicas
  4. Linearizable – All successful acknowledged by a majority to all Replicas before Read (Query might have to wait)
  5. Snapshot – Used with multi document projection data from the majority to all Replicas

Write Concern

Level of acknowledgement requested from MongoDB🍃 for write operations

w:1 – Ack from primary

w:0 – No ack

w(n) – Primary + (n-1) secondary

w: majority

Timeout: Time limit to prevent write operations from blocking indefinitely

As lasting point on database writes (UD), all write operations in MongoDB🍃 are atomic on the level of a single document. In other words, if you are updating a single document or if you are updating multiple documents in a collection at any time every single update is just atomic at a single document.

Closing time 🕰 … Time 🕰 for you to go out go out into the world 🌎 … Closing time 🕰…

Finally, dénouements of the course are on Common SQL Concepts and Semantics to MongoDB🍃 including some of the major differences between a typical RDBS and MongoDB🍃which can be represented by the table below:

RDBS                         MongoDB
SQL                           MQL (Mongo Query Language)
Predefined Schema   Dynamic Schema
Relational Keys         No foreign key
Triggers                      No Triggers
ACID PropertiesCAP theorem

Sadly 😢, that was it for Foundations of Document Databases with MongoDB. This left us clambering for more. Fortunately, Nuri Halperin happily😊 delivered and then some… Nuri a MongoDB🍃 Guru 🧙‍♂️ and a Love❤️ 👨‍⚕️ of sorts (known for creating the wildly popular jdate.com platform) put together a series of timeless MongoDB🍃 courses that have managed to stand up through the test of time 🕰. In which, I might add is not an easy feat when it comes to a burgeoning technologies like MongoDB🍃.

Part 1:  Introduction course Introduction to MongoDB in-depth look at both the Mongo Shell (CLI) and CRUD Syntax and Indexing

Part 2: MongoDB Administration takes a deep dive into MongoDB🍃 administration key concepts i.e. installation, configuration, Security, Backup/Restores, Monitoring, High Availability and Performance

Nuri like Pinal, discusses some of the challenges found in Relational Databases like Impedance mismatch and the need for Object-relational mapping (ORM) for developers. He demonstrates how MongoDB🍃 solves these challenges through its schema-less approach and no relationships required model. In addition, he touches on to how MongoDB🍃 lends itself nicely to data polymorphism. 

Next he takes us through the MongoDB architecture a collection humongous arrays that utilizes memory mapped BSON (Binary script object notation) or simply “Binary JSON” (Java Script object notation) files. MongoDB🍃 intuitively leverages the OS to handle the loading of data and saving to disk which allows the engine to center on speed, optimization, and stability. 

MongoDBs🍃 main mission is to just serve up data quickly and efficiently. Next, Nuri takes through the Mongo Shell (CLI) which basically is just a Java☕️Script interpreter that  allows you to interactively get insight into the MongoDB🍃 Server. Further he discusses indexes, types of indexes, and how paramount indexes are in MongoDB🍃 for practical usability.

Lastly, Nuri takes through MongoDB🍃 Replication which uses the simple to configure but highly scalable replica sets. This is how MongoDB🍃 achieves “Eventual Consistency”, Automatic Failover, and Automatic recovery… And this is just the introduction of the Part 2 of the course…

Like trying to watch all 3 parts of The Lord of Rings 💍 (Director’s Edition) trilogy in a single helping, it’s just wasn’t possible to complete all of Nuri’s two part sequel in a single week but we did get through most of it. 😊  This also left us with a little bit more on our plate🍽 as we continue through our Mongo Journey…

This Week’s Log

Out of the tree 🌳 of life I just picked me a plum… You came along and everything started’ in to hum 🎶… Still it’s a real good bet… The best is yet to come

Below are some topics I am considering for my voyage next week:

  • More with Nuri and MongoDB 
  • Cosmos DB
  • More with Google Cloud Path
  • Working with Parquet files 
  • JDBC Drivers
  • More on Machine Learning
  • ONTAP Cluster Fundamentals
  • Data Visualization Tools (i.e. Looker)
  • Additional ETL Solutions (Stitch, FiveTran) 
  • Process and Transforming data/Explore data through ML (i.e. Databricks)

Stay safe and Be well –

–MCS