dataorchestration | SQL Squirrels

Week of June 19th

Posted on June 19, 2020 by Mark Shay

“I had some dreams, they were clouds ☁️ in my coffee☕️ … Clouds ☁️ in my coffee ☕️ , and…”

Hi All –

Last week, we explored Google’s fully managed “No-Ops” Cloud ☁️ DW solution, BigQuery🔎. So naturally it made sense to drink🍹more of the Google Kool-Aid and further discover the data infrastructure offerings within the Google fiefdom 👑. Besides we have been wanting to find what all the hype was about with Datafusion ☢️ for some time now which we finally did and happily😊 wound up getting a whole lot more than we bargained for…

To take us through the Google’s stratosphere☁️ would be no other than some of more prominent members of the Google Cloud Team; Evan Jones, Julie Price, and Gwendolyn Stripling. Apparently, these Googlers (all of which seemed have mastered the art of using their hands👐 while speaking) collaborated with other data aficionados at Google on a 6 course compilation of awesomeness😎 for the Data Engineering on the Google Cloud☁️Path. The course that fit the bill to start this week’s learning off was Building Batch Data Pipelines on GCP

Before we were able to dive right into DataFusion☢️, we first started off with a brief review of EL (Extract and Load), ELT (Extract, Transform and Load), and ETL (Extract, Load, and Transform) .

The best way to think of these types of data extraction is the following:

EL is like a package📦 delivered right to your door🚪 where the contents can be taken right out of the box and used. (data can be imported “as is”)
ELT is like a hand truck 🛒 which allows you to move packages easily, but the packages 📦 📦 stilled need to be unloaded and items possibly stored a particular way.
ELT is like a forklift 🚜 this is when heavy lifting needs to be done to transfer packages and have them fit in the right place

In the case of EL and ELT our flavor du jour in the Data Warehouse space, Bigquery🔎 is an ideal target 🎯system but when you need the heavy artily (ELT) that’s when you got to bring an intermediate solution. The best way to achieve these goals is the following:

Data pipelines
Manage pipelines
UI to build those pipelines

Google offers several data transformation and streaming pipeline solutions (Dataproc🔧 and Dataflow🚰) and one easy to use UI (DataFusion☢️) that makes it easy to build those pipelines. Our first stop was Dataproc🔧 which is a fast, easy-to-use, fully managed cloud☁️ service meant for Apache Spark⚡️ and Apache Hadoop🐘 clusters. Hadoop🐘 solutions are generally not really our area of expertise but nevertheless we spent some time here to get a good general understanding of how this solution works and since Datafusion ☢️ sits on top of Dataproc🔧. It was worth our while to understand how it all works

Next, we ventured over too the much anticipated Datafusion☢️ which was more than worth our wait! Datafusion☢️ uses ephemeral Dataproc🔧VMs to perform all the transforms in batch data pipelines (Streaming currently not supported but coming soon through Dataflow🚰 support). Under the hood Datafusion☢️ leverages five main components

1. Kubernetes☸️ Engine (runs in a containerized environment on GKE)

2. Key🔑 Management Service (For Security)

3. Persistent Disk

4. Google Cloud☁️ Storage (GCS) (For long term storage)

5. Cloud☁️ SQL – (To manages user and Pipeline Data)

The good news is that you don’t really need to muck around with any of these components. In fact, you shouldn’t even concern yourself with them at all. I just mentioned them because I thought it was kind of cool stack 😎. The most important part of datafusion☢️ is the data fusion☢️ studio which is the graphical “no code” tool that allows Data Analysts and ETL Developers to wrangle data and build batch data pipelines. Basically, it allows you to build pretty complex pipelines by simply “drag and drop”.

“Don’t reinvent the wheel, just realign it.” – Anthony J. D’Angelo

So now with a cool 😎 and easy to build batch pipeline UI under our belt, what about a way to orchestrate all these pipelines? Well, Google pulled no punches🥊🥊 and gave us Cloud☁️ Composer which is fully managed data workflow and orchestration service that allows you to schedule and monitor pipelines. Following the motto of “not reinventing the wheel”, Cloud☁️ Composer leverages Apache Airflow 🌬.

For those who don’t know Apache Airflow 🌬is the popular Data Pipeline orchestration tool originally developed by the fine folks at Airbnb. Airflow🌬 is written in Python 🐍(our new favorite programming language), and workflows are created via Python 🐍 scripts (1 Python🐍 file per DAG). Airflow 🌬uses directed acyclic graphs (DAGs) to manage workflow orchestration. Not to be confused with a uncool person or an unpleasant sight on a sheep 🐑 A DAG* is a simply a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

Take a bow for the new revolution… Smile and grin at the change all around me

Next up on our adventures was onto Dataflow🚰 which is a fully managed streaming 🌊 analytics service that minimizes latency, processing time, and cost through autoscaling as well as batch processing. So why Dataflow🚰 and not DataProc🔧?

No doubt, Dataproc🔧 is a solid data pipeline solution which meets most requirements for either Batch or Streaming🌊Pipelines but it’s a bit clunky and requires existing knowledge of Hadoop🐘/Spark⚡️ infrastructure.

Dataproc🔧 is still an ideal solution for those who want to bridge 🌉 the gap by moving their on-premise Big Data Infrastructure to GCP. However, if you have a green field project than Dataflow🚰 definitely seems like the way to go.

DataFlow🚰 is “Server-less” which means the service “just works” and you don’t need to manage it! Once again, Google holds true to form with our earlier mantra (“not reinventing the wheel”) as Cloud Dataflow🚰 is built on top of the popular batch and streaming pipeline solution Apache Beam.

For those not familiar with Apache BEAM (Batch and StrEAM) it was also developed by Google to ensure the perfect marriage between batch and streaming data-parallel processing pipelines. A true work of art!

The show must go on….

So ya… Thought ya. Might like to go to the show…To feel the warm🔥 thrill of confusion. That space cadet glow

Now that we were on role with our journey through GCP’s Data Ecosystem it seemed logical to continue our path with the next course Building Resilient Streaming Analytics Systems on GCP. This exposition was taught by the brilliant Raiyann Serang who maintains a well kempt hairdo throughout his presentations and the distinguished Nitin Aggarwal as well as the aforementioned Evan Jones.

First, Raiyann’s takes us through a brief introduction on streaming 🌊 data (data processing for unbounded data sets). In addition, he provides The reasons for streaming 🌊 data and value that streaming🌊 data provides to the business by enabling real time information in a dashboard or another means to see the current state. He touches on the ideal architectural model using Google Pub/Sub and Dataflow🚰 to construct a data pipeline to minimize latency at each step during the ingestion process.

Next, he laments about the infamous 3Vs in regards to streaming 🌊 data and how might a data engineer deal with these challenges.

Volume

How to ingest this data into the system?
How to store and organize data to process this data quickly?
How will the storage layer be integrated with other processing layers?

Velocity

10,000 records/sec being transferred (Stock market Data, etc.)
How systems need to be able handle the load change?

Variety

Type and format of data and the constraints of processing

Next, he provides a preview to the rest of the course as he unveils Google’s triumvirate to the streaming data challenge. Pub/Sub to deal with variable volumes of data, Dataflow🚰 to process data without undue delays and Bigquery🔎 to address need of ad-hoc analysis and immediate insights.

Pure Gold!

After a great Introduction, Raiyann’s takes us right to Pub/Sub. Fortunately, we has been to this rodeo before and were well aware of the value of Pub/Sub Pub/Sub Is a ready to use asynchronous distribution system that fully manages data ingestion for both on cloud ☁️ and on premise environments. It’s a highly desirable solution when it comes to streaming solutions because of how well it addresses Availability, Durability, and Scalability.

The short story around Pub/Sub is a story of two data structures, the topic and the subscription. The Pub/Sub Client that creates the topic is called the Publisher and the Cloud Pub/Sub client that creates the subscription is the subscriber. Pub/Sub provides both Push (periodically calling for messages) and Pull (Clients have to acknowledge the message as separate step) deliveries.

Now, that we covered how to process data, it was time to move to the next major piece in our data architectural model and that is how to process the data without undue delays Dataflow🚰

Taking us through this part of the journey would be Nitin. We had already covered Dataflow🚰 earlier in the week in the previous course but that was only in regards to batch data (Bound data or unchanging data) pipelines.

DataFlow🚰 if you remember is built on Apache Beam, so in another words it has “the need for speed” and can support streams🌊 of data. Dataflow🚰 is highly scalable with low latency processing pipelines for incoming messages. Nitin further discusses the major challenges with handling streaming or real time data and how DataFlow🚰 tackles these obstacles.

Streaming 🌊 data generally only grows larger and more frequent
Fault Tolerance – Maintain fault tolerance despite increasing volumes of data
Model – Is it streaming or repeated batch?
Time – (Latency) what if data arrives late

Next, he discusses one of DataFlow🚰 key strengths “Windowing” and provides details in the three kinds of Windows.

Fixed – Divides Data into time Slices
Sliding – Those you use for computing (often required when doing aggregations over unbounded data)
Sessions – defined by minimum gap duration and timing is triggered by another element (communication is bursty)

Then Nitin rounds it off with one of the key concepts when it comes to Streaming🌊 data pipelines the “watermark trigger”. The summit of this module is the lab on Streaming🌊 Data Processing which requires building a full end to end solution using Pub/Sub, Dataflow🚰, and Bigquery. In addition, he gave us a nice introduction to Google Cloud☁️ Monitoring which we had not seen before.

So much larger than life… I’m going to watch it growing

We next headed over to another spoke in the data architecture wheel 🎡with Google’s Bigtable . Bigtable (built on Colossus) is Google’s NoSQL solution for high performance applications. We hadn’t done much so far with NoSQL up until this point. So, this module offered us a great primer for future travels.

Bigtable is ideal for storing very large amounts of data in a key-value store or non-structured data and it supports high read and write throughput at low latency for fast access to large datasets. However, Bigtable is not a good solution for Structured data, small data (< TB) or data that requires SQL Joins. Bigtable is good for specific use cases like real-time lookups as part of an application, where speed and efficiency are desired beyond that of a database. When Bigtable is a good match for specific workloads “it’s so consistently fast that it is magical 🧙‍♂️”.

“And down the stretch they 🐎 come!”

Next up, Evan takes us down the homestretch by surveying Advanced BigQuery 🔎 Functionality and Performance. He first begins with an overview and a demo of BigQuery 🔎 and GIS (Geographic Information Systems) functions which allows you to analyze and visualize Geo Spatial data in BigQuery🔎. This is a little beyond the scope of our usual musings but it’s good to know from an informational standpoint. Then Evan covers a critical topic for any data engineer or analyst to understand, which is how to break apart a single data set into groups or Window Functions .

Followed by a lab that demonstrated some neat tricks on how to reduce I/O, Cache results, and perform efficient joins by using the the WITH Clause, Changing the parameter of Region location, and denormalization of the data respectively. Finally, Evan leaves use with a nice parting gift by providing a handy cheatsheet and a quick lab on Partitioned Tables in Google BigQuery🔎

* DAG is a directed graph data structure that uses a topological ordering. The sequence can only go from earlier to later. DAG is often applied to problems related to data processing, scheduling, finding the best route in navigation, and data compression.

“It’s something unpredictable, but in the end it’s right”

Below are some topics I am considering for my travels next week:

NoSQL – MongoDB, Cosmos DB
More on Google Data Engineering with
Google Cloud Path <-Google Cloud Certified Professional Data Engineer
Working JSON Files
Working with Parquet files
JDBC Drivers
More on Machine Learning
ONTAP Cluster Fundamentals
Data Visualization Tools (i.e. Looker)
Additional ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)

Stay safe and Be well

—MCS