SnowFlake | SQL Squirrels

Week of June 12th

Posted on June 12, 2020 by Mark Shay

“Count to infinity, ones and zeroes”

Happy Russia 🇷🇺 Day !

Earth🌎 below us, Drifting, falling, Floating, weightless, Coming, coming home

After spending the previous two weeks in the far reaches 🚀 of space 🌌 , it was time 🕰 to return to my normal sphere of activity and back to the more familiar data realm. This week we took a journey into Google’s Data Warehouse solution better known as BigQuery🔎. This was our third go around in GCP as we previously looked at Google’s Compute and Data messaging service Pub/Sub.

It seems to me like the more we dive into the Google Cloud Platform ecosystem the more impressed we become with the GCP offerings. As for Data Warehouse, we had previously visited with Snowflake❄️ which we found to be a nifty SaaS solution for DW. After tinkering around a bit with BigQuery🔎 we found it to be equally utilitarian. BigQuery🔎 like its strong competitor offers a “No-Ops” approach to data warehousing while adhering to the 3Vs of Big Data. Although we didn’t benchmark either DW solution both offerings are highly performant based on 99 TPC-DS industry benchmarks

In the case of Snowflake❄️, it offers the flexibility to data professionals to scale the compute and storage resources up and down independently based on workloads whereas BigQuery 🔎 is “server-less” and all scaling is done automatically . BigQuery🔎 doesn’t use indexes but rather it relies on its awesome clustering technology to make its queries scream 💥.

Both Snowflake❄️ and BigQuery🔎 are quite similar with their low maintenance and minimal task administration but where both products make your head spin🌪 is trying to make heads or tails out of their pricing models. Snowflake’s❄️ pricing is a little bit easier to try to interrupt whereas BigQuery🔎 seems like you need to have P.H.D. in cost models just to read through the fine print.

To accurately determine which product, offers a better TCO, really depends on your individual workloads. From what I gather if you’re running lots of queries sporadically, with high idle time than BigQuery🔎 is the place to be from a pricing standpoint. However, if your workloads are more consistent than its probably more cost effective to go with Snowflake❄️ based on their pay as you go model.

To assist us on our exploration of BigQuery was the founder of LoonyCorn, the bright and talented Janani Ravi through her excellent Pluralsight course. Janani gives a great overview of the solution and keeps the flow of the course at a reasonable pace as we took a deep dive into this complex data technology.

One Interesting observations about the course as it was published about a year and half ago (15 Oct 2018) is how much improvements Google has made to the product offering since then including a refined UI, more options for partition keys and enhancements to Python Module. The course touches on design, comparison to RDBMS, and other DWs and shows us different ingestion of file types including the popular Binary format Avaro.

The meat🥩 and potatoes🥔 of the course is the section of Programmatically Accessing BigQuery from Client Programs. This is where Jani goes into some advanced programming options like UNNEST, ARRAY_AGG, STRUCT Operators and the powerful Windowing Operations.

See log for more details

To round out the course, she takes us through some additional nuggets in GCP like Google Data Studio (https://datastudio.google.com/) for Data Visualization and Cloud☁️ Notebooks📓 and Python by utilizing Google Datalab🧪.

Stay, ahhh Just a little bit longer Please, please, please, please, please Tell me that you’re going to….

Below are some topics I am considering for my travels next week:

Google Cloud Data Fusion (EL/ETL/ELT)
More on Google Big Query
More on Data Pipelines
NoSQL – MongoDB, Cosmos DB
Working JSON Files
Working with Parquet files
JDBC Drivers
More on Machine Learning
ONTAP Cluster Fundamentals
Data Visualization Tools (i.e. Looker)
ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)

Stay safe and Be well –

–MCS

Week of April 10th

Posted on April 10, 2020 by Mark Shay

“…When the Promise of a brave new world unfurled beneath a clear blue Sky”

“Forests, lakes, and rivers, clouds and winds, stars and flowers, stupendous glaciers and crystal snowflakes – every form of animate or inanimate existence, leaves its impress upon the soul of man.” — Orison Swett Marden

My journey for this week turned out to be a sort of potpourri of various technologies and solutions thanks to the wonderful folks at MSFT. After some heavy soul searching over the previous weekend, I decided that my time would be best spent this week on recreating the SQL Server 2016 with Always On environment (previously created several weeks back on AWS EC2) but in the MS Azure Cloud platform. The goal would be to better understand Azure and how it works. In addition, I would be able to compare and contrast both AWS EC2 vs. Azure VMs and be able to list both the pros and cons of these cloud providers.

But before I could get my head into the clouds I was still lingering around in the bamboo forests. This past weekend, I was presented with an interesting scenario to stream market data to pandas from the investors exchange (Thanks to my friend) . So after consulting with Mr. Google, I was pleasantly surprised to find that IEX offered an API that allows you to connect to there service and stream messages directly to Python and use Pandas for data visualization and analysis. Of course being the cheapskate that I am I signed up for a free account and off I went.

So I started tickling the keys, I produced a newly minted IEX Py script. After some brief testing, I started receiving an obscure error? Of course there was no documented solution on how to the address such an error..

So after some fruitless nonstop piping of several modules, I was still getting the same error. 🙁 After a moment of clarity of I deduced there was probably limitation on messages you can stream from the free IEX account..

So I took shot in the dark and decided to register for another account (under a different email address) this way I would receive a new token and give that a try

… And Oh là là! My script started working again! 🙂 Of course as I continued to add more functionality and test my script I ran back into the same error but this time I knew exactly how to resolve it.

So I registered for a third account (to yet again generate a new token ). FortunateIy, I completed my weekend project. See attachments Plot2.png and Plot3.png for pretty graphs

Now that I could see the forest through the trees and it was off to the cloud! I anticipated that it would take me a full week to explore Azure VMs but it actually only took a fews to wrap my head around it..

So this left me chance to pivot again and this time to a Data Warehouse/ Data Lake solution built for the Cloud. Turning the forecast for the rest of the week to Snow.

Here is a summary of what I did this week:

Sunday:

Developed Pandas/Python Script in conjunction with iexfinance & matplotlib modules to build graphs to show historical price for MSFT for 2020 and comparison of MSFT vs INTC for Jan 2nd – April 3rd 2020

Monday: (Brief summary)

Followed previous steps to build the plumbing on Azure for my small SQL Server farm (See Previous status on AWS EC2 for more details)

Created Resource Group
Create Application Security Group
Created 6 small Windows VMs in the same Region and an Availability Zone
Joined them to Windows domain

Tuesday: (Brief summary)

Created Windows Failover Cluster
Installed SQL Server 2016
Setup and configured AlwaysOn AGs and Listeners

Observations with Azure VMs:

Cons

Azure VMS are very slow first time brought up after build
Azure VMS has a longer provisioning time than EC2 Instances
No UI option to perform bulk tasks (like AWS UI) . Only option is Templating thru scripting
Can not move Resource Group from one Geographical location to another like VMs and other objects within Azure
When deleting a VM all child dependencies are not dropped ( Security Groups, NICs, Disks) – Perhaps this is by design?

– Objects need to be dissociated with groups and then deleted for clean up of orphan objects

Neutral

Easy to migrate VMs to higher T-Shirt Sizes
Easy to provision Storage Volumes per VM
Application Security Groups can be used to manage TCP/UDP traffic for entire resource group

Pros

You can migrate existing storage volumes to premium or cheaper storage seamlessly
Less network administration
- less TCP/UDP ports need to be opened especially ports native to Windows domains

Very Easy to build Windows Failover clustering services
- Natively works in the same subnet
- Less configuration to get Connectivity to working then AWS EC2

Very Easy to configure SQL Server 2016 Always On
- No need to create 5 Listeners (different per subnet) for a given specific AG
- 1 Listener per AG

Free Cost, Performance, Operation Excellence Recommendations Pop up after Login

Wednesday:

Registered for an Eval account for Snowflake instance
Attended Zero to Snowflake in 90 Minutes virtual Lab
- Created Databases, Data Warehouses, User accounts, and Roles
- Created Stages to be used for Data Import
- Imported Data Sources (Data in S3 Buckets, CSV, JSON formats) via Web UI and SnowSQL cmd line tool
- Ran various ANSI-92 T-SQL Queries to generate reports from SnowFlake

Thursday:

Completed Pluralsight course Snowflake? Why Should I Care?
Watched the following YouTube Videos:
See Snowflake in 8 Minutes
Snowflake Architecture – Learn How Snowflake Stores Table data
Started Series: Getting Started – Introduction to Worksheets & Queries
Imported CitiBike System Data
Configured the SnowFlake Python Module
Developed a Pandas/Python Script using snowflake.connector & matplotlib modules to build a graph to show Citibike total rides over 12 month period (in descending order by rides per month) . See attachment plot.png

Friday:

**Bonus Points **

More Algebra – Regents questions.
More with conjugating verbs in Español (AR Verbs)

Next Steps..
Below are some topics I am considering for my voyage next week:

SQL Server Advanced Features:

– Columnstore Indexes
– Best practices around SQL Server AlwaysOn (Snapshot Isolation/sizing of Tempdb, etc)

Data Visualization Tools (i.e. Looker)
ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks) .
Getting Started with Kubernetes with an old buddy (Nigel)

Stay safe and Be well

—MCS