pandas | SQL Squirrels

Week of April 10th

Posted on April 10, 2020 by Mark Shay

“…When the Promise of a brave new world unfurled beneath a clear blue Sky”

“Forests, lakes, and rivers, clouds and winds, stars and flowers, stupendous glaciers and crystal snowflakes – every form of animate or inanimate existence, leaves its impress upon the soul of man.” — Orison Swett Marden

My journey for this week turned out to be a sort of potpourri of various technologies and solutions thanks to the wonderful folks at MSFT. After some heavy soul searching over the previous weekend, I decided that my time would be best spent this week on recreating the SQL Server 2016 with Always On environment (previously created several weeks back on AWS EC2) but in the MS Azure Cloud platform. The goal would be to better understand Azure and how it works. In addition, I would be able to compare and contrast both AWS EC2 vs. Azure VMs and be able to list both the pros and cons of these cloud providers.

But before I could get my head into the clouds I was still lingering around in the bamboo forests. This past weekend, I was presented with an interesting scenario to stream market data to pandas from the investors exchange (Thanks to my friend) . So after consulting with Mr. Google, I was pleasantly surprised to find that IEX offered an API that allows you to connect to there service and stream messages directly to Python and use Pandas for data visualization and analysis. Of course being the cheapskate that I am I signed up for a free account and off I went.

So I started tickling the keys, I produced a newly minted IEX Py script. After some brief testing, I started receiving an obscure error? Of course there was no documented solution on how to the address such an error..

So after some fruitless nonstop piping of several modules, I was still getting the same error. 🙁 After a moment of clarity of I deduced there was probably limitation on messages you can stream from the free IEX account..

So I took shot in the dark and decided to register for another account (under a different email address) this way I would receive a new token and give that a try

… And Oh là là! My script started working again! 🙂 Of course as I continued to add more functionality and test my script I ran back into the same error but this time I knew exactly how to resolve it.

So I registered for a third account (to yet again generate a new token ). FortunateIy, I completed my weekend project. See attachments Plot2.png and Plot3.png for pretty graphs

Now that I could see the forest through the trees and it was off to the cloud! I anticipated that it would take me a full week to explore Azure VMs but it actually only took a fews to wrap my head around it..

So this left me chance to pivot again and this time to a Data Warehouse/ Data Lake solution built for the Cloud. Turning the forecast for the rest of the week to Snow.

Here is a summary of what I did this week:

Sunday:

Developed Pandas/Python Script in conjunction with iexfinance & matplotlib modules to build graphs to show historical price for MSFT for 2020 and comparison of MSFT vs INTC for Jan 2nd – April 3rd 2020

Monday: (Brief summary)

Followed previous steps to build the plumbing on Azure for my small SQL Server farm (See Previous status on AWS EC2 for more details)

Created Resource Group
Create Application Security Group
Created 6 small Windows VMs in the same Region and an Availability Zone
Joined them to Windows domain

Tuesday: (Brief summary)

Created Windows Failover Cluster
Installed SQL Server 2016
Setup and configured AlwaysOn AGs and Listeners

Observations with Azure VMs:

Cons

Azure VMS are very slow first time brought up after build
Azure VMS has a longer provisioning time than EC2 Instances
No UI option to perform bulk tasks (like AWS UI) . Only option is Templating thru scripting
Can not move Resource Group from one Geographical location to another like VMs and other objects within Azure
When deleting a VM all child dependencies are not dropped ( Security Groups, NICs, Disks) – Perhaps this is by design?

– Objects need to be dissociated with groups and then deleted for clean up of orphan objects

Neutral

Easy to migrate VMs to higher T-Shirt Sizes
Easy to provision Storage Volumes per VM
Application Security Groups can be used to manage TCP/UDP traffic for entire resource group

Pros

You can migrate existing storage volumes to premium or cheaper storage seamlessly
Less network administration
- less TCP/UDP ports need to be opened especially ports native to Windows domains

Very Easy to build Windows Failover clustering services
- Natively works in the same subnet
- Less configuration to get Connectivity to working then AWS EC2

Very Easy to configure SQL Server 2016 Always On
- No need to create 5 Listeners (different per subnet) for a given specific AG
- 1 Listener per AG

Free Cost, Performance, Operation Excellence Recommendations Pop up after Login

Wednesday:

Registered for an Eval account for Snowflake instance
Attended Zero to Snowflake in 90 Minutes virtual Lab
- Created Databases, Data Warehouses, User accounts, and Roles
- Created Stages to be used for Data Import
- Imported Data Sources (Data in S3 Buckets, CSV, JSON formats) via Web UI and SnowSQL cmd line tool
- Ran various ANSI-92 T-SQL Queries to generate reports from SnowFlake

Thursday:

Completed Pluralsight course Snowflake? Why Should I Care?
Watched the following YouTube Videos:
See Snowflake in 8 Minutes
Snowflake Architecture – Learn How Snowflake Stores Table data
Started Series: Getting Started – Introduction to Worksheets & Queries
Imported CitiBike System Data
Configured the SnowFlake Python Module
Developed a Pandas/Python Script using snowflake.connector & matplotlib modules to build a graph to show Citibike total rides over 12 month period (in descending order by rides per month) . See attachment plot.png

Friday:

**Bonus Points **

More Algebra – Regents questions.
More with conjugating verbs in Español (AR Verbs)

Next Steps..
Below are some topics I am considering for my voyage next week:

SQL Server Advanced Features:

– Columnstore Indexes
– Best practices around SQL Server AlwaysOn (Snapshot Isolation/sizing of Tempdb, etc)

Data Visualization Tools (i.e. Looker)
ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks) .
Getting Started with Kubernetes with an old buddy (Nigel)

Stay safe and Be well

—MCS

Week of April 3rd

Posted on April 3, 2020 by Mark Shay

“The other day, I met a bear. A great big bear, a-way out there.”

As reported last week, I began to dip my toe into the wonderful world of Python.. Last week, I wasn’t able to complete the Core Python: Getting Started by Robert Smallshire and Austin Bingham Pluralsight course . So I had to do some extended learning over last weekend. So last weekend, I was able to finish the “Iteration and Iterables” module which I started last Friday and then spent the rest of the weekend with the module on “Classes” which was nothing short of a nightmare. I spent numerous hours on this module trying to debug my horrific code and rewatching this lessons in the module over and over again. This left me with the conclusion that I just simply don’t get object oriented programming and probably never will..

View Post

Ironically, a conclusion, I derived almost 25 years ago when I attended my last class at University at Albany which was in C++ Object Oriented programming. Fortunately, I escaped that one with a solid D- and was able to pass go and collect $200 and move on to the working world. So after languishing with Classes in Python, I was able to proceed with the final module on File IO and Resource Managements which seemed more straight forward and practical on Monday.

On Tuesday, life got a whole lot easier when I Installed Anaconda – Navigator. Up until this point I was writing my python scripts in TextWrangler Editor on the Mac which was not ideal.

Through Anaconda, I discovered Spider IDE which was like a breath of fresh air. No longer did I have to worry about aligned spaces, open and closed parenthesis, curly and square brackets. Now with the proper IDE environment I was able to begin my journey down the Pandas Jungle…

Here is what I did:

Completed the course of Pandas Fundamentals
Installed Anaconda Panda Python Module, SQL Lite
Created Pandas/Python Scripts:

Read in CSV file (Tate Museum Collection) and output to pickle file
Read in JSON file write output to screen
Traverse directories with multiple JSON files and write output to a file
Perform iteration, aggregation, and filtering (transformation)
Created indexes on data from CSV file for faster retrieval or data
Read data source (Tate Museum Collection) and output data to Excel Spreadsheets, with multiple columns, multiple sheets, and with colored columns options
Connects to RDBMS using SQLAlchemy module (Used SQL Lite Database as POC) which creates a table and writes data to the table from a data source (pickle file)
Create JSON file output from a data source (pickle file)
Create graph using matplotlib.pyplot and matplotlib modules. See attachment.

**Bonus Points ** Continued to drudge old nightmares from freshman year of Highs school as I took a stroll down memory lane with distribute binomials, perfect square binomials, difference of square binomials, factor perfect square trinomials and factor difference of squares, F.O.I.L. and other Algebraic muses.

In addition, revisited conjugating verbs in Español and writing descriptions (en Español) for 9 family members Next Steps..
There are many places I still need to explore..

Below are some topics I am considering:

A Return to SQL Server Advanced Features:

– Columnstore Indexes
– Best practices around SQL Server AlwaysOn (Snapshot Isolation/sizing of Tempdb, etc)

Getting Started with Kubernetes with an old buddy (Nigel)
Getting Started with Apache Kafka
Understanding Apache ZooKeeper and its use cases

I will give it some thought over the weekend and start fresh on Monday.
Stay safe and Be well

—MCS