SQL | SQL Squirrels

Week of September 25th

Posted on September 25, 2020 by Mark Shay

“Dynamite🧨 with a laser beam💥…Guaranteed to blow💨 your mind🧠”

Happy National Lobster 🦞 Day!

“And here I go again on my own… Goin’ down the only road I’ve ever known”

This week we continued where we last left off the previous week as we continued exploring the depths of SQL Server 2019. Last week, we just merely scratched 💅 the surface of SQL Server 2019 as we dove🤿 into IQP and the improvements made to TempDB. This week we tackled Microsoft’s most ambitious SQL Server offering to date in SQL Server 2019 Big Data Clusters (BDC). When I first thought of BDCs the first thing that came to mind 🤔 was a Twix Bar 🍫. Yes, we all know Twix is the is the “only candy with the cookie crunch” but what makes the Twix bar so delicious 😋 is the perfect melding of smooth Chocolate, Chewy Carmel and of course crisp cookie🍪! Well, that’s exactly what Big Data Cluster is like… You’re probably thinking Huh?

Big Data Clusters (BDC) is MSFT’s groundbreaking new Big Data/Data Lake architecture that unifies virtual business data and operational data stored in relational databases with IoT for true real-time BI and embedded Artificial Intelligence (AI) and Machine Learning (ML). BDC combines the power⚡️of SQL Server, Spark 🔥, and the Hadoop Distributed File System (HDFS) 🐘 into a unified data platform. But that’s not all! Since BDC runs natively on Linux🐧 it’s able to embrace modern architectures for deploying applications like Linux-based Docker🐧- 🐳 containers on Kubernetes ☸︎.

By Leveraging K8s ☸︎ for orchestration, deployments of BDCs are predictable, fast 🏃🏻, elastic🧘‍♀️ and scalable ⚖️. Seeing that Big data clusters can run any Kubernetes ☸︎ environment whether it be on-premise 🏠 (i.e. Red Hat OpenShift) or in the cloud☁️ (i.e. Amazon EKS); BDC makes a perfect fit to be hosted on Azure Kubernetes Service (AKS).

Another great feature that BDC makes use of is data virtualization, also known as “Polybase“. Polybase made its original debut with SQL Server 2016. It had seemed like Microsoft had gone to sleep😴 on it but now with SQL Server 2019 BDC, Microsoft has broken out big time! BDC takes advantage of Data Virtualization as it’s “data hub”. So, you don’t need to spend time 🕰 and expense💰 of traditional extract, transform, and load (ETL) 🚜 and Data Pipelines. In addition, it lets organizations leverage existing SQL Server expertise and extract value from third-party data sources such as NoSQL, Oracle, Teradata and HDFS🐘.

Lastly, BDC takes advantage of Azure Data Studio (ADS) for both deployments and administration of BDC. For those who are not familiar with ADS, it’s a really cool 😎 tool 🛠 that you can benefit from by acquainting yourself with. Of course, SSMS isn’t going anywhere but with ADS you get a cross-platform database lightweight tool 🛠 that uses Jupyter notebooks📒 and Python🐍 scripting making deployments and administration of BDCs a breeze. OK, I am ready to rock and roll 🎶!

“I love❤️ rock n’ roll🎸… So put another dime in the jukebox 📻, baby”

Before we can jump right into the deep end of the Pool 🏊‍♂️ with Big Data Cluster, we felt a need for a little primer on Virtualization, Kubernetes ☸︎, and Containers. Fortunately, we knew just who to deliver the perfect overview, no other then one of the prominent members of the Mount Rushmore of SQL Server, Buck Woody who through his guest appearance with Long time product team member and Technology Evangelist Sanjay Soni in the Introduction to Big Data Cluster on SQL Server 2019 | Virtualization, Kubernetes ☸︎, and Containers YouTube Video. Buck who has a very unique style and an amazing skill of taking complex technologies and making them seem simple. His supermarket 🛒 analogy to explain Big Data, Hadoop and Spark 🔥, virtualization, containers, and Kubernetes ☸︎ is pure brilliance! Now, armed 🛡🗡 with Buck’s knowledge bombs 💣 we were ready to light 🔥 this candle 🕯!

“Let’s get it started (ha), let’s get it started in here”

Taking us through BDC architecture and deployments was newly minted Microsoft Data Platform MVP Mohammad Darab who put together a series of super exciting videos as well as excellent detailed blog posts on BDC and Long Time SQL Server veteran and Microsoft Data Platform MVP Ben Weissman through his awesome Building Your First Microsoft SQL Server Big Data Cluster Pluralsight course. Ben’s speculator course covers not only architecture and deployment but how to get data in and out of the, make the most of out of them and how to monitor, maintain, and troubleshoot BDC.

“If you have built castles 🏰 in the air, your work need not be lost; that is where they should be. Now put the foundations 🧱 under them.” – Henry David Thoreau

A big data cluster consists of several major components:

Controller (Control plane)
- Master Instance – manage connectivity (endpoints and communication with the other Pools 🏊‍♂️), scale-out ⚖️ queries, metadata and user databases (target for Restore databases), and machine learning services.

Data not residing in your master instance will be exposed through the concept of External tables. Examples are CSV files on an HDFS store or Data on another RDBMS. External tables can be queried the same as local tables on the SQL Server with several cavorts:

Unable modify the table structure or content
No Indexes can be applied (only statistics are kept. SQL Server to houses that meta data.
Data source might require you to provide credentials 🔐.
Data source may also need a format definition like text qualifiers or separators for CSV files 🗃

Compute Pool 🏊‍♂️
Storage Pool 🏊‍♂️
Data Pool 🏊‍♂️
Application Pool 🏊‍♂️

The controller provides management and security for the cluster and acts as the control plane for the cluster. It takes care of all the interactions with K8s ☸︎, the SQL Server instances that are part of the cluster and other components like HDFS🐘, Spark🔥, Kibana, Grafana, and Elastic Search

The controller manages:

Cluster lifecycle, bootstrap, delete update, etc.
Master SQL Server Instance
Compute, data, and Storage Pools 🏊‍♂️
Cluster Security 🔐

The Compute Pool 🏊‍♂️ is a set a stateless multiple SQL Server 2019 instances that work together. The Compute Pool 🏊‍♂️ leverage Data Virtualization or Polybase to scale-out⚖️ queries across partitions. The Compute Pool 🏊‍♂️ is automatically provisioned as part of BDC. Management and Patching of Computer Pools 🏊‍♂️ is easy because they run on Docker 🐳 containers running on K8☸️ pods.

Please note: Queries on BDC can also function without the Compute Pool 🏊‍♂️.

Compute Pool 🏊‍♂️ is responsible for:

Joining of two or more directories 📂 in HDFS🐘 with 100+ files
Joining of two or more data sources
Joining multiple tables with different partitioning or distribution schemes
Data stored in Blob Storage

The Storage Pool 🏊‍♂️ consists of pods comprised of SQL Servers on Linux🐧 and Spark🔥 on HDFS 🐘 (deployed automatically). All the Nodes of the BDC are members of an HDFS🐘 Cluster. The Storage Pool 🏊‍♂️ stores file‑based 🗂 data like a CSV and queried directly through external tables and T‑SQL, or you can use Python🐍 and Spark🔥. If you already have an HDFS🐘 on either Azure Data Lake (ADLS) store or AWS S3 buckets🗑 you can easily mount your existing storage into your BDC without the need to shift all your data around.

The Storage Pool 🏊‍♂️ is responsible for:

Data ingestion through Spark🔥
Data Storage in HDFS🐘 (Parquet format). HDFS🐘 data is spread across all storage nodes in the BDC for persistency
Data access through HDFS🐘 and SQL Server Endpoints

The Data Pool 🏊‍♂️ is used for data persistence and caching. Under the covers the Data Pool 🏊‍♂️ is a set of SQL Servers (Defined at Deployment) that are using Columnstore Index and Sharding. In other words, SQL Server will create physical tables with same structure and evenly distribute the data across the total number of servers. The Queries will be also be distributed across server but to the user it will be transparent as all the magic✨🎩 is happening behind the scenes. The data stored in the data Pool 🏊‍♂️ does not support transactions as its sole purpose is for caching. It is used to ingest data from SQL Queries or Spark🔥 Jobs. BDC data marts are persisted in the data Pool 🏊‍♂️.

Data Pool 🏊‍♂️ is used for:

Complex Query Joins
Machine Learning
Reporting

The application Pool 🏊‍♂️ is used to run jobs like SSIS, store and execute ML models, and all kinds of other applications which are generally exposed through a web service.

“This is 10% luck🍀… 20% skill… 15% concentrated power🔌 of will… 5% pleasure😆… 50% pain🤕… And a 100% reason to remember the name”

After both great overviews of the BDC architecture by both Mo and Ben, we were eager to build this spaceship 🚀. First, we need to download the required tools 🛠🧰 so we can happily😊 have our base machine where we can deploy our first BDC. I have chosen the Mac 💻 just to spice 🥵 things up as my base machine.

Below are the Required Tools:

Azure Data Studio (ADS)
ADS Extension for Data Virtualization
Python 🐍
Kubectl ☸️
azdata
Azure-CLI (and only required if using AKS)

Optional Tools:

Text Editor 🗒
PowerShell Extension for ADS
Zip utility Software 🗜
SQL Server Utilities
SSH

Now that we got all our prerequisites downloaded, we next needed to determine where we should deploy our BDC. The most natural choice seemed to be with AKS.

To help walk🚶‍♂️us through the installation of our base machine and the deployment of BDS using ADS on AKS, we once again turned to Mo Darab who put together an excellent and easy to follow along series of videos Deploying Big Data Clusters on his YouTube channel.

Base Machine

In his video “How to set Up a Base Machine”, Mo used a Windows machine opposed to us who went with the Mac 💻 . But for all intents and purposes the steps are pretty much the same. The only difference is the package📦 manager that we need to use. Windows the recommended package manager is Chocolatey 🍫 while on the Mac 💻 its Brew 🍺.

Here were the basic steps:

Install ADS for MAC
Install Brew
1. /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.s
Install Kubectl and Azure-CLI

§  brew install kubernetes-cli

§  brew update && brew install azure-cli

Install Python
- brew install python3
Install azdata
- brew tap microsoft/azdata-cli-release
- brew update
- brew install azdata-cli
Install ADS Extension for Data Virtualization
Install Pandas (manage package option in ADS/ Add New Pandas/ Click Install)

“Countdown commencing, fire one”

Now, we had our base machine up and running 🏃‍♂️, it was time ⏱ to deploy. To guide us through the deployment process, we once again went to Mo and followed along in his How to Deploy Big Data Cluster on AKS using Azure Data Studio. Mo walked us through the wizard 🧙‍♂️ in ADS which basically creates a Notebook 📒 that builds our BDC. We created our deployment Jupyter notebook 📒 and then clicked Run 🏃‍♂️ and let it rip☄️. Everything seemed to be humming 🎶 along except 2 hours later our Jupyter notebook was still running 🏃‍♂️.

Obviously, something wasn’t right? 🤔 Unfortunately, we didn’t’ have much visibility on what’s happening with the install through the notebook but hey that’s ok we have a terminal prompt in ADS. So we can just run some K8☸️ commands to see what’s happening under the covers. Ok, so after running a few K8 commands we noticed “ImagePullBackOff” error with our SQL Server Images. After a little bit research, we determined someone forgot 🙄to update the Microsoft Repo with the latest CU Image.

“But there is no joy in Mudville—mighty Casey has struck out.”

So we filled a bug 🐞 on GitHub and we ran the BDC wizard 🧙‍♂️ changed the Docker🐳 settings pointing to the next latest package available on the Docker🐳 Registry and we were back in business until… And then Bam!🥊🥊

Operation failed with status: 'Bad Request'. Details: Provisioning of resource(s) for container service mssql-20200923030840 in resource group mssql-20200923030840 failed. Message: Operation could not be completed as it results in exceeding approved Total Regional Cores quota.

What shall we do? What does our Azure Fundamentals training tells us to do? That’s right we go to Azure Portal and submit a support ticket and beg the good people at Microsoft to increase our Cores Quota. So, we did that and almost instantaneously MSFT quick obliged. 😊

Great, we are back in business (Well sort of).. After several more attempts we ran into more Quota issues with all three types (SKU, Static, and Basic) of Public IP Addresses . So, 3 more support tickets later and there was finally joy 😊 to the world🌎.

Next, we turned back to Ben who provided some great demos in his Pluralsight course on how to set up Data Virtualization for both a SQL Server and HDFS 🐘 files as a data sources on BDC

Data Virtualization (SQL Server)

Right‑mouse click ->Virtualize Data
Data virtualization wizard 🧙‍♂️ will launch.
Next step chooses data source

Create Master Key

Create connection and specify username and password

Next, the wizard🧙‍♂️ will connect to the source and a list of the tables and views
Choose the script option

Click on a specific table(s)

The external table inherits a schema from its source, and transformation would happen in your queries.

You can choose between just having them created or to generate a script.

CREATE EXTERNAL TABLE [virt].[PersonPhone]

(

[BusinessEntityID] INT NOT NULL,

[PhoneNumber] NVARCHAR(25) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,

[PhoneNumberTypeID] INT NOT NULL,

[ModifiedDate] SMALLDATETIME NOT NULL

)

WITH (LOCATION = N'[AdventureWorks2015].[Person].[PersonPhone]’, DATA_SOURCE = [SQLServer]);

Data Virtualization (HDFS🐘)

Right‑mouse‑click your HDFS 🐘 and create a new directory
Upload the flight delay dataset to it.
Expand the directory 📁 to see files 🗂

Right‑mouse‑click a file launches a wizard 🧙‍♂️ to virtualize data from CSV files

The wizard 🧙‍♂️will ask you for the database in which you want to create the external table, a name for the data source;
Next step preview of our data,
next step, the wizard 🧙‍♂️ recommends a column type

CREATE EXTERNAL DATA SOURCE [SqlStoragePool]

WITH (LOCATION = N’sqlhdfs://controller-svc/default’);

CREATE EXTERNAL FILE FORMAT [CSV]

WITH (FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = N’,’, STRING_DELIMITER = N'”‘, FIRST_ROW = 2));

CREATE EXTERNAL TABLE [csv].[airlines]

(

[IATA_CODE] nvarchar(50) NOT NULL,

[AIRLINE] nvarchar(50) NOT NULL

)

WITH (LOCATION = N’/FlightDelays/airlines.csv’, DATA_SOURCE = [SqlStoragePool], FILE_FORMAT = [CSV]);

Monitoring 🎛 Big Data Clusters through Azure Data Studio

BDC comes with a pre-deployed Grafana container for SQL Server and system metrics. This map server collects all those metrics from every single node, container, and pod and provides them individual dashboards 📊.

The Kibana dashboard 📊 is part of the Elastic Stack and provides a looking glass🔎 into all of your log files in BDC.

Troubleshooting Big Data Clusters through Azure Data Studio

In ADS, on the main dashboard 📊 there is a button Troubleshoot. This provides a library📚 of notebooks 📒 that you can use to troubleshoot and analyze the cluster. Notebooks 📒 are categorized and provide all kinds of different aspects on monitoring 🎛 of a diagnosing to help you repair 🛠 an issue within your cluster.

In addition, the azdata utility can be monitoring 🎛, running queries and notebooks 📒, and retrieve a cluster’s endpoints. the namespace and username as well.

We really enjoyed spending time learning SQL Server 2019 Big Data Cluster. 😊

“And I feel like it’s all been done Somebody’s tryin’ to make me stay You know I’ve got to be movin’ on”✈️

Below are some of the destinations I am considering for my travels for next week:

Google Cloud Certified Associate Cloud Engineer Path

Thanks –

–MCS

Week of September 18th

Posted on September 18, 2020 by Mark Shay

“Catch the mist, catch the myth…Catch the mystery, catch the drift”

L’shanah tovah! 🎉

So after basking in last week’s accomplishment of passing the AZ-900: Microsoft Azure Fundamentals certification exam 📝 this week we decided that we need come down from the clouds☁️☁️. Now, grounded we knew exactly where we needed to go. For those who know me well, I have spent a large part of my career supporting and Engineering solutions for Microsoft SQL Server. Microsoft released SQL Server 2019 back in November of the same year.

For the most part, we really haven’t spent too much on our journey going deep into SQL Server. So it has been long overdue that we give SQL Server 2019 a looksy 👀 and besides we needed to put the S-Q-L back in SQL Squirrels 🐿

SQL Server 2019 had been Microsoft’s most ambitious release of the product that has been around now for just a little over 25 years. 🎂 The great folks at MSFT built on previous innovations in the product to further enhance the development of languages, data types, on-premises or cloud ☁️ environments, and operating systems. Just a few of the big headline features of SQL Server 2019 are:

Intelligent Query Processing (IQP)
Memory-optimized TempDB
Accelerated Database Recovery (ADR)
Data Virtualization with PolyBase
Big Data Clusters

So let’s un-peel the onion 🧅 and begin diving into SQL Server 2019. So, where to begin? Full disclosure, I am kind of a bit of SQL Server 2019 internals geek 🤓. So we started with a tour of some of the offerings in Intelligent Query Processing

The IQP feature in SQL Server 2019 provides a broad impact that improves the performance💥 of existing workloads with minimal implementation effort to adopt. In theory, you just simply change the database compatibility level of your existing user databases when you upgrade to 2019 and your query just works faster 🏃 with no code changes! Of course this sounds 📢 almost too good to be true?

“Here we are now, entertain us”

To take us through an unbiased lens 👓 at IQP is no other than the man, the myth, the legend, Microsoft Certified Master Brent Ozar. Brent produced two amazing presentations for the SQL Server community, What’s New in SQL Server 2019 and The New Robots in SQL Server 2017, 2019 and Azure.

Brent with his always quirky, galvanic, and sometimes cynical style gives us the skinny on the Good 😊, the Bad 😞, and the Ugly😜 on IQP.

“I never played by the rules, I never really cared…My nasty reputation takes me everywhere“

First we began, looking at Table Variables. For those who have been around the SQL Server block 🧱, table variables have been around since the good o’l days of SQL 2000 as a performance 💥 improvement alternative to using temporary tables.

However, SQL Server doesn’t maintain statistics on table variables which as you probably know the query optimizer uses to determine the best execution plan. Basically, the SQL Server query optimizer just assumes that a table variable(s) only has one row whether it does or doesn’t. So performance wasn’t so great to say the least. Thus, Table Variables don’t have the best reputation.

Well, SQL Server 2019 to rescue! (Sort of..) In SQL Server 2019, by simply switching the Database “compat mode” to SQL Server 2019 (15.x) SQL Server, SQL Server will now use the actual number of rows in the table variable to create the plan. This is just fantastic! However, as Brent diligently points out this nifty little enhancement isn’t all Rainbows 🌈 and Unicorns 🦄 . Although it does solve the pesky estimated row problem with the query optimizer has with table variables, but now introduces the enigma that is parameter sniffing.

Parameter sniffing has plagued SQL Server since the beginning of time 🕰 and is one of the more common performance💥 issues we SQL people encounter. In simplest terms, SQL Server tries to optimize the execution plan by leveraging a previously ran🏃 plan used. The problem lies when there is different results set due to the parameters provided. Overall, this feature is definite improvement but not the silver bullet 🚀.

“Memories may be beautiful and yet…”

Next, we took a look 👀 @ the Memory Grant Feedback (row mode) feature which originally made its debut in SQL Server 2017. A memory grant is used by the SQL Server database engine to allocate how much memory it will use for a given query.

Sometimes it can allocate too much or too little than the actual memory needed. Obviously, if too much memory is allocated than there will be less memory available for other processes, and if too little is allocated than memory will spill 💦 to disk and performance will suffer.

“Here I come to save the day!” 🐭

Once again, by simply switching the Database “compat mode” to SQL Server 2019 (15.x), SQL Server will now automatically adjust the amount of memory granted for a query. Awesome! … but what’s the catch? Well, similar to the new Table variable enhancement situation, memory grants are sensitive to the dreaded parameter sniffing as SQL Server will base its decision making on the previous run query. In addition, If you rebuild your indexes or create new indexes, the adaptive memory grants feature will forget everything it learned about previously run plans and start all over. 😞 This is an improvement to the past performance of memory grants but unfortunately not the panacea.

“I will choose a path that’s clear.. I will choose free will”

Further, we next explored another feature that first made an appearance in SQL Server 2017 in Adaptive Joins. Back then this sleek feature was only available for Columnstore index but now in SQL Server 2019 is available for our run of the mill 🏭 b-tree 🌲 or row-type work loads. In SQL Server, with Adaptive Joins the query optimizer will dynamically determines at runtime the threshold number of rows and then it chooses between a Nested Loop or Hash Match join operator. Again, we simply switch the Database “compat mode” to SQL Server 2019 (15.x) and abracadabra!

This is awesome! But… Well, once again parameter sniffing rears its ugly 😜 head. As Brent astutely points out during his presentation and documents in his detailed blog post when dealing with complex queries often times SQL Server will produce several plans. So when SQL Server 2019 tries to solve this problem with more options for the optimizer to choose from it will in some particular cases might backfire 🔥 depending on your workloads.

“Just when I thought our chance had passed..You go and save the best for last”

The last feature we keyed 🔑 on as part of IQP was Batch Mode execution. This is just another great innovation that came straight out Columnstore Index (SQL 2012). Now offered in SQL Server 2019, users can take advantage of this enhancement without needing to use a Columnstore Index.

Batch mode is a huge performance 💥 enhancement especially with CPU-bound queries and for queries that use aggregation, sorts, and group by operations. Batch mode performs scans and calculations using batches. To enable the batch mode processing all you have to do is… thats right.. you guessed it… Switching the Database “compat mode” to SQL Server 2019 (15.x)

“Don’t cry Don’t raise your eye It’s only teenage wasteland”

Continuing our journey with SQL Server 2019.. We ventured to take a sneak peak 👀 at the improvements SQL Server 2019 made to TempDB. Guiding us through the murky terrain of TempDB was SQL Royalty👸 and Microsoft Certified Master Pam Lahoud. Pam presented in two short but very thorough videos as part of the Data Exposed series:

In these great videos, Pam gives us the lowdown on improvements to TempDB in SQL Server 2019. TempDB is of course one of the famous or infamous system databases as part of SQL Server. As SQL Server, has been enhanced over the last several decades more and more “stuff” has been chucked into TempDB. TempDB has often been referred to as the “Wastelands”. Some of the activities that go on in TempDB are:

Temp Tables
Table Variables
Cursors
DBCC CHECKDB
Table-Valued Functions
Row Versions
Online Index Operations
Sorts
Triggers
Statistics Updates
Hash Worktables
Spools

As result of all these happenings in SQL Server, we often might experience some pain 😭 in Object allocation contention, Metadata contention, and Temp table cache contention.

“Look up in the sky! It’s a Bird 🕊 … It’s a Plane… It’s ..“

..SQL Server 2019. (Of course 😊)

In SQL 2019, there are two major improvements that impact TempDB performance. The first improvement made helps reduce some of the Temp table cache contention issues. SQL 2019 intelligently partitions cache objects and optimizing cache look up.

To address object allocation contention issues SQL 2019 now offers concurrent PFS updates. In addition, by enabling concurrent PFS updates this allows us to have multiple threads share a latch on the PFS page and therefore we can have more more concurrent threads and less attention on those object allocation pages

The Next major enhancement is memory optimized metadata tables. This is the big one that got into the brochure!

Under the hood, SQL Server 2019 moves the system objects into memory optimized tables that have latched free, lock free structures which greatly increases the concurrency that we can have against those metadata tables and that helps us alleviate that metadata contention.

By Default the Memory-Optimized TempDB Metadata feature is not turned out but its quite easy to enable:

ALTER SERVER CONFIGURATION SET MEMORY_OPTIMIZED TEMPDB_METADATA = ON;

For those who see a flurry of tempdb activity, this feature seems like a no brainer 🧠.

“Makes us want to stay, stay, stay… For awhile”

Below are some of the places I am considering for my explorations for next week:

Continuing with SQL Server 2019
Google Cloud Certified Associate Cloud Engineer Path

Thanks –

–MCS

Week of May 15th

Posted on May 15, 2020 by Mark Shay

“Slow down, you move too fast…You got to make the morning last.”

Happy International Day of Families and for those celebrating in the US Happy Chocolate Chip Day!

This week’s Journey was a bit of a laggard in comparisons to previous week’s journeys but still productive, nonetheless. This week we took a break from Machine Learning while sticking to our repertoire and with our reptilian programing friend. Our first stop was to head over to installing Python on Windows Server which we haven’t touched on so far.As we tend to make things more challenging than they need to be we targeted an Oldie but a goodie Windows Server 2012 R2 running SQL Server 2016. Our goal to configure a SQL Server Scheduled Job that runs a simple Python Script which seemed liked a pretty simple task. We found an nice example of this exact scenario on MSSQL Tips – Run Python Scripts in SQL Server Agent

First, we installed Python and followed the steps and lo and behold it didn’t work right away. To quote the great Gomer Pyle “Surprise, surprise, surprise”. No worries we had this… After a little bit of troubleshooting and trying to interpret the vague error messages in the SQL Server Agent Error log we got it working… In turns out, we had a multitude of issues ranging from the FID that was running the SQL Agent service not having the proper entitlements to the directory where the py script lived and the more important prerequisite of Python not being in the User Environment Variables for the Service account to know where to launch the executable. Once resolved, we were off to the races or at least we got the job working.

At this point we were feeling pretty ambitious, so we decided rather than using the lame MS Dos style batch file we would use a cool PowerShell Script as a wrapper for our python code for the job… Cool but not so cool on Windows Server 2012 R2. First, we started out with set-executionpolicy remotesigned command which needs to be specified in order to execute PowerShell but because we were using an old jalopy OS we had to upgrade the version of the .Net runtime as well as the version of PowerShell. Once upgraded and we had executed a few additional commands and then we were good to go…

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12

Install-PackageProvider -Name NuGet -RequiredVersion 2.8.5.201 -Force

Install-Module -Name SqlServer -AllowClobber

After spending a few days here, we decided to loiter a little bit longer and crank out some SQL maintenance tasks in Python like a simple backup Job. This was pretty straight forward once we executed a few prerequisites.

python -m pip install –upgrade pip

Pip install pip

pip install pyodbc

pip install pymssql-2.1.4-cp38-cp38-win_amd64.whl

pip install –upgrade pymssql

Our final destination for the week was to head back over to a previous jaunt and play with streaming market data and Python. This time we decided to stop being cheap and pay for an IEX account

Fortunately, they offer pay by the month option with opt out any time so it shouldn’t get too expensive. To get re-acclimated we leveraged Jupyter notebooks and banged out a nifty python/pandas/matlib script that generates the top 5 US Banks and there 5-year performance. See attachment.

“I have only come here seeking knowledge… Things they would not teach me of in college”

Below are some topics I am considering for my adventures next week:

Vagrant with Docker
Data Pipelines
- Google Cloud Pub/Sub (Streaming Data Pipelines)
- Google Cloud Data Fusion ( ETL/ELT)
Back to Machine Learning
ONTAP Cluster Fundamentals
Google Big Query
Python -> Stream Data from IEX -> Postgres
Data Visualization Tools (i.e. Looker)
ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks) .
Getting Started with Kubernetes with an old buddy (Nigel)

Stay Safe and Be well –

–MCS