postgres | SQL Squirrels

Week of May 22nd

Posted on May 22, 2020 by Mark Shay

“And you know that notion just cross my mind…“

Happy Bitcoin Pizza Day!

All aboard! This week our travels would take us on the railways far and high but before, we can hop on the knowledge express we had some unfinished business to attended too.

“Oh, I get by with a little help from my friends”

If you have been following my weekly submissions for the last few weeks I listed as future action item “create/configure a solution that leverages Python to stream market data and insert it into a relational database.“

Well last week, I found just the perfect solution. A true master piece by Data Scientist/Physicist extraordinaire AJ Pryor, Ph.D. AJ had created a brilliant multithreaded work of art that continuously queries market data from IEX and then writes it to a PostgreSQL database. In addition, he built a data visualization front-end that leverages Pandas and Bokeh so the application can run interactively through a standard web browser. It was like a dream come true! Except that the code was written like 3 years ago and referenced a deprecated API from IEX.

Ok, no problem. We will just simply modify AJ’s “Mona Lisa” to reference the new IEX API and off we will go. Well, what seemed like was a dream turned into a virtual nightmare. I spent most of last week spinning my wheels trying to get the code to work but to no avail. I even reached out to the community on Stack overflow but all I received was crickets..

As I was ready to cut my loses, but I reached out to a longtime good friend who happens to be all-star programmer and a fellow NY Yankees baseball enthusiast. Python wasn’t his specialty (he is really an amazing Java programmer) but he offered to take a look at the code when he had some time… So we set up a zoom call this past Sunday and I let his wizardry take over… After about hour or so he was in a state of flow and had a good pulse of what our maestro AJ’s work was all about. After a few modifications my good chum had the code working and humming along. I ran into a few hiccups along the way with the brokeh code, but my confidant just referred me to run some simpler syntax and then abracadabra… this masterpiece was now working on the Mac! As the new week started, I was still basking in the radiance of this great coding victory. So, I decided to be a bit ambitious and move this gem to the cloud which would be like the crème de la crème of our learnings thus far. Cloud, Python/Pandas, Streaming market data, and Postgres all wrapped up in one! Complete and utter awesomeness!

Now the question was for which cloud platform to go with? We were well versed in the compute area in all 3 of the major providers as a result of our learnings.

So with a flip of the coin ,we decided to go with Microsoft Azure. That and we had some free credits still available.

With sugar plum fairies dancing in our head, we spun up our Ubuntu Image and we followed along the well documented steps on AJ’s Github project

Now, we were now cooking with gasoline ! We cloned AJ’s Github repo, modified the code with our new changes, and executed the syntax and just as we were ready to declare victory… Stack overflow Error! Oh, the pain.

Fortunately I didn’t waste any time, I went right back to my ace in the hole but with some trepidation that I wasn’t being too much of irritant.

I explained my perplexing predicament and without hesitation my Fidus Achates offered some great trouble shooting tips and quite expeditiously we had the root cause pinpointed. For some peculiar reason, the formatting of URL that worked like a charm on the Mac was a dyspepsia on Ubuntu on Azure. It was certainly a mystery but one that can only be solved by simply rewriting the code.

So once again, my comrade in arms helped me through another quagmire. So, without further ado, may I introduce to you the one and only…

http://stockstreamer.eastus.cloudapp.azure.com:5006/stockstreamer

We’ll hit the stops along the way We only stop for the best

After feeling victorious after my own personal Battle of Carthage and with our little streaming market data saga out of our periphery it was to time to hit the rails…

Our first stop was messaging services which is all the rage now a days. There are so many choices with data messaging services out there.. So where to start with? We went with Google’s Pub/Sub which turned out to be a marvelous choice! To get enlightened with this solution, we went to Pluralsight where we found excellent course on Architecting Stream Processing Solutions Using Google Cloud Pub/Sub by Vitthal Srinivasan

Vitthal was a great conductor who navigated us through an excellent overview of Google’s impressive solution, uses cases, and even touched on a rather complex pricing structure in our first lesson. He then takes us deep into the weeds showing us how to create Topics, Publishers, and Subscribers. He goes on further by showing us how to leverage some other tremendous offerings in GCP like Cloud Functions, API & Services, and Storage.

Before this amazing course my only exposure was just limited to GCP’s Compute Engine so this was eye opening experience to see the great power that GCP had to offer! To round out the course, he showed us how to use GCP Pub/Sub with some client Libraries which was excellent tutorial on how to use Python with this awesome product. There was even two modules on how to integrate Google Hangout Chatbot with Pub/Sub but that required you to be a G Suite User. (There was free trial but skipped the set up and just watched the videos) Details on the work I did on Pub/Sub can be found at

May_22_logDownload

“I think of all the education that I missed… But then my homework was never quite like this”

For Bonus this week, I spent enormous amount of time brushing up my 8th grade Math and Science Curriculum

Liner Regression
Epigenetics
Protein Synthesis

Below are some topics I am considering for my Journey next week:

Vagrant with Docker
Continuing with Data Pipelines
Google Cloud Data Fusion (ETL/ELT)
More on Machine Learning
ONTAP Cluster Fundamentals
Google Big Query
Data Visualization Tools (i.e. Looker)
ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks) .
Getting Started with Kubernetes with an old buddy (Nigel)

Stay safe and Be well –

–MCS

Week of May 8th

Posted on May 8, 2020 by Mark Shay

“Now it’s time to leave the capsule if you dare..”

Happy Friday!

Before we could end our voyage and return our first mate Slonik to the zookeeper, we would first need to put a bow on our Postgres journey (for now) by covering a few loose ends on advanced features. Saturday, we kicked it off with a little review on Isolation levels in Postgres (including a deep dive on Serializable Snapshot Isolation (SSI)) Then on to Third-parting monitoring for database health and streaming replication, and for the la cerise sur le gâteau… Declarative Partitioning and Sharding!

Third-Party Monitoring

We evaluated 2 solutions OpsDash and PgDash. Both were easy to set up and both gave valuable information in regards to Postgres. OpsDash provided more counters and is it can monitor system information as well as other services running on Linux where as PgDash is Postgres specific and will give you a deeper look into Postgres and Streaming Replications than just querying the native system views

Declarative Partitioning

It was fairly straight forward to implement Declarative partitioning. We reinforcement such concepts by turning to Creston’s plethora of videos on the topics as well as turning to several blog posts. See below for detailed log.

Sharding Your Data with PostgreSQL

There are third party solutions like Citus Data that seem to offer a more scalable solution but out of the box you can implement Sharding with using Declarative Partitioning set up on a Primary Server and using a Foreign Data Wrapper configured on a remote Server. Then you combine Partitioning and FDW to create Sharding. This was quite an interesting solution although I have strong doubts about how scalable this would be in production.

On Sunday, we took a much-needed respite as the weather was very agreeable in NYC to escape the quarantine…

On Monday, with our rig now dry docked. We would travel through different means to another dimension, a dimension not only of sight and sound but of mind. A journey into a wondrous land of imagination. Next stop, the DevOps Zone!

To begin our initiation into this realm we would start off with HashiCorp’s Vagrant.

For those who not familiar with Vagrant it is not a transient mendicant that the name would otherwise imply but a nifty open-source solution for building and maintaining lightweight and portable DEV environments.

It’s kind of similar to docker for those more familiar but it generally works with virtual machines (although can be used with containers).

At the most basic level, Vagrant uses a smaller version of VMs whereas Docker is kind of the “most minimalistic version for process and OS bifurcation by leveraging containers”.

The reason to go this route opposed to the more popular Docker was that it is generally easier to standup a DEV environment.

With that being said we wound up spending a considerable amount of time on Monday and Tuesday this week Working on this. As I ran into some issues with SSH and “Vagrant UP” process. The crux of issue was related using Vagrant/VirtualBox under an Ubuntu VM that was already running VirtualBox on a Mac. This convoluted solution didn’t seem to play nice. Go figure?

Once we decided to install Vagrant with VirtualBox natively on the Mac we were up and running were easily able to spin up and deploy VMs seamlessly.

Next, we played a little bit with Git. Getting some practicing with the work flow of editing configuration files and pushing the changes straight to the repo.

On Wednesday, we decided to begin our expropriation of a strange new worlds, to seek out new life and new civilizations and of course boldly go where maybe some have dared to go before? That would be of course Machine Learning where the data is the oil and the algorithm is the engine. We would start off slow by just trying to grasp the jargon like training data, training a model, testing a model, Supervised learning, and Unsupervised Learning.

The best way for us to absorb this unfamiliar lingo would be to head over to Pluralsight where David Chappell offered a great introductory course on Understanding Machine Learning

“Now that she’s back in the atmosphere… With drops of Jupiter in her hair, hey, hey”

On Thursday we would go further down the rabbit hole of Machine Learning with Jerry Kurata’s Understanding Machine Learning with Python

There we would be indoctrinated by the powerful tool of Jupyter Notebook. Now armed with this great “Bat gadet” we would reunite with some of our old heroes from the “Guardians of the Python” like “Projectile” Pandas, matplotlib “the masher” and of course numpy “ the redhead step child of Thanos”. In addition, we would also be introduced to a new and special super hero scikit-learn.

For those not familiar with this powerful library “scikit-lean” has unique and empathic powers to our friends Numpy, Pandas and SciPy. This handy py lib ultimately unlocks the key to the Machine Learning Universe through Python.

Despite all this roistering with our exemplars of Python, our voyage wasn’t all rainbows and Unicorns.

We got introduced to all sorts of new space creatures like Bayesian and Gaussian Algos each conjuring up bêtes noires. The mere thought of Bayes theorem drudged up old memories buried deep in the bowls back in college when I was studying probability and just the mere mention of Gaussian functions jarred memories from the Black Swan (and not the ballet movie with fine actresses Natalie Portman and Mila Kunis) but the well-written and often irritating NYT Best seller by Nassim Nicholas Taleb.

Unfortunately, It didn’t get any cozier when we started our course for powerful and complex ensemble of the Random Forrest Algo. There we got bombarded by meteorites such as “Over Fitting”, “Regularization Hyper-parameters” , and “Cross Validation”, and not to mention the dreaded “Bias – variance tradeoff”. Ouch! My head hurts…

Here is the detailed log of my week’s journey

“With so many light years to go…. And things to be found (to be found)”

Below are some topics I am considering for next week’s odyssey :

Run Python Scripts in SQL Server Agent
More with Machine Learning
ONTAP Cluster Fundamentals
Google Big Query
Python -> Stream Data from IEX ->
MSSQLData Visualization Tools (i.e. Looker)
ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)
Getting Started with Kubernetes with an old buddy (Nigel)

Stay safe and Be well

—MCS

Week of May 1st

Posted on May 1, 2020 by Mark Shay

“And once again, I will be… In a march to the sea.”

Happy May Day for one and for all!

This week’s expedition continued with our first mate Slonik through the relational database version of the Galápagos Islands. Last week, we rendezvous with our old friend the “SQL Authority” who gave us a mere beginner’s guide to the world of PostgreSQL.

Now, no longer a neophyte to Postgres, we needed to kick it up a notch and touch on some more advanced topics like Database architecture, Logical Replication, Streaming Replication, Monitoring of Replication scenarios, and a migration path from other relational databases i.e. MS SQL Server to Postgres.

So once again, we turned back to Pluralsight for some insights on these sophisticated areas. Only to find zilch in this realm! So now where to turn? Who can help us navigate these uncharted territories?

Well… Google of course.. Or I should say the Google’s Streaming video Service A.K.A youtube.com. There much to our delight we found a treasure trove of riches of the high-level topics related to Postgres.

However, we didn’t have our true Eureka moment until we encountered Creston not be confused with “The Amazing Kreskin“. Although great possibly equally as amazing or possibly even better? This RDBS enthusiast put together a bunch of spectacular videos which included for a wide variety of nuggets on Postgres. So now armed with these knowledge bombs it was full speed ahead!

On Saturday, we kicked it off with quick jolt of architectural review and then we dove right into a high level overview on High Availability in Postgres. To finish it off, we successfully implemented logical replication and we were feeling pretty good about our first day in deep blue sea…

“Like a red morn that ever yet betokened, Wreck to the seaman, tempest to the field, Sorrow to the shepherds, woe unto the birds, Gusts and foul flaws to herdmen and to herds.” – William Shakespeare

..Or more eloquently phrased

“Red sky at night, sailor’s delight. Red sky in morning, sailor’s warning” – Some unknown Sailer Dude

Well, that warning came quite quickly on Sunday when we took the plunge and went right to Streaming Replication. We immediately queued up the Creston video on the subject and began to follow along. …And follow along… and follow along …and follow along …and so on… After bleary 15 hours of trying we were left with what turned out to a red herring of erroneous error messages in the Postgres DB log but more importantly we had no replication in place…

We finished Sunday and continued into wee hours on Monday with no progress, a disappointing end to an Eighty consecutive day step goal, and no exercises done.. In fact, I did less 500 steps for whole day as my gluteus maximus glued to the chair and my head locked on theMac Book. Feeling dejected, I decided to call it quits and start over after some shut eye.

On Monday, I spent the entire day trying to troubleshoot the issue with more re-watching of the videos and countless google searches but to no avail.

The sun’ll come out Tomorrow…Bet your bottom dollar.. That tomorrow There’ll be sun – Orphan Annie

…And then Tuesday had arrived, but the Sun actually didn’t show up until later in the day (~3:30 PM). My plan was to tear it all down and start from scratch and rebuilt a shiny new pristine environment and follow the video methodically step by step. When I finished I was back where I was before no replication but this time I had written off the meaningless error messages in the log and just focused on why replication was not working.. So I went back to google and even gave a cry out for help to the community on DB Stack exchange

And then finally it hit me “like the time I was standing on the edge of my toilet hanging a clock, the porcelain was wet, I slipped, hit my head on the sink, and when I came too.. I came up with the flux capacitor”.

Actually, that’s a different story but I realized that I had inadvertently put the recovery.conf file in the wrong directory. Doh! Once I placed the file in its proper place and restarted my Postgres servers and the magic began.

Overwhelmed, with utter jubilation I decided it was time to celebrate with a victory lap or 15. To my amazement I actually set my best PR for a mile and ran and solid 7:02 per mile for 3.6 miles ran… But that’s something I can write about somewhere else… Not here not now..

After Tuesday’s afternoon catharsis, it was time further the mission on Wednesday, We spent some time with more on Replication Slots and Replication monitoring.. Then over to migrating SQL Server sample database AdventureworksDW2012 to Postgres. First step was generate the full table schema from MS SQL Server and then modify the script to translate so Postgres can interrupt it. See the below log for more details.

On Thursday, before we could pick up where we last left off we needed to recover our Ubuntu Server which was hosting our Primary and local Replica from a dreaded disk crash.

In effort to simulate latency during our replication monitoring test we wrote a an INSERT statement using generate_series which worked great as the Replica’s started to fall behind as we continuously pump in data that until we ran out of space and our Ubuntu server sh*t the bed. Now, we had to flex our Virtual Box and Linux skills to get our system back online

First, we had to increase the size of the disk in our VDI which of course is unsupported in the UI and needs to be done at the command line. Now with an increased Disk we needed to boot up VB straight to the Ubuntu Installer ISO and run our trusted gpart command to extend our volume so Ubuntu could see our newly added free space.

After a quick reboot our system was back online. We re-enable our streaming replication and now we ready we picked were we left off this time migrating the data from MS SQL Server to Postgres. Of course, just like Data types conversion it’s not so intuitive. Our conclusion is that if you going to migrate from MS SQL Server to Postgres its best to invest in a third party tool.

However, as a POC we were able to migrate a small table. We were able to accomplish this by using BCP to export all the table data out to individual text files and then import it into Postgres.

On Friday, we got ambushed with some 8^th grade Algebra and Science homework but we managed to find some time test to drive one of the third party tools used for MS SQL Server to Postgres migrations with some success. We used Ispirer Migration and Modernization Toolkit to migrates all the tables from AdventureworksDW2012 and transfer all data from Microsoft SQL Server to PostgreSQL. Unfortunately, we weren’t so successful with the Views and Functions as it requires further code re-writes but that was to be expected. Here is the detailed log of my week’s journey

Below are some topics I am considering for my exploration next week:

Advanced Topics In PostgreSQL
- Partitioning
Using Vagrant to build the image.
Run Python Scripts in SQL Server Agent
Welcome to Machine Learning (Just to grasp concepts)
Google Big Query
Python -> Stream Data from IEX -> MSSQL
Data Visualization Tools (i.e. Looker)
ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks) .
Getting Started with Kubernetes with an old buddy (Nigel)

Stay safe and Be well

—MCS

Week of April 24th

Posted on April 24, 2020 by Mark Shay

“We’ll search for tomorrow on every shore..”

Last week, after our leisurely cruise had docked, it wouldn’t be too long before we would set an “open course for this week’s virgin sea“. As I was preparing my coordinates, “I look to the sea, reflections in the waves spark my memory” which led me down a familiar path. As some of you might know, I have spent a majority of my career working with Relational Databases (in particular Microsoft SQL Server).

Over the years MS SQL Server has become one of the most popular RDBMS but with each new release and awesome new features added to SQL Server in some cases it become highly restrictive from a licensing standpoint with a very high TCO especially if the database was large in size, or it was accessed by a many clients. With enterprise licensing skyrocketing this of course opened the door for open-source the RDBMS movement.

The leader in this category has been MySQL. However, after being acquired by the Oracle many have been dissuaded to use this database for new projects. Not to mention, the original creator of MySQL left after acquisition and subsequently forked the code and developed MariaDB which has had received a lukewarm response in the industry but the real little blue elephant or ” слоник” in the room was clearly PostgreSQL.

Both PostgreSQL and MySQL launched around same time but not until recent years has PostgreSQL really taken off. But it was always sort of lurking in the grasslands. Today, PostgreSQL has now emerged as one the leaders not only in the open-source world but for all relational databases. So after “a gathering of angels appeared above my head..They sang to me this song of hope, and this is what they said…”

Ok, Where to start?

Well, the basics..

First, I need a Postgres Environment to work with.. For this exercise, I wanted to avoid any additional charges in the cloud so I needed to developed my own Prem solution.

Here are the steps I took:

Install Oracle VirtualBox on the Mac book
Download Ubuntu 18.x.x
Mount ISO and Install Ubuntu
Change System Memory to higher value
Change display settings Memory to higher value
After Ubuntu install -> Power down
In Virtual Box Click Tools -> Network -> Create NIC
Under Ubuntu Image ->
Create a 2nd virtual NIC
Host-only Adapter

Get SSH working:

sudo apt update

sudo apt install openssh-server

sudo systemctl status ssh

sudo ufw allow ssh

Install PostgreSQL (Server) on Ubuntu:

sudo su –

apt-get install postgresql postgresql-contrib

update-rc.d postgresql enable

service postgresql start

Verify local connectionc(On Server):

sudo -u postgres psql -c “SELECT version();”

sudo su – postgres

Change Postgres Password from Blank to something meaningful

psql

ALTER USER postgres PASSWORD ‘newPassword’;

exit

Open up FW port to allow Postgress traffic

ufw allow 5432/tcp

Enable remote access to PostgreSQL server

Edit postgresql.conf

sudo vim /etc/postgresql/10/main/postgresql.conf

Change from listen_addresses = ‘localhost’ to listen_addresses = ‘*’

—Restart postgres

sudo service postgresql restart

—Verify Postgres listening on 5432

ss -nlt | grep 5432

Editing the pg_hba.conf file.

sudo vim /etc/postgresql/10/main/pg_hba.conf

host all all 0.0.0.0/0 md5

host all all ::/0 md5

—Restart postgres

sudo service postgresql restart

(On Mac Client)

Download pgAdmin 4 for MAC (Client)

Launched Browser
Created Postgres connection

Next, I downloaded and restored the sample Database – http://bit.ly/pagilia-dl and I was ready to take on some learning.

For my eduction on Postgres I turned to reliable source on Pluralsight. To no other than the SQLAuthority himself who produced a series of great courses! Below was my syllabus for the week:

Monday:

1. PostgreSQL: Getting Started by Pinal Dave

Tuesday:
2.PostgreSQL: Introduction to SQL Queries by Pinal Dave

Wednesday:
3. PostgreSQL: Advanced SQL Queries by Pinal Dave

Thursday:
4. PostgreSQL: Advanced Server Programming by Pinal Dave

Friday:
5. PostgreSQL: Index Tuning and Performance Optimization by Pinal Dave

Below are some topics I am considering for my exploration next week:

Advanced Topics In PostgreSQL
- High Availability
- Migration of SQL Server database to PostgreSQL
- Replication
Run Python Scripts in SQL Server Agent
Welcome to Machine Learning (Just to grasp concepts)
Data Visualization Tools (i.e. Looker)
ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks) .
Getting Started with Kubernetes with an old buddy (Nigel)

Stay safe and Be well

—MCS