redshift | SQL Squirrels

Week of July 17th

Posted on July 17, 2020 by Mark Shay

“Any time ⏰ of year… You can find it here”

Happy World🌎 Emoji 😊😜 Day!

The Last few weeks we have been enjoying our time in Microsoft’s Cloud☁️ Data Ecosystem and It was just last month that we found ourselves hanging out with the GCP☁️ gang and their awesome Data offerings. All seemed well and good😊 except that we had been missing out on excursion to the one cloud☁️ provider where it all began literally and figuratively.

Back when we first began our journey on a cold 🥶 and rainy☔️ day in March just as Covid-19🦠 occupied Wall Street 🏦 and the rest of the planet 🌎 we started at the one place that disrupted how infrastructure and operations would be implemented and supported going forward.

That’s right Amazon Web Services or more endearingly known to humanity as AWS. AWS was released just two decades ago by the its parent company that offers everything from A to Z.

AWS like its parent company has a similar mantra in the cloud ☁️ computing world as they offer 200+ Cloud☁️ Services. So how the heck with so some many months passed that we haven’t been back since? The question is quite perplexing? But like they say “all Clouds☁️☁️ lead to AWS. So, here we are back in the AWS groove 🎶 and eager 😆 to explore 🔦the wondrous world🌎 of AWS Data solutions. Guiding us through this vast journey would be Richard Seroter (who ironically recently joined the team at Google). In 2015, Richard authored an amazing Pluralsight course covering Amazon RDS🛢, Amazon DynamoDB 🧨 and Amazon’s Redshift 🎆. It was like getting 3 courses for the price of 1! 😊

Although the course was several years old, for the most part it still out lasted the test of time ⏰ by providing a strong foundational knowledge for Amazon’s relational, NoSQL, and Data warehousing solutions. But unfortunately in technology years, it’s kind of like dog🐕 years. So obviously, there have been many innovations to all three of these incredible solutions including UI enhancements, architectural improvements and additional features to these great AWS offerings making them even more awesome!

So for a grand finale to our marvelous week of learning and to help us fill in the gaps on some of these major enhancements as well as offering some additional insights were the team from AWS Training and certification which includes the talented fashionista Michelle Metzger, the always effervescent and insightful Blaine Sundrud and on demos the man with a quirky naming convention for database objects the always witty Stephen Cole

Back in our Amazon Web Services Databases in Depth course and in effort to make our journey that more captivating, Richard provided us with a nifty mobile sports 🏀 ⚾️ 🏈 app solution written in Node.js which leverages the Amazon data offerings covered in the course as components for an end to end solution. As the solution, was written several years back it did require some updating on some deprecated libraries📚 and some code changes in order to make the solution work which made our learning that more fulfilling. So, after a great introduction from Richard where he compares and contrasts RDS🛢, DynamoDB🧨, and Redshift🎆, we began our journey with Amazon’s Relational Database Service (RDS🛢). RDS🛢 is a database as a service (DBaaS) that makes provisioning, operating and scaling⚖️ either up or out seamless. In addition, RDS🛢 makes other time-consuming administrative tasks such as patching, and backups a thing of the past. Amazon RDS🛢 provides high availability and durability through the use of Multi-AZ deployments. In other words, AWS creates multiple instances of the databases in different Availability Zones making recovery from infrastructure failure automatic and almost transparent to the application. Of course like with all AWS offerings there always a heavy emphasis on security🔐 which it’s certainly reassuring when you putting your mission critical data in their hands 🤲 but it could also be a bit challenging at first to get things up and running when you are simply just trying connect to from your home computer 💻 back to the AWS infrastructure

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to put to your RDS🛢instance(s) in. You can leverage an existing one or you can create a dedicated one for RDS🛢instance(s).

It is required that your VPC have at least two subnets in order to support the Availability Zones for high availability. If direct internet access is needed that you will need to add an internet gateway to your VPC.

Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to the RDS🛢. RDS🛢 leverages three types of security groups (database, VPC, and EC2). As for credentials🔐 and entitlements in RDS🛢, it is managed through AWS Identity and Access Management (IAM). At the time of the release of Richard’s course, Amazon Aurora was new in the game and was not covered in depth in the course. In addition, at the same time only MySQL, Postgres, Oracle, MS SQL Server and the aforementioned Aurora were only supported at this time. AWS has since added support for MariaDB to their relational database service portfolio.

Fortunately, our friends from the AWS Training and Certification group gave us the insights that we would require on Amazon’s innovation behind their relational database built for the cloud☁️ better known as Aurora.

So, with six familiar database engines (licensing costs apply) to choose from you have quite a few options. Another key🔑 decision is to determines the resources you want your database to have. RDS🛢offers multiple options for optimized for memory, performance, or I/O.

I would be remised if we didn’t briefly touch on Amazon’s Aurora. As mentioned, it’s one of Amazon’s 6 database options with RDS🛢. Aurora is fully managed by RDS🛢. So it leverages the same great infrastructure and has all the same bells 🔔 and whistles. Aurora comes in two different flavors🍦 MySQL and PostgreSQL. Although, I didn’t benchmark Aurora performance in my evaluation AWS claims that Aurora is 5x faster than the standard MySQL databases. However, it’s probably more like 3x faster. But the bottom line it is that it is faster and more cost-effectiveness for MySQL or PostgreSQL databases that require optimal performances, availability, and reliability. The secret sauce behind Aurora is that automatically maintains 6 copies of your data (which can be increased up to 15 replicas) that is spanned across 3 AZs making data highly available and ultimately providing laser⚡️ fast performance for your database instances.

Please note: There is an option that allows a single Aurora database to span multiple AWS Regions 🌎 for an additional cost

In addition, Aurora uses an innovative and significantly faster log structured distributed storage layer than other RDS🛢offerings.

“Welcome my son, welcome to the machine🤖”

Next on our plate 🍽 was to take a deep dive into Amazon’s fast and flexible NoSQL database service a.k.a DynamoDB🧨.. DynamoDB🧨 like Cosmo DB🪐 is a multi-model NoSQL solution.

DynamoDB🧨 combines the best of those two ACID compliant non-relational databases in a key-value🔑 and document database. It is a proprietary engine, so can’t just take your MongoDB🍃 database and convert it to DynamoDB🧨. But don’t worry if you looking to move your MongoDB🍃 works loads to Amazon, AWS offers Amazon DocumentDB (with MongoDB compatibility) but that’s for a later discussion 😉

As for DynamoDB🧨, it delivers a blazing⚡️ single-digit millisecond guaranteed performance at any scale⚖️. It’s a fully managed, multi-Region, multi-master database with built-in security🔐, backup and restore options, and in-memory caching for internet-scale applications. DynamoDB🧨 automatically scales⚖️ up and down to adjust for the capacity and maintain performance of your systems. Availability and fault tolerance are built in, eliminating the need to architect your applications for these capabilities. An important concept to grasp while working with DynamoDB🧨 is that the databases are comprised of tables, items, and attributes. Again, there has been some major architectural design changes to DynamoDB🧨 since Richard’s course was released. Not to go into too many details as its kind or irrelevant but at time⏰ the course was released DynamoDB🧨 used to offer the option to either use a Hash Primary Key🔑 or Hash and Range Primary Key🔑 to organize or partition data and of course as you would imagine choosing the right combination was rather confusing. Intuitively, AWS scrapped this architectural design and the good folks at the AWS Training and Certification group were so kind to offer clarity here as well 😊

Today, DynamoDB🧨 uses partition keys🔑 to find each item in the database similar to Cosmo DB🪐. Data is distributed on physical storage nodes. DynamoDB🧨 uses the partition key🔑 to determine which of the nodes the item is located on. It’s very important to choice the right partition key 🔑 to avoid the dreaded hot 🔥partitions. Again “As rule of thumb 👍, an ideal Partition key🔑 should have a wide range of values, so your data is evenly spread across logical partitions. Also in DynamoDB🧨 items can have an optional sort key🔑 to store related attributes in a sorted order.

One major difference to Cosmos DB🪐 is that DynamoDB🧨 utilizes a primary key🔑 on each table. If there is no sort key🔑, the primary and partition keys🔑 are the same. If there is a sort key🔑, the primary key🔑 is a combination of the partition and sort key 🔑 which is called a composite primary key🔑 .

DynamoDB🧨 allows for secondary indexes for faster searches. It supports two types indexes local (up to 5 per table) and global (up to 20 per table). These indexes can help improve the application’s ability to access data quickly and efficiently.

Differences Between Global and Local Secondary Indexes

GSI	LSI
Hash or hash and range key	Hash and range key
No size limit	For each key, 10GB max
Add during table create, or later	Add during table create
Query all partitions in a table	Query single partition
Eventually consistent queries	Eventually/strong consistent queries
Dedicated throughput units	User table throughput units
Only access projected items	Access all attributes from table

DynamoDB🧨 like Cosmo DB🪐 offers multiple Data Consistency Levels. DynamoDB🧨 Offers both Eventually and Strongly consistent Reads but like I said previously “it’s like life itself there is always tradeoffs. So, depending on your application needs. You will need to determine what’s the most important need for your application latency or availability.”

As a prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to put DynamoDB🧨 in. You can leverage an existing one or you can create a dedicated one for DynamoDB🧨 Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to DynamoDB🧨. As for authentication🔐 and permission to access a table, it is managed through Identity and Access Management (IAM). DynamoDB🧨 provides end-to-end enterprise-grade encryption for data that is both in transit and at rest. All DynamoDB tables have encryption at rest enabled by default. This provides enhanced security by encrypting all your data using encryption keys🔑stored in the AWS Key🔑Management System, or AWS KMS.

“Quicker than a ray of light I’m flying”

Making our final destination for this week’s explorations would be to Amazon’s fully managed, fast, scalable data warehouse known as Redshift🎆 . A “Red Shift🎆” is when a wavelength of the light is stretched, so the light is seen as ‘shifted’ towards the red part of the spectrum but according to anonymous sources “RedShift🎆 was apparently named very deliberately as a nod to Oracle’ trademark red branding, and Salesforce is calling its effort to move onto a new database “Sayonara,”” Be that what it may, this would be the third Data Warehouse cloud☁️ solution we would have the pleasure of be aquatinted with. 😊

AWS claims Redshift🎆 delivers 10x faster performance than other data warehouses. We didn’t have a chance to benchmark RedShift’s 🎆 performance but based some TPC tests vs some of their top competitors there might be some discrepancies with these claims but either case it’s still pretty darn on fast.

Redshift🎆 uses Massively parallel processing (MPP) and columnar storage architecture. The core unit that makes up Redshift🎆 is the cluster. The Cluster is made up of one or more compute nodes. There is a single leader node and several compute nodes. Clients access to Redshift🎆 is via a SQL endpoint on the leader node. The client sends a query to the endpoint. The leader node creates jobs based on the query logic and sends them in parallel to the compute nodes. The compute nodes contain the actual data the queries need. The compute nodes find the required data, perform operations, and return results to the leader node. The leader node then aggregates the results from all of the compute nodes and sends a report back to the client.

The compute nodes themselves are individual servers, they have their own dedicated memory, CPU, and attached disks. An individual compute node is actually split up into slices🍕, one slice🍕 for every core of that node’s processor. Each slice🍕 is given a piece of memory, disk, and so forth, where it processes whatever part of the workflow that’s been assigned to it by the leader node.

The way the columnar database storage works data is stored by columns rather than by rows. This allows for fast retrieval of columns of data. An additional advantage is that, since each block holds the same type of data, block data can use a compression scheme selected specifically for the column data type, further reducing disk space and I/O. Again, there have been several architectural changes to RedShift🎆 as well since Richard’s course was released.

In the past you needed to pick a distribution style. Today, you still have the option to choose a distribution style but if don’t specify a distribution style, Redshift🎆 will uses AUTO distribution making it little easier not to make the wrong choice 😊. Another recent innovation to Redshift🎆 that didn’t exist when the Richard’s course was released is the ability to build a unified data platform. Amazon Redshift🎆 Spectrum allows you to run queries across your data warehouse and Amazon S3 buckets simultaneously. Allowing you to save time ⏰ and money💰 as you don’t need to load all your data into the data warehouse.

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to place Redshift🎆 in. You can leverage an existing one or you can create a dedicated one just for Redshift🎆. In addition, you will need to create an Amazon Simple Storage Service (S3) bucket and S3 Endpoint to be used with Redshift🎆. Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to your data warehouse. As for credentials 🔐 and entitlements in Redshift🎆, it is managed through AWS Identity and Access Management (IAM).

One last point worth mentioning is that AWS Cloud ☁️ Watch ⌚️ is included with all the tremendous Cloud☁️ Services offered by AWS. So you get great monitoring 📉right of the box! 😊 We enjoyed 😊 our time⏰ this week in AWS exploring 🔦 some of their data offerings, but we merely just scratched the service.

“So much to do, so much to see… So, what’s wrong with taking the backstreets? You’ll never know if you don’t go. You’ll never shine if you don’t glow“

Below are some topics I am considering for my travels next week:

More with AWS Data Solutions
OKTA SSO
Neo4J and Cypher
More with Google Cloud Path
ONTAP Cluster Fundamentals
Data Visualization Tools (i.e. Looker)
Additional ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)

Thanks

—MCS