nosql | SQL Squirrels

Week of July 24th

Posted on July 24, 2020 by Mark Shay

“And you may ask yourself 🤔, well, how did I get here?” 😲

Happy Opening⚾️ Day!

Last week, we hit a milestone of sorts with our 20th submission🎖since we started our journey way back in March.😊 To commemorate the occasion, we made a return back to AWS by gleefully 😊 exploring their data ecosystem. Of course, trying to cover all the data services that are made available in AWS in such a short duration 🕰 would be a daunting task.

So last week, we concentrated our travels to three of their main offerings in the Relational Database, NoSQL, and Data warehouse realms. This being of course RDS🛢, DynamoDB🧨, and Redshift🎆. We felt rather content 🤗 and enlighten💡with AWS’s Relational Database and Data warehouse offerings, but we were still feeling a little less satisfied 🤔 with NoSQL as we really just scratched the surface on what AWS had to offer.

To recap, we had explored 🔦 DynamoDB🧨 AWS’s multi-model NoSQL service which offers support for a key-value🔑and their propriety document📜 database. But we were still curious to learn more about a Document📜 database that offers MongoDB🍃support as well in AWS. In addition, an answer to the hottest🔥 trend in “Not Just SQL Solutions”, Graph📈 Database.

Well of course being the Cloud☁️ Provider that offers every Cloud☁️native service from A-Z, AWS delivered with many great options. So we began our voyage heading straight over to DocumentDB📜. AWS’s fully managed database service with MongoDB🍃compatibility. As with all AWS services, Document DB📜 was designed from the ground up to give the most optimal performance, scalability⚖️, and availability. DocumentDB📜 like the Cosmo DB🪐 MongoDB🍃API makes it easy to set up, operate, and scale MongoDB-compatible databases. In other words, no code changes are required, and all the same drivers can be utilized by existing legacy MongoDB🍃applications.

In addition, Document DB📜 solves the friction and complications of when an application tries to map JSON to a relational model. DocumentDB📜 solves this problem by making JSON documents a first-class object of the database. Data is stored in the form of documents📜. These documents📜 are stored into collections. Each document📜can have a unique combination and nesting of fields or key-value🔑 pairs. Thus, making querying the database faster⚡️, indexing more flexible, and repetitions easier.

Similar to other AWS Data offerings, the core unit that makes up DocumentDB📜 is the cluster. A cluster consists of one or more instances and cluster storage volume that manages the data for those instances. All writes📝 are done through the primary instance. All instances (primary and replicas) support read 📖 operations. The cluster’s data stores six copies of your data across three different Availability Zones. AWS easily allows you to create or modify clusters. When you modify a cluster, AWS is really just spinning up a new cluster behind the curtains and then migrates the data taking what is an otherwise complex task and making it seamless.

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to place DocumentDB📜 in. You can leverage an existing one or you can create a dedicated one just for DocumentDB📜. Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to your Document📜 Databases . As for credentials🔐 and entitlements in DocumentDB📜, it is managed through AWS Identity and Access Management (IAM).By default, the cluster Document DB📜accepts secure connections using Transport Layer Security (TLS). So, all traffic in transit is encrypted and Amazon DocumentDB📜 uses the 256-bit Advanced Encryption Standard (AES-256) to encrypt your data or allows you to encrypt your clusters using keys🔑 you manage through AWS Key🔑Management Service (AWS KMS) so data at rest is always encrypted.

“Such wonderful things surround you…What more is you lookin’ for?”

Lately, we have been really digging Graph📈 Databases. We had our first visit with Graph📈 Databases when we were exposed to the Graph📈 API through Cosmos DB🪐 earlier this month and then furthered our curiosity through Neo4J. Well, now armed with a plethora of knowledge in the Graph📈 Database space we wanted to see what AWS had to offer and once again they did not disappoint.😊

First let me start by writing, It’s a little difficult to compare AWS Neptune🔱 to Neo4J although Nous Populi from Leapgraph does an admirable job. Obviously, both are graph📈 databases but architecturally there some major differences in their graph storage model and query languages. Neo4J uses Cypher and Neptune🔱 uses Apache TinkerPop or Gremlin👹 same as Cosmos DB🪐 as well as SPARQL. Where Neptune🔱 really shines☀️ is that it’s not just another graph database but a great service offering within the AWS portfolio. So, it leverages all the great bells🔔 and whistles like fast⚡️ performance, scalability⚖️, High availability and durability. As well as being a fully managed service that we have come accustomed too like handling hardware provisioning, software patching, backup, recovery, failure detection, and repair. Neptune🔱 is an optimized for storing billions of relationships and querying the graph with milliseconds latency.

Neptune🔱 uses database instances. The primary database instance supports both read📖 and write📝 operations and performs all the data modifications to the cluster. Neptune🔱 also uses replicas which connects to the same cloud-native☁️ storage service as the primary database instance but only supports read-only operations. There can be up to 15 of these replicas across multiple AZs. In addition, Neptune🔱 supports encryption at rest.

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to place Neptune🔱 in. You can leverage an existing one or you can create a dedicated one just for Neptune🔱 Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to your Neptune🔱. As for credentials🔐 and entitlements in Neptune🔱 is managed through AWS Identity and Access Management (IAM). Your data at rest in the Neptune🔱 is encrypted using the industry standard AES-256-bit encryption algorithm on the server that hosts your Neptune🔱 instance. Keys🔑 can also be used, which are managed through AWS Key🔑 Management Service (AWS KMS).

“Life moves pretty fast⚡️. If you don’t stop 🛑 and look 👀 around once in a while, you could miss it.”

So now feeling pretty good 😊 about NoSQL on AWS, where do we go now?

Well, as mentioned we have been learning so much over the last 5 months it could be very easy to forget somethings especially with limited storage capacity. So we decided to take a pause for the rest of the week and go back and review all that we have learned by re-reading all our previous blog posts as well as engaging in some Google Data Engineered solution Quests🛡to help reinforce our previous learnings

Currently, the fine folks at qwiklabs.com are offering anyone who wants to learn Google Cloud ☁️ skills an opportunity for unlimited access for 30 days. So with an offer too good to be true as well as an opportunity to add some flare to our linked in profile and who doesn’t like flare? We dove right in head first!

“Where do we go? Oh, where do we go now? Now, now, now, now, now, now, now”

Below are some topics I am considering for my travels next week:

OKTA SSO
Neo4J and Cypher
More with Google Cloud Path
ONTAP Cluster Fundamentals
Data Visualization Tools (i.e. Looker)
Additional ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)

Thanks

—MCS

Week of July 17th

Posted on July 17, 2020 by Mark Shay

“Any time ⏰ of year… You can find it here”

Happy World🌎 Emoji 😊😜 Day!

The Last few weeks we have been enjoying our time in Microsoft’s Cloud☁️ Data Ecosystem and It was just last month that we found ourselves hanging out with the GCP☁️ gang and their awesome Data offerings. All seemed well and good😊 except that we had been missing out on excursion to the one cloud☁️ provider where it all began literally and figuratively.

Back when we first began our journey on a cold 🥶 and rainy☔️ day in March just as Covid-19🦠 occupied Wall Street 🏦 and the rest of the planet 🌎 we started at the one place that disrupted how infrastructure and operations would be implemented and supported going forward.

That’s right Amazon Web Services or more endearingly known to humanity as AWS. AWS was released just two decades ago by the its parent company that offers everything from A to Z.

AWS like its parent company has a similar mantra in the cloud ☁️ computing world as they offer 200+ Cloud☁️ Services. So how the heck with so some many months passed that we haven’t been back since? The question is quite perplexing? But like they say “all Clouds☁️☁️ lead to AWS. So, here we are back in the AWS groove 🎶 and eager 😆 to explore 🔦the wondrous world🌎 of AWS Data solutions. Guiding us through this vast journey would be Richard Seroter (who ironically recently joined the team at Google). In 2015, Richard authored an amazing Pluralsight course covering Amazon RDS🛢, Amazon DynamoDB 🧨 and Amazon’s Redshift 🎆. It was like getting 3 courses for the price of 1! 😊

Although the course was several years old, for the most part it still out lasted the test of time ⏰ by providing a strong foundational knowledge for Amazon’s relational, NoSQL, and Data warehousing solutions. But unfortunately in technology years, it’s kind of like dog🐕 years. So obviously, there have been many innovations to all three of these incredible solutions including UI enhancements, architectural improvements and additional features to these great AWS offerings making them even more awesome!

So for a grand finale to our marvelous week of learning and to help us fill in the gaps on some of these major enhancements as well as offering some additional insights were the team from AWS Training and certification which includes the talented fashionista Michelle Metzger, the always effervescent and insightful Blaine Sundrud and on demos the man with a quirky naming convention for database objects the always witty Stephen Cole

Back in our Amazon Web Services Databases in Depth course and in effort to make our journey that more captivating, Richard provided us with a nifty mobile sports 🏀 ⚾️ 🏈 app solution written in Node.js which leverages the Amazon data offerings covered in the course as components for an end to end solution. As the solution, was written several years back it did require some updating on some deprecated libraries📚 and some code changes in order to make the solution work which made our learning that more fulfilling. So, after a great introduction from Richard where he compares and contrasts RDS🛢, DynamoDB🧨, and Redshift🎆, we began our journey with Amazon’s Relational Database Service (RDS🛢). RDS🛢 is a database as a service (DBaaS) that makes provisioning, operating and scaling⚖️ either up or out seamless. In addition, RDS🛢 makes other time-consuming administrative tasks such as patching, and backups a thing of the past. Amazon RDS🛢 provides high availability and durability through the use of Multi-AZ deployments. In other words, AWS creates multiple instances of the databases in different Availability Zones making recovery from infrastructure failure automatic and almost transparent to the application. Of course like with all AWS offerings there always a heavy emphasis on security🔐 which it’s certainly reassuring when you putting your mission critical data in their hands 🤲 but it could also be a bit challenging at first to get things up and running when you are simply just trying connect to from your home computer 💻 back to the AWS infrastructure

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to put to your RDS🛢instance(s) in. You can leverage an existing one or you can create a dedicated one for RDS🛢instance(s).

It is required that your VPC have at least two subnets in order to support the Availability Zones for high availability. If direct internet access is needed that you will need to add an internet gateway to your VPC.

Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to the RDS🛢. RDS🛢 leverages three types of security groups (database, VPC, and EC2). As for credentials🔐 and entitlements in RDS🛢, it is managed through AWS Identity and Access Management (IAM). At the time of the release of Richard’s course, Amazon Aurora was new in the game and was not covered in depth in the course. In addition, at the same time only MySQL, Postgres, Oracle, MS SQL Server and the aforementioned Aurora were only supported at this time. AWS has since added support for MariaDB to their relational database service portfolio.

Fortunately, our friends from the AWS Training and Certification group gave us the insights that we would require on Amazon’s innovation behind their relational database built for the cloud☁️ better known as Aurora.

So, with six familiar database engines (licensing costs apply) to choose from you have quite a few options. Another key🔑 decision is to determines the resources you want your database to have. RDS🛢offers multiple options for optimized for memory, performance, or I/O.

I would be remised if we didn’t briefly touch on Amazon’s Aurora. As mentioned, it’s one of Amazon’s 6 database options with RDS🛢. Aurora is fully managed by RDS🛢. So it leverages the same great infrastructure and has all the same bells 🔔 and whistles. Aurora comes in two different flavors🍦 MySQL and PostgreSQL. Although, I didn’t benchmark Aurora performance in my evaluation AWS claims that Aurora is 5x faster than the standard MySQL databases. However, it’s probably more like 3x faster. But the bottom line it is that it is faster and more cost-effectiveness for MySQL or PostgreSQL databases that require optimal performances, availability, and reliability. The secret sauce behind Aurora is that automatically maintains 6 copies of your data (which can be increased up to 15 replicas) that is spanned across 3 AZs making data highly available and ultimately providing laser⚡️ fast performance for your database instances.

Please note: There is an option that allows a single Aurora database to span multiple AWS Regions 🌎 for an additional cost

In addition, Aurora uses an innovative and significantly faster log structured distributed storage layer than other RDS🛢offerings.

“Welcome my son, welcome to the machine🤖”

Next on our plate 🍽 was to take a deep dive into Amazon’s fast and flexible NoSQL database service a.k.a DynamoDB🧨.. DynamoDB🧨 like Cosmo DB🪐 is a multi-model NoSQL solution.

DynamoDB🧨 combines the best of those two ACID compliant non-relational databases in a key-value🔑 and document database. It is a proprietary engine, so can’t just take your MongoDB🍃 database and convert it to DynamoDB🧨. But don’t worry if you looking to move your MongoDB🍃 works loads to Amazon, AWS offers Amazon DocumentDB (with MongoDB compatibility) but that’s for a later discussion 😉

As for DynamoDB🧨, it delivers a blazing⚡️ single-digit millisecond guaranteed performance at any scale⚖️. It’s a fully managed, multi-Region, multi-master database with built-in security🔐, backup and restore options, and in-memory caching for internet-scale applications. DynamoDB🧨 automatically scales⚖️ up and down to adjust for the capacity and maintain performance of your systems. Availability and fault tolerance are built in, eliminating the need to architect your applications for these capabilities. An important concept to grasp while working with DynamoDB🧨 is that the databases are comprised of tables, items, and attributes. Again, there has been some major architectural design changes to DynamoDB🧨 since Richard’s course was released. Not to go into too many details as its kind or irrelevant but at time⏰ the course was released DynamoDB🧨 used to offer the option to either use a Hash Primary Key🔑 or Hash and Range Primary Key🔑 to organize or partition data and of course as you would imagine choosing the right combination was rather confusing. Intuitively, AWS scrapped this architectural design and the good folks at the AWS Training and Certification group were so kind to offer clarity here as well 😊

Today, DynamoDB🧨 uses partition keys🔑 to find each item in the database similar to Cosmo DB🪐. Data is distributed on physical storage nodes. DynamoDB🧨 uses the partition key🔑 to determine which of the nodes the item is located on. It’s very important to choice the right partition key 🔑 to avoid the dreaded hot 🔥partitions. Again “As rule of thumb 👍, an ideal Partition key🔑 should have a wide range of values, so your data is evenly spread across logical partitions. Also in DynamoDB🧨 items can have an optional sort key🔑 to store related attributes in a sorted order.

One major difference to Cosmos DB🪐 is that DynamoDB🧨 utilizes a primary key🔑 on each table. If there is no sort key🔑, the primary and partition keys🔑 are the same. If there is a sort key🔑, the primary key🔑 is a combination of the partition and sort key 🔑 which is called a composite primary key🔑 .

DynamoDB🧨 allows for secondary indexes for faster searches. It supports two types indexes local (up to 5 per table) and global (up to 20 per table). These indexes can help improve the application’s ability to access data quickly and efficiently.

Differences Between Global and Local Secondary Indexes

GSI	LSI
Hash or hash and range key	Hash and range key
No size limit	For each key, 10GB max
Add during table create, or later	Add during table create
Query all partitions in a table	Query single partition
Eventually consistent queries	Eventually/strong consistent queries
Dedicated throughput units	User table throughput units
Only access projected items	Access all attributes from table

DynamoDB🧨 like Cosmo DB🪐 offers multiple Data Consistency Levels. DynamoDB🧨 Offers both Eventually and Strongly consistent Reads but like I said previously “it’s like life itself there is always tradeoffs. So, depending on your application needs. You will need to determine what’s the most important need for your application latency or availability.”

As a prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to put DynamoDB🧨 in. You can leverage an existing one or you can create a dedicated one for DynamoDB🧨 Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to DynamoDB🧨. As for authentication🔐 and permission to access a table, it is managed through Identity and Access Management (IAM). DynamoDB🧨 provides end-to-end enterprise-grade encryption for data that is both in transit and at rest. All DynamoDB tables have encryption at rest enabled by default. This provides enhanced security by encrypting all your data using encryption keys🔑stored in the AWS Key🔑Management System, or AWS KMS.

“Quicker than a ray of light I’m flying”

Making our final destination for this week’s explorations would be to Amazon’s fully managed, fast, scalable data warehouse known as Redshift🎆 . A “Red Shift🎆” is when a wavelength of the light is stretched, so the light is seen as ‘shifted’ towards the red part of the spectrum but according to anonymous sources “RedShift🎆 was apparently named very deliberately as a nod to Oracle’ trademark red branding, and Salesforce is calling its effort to move onto a new database “Sayonara,”” Be that what it may, this would be the third Data Warehouse cloud☁️ solution we would have the pleasure of be aquatinted with. 😊

AWS claims Redshift🎆 delivers 10x faster performance than other data warehouses. We didn’t have a chance to benchmark RedShift’s 🎆 performance but based some TPC tests vs some of their top competitors there might be some discrepancies with these claims but either case it’s still pretty darn on fast.

Redshift🎆 uses Massively parallel processing (MPP) and columnar storage architecture. The core unit that makes up Redshift🎆 is the cluster. The Cluster is made up of one or more compute nodes. There is a single leader node and several compute nodes. Clients access to Redshift🎆 is via a SQL endpoint on the leader node. The client sends a query to the endpoint. The leader node creates jobs based on the query logic and sends them in parallel to the compute nodes. The compute nodes contain the actual data the queries need. The compute nodes find the required data, perform operations, and return results to the leader node. The leader node then aggregates the results from all of the compute nodes and sends a report back to the client.

The compute nodes themselves are individual servers, they have their own dedicated memory, CPU, and attached disks. An individual compute node is actually split up into slices🍕, one slice🍕 for every core of that node’s processor. Each slice🍕 is given a piece of memory, disk, and so forth, where it processes whatever part of the workflow that’s been assigned to it by the leader node.

The way the columnar database storage works data is stored by columns rather than by rows. This allows for fast retrieval of columns of data. An additional advantage is that, since each block holds the same type of data, block data can use a compression scheme selected specifically for the column data type, further reducing disk space and I/O. Again, there have been several architectural changes to RedShift🎆 as well since Richard’s course was released.

In the past you needed to pick a distribution style. Today, you still have the option to choose a distribution style but if don’t specify a distribution style, Redshift🎆 will uses AUTO distribution making it little easier not to make the wrong choice 😊. Another recent innovation to Redshift🎆 that didn’t exist when the Richard’s course was released is the ability to build a unified data platform. Amazon Redshift🎆 Spectrum allows you to run queries across your data warehouse and Amazon S3 buckets simultaneously. Allowing you to save time ⏰ and money💰 as you don’t need to load all your data into the data warehouse.

As prerequisite, you first must create and configure a virtual private cloud☁️ (VPC) to place Redshift🎆 in. You can leverage an existing one or you can create a dedicated one just for Redshift🎆. In addition, you will need to create an Amazon Simple Storage Service (S3) bucket and S3 Endpoint to be used with Redshift🎆. Next, you need to configure security🔒 groups for your VPC. Security🔒 groups are what controls who has access to your data warehouse. As for credentials 🔐 and entitlements in Redshift🎆, it is managed through AWS Identity and Access Management (IAM).

One last point worth mentioning is that AWS Cloud ☁️ Watch ⌚️ is included with all the tremendous Cloud☁️ Services offered by AWS. So you get great monitoring 📉right of the box! 😊 We enjoyed 😊 our time⏰ this week in AWS exploring 🔦 some of their data offerings, but we merely just scratched the service.

“So much to do, so much to see… So, what’s wrong with taking the backstreets? You’ll never know if you don’t go. You’ll never shine if you don’t glow“

Below are some topics I am considering for my travels next week:

More with AWS Data Solutions
OKTA SSO
Neo4J and Cypher
More with Google Cloud Path
ONTAP Cluster Fundamentals
Data Visualization Tools (i.e. Looker)
Additional ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)

Thanks

—MCS

Week of July 3rd

Posted on July 3, 2020 by Mark Shay

“Hanging in the cosmos 🌌 like a space ornament”

Happy Birthday🎂🎁🎉🎈America🇺🇸 !

“Now let me welcome everybody to the Wild, Wild West 🤠. A ~~state~~ database that’s untouchable like Eliot Ness.” So, after spending a good concentrated week in the “humongous” document database world better known as the popular MongoDB🍃, it only made sense to continue our Jack Kerouac-like adventures through the universe 🌌 of “Not only SQL” databases.

“So many roads, so many detours. So many choices, so many mistakes.” -Carrie Bradshaw

But with so many Document databases, Table and Key-value stores, Columnar and Graph databases to choose from in the NoSQL universe, where shall we go? Well, after a brief deliberation, we turned to the one place that empowers every person and every organization on the planet to achieve more. That’s right, Microsoft! Besides we haven’t been giving Mr. Softy enough love ❤️ in our travels. So, we figured we would take a stab and see what MSFT had to offer. Oh boy, did we hit eureka with Microsoft’s Cosmos DB🪐!

For those not familiar with Microsoft’s Cosmos DB🪐 it was released for GA in 2017. The solution had morphed out of the Azure DocumentDB (the “Un-cola”🥤of document databases of its day) which was initially released in 2014. During the time of its inception, Azure DocumentDB was the only NoSQL Cloud☁️ solution (MongoDB🍃 Atlas☁️ was released two years later in 2016) but its popularity was still limited. Fortunately, MSFT saw the “forest 🌲🌲🌲through the trees🌳” or I shall I say the planets🪐 through the stars ✨ and knew there was a lot more to NoSQL then just some JSON and bunch of curly braces. So, they “pimped up” Azure DocumentDB and gave us the Swiss🇨🇭 Army knives of all NoSQL solutions through their rebranded offering Cosmos DB🪐

Cosmos DB 🪐 is multi-model NoSQL Database as a Service (NDaaS) that manages data at planetary 🌎 scale ⚖️! Huh? In other words, Cosmos DB🪐 supports 6 different NoSQL solutions through the beauty of APIs (Application Program Interfaces). Yes, you read that correctly. Six! Cosmos DB🪐 supports the SQL API which was originally intended to be used with aforementioned Azure DocumentDB which uses the friendly SQL query language, the MongoDB🍃 API (For all the JSON fans), Cassandra (Columnar database), Azure Table Storage (Table) and etcd (Key Value Store) and last but certainly not least the Gremlin👹 API (Graph database).

Cosmos DB🪐 provides virtually unlimited scale ⚖️ through both storage and throughput and it automatically manages the growth of the data with server-side horizontal partitioning.

So, no worrying about adding more nodes or shards. …And that’s not all! Cosmos DB🪐 does all the heavy lifting 🏋🏻‍♀️ with automatic global distribution and server-side partitioning for painless management over the scale and growth of your database. Not to mention, offers a 99.999% SLA when data is distributed across multi-regions 🌎 (Only a mere four 9s when you stick to a single region).

Yes, you read that right, too. 99.999% guarantee! Not just on availability… No, No, No… but five 9s on latency, throughput, and consistency as well!

Ok, so now I sound like a MSFT fanboy. Perhaps? So now, we were fully percolating ☕️ with excitement who will guide us through such amazing innovation? Well, we found just the right tour guide in a Native New Yorker Lenni Lobel. Through his melodious 🎶 voice and over 5 decades of experience in IT, Lenni takes us through an amazing journey through Cosmos DB🪐 with his Plural sight course Learning Azure Cosmos DB🪐

In the introduction, Lenni gives his us his interpretation on NoSQL which answers the common problem of 3Vs in regards to data and the roots of Cosmos DB🪐 which we touched on earlier. Lenni then explains how the Cosmos DB🪐 engine is an atom-record-sequence (ARS) based system. In other words the database engine of Cosmos DB🪐 is capable of efficiently translating and projecting multiple data models by leveraging ARS. Still confused? Don’t be. In more simplistic terms, under the covers Cosmos DB🪐 leverages the ARS framework to be able support multiple NoSQL technologies. It does this through APIs and then placing each of data models in separate schema-agnostic containers which is super cool 😎! Next, he discusses one of the cooler 😎 features of Cosmos DB🪐 “Automatic Indexing”. If you recall from our MongoDB travels one of the main takeaways was a strong emphasis on the need for indexes in MongoDB🍃. Well, in Cosmos DB🪐 you need not to worry. Cosmos DB🪐 does this for you automatically. The only main concern is choosing the right partition key🔑 on your container but you must choose wisely otherwise performance and cost will suffer.

Lenni further explains how one quantifies performance for data through Latency and throughput. In the world 🌎 of data, Latency is how long the data consumer waits for the data to be received from end to end. Whereas throughput is the performance of database itself. First, Mr. Lobel demonstrates how to provision throughput through Cosmos DB🪐 which provides predictable throughput to the database through a server-less approach measured in Request Units (RUs). RUs are a blended measure of computational cost CPU, memory, disk I/O, network I/O.

So, like most server-less approaches you don’t need to worry about provisioning hardware to scale ⚖️ your workloads. You just need to properly allocate the right amount of RUs to a given container. The good news on RUs is that this setting is flexible. So it can be easily throttled up and down through the portal or even specify on an individual query level.

Please note: data writes are generally more expensive than data reads. The beauty of the RU approach is that you are guaranteed throughput and you can predict cost. You will even be notified through a friendly error message when your workloads exceed a certain threshold. There is an option to run your workloads in an “auto-pilot ✈️ mode” in which Cosmos DB🪐 will adjust the RUs to a given workload but beware this option could be quite costly so proceed with risk and discuss this option with MSFT before considering using it.

In effort of being fully transparent, unlike some of their competitors, Microsoft offers a Capacity Calculator So you can figure out exactly how much it will cost you to run your workloads (Reserved RU/sec per hour $0.008 for 100 RU/sec). The next import considerations in regards to throughput is Horizontal Partitioning. Some might regard, Horizontal Partitioning as strictly just for storage, but in fact it also massively impacts throughput.

“Yes, it’s understood that Partitioning and throughput are distinct concepts, but they’re symbiotic in terms of scale-out.”

Anyway, no need to fret… We just simply create a container and let Cosmos DB🪐 automatically manage these partitions for us behind the scenes (including the distribution of partitions within a given data center). However, keep in mind that we must choose a proper partition key🔑 otherwise we can have a rather unpleasant😞 and costly🤑 experience with Cosmos DB🪐. Luckily, there are several best practices around choosing the right partition key🔑. Personally, I like to stick to the rule of thumb 👍 to always choose a key🔑 with many distinct values like in 100s or 1000s. This can hopefully help avoid the dreaded Hot🔥 Partition

Please note: Partition keys 🔑 are immutable but there are documented workarounds on how to deal with changing this in case you find yourself in this scenario.

Now, that we have a good grasp on how Cosmos DB🪐 handles throughput and latency through RUs and horizontal partitioning but what if your application is global 🌎 and your primary data is located halfway around the world 🌍 ? Our performance could suffer tremendously… 😢

Cosmos DB🪐 handles such challenges with one of its most compelling features in the solution through Global Distribution of Data. Microsoft intuitively leverages the ubiquitousness of its global data centers and offers a Turnkey global distribution “Point-and-click” control so your data can seamlessly be geo-replicated across regions.

In cases, where you have multiple-masters or data writers, Cosmos DB🪐 offers three options to handle such conflicts:

Option 1: Last Writer Wins (default) based on the highest _ts property or any other numeric property) Conflict Resolver Property Write with higher valuer wins if blank than master with high _ts property wins
Option 2: Merge Procedure (Custom) – Based on stored procedure result
Option 3: Conflict feed (Offline resolution) Based Quorum majority

Whew 😅 … But what about data consistency? How do we ensure our data is consistent in all of our locations? Well once again, Cosmos DB🪐 does not disappoint supporting five different options. Of course, like life itself there is always tradeoffs. So, depending on your application needs. You will need to determine what’s the most important need for your application latency or availability? Below are the options based higher latency to lowest availability:

Strong – (No Dirty Reads) Higher latency on writes waiting for write to be written to Cosmos DB Quorum. Higher RU costs
Bounded Staleness – Dirty reads possible Bounded by time and updates which kind of like “Skunked🦨 beer🍺” You decide the level of freshness you can tolerate.
Session – (Default) No dirty reads for writers (read your own writes). Dirty Reads are possible for other users
Consistent Prefix – Dirty reads possible. Reads never see out-of-order writes. Never experience data returned out of order.
Eventual – Stale reads possible, No guaranteed order. Fastest

So, after focusing on these core concepts within Cosmos DB🪐, we were ready to dig our heels 👠 👠 right in and get this bad boy up and running 🏃🏻 . So after waiting about 15 minutes or so… we had our Cosmos DB🪐 fired up 🔥 and running in Azure… Not bad for a such complex piece of infrastructure. 😊

Next, we created a Container and then a Database and started our travels with the SQL API. Through the portal, We were easily able manually write some JSON documents and add them to our collection.

In addition, through Lenni’s brilliantly written .Net Core code samples, we were able to automate writing, Querying, and reading in bulk data. Further, we were able to easily adjust throughput and latency through the portal by tweaking the RUs and enabling multi-region replication. We were able to demonstrate this by re-running Lenni’s code after the changes

Although, getting Lenni’s code to work did take a little bit of troubleshooting with visual studio 2019 and a little bit of understanding how to fix the .Net SDK errors and some of Compilation errors NuGet from packages . All of which was out of our purview.. But needless to say we figured how to troubleshooted the NuGet Packages and modify some of the parameter’s in the code like _ID field and Cosmos DB🪐 Server and Cosmos DB master key 🔑.

We were able to enjoy the full experience of SQL API including the power⚡️ of using the familiar SQL query language and not to having to worrying about the all

db.collection.insertOne() this

and

db.collection.find(),

db.collection.UpdateOne()

db.collection.deleteOne()

that..

We also got to play with server‑side programming in Cosmos DB🪐 like the familiar concept of stored procedures, triggers, and user‑defined functions which in Cosmos DB🪐 are basically self‑contained JavaScript functions that are deployed to the database for execution. But one can always pretend like we are in the relational database world. 😊

Next we, got to test drive 🚙 the Data Migration tool 🛠 that allows you to import data from an existing data sources into Cosmos DB🪐.

From our past experiences, we have found Microsoft has gotten quite good at creating these type of tools 🧰. Cosmos DB🪐 Data Migration tool offers great support for many data sources like SQL Server, JSON files, CSV files, MongoDB, Azure Table storage, and others.

First, we used the UI to move data from Microsoft SQL Server 2016 and the popular example Adventureworks database to Cosmos DB🪐 and then later through the CLI (azcopy) from Azure Table storage.

Notably, Azure Table Storage is on the road map to be deprecated and automatically migrated to Cosmos DB🪐 but this was good exercise for those who can’t wait and want to take advantage such awesome platform today!

As a grand finale, we got to play with Graph Databases through the Gremlin 👹 API. As many of you might be aware, Graph databases are becoming excessively popular these days. Mostly because Data in the real world is naturally connected through relationships and Graph Databases do a better job managing when many complex relationships exist opposed to our traditional RDBMS.

Again, it’s worth noting that in the case of Cosmos DB🪐, it doesn’t really matter what data model you’re implementing because as we mentioned earlier it leverages the ARS framework. So as far as Cosmos DB🪐 concerned it’s just another container to manage and we get all the Horizontal partitioning, provisioned throughput, global distribution, indexing goodness 😊.

We were new to whole concept of Graph Databases so we were very excited to get some exposure here which looks to be a precursor for further explorations. The most important highlights of Graph database is understanding Vertex and Edge objects. These are basically just fancy schmancy words for Entities and Relationships. A Vertex is an entity and a Edge is a relationship between any two vertices respectively. Both can hold arbitrary key-value pairs 🔑🔑 and are the building blocks of a graph database.

Cosmos DB🪐 utilizes the Apache TinkerPop standard which uses Gremlin as a functional step-by-step language to create vertices and edges and stores the data as GraphSON or “Graphical JSON”.

In addition, Gremlin 👹 allows you to query the graph database by using simple transversals though a myriad of relationships or Edges. The more edges you add, the more relationships you define, and the more questions you can answer by running Gremlin👹 Queries. 😊

To further our learning Lenni once again gave us some nice demos using a fictitious company “Acme” and its relationships of employees, Airport terminals and Restaurants and another example using Comic Book hero’s which made playing along fun.

Below is some example of some Gremlin 👹 syntax language from our voyage.

g.addV(‘person’).property(‘id’,’John’).property(‘age’,25).property(‘likes’,’pizza’).property(‘city’,’NY’)

g.addV(‘person’).property(‘id’,’Alan’).property(‘age’,22).property(‘likes’,’seafood’).property(‘city’,’NY’)

g.addV(‘company’).property(‘id’,’Acm’e).property(‘founded,2001).property(‘city’,’NY’)

g.V().has(‘id’,’John’).addE(‘worksAt’).property(‘weekends’, true).to(g.V().has(‘id’,’Acme’))

g.V().has(‘id’,’Alan’).addE(‘worksAt’).property(‘weekends’, true).to(g.V().has(‘id’,’Acme’))

g.V().has(‘id’,’Alan’).addE(‘manages’).to(g.V().has(‘id’,’John’))

When in comes to Graph databases the possibilities are endless. Some good use cases for Graph Database would be:

Complex Relationships – Many “many-to-many” relationships
Excessive JOINS
Analyze interconnected data relationships
Typical graph applications
- Social networks
- Recommendation Engines

In Cosmos DB🪐, it’s clear to see how a graph database is no different than any other key value data model. Graph database gets provisioned throughput, fully indexed, partitioned, and globally distributed just like a document collection in this SQL API or a table in the Table API

Cosmos DB🪐 will one day allow you to switch freely between different APIs and data models within the same account, and even over the same data set. So by adding this graph functionality to Cosmos DB🪐 Microsoft really hit ⚾️ this one out of the park 🏟!

Closing time …Every new beginning.. comes from some other beginning’s end

Below are some topics I am considering for my wonderings next week:

Neo4J and Graph DB
More on Cosmos DB
More on MongoDB
More with Google Cloud Path
Working with Parquet files
JDBC Drivers
More on Machine Learning
ONTAP Cluster Fundamentals
Data Visualization Tools (i.e. Looker)
Additional ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)

Stay safe and Be well –

–MCS

Week of June 26th

Posted on June 26, 2020 by Mark Shay

“…And I think to myself… What a wonderful world 🌎 .”

Happy Coconut🥥 Day!

Recently, I had been spending so much time ⏰ in GCP land☁️ that it started to feel like it was my second home 🏡 . However, it was time for a little data sabbatical. I needed to visit a land of mysticism✨ and intrigue. A place where developers can roam freely and where data can be flexible, semi-structured, hierarchical nature, and can be easily scaled horizontally… A place not bound to the rigidness of relational tables but a domicile of flexible documents. We would journey to the world of MongoDB🍃.

Ok, so we have been there before, but we needed a refresher. It had been about 6 years since we first became acquainted with this technological phenomenon. Besides we hadn’t played around too much with some of the company’s past innovations like MongoDB🍃 Compass 🧭 a sleek visual environment that allows you to analyze and understand the contents of your data in MongoDB🍃 and MongoDB🍃 Atlas☁️ the managed service used to provision, maintain and scale MongoDB🍃 clusters of instances that is conveniently offered on AWS, Azure and GCP.

To assist us on getting started would be our old comrade in arms, Pinal Dave from SQLAuthority fame. Pinal had put together an outstanding condensed course on Foundations of Document Databases with MongoDB

So, this is where we would begin. The course commences with an introduction on NoSQL (Not Just SQL) databases and some the advantages of a Document Database like an Intuitive Data Model, dynamic Schema and distributed Scalable Database. Then he gives us comprehensible explanation to the CAP (Consistency, Availability and Partition Tolerance) Theorem and that only 2/3 are necessary. MongoDB🍃 fits in under the CP variety while compromising on availability. Next, the following key 🔑 points are made in relation to MongoDB🍃

All write operations in MongoDB🍃 are atomic on the level of a single document
If the collection does not currently exist, the insert operator will create one in the collection.

Next, Pinal takes use through a few quick and easy steps on how to get setup with MongoDB Atlas☁️. Once, our fully managed MongoDB🍃 Cluster was fired🔥 up it was time to navigate our collections with MongoDB Compass🧭

For much of the rest of the course, we would concentrate on CRUD (Create, Read, Update, Delete) operations in MongoDB🍃 through both Compass🧭 and the CLI (Mongo Shell).

The course would also present us a terse walk through with syntax and elucidation on Read Concerns and Write Concerns in MongoDB🍃

Read Concern

Allows to control the consistency and isolation properties of the data read from replica sets and replica set shards

Local – (No guarantee the data has applied to all Replicas) – Primary
Available – (No guarantee the data has applied to all Replicas) – Secondary
Majority – (default) acknowledged by a majority to all Replicas
Linearizable – All successful acknowledged by a majority to all Replicas before Read (Query might have to wait)
Snapshot – Used with multi document projection data from the majority to all Replicas

Write Concern

Level of acknowledgement requested from MongoDB🍃 for write operations

w:1 – Ack from primary

w:0 – No ack

w(n) – Primary + (n-1) secondary

w: majority

Timeout: Time limit to prevent write operations from blocking indefinitely

As lasting point on database writes (UD), all write operations in MongoDB🍃 are atomic on the level of a single document. In other words, if you are updating a single document or if you are updating multiple documents in a collection at any time every single update is just atomic at a single document.

Closing time 🕰 … Time 🕰 for you to go out go out into the world 🌎 … Closing time 🕰…

Finally, dénouements of the course are on Common SQL Concepts and Semantics to MongoDB🍃 including some of the major differences between a typical RDBS and MongoDB🍃which can be represented by the table below:

RDBS	MongoDB
SQL	MQL (Mongo Query Language)
Predefined Schema	Dynamic Schema
Relational Keys	No foreign key
Triggers	No Triggers
ACID Properties	CAP theorem

Sadly 😢, that was it for Foundations of Document Databases with MongoDB. This left us clambering for more. Fortunately, Nuri Halperin happily😊 delivered and then some… Nuri a MongoDB🍃 Guru 🧙‍♂️ and a Love❤️ 👨‍⚕️ of sorts (known for creating the wildly popular jdate.com platform) put together a series of timeless MongoDB🍃 courses that have managed to stand up through the test of time 🕰. In which, I might add is not an easy feat when it comes to a burgeoning technologies like MongoDB🍃.

Part 1: Introduction course Introduction to MongoDB in-depth look at both the Mongo Shell (CLI) and CRUD Syntax and Indexing

Part 2: MongoDB Administration takes a deep dive into MongoDB🍃 administration key concepts i.e. installation, configuration, Security, Backup/Restores, Monitoring, High Availability and Performance

Nuri like Pinal, discusses some of the challenges found in Relational Databases like Impedance mismatch and the need for Object-relational mapping (ORM) for developers. He demonstrates how MongoDB🍃 solves these challenges through its schema-less approach and no relationships required model. In addition, he touches on to how MongoDB🍃 lends itself nicely to data polymorphism.

Next he takes us through the MongoDB architecture a collection humongous arrays that utilizes memory mapped BSON (Binary script object notation) or simply “Binary JSON” (Java Script object notation) files. MongoDB🍃 intuitively leverages the OS to handle the loading of data and saving to disk which allows the engine to center on speed, optimization, and stability.

MongoDBs🍃 main mission is to just serve up data quickly and efficiently. Next, Nuri takes through the Mongo Shell (CLI) which basically is just a Java☕️Script interpreter that allows you to interactively get insight into the MongoDB🍃 Server. Further he discusses indexes, types of indexes, and how paramount indexes are in MongoDB🍃 for practical usability.

Lastly, Nuri takes through MongoDB🍃 Replication which uses the simple to configure but highly scalable replica sets. This is how MongoDB🍃 achieves “Eventual Consistency”, Automatic Failover, and Automatic recovery… And this is just the introduction of the Part 2 of the course…

Like trying to watch all 3 parts of The Lord of Rings 💍 (Director’s Edition) trilogy in a single helping, it’s just wasn’t possible to complete all of Nuri’s two part sequel in a single week but we did get through most of it. 😊 This also left us with a little bit more on our plate🍽 as we continue through our Mongo Journey…

This Week’s Log

Out of the tree 🌳 of life I just picked me a plum… You came along and everything started’ in to hum 🎶… Still it’s a real good bet… The best is yet to come

Below are some topics I am considering for my voyage next week:

More with Nuri and MongoDB
Cosmos DB
More with Google Cloud Path
Working with Parquet files
JDBC Drivers
More on Machine Learning
ONTAP Cluster Fundamentals
Data Visualization Tools (i.e. Looker)
Additional ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)

Stay safe and Be well –

–MCS