Cosmos DB | SQL Squirrels

Week of July 10th

Posted on July 10, 2020 by Mark Shay

“Stay with the spirit I found”

Happy Friday! 😊

“You take the blue pill 💊— the story ends, you wake up in your bed 🛌 and believe whatever you want to believe. You take the red pill 💊 — you stay in Wonderland and I show you how deep the rabbit🐰 hole goes.”

Last week, we hopped on board the Nebuchadnezzar🚀 and traveled through the cosmos🌌 to Microsoft’s multi-model NoSQL Solution. So, this week we decided to go further down the “rabbit🐰 hole” and explore the wondrous land of Microsoft’s NoSQL Azure solutions as well as Graph📈 database. We would once again revisit with Cosmos DB🪐 exploring all 5 APIs. In addition, we would have brief journey with Azure Storage (Table), Azure Data Lake (Gen2), and Azure’s managed data analytics service for real-time analysis (ADLS), and Azure Data Explorer (ADX). Then for an encore we would venture into the world’s🌎 most popular Graph📈 database Neo4J

First, playing the role as our leader “Morpheus” in our first mission would be featured Pluralsight author and premier trainer Reza Salehi through his recently released Pluralsight course Implementing NoSQL Databases in Microsoft Azure . Reza doesn’t take us quite as deep in the weeds 🌾 with Cosmos DB🪐 as Lenni Lobel’s Learning Azure Cosmos DB🪐 Pluralsight course but that is because his course covers a wide range of topics in the Azure NoSQL ecosystem. Reza provides us a very practical real-world🌎 scenario like migrating from MongoDB🍃Atlas to Cosmos DB🪐(MongoDB🍃 API) and he also covers the Cassandra API which was omitted from Lenni’s offerings. In addition, Reza spends some time giving a comprehensive overview on Azure Storage (Table) and introduces us to ADLS and ADX all of which were all new to our learnings.

In the introduction of the course, Reza gives us a brief history on NoSQL which apparently has existed since the 1960s! It just wasn’t called NoSQL. He then gives us his definition of NoSQL and emphasizes its main goal to provide horizontal scalability, availability and optimal pricing. Reza’s mentions an interesting factoid that Azure NoSQL solitons have been used by Microsoft for about decade through Skype, Xbox 🎮, Office 365 🧩 neither of which scaled very well with a traditional relational database.

Next, he discusses Azure Table Storage (soon to be deprecated and replaced by Cosmos DB🪐 Table API). Azure Table storage can store large amounts of structured and non-relational data (datasets that don’t require complex joins, foreign keys🔑 or stored procedures) cost effectively. In addition, It is durable and highly available, secure, and massively scalable⚖️. A table is basically a collection of entities with no schema enforced. An entity is a set of properties (maximum of 252) similar to a row in table in a relational database. A property is a name-value pair. Three main system properties that must exist with each entity are a partition Key🔑 , row key🔑 and a timestamp. In the case of a Partition Key🔑 and a Row key🔑 the application is responsible for inserting and updating these values whereas the Timestamp is managed by Azure Table Storage and this value is immutable. Azure automatically manages the partitions and the underline storage, so as the data in your table grows, the data in your table is divided into different partitions. This allows for faster query performance⚡️ of entities with the same partition key🔑 and for atomic transactions on inserts and updates.

Next on the agenda was Microsoft’s globally distributed, multi-model database service better known Cosmos DB🪐. Again, we had been down this road just last week but just like re-watching the first Matrix movie🎞 I was more than happy 😊 to do so.

As a nice review, Reza reiterated some of the core Cosmos DB🪐 concepts like Global distribution, Multi-homing, Data Consistency Levels, Time-to-live (TTL), and Data Partitioning. All of which are included with of all five flavors or APIs in Cosmos DB🪐 because at the end of the day each API is just another container to the Cosmos DB🪐. Some of the highlights included:

Global distribution

· Cosmos DB🪐 allows you to add or remove any of the azure regions to your cosmos account at any time with a click of a button.

· Cosmos DB🪐 will seamlessly replicate your data to all the region’s associate ID with your cosmos account.

· The multi homing capability of Cosmos DB🪐 allows your application to be highly available.

Multi-homing APIs

· Your application is aware of the nearest region and sends requests to that region.

· Nearest region is identified without any configuration changes

· When a new region is added or removed, the connection string stays the same

Time-to-live (TTL)

• You can set the expiry time (TTL) on Cosmos DB data items

• Cosmos DB🪐 will automatically remove the items after this time period, since the last modification time ⏰

Cosmos DB🪐 Consistency Levels

• Cosmos DB🪐 offers five consistency levels to choose from:

• Strong, bounded staleness, session, consistent prefix, eventual

Data Partitioning

· A logical partition consists of a set of items that have the same partition key🔑.

· Data that’s added to the container is automatically partitioned across a set of logical partitions.

· Logical partitions are mapped to physical partitions that are distributed among several machines.

· Throughput provisioned for container, is divided evenly among physical partitions.

Then Reza’s breaks down each of the 5 Cosmos DBs🪐 APIs in separate modules. But at the risk, of being redundant from last week’s submission, we will just focus on the MongoDB🍃 API and the Cassandra API as we covered the other APIs in-depth last week. I will make one important point for all APIs that you are working with that is you must choose an appropriate partition key🔑. As rule of thumb 👍, an ideal Partition key🔑 should have a wide range of values, so your data is evenly spread across logical partitions.

MongoDB🍃 API in Cosmos DB🪐 supports the popular MongoDB🍃 Document database with absolutely no code changes other than a connection string to existing applications. It now supports up to MongoDB 🍃version 3.6.

During this module, Reza provides us with a very practical real world 🌎 scenario migrating from MongoDB🍃Atlas to Cosmos DB🪐 (MongoDB🍃 API). We were happy😊 to report that we were able to follow along easily and successfully migrate our own MongoDB🍃 Atlas collections to Cosmos DB🪐.

Important to note: Before starting a migration from MongoDB🍃 to Cosmos DB🪐, you should estimate the amount of throughput to provisioned for your azure cosmos databases on collections and of course pick an optimal partition key🔑 for your data.

Next, we will focused on the Cassandra API in Cosmos DB🪐. This one admittedly, I was really looking forward too as it wasn’t in scope in our previous journey. Cosmos DB🪐 – Cassandra API can be used as the data store for apps written for Apache Cassandra. Just like for MongoDB🍃, existing Cassandra applications using CQLv4 compliant drivers, can easily communicate with the Cosmos DB🪐 Cassandra API. Making it easy to switch from Apache Cassandra to Cosmos DB🪐 Cassandra API with only requiring an update to the connection string. The familiar CQL, Cassandra client drivers, and Cassandra-based tools can all be used making for seamless migration with of course the benefits of Cosmos DB🪐 like

· No operations management (PaaS)

· Low latency reads and writes

· Use existing code and tools

· Throughput and storage elasticity

· Global distribution and availability

· Choice of five well-defined consistency levels

· Interact with Cosmos DB🪐 Cassandra API

Next we ventured on to new ground with Azure Data Lake Storage (ADLS). ADLS is a hyper-scale repository for big data analytic workloads. Azure Storage (Gen 2) is the foundation for building enterprise data lakes on ADLS. ADLS supports hundreds of gigabits of throughput and manages massive amounts of data. Some Key features of ADLS include:

· Hadoop compatible – manage data same as Hadoop HDFS

· POSIX permissions – supports ACL and POSIX file permissions

· Cost effective – offers low cost storage capacity

Last but certainly not least on this Journey with Reza was an introduction to Azure Data Explorer (ADX) a fast and highly scalable⚖️ data exploration service for log and telemetry data. ADX is designed to ingest data from devices like websites, logs and more. These ingestion sources come natively from Azure Event Hub, IoT hub and Blob Storage. Data is then stored in highly scalable⚖️ database and analytics are performed using Kusto Query Language (KQL). ADX can be provisioned with Azure CLI, PowerShell, C# (NuGet package), Python 🐍 SDK and the ARM template. One of the key features of ADS is Anomaly Detection. ADX uses machine learning under the hood to find these anomalies. ADX also supports many data visualization tools like

· Kusto query language visualizations

· Azure Data Explorer dashboards (Web UI)

· Power BI connector

· Microsoft Excel connector

· ODBC connector

· Granfana (ADX plugin)

· Kibana Connector (using k2bridge)

· Tableau (via ODBC connector)

· Qlik (via ODBC connector)

· Sisense (via JDBC connector)

· Redash

ADX can easily integrate with other services like:

· Azure Data Factory

· Microsoft Flow

· Logic Apps

· Apache Spark Connector

· Azure Databricks

· CI/CD using Azure DevOps

I’ll show these people what you don’t want them to see. A world🌎 without rules and controls, without borders or boundaries. A world🌎 where anything is possible. -Neo

After spending much time in Cosmos DB🪐 and in particular the Graph📈Database API, I have become very intrigued by this type of NoSQL solution. The more I explored the more I coveted. I had a particular yearning to learn more about the world’s 🌎 most popular graph 📈database Neo4J. For those not aware of Neo4J its developed by Swedish 🇸🇪 Technology company sometimes referred to as Neo4J or Neo Technology. I guess it depends on the day of the week?

According to all accounts the name Neo” was named for Swedish 🇸🇪 pop artist and favorite of the Swedish🇸🇪 developers Linus “Neo” Ingelsbo, “4” (for version) and “J” for the Swedish🇸🇪 word “Jätteträd” which of course means “giant tree 🌳” because a tree 🌳 signifies the huge data structures that could now be stored in this amazing database product. But to me this story seems a bit curious.. With a database name like “Neo” and Querying language called “Cypher” and with Awesome Procedures On Cypher better known as APOC I somehow believe there is another story here..

Anyway to guide us through our learning with Neo4J would be no other than the “Flying Dutchman” 🇳🇱 Roland Guijt through his Introduction to Graph📈 Databases, Cypher, and Neo4j which was short but sweet (sort of like a Stroopwafel🧇)

In the introduction, Roland tells us the Who, What, When, Where, Why and How about graph📈 databases. A graph 📈consists of nodes or vertices which are connected by directional relationships or Edges. A node represents an entity. An entity is typically something in the real world🌎 like a customer, an order or a person A collection of nodes and relationships together is called a graph 📈. Graph📈databases are very mind friendly compared to other data storage technologies because graphs📈 act a lot like how the human brain🧠 works. It’s easier to think of the data structure and also easier to write queries. These patterns are much like the patterns of the brain🧠 uses to fetch data or retrieve memories.

Graph 📈 Databases are all about relationships and thus are very strong in storing and retrieving highly related data. They are also very performant during querying even with large number of nodes like in the millions. They offer great flexibility as like all NoSQL databases it doesn’t require a fixed schema. In addition, they are quite agile as you can add or delete nodes and property of nodes without affecting already stored nodes and it’s extensible supporting multiple query languages

After a comprehensive overview with graph📈 database, Roland dives right into Neo4J the leader in Graph 📈database. Unlike document databases, Neo4j is ACID compliant which means that all data modification is done within a transaction. If something goes wrong, Neo4j will simply roll back to a state where the data was reliable.

Neo4J is Java☕️ based which allows you to install it on multiple platforms like Windows, Linux, and OS X. Neo4j can scale⚖️ up as it can easily adjust to a hardware changes i.e. adding more physical memory in which it will automatically add more nodes in the cache. Neo4J can also scale ⚖️ out like most NoSQL Solutions i.e. adding more servers meaning it can distribute the load of transactions or create a highly available cluster in which a server will take over when the active one fails.

Since by definition Neo4J is a graph📈 database, it’s all about relationships and nodes. Both nodes and relationships are equally as important. Nodes are schema-less entities with properties (key-value pairs) which are always strings. Relationship connects a node to another node. Just like nodes, they also can contain properties that also support indexing.

Next, Roland discusses Querying Data with Cypher which is the most powerful⚡️of Query languages supported by Neo4J. Cypher was developed and optimized for Neo4j and for graph📈 databases. Cypher is a very fluid language meaning it continuously changes with each release. The good 😊 news is all major releases are backwards compatible to all old versions of the language. It’s very different for SQL so there is a bit of a learning curve. However, it’s not as steep as a learning curve you would imagine because Cypher uses patterns to match the data in the database very much how the brain🧠 works. That and Neo4J Desktop has intellisense. 😊

As example to demonstrate the query language and CRUD we worked with a very cool Dr. Who graph 📈database filled multiple nodes with Actors, Roles, Episodes, Villains and their given relationships. To begin we started with “R” or Reads part of CRUD learning the MATCH command

Below is some MATCH – RETURN syntax:

MATCH (:Actor{name:’Matt Smith’}) -[:PLAYED]->(c:Character) RETURN c.name as name

MATCH (actors:Actor)-[:REGENERATED_TO]-> (others) RETURN actors.name, others.name

MATCH (:Character{name:’Doctor’})<-[:ENEMY_OF]-(:Character)-[:COMES_FROM]->(p:Planet) RETURN p.name as Planet, count(p) AS Count

MATCH (:Actor{name:’Matt Smith’})-[:APPEARED_IN]-> (ep:Episode)<-[:APPEARED_IN]- (:Character{name:’Amy Pond’}),(ep) <-[:APPEARED_IN]-(enemies:Character)<-[:ENEMY_OF]-(Character{name:’Doctor’}) RETURN ep AS Episode, collect(enemies.name) AS Enemies;

Further, Roland discussed the WHERE Clause and ORDER BY Clauses which are very similar to ANSI SQL. Then he converses about other Cypher syntax like:

SKIP – which skips the number of result items you specify.

LIMIT – which limits the numbers of items returned.

With UNION which allows to connect two queries together and generate one result set.

Then he ends the module conferring on Scalar functions like TOINT,

LENGTH, REDUCE, FILTER, ROUND, and SUBSTRING.

Then he reviews two of his favorite some advanced query features like COMPANION_OF and SHORTESTPATH.

Continuing on with C,U,D in CRUD, we played with the CREATE, MATCH WITH SET and MATCH DELETE

Below is some Syntax:

CREATE p= (:Actor{name:’Peter Capaldi’})-[:APPEARED_IN]->(:Episode{name:’The Time of The Doctor’}) RETURN p

MATCH (Matt:Actor{name: ‘Matt Smith’}}

DELETE matt

MATCH (Matt:Actor{name: ‘Matt Smith’}}

SET matt.salary = 1000

Then looking at MERGE and FOREACH with the below syntax as example:

MERGE (peter:Actor{name: ‘Peter Capaldi’}) RETURN peter

Match p =(actors:Actor)-[r:PLAYED]->others)

WHERE actors.salary > 10000

FOREACH (n IN nodes(p)| set n.done = true)

As we continued our journey with Neo4J, we reconnoitered on Indexes and Constraints. Indexes are only good for data retrieval. So, if your application performs lots of writes it’s probably best to avoid them. As for constraints, the unique constraint is currently the only constraint available in Neo4j. That is why this is often called just constraint. Lastly, in the module we reviewed Importing CSV which makes importing data from other sources a breeze. It enables you to import data into a Neo4j’s database from many sources. CSV files can be loaded from the local file system, as well as remote locations. Cypher has a LOAD CSV statement, which is used together with CREATE and/or MERGE.

Finally, Roland reviewed Neo4j’s APIs which was a little bit out of our lexicon but interesting nonetheless. Neo4j supports two API types out of the box. The traditional REST and their proprietary Bolt⚡️. The advantage of Bolt⚡️is mainly performance. Bolt⚡️ doesn’t have the HTTP overhead, and it uses a binary format instead of text to return data. For both the REST and Bolt APIs Roland provides C# code sample that can be run with NuGet packages in Visual Studio my new favorite IDE.

Ever have that feeling where you’re not sure if you’re awake or dreaming?

Below are some topics I am considering for my learnings next week:

· More on Neo4J and Cypher

· More on MongoDB

· More with Google Cloud Path

· Working with Parquet files

· JDBC Drivers

· More on Machine Learning

· ONTAP Cluster Fundamentals

· Data Visualization Tools (i.e. Looker)

· Additional ETL Solutions (Stitch, FiveTran)

· Process and Transforming data/Explore data through ML (i.e. Databricks)

Stay safe and Be well –

–MCS

Week of July 3rd

Posted on July 3, 2020 by Mark Shay

“Hanging in the cosmos 🌌 like a space ornament”

Happy Birthday🎂🎁🎉🎈America🇺🇸 !

“Now let me welcome everybody to the Wild, Wild West 🤠. A ~~state~~ database that’s untouchable like Eliot Ness.” So, after spending a good concentrated week in the “humongous” document database world better known as the popular MongoDB🍃, it only made sense to continue our Jack Kerouac-like adventures through the universe 🌌 of “Not only SQL” databases.

“So many roads, so many detours. So many choices, so many mistakes.” -Carrie Bradshaw

But with so many Document databases, Table and Key-value stores, Columnar and Graph databases to choose from in the NoSQL universe, where shall we go? Well, after a brief deliberation, we turned to the one place that empowers every person and every organization on the planet to achieve more. That’s right, Microsoft! Besides we haven’t been giving Mr. Softy enough love ❤️ in our travels. So, we figured we would take a stab and see what MSFT had to offer. Oh boy, did we hit eureka with Microsoft’s Cosmos DB🪐!

For those not familiar with Microsoft’s Cosmos DB🪐 it was released for GA in 2017. The solution had morphed out of the Azure DocumentDB (the “Un-cola”🥤of document databases of its day) which was initially released in 2014. During the time of its inception, Azure DocumentDB was the only NoSQL Cloud☁️ solution (MongoDB🍃 Atlas☁️ was released two years later in 2016) but its popularity was still limited. Fortunately, MSFT saw the “forest 🌲🌲🌲through the trees🌳” or I shall I say the planets🪐 through the stars ✨ and knew there was a lot more to NoSQL then just some JSON and bunch of curly braces. So, they “pimped up” Azure DocumentDB and gave us the Swiss🇨🇭 Army knives of all NoSQL solutions through their rebranded offering Cosmos DB🪐

Cosmos DB 🪐 is multi-model NoSQL Database as a Service (NDaaS) that manages data at planetary 🌎 scale ⚖️! Huh? In other words, Cosmos DB🪐 supports 6 different NoSQL solutions through the beauty of APIs (Application Program Interfaces). Yes, you read that correctly. Six! Cosmos DB🪐 supports the SQL API which was originally intended to be used with aforementioned Azure DocumentDB which uses the friendly SQL query language, the MongoDB🍃 API (For all the JSON fans), Cassandra (Columnar database), Azure Table Storage (Table) and etcd (Key Value Store) and last but certainly not least the Gremlin👹 API (Graph database).

Cosmos DB🪐 provides virtually unlimited scale ⚖️ through both storage and throughput and it automatically manages the growth of the data with server-side horizontal partitioning.

So, no worrying about adding more nodes or shards. …And that’s not all! Cosmos DB🪐 does all the heavy lifting 🏋🏻‍♀️ with automatic global distribution and server-side partitioning for painless management over the scale and growth of your database. Not to mention, offers a 99.999% SLA when data is distributed across multi-regions 🌎 (Only a mere four 9s when you stick to a single region).

Yes, you read that right, too. 99.999% guarantee! Not just on availability… No, No, No… but five 9s on latency, throughput, and consistency as well!

Ok, so now I sound like a MSFT fanboy. Perhaps? So now, we were fully percolating ☕️ with excitement who will guide us through such amazing innovation? Well, we found just the right tour guide in a Native New Yorker Lenni Lobel. Through his melodious 🎶 voice and over 5 decades of experience in IT, Lenni takes us through an amazing journey through Cosmos DB🪐 with his Plural sight course Learning Azure Cosmos DB🪐

In the introduction, Lenni gives his us his interpretation on NoSQL which answers the common problem of 3Vs in regards to data and the roots of Cosmos DB🪐 which we touched on earlier. Lenni then explains how the Cosmos DB🪐 engine is an atom-record-sequence (ARS) based system. In other words the database engine of Cosmos DB🪐 is capable of efficiently translating and projecting multiple data models by leveraging ARS. Still confused? Don’t be. In more simplistic terms, under the covers Cosmos DB🪐 leverages the ARS framework to be able support multiple NoSQL technologies. It does this through APIs and then placing each of data models in separate schema-agnostic containers which is super cool 😎! Next, he discusses one of the cooler 😎 features of Cosmos DB🪐 “Automatic Indexing”. If you recall from our MongoDB travels one of the main takeaways was a strong emphasis on the need for indexes in MongoDB🍃. Well, in Cosmos DB🪐 you need not to worry. Cosmos DB🪐 does this for you automatically. The only main concern is choosing the right partition key🔑 on your container but you must choose wisely otherwise performance and cost will suffer.

Lenni further explains how one quantifies performance for data through Latency and throughput. In the world 🌎 of data, Latency is how long the data consumer waits for the data to be received from end to end. Whereas throughput is the performance of database itself. First, Mr. Lobel demonstrates how to provision throughput through Cosmos DB🪐 which provides predictable throughput to the database through a server-less approach measured in Request Units (RUs). RUs are a blended measure of computational cost CPU, memory, disk I/O, network I/O.

So, like most server-less approaches you don’t need to worry about provisioning hardware to scale ⚖️ your workloads. You just need to properly allocate the right amount of RUs to a given container. The good news on RUs is that this setting is flexible. So it can be easily throttled up and down through the portal or even specify on an individual query level.

Please note: data writes are generally more expensive than data reads. The beauty of the RU approach is that you are guaranteed throughput and you can predict cost. You will even be notified through a friendly error message when your workloads exceed a certain threshold. There is an option to run your workloads in an “auto-pilot ✈️ mode” in which Cosmos DB🪐 will adjust the RUs to a given workload but beware this option could be quite costly so proceed with risk and discuss this option with MSFT before considering using it.

In effort of being fully transparent, unlike some of their competitors, Microsoft offers a Capacity Calculator So you can figure out exactly how much it will cost you to run your workloads (Reserved RU/sec per hour $0.008 for 100 RU/sec). The next import considerations in regards to throughput is Horizontal Partitioning. Some might regard, Horizontal Partitioning as strictly just for storage, but in fact it also massively impacts throughput.

“Yes, it’s understood that Partitioning and throughput are distinct concepts, but they’re symbiotic in terms of scale-out.”

Anyway, no need to fret… We just simply create a container and let Cosmos DB🪐 automatically manage these partitions for us behind the scenes (including the distribution of partitions within a given data center). However, keep in mind that we must choose a proper partition key🔑 otherwise we can have a rather unpleasant😞 and costly🤑 experience with Cosmos DB🪐. Luckily, there are several best practices around choosing the right partition key🔑. Personally, I like to stick to the rule of thumb 👍 to always choose a key🔑 with many distinct values like in 100s or 1000s. This can hopefully help avoid the dreaded Hot🔥 Partition

Please note: Partition keys 🔑 are immutable but there are documented workarounds on how to deal with changing this in case you find yourself in this scenario.

Now, that we have a good grasp on how Cosmos DB🪐 handles throughput and latency through RUs and horizontal partitioning but what if your application is global 🌎 and your primary data is located halfway around the world 🌍 ? Our performance could suffer tremendously… 😢

Cosmos DB🪐 handles such challenges with one of its most compelling features in the solution through Global Distribution of Data. Microsoft intuitively leverages the ubiquitousness of its global data centers and offers a Turnkey global distribution “Point-and-click” control so your data can seamlessly be geo-replicated across regions.

In cases, where you have multiple-masters or data writers, Cosmos DB🪐 offers three options to handle such conflicts:

Option 1: Last Writer Wins (default) based on the highest _ts property or any other numeric property) Conflict Resolver Property Write with higher valuer wins if blank than master with high _ts property wins
Option 2: Merge Procedure (Custom) – Based on stored procedure result
Option 3: Conflict feed (Offline resolution) Based Quorum majority

Whew 😅 … But what about data consistency? How do we ensure our data is consistent in all of our locations? Well once again, Cosmos DB🪐 does not disappoint supporting five different options. Of course, like life itself there is always tradeoffs. So, depending on your application needs. You will need to determine what’s the most important need for your application latency or availability? Below are the options based higher latency to lowest availability:

Strong – (No Dirty Reads) Higher latency on writes waiting for write to be written to Cosmos DB Quorum. Higher RU costs
Bounded Staleness – Dirty reads possible Bounded by time and updates which kind of like “Skunked🦨 beer🍺” You decide the level of freshness you can tolerate.
Session – (Default) No dirty reads for writers (read your own writes). Dirty Reads are possible for other users
Consistent Prefix – Dirty reads possible. Reads never see out-of-order writes. Never experience data returned out of order.
Eventual – Stale reads possible, No guaranteed order. Fastest

So, after focusing on these core concepts within Cosmos DB🪐, we were ready to dig our heels 👠 👠 right in and get this bad boy up and running 🏃🏻 . So after waiting about 15 minutes or so… we had our Cosmos DB🪐 fired up 🔥 and running in Azure… Not bad for a such complex piece of infrastructure. 😊

Next, we created a Container and then a Database and started our travels with the SQL API. Through the portal, We were easily able manually write some JSON documents and add them to our collection.

In addition, through Lenni’s brilliantly written .Net Core code samples, we were able to automate writing, Querying, and reading in bulk data. Further, we were able to easily adjust throughput and latency through the portal by tweaking the RUs and enabling multi-region replication. We were able to demonstrate this by re-running Lenni’s code after the changes

Although, getting Lenni’s code to work did take a little bit of troubleshooting with visual studio 2019 and a little bit of understanding how to fix the .Net SDK errors and some of Compilation errors NuGet from packages . All of which was out of our purview.. But needless to say we figured how to troubleshooted the NuGet Packages and modify some of the parameter’s in the code like _ID field and Cosmos DB🪐 Server and Cosmos DB master key 🔑.

We were able to enjoy the full experience of SQL API including the power⚡️ of using the familiar SQL query language and not to having to worrying about the all

db.collection.insertOne() this

and

db.collection.find(),

db.collection.UpdateOne()

db.collection.deleteOne()

that..

We also got to play with server‑side programming in Cosmos DB🪐 like the familiar concept of stored procedures, triggers, and user‑defined functions which in Cosmos DB🪐 are basically self‑contained JavaScript functions that are deployed to the database for execution. But one can always pretend like we are in the relational database world. 😊

Next we, got to test drive 🚙 the Data Migration tool 🛠 that allows you to import data from an existing data sources into Cosmos DB🪐.

From our past experiences, we have found Microsoft has gotten quite good at creating these type of tools 🧰. Cosmos DB🪐 Data Migration tool offers great support for many data sources like SQL Server, JSON files, CSV files, MongoDB, Azure Table storage, and others.

First, we used the UI to move data from Microsoft SQL Server 2016 and the popular example Adventureworks database to Cosmos DB🪐 and then later through the CLI (azcopy) from Azure Table storage.

Notably, Azure Table Storage is on the road map to be deprecated and automatically migrated to Cosmos DB🪐 but this was good exercise for those who can’t wait and want to take advantage such awesome platform today!

As a grand finale, we got to play with Graph Databases through the Gremlin 👹 API. As many of you might be aware, Graph databases are becoming excessively popular these days. Mostly because Data in the real world is naturally connected through relationships and Graph Databases do a better job managing when many complex relationships exist opposed to our traditional RDBMS.

Again, it’s worth noting that in the case of Cosmos DB🪐, it doesn’t really matter what data model you’re implementing because as we mentioned earlier it leverages the ARS framework. So as far as Cosmos DB🪐 concerned it’s just another container to manage and we get all the Horizontal partitioning, provisioned throughput, global distribution, indexing goodness 😊.

We were new to whole concept of Graph Databases so we were very excited to get some exposure here which looks to be a precursor for further explorations. The most important highlights of Graph database is understanding Vertex and Edge objects. These are basically just fancy schmancy words for Entities and Relationships. A Vertex is an entity and a Edge is a relationship between any two vertices respectively. Both can hold arbitrary key-value pairs 🔑🔑 and are the building blocks of a graph database.

Cosmos DB🪐 utilizes the Apache TinkerPop standard which uses Gremlin as a functional step-by-step language to create vertices and edges and stores the data as GraphSON or “Graphical JSON”.

In addition, Gremlin 👹 allows you to query the graph database by using simple transversals though a myriad of relationships or Edges. The more edges you add, the more relationships you define, and the more questions you can answer by running Gremlin👹 Queries. 😊

To further our learning Lenni once again gave us some nice demos using a fictitious company “Acme” and its relationships of employees, Airport terminals and Restaurants and another example using Comic Book hero’s which made playing along fun.

Below is some example of some Gremlin 👹 syntax language from our voyage.

g.addV(‘person’).property(‘id’,’John’).property(‘age’,25).property(‘likes’,’pizza’).property(‘city’,’NY’)

g.addV(‘person’).property(‘id’,’Alan’).property(‘age’,22).property(‘likes’,’seafood’).property(‘city’,’NY’)

g.addV(‘company’).property(‘id’,’Acm’e).property(‘founded,2001).property(‘city’,’NY’)

g.V().has(‘id’,’John’).addE(‘worksAt’).property(‘weekends’, true).to(g.V().has(‘id’,’Acme’))

g.V().has(‘id’,’Alan’).addE(‘worksAt’).property(‘weekends’, true).to(g.V().has(‘id’,’Acme’))

g.V().has(‘id’,’Alan’).addE(‘manages’).to(g.V().has(‘id’,’John’))

When in comes to Graph databases the possibilities are endless. Some good use cases for Graph Database would be:

Complex Relationships – Many “many-to-many” relationships
Excessive JOINS
Analyze interconnected data relationships
Typical graph applications
- Social networks
- Recommendation Engines

In Cosmos DB🪐, it’s clear to see how a graph database is no different than any other key value data model. Graph database gets provisioned throughput, fully indexed, partitioned, and globally distributed just like a document collection in this SQL API or a table in the Table API

Cosmos DB🪐 will one day allow you to switch freely between different APIs and data models within the same account, and even over the same data set. So by adding this graph functionality to Cosmos DB🪐 Microsoft really hit ⚾️ this one out of the park 🏟!

Closing time …Every new beginning.. comes from some other beginning’s end

Below are some topics I am considering for my wonderings next week:

Neo4J and Graph DB
More on Cosmos DB
More on MongoDB
More with Google Cloud Path
Working with Parquet files
JDBC Drivers
More on Machine Learning
ONTAP Cluster Fundamentals
Data Visualization Tools (i.e. Looker)
Additional ETL Solutions (Stitch, FiveTran)
Process and Transforming data/Explore data through ML (i.e. Databricks)

Stay safe and Be well –

–MCS