Week of July 10th

โStay with the spirit I foundโ
Happy Friday! ๐
โYou take the blue pill ๐โ the story ends, you wake up in your bed ๐ and believe whatever you want to believe. You take the red pill ๐ โ you stay in Wonderland and I show you how deep the rabbit๐ฐ hole goes.โ
Last week, we hopped on board the Nebuchadnezzar๐ and traveled through the cosmos๐ to Microsoftโs multi-model NoSQL Solution. So, this week we decided to go further down the โrabbit๐ฐ holeโ and explore the wondrous land of Microsoftโs NoSQL Azure solutions as well as Graph๐ database. We would once again revisit with Cosmos DB๐ช exploring all 5 APIs. In addition, we would have brief journey with Azure Storage (Table), Azure Data Lake (Gen2), and Azureโs managed data analytics service for real-time analysis (ADLS), and Azure Data Explorer (ADX). Then for an encore we would venture into the worldโs๐ most popular Graph๐ database Neo4J
First, playing the role as our leader “Morpheus” in our first mission would be featured Pluralsight author and premier trainerย Reza Salehiย through his recently released Pluralsight course ย Implementing NoSQL Databases in Microsoft Azureย . Reza doesnโt take us quite as deep in the weedsย ๐พย with Cosmos DB๐ชย asย Lenni Lobelโsย Learning Azure Cosmos DB๐ชย Pluralsight course but that is because his course covers a wide range of topics in the Azure NoSQL ecosystem. Reza provides us a very practical real-world๐ย scenario like migrating from MongoDB๐Atlas to Cosmos DB๐ช(MongoDB๐ย API) and he also covers the Cassandra API which was omitted from Lenniโs offerings.ย In addition, Reza spends some time giving a comprehensive overview on Azure Storage (Table) and introduces us to ADLS and ADX all of which were all new to our learnings.
In the introduction of the course, Reza gives us a brief history on NoSQL which apparently has existed since the 1960s! It just wasnโt called NoSQL. He then gives us his definition of NoSQL and emphasizes its main goal to provide horizontal scalability, availability and optimal pricing. Rezaโs mentions an interesting factoid that Azure NoSQL solitons have been used by Microsoft for about decade through Skype, Xbox ๐ฎ, Office 365 ๐งฉ neither of which scaled very well with a traditional relational database.
Next, he discusses Azure Table Storage (soon to be deprecated and replaced by Cosmos DB๐ช Table API). Azure Table storage can store large amounts of structured and non-relational data (datasets that donโt require complex joins, foreign keys๐ย or stored procedures) cost effectively.ย ย In addition, It is durable and highly available, secure, and massively scalableโ๏ธ. Aย tableย is basically a collection of entities with no schema enforced. Anย entityย is a set of properties (maximum of 252) similar to a row in table in a relational database. A property is a name-value pair. Three main system properties that must exist with each entity are a partition Key๐ย , row key๐ย and a timestamp. In the case of a Partition Key๐ย and a Row key๐ย the application is responsible for inserting and updating these values whereas the Timestamp is managed by Azure Table Storage and this value is immutable. Azure automatically manages the partitions and the underline storage, so as the data in your table grows, the data in your table is divided into different partitions. This allows for faster query performanceโก๏ธ of entities with the same partition key๐ย and for atomic transactions on inserts and updates.
Next on the agenda was Microsoft’s globally distributed, multi-model database service better known Cosmos DB๐ช. Again, we had been down this road just last week but just like re-watching the first Matrix movie๐ย I was more than happyย ๐ย to do so.
As a nice review, Reza reiterated some of the core Cosmos DB๐ช concepts like Global distribution, Multi-homing, Data Consistency Levels, Time-to-live (TTL), and Data Partitioning. All of which are included with of all five flavors or APIs in Cosmos DB๐ช because at the end of the day each API is just another container to the Cosmos DB๐ช. Some of the highlights included:
Global distribution
ยท Cosmos DB๐ช allows you to add or remove any of the azure regions to your cosmos account at any time with a click of a button.
ยท Cosmos DB๐ช will seamlessly replicate your data to all the region’s associate ID with your cosmos account.
ยท The multi homing capability of Cosmos DB๐ช allows your application to be highly available.
Multi-homing APIs
ยท Your application is aware of the nearest region and sends requests to that region.
ยท Nearest region is identified without any configuration changes
ยท When a new region is added or removed, the connection string stays the same
Time-to-live (TTL)
โข You can set the expiry time (TTL) on Cosmos DB data items
โข Cosmos DB๐ช will automatically remove the items after this time period, since the last modification time โฐ
Cosmos DB๐ช Consistency Levels
โข Cosmos DB๐ช offers five consistency levels to choose from:
โข Strong, bounded staleness, session, consistent prefix, eventual
Data Partitioning
ยท A logical partition consists of a set of items that have the same partition key๐.
ยท Data that’s added to the container is automatically partitioned across a set of logical partitions.
ยท Logical partitions are mapped to physical partitions that are distributed among several machines.
ยท Throughput provisioned for container, is divided evenly among physical partitions.
Then Rezaโs breaks down each of the 5 Cosmos DBs๐ช APIs in separate modules. But at the risk, of being redundant from last weekโs submission, we will just focus on the MongoDB๐ API and the Cassandra API as we covered the other APIs in-depth last week. I will make one important point for all APIs that you are working with that is you must choose an appropriate partition key๐. As rule of thumb ๐, an ideal Partition key๐ should have a wide range of values, so your data is evenly spread across logical partitions.
MongoDB๐ API in Cosmos DB๐ช supports the popular MongoDB๐ Document database with absolutely no code changes other than a connection string to existing applications. It now supports up to MongoDB ๐version 3.6.
During this module, Reza provides us with a very practical real worldย ๐ย scenario migrating fromย MongoDB๐Atlasย to Cosmos DB๐ชย (MongoDB๐ย API). We were happy๐ย to report that we were able to follow along easily and successfully migrate our own MongoDB๐ย Atlas collections to Cosmos DB๐ช.ย
Important to note:ย Before starting aย migration from MongoDB๐ย to Cosmos DB๐ช, you should estimate the amountย of throughput to provisioned for yourย azure cosmos databases on collections and of course pick an optimal partition key๐ย for your data.
Next, we will focused on the Cassandra API in Cosmos DB๐ช. This one admittedly, ย I was really looking forward too as it wasnโt in scope in ourย previous journey.ย ย Cosmos DB๐ช –ย Cassandra API can be used as the data store for apps written for Apache Cassandra. Just like for MongoDB๐,ย existing Cassandra applications using CQLv4 compliant drivers, can easily communicate with the Cosmos DB๐ช Cassandra API. Making it easy to switch from Apache Cassandra to Cosmos DB๐ชย Cassandra API with only requiring an update to the connection string. The familiar CQL, Cassandra client drivers, and Cassandra-based tools can all be used making for seamless migration with of course the benefits of Cosmos DB๐ชย like
ยท No operations management (PaaS)
ยท Low latency reads and writes
ยท Use existing code and tools
ยท Throughput and storage elasticity
ยท Global distribution and availability
ยท Choice of five well-defined consistency levels
ยทย ย ย ย ย ย ย ย Interact with Cosmos DB๐ช Cassandra API
Next we ventured on to new ground with Azure Data Lake Storage (ADLS). ADLS is a hyper-scale repository for big data analytic workloads. Azure Storage (Gen 2) is the foundation for building enterprise data lakes on ADLS. ADLS supports hundreds of gigabits of throughput and manages massive amounts of data. Some Key features of ADLS include:
ยท Hadoop compatible – manage data same as Hadoop HDFS
ยท POSIX permissions – supports ACL and POSIX file permissions
ยท Cost effective – offers low cost storage capacity
Last but certainly not least on this Journey with Reza was an introduction to Azure Data Explorer (ADX) a fast and highly scalableโ๏ธ data exploration service for log and telemetry data. ADX is designed to ingest data from devices like websites, logs and more. These ingestion sources come natively from Azure Event Hub, IoT hub and Blob Storage. Data is then stored in highly scalableโ๏ธ database and analytics are performed using Kusto Query Language (KQL). ADX can be provisioned with Azure CLI, PowerShell, C# (NuGet package), Pythonย ๐ย SDK and the ARM template. One of the key features of ADS isย Anomaly Detection. ADX uses machine learning under the hood to find these anomalies. ADX also supports many data visualization tools like
ยท Kusto query language visualizations
ยท Azure Data Explorer dashboards (Web UI)
ยท Power BI connector
ยท Microsoft Excel connector
ยท ODBC connector
ยท Granfana (ADX plugin)
ยท Kibana Connector (using k2bridge)
ยท Tableau (via ODBC connector)
ยท Qlik (via ODBC connector)
ยท Sisense (via JDBC connector)
ยท Redash
ADX can easily integrate with other services like:
ยท Azure Data Factory
ยท Microsoft Flow
ยท Logic Apps
ยท Apache Spark Connector
ยท Azure Databricks
ยท CI/CD using Azure DevOps
Iโll show these people what you donโt want them to see. A world๐ without rules and controls, without borders or boundaries. A world๐ where anything is possible. -Neo
After spending much time in Cosmos DB๐ช and in particular the Graph๐Database API, I have become very intrigued by this type of NoSQL solution. The more I explored the more I coveted. I had a particular yearning to learn more about the worldโs ๐ most popular graph ๐database Neo4J. For those not aware of Neo4J its developed by Swedish ๐ธ๐ช Technology company sometimes referred to as Neo4J or Neo Technology. I guess it depends on the day of the week?
According to all accounts the name Neoโ was named for Swedish ๐ธ๐ช pop artist and favorite of the Swedish๐ธ๐ช developers Linus โNeoโ Ingelsbo, โ4โ (for version) and โJโ for the Swedish๐ธ๐ช word โJรคttetrรคdโ which of course means “giant tree ๐ณโ because a tree ๐ณ signifies the huge data structures that could now be stored in this amazing database product. But to me this story seems a bit curious.. With a database name like โNeoโ and Querying language called โCypherโ and with Awesome Procedures On Cypher better known as APOC I somehow believe there is another story here..
Anyway to guide us through our learning with Neo4J would be no other than the โFlying Dutchmanโ ๐ณ๐ฑ Roland Guijt through his Introduction to Graph๐ Databases, Cypher, and Neo4j which was short but sweet (sort of like a Stroopwafel๐ง)
In the introduction, Roland tells us the Who, What, When, Where, Why and How about graph๐ databases. A graph ๐consists of nodes or vertices which are connected by directional relationships or Edges. A node represents an entity. An entity is typically something in the real world๐ like a customer, an order or a person A collection of nodes and relationships together is called a graph ๐. Graph๐databases are very mind friendly compared to other data storage technologies because graphs๐ act a lot like how the human brain๐ง works. It’s easier to think of the data structure and also easier to write queries. These patterns are much like the patterns of the brain๐ง uses to fetch data or retrieve memories.
Graph ๐ Databases are all about relationships and thus are very strong in storing and retrieving highly related data. They are also very performant during querying even with large number of nodes like in the millions. They offer great flexibility as like all NoSQL databases it doesnโt require a fixed schema. In addition, they are quite agile as you can add or delete nodes and property of nodes without affecting already stored nodes and it’s extensible supporting multiple query languages
After a comprehensive overview with graph๐ database, Roland dives right into Neo4J the leader in Graph ๐database. Unlike document databases, Neo4j is ACID compliant which means that all data modification is done within a transaction. If something goes wrong, Neo4j will simply roll back to a state where the data was reliable.
Neo4J is Javaโ๏ธ based which allows you to install it on multiple platforms like Windows, Linux, and OS X. Neo4j can scaleโ๏ธ up as it can easily adjust to a hardware changes i.e. adding more physical memory in which it will automatically add more nodes in the cache. Neo4J can also scale โ๏ธ out like most NoSQL Solutions i.e. adding more servers meaning it can distribute the load of transactions or create a highly available cluster in which a server will take over when the active one fails.
Since by definition Neo4J is a graph๐ database, itโs all about relationships and nodes. Both nodes and relationships are equally as important. Nodes are schema-less entities with properties (key-value pairs) which are always strings. Relationship connects a node to another node. Just like nodes, they also can contain properties that also support indexing.
Next, Roland discusses Querying Data with Cypher which is the most powerfulโก๏ธof Query languages supported by Neo4J. Cypher was developed and optimized for Neo4j and for graph๐ databases. Cypher is a very fluid language meaning it continuously changes with each release. The good ๐ news is all major releases are backwards compatible to all old versions of the language. Itโs very different for SQL so there is a bit of a learning curve. However, it’s not as steep as a learning curve you would imagine because Cypher uses patterns to match the data in the database very much how the brain๐ง works. That and Neo4J Desktop has intellisense. ๐
As example to demonstrate the query language and CRUD we worked with a very cool Dr. Who graph ๐database filled multiple nodes with Actors, Roles, Episodes, Villains and their given relationships. To begin we started with โRโ or Reads part of CRUD learning the MATCH command
Below is some MATCH โ RETURN syntax:
MATCH (:Actor{name:’Matt Smith’}) -[:PLAYED]->(c:Character) RETURN c.name as name
MATCH (actors:Actor)-[:REGENERATED_TO]-> (others) RETURN actors.name, others.name
MATCH (:Character{name:’Doctor’})<-[:ENEMY_OF]-(:Character)-[:COMES_FROM]->(p:Planet) RETURN p.name as Planet, count(p) AS Count
MATCH (:Actor{name:’Matt Smith’})-[:APPEARED_IN]-> (ep:Episode)<-[:APPEARED_IN]- (:Character{name:’Amy Pond’}),(ep) <-[:APPEARED_IN]-(enemies:Character)<-[:ENEMY_OF]-(Character{name:’Doctor’}) RETURN ep AS Episode, collect(enemies.name) AS Enemies;
Further, Roland discussed the WHERE Clause and ORDER BY Clauses which are very similar to ANSI SQL. Then he converses about other Cypher syntax like:
SKIP – which skips the number of result items you specify.
LIMIT โ which limits the numbers of items returned.
With UNION which allows to connect two queries together and generate one result set.
Then he ends the module conferring on Scalar functions like TOINT,
LENGTH, REDUCE, FILTER, ROUND, and SUBSTRING.
Then he reviews two of his favorite some advanced query features like COMPANION_OF and SHORTESTPATH.
Continuing on with C,U,D in CRUD, we played with the CREATE, MATCH WITH SET and MATCH DELETE
Below is some Syntax:
CREATE p= (:Actor{name:’Peter Capaldi’})-[:APPEARED_IN]->(:Episode{name:’The Time of The Doctor’}) RETURN p
MATCH (Matt:Actor{name: โMatt Smithโ}}
DELETE matt
MATCH (Matt:Actor{name: โMatt Smithโ}}
SET matt.salary = 1000
Then looking at MERGE and FOREACH with the below syntax as example:
MERGE (peter:Actor{name: โPeter Capaldiโ}) RETURN peter
Match p =(actors:Actor)-[r:PLAYED]->others)
WHERE actors.salary > 10000
FOREACH (n IN nodes(p)| set n.done = true)
As we continued our journey with Neo4J, we reconnoitered on Indexes and Constraints. Indexes are only good for data retrieval. So, if your application performs lots of writes itโs probably best to avoid them. As for constraints, the unique constraint is currently the only constraint available in Neo4j. That is why this is often called just constraint. Lastly, in the module we reviewed Importing CSV which makes importing data from other sources a breeze. It enables you to import data into a Neo4j’s database from many sources. CSV files can be loaded from the local file system, as well as remote locations. Cypher has a LOAD CSV statement, which is used together with CREATE and/or MERGE.
Finally, Roland reviewed Neo4j’s APIs which was a little bit out of our lexicon but interesting nonetheless. Neo4j supports two API types out of the box. The traditional REST and their proprietary Boltโก๏ธ. The advantage of Boltโก๏ธis mainly performance. Boltโก๏ธ doesn’t have the HTTP overhead, and it uses a binary format instead of text to return data. For both the REST and Bolt APIs Roland provides C# code sample that can be run with NuGet packages in Visual Studio my new favorite IDE.
Ever have that feeling where you’re not sure if you’re awake or dreaming?
Below are some topics I am considering for my learnings next week:
ยท More on Neo4J and Cypher
ยท More on MongoDB
ยท More with Google Cloud Path
ยท Working with Parquet files
ยท JDBC Drivers
ยท More on Machine Learning
ยท Data Visualization Tools (i.e. Looker)
ยท Additional ETL Solutions (Stitch, FiveTran)
ยท Process and Transforming data/Explore data through ML (i.e. Databricks)
Stay safe and Be well โ
โMCS
