This past May 4 – 8 in Chicago was the Microsoft Ignite 2015 Conference. I unfortunately was not able to attend but was certainly looking forward to all the session material being available and posted on Channel 9.

Some enhancements and new features coming SQL Server 2016 were announced:
– Stretch Database
– Native JSON support
– PolyBase
– Enhancements to Master Data Services and SSIS
– Always Encrypted
– Operational Analytics
– In database analytics with R-Integration
– Azure SQL Data Warehouse

I strongly recommend you keep updated by reading the following Data Platform Blogs and watch the Day 1 Microsoft Ignite Keynote

SQL Server 2016 public preview is coming this summer. To not be left out and stay informed please sign up here and don’t forget to also signup for Power BI Preview

So in no particular order I have quickly put together a list of sessions that are of interest to me and that will consume most of my upcoming evenings and weekends…

NoSQL on Microsoft Azure: An introduction

Data Science with Microsoft Azure Machine Learning, Microsoft SQL Server and R

What’s Next for Business Analytics in Microsoft Excel

What’s Next for Visualizations in Microsoft Office

Microsoft BI Overview

Overview of Advanced Analytics with Machine Learning and Data Orchestration

Harnessing the Power of Hadoop: Cloud Scale with Microsoft Azure HDInsight

In-Memory Technologies Overview for Microsoft SQL Server and Microsoft Azure

Power BI Overview

Microsoft Azure SQL Data Warehouse Overview

Microsoft SQL Server BI Drill Down

Microsoft Azure Machine Learning Marketplace: Real-World Experience to Solve Your Business Problems

Microsoft Azure SQL Database: Overview and Customer Scenarios

Next Steps in Polybase with Dr. David Dewitt

SELECT Latest FROM Microsoft Azure DocumentDB

Advanced Analytics: Navigating Your Way There

Operational Analytics in SQL Server

Big Data for the SQL Ninja

Cloud-Based Machine Learning for the Developer

Overview of Microsoft Analytics Platform System (APS)

Build Hybrid Big Data Pipelines with Azure Data Factory and Azure HDInsight

Microsoft Azure DocumentDB and Azure HDInsight: Better Together

Real-Time Analytics at Scale for Internet of Things

End-to-End Business Analytics Solution: Real-World Scenario & Demo

Power BI for Developers

Planning Your Big Data Architecture on Azure

Optimizing Hadoop Using Microsoft Azure HDInsight

Advanced Modeling and Calculations Using the Microsoft Power BI Designer

Revolution Analytics Joins Microsoft: Strange Bedfellows or a New Power Tool for Advanced Analytics

Azure SQL Data Warehouse: Deep Dive

DNA of Data-Driven Organizations: Storytelling with R, Microsoft Azure Machine Learning and Power BI

Scalable BI and Advanced Modeling with Microsoft SQL Server Analysis Services and Power Pivot

Harness Predictive Customer Churn Models with Azure Machine Learning, Data Factory, and More

What’s New in Master Data Services (MDS) and Integration Services (SSIS) in SQL Server

Microsoft SQL Server Unplugged

ColumnStore Index: Microsoft SQL Server 2014 and Beyond

Would you like to further explore and learn more about Neo4j… using the cloud! Provisioning a more powerful VM instance with multiple cores and additional memory, a sandbox development environment… Well this post “Creating a Neo4j Linux VM in Windows Azure” is just right for you.

Will be installing a Neo4j Community 2.0.1 distribution which runs on the Ubuntu 12.04 LTS kernel and available on the VM Depot

Follow these steps that outlines the basics on how to setup.

From the Windows Azure Management Portal select “VIRTUAL MACHINES” then select “IMAGES” and then “BROWSE VM DEPOT”


Find the Neo4j Community 2.0.1 community image under “UBUNTU”
Click the arrow to continue


Choose a storage account
Choose a region, create a new storage account and provide it a name (for example: neo4jlinuxvm) then click the check to continue.


The VM image will be copied to the new storage account. Once it is finished, proceed to the next step.
When the status of the VM image will show “Pending registration” click “Register”



Register and provide a name for your new VM image
Click the check to continue.


Create a new Virtual Machine (click the +) and select “FROM GALLERY”


Now will create a new VM from an image
Select “MY IMAGES” and the image you created earlier (in this case neo4j-linux-vm) and then click on the arrow to continue


Configuring your VM
Provide VM name, size it appropriately, create a user name and password (user name and password that will be used when you connect via SSH to Ubuntu) and then click on the arrow to continue


Create a new cloud service and provide name, and most importantly enter a new endpoint named Neo4j with value 7474 for public and private port and then click on the arrow to continue


Your new VM is being provisioned! This can take a couple of minutes…


When completed go to VM dashboard to see all relevant information.
In this case we need the DNS Name


Open up you favourite browser and type the following


That’s it your done! Now start graphing

To connect and manage your VM through SSH you will need a tool like PuTTY if on Windows or use Terminal if on Mac OS X. In future posts I will go through how to upgrade to latest version of Neo4j, modify Neo4j configuration settings and upload CSV and graphdb backups.

Graphs are everywhere! In my efforts to spread the word and inform SQL Server Professionals about Neo4j and learn how to take a relational database schema and model it as a graph… AdventureWorks was a no-brainer! Let’s explore how to import a subset of the AdventureWorks database from SQL Server (RDBMS) into Neo4j (GraphDB).

But first some prerequisites. You should at least have a basic understanding of what is a property graph model and have completed the following modeling guidelines. Download and install Neo4j to be able to follow along with the examples.

The AdventureWorks database is a fictitious company database that has existed since SQL Server 2005 as a means to show new functionality in each new version released. AdventureWorks has become a vital aspect of learning new features/constructs within SQL Server.


For our demonstration I have decided to choose a subset of tables and select only offline sales transactions (OnlineOrderFlag = 0).
As a result the following CSV files will be used to import data into our graph structure:

  • customers.csv
  • employees.csv
  • orders.csv
  • productcategories.csv
  • products.csv
  • producsubcategories.csv
  • vendorproduct.csv
  • vendors.csv
  • You can find all the necessary code and files here ->

    If you still need to download a copy of the AdventureWorks sample database (I used the SQL Server 2014 version) you can find them here ->

    Developing a Graph Model
    When deriving and defining a graph model from a relational model, you should keep the following base guidelines in mind:

  • A row is a node
  • A table name is a label name
  • For further understanding read the following GraphDB vs. RDBMS. This article explores the differences between relational and graph databases and data models. In addition, it explains how to integrate graph databases with relational databases and how to import data from a relational store.

    With this dataset, this simple AdventureWorks graph data model serves as a foundation:


    The key difference between a graph and relational database is that relational databases work with sets while graph databases work with paths and relationships are first-class entities in a graph database and are semantically far stronger than those implied relationships reified at runtime in a relational store.

    Importing the Data using Cypher

    Now that we have extracted data from the AdventureWorks database, will be using Cypher’s LOAD CSV command to transform and load the content of these CSV files into a graph structure.

    First we create specific indexes and constraints

    // Create indexes for faster lookup
    CREATE INDEX ON :Category(categoryName);
    CREATE INDEX ON :SubCategory(subCategoryName);
    CREATE INDEX ON :Vendor(vendorName);
    CREATE INDEX ON :Product(productName);
    // Create constraints
    CREATE CONSTRAINT ON (c:Category) ASSERT c.categoryId IS UNIQUE;
    CREATE CONSTRAINT ON (s:SubCategory) ASSERT s.subCategoryId IS UNIQUE;
    CREATE CONSTRAINT ON (e:Employee) ASSERT e.employeeId IS UNIQUE;
    CREATE CONSTRAINT ON (c:Customer) ASSERT c.customerId IS UNIQUE;

    Schema await – now create nodes:

    // Create products
    LOAD CSV WITH HEADERS FROM "file:data/products.csv" as row
    CREATE (:Product {productName: row.ProductName, productNumber: row.ProductNumber, productId: row.ProductID, modelName: row.ProductModelName, standardCost: row.StandardCost, listPrice: row.ListPrice});
    // Create vendors
    LOAD CSV WITH HEADERS FROM "file:data/vendors.csv" as row
    CREATE (:Vendor {vendorName: row.VendorName, vendorNumber: row.AccountNumber, vendorId: row.VendorID, creditRating: row.CreditRating, activeFlag: row.ActiveFlag});
    // Create employees
    LOAD CSV WITH HEADERS FROM "file:data/employees.csv" as row
    CREATE (:Employee {firstName: row.FirstName, lastName: row.LastName, fullName: row.FullName, employeeId: row.EmployeeID, jobTitle: row.JobTitle, organizationLevel: row.OrganizationLevel, maritalStatus: row.MaritalStatus, gender: row.Gender, territoty: row.Territory, country: row.Country, group: row.Group});
    // Create customers
    LOAD CSV WITH HEADERS FROM "file:data/customers.csv" as row
    CREATE (:Customer {firstName: row.FirstName, lastName: row.LastName, fullName: row.FullName, customerId: row.CustomerID});
    // Create categories
    LOAD CSV WITH HEADERS FROM "file:data/productcategories.csv" as row
    CREATE (:Category {categoryName: row.CategoryName, categoryId: row.CategoryID});
    // Create sub-categories
    LOAD CSV WITH HEADERS FROM "file:data/productsubcategories.csv" as row
    CREATE (:SubCategory {subCategoryName: row.SubCategoryName, subCategoryId: row.SubCategoryID});
    // Prepare orders
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MERGE (order:Order {orderId: row.SalesOrderID}) ON CREATE SET order.orderDate =  row.OrderDate;

    Create relationships

    // Create relationships: Order to Product
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MATCH (order:Order {orderId: row.SalesOrderID})
    MATCH (product:Product {productId: row.ProductID})
    MERGE (order)-[pu:PRODUCT]->(product)
    ON CREATE SET pu.unitPrice = toFloat(row.UnitPrice), pu.quantity = toFloat(row.OrderQty);
    // Create relationships: Order to Employee
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MATCH (order:Order {orderId: row.SalesOrderID})
    MATCH (employee:Employee {employeeId: row.EmployeeID})
    MERGE (employee)-[:SOLD]->(order);
    // Create relationships: Order to Customer
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MATCH (order:Order {orderId: row.SalesOrderID})
    MATCH (customer:Customer {customerId: row.CustomerID})
    MERGE (customer)-[:PURCHASED]->(order);
    // Create relationships: Product to Vendor
    LOAD CSV WITH HEADERS FROM "file:data/vendorproduct.csv" AS row
    MATCH (product:Product {productId: row.ProductID})
    MATCH (vendor:Vendor {vendorId: row.VendorID})
    MERGE (vendor)-[:SUPPLIES]->(product);
    // Create relationships: Product to SubCategory
    LOAD CSV WITH HEADERS FROM "file:data/products.csv" AS row
    MATCH (product:Product {productId: row.ProductID})
    MATCH (subcategory:SubCategory {subCategoryId: row.SubCategoryID})
    MERGE (product)-[:PART_OF_SUBCAT]->(subcategory);
    // Create relationships: SubCategory to Category
    LOAD CSV WITH HEADERS FROM "file:data/productsubcategories.csv" AS row
    MATCH (subcategory:SubCategory {subCategoryId: row.SubCategoryID})
    MATCH (category:Category {categoryId: row.CategoryID})
    MERGE (subcategory)-[:PART_OF_CAT]->(category);
    // Create relationship for employee reporting structure
    LOAD CSV WITH HEADERS FROM "file:data/employees.csv" AS row
    MATCH (employee:Employee {employeeId: row.EmployeeID})
    MATCH (manager:Employee {employeeId: row.ManagerID})
    MERGE (employee)-[:REPORTS_TO]->(manager);

    You can also import and run the entire Cypher script using the neo4j-shell: bin/neo4j-shell -path adventure.db -file cypher/import.cypher

    Once completed you will be able query and do stuff like this:


    Querying the Graph with Cypher

    Let’s issue some cypher queries to our newly created graph database:

    // List the product subcategories and categories provided by each supplier.
    MATCH (v:Vendor)-->(:Product)-->(s:SubCategory)-->(c:Category)
    RETURN v.vendorName as Vendor, collect(distinct s.subCategoryName) as SubCategories, collect(distinct c.categoryName) as Categories;


    // Which employee had the highest cross-selling count of 'AWC Logo Cap' and which product?
    MATCH (p:Product {productName:'AWC Logo Cap'})< -[:PRODUCT]-(:Order)<-[:SOLD]-(employee),
    RETURN employee.fullName as Employee, other.productName as Product, count(distinct o2) as Count
    LIMIT 5;


    // What is the total quantity sold of a particular product?
    MATCH (o:Order)-[r:PRODUCT]->(p:Product {productName: "Long-Sleeve Logo Jersey, L"})
    RETURN sum(r.quantity) as TotalQuantitySold;


    Explore, learn Cypher and write your own queries

    The more you used it, the more you discover great stuff… I wanted to do some analysis with some weather data, but did not want to start searching and scraping sites, then merging and cleansing…you know the drill. So I came across this neat R package named weatherData. weatherData provides functions that help in fetching weather data from websites, documentation found here ->

    So I got inspired to compare Ottawa weather and Ottawa JavaScript Meetup attendance! Even with only 33 meetups (observations) in the past 3+ years, I thought it would be interesting to see if there any correlation between them? Ottawa weather vs. Ottawa JavaScript Meetup attendance: an R analysis

    First we need to load some required libraries (packages) and write a function to get the weather data we need:

    # Ottawa International Airport (YOW) weather data
    getWeatherForYear = function(year) {
                          start_date= paste(sep='', year, '-01-01'),
                          end_date = paste(sep='', year, '-12-31'),
                          opt_detailed = FALSE,
                          opt_all_columns = TRUE)
    # Execute functions - get data
    df_weather = rbind(getWeatherForYear(2012), 
                        getWeatherForDate('YOW', start_date='2015-01-01',
                                 end_date = '2015-02-11',
                                 opt_detailed = FALSE,
                                 opt_all_columns = TRUE))

    Your Console window will output similar to the following:


    Once data as been fetched we need to write and save it to a CSV file. Then we can read from CSV and sample the first 10 records with specific columns

    write.csv(df_weather, 'ottawa_weatherdata.csv', row.names = TRUE)
    weather = read.csv('ottawa_weatherdata.csv')
    weather$Date = as.POSIXct(weather$Date)
    weather %>% sample_n(10) %>% select(Date, Min_TemperatureC, Mean_TemperatureC, Max_TemperatureC)


    Next we need to load the Ottawa JavaScript Meetup data which I already fetched and stored here in a CSV file (note: only 33 events, and unfortunately the name of of events do not have the topics and sessions that were presented on these dates… meh!)

    events = read.csv('ottawajs_events.csv')
    events$date =  as.POSIXct(events$date) 
    events %>% sample_n(10) %>% select(event, rsvps, date)


    With our two datasets (data frames) loaded we can now manipulate, group and plot. Let’s get the average attendance and temperature by month:

    # Group average attendance event by month
    by_month = events %>% 
        mutate(month = factor(format(date, "%B"), %>%
        group_by(month) %>%
        summarise(events = n(), 
                  count = sum(rsvps)) %>%
        mutate(avg = count / events) %>%
    # Group average temperature by month
    averagetemperature_bymonth = weather %>% 
        mutate(month = factor(format(Date, "%B"), %>%
        group_by(month) %>% 
        summarise(avg_temperature = mean(Mean_TemperatureC))
    plot_temperature = ggplot(aes(x = month, y = avg_temperature, group=1), data = averagetemperature_bymonth) + 
        geom_line( ) + ggtitle("Average Temperature by Month")
    plot_attendance = ggplot(aes(x = month, y = count, group=1), data = by_month) + 
        geom_bar(stat="identity", fill="grey50") +
        ggtitle("Average Attendance by Month")
    grid.arrange(plot_temperature, plot_attendance, ncol = 1)

    Examining the plots, we can see a slight inverse correlation between February and June (March – black sheep!), meaning that as the temperature starts increasing, the total attendance slightly decreases. And from July to November, as temperature decreases, attendance slightly increases!

    Let’s go more granular by comparing specific dates. We need to add (mutate) a new column to our existing events data frame, merging it to our weather data frame and let’s plot the attendance against the average temperature for individual specific days:

    # Group by day 
    by_day = events %>% 
        mutate(day = (as.POSIXct(events$date))) %>%
        group_by(day) %>%
        summarise(events = n(), 
                  count = sum(rsvps)) %>%
        mutate(avg = count / events) %>%
    weather = weather %>% mutate(day = Date)
    merged = merge(weather, by_day, by = 'day')
    ggplot(aes(x = count, y = Mean_TemperatureC, group = day), data = merged) + geom_point()


    Interesting, there doesn’t seem to be any strong correlation between the temperature and the attendance. We can confirm our doubts by running a correlation:

    # Any correlation between the temperature and attendance?
    cor(merged$count, merged$Mean_TemperatureC)


    We get a weak downhill (negative) linear relationship: -0.2958863

    There must be other variables than temperature that is influencing event attendance per month. My hypothesis would be that we would see lower attendance during March break, and the start of summer school vacation (June, July, August), but in this case July and August are up!

    That for now pretty much puts an halt to my Meetup attendance prediction model.
    Thus insightful and worthwhile using R in trying to better understand data and finding relationships (That’s What It’s All About)

    What I would like to analyze are data sets related to daily museum, parcs, library attendance. If you have any other interesting suggestions please let me know, we could work at it together.

    GitHub Gist available here ->