This past May 4 – 8 in Chicago was the Microsoft Ignite 2015 Conference. I unfortunately was not able to attend but was certainly looking forward to all the session material being available and posted on Channel 9.

Some enhancements and new features coming SQL Server 2016 were announced:
– Stretch Database
– Native JSON support
– PolyBase
– Enhancements to Master Data Services and SSIS
– Always Encrypted
– Operational Analytics
– In database analytics with R-Integration
– Azure SQL Data Warehouse

I strongly recommend you keep updated by reading the following Data Platform Blogs and watch the Day 1 Microsoft Ignite Keynote

SQL Server 2016 public preview is coming this summer. To not be left out and stay informed please sign up here and don’t forget to also signup for Power BI Preview

So in no particular order I have quickly put together a list of sessions that are of interest to me and that will consume most of my upcoming evenings and weekends…


NoSQL on Microsoft Azure: An introduction
https://channel9.msdn.com/Events/Ignite/2015/BRK2555

Data Science with Microsoft Azure Machine Learning, Microsoft SQL Server and R
https://channel9.msdn.com/Events/Ignite/2015/BRK3550

What’s Next for Business Analytics in Microsoft Excel
https://channel9.msdn.com/Events/Ignite/2015/BRK2566

What’s Next for Visualizations in Microsoft Office
https://channel9.msdn.com/events/Ignite/2015/BRK3564

Microsoft BI Overview
https://channel9.msdn.com/Events/Ignite/2015/BRK2556

Overview of Advanced Analytics with Machine Learning and Data Orchestration
https://channel9.msdn.com/Events/Ignite/2015/BRK2569

Harnessing the Power of Hadoop: Cloud Scale with Microsoft Azure HDInsight
https://channel9.msdn.com/Events/Ignite/2015/BRK2557

In-Memory Technologies Overview for Microsoft SQL Server and Microsoft Azure
https://channel9.msdn.com/Events/Ignite/2015/BRK2563

Power BI Overview
https://channel9.msdn.com/Events/Ignite/2015/BRK2568

Microsoft Azure SQL Data Warehouse Overview
https://channel9.msdn.com/Events/Ignite/2015/BRK2571

Microsoft SQL Server BI Drill Down
https://channel9.msdn.com/Events/Ignite/2015/BRK2552

Microsoft Azure Machine Learning Marketplace: Real-World Experience to Solve Your Business Problems
https://channel9.msdn.com/Events/Ignite/2015/BRK2560

Microsoft Azure SQL Database: Overview and Customer Scenarios
https://channel9.msdn.com/Events/Ignite/2015/BRK2564

Next Steps in Polybase with Dr. David Dewitt
https://channel9.msdn.com/Events/Ignite/2015/BRK3565

SELECT Latest FROM Microsoft Azure DocumentDB
https://channel9.msdn.com/Events/Ignite/2015/BRK2551

Advanced Analytics: Navigating Your Way There
https://channel9.msdn.com/Events/Ignite/2015/BRK2567

Operational Analytics in SQL Server
https://channel9.msdn.com/Events/Ignite/2015/BRK4552

Big Data for the SQL Ninja
https://channel9.msdn.com/Events/Ignite/2015/BRK2550

Cloud-Based Machine Learning for the Developer
https://channel9.msdn.com/Events/Ignite/2015/BRK3560

Overview of Microsoft Analytics Platform System (APS)
https://channel9.msdn.com/Events/Ignite/2015/BRK1550

Build Hybrid Big Data Pipelines with Azure Data Factory and Azure HDInsight
https://channel9.msdn.com/Events/Ignite/2015/BRK3559

Microsoft Azure DocumentDB and Azure HDInsight: Better Together
https://channel9.msdn.com/Events/Ignite/2015/BRK3562

Real-Time Analytics at Scale for Internet of Things
https://channel9.msdn.com/Events/Ignite/2015/BRK3555

End-to-End Business Analytics Solution: Real-World Scenario & Demo
https://channel9.msdn.com/Events/Ignite/2015/BRK2559

Power BI for Developers
https://channel9.msdn.com/Events/Ignite/2015/BRK3551

Planning Your Big Data Architecture on Azure
https://channel9.msdn.com/Events/Ignite/2015/BRK2576

Optimizing Hadoop Using Microsoft Azure HDInsight
https://channel9.msdn.com/Events/Ignite/2015/BRK3556

Advanced Modeling and Calculations Using the Microsoft Power BI Designer
https://channel9.msdn.com/Events/Ignite/2015/BRK3573

Revolution Analytics Joins Microsoft: Strange Bedfellows or a New Power Tool for Advanced Analytics
https://channel9.msdn.com/Events/Ignite/2015/BRK3553

Azure SQL Data Warehouse: Deep Dive
https://channel9.msdn.com/Events/Ignite/2015/BRK3569

DNA of Data-Driven Organizations: Storytelling with R, Microsoft Azure Machine Learning and Power BI
https://channel9.msdn.com/Events/Ignite/2015/BRK2562

Scalable BI and Advanced Modeling with Microsoft SQL Server Analysis Services and Power Pivot
https://channel9.msdn.com/Events/Ignite/2015/BRK3561

Harness Predictive Customer Churn Models with Azure Machine Learning, Data Factory, and More
https://channel9.msdn.com/Events/Ignite/2015/BRK3570

What’s New in Master Data Services (MDS) and Integration Services (SSIS) in SQL Server
https://channel9.msdn.com/events/Ignite/2015/BRK2578

Microsoft SQL Server Unplugged
https://channel9.msdn.com/events/Ignite/2015/BRK1551

ColumnStore Index: Microsoft SQL Server 2014 and Beyond
https://channel9.msdn.com/events/Ignite/2015/BRK4556

Would you like to further explore and learn more about Neo4j… using the cloud! Provisioning a more powerful VM instance with multiple cores and additional memory, a sandbox development environment… Well this post “Creating a Neo4j Linux VM in Windows Azure” is just right for you.

Will be installing a Neo4j Community 2.0.1 distribution which runs on the Ubuntu 12.04 LTS kernel and available on the VM Depot

Follow these steps that outlines the basics on how to setup.

From the Windows Azure Management Portal select “VIRTUAL MACHINES” then select “IMAGES” and then “BROWSE VM DEPOT”

azureneo4j01

 
Find the Neo4j Community 2.0.1 community image under “UBUNTU”
Click the arrow to continue

azureneo4j04

 
Choose a storage account
Choose a region, create a new storage account and provide it a name (for example: neo4jlinuxvm) then click the check to continue.

azureneo4j05

 
The VM image will be copied to the new storage account. Once it is finished, proceed to the next step.
When the status of the VM image will show “Pending registration” click “Register”

azureneo4j06

azureneo4j07

 
Register and provide a name for your new VM image
Click the check to continue.

azureneo4j08

 
Create a new Virtual Machine (click the +) and select “FROM GALLERY”

azureneo4j09

 
Now will create a new VM from an image
Select “MY IMAGES” and the image you created earlier (in this case neo4j-linux-vm) and then click on the arrow to continue

azureneo4j10

 
Configuring your VM
Provide VM name, size it appropriately, create a user name and password (user name and password that will be used when you connect via SSH to Ubuntu) and then click on the arrow to continue

azureneo4j11

 
Create a new cloud service and provide name, and most importantly enter a new endpoint named Neo4j with value 7474 for public and private port and then click on the arrow to continue

azureneo4j12

 
Your new VM is being provisioned! This can take a couple of minutes…

azureneo4j13

 
When completed go to VM dashboard to see all relevant information.
In this case we need the DNS Name neo4jlinuxvm.cloudapp.net

azureneo4j14

 
Open up you favourite browser and type the following neo4jlinuxvm.cloudapp.net:7474
http://[DNSName]:7474

azureneo4j15

 
That’s it your done! Now start graphing

 
To connect and manage your VM through SSH you will need a tool like PuTTY if on Windows or use Terminal if on Mac OS X. In future posts I will go through how to upgrade to latest version of Neo4j, modify Neo4j configuration settings and upload CSV and graphdb backups.

Graphs are everywhere! In my efforts to spread the word and inform SQL Server Professionals about Neo4j and learn how to take a relational database schema and model it as a graph… AdventureWorks was a no-brainer! Let’s explore how to import a subset of the AdventureWorks database from SQL Server (RDBMS) into Neo4j (GraphDB).

But first some prerequisites. You should at least have a basic understanding of what is a property graph model and have completed the following modeling guidelines. Download and install Neo4j to be able to follow along with the examples.

The AdventureWorks database is a fictitious company database that has existed since SQL Server 2005 as a means to show new functionality in each new version released. AdventureWorks has become a vital aspect of learning new features/constructs within SQL Server.

AdventureWorks2008

For our demonstration I have decided to choose a subset of tables and select only offline sales transactions (OnlineOrderFlag = 0).
As a result the following CSV files will be used to import data into our graph structure:

  • customers.csv
  • employees.csv
  • orders.csv
  • productcategories.csv
  • products.csv
  • producsubcategories.csv
  • vendorproduct.csv
  • vendors.csv
  • You can find all the necessary code and files here -> https://github.com/sfrechette/adventureworks-neo4j

    If you still need to download a copy of the AdventureWorks sample database (I used the SQL Server 2014 version) you can find them here -> https://msftdbprodsamples.codeplex.com/

    Developing a Graph Model
    When deriving and defining a graph model from a relational model, you should keep the following base guidelines in mind:

  • A row is a node
  • A table name is a label name
  • For further understanding read the following GraphDB vs. RDBMS. This article explores the differences between relational and graph databases and data models. In addition, it explains how to integrate graph databases with relational databases and how to import data from a relational store.

    With this dataset, this simple AdventureWorks graph data model serves as a foundation:

    graphmodel_adventureworks

    The key difference between a graph and relational database is that relational databases work with sets while graph databases work with paths and relationships are first-class entities in a graph database and are semantically far stronger than those implied relationships reified at runtime in a relational store.

    Importing the Data using Cypher

    Now that we have extracted data from the AdventureWorks database, will be using Cypher’s LOAD CSV command to transform and load the content of these CSV files into a graph structure.

    First we create specific indexes and constraints

    // Create indexes for faster lookup
    CREATE INDEX ON :Category(categoryName);
    CREATE INDEX ON :SubCategory(subCategoryName);
    CREATE INDEX ON :Vendor(vendorName);
    CREATE INDEX ON :Product(productName);
    
    // Create constraints
    CREATE CONSTRAINT ON (o:Order) ASSERT o.orderId IS UNIQUE;
    CREATE CONSTRAINT ON (p:Product) ASSERT p.productId IS UNIQUE;
    CREATE CONSTRAINT ON (c:Category) ASSERT c.categoryId IS UNIQUE;
    CREATE CONSTRAINT ON (s:SubCategory) ASSERT s.subCategoryId IS UNIQUE;
    CREATE CONSTRAINT ON (e:Employee) ASSERT e.employeeId IS UNIQUE;
    CREATE CONSTRAINT ON (v:Vendor) ASSERT v.vendorId IS UNIQUE;
    CREATE CONSTRAINT ON (c:Customer) ASSERT c.customerId IS UNIQUE;
    

    Schema await – now create nodes:

    // Create products
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/products.csv" as row
    CREATE (:Product {productName: row.ProductName, productNumber: row.ProductNumber, productId: row.ProductID, modelName: row.ProductModelName, standardCost: row.StandardCost, listPrice: row.ListPrice});
    
    // Create vendors
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/vendors.csv" as row
    CREATE (:Vendor {vendorName: row.VendorName, vendorNumber: row.AccountNumber, vendorId: row.VendorID, creditRating: row.CreditRating, activeFlag: row.ActiveFlag});
    
    // Create employees
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/employees.csv" as row
    CREATE (:Employee {firstName: row.FirstName, lastName: row.LastName, fullName: row.FullName, employeeId: row.EmployeeID, jobTitle: row.JobTitle, organizationLevel: row.OrganizationLevel, maritalStatus: row.MaritalStatus, gender: row.Gender, territoty: row.Territory, country: row.Country, group: row.Group});
    
    // Create customers
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/customers.csv" as row
    CREATE (:Customer {firstName: row.FirstName, lastName: row.LastName, fullName: row.FullName, customerId: row.CustomerID});
    
    // Create categories
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/productcategories.csv" as row
    CREATE (:Category {categoryName: row.CategoryName, categoryId: row.CategoryID});
    
    // Create sub-categories
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/productsubcategories.csv" as row
    CREATE (:SubCategory {subCategoryName: row.SubCategoryName, subCategoryId: row.SubCategoryID});
    
    // Prepare orders
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MERGE (order:Order {orderId: row.SalesOrderID}) ON CREATE SET order.orderDate =  row.OrderDate;
    

    Create relationships

    // Create relationships: Order to Product
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MATCH (order:Order {orderId: row.SalesOrderID})
    MATCH (product:Product {productId: row.ProductID})
    MERGE (order)-[pu:PRODUCT]->(product)
    ON CREATE SET pu.unitPrice = toFloat(row.UnitPrice), pu.quantity = toFloat(row.OrderQty);
    
    // Create relationships: Order to Employee
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MATCH (order:Order {orderId: row.SalesOrderID})
    MATCH (employee:Employee {employeeId: row.EmployeeID})
    MERGE (employee)-[:SOLD]->(order);
    
    // Create relationships: Order to Customer
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MATCH (order:Order {orderId: row.SalesOrderID})
    MATCH (customer:Customer {customerId: row.CustomerID})
    MERGE (customer)-[:PURCHASED]->(order);
    
    // Create relationships: Product to Vendor
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/vendorproduct.csv" AS row
    MATCH (product:Product {productId: row.ProductID})
    MATCH (vendor:Vendor {vendorId: row.VendorID})
    MERGE (vendor)-[:SUPPLIES]->(product);
    
    // Create relationships: Product to SubCategory
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/products.csv" AS row
    MATCH (product:Product {productId: row.ProductID})
    MATCH (subcategory:SubCategory {subCategoryId: row.SubCategoryID})
    MERGE (product)-[:PART_OF_SUBCAT]->(subcategory);
    
    // Create relationships: SubCategory to Category
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/productsubcategories.csv" AS row
    MATCH (subcategory:SubCategory {subCategoryId: row.SubCategoryID})
    MATCH (category:Category {categoryId: row.CategoryID})
    MERGE (subcategory)-[:PART_OF_CAT]->(category);
    
    // Create relationship for employee reporting structure
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/employees.csv" AS row
    MATCH (employee:Employee {employeeId: row.EmployeeID})
    MATCH (manager:Employee {employeeId: row.ManagerID})
    MERGE (employee)-[:REPORTS_TO]->(manager);
    

    You can also import and run the entire Cypher script using the neo4j-shell: bin/neo4j-shell -path adventure.db -file cypher/import.cypher

    Once completed you will be able query and do stuff like this:

    cypher_subcat

    Querying the Graph with Cypher

    Let’s issue some cypher queries to our newly created graph database:

    // List the product subcategories and categories provided by each supplier.
    MATCH (v:Vendor)-->(:Product)-->(s:SubCategory)-->(c:Category)
    RETURN v.vendorName as Vendor, collect(distinct s.subCategoryName) as SubCategories, collect(distinct c.categoryName) as Categories;
    

    cypher_aw01

    // Which employee had the highest cross-selling count of 'AWC Logo Cap' and which product?
    MATCH (p:Product {productName:'AWC Logo Cap'})< -[:PRODUCT]-(:Order)<-[:SOLD]-(employee),
          (employee)-[:SOLD]->(o2)-[:PRODUCT]->(other:Product)
    RETURN employee.fullName as Employee, other.productName as Product, count(distinct o2) as Count
    ORDER BY Count DESC
    LIMIT 5;
    

    cypher_aw02

    // What is the total quantity sold of a particular product?
    MATCH (o:Order)-[r:PRODUCT]->(p:Product {productName: "Long-Sleeve Logo Jersey, L"})
    RETURN sum(r.quantity) as TotalQuantitySold;
    

    cypher_aw03

    Explore, learn Cypher and write your own queries
    Enjoy!

    The more you used it, the more you discover great stuff… I wanted to do some analysis with some weather data, but did not want to start searching and scraping sites, then merging and cleansing…you know the drill. So I came across this neat R package named weatherData. weatherData provides functions that help in fetching weather data from websites, documentation found here -> https://ram-n.github.io/weatherData/.

    So I got inspired to compare Ottawa weather and Ottawa JavaScript Meetup attendance! Even with only 33 meetups (observations) in the past 3+ years, I thought it would be interesting to see if there any correlation between them? Ottawa weather vs. Ottawa JavaScript Meetup attendance: an R analysis

    First we need to load some required libraries (packages) and write a function to get the weather data we need:

    library(weatherData)
    library(dplyr)
    library(ggplot2)
    library(gridExtra)
    
    # Ottawa International Airport (YOW) weather data
    getWeatherForYear = function(year) {
        getWeatherForDate('YOW', 
                          start_date= paste(sep='', year, '-01-01'),
                          end_date = paste(sep='', year, '-12-31'),
                          opt_detailed = FALSE,
                          opt_all_columns = TRUE)
    }
    
    # Execute functions - get data
    df_weather = rbind(getWeatherForYear(2012), 
                        getWeatherForYear(2013),
                        getWeatherForYear(2014),
                        getWeatherForDate('YOW', start_date='2015-01-01',
                                 end_date = '2015-02-11',
                                 opt_detailed = FALSE,
                                 opt_all_columns = TRUE))
    

    Your Console window will output similar to the following:

    R_ottawaweather01

    Once data as been fetched we need to write and save it to a CSV file. Then we can read from CSV and sample the first 10 records with specific columns

    write.csv(df_weather, 'ottawa_weatherdata.csv', row.names = TRUE)
    
    weather = read.csv('ottawa_weatherdata.csv')
    weather$Date = as.POSIXct(weather$Date)
    
    weather %>% sample_n(10) %>% select(Date, Min_TemperatureC, Mean_TemperatureC, Max_TemperatureC)
    

    R_ottawaweather02

    Next we need to load the Ottawa JavaScript Meetup data which I already fetched and stored here in a CSV file (note: only 33 events, and unfortunately the name of of events do not have the topics and sessions that were presented on these dates… meh!)

    events = read.csv('ottawajs_events.csv')
    events$date =  as.POSIXct(events$date) 
    
    events %>% sample_n(10) %>% select(event, rsvps, date)
    

    R_ottawaweather03

    With our two datasets (data frames) loaded we can now manipulate, group and plot. Let’s get the average attendance and temperature by month:

    # Group average attendance event by month
    by_month = events %>% 
        mutate(month = factor(format(date, "%B"), levels=month.name)) %>%
        group_by(month) %>%
        summarise(events = n(), 
                  count = sum(rsvps)) %>%
        mutate(avg = count / events) %>%
        arrange(desc(avg))
    
    # Group average temperature by month
    averagetemperature_bymonth = weather %>% 
        mutate(month = factor(format(Date, "%B"), levels=month.name)) %>%
        group_by(month) %>% 
        summarise(avg_temperature = mean(Mean_TemperatureC))
    
    plot_temperature = ggplot(aes(x = month, y = avg_temperature, group=1), data = averagetemperature_bymonth) + 
        geom_line( ) + ggtitle("Average Temperature by Month")
    
    plot_attendance = ggplot(aes(x = month, y = count, group=1), data = by_month) + 
        geom_bar(stat="identity", fill="grey50") +
        ggtitle("Average Attendance by Month")
    
    grid.arrange(plot_temperature, plot_attendance, ncol = 1)
    

    Examining the plots, we can see a slight inverse correlation between February and June (March – black sheep!), meaning that as the temperature starts increasing, the total attendance slightly decreases. And from July to November, as temperature decreases, attendance slightly increases!
    R_ottawaweather04

    Let’s go more granular by comparing specific dates. We need to add (mutate) a new column to our existing events data frame, merging it to our weather data frame and let’s plot the attendance against the average temperature for individual specific days:

    # Group by day 
    by_day = events %>% 
        mutate(day = (as.POSIXct(events$date))) %>%
        group_by(day) %>%
        summarise(events = n(), 
                  count = sum(rsvps)) %>%
        mutate(avg = count / events) %>%
        arrange(desc(avg))
    weather = weather %>% mutate(day = Date)
    merged = merge(weather, by_day, by = 'day')
    
    ggplot(aes(x = count, y = Mean_TemperatureC, group = day), data = merged) + geom_point()
    

    R_ottawaweather05

    Interesting, there doesn’t seem to be any strong correlation between the temperature and the attendance. We can confirm our doubts by running a correlation:

    # Any correlation between the temperature and attendance?
    cor(merged$count, merged$Mean_TemperatureC)
    

    R_ottawaweather06

    We get a weak downhill (negative) linear relationship: -0.2958863

    There must be other variables than temperature that is influencing event attendance per month. My hypothesis would be that we would see lower attendance during March break, and the start of summer school vacation (June, July, August), but in this case July and August are up!

    That for now pretty much puts an halt to my Meetup attendance prediction model.
    Thus insightful and worthwhile using R in trying to better understand data and finding relationships (That’s What It’s All About)

    What I would like to analyze are data sets related to daily museum, parcs, library attendance. If you have any other interesting suggestions please let me know, we could work at it together.

    GitHub Gist available here -> https://gist.github.com/sfrechette/585e4ca9cdab7eacc92b