Graphs are everywhere! In my efforts to spread the word and inform SQL Server Professionals about Neo4j and learn how to take a relational database schema and model it as a graph… AdventureWorks was a no-brainer! Let’s explore how to import a subset of the AdventureWorks database from SQL Server (RDBMS) into Neo4j (GraphDB).

But first some prerequisites. You should at least have a basic understanding of what is a property graph model and have completed the following modeling guidelines. Download and install Neo4j to be able to follow along with the examples.

The AdventureWorks database is a fictitious company database that has existed since SQL Server 2005 as a means to show new functionality in each new version released. AdventureWorks has become a vital aspect of learning new features/constructs within SQL Server.

AdventureWorks2008

For our demonstration I have decided to choose a subset of tables and select only offline sales transactions (OnlineOrderFlag = 0).
As a result the following CSV files will be used to import data into our graph structure:

  • customers.csv
  • employees.csv
  • orders.csv
  • productcategories.csv
  • products.csv
  • producsubcategories.csv
  • vendorproduct.csv
  • vendors.csv
  • You can find all the necessary code and files here -> https://github.com/sfrechette/adventureworks-neo4j

    If you still need to download a copy of the AdventureWorks sample database (I used the SQL Server 2014 version) you can find them here -> https://msftdbprodsamples.codeplex.com/

    Developing a Graph Model
    When deriving and defining a graph model from a relational model, you should keep the following base guidelines in mind:

  • A row is a node
  • A table name is a label name
  • For further understanding read the following GraphDB vs. RDBMS. This article explores the differences between relational and graph databases and data models. In addition, it explains how to integrate graph databases with relational databases and how to import data from a relational store.

    With this dataset, this simple AdventureWorks graph data model serves as a foundation:

    graphmodel_adventureworks

    The key difference between a graph and relational database is that relational databases work with sets while graph databases work with paths and relationships are first-class entities in a graph database and are semantically far stronger than those implied relationships reified at runtime in a relational store.

    Importing the Data using Cypher

    Now that we have extracted data from the AdventureWorks database, will be using Cypher’s LOAD CSV command to transform and load the content of these CSV files into a graph structure.

    First we create specific indexes and constraints

    // Create indexes for faster lookup
    CREATE INDEX ON :Category(categoryName);
    CREATE INDEX ON :SubCategory(subCategoryName);
    CREATE INDEX ON :Vendor(vendorName);
    CREATE INDEX ON :Product(productName);
    
    // Create constraints
    CREATE CONSTRAINT ON (o:Order) ASSERT o.orderId IS UNIQUE;
    CREATE CONSTRAINT ON (p:Product) ASSERT p.productId IS UNIQUE;
    CREATE CONSTRAINT ON (c:Category) ASSERT c.categoryId IS UNIQUE;
    CREATE CONSTRAINT ON (s:SubCategory) ASSERT s.subCategoryId IS UNIQUE;
    CREATE CONSTRAINT ON (e:Employee) ASSERT e.employeeId IS UNIQUE;
    CREATE CONSTRAINT ON (v:Vendor) ASSERT v.vendorId IS UNIQUE;
    CREATE CONSTRAINT ON (c:Customer) ASSERT c.customerId IS UNIQUE;
    

    Schema await – now create nodes:

    // Create products
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/products.csv" as row
    CREATE (:Product {productName: row.ProductName, productNumber: row.ProductNumber, productId: row.ProductID, modelName: row.ProductModelName, standardCost: row.StandardCost, listPrice: row.ListPrice});
    
    // Create vendors
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/vendors.csv" as row
    CREATE (:Vendor {vendorName: row.VendorName, vendorNumber: row.AccountNumber, vendorId: row.VendorID, creditRating: row.CreditRating, activeFlag: row.ActiveFlag});
    
    // Create employees
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/employees.csv" as row
    CREATE (:Employee {firstName: row.FirstName, lastName: row.LastName, fullName: row.FullName, employeeId: row.EmployeeID, jobTitle: row.JobTitle, organizationLevel: row.OrganizationLevel, maritalStatus: row.MaritalStatus, gender: row.Gender, territoty: row.Territory, country: row.Country, group: row.Group});
    
    // Create customers
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/customers.csv" as row
    CREATE (:Customer {firstName: row.FirstName, lastName: row.LastName, fullName: row.FullName, customerId: row.CustomerID});
    
    // Create categories
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/productcategories.csv" as row
    CREATE (:Category {categoryName: row.CategoryName, categoryId: row.CategoryID});
    
    // Create sub-categories
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/productsubcategories.csv" as row
    CREATE (:SubCategory {subCategoryName: row.SubCategoryName, subCategoryId: row.SubCategoryID});
    
    // Prepare orders
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MERGE (order:Order {orderId: row.SalesOrderID}) ON CREATE SET order.orderDate =  row.OrderDate;
    

    Create relationships

    // Create relationships: Order to Product
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MATCH (order:Order {orderId: row.SalesOrderID})
    MATCH (product:Product {productId: row.ProductID})
    MERGE (order)-[pu:PRODUCT]->(product)
    ON CREATE SET pu.unitPrice = toFloat(row.UnitPrice), pu.quantity = toFloat(row.OrderQty);
    
    // Create relationships: Order to Employee
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MATCH (order:Order {orderId: row.SalesOrderID})
    MATCH (employee:Employee {employeeId: row.EmployeeID})
    MERGE (employee)-[:SOLD]->(order);
    
    // Create relationships: Order to Customer
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/orders.csv" AS row
    MATCH (order:Order {orderId: row.SalesOrderID})
    MATCH (customer:Customer {customerId: row.CustomerID})
    MERGE (customer)-[:PURCHASED]->(order);
    
    // Create relationships: Product to Vendor
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/vendorproduct.csv" AS row
    MATCH (product:Product {productId: row.ProductID})
    MATCH (vendor:Vendor {vendorId: row.VendorID})
    MERGE (vendor)-[:SUPPLIES]->(product);
    
    // Create relationships: Product to SubCategory
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/products.csv" AS row
    MATCH (product:Product {productId: row.ProductID})
    MATCH (subcategory:SubCategory {subCategoryId: row.SubCategoryID})
    MERGE (product)-[:PART_OF_SUBCAT]->(subcategory);
    
    // Create relationships: SubCategory to Category
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/productsubcategories.csv" AS row
    MATCH (subcategory:SubCategory {subCategoryId: row.SubCategoryID})
    MATCH (category:Category {categoryId: row.CategoryID})
    MERGE (subcategory)-[:PART_OF_CAT]->(category);
    
    // Create relationship for employee reporting structure
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM "file:data/employees.csv" AS row
    MATCH (employee:Employee {employeeId: row.EmployeeID})
    MATCH (manager:Employee {employeeId: row.ManagerID})
    MERGE (employee)-[:REPORTS_TO]->(manager);
    

    You can also import and run the entire Cypher script using the neo4j-shell: bin/neo4j-shell -path adventure.db -file cypher/import.cypher

    Once completed you will be able query and do stuff like this:

    cypher_subcat

    Querying the Graph with Cypher

    Let’s issue some cypher queries to our newly created graph database:

    // List the product subcategories and categories provided by each supplier.
    MATCH (v:Vendor)-->(:Product)-->(s:SubCategory)-->(c:Category)
    RETURN v.vendorName as Vendor, collect(distinct s.subCategoryName) as SubCategories, collect(distinct c.categoryName) as Categories;
    

    cypher_aw01

    // Which employee had the highest cross-selling count of 'AWC Logo Cap' and which product?
    MATCH (p:Product {productName:'AWC Logo Cap'})< -[:PRODUCT]-(:Order)<-[:SOLD]-(employee),
          (employee)-[:SOLD]->(o2)-[:PRODUCT]->(other:Product)
    RETURN employee.fullName as Employee, other.productName as Product, count(distinct o2) as Count
    ORDER BY Count DESC
    LIMIT 5;
    

    cypher_aw02

    // What is the total quantity sold of a particular product?
    MATCH (o:Order)-[r:PRODUCT]->(p:Product {productName: "Long-Sleeve Logo Jersey, L"})
    RETURN sum(r.quantity) as TotalQuantitySold;
    

    cypher_aw03

    Explore, learn Cypher and write your own queries
    Enjoy!

    The more you used it, the more you discover great stuff… I wanted to do some analysis with some weather data, but did not want to start searching and scraping sites, then merging and cleansing…you know the drill. So I came across this neat R package named weatherData. weatherData provides functions that help in fetching weather data from websites, documentation found here -> https://ram-n.github.io/weatherData/.

    So I got inspired to compare Ottawa weather and Ottawa JavaScript Meetup attendance! Even with only 33 meetups (observations) in the past 3+ years, I thought it would be interesting to see if there any correlation between them? Ottawa weather vs. Ottawa JavaScript Meetup attendance: an R analysis

    First we need to load some required libraries (packages) and write a function to get the weather data we need:

    library(weatherData)
    library(dplyr)
    library(ggplot2)
    library(gridExtra)
    
    # Ottawa International Airport (YOW) weather data
    getWeatherForYear = function(year) {
        getWeatherForDate('YOW', 
                          start_date= paste(sep='', year, '-01-01'),
                          end_date = paste(sep='', year, '-12-31'),
                          opt_detailed = FALSE,
                          opt_all_columns = TRUE)
    }
    
    # Execute functions - get data
    df_weather = rbind(getWeatherForYear(2012), 
                        getWeatherForYear(2013),
                        getWeatherForYear(2014),
                        getWeatherForDate('YOW', start_date='2015-01-01',
                                 end_date = '2015-02-11',
                                 opt_detailed = FALSE,
                                 opt_all_columns = TRUE))
    

    Your Console window will output similar to the following:

    R_ottawaweather01

    Once data as been fetched we need to write and save it to a CSV file. Then we can read from CSV and sample the first 10 records with specific columns

    write.csv(df_weather, 'ottawa_weatherdata.csv', row.names = TRUE)
    
    weather = read.csv('ottawa_weatherdata.csv')
    weather$Date = as.POSIXct(weather$Date)
    
    weather %>% sample_n(10) %>% select(Date, Min_TemperatureC, Mean_TemperatureC, Max_TemperatureC)
    

    R_ottawaweather02

    Next we need to load the Ottawa JavaScript Meetup data which I already fetched and stored here in a CSV file (note: only 33 events, and unfortunately the name of of events do not have the topics and sessions that were presented on these dates… meh!)

    events = read.csv('ottawajs_events.csv')
    events$date =  as.POSIXct(events$date) 
    
    events %>% sample_n(10) %>% select(event, rsvps, date)
    

    R_ottawaweather03

    With our two datasets (data frames) loaded we can now manipulate, group and plot. Let’s get the average attendance and temperature by month:

    # Group average attendance event by month
    by_month = events %>% 
        mutate(month = factor(format(date, "%B"), levels=month.name)) %>%
        group_by(month) %>%
        summarise(events = n(), 
                  count = sum(rsvps)) %>%
        mutate(avg = count / events) %>%
        arrange(desc(avg))
    
    # Group average temperature by month
    averagetemperature_bymonth = weather %>% 
        mutate(month = factor(format(Date, "%B"), levels=month.name)) %>%
        group_by(month) %>% 
        summarise(avg_temperature = mean(Mean_TemperatureC))
    
    plot_temperature = ggplot(aes(x = month, y = avg_temperature, group=1), data = averagetemperature_bymonth) + 
        geom_line( ) + ggtitle("Average Temperature by Month")
    
    plot_attendance = ggplot(aes(x = month, y = count, group=1), data = by_month) + 
        geom_bar(stat="identity", fill="grey50") +
        ggtitle("Average Attendance by Month")
    
    grid.arrange(plot_temperature, plot_attendance, ncol = 1)
    

    Examining the plots, we can see a slight inverse correlation between February and June (March – black sheep!), meaning that as the temperature starts increasing, the total attendance slightly decreases. And from July to November, as temperature decreases, attendance slightly increases!
    R_ottawaweather04

    Let’s go more granular by comparing specific dates. We need to add (mutate) a new column to our existing events data frame, merging it to our weather data frame and let’s plot the attendance against the average temperature for individual specific days:

    # Group by day 
    by_day = events %>% 
        mutate(day = (as.POSIXct(events$date))) %>%
        group_by(day) %>%
        summarise(events = n(), 
                  count = sum(rsvps)) %>%
        mutate(avg = count / events) %>%
        arrange(desc(avg))
    weather = weather %>% mutate(day = Date)
    merged = merge(weather, by_day, by = 'day')
    
    ggplot(aes(x = count, y = Mean_TemperatureC, group = day), data = merged) + geom_point()
    

    R_ottawaweather05

    Interesting, there doesn’t seem to be any strong correlation between the temperature and the attendance. We can confirm our doubts by running a correlation:

    # Any correlation between the temperature and attendance?
    cor(merged$count, merged$Mean_TemperatureC)
    

    R_ottawaweather06

    We get a weak downhill (negative) linear relationship: -0.2958863

    There must be other variables than temperature that is influencing event attendance per month. My hypothesis would be that we would see lower attendance during March break, and the start of summer school vacation (June, July, August), but in this case July and August are up!

    That for now pretty much puts an halt to my Meetup attendance prediction model.
    Thus insightful and worthwhile using R in trying to better understand data and finding relationships (That’s What It’s All About)

    What I would like to analyze are data sets related to daily museum, parcs, library attendance. If you have any other interesting suggestions please let me know, we could work at it together.

    GitHub Gist available here -> https://gist.github.com/sfrechette/585e4ca9cdab7eacc92b

    Presented @ Ottawa SQL Server User Group (Ottawa PASS Chapter) on Thursday March 19, 2015

    Great gathering at Pythian HQ in Ottawa (Thank you Vanessa!). Lot’s of questions and interactions, used the following from an earlier blog post for the demo of R and SQL Server: connectivity and analysis.

    RSQLMeetup

    Slides – Data Analytics with R and SQL Server

    Data Science, Statistical Analysis and R… Learn what those mean, how they can help you find answers to your questions and complement the existing toolsets and processes you are currently using to make sense of data. We will explore R and the RStudio development environment, installing and using R packages, basic and essential data structures and data types, plotting graphics, manipulating data frames and how to connect R and SQL Server.

    I recently completed all 9 courses from the Johns Hopkins University Data Science Specialization | Coursera. Now what’s left is to join and complete the Capstone project… and that should happen soon, the next time it is offered.

    The program (specialization) focuses heavily on R and the use of practical real-world datasets.
    Below are my Verified Course Record (Grades) for each of the courses. They also outline the course description, syllabus and grading policy.

  • The Data Scientist’s Toolbox by Johns Hopkins University on Coursera. Certificate earned on June 1, 2014
  • R Programming by Johns Hopkins University on Coursera. Certificate earned on June 30, 2014
  • Getting and Cleaning Data by Johns Hopkins University on Coursera. Certificate earned on August 3, 2014
  • Exploratory Data Analysis by Johns Hopkins University on Coursera. Certificate earned on September 28, 2014
  • Reproducible Research by Johns Hopkins University on Coursera. Certificate earned on November 2, 2014
  • Regression Models by Johns Hopkins University on Coursera. Certificate earned on December 27, 2014
  • Statistical Inference by Johns Hopkins University on Coursera. Certificate earned on February 1, 2015
  • Practical Machine Learning by Johns Hopkins University on Coursera. Certificate earned on February 1, 2015
  • Developing Data Products by Johns Hopkins University on Coursera. Certificate earned on March 1, 2015
  • Overall an excellent program. It won’t make you an overnight data scientist, but it will give you more than a solid foundation to build upon. I highly recommend.