The more you used it, the more you discover great stuff… I wanted to do some analysis with some weather data, but did not want to start searching and scraping sites, then merging and cleansing…you know the drill. So I came across this neat R package named weatherData. weatherData provides functions that help in fetching weather data from websites, documentation found here -> https://ram-n.github.io/weatherData/.

So I got inspired to compare Ottawa weather and Ottawa JavaScript Meetup attendance! Even with only 33 meetups (observations) in the past 3+ years, I thought it would be interesting to see if there any correlation between them? Ottawa weather vs. Ottawa JavaScript Meetup attendance: an R analysis

First we need to load some required libraries (packages) and write a function to get the weather data we need:

library(weatherData)
library(dplyr)
library(ggplot2)
library(gridExtra)

# Ottawa International Airport (YOW) weather data
getWeatherForYear = function(year) {
    getWeatherForDate('YOW', 
                      start_date= paste(sep='', year, '-01-01'),
                      end_date = paste(sep='', year, '-12-31'),
                      opt_detailed = FALSE,
                      opt_all_columns = TRUE)
}

# Execute functions - get data
df_weather = rbind(getWeatherForYear(2012), 
                    getWeatherForYear(2013),
                    getWeatherForYear(2014),
                    getWeatherForDate('YOW', start_date='2015-01-01',
                             end_date = '2015-02-11',
                             opt_detailed = FALSE,
                             opt_all_columns = TRUE))

Your Console window will output similar to the following:

R_ottawaweather01

Once data as been fetched we need to write and save it to a CSV file. Then we can read from CSV and sample the first 10 records with specific columns

write.csv(df_weather, 'ottawa_weatherdata.csv', row.names = TRUE)

weather = read.csv('ottawa_weatherdata.csv')
weather$Date = as.POSIXct(weather$Date)

weather %>% sample_n(10) %>% select(Date, Min_TemperatureC, Mean_TemperatureC, Max_TemperatureC)

R_ottawaweather02

Next we need to load the Ottawa JavaScript Meetup data which I already fetched and stored here in a CSV file (note: only 33 events, and unfortunately the name of of events do not have the topics and sessions that were presented on these dates… meh!)

events = read.csv('ottawajs_events.csv')
events$date =  as.POSIXct(events$date) 

events %>% sample_n(10) %>% select(event, rsvps, date)

R_ottawaweather03

With our two datasets (data frames) loaded we can now manipulate, group and plot. Let’s get the average attendance and temperature by month:

# Group average attendance event by month
by_month = events %>% 
    mutate(month = factor(format(date, "%B"), levels=month.name)) %>%
    group_by(month) %>%
    summarise(events = n(), 
              count = sum(rsvps)) %>%
    mutate(avg = count / events) %>%
    arrange(desc(avg))

# Group average temperature by month
averagetemperature_bymonth = weather %>% 
    mutate(month = factor(format(Date, "%B"), levels=month.name)) %>%
    group_by(month) %>% 
    summarise(avg_temperature = mean(Mean_TemperatureC))

plot_temperature = ggplot(aes(x = month, y = avg_temperature, group=1), data = averagetemperature_bymonth) + 
    geom_line( ) + ggtitle("Average Temperature by Month")

plot_attendance = ggplot(aes(x = month, y = count, group=1), data = by_month) + 
    geom_bar(stat="identity", fill="grey50") +
    ggtitle("Average Attendance by Month")

grid.arrange(plot_temperature, plot_attendance, ncol = 1)

Examining the plots, we can see a slight inverse correlation between February and June (March – black sheep!), meaning that as the temperature starts increasing, the total attendance slightly decreases. And from July to November, as temperature decreases, attendance slightly increases!
R_ottawaweather04

Let’s go more granular by comparing specific dates. We need to add (mutate) a new column to our existing events data frame, merging it to our weather data frame and let’s plot the attendance against the average temperature for individual specific days:

# Group by day 
by_day = events %>% 
    mutate(day = (as.POSIXct(events$date))) %>%
    group_by(day) %>%
    summarise(events = n(), 
              count = sum(rsvps)) %>%
    mutate(avg = count / events) %>%
    arrange(desc(avg))
weather = weather %>% mutate(day = Date)
merged = merge(weather, by_day, by = 'day')

ggplot(aes(x = count, y = Mean_TemperatureC, group = day), data = merged) + geom_point()

R_ottawaweather05

Interesting, there doesn’t seem to be any strong correlation between the temperature and the attendance. We can confirm our doubts by running a correlation:

# Any correlation between the temperature and attendance?
cor(merged$count, merged$Mean_TemperatureC)

R_ottawaweather06

We get a weak downhill (negative) linear relationship: -0.2958863

There must be other variables than temperature that is influencing event attendance per month. My hypothesis would be that we would see lower attendance during March break, and the start of summer school vacation (June, July, August), but in this case July and August are up!

That for now pretty much puts an halt to my Meetup attendance prediction model.
Thus insightful and worthwhile using R in trying to better understand data and finding relationships (That’s What It’s All About)

What I would like to analyze are data sets related to daily museum, parcs, library attendance. If you have any other interesting suggestions please let me know, we could work at it together.

GitHub Gist available here -> https://gist.github.com/sfrechette/585e4ca9cdab7eacc92b

Presented @ Ottawa SQL Server User Group (Ottawa PASS Chapter) on Thursday March 19, 2015

Great gathering at Pythian HQ in Ottawa (Thank you Vanessa!). Lot’s of questions and interactions, used the following from an earlier blog post for the demo of R and SQL Server: connectivity and analysis.

RSQLMeetup

Slides – Data Analytics with R and SQL Server

Data Science, Statistical Analysis and R… Learn what those mean, how they can help you find answers to your questions and complement the existing toolsets and processes you are currently using to make sense of data. We will explore R and the RStudio development environment, installing and using R packages, basic and essential data structures and data types, plotting graphics, manipulating data frames and how to connect R and SQL Server.

I recently completed all 9 courses from the Johns Hopkins University Data Science Specialization | Coursera. Now what’s left is to join and complete the Capstone project… and that should happen soon, the next time it is offered.

The program (specialization) focuses heavily on R and the use of practical real-world datasets.
Below are my Verified Course Record (Grades) for each of the courses. They also outline the course description, syllabus and grading policy.

  • The Data Scientist’s Toolbox by Johns Hopkins University on Coursera. Certificate earned on June 1, 2014
  • R Programming by Johns Hopkins University on Coursera. Certificate earned on June 30, 2014
  • Getting and Cleaning Data by Johns Hopkins University on Coursera. Certificate earned on August 3, 2014
  • Exploratory Data Analysis by Johns Hopkins University on Coursera. Certificate earned on September 28, 2014
  • Reproducible Research by Johns Hopkins University on Coursera. Certificate earned on November 2, 2014
  • Regression Models by Johns Hopkins University on Coursera. Certificate earned on December 27, 2014
  • Statistical Inference by Johns Hopkins University on Coursera. Certificate earned on February 1, 2015
  • Practical Machine Learning by Johns Hopkins University on Coursera. Certificate earned on February 1, 2015
  • Developing Data Products by Johns Hopkins University on Coursera. Certificate earned on March 1, 2015
  • Overall an excellent program. It won’t make you an overnight data scientist, but it will give you more than a solid foundation to build upon. I highly recommend.

    Presented @ Ottawa SQL Server User Group (Ottawa PASS Chapter) on Thursday February 19, 2015

    Self-Service Data Integration with Power Query

    Discover, Load, Transform and Mashup. Microsoft Power Query for Excel includes a powerful query engine and a formula language that enables self-service data integration and shaping over a diverse set of data sources. Power Query makes it possible for analysts to do basic ETL by themselves without much help from the IT department. Most common tasks can be accomplished within an intuitive user interface, but a powerful language called “M” can also be used to do some pretty sophisticated data preparation work. Come learn how to succeed and tackle your data and data-shaping needs.

    Updated with the Power BI Designer (currently in preview) @ http://powerbi.com