In-memory distributed processing for large datasets… How to connect to SQL Server using Apache Spark? The Spark documentation covers the basics of the API and Dataframes, there is a lack of info. and examples on actually how to get this feature to work.

spark-logosql-server

First, what is Apache Spark? Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. A fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.

Besides the Spark Core API there are libraries that provide additional capabilities: Spark Streaming, Spark SQL (SQL and Dataframes), Spark MLlib (Machine Learning), Spark GraphX (Graph)

This example will be using Spark SQL which provides the capability to expose the Spark datasets over JDBC API and allow running the SQL like queries on Spark data using BI and visualization tools. Spark SQL allows the users to access their data from different formats it’s currently in (i.e. JSON, Parquet, Database tables), transform it, and expose it for ad-hoc querying.

For the purpose of this demonstration I am connecting remotely to an instance of SQL Server 2016 and have installed the Wide World Importers sample database which you can find and download here -> https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0

Most importantly you will need to download and install the latest version of the Microsoft JDBC Drivers for SQL Server located here -> https://www.microsoft.com/en-ca/download/details.aspx?id=11774. Since I am connecting from my Mac I have installed the drivers (4.2) to the following location: /opt/sqljdbc_4.2/enu/sqljdbc42.jar

Open up a Terminal session and issue the following command to start the Spark shell with the Microsoft JDBC Driver

2-0-0-java-140x28

The following Scala code snippet demonstrates the Spark SQL commands you can run on the Spark Shell console.
Replace the xxx.xxx.xxx.xxx with your SQL Server Name or IP Address.
I recommend you copy and paste a block of commands at a time to see and understand what is happening…

Sample Terminal output…
2-0-0-java-173x43

In this post, we quickly looked at how Apache Spark SQL works to provide an SQL interface to SQL Server and interact with Spark data using the familiar SQL query syntax.

Tagged with →  
  • dharani sugumar

    very useful

  • Luis Sant Ana

    What is the performance for this? Does the query scale?

    • stephanefrechette

      Hi Luis, not sure I understand the context of your Q? But indeed once the data is loaded in the DataFrame, the operations (queries, transformations…) are being processed in memory… fast!

      • Luis Sant Ana

        My doubt is in connection. The data are in database, they need are in the spark memory. Do all works create a connection and download the data in parallel? Or just the driver download the data and after shared the tables among the works. Because if the tables are very big, the transfer will be very slow.