In-memory distributed processing for large datasets… How to connect to SQL Server using Apache Spark? The Spark documentation covers the basics of the API and Dataframes, there is a lack of info. and examples on actually how to get this feature to work.


First, what is Apache Spark? Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. A fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.

Besides the Spark Core API there are libraries that provide additional capabilities: Spark Streaming, Spark SQL (SQL and Dataframes), Spark MLlib (Machine Learning), Spark GraphX (Graph)

This example will be using Spark SQL which provides the capability to expose the Spark datasets over JDBC API and allow running the SQL like queries on Spark data using BI and visualization tools. Spark SQL allows the users to access their data from different formats it’s currently in (i.e. JSON, Parquet, Database tables), transform it, and expose it for ad-hoc querying.

For the purpose of this demonstration I am connecting remotely to an instance of SQL Server 2016 and have installed the Wide World Importers sample database which you can find and download here ->

Most importantly you will need to download and install the latest version of the Microsoft JDBC Drivers for SQL Server located here -> Since I am connecting from my Mac I have installed the drivers (4.2) to the following location: /opt/sqljdbc_4.2/enu/sqljdbc42.jar

Open up a Terminal session and issue the following command to start the Spark shell with the Microsoft JDBC Driver


The following Scala code snippet demonstrates the Spark SQL commands you can run on the Spark Shell console.
Replace the with your SQL Server Name or IP Address.
I recommend you copy and paste a block of commands at a time to see and understand what is happening…

Sample Terminal output…

In this post, we quickly looked at how Apache Spark SQL works to provide an SQL interface to SQL Server and interact with Spark data using the familiar SQL query syntax.

Tagged with →  
  • dharani sugumar

    very useful

  • Luis Sant Ana

    What is the performance for this? Does the query scale?

    • stephanefrechette

      Hi Luis, not sure I understand the context of your Q? But indeed once the data is loaded in the DataFrame, the operations (queries, transformations…) are being processed in memory… fast!

      • Luis Sant Ana

        My doubt is in connection. The data are in database, they need are in the spark memory. Do all works create a connection and download the data in parallel? Or just the driver download the data and after shared the tables among the works. Because if the tables are very big, the transfer will be very slow.

        • Srikanth Vemulakonda

          Yes I have similar doubt as Luis

  • Joy George

    is there any way I can update the data back to SQL Server after Spark processing?

  • disqus_Q5pIljsCde

    I understood that u r directly loading the data from “Sales.InvoiceLines” to data frame but what if I want to lead the custom data into the dataframe? i.e. suppose the data obtained from the JOIN operation on 3-4 tables..

    • stephanefrechette

      Yes absolutely, it could be an inline SQL statement “query” (with joins) or even a view (containing joins) …

      The dbtable parameter can be any query wrapped in parenthesis with an alias.

      i.e. – MySQL example …

      val query = “””
      (select dl.DialogLineID, dlwim.Sequence, wi.WordRootID from Dialog as d
      join DialogLine as dl on dl.DialogID=d.DialogID
      join DialogLineWordInstanceMatch as dlwim on dlwim.DialogLineID=dl.DialogLineID
      join WordInstance as wi on wi.WordInstanceID=dlwim.WordInstanceID
      join WordRoot as wr on wr.WordRootID=wi.WordRootID
      where d.InSite=1 and dl.Active=1
      limit 100) foo

      val df = sqlContext.format(“jdbc”).
      option(“url”, “jdbc:mysql://localhost:3306/local_content”).
      option(“driver”, “com.mysql.jdbc.Driver”).
      option(“useUnicode”, “true”).
      option(“useSSL”, “false”).
      option(“user”, “root”).
      option(“password”, “”).

  • Rohit Garg

    Hi Stephan.. I want the same functionality with Bulk copy mode…. how can we do that?

    • stephanefrechette

      Hi Rohit, the documentation provided above see ( you can use the batchsize parameter to perform inserts in batches:

      batchsize -> The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to 1000.


      .option(“batchsize”, 5000)
      .jdbc(url, “mytablename”, new java.util.Properties())