Background
This is 2nd part of the Step by Step guide to run Apache Spark on HDInsight cluster. In the first part we saw how to provision the HDInsight Spark cluster with Spark 1.6.3 on Azure. In this post we will see how to use IntelliJ IDEA IDE and submit the Spark job.
Pre-requisite
IntelliJ IDEA IDEA installed on your machine. I am using the community edition but same works with ultimate edition as well.
Azure toolkit for IntelliJ this is an IntelliJ plugin. This plugin provides different functionalities which helps to create, test and deploy Azure application using IntelliJ. If you are an Eclipse user there is a similar plugin named Azure toolkit for Eclipse. I have not used it personally. I prefer to work with IntelliJ as it makes me feel more at home since I come from a C# background having used Visual Studio for more than a decade and a half.
Steps to submit Spark job using IntelliJ
1. Login to Azure account
If the plugin is installed correctly, you should see an option to sign in to Azure as shown below
Select Azure Sign in option to login to the Azure account. This will bring up the Azure Sign in dialog.
2 . Sign in to Azure in Interactive mode
Use the default option of Interactive and click on Sign in. I have not yet tested the automated authentication method.
Provide the Windows live account details here which has access to the Azure subscription. If the account login was successful, your subscription details will be pulled into the IDE as shown below
As I have only one subscription, I can click on select button to access different services associated with the subscription.
3. Explore Azure resources (Optional)
We can verify that different resources available under this subscription can be accesses using the Azure Explorer. Navigate to Azure Explorer from the Tools menu and then selecting Tool Windows option.
Selecting this option brings up the Azure Explorer Sidebar. We can access things like Container Registry, Docker hosts, HDInsight cluster, Redis Caches, Storage Accounts, Virtual Machines and Web Apps.
In the above screenshot we can see the storage account named ngstorageaccount associated with my Azure subscription.
4. Submit Spark job
Lets move onto the most important part which is to submit the Spark job to cluster. From the sidebar, navigate to the Projects pane. Select the file containing main method. In fact any file would do as we need to specify the details in the dialog box that appears. Right click and select Submit Spark Application to HDInsight from the context menu. The option is available right at the bottom of context menu.
This brings up a dialog box where we can specify the details related to the job. Select the appropriate options from the drop down list. I have selected the Spark cluster, location of the jar file and name of the Main class as com.nileshgule.MapToDoubleExample. Note that the fully qualified name needs to be supplied here.
We can provide the runtime parameters like driverMemory, driverCores etc. I chose to go with default values. The widget provides us options to pass command line arguments, reference jars and any reference files if required.
Once we submit the job, the Spark Submission pane loads up at the bottom of the screen and reports the progress of job execution. Assuming everything goes fine, you should see output which says that The Spark application completed successfully. The output will also contain additional information with a link to the YARN UI and also the detailed job log copied from the cluster to a local directory.
5. Verify job execution
In order to verify that the job was successfully executed on the cluster, we can click the link to the YARN UI. This will bring up a login prompt as shown below.
Note that this is using the SSL port 443. Provide the credentials which were specified when we created the admin user for the cluster in the provisioning step. Hope you remember the password that was supplied at the time of creating the cluster. Once the admin user is authenticated, you will be presented with the YARN application container details.
This page gives various details like the status reported by application master, link to the Spark history server using the tracking URL, link to logs and many other details. you can get more details about the job by navigating to different links provided on this page.
Conclusion
We saw that it is quite easy to trigger Spark job from IntelliJ IDE directly onto the HDInsight cluster. I have demonstrated very few capabilities of the Azure plugin in this post. Feel free to explore other options. In the next part we will see how we can submit the Spark jobs from Head node of the cluster using command line interface. Hope you found this information useful. Until next time Happy Programming.