1.参数定义
spark.yarn.archive和spark.yarn.jars是Spark on Yarn模式下的参数,参数定义在org.apache.spark.deploy.yarn.config类中,源码如下所示:
org.apache.spark.deploy.yarn.config中对参数的定义
private[spark] val SPARK_ARCHIVE = ConfigBuilder("spark.yarn.archive")
.doc("Location of archive containing jars files with Spark classes.")
.version("2.0.0")
.stringConf
.createOptional
private[spark] val SPARK_JARS = ConfigBuilder("spark.yarn.jars")
.doc("Location of jars containing Spark classes.")
.version("2.0.0")
.stringConf
.toSequence
.createOptional
可以看到,这两个参数是从spark2.0.0版本加入的参数。
2.参数作用
可参考官网介绍——https://spark.apache.org/docs/3.5.1/running-on-yarn.html#spark-properties
具体描述如下:
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
我们可以通过指定spark.yarn.archive和spark.yarn.jars来把spark运行时依赖的jars放到Yarn可直接获取的地方,比如HDFS。
如果两个参数都没有指定,那么就会把$SPARK_HOME/jars目录下的所有文件打包成zip包,并上传到分布式缓存中。
也就是说这里直接关系到一个上传的过程,大概有300MB左右的数据上传,如果任务提交的客户端与YARN端网速较慢,则提交过程会拉长。
3.源码中参数的使用
两个参数作用的点在org.apache.spark.deploy.yarn.YarnClusterApplication中。
org.apache.spark.deploy.yarn.YarnClusterApplication#start // 启动spark on yarn
new org.apache.spark.deploy.yarn.Client()
yarnClient = YarnClient.createYarnClient
hadoopConf = new YarnConfiguration(SparkHadoopUtil.newConfiguration(sparkConf))
launcherBackend = new LauncherBackend(){}
org.apache.spark.deploy.yarn.Client#run
org.apache.spark.deploy.yarn.Client#submitApplication
launcherBackend.connect()
yarnClient.init(hadoopConf)
yarnClient.start()
newApp = yarnClient.createApplication()
containerContext = createContainerLaunchContext()
org.apache.spark.deploy.yarn.Client#setupLaunchEnv
org.apache.spark.deploy.yarn.Client#populateClasspath
如果spark.yarn.archive为空,且spark.yarn.jars不为空,则把spark.yarn.jars指向的所有的jar包添加到Container进程的classpath中
org.apache.spark.deploy.yarn.Client#prepareLocalResources
如果spark.yarn.archive设置了,则把对应的资源下载到Container本地
如果spark.yarn.archive未设置,且设置了spark.yarn.jars,则把对应的资源下载到container本地
appContext = createApplicationSubmissionContext(newApp, containerContext)
yarnClient.submitApplication(appContext)
可以看到spark.yarn.archive的优先级比spark.yarn.jars高。