SparkLauncher提交spark-sql任务报错


发布于 2022-05-10 / 64 阅读 / 0 评论 /
通过SparkLauncher提交spark-sql任务时,对于select的表,如果该表同时有任务在写,则会报错

1.问题描述

通过SparkLauncher提交spark-sql任务时,对于select的表,提示“File does not exist”错误。

报错信息如下所示:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://HDFS12345/usr/hive/warehouse/abcd.db/ods_wechat_merchants_do/part-00000-8ECD4DEA-0103-4598-92A7-36E6E5E483BA-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)

……

 ShuffleMapStage 20 (count at SparkSqlClient.java:36) failed in 0.389 s due to Job aborted due to stage failure: Task 1 in stage 16.0 failed 4 times, most recent failure: Lost task 1.3 in stage 16.0 (TID 1937, 10.253.1.14, executor 1): java.io.FileNotFoundException: File does not exist: hdfs://HDFS12345/usr/hive/warehouse/abcd.db/ods_wechat_merchants_do/part-00000-8ECD4DEA-0103-4598-92A7-36E6E5E483BA-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

2.问题分析

原因:写任务会导致表的元数据更新,select时因元数据不同步导致。

官方给出的解释为:

Spark SQL caches Parquet metadata for better performance.

When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached.

If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

也就是说,Spark SQL为了性能对元数据做了缓存,如果外部系统在对同一个表进行写操作,会导致元数据不一致,在sparkSql中需要去refreshTable。