Hudi索引


发布于 2025-06-06 / 0 阅读 / 0 评论 /
Hudi索引的设计与实现

Hudi索引的作用

Apache Hudi索引在数据读和写的过程中都有应用。

(1)读的过程主要是查询引擎利用MetaDataTable使用索引进行Data Skipping以提高查找速度

(2)写的过程主要应用在upsert写上,即利用索引查找该纪录是新增(I)还是更新(U),以提高写入过程中纪录的打标(tag)速度。

Hudi索引的类型

Hudi支持以下类型的索引

HoodieIndex

HoodieIndex是所有索引类型的基类

核心方法如下:

public abstract class HoodieIndex<I, O> implements Serializable {
  /**
   * Looks up the index and tags each incoming record with a location of a file that contains
   * the row (if it is actually present).
   */
  public abstract <R> HoodieData<HoodieRecord<R>> tagLocation(
      HoodieData<HoodieRecord<R>> records, HoodieEngineContext context,
      HoodieTable hoodieTable) throws HoodieIndexException;

  /**
   * Extracts the location of written records, and updates the index.
   */
  public abstract HoodieData<WriteStatus> updateLocation(
      HoodieData<WriteStatus> writeStatuses, HoodieEngineContext context,
      HoodieTable hoodieTable) throws HoodieIndexException;

  /**
   * Rollback the effects of the commit made at instantTime.
   */
  public abstract boolean rollbackCommit(String instantTime);

  /**
   * An index is `global` if {@link HoodieKey} to fileID mapping, does not depend on the `partitionPath`. Such an
   * implementation is able to obtain the same mapping, for two hoodie keys with same `recordKey` but different
   * `partitionPath`
   *
   * @return whether the index implementation is global in nature
   */
  public abstract boolean isGlobal();

  /**
   * This is used by storage to determine, if it is safe to send inserts, straight to the log, i.e. having a
   * {@link FileSlice}, with no data file.
   *
   * @return Returns true/false depending on whether the impl has this capability
   */
  public abstract boolean canIndexLogFiles();
  
  /**
   * An index is "implicit" with respect to storage, if just writing new data to a file slice, updates the index as
   * well. This is used by storage, to save memory footprint in certain cases.
   */
  public abstract boolean isImplicitWithStorage();
}

HoodieBloomIndex

Hudi默认采用的HoodieBloomIndex索引,其依赖布隆过滤器来判断记录存在与否,当记录存在时,会读取实际文件进行二次判断,以便修正布隆过滤器带来的误差。同时还在每个文件元数据中添加了该文件保存的最大和最小的recordKey,借助该值可过滤出无需对比的文件。

HoodieBloomIndex是基于分区记录所在文件,即分区路径+recordKey唯一即可。

HoodieGlobalBloomIndex

HoodieGlobalBloomIndex继承自HoodieBloomIndex,表示全局的索引,即会在所有分区内查找指定的recordKey,而非像 HoodieBloomIndex只在指定的分区内查找。

HoodieGlobalBloomIndex在加载分区下所有最新文件时,其会首先获取所有分区,然后再获取所有分区下的最新文件,而非使用从原始记录中解析出来的分区路径。

HoodieBucketIndex

HoodieSimpleIndex

HoodieInMemoryHashIndex

SparkHoodieHBaseIndex

Hudi内置了HBase外置存储系统索引的实现,用户可直接配置HBase索引,将记录索引信息存入HBase。

RebalancedSparkHoodieHBaseIndex

继承自SparkHoodieHBaseIndex类,通过给key添加随机的前缀来解决hbase region数据倾斜的问题。