数据湖
Iceberg

Iceberg Spark DQL

如果要使用 Iceberg，强烈建议使用 Spark3 的版本，因为自从 Spark3 开始，Iceberg 支持了一些新特性，其中就包括使用 SQL 查询语句。

使用方式和传统的 Spark SQL 没有区别，可以在spark-sql窗口使用，也可以在代码中通过saprk.sql()的方式使用。

语法和原生的 Spark SQL 基本无异，但是 Iceberg 支持 SQL 中查看表的历史记录、快照、数据文件、元数据文件等信息，使用方式就是在from的表名后面再加上一些其它属性。

# History

查询表的历史记录：

select * from hadoop_catalog.iceberg_db.user_log_iceberg.history;

输出：

+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2022-03-20 14:37:...|6744647507914918603|               null|               true|
|2022-03-20 18:31:...|8046643380197343006|6744647507914918603|               true|
|2022-03-20 18:36:...|4140724156423386841|8046643380197343006|               true|
+--------------------+-------------------+-------------------+-------------------+

1
2
3
4
5
6
7

# Snapshots

查询表的有效快照：

select * from hadoop_catalog.iceberg_db.user_log_iceberg.snapshots;

输出：

+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|        committed_at|        snapshot_id|          parent_id|operation|       manifest_list|             summary|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|2022-03-20 14:37:...|6744647507914918603|               null|   append|/opt/module/spark...|{spark.app.id -> ...|
|2022-03-20 18:31:...|8046643380197343006|6744647507914918603|   append|/opt/module/spark...|{spark.app.id -> ...|
|2022-03-20 18:36:...|4140724156423386841|8046643380197343006|   delete|/opt/module/spark...|{spark.app.id -> ...|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+

1
2
3
4
5
6
7

# Files

查询表的数据文件和每个文件的元数据：

select * from hadoop_catalog.iceberg_db.user_log_iceberg.files;

输出：

+-------+--------------------+-----------+-------+------------+------------+------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+
|content|           file_path|file_format|spec_id|   partition|record_count|file_size_in_bytes|        column_sizes|        value_counts|   null_value_counts|nan_value_counts|        lower_bounds|        upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|
+-------+--------------------+-----------+-------+------------+------------+------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+
|      0|/opt/module/spark...|    PARQUET|      0|{2021-12-31}|           1|              1032|{1 -> 48, 2 -> 48...|{1 -> 1, 2 -> 1, ...|{1 -> 0, 2 -> 0, ...|              {}|{1 -> xxxxxxxxxxx...|{1 -> xxxxxxxxxxx...|        null|          [4]|        null|            0|
+-------+--------------------+-----------+-------+------------+------------+------------------+--------------------+--------------------+-------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+

1
2
3
4
5

# Manifests

查询表的 metadata 信息：

select * from hadoop_catalog.iceberg_db.user_log_iceberg.manifests;

输出：

+--------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+--------------------+
|                path|length|partition_spec_id|  added_snapshot_id|added_data_files_count|existing_data_files_count|deleted_data_files_count| partition_summaries|
+--------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+--------------------+
|/opt/module/spark...|  6127|                0|4140724156423386841|                     0|                        0|                       1|[{false, false, 2...|
|/opt/module/spark...|  6128|                0|6744647507914918603|                     1|                        0|                       0|[{false, false, 2...|
+--------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+--------------------+

1
2
3
4
5
6

上次更新: 2023/11/01, 03:11:44

← Iceberg Spark DDL Iceberg Spark SQL 存储过程→