Iceberg Spark DQL
如果要使用 Iceberg,强烈建议使用 Spark3 的版本,因为自从 Spark3 开始,Iceberg 支持了一些新特性,其中就包括使用 SQL 查询语句。
使用方式和传统的 Spark SQL 没有区别,可以在spark-sql
窗口使用,也可以在代码中通过saprk.sql()
的方式使用。
语法和原生的 Spark SQL 基本无异,但是 Iceberg 支持 SQL 中查看表的历史记录、快照、数据文件、元数据文件等信息,使用方式就是在from
的表名后面再加上一些其它属性。
# History
查询表的历史记录:
select * from hadoop_catalog.iceberg_db.user_log_iceberg.history;
1
输出:
+--------------------+-------------------+-------------------+-------------------+
| made_current_at| snapshot_id| parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2022-03-20 14:37:...|6744647507914918603| null| true|
|2022-03-20 18:31:...|8046643380197343006|6744647507914918603| true|
|2022-03-20 18:36:...|4140724156423386841|8046643380197343006| true|
+--------------------+-------------------+-------------------+-------------------+
1
2
3
4
5
6
7
2
3
4
5
6
7
# Snapshots
查询表的有效快照:
select * from hadoop_catalog.iceberg_db.user_log_iceberg.snapshots;
1
输出:
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
| committed_at| snapshot_id| parent_id|operation| manifest_list| summary|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|2022-03-20 14:37:...|6744647507914918603| null| append|/opt/module/spark...|{spark.app.id -> ...|
|2022-03-20 18:31:...|8046643380197343006|6744647507914918603| append|/opt/module/spark...|{spark.app.id -> ...|
|2022-03-20 18:36:...|4140724156423386841|8046643380197343006| delete|/opt/module/spark...|{spark.app.id -> ...|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
1
2
3
4
5
6
7
2
3
4
5
6
7
# Files
查询表的数据文件和每个文件的元数据:
select * from hadoop_catalog.iceberg_db.user_log_iceberg.files;
1
输出:
+-------+--------------------+-----------+-------+------------+------------+------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+
|content| file_path|file_format|spec_id| partition|record_count|file_size_in_bytes| column_sizes| value_counts| null_value_counts|nan_value_counts| lower_bounds| upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|
+-------+--------------------+-----------+-------+------------+------------+------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+
| 0|/opt/module/spark...| PARQUET| 0|{2021-12-31}| 1| 1032|{1 -> 48, 2 -> 48...|{1 -> 1, 2 -> 1, ...|{1 -> 0, 2 -> 0, ...| {}|{1 -> xxxxxxxxxxx...|{1 -> xxxxxxxxxxx...| null| [4]| null| 0|
+-------+--------------------+-----------+-------+------------+------------+------------------+--------------------+--------------------+-------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+
1
2
3
4
5
2
3
4
5
# Manifests
查询表的 metadata 信息:
select * from hadoop_catalog.iceberg_db.user_log_iceberg.manifests;
1
输出:
+--------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+--------------------+
| path|length|partition_spec_id| added_snapshot_id|added_data_files_count|existing_data_files_count|deleted_data_files_count| partition_summaries|
+--------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+--------------------+
|/opt/module/spark...| 6127| 0|4140724156423386841| 0| 0| 1|[{false, false, 2...|
|/opt/module/spark...| 6128| 0|6744647507914918603| 1| 0| 0|[{false, false, 2...|
+--------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+--------------------+
1
2
3
4
5
6
2
3
4
5
6
上次更新: 2023/11/01, 03:11:44