DataFile carries data file path, partition tuple, metrics, ...
More...
#include <manifest_entry.h>
|
| enum class | Content { kData = 0
, kPositionDeletes = 1
, kEqualityDeletes = 2
} |
| | Content of a data file.
|
| |
|
|
bool | operator== (const DataFile &other) const =default |
| |
|
bool | IsDeletionVector () const |
| | Check if this data file is a deletion vector.
|
| |
|
|
static std::shared_ptr< StructType > | Type (std::shared_ptr< StructType > partition_type) |
| | Get the schema of the data file with the given partition type.
|
| |
DataFile carries data file path, partition tuple, metrics, ...
◆ column_sizes
| std::map<int32_t, int64_t> iceberg::DataFile::column_sizes |
Field id: 108 Key field id: 117 Value field id: 118 Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro)
◆ content
| Content iceberg::DataFile::content = Content::kData |
Field id: 134 Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files)
◆ content_offset
| std::optional<int64_t> iceberg::DataFile::content_offset |
Field id: 144 The offset in the file where the content starts.
The content_offset and content_size_in_bytes fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the offset and length stored in the Puffin footer for the deletion vector blob.
◆ content_size_in_bytes
| std::optional<int64_t> iceberg::DataFile::content_size_in_bytes |
Field id: 145 The length of a referenced content stored in the file; required if content_offset is present
◆ equality_ids
| std::vector<int32_t> iceberg::DataFile::equality_ids |
Field id: 135 Element Field id: 136 Field ids used to determine row equality in equality delete files. Required when content=2 and should be null otherwise. Fields with ids listed in this column must be present in the delete file.
◆ file_format
| FileFormatType iceberg::DataFile::file_format = FileFormatType::kParquet |
Field id: 101 File format type, avro, orc, parquet, or puffin
◆ file_path
| std::string iceberg::DataFile::file_path |
Field id: 100 Full URI for the file with FS scheme
◆ file_size_in_bytes
| int64_t iceberg::DataFile::file_size_in_bytes = 0 |
Field id: 104 Total file size in bytes
◆ first_row_id
| std::optional<int64_t> iceberg::DataFile::first_row_id |
Field id: 142 The _row_id for the first row in the data file.
Reference:
◆ kColumnSizes
Initial value:
kColumnSizesFieldId, "column_sizes",
"Map of column id to total size on disk")
static SchemaField MakeOptional(int32_t field_id, std::string_view name, std::shared_ptr< Type > type, std::string_view doc={})
Construct an optional (nullable) field.
Definition schema_field.cc:38
static SchemaField MakeRequired(int32_t field_id, std::string_view name, std::shared_ptr< Type > type, std::string_view doc={})
Construct a required (non-null) field.
Definition schema_field.cc:43
std::shared_ptr< MapType > map(SchemaField key, SchemaField value)
Create a MapType with the given key and value fields.
Definition type.cc:388
ICEBERG_EXPORT const std::shared_ptr< IntType > & int32()
Return an IntType instance.
ICEBERG_EXPORT const std::shared_ptr< LongType > & int64()
Return a LongType instance.
◆ kContent
Initial value:
kContentFieldId,
"content",
int32(),
"Contents of the file: 0=data, 1=position deletes, 2=equality deletes")
◆ kContentOffset
Initial value:=
"The offset in the file where the content starts")
◆ kContentSize
Initial value:=
"The length of referenced content stored in the file")
◆ kEqualityIds
Initial value:
kEqualityIdsFieldId, "equality_ids",
"Equality comparison field IDs")
std::shared_ptr< ListType > list(SchemaField element)
Create a ListType with the given element field.
Definition type.cc:392
◆ key_metadata
| std::vector<uint8_t> iceberg::DataFile::key_metadata |
Field id: 131 Implementation-specific key metadata for encryption
◆ kFileFormat
Initial value:=
"File format name: avro, orc, or parquet")
◆ kFilePath
Initial value:
kFilePathFieldId, "file_path", string(), "Location URI with FS scheme")
◆ kFileSize
Initial value:
kFileSizeFieldId,
"file_size_in_bytes",
int64(),
"Total file size in bytes")
◆ kFirstRowId
Initial value:=
"Starting row ID to assign to new rows")
◆ kKeyMetadata
Initial value:
kKeyMetadataFieldId,
"key_metadata",
binary(),
"Encryption key metadata blob")
ICEBERG_EXPORT const std::shared_ptr< BinaryType > & binary()
Return a BinaryType instance.
◆ kLowerBounds
Initial value:
kLowerBoundsFieldId, "lower_bounds",
"Map of column id to lower bound")
◆ kNanValueCounts
Initial value:
kNanValueCountsFieldId, "nan_value_counts",
"Map of column id to number of NaN values in the column")
◆ kNullValueCounts
Initial value:
kNullValueCountsFieldId, "null_value_counts",
"Map of column id to null value count")
◆ kPartitionDoc
| const std::string iceberg::DataFile::kPartitionDoc |
|
inlinestatic |
Initial value:=
"Partition data tuple, schema based on the partition spec"
◆ kRecordCount
Initial value:
kRecordCountFieldId,
"record_count",
int64(),
"Number of records in the file")
◆ kReferencedDataFile
| const SchemaField iceberg::DataFile::kReferencedDataFile |
|
inlinestatic |
Initial value:
kReferencedDataFileFieldId, "referenced_data_file", string(),
"Fully qualified location (URI with FS scheme) of a data file that all deletes "
"reference")
◆ kSortOrderId
Initial value:
kSortOrderIdFieldId,
"sort_order_id",
int32(),
"Sort order ID")
◆ kSplitOffsets
Initial value:
kSplitOffsetsFieldId, "split_offsets",
"Splittable offsets")
◆ kUpperBounds
Initial value:
kUpperBoundsFieldId, "upper_bounds",
"Map of column id to upper bound")
◆ kValueCounts
Initial value:
kValueCountsFieldId, "value_counts",
"Map of column id to total count, including null and NaN")
◆ lower_bounds
| std::map<int32_t, std::vector<uint8_t> > iceberg::DataFile::lower_bounds |
Field id: 125 Key field id: 126 Value field id: 127 Map from column id to lower bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.
◆ nan_value_counts
| std::map<int32_t, int64_t> iceberg::DataFile::nan_value_counts |
Field id: 137 Key field id: 138 Value field id: 139 Map from column id to number of NaN values in the column
◆ null_value_counts
| std::map<int32_t, int64_t> iceberg::DataFile::null_value_counts |
Field id: 110 Key field id: 121 Value field id: 122 Map from column id to number of null values in the column
◆ partition
Field id: 102 Partition data tuple, schema based on the partition spec output using partition field ids
◆ partition_spec_id
| std::optional<int32_t> iceberg::DataFile::partition_spec_id |
Partition spec id for this data file.
- Note
- This field is for internal use only and will not be persisted to manifest entry.
◆ record_count
| int64_t iceberg::DataFile::record_count = 0 |
Field id: 103 Number of records in this file, or the cardinality of a deletion vector
◆ referenced_data_file
| std::optional<std::string> iceberg::DataFile::referenced_data_file |
Field id: 143 Fully qualified location (URI with FS scheme) of a data file that all deletes reference.
Position delete metadata can use referenced_data_file when all deletes tracked by the entry are in a single data file. Setting the referenced file is required for deletion vectors.
◆ sort_order_id
| std::optional<int32_t> iceberg::DataFile::sort_order_id |
Field id: 140 ID representing sort order for this file
If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. Position deletes are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.
◆ split_offsets
| std::vector<int64_t> iceberg::DataFile::split_offsets |
Field id: 132 Element Field id: 133 Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending.
◆ upper_bounds
| std::map<int32_t, std::vector<uint8_t> > iceberg::DataFile::upper_bounds |
Field id: 128 Key field id: 129 Value field id: 130 Map from column id to upper bound in the column serialized as binary. Each value must be greater than or equal to all non-null, non-NaN values in the column for the file.
◆ value_counts
| std::map<int32_t, int64_t> iceberg::DataFile::value_counts |
Field id: 109 Key field id: 119 Value field id: 120 Map from column id to number of values in the column (including null and NaN values)
The documentation for this struct was generated from the following files: