iceberg-cpp
Loading...
Searching...
No Matches
Public Types | Public Member Functions | Static Public Member Functions | Public Attributes | Static Public Attributes | List of all members
iceberg::DataFile Struct Reference

DataFile carries data file path, partition tuple, metrics, ... More...

#include <manifest_entry.h>

Public Types

enum class  Content { kData = 0 , kPositionDeletes = 1 , kEqualityDeletes = 2 }
 Content of a data file.
 

Public Member Functions

bool operator== (const DataFile &other) const =default
 
bool IsDeletionVector () const
 Check if this data file is a deletion vector.
 

Static Public Member Functions

static std::shared_ptr< StructTypeType (std::shared_ptr< StructType > partition_type)
 Get the schema of the data file with the given partition type.
 

Public Attributes

Content content = Content::kData
 
std::string file_path
 
FileFormatType file_format = FileFormatType::kParquet
 
PartitionValues partition
 
int64_t record_count = 0
 
int64_t file_size_in_bytes = 0
 
std::map< int32_t, int64_t > column_sizes
 
std::map< int32_t, int64_t > value_counts
 
std::map< int32_t, int64_t > null_value_counts
 
std::map< int32_t, int64_t > nan_value_counts
 
std::map< int32_t, std::vector< uint8_t > > lower_bounds
 
std::map< int32_t, std::vector< uint8_t > > upper_bounds
 
std::vector< uint8_t > key_metadata
 
std::vector< int64_t > split_offsets
 
std::vector< int32_t > equality_ids
 
std::optional< int32_t > sort_order_id
 
std::optional< int64_t > first_row_id
 
std::optional< std::string > referenced_data_file
 
std::optional< int64_t > content_offset
 
std::optional< int64_t > content_size_in_bytes
 
std::optional< int32_t > partition_spec_id
 Partition spec id for this data file.
 

Static Public Attributes

static constexpr int32_t kContentFieldId = 134
 
static const SchemaField kContent
 
static constexpr int32_t kFilePathFieldId = 100
 
static const SchemaField kFilePath
 
static constexpr int32_t kFileFormatFieldId = 101
 
static const SchemaField kFileFormat
 
static constexpr int32_t kPartitionFieldId = 102
 
static const std::string kPartitionField = "partition"
 
static const std::string kPartitionDoc
 
static constexpr int32_t kRecordCountFieldId = 103
 
static const SchemaField kRecordCount
 
static constexpr int32_t kFileSizeFieldId = 104
 
static const SchemaField kFileSize
 
static constexpr int32_t kColumnSizesFieldId = 108
 
static const SchemaField kColumnSizes
 
static constexpr int32_t kValueCountsFieldId = 109
 
static const SchemaField kValueCounts
 
static constexpr int32_t kNullValueCountsFieldId = 110
 
static const SchemaField kNullValueCounts
 
static constexpr int32_t kNanValueCountsFieldId = 137
 
static const SchemaField kNanValueCounts
 
static constexpr int32_t kLowerBoundsFieldId = 125
 
static const SchemaField kLowerBounds
 
static constexpr int32_t kUpperBoundsFieldId = 128
 
static const SchemaField kUpperBounds
 
static constexpr int32_t kKeyMetadataFieldId = 131
 
static const SchemaField kKeyMetadata
 
static constexpr int32_t kSplitOffsetsFieldId = 132
 
static const SchemaField kSplitOffsets
 
static constexpr int32_t kEqualityIdsFieldId = 135
 
static const SchemaField kEqualityIds
 
static constexpr int32_t kSortOrderIdFieldId = 140
 
static const SchemaField kSortOrderId
 
static constexpr int32_t kFirstRowIdFieldId = 142
 
static const SchemaField kFirstRowId
 
static constexpr int32_t kReferencedDataFileFieldId = 143
 
static const SchemaField kReferencedDataFile
 
static constexpr int32_t kContentOffsetFieldId = 144
 
static const SchemaField kContentOffset
 
static constexpr int32_t kContentSizeFieldId = 145
 
static const SchemaField kContentSize
 

Detailed Description

DataFile carries data file path, partition tuple, metrics, ...

Member Data Documentation

◆ column_sizes

std::map<int32_t, int64_t> iceberg::DataFile::column_sizes

Field id: 108 Key field id: 117 Value field id: 118 Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro)

◆ content

Content iceberg::DataFile::content = Content::kData

Field id: 134 Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files)

◆ content_offset

std::optional<int64_t> iceberg::DataFile::content_offset

Field id: 144 The offset in the file where the content starts.

The content_offset and content_size_in_bytes fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the offset and length stored in the Puffin footer for the deletion vector blob.

◆ content_size_in_bytes

std::optional<int64_t> iceberg::DataFile::content_size_in_bytes

Field id: 145 The length of a referenced content stored in the file; required if content_offset is present

◆ equality_ids

std::vector<int32_t> iceberg::DataFile::equality_ids

Field id: 135 Element Field id: 136 Field ids used to determine row equality in equality delete files. Required when content=2 and should be null otherwise. Fields with ids listed in this column must be present in the delete file.

◆ file_format

FileFormatType iceberg::DataFile::file_format = FileFormatType::kParquet

Field id: 101 File format type, avro, orc, parquet, or puffin

◆ file_path

std::string iceberg::DataFile::file_path

Field id: 100 Full URI for the file with FS scheme

◆ file_size_in_bytes

int64_t iceberg::DataFile::file_size_in_bytes = 0

Field id: 104 Total file size in bytes

◆ first_row_id

std::optional<int64_t> iceberg::DataFile::first_row_id

Field id: 142 The _row_id for the first row in the data file.

Reference:

◆ kColumnSizes

const SchemaField iceberg::DataFile::kColumnSizes
inlinestatic
Initial value:
kColumnSizesFieldId, "column_sizes",
map(SchemaField::MakeRequired(117, std::string(MapType::kKeyName), int32()),
SchemaField::MakeRequired(118, std::string(MapType::kValueName), int64())),
"Map of column id to total size on disk")
static SchemaField MakeOptional(int32_t field_id, std::string_view name, std::shared_ptr< Type > type, std::string_view doc={})
Construct an optional (nullable) field.
Definition schema_field.cc:38
static SchemaField MakeRequired(int32_t field_id, std::string_view name, std::shared_ptr< Type > type, std::string_view doc={})
Construct a required (non-null) field.
Definition schema_field.cc:43
std::shared_ptr< MapType > map(SchemaField key, SchemaField value)
Create a MapType with the given key and value fields.
Definition type.cc:388
ICEBERG_EXPORT const std::shared_ptr< IntType > & int32()
Return an IntType instance.
ICEBERG_EXPORT const std::shared_ptr< LongType > & int64()
Return a LongType instance.

◆ kContent

const SchemaField iceberg::DataFile::kContent
inlinestatic
Initial value:
kContentFieldId, "content", int32(),
"Contents of the file: 0=data, 1=position deletes, 2=equality deletes")

◆ kContentOffset

const SchemaField iceberg::DataFile::kContentOffset
inlinestatic
Initial value:
=
SchemaField::MakeOptional(kContentOffsetFieldId, "content_offset", int64(),
"The offset in the file where the content starts")

◆ kContentSize

const SchemaField iceberg::DataFile::kContentSize
inlinestatic
Initial value:
=
SchemaField::MakeOptional(kContentSizeFieldId, "content_size_in_bytes", int64(),
"The length of referenced content stored in the file")

◆ kEqualityIds

const SchemaField iceberg::DataFile::kEqualityIds
inlinestatic
Initial value:
kEqualityIdsFieldId, "equality_ids",
list(SchemaField::MakeRequired(136, std::string(ListType::kElementName), int32())),
"Equality comparison field IDs")
std::shared_ptr< ListType > list(SchemaField element)
Create a ListType with the given element field.
Definition type.cc:392

◆ key_metadata

std::vector<uint8_t> iceberg::DataFile::key_metadata

Field id: 131 Implementation-specific key metadata for encryption

◆ kFileFormat

const SchemaField iceberg::DataFile::kFileFormat
inlinestatic
Initial value:
=
SchemaField::MakeRequired(kFileFormatFieldId, "file_format", string(),
"File format name: avro, orc, or parquet")

◆ kFilePath

const SchemaField iceberg::DataFile::kFilePath
inlinestatic
Initial value:
kFilePathFieldId, "file_path", string(), "Location URI with FS scheme")

◆ kFileSize

const SchemaField iceberg::DataFile::kFileSize
inlinestatic
Initial value:
kFileSizeFieldId, "file_size_in_bytes", int64(), "Total file size in bytes")

◆ kFirstRowId

const SchemaField iceberg::DataFile::kFirstRowId
inlinestatic
Initial value:
=
SchemaField::MakeOptional(kFirstRowIdFieldId, "first_row_id", int64(),
"Starting row ID to assign to new rows")

◆ kKeyMetadata

const SchemaField iceberg::DataFile::kKeyMetadata
inlinestatic
Initial value:
kKeyMetadataFieldId, "key_metadata", binary(), "Encryption key metadata blob")
ICEBERG_EXPORT const std::shared_ptr< BinaryType > & binary()
Return a BinaryType instance.

◆ kLowerBounds

const SchemaField iceberg::DataFile::kLowerBounds
inlinestatic
Initial value:
kLowerBoundsFieldId, "lower_bounds",
map(SchemaField::MakeRequired(126, std::string(MapType::kKeyName), int32()),
SchemaField::MakeRequired(127, std::string(MapType::kValueName), binary())),
"Map of column id to lower bound")

◆ kNanValueCounts

const SchemaField iceberg::DataFile::kNanValueCounts
inlinestatic
Initial value:
kNanValueCountsFieldId, "nan_value_counts",
map(SchemaField::MakeRequired(138, std::string(MapType::kKeyName), int32()),
SchemaField::MakeRequired(139, std::string(MapType::kValueName), int64())),
"Map of column id to number of NaN values in the column")

◆ kNullValueCounts

const SchemaField iceberg::DataFile::kNullValueCounts
inlinestatic
Initial value:
kNullValueCountsFieldId, "null_value_counts",
map(SchemaField::MakeRequired(121, std::string(MapType::kKeyName), int32()),
SchemaField::MakeRequired(122, std::string(MapType::kValueName), int64())),
"Map of column id to null value count")

◆ kPartitionDoc

const std::string iceberg::DataFile::kPartitionDoc
inlinestatic
Initial value:
=
"Partition data tuple, schema based on the partition spec"

◆ kRecordCount

const SchemaField iceberg::DataFile::kRecordCount
inlinestatic
Initial value:
kRecordCountFieldId, "record_count", int64(), "Number of records in the file")

◆ kReferencedDataFile

const SchemaField iceberg::DataFile::kReferencedDataFile
inlinestatic
Initial value:
kReferencedDataFileFieldId, "referenced_data_file", string(),
"Fully qualified location (URI with FS scheme) of a data file that all deletes "
"reference")

◆ kSortOrderId

const SchemaField iceberg::DataFile::kSortOrderId
inlinestatic
Initial value:
kSortOrderIdFieldId, "sort_order_id", int32(), "Sort order ID")

◆ kSplitOffsets

const SchemaField iceberg::DataFile::kSplitOffsets
inlinestatic
Initial value:
kSplitOffsetsFieldId, "split_offsets",
list(SchemaField::MakeRequired(133, std::string(ListType::kElementName), int64())),
"Splittable offsets")

◆ kUpperBounds

const SchemaField iceberg::DataFile::kUpperBounds
inlinestatic
Initial value:
kUpperBoundsFieldId, "upper_bounds",
map(SchemaField::MakeRequired(129, std::string(MapType::kKeyName), int32()),
SchemaField::MakeRequired(130, std::string(MapType::kValueName), binary())),
"Map of column id to upper bound")

◆ kValueCounts

const SchemaField iceberg::DataFile::kValueCounts
inlinestatic
Initial value:
kValueCountsFieldId, "value_counts",
map(SchemaField::MakeRequired(119, std::string(MapType::kKeyName), int32()),
SchemaField::MakeRequired(120, std::string(MapType::kValueName), int64())),
"Map of column id to total count, including null and NaN")

◆ lower_bounds

std::map<int32_t, std::vector<uint8_t> > iceberg::DataFile::lower_bounds

Field id: 125 Key field id: 126 Value field id: 127 Map from column id to lower bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.

◆ nan_value_counts

std::map<int32_t, int64_t> iceberg::DataFile::nan_value_counts

Field id: 137 Key field id: 138 Value field id: 139 Map from column id to number of NaN values in the column

◆ null_value_counts

std::map<int32_t, int64_t> iceberg::DataFile::null_value_counts

Field id: 110 Key field id: 121 Value field id: 122 Map from column id to number of null values in the column

◆ partition

PartitionValues iceberg::DataFile::partition

Field id: 102 Partition data tuple, schema based on the partition spec output using partition field ids

◆ partition_spec_id

std::optional<int32_t> iceberg::DataFile::partition_spec_id

Partition spec id for this data file.

Note
This field is for internal use only and will not be persisted to manifest entry.

◆ record_count

int64_t iceberg::DataFile::record_count = 0

Field id: 103 Number of records in this file, or the cardinality of a deletion vector

◆ referenced_data_file

std::optional<std::string> iceberg::DataFile::referenced_data_file

Field id: 143 Fully qualified location (URI with FS scheme) of a data file that all deletes reference.

Position delete metadata can use referenced_data_file when all deletes tracked by the entry are in a single data file. Setting the referenced file is required for deletion vectors.

◆ sort_order_id

std::optional<int32_t> iceberg::DataFile::sort_order_id

Field id: 140 ID representing sort order for this file

If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. Position deletes are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.

◆ split_offsets

std::vector<int64_t> iceberg::DataFile::split_offsets

Field id: 132 Element Field id: 133 Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending.

◆ upper_bounds

std::map<int32_t, std::vector<uint8_t> > iceberg::DataFile::upper_bounds

Field id: 128 Key field id: 129 Value field id: 130 Map from column id to upper bound in the column serialized as binary. Each value must be greater than or equal to all non-null, non-NaN values in the column for the file.

◆ value_counts

std::map<int32_t, int64_t> iceberg::DataFile::value_counts

Field id: 109 Key field id: 119 Value field id: 120 Map from column id to number of values in the column (including null and NaN values)


The documentation for this struct was generated from the following files: