DataFile carries data file path, partition tuple, metrics, ... More...

#include <manifest_entry.h>

Public Types
enum class	Content { kData = 0 , kPositionDeletes = 1 , kEqualityDeletes = 2 }
	Content of a data file.

Public Member Functions
bool	operator== (const DataFile &other) const =default

bool	IsDeletionVector () const
	Check if this data file is a deletion vector.

Static Public Member Functions
static std::shared_ptr< StructType >	Type (std::shared_ptr< StructType > partition_type)
	Get the schema of the data file with the given partition type.

Public Attributes
Content	content = Content::kData

std::string	file_path

FileFormatType	file_format = FileFormatType::kParquet

PartitionValues	partition

int64_t	record_count = 0

int64_t	file_size_in_bytes = 0

std::map< int32_t, int64_t >	column_sizes

std::map< int32_t, int64_t >	value_counts

std::map< int32_t, int64_t >	null_value_counts

std::map< int32_t, int64_t >	nan_value_counts

std::map< int32_t, std::vector< uint8_t > >	lower_bounds

std::map< int32_t, std::vector< uint8_t > >	upper_bounds

std::vector< uint8_t >	key_metadata

std::vector< int64_t >	split_offsets

std::vector< int32_t >	equality_ids

std::optional< int32_t >	sort_order_id

std::optional< int64_t >	first_row_id

std::optional< std::string >	referenced_data_file

std::optional< int64_t >	content_offset

std::optional< int64_t >	content_size_in_bytes

std::optional< int32_t >	partition_spec_id
	Partition spec id for this data file.

Static Public Attributes
static constexpr int32_t	kContentFieldId = 134

static const SchemaField	kContent

static constexpr int32_t	kFilePathFieldId = 100

static const SchemaField	kFilePath

static constexpr int32_t	kFileFormatFieldId = 101

static const SchemaField	kFileFormat

static constexpr int32_t	kPartitionFieldId = 102

static const std::string	kPartitionField = "partition"

static const std::string	kPartitionDoc

static constexpr int32_t	kRecordCountFieldId = 103

static const SchemaField	kRecordCount

static constexpr int32_t	kFileSizeFieldId = 104

static const SchemaField	kFileSize

static constexpr int32_t	kColumnSizesFieldId = 108

static const SchemaField	kColumnSizes

static constexpr int32_t	kValueCountsFieldId = 109

static const SchemaField	kValueCounts

static constexpr int32_t	kNullValueCountsFieldId = 110

static const SchemaField	kNullValueCounts

static constexpr int32_t	kNanValueCountsFieldId = 137

static const SchemaField	kNanValueCounts

static constexpr int32_t	kLowerBoundsFieldId = 125

static const SchemaField	kLowerBounds

static constexpr int32_t	kUpperBoundsFieldId = 128

static const SchemaField	kUpperBounds

static constexpr int32_t	kKeyMetadataFieldId = 131

static const SchemaField	kKeyMetadata

static constexpr int32_t	kSplitOffsetsFieldId = 132

static const SchemaField	kSplitOffsets

static constexpr int32_t	kEqualityIdsFieldId = 135

static const SchemaField	kEqualityIds

static constexpr int32_t	kSortOrderIdFieldId = 140

static const SchemaField	kSortOrderId

static constexpr int32_t	kFirstRowIdFieldId = 142

static const SchemaField	kFirstRowId

static constexpr int32_t	kReferencedDataFileFieldId = 143

static const SchemaField	kReferencedDataFile

static constexpr int32_t	kContentOffsetFieldId = 144

static const SchemaField	kContentOffset

static constexpr int32_t	kContentSizeFieldId = 145

static const SchemaField	kContentSize

Detailed Description

DataFile carries data file path, partition tuple, metrics, ...

Member Data Documentation

◆ column_sizes

std::map<int32_t, int64_t> iceberg::DataFile::column_sizes

Field id: 108 Key field id: 117 Value field id: 118 Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro)

◆ content

Content iceberg::DataFile::content = Content::kData

Field id: 134 Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files)

◆ content_offset

std::optional<int64_t> iceberg::DataFile::content_offset

Field id: 144 The offset in the file where the content starts.

The content_offset and content_size_in_bytes fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the offset and length stored in the Puffin footer for the deletion vector blob.

◆ content_size_in_bytes

std::optional<int64_t> iceberg::DataFile::content_size_in_bytes

Field id: 145 The length of a referenced content stored in the file; required if content_offset is present

◆ equality_ids

std::vector<int32_t> iceberg::DataFile::equality_ids

Field id: 135 Element Field id: 136 Field ids used to determine row equality in equality delete files. Required when content=2 and should be null otherwise. Fields with ids listed in this column must be present in the delete file.

◆ file_format

FileFormatType iceberg::DataFile::file_format = FileFormatType::kParquet

Field id: 101 File format type, avro, orc, parquet, or puffin

◆ file_path

std::string iceberg::DataFile::file_path

Field id: 100 Full URI for the file with FS scheme

◆ file_size_in_bytes

int64_t iceberg::DataFile::file_size_in_bytes = 0

Field id: 104 Total file size in bytes

◆ first_row_id

std::optional<int64_t> iceberg::DataFile::first_row_id

Field id: 142 The _row_id for the first row in the data file.

Reference:

First Row ID Inheritance

◆ kColumnSizes

const SchemaField iceberg::DataFile::kColumnSizes

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kColumnSizesFieldId, "column_sizes",
      map(SchemaField::MakeRequired(117, std::string(MapType::kKeyName), int32()),
          SchemaField::MakeRequired(118, std::string(MapType::kValueName), int64())),
      "Map of column id to total size on disk")

◆ kContent

const SchemaField iceberg::DataFile::kContent

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kContentFieldId, "content", int32(),
      "Contents of the file: 0=data, 1=position deletes, 2=equality deletes")

◆ kContentOffset

const SchemaField iceberg::DataFile::kContentOffset

inlinestatic

Initial value:

=
      SchemaField::MakeOptional(kContentOffsetFieldId, "content_offset", int64(),
                                "The offset in the file where the content starts")

◆ kContentSize

const SchemaField iceberg::DataFile::kContentSize

inlinestatic

Initial value:

=
      SchemaField::MakeOptional(kContentSizeFieldId, "content_size_in_bytes", int64(),
                                "The length of referenced content stored in the file")

◆ kEqualityIds

const SchemaField iceberg::DataFile::kEqualityIds

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kEqualityIdsFieldId, "equality_ids",
      list(SchemaField::MakeRequired(136, std::string(ListType::kElementName), int32())),
      "Equality comparison field IDs")

◆ key_metadata

std::vector<uint8_t> iceberg::DataFile::key_metadata

Field id: 131 Implementation-specific key metadata for encryption

◆ kFileFormat

const SchemaField iceberg::DataFile::kFileFormat

inlinestatic

Initial value:

=
      SchemaField::MakeRequired(kFileFormatFieldId, "file_format", string(),
                                "File format name: avro, orc, or parquet")

◆ kFilePath

const SchemaField iceberg::DataFile::kFilePath

inlinestatic

Initial value:

= SchemaField::MakeRequired(

kFilePathFieldId, "file_path", string(), "Location URI with FS scheme")

◆ kFileSize

const SchemaField iceberg::DataFile::kFileSize

inlinestatic

Initial value:

= SchemaField::MakeRequired(

kFileSizeFieldId, "file_size_in_bytes", int64(), "Total file size in bytes")

◆ kFirstRowId

const SchemaField iceberg::DataFile::kFirstRowId

inlinestatic

Initial value:

=
      SchemaField::MakeOptional(kFirstRowIdFieldId, "first_row_id", int64(),
                                "Starting row ID to assign to new rows")

◆ kKeyMetadata

const SchemaField iceberg::DataFile::kKeyMetadata

inlinestatic

Initial value:

= SchemaField::MakeOptional(

kKeyMetadataFieldId, "key_metadata", binary(), "Encryption key metadata blob")

iceberg::binary

ICEBERG_EXPORT const std::shared_ptr< BinaryType > & binary()

Return a BinaryType instance.

◆ kLowerBounds

const SchemaField iceberg::DataFile::kLowerBounds

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kLowerBoundsFieldId, "lower_bounds",
      map(SchemaField::MakeRequired(126, std::string(MapType::kKeyName), int32()),
          SchemaField::MakeRequired(127, std::string(MapType::kValueName), binary())),
      "Map of column id to lower bound")

◆ kNanValueCounts

const SchemaField iceberg::DataFile::kNanValueCounts

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kNanValueCountsFieldId, "nan_value_counts",
      map(SchemaField::MakeRequired(138, std::string(MapType::kKeyName), int32()),
          SchemaField::MakeRequired(139, std::string(MapType::kValueName), int64())),
      "Map of column id to number of NaN values in the column")

◆ kNullValueCounts

const SchemaField iceberg::DataFile::kNullValueCounts

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kNullValueCountsFieldId, "null_value_counts",
      map(SchemaField::MakeRequired(121, std::string(MapType::kKeyName), int32()),
          SchemaField::MakeRequired(122, std::string(MapType::kValueName), int64())),
      "Map of column id to null value count")

◆ kPartitionDoc

const std::string iceberg::DataFile::kPartitionDoc

inlinestatic

Initial value:

=

"Partition data tuple, schema based on the partition spec"

◆ kRecordCount

const SchemaField iceberg::DataFile::kRecordCount

inlinestatic

Initial value:

= SchemaField::MakeRequired(

kRecordCountFieldId, "record_count", int64(), "Number of records in the file")

◆ kReferencedDataFile

const SchemaField iceberg::DataFile::kReferencedDataFile

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kReferencedDataFileFieldId, "referenced_data_file", string(),
      "Fully qualified location (URI with FS scheme) of a data file that all deletes "
      "reference")

◆ kSortOrderId

const SchemaField iceberg::DataFile::kSortOrderId

inlinestatic

Initial value:

= SchemaField::MakeOptional(

kSortOrderIdFieldId, "sort_order_id", int32(), "Sort order ID")

◆ kSplitOffsets

const SchemaField iceberg::DataFile::kSplitOffsets

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kSplitOffsetsFieldId, "split_offsets",
      list(SchemaField::MakeRequired(133, std::string(ListType::kElementName), int64())),
      "Splittable offsets")

◆ kUpperBounds

const SchemaField iceberg::DataFile::kUpperBounds

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kUpperBoundsFieldId, "upper_bounds",
      map(SchemaField::MakeRequired(129, std::string(MapType::kKeyName), int32()),
          SchemaField::MakeRequired(130, std::string(MapType::kValueName), binary())),
      "Map of column id to upper bound")

◆ kValueCounts

const SchemaField iceberg::DataFile::kValueCounts

inlinestatic

Initial value:

= SchemaField::MakeOptional(
      kValueCountsFieldId, "value_counts",
      map(SchemaField::MakeRequired(119, std::string(MapType::kKeyName), int32()),
          SchemaField::MakeRequired(120, std::string(MapType::kValueName), int64())),
      "Map of column id to total count, including null and NaN")

◆ lower_bounds

std::map<int32_t, std::vector<uint8_t> > iceberg::DataFile::lower_bounds

Field id: 125 Key field id: 126 Value field id: 127 Map from column id to lower bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.

◆ nan_value_counts

std::map<int32_t, int64_t> iceberg::DataFile::nan_value_counts

Field id: 137 Key field id: 138 Value field id: 139 Map from column id to number of NaN values in the column

◆ null_value_counts

std::map<int32_t, int64_t> iceberg::DataFile::null_value_counts

Field id: 110 Key field id: 121 Value field id: 122 Map from column id to number of null values in the column

◆ partition

PartitionValues iceberg::DataFile::partition

Field id: 102 Partition data tuple, schema based on the partition spec output using partition field ids

◆ partition_spec_id

std::optional<int32_t> iceberg::DataFile::partition_spec_id

Partition spec id for this data file.

Note: This field is for internal use only and will not be persisted to manifest entry.

◆ record_count

int64_t iceberg::DataFile::record_count = 0

Field id: 103 Number of records in this file, or the cardinality of a deletion vector

◆ referenced_data_file

std::optional<std::string> iceberg::DataFile::referenced_data_file

Field id: 143 Fully qualified location (URI with FS scheme) of a data file that all deletes reference.

Position delete metadata can use referenced_data_file when all deletes tracked by the entry are in a single data file. Setting the referenced file is required for deletion vectors.

◆ sort_order_id

std::optional<int32_t> iceberg::DataFile::sort_order_id

Field id: 140 ID representing sort order for this file

If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. Position deletes are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.

◆ split_offsets

std::vector<int64_t> iceberg::DataFile::split_offsets

Field id: 132 Element Field id: 133 Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending.

◆ upper_bounds

std::map<int32_t, std::vector<uint8_t> > iceberg::DataFile::upper_bounds

Field id: 128 Key field id: 129 Value field id: 130 Map from column id to upper bound in the column serialized as binary. Each value must be greater than or equal to all non-null, non-NaN values in the column for the file.

◆ value_counts

std::map<int32_t, int64_t> iceberg::DataFile::value_counts

Field id: 109 Key field id: 119 Value field id: 120 Map from column id to number of values in the column (including null and NaN values)

The documentation for this struct was generated from the following files:

iceberg/manifest/manifest_entry.h
iceberg/manifest/manifest_entry.cc

Public Types

Public Member Functions

Static Public Member Functions

Public Attributes

Static Public Attributes

Detailed Description

Member Data Documentation

◆ column_sizes

◆ content

◆ content_offset

◆ content_size_in_bytes

◆ equality_ids

◆ file_format

◆ file_path

◆ file_size_in_bytes

◆ first_row_id

◆ kColumnSizes

◆ kContent

◆ kContentOffset

◆ kContentSize

◆ kEqualityIds

◆ key_metadata

◆ kFileFormat

◆ kFilePath

◆ kFileSize

◆ kFirstRowId

◆ kKeyMetadata

◆ kLowerBounds

◆ kNanValueCounts

◆ kNullValueCounts

◆ kPartitionDoc

◆ kRecordCount

◆ kReferencedDataFile

◆ kSortOrderId

◆ kSplitOffsets

◆ kUpperBounds

◆ kValueCounts

◆ lower_bounds

◆ nan_value_counts

◆ null_value_counts

◆ partition

◆ partition_spec_id

◆ record_count

◆ referenced_data_file

◆ sort_order_id

◆ split_offsets

◆ upper_bounds

◆ value_counts