TFX-BSL Public TFXIO¶
tfx_bsl.public.tfxio
¶
Module level imports for tfx_bsl.public.tfxio.
TFXIO defines a common in-memory data representation shared by all TFX libraries and components, as well as an I/O abstraction layer to produce such representations. See the RFC for details: https://github.com/tensorflow/community/blob/master/rfcs/20191017-tfx-standardized-inputs.md
Attributes¶
Classes¶
BeamRecordCsvTFXIO
¶
BeamRecordCsvTFXIO(
physical_format: Text,
column_names: List[Text],
delimiter: Optional[Text] = ",",
skip_blank_lines: bool = True,
multivalent_columns: Optional[Text] = None,
secondary_delimiter: Optional[Text] = None,
schema: Optional[Schema] = None,
raw_record_column_name: Optional[Text] = None,
telemetry_descriptors: Optional[List[Text]] = None,
)
Bases: _CsvTFXIOBase
TFXIO implementation for CSV records in pcoll[bytes].
This is a special TFXIO that does not actually do I/O -- it relies on the caller to prepare a PCollection of bytes.
Source code in tfx_bsl/tfxio/csv_tfxio.py
Attributes¶
Functions¶
ArrowSchema
¶
Returns the schema of the RecordBatch
produced by self.BeamSource()
.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/record_based_tfxio.py
BeamSource
¶
Returns a beam PTransform
that produces PCollection[pa.RecordBatch]
.
May NOT raise an error if the TFMD schema was not provided at construction time.
If a TFMD schema was provided at construction time, all the
pa.RecordBatch
es in the result PCollection
must be of the same schema
returned by self.ArrowSchema
. If a TFMD schema was not provided, the
pa.RecordBatch
es might not be of the same schema (they may contain
different numbers of columns).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
Project
¶
Projects the dataset represented by this TFXIO.
A Projected TFXIO:
- Only columns needed for given tensor_names are guaranteed to be
produced by self.BeamSource()
- self.TensorAdapterConfig()
and self.TensorFlowDataset()
are trimmed
to contain only those tensors.
- It retains a reference to the very original TFXIO, so its TensorAdapter
knows about the specs of the tensors that would be produced by the
original TensorAdapter. Also see TensorAdapter.OriginalTensorSpec()
.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
tensor_names
|
a set of tensor names. |
RETURNS | DESCRIPTION |
---|---|
TFXIO
|
A |
TFXIO
|
|
TFXIO
|
|
Source code in tfx_bsl/tfxio/tfxio.py
RawRecordBeamSource
¶
Returns a PTransform that produces a PCollection[bytes].
Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:
record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()
would result in the files being read twice, while the following would only read once:
raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordTensorFlowDataset
¶
RawRecordTensorFlowDataset(
options: TensorFlowDatasetOptions,
) -> Dataset
Returns a Dataset that contains nested Datasets of raw records.
May not be implemented for some TFXIOs.
This should be used when RawTfRecordTFXIO.TensorFlowDataset does not suffice. Namely, if there is some logical grouping of files which we need to perform operations on, without applying the operation to each individual group (i.e. shuffle).
The returned Dataset object is a dataset of datasets, where each nested dataset is a dataset of serialized records. When shuffle=False (default), the nested datasets are deterministically ordered. Each nested dataset can represent multiple files. The files are merged into one dataset if the files have the same format. For example:
file_patterns = ['file_1', 'file_2', 'dir_1/*']
file_formats = ['recordio', 'recordio', 'sstable']
tfxio = SomeTFXIO(file_patterns, file_formats)
datasets = tfxio.RawRecordTensorFlowDataset(options)
datasets
would result in the following dataset: [ds1, ds2]
. Where ds1
iterates over records from 'file_1' and 'file_2', and ds2 iterates over
records from files matched by 'dir_1/*'.
Example usage:
tfxio = SomeTFXIO(file_patterns, file_formats)
ds = tfxio.RawRecordTensorFlowDataset(options=options)
ds = ds.flat_map(lambda x: x)
records = list(ds.as_numpy_iterator())
# iterating over `records` yields records from the each file in
# `file_patterns`. See `tf.data.Dataset.list_files` for more information
# about the order of files when expanding globs.
RawRecordTensorFlowDataset
returns
a dataset of datasets.
When shuffle=True, then the datasets not deterministically ordered, but the contents of each nested dataset are deterministcally ordered. For example, we may potentially have [ds2, ds1, ds3], where the contents of ds1, ds2, and ds3 are all deterministcally ordered.
PARAMETER | DESCRIPTION |
---|---|
options
|
A TensorFlowDatasetOptions object. Not all options will apply.
TYPE:
|
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordToRecordBatch
¶
Returns a PTransform that converts raw records to Arrow RecordBatches.
The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RecordBatches
¶
RecordBatches(options: RecordBatchesOptions)
Returns an iterable of record batches.
This can be used outside of Apache Beam or TensorFlow to access data.
PARAMETER | DESCRIPTION |
---|---|
options
|
An options object for iterating over record batches. Look at
TYPE:
|
TensorAdapter
¶
TensorAdapter() -> TensorAdapter
Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/tfxio.py
TensorAdapterConfig
¶
TensorAdapterConfig() -> TensorAdapterConfig
Returns the config to initialize a TensorAdapter
.
RETURNS | DESCRIPTION |
---|---|
TensorAdapterConfig
|
a |
TensorAdapterConfig
|
|
Source code in tfx_bsl/tfxio/tfxio.py
TensorFlowDataset
¶
TensorFlowDataset(options: TensorFlowDatasetOptions)
Returns a tf.data.Dataset of TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
options
|
an options object for the tf.data.Dataset. Look at
TYPE:
|
TensorRepresentations
¶
TensorRepresentations() -> TensorRepresentations
Returns the TensorRepresentations
.
These TensorRepresentation
s describe the tensors or composite tensors
produced by the TensorAdapter
created from self.TensorAdapter()
or
the tf.data.Dataset created from self.TensorFlowDataset()
.
May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.
CsvTFXIO
¶
CsvTFXIO(
file_pattern: Text,
column_names: List[Text],
telemetry_descriptors: Optional[List[Text]] = None,
validate: bool = True,
delimiter: Optional[Text] = ",",
skip_blank_lines: Optional[bool] = True,
multivalent_columns: Optional[Text] = None,
secondary_delimiter: Optional[Text] = None,
schema: Optional[Schema] = None,
raw_record_column_name: Optional[Text] = None,
skip_header_lines: int = 0,
)
Bases: _CsvTFXIOBase
TFXIO implementation for CSV.
Initializes a CSV TFXIO.
PARAMETER | DESCRIPTION |
---|---|
file_pattern
|
A file glob pattern to read csv files from.
TYPE:
|
column_names
|
List of csv column names. Order must match the order in the CSV file. |
telemetry_descriptors
|
A set of descriptors that identify the component that is instantiating this TFXIO. These will be used to construct the namespace to contain metrics for profiling and are therefore expected to be identifiers of the component itself and not individual instances of source use. |
validate
|
Boolean flag to verify that the files exist during the pipeline creation time.
TYPE:
|
delimiter
|
A one-character string used to separate fields. |
skip_blank_lines
|
A boolean to indicate whether to skip over blank lines rather than interpreting them as missing values. |
multivalent_columns
|
Name of column that can contain multiple values. If secondary_delimiter is provided, this must also be provided. |
secondary_delimiter
|
Delimiter used for parsing multivalent columns. If multivalent_columns is provided, this must also be provided. |
schema
|
An optional TFMD Schema describing the dataset. If schema is provided, it will determine the data type of the csv columns. Otherwise, the each column's data type will be inferred by the csv decoder. The schema should contain exactly the same features as column_names.
TYPE:
|
raw_record_column_name
|
If not None, the generated Arrow RecordBatches will contain a column of the given name that contains raw csv rows. |
skip_header_lines
|
Number of header lines to skip. Same number is skipped from each file. Must be 0 or higher. Large number of skipped lines might impact performance.
TYPE:
|
Source code in tfx_bsl/tfxio/csv_tfxio.py
Attributes¶
Functions¶
ArrowSchema
¶
Returns the schema of the RecordBatch
produced by self.BeamSource()
.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/record_based_tfxio.py
BeamSource
¶
Returns a beam PTransform
that produces PCollection[pa.RecordBatch]
.
May NOT raise an error if the TFMD schema was not provided at construction time.
If a TFMD schema was provided at construction time, all the
pa.RecordBatch
es in the result PCollection
must be of the same schema
returned by self.ArrowSchema
. If a TFMD schema was not provided, the
pa.RecordBatch
es might not be of the same schema (they may contain
different numbers of columns).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
Project
¶
Projects the dataset represented by this TFXIO.
A Projected TFXIO:
- Only columns needed for given tensor_names are guaranteed to be
produced by self.BeamSource()
- self.TensorAdapterConfig()
and self.TensorFlowDataset()
are trimmed
to contain only those tensors.
- It retains a reference to the very original TFXIO, so its TensorAdapter
knows about the specs of the tensors that would be produced by the
original TensorAdapter. Also see TensorAdapter.OriginalTensorSpec()
.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
tensor_names
|
a set of tensor names. |
RETURNS | DESCRIPTION |
---|---|
TFXIO
|
A |
TFXIO
|
|
TFXIO
|
|
Source code in tfx_bsl/tfxio/tfxio.py
RawRecordBeamSource
¶
Returns a PTransform that produces a PCollection[bytes].
Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:
record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()
would result in the files being read twice, while the following would only read once:
raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordTensorFlowDataset
¶
RawRecordTensorFlowDataset(
options: TensorFlowDatasetOptions,
) -> Dataset
Returns a Dataset that contains nested Datasets of raw records.
May not be implemented for some TFXIOs.
This should be used when RawTfRecordTFXIO.TensorFlowDataset does not suffice. Namely, if there is some logical grouping of files which we need to perform operations on, without applying the operation to each individual group (i.e. shuffle).
The returned Dataset object is a dataset of datasets, where each nested dataset is a dataset of serialized records. When shuffle=False (default), the nested datasets are deterministically ordered. Each nested dataset can represent multiple files. The files are merged into one dataset if the files have the same format. For example:
file_patterns = ['file_1', 'file_2', 'dir_1/*']
file_formats = ['recordio', 'recordio', 'sstable']
tfxio = SomeTFXIO(file_patterns, file_formats)
datasets = tfxio.RawRecordTensorFlowDataset(options)
datasets
would result in the following dataset: [ds1, ds2]
. Where ds1
iterates over records from 'file_1' and 'file_2', and ds2 iterates over
records from files matched by 'dir_1/*'.
Example usage:
tfxio = SomeTFXIO(file_patterns, file_formats)
ds = tfxio.RawRecordTensorFlowDataset(options=options)
ds = ds.flat_map(lambda x: x)
records = list(ds.as_numpy_iterator())
# iterating over `records` yields records from the each file in
# `file_patterns`. See `tf.data.Dataset.list_files` for more information
# about the order of files when expanding globs.
RawRecordTensorFlowDataset
returns
a dataset of datasets.
When shuffle=True, then the datasets not deterministically ordered, but the contents of each nested dataset are deterministcally ordered. For example, we may potentially have [ds2, ds1, ds3], where the contents of ds1, ds2, and ds3 are all deterministcally ordered.
PARAMETER | DESCRIPTION |
---|---|
options
|
A TensorFlowDatasetOptions object. Not all options will apply.
TYPE:
|
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordToRecordBatch
¶
Returns a PTransform that converts raw records to Arrow RecordBatches.
The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RecordBatches
¶
RecordBatches(options: RecordBatchesOptions)
Returns an iterable of record batches.
This can be used outside of Apache Beam or TensorFlow to access data.
PARAMETER | DESCRIPTION |
---|---|
options
|
An options object for iterating over record batches. Look at
TYPE:
|
TensorAdapter
¶
TensorAdapter() -> TensorAdapter
Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/tfxio.py
TensorAdapterConfig
¶
TensorAdapterConfig() -> TensorAdapterConfig
Returns the config to initialize a TensorAdapter
.
RETURNS | DESCRIPTION |
---|---|
TensorAdapterConfig
|
a |
TensorAdapterConfig
|
|
Source code in tfx_bsl/tfxio/tfxio.py
TensorFlowDataset
¶
TensorFlowDataset(options: TensorFlowDatasetOptions)
Returns a tf.data.Dataset of TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
options
|
an options object for the tf.data.Dataset. Look at
TYPE:
|
TensorRepresentations
¶
TensorRepresentations() -> TensorRepresentations
Returns the TensorRepresentations
.
These TensorRepresentation
s describe the tensors or composite tensors
produced by the TensorAdapter
created from self.TensorAdapter()
or
the tf.data.Dataset created from self.TensorFlowDataset()
.
May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.
RecordBatchToExamplesEncoder
¶
RecordBatchToExamplesEncoder(
schema: Optional[Schema] = None,
)
Encodes pa.RecordBatch
as a list of serialized tf.Example
s.
Requires TFMD schema only if RecordBatches contains nested lists with depth > 2 that represent TensorFlow's RaggedFeatures.
Source code in tfx_bsl/coders/example_coder.py
RecordBatchesOptions
¶
Bases: NamedTuple('RecordBatchesOptions', [('batch_size', int), ('drop_final_batch', bool), ('num_epochs', Optional[int]), ('shuffle', bool), ('shuffle_buffer_size', int), ('shuffle_seed', Optional[int])])
Options for TFXIO's RecordBatches.
Note: not all of these options may be effective. It depends on the particular TFXIO's implementation.
TFExampleBeamRecord
¶
TFExampleBeamRecord(
physical_format: str,
telemetry_descriptors: Optional[List[str]] = None,
schema: Optional[Schema] = None,
raw_record_column_name: Optional[str] = None,
)
Bases: _TFExampleRecordBase
TFXIO implementation for serialized tf.Examples in pcoll[bytes].
This is a special TFXIO that does not actually do I/O -- it relies on the caller to prepare a PCollection of bytes (serialized tf.Examples).
Initializer.
PARAMETER | DESCRIPTION |
---|---|
physical_format
|
The physical format that describes where the input pcoll[bytes] comes from. Used for telemetry purposes. Examples: "text", "tfrecord".
TYPE:
|
telemetry_descriptors
|
A set of descriptors that identify the component that is instantiating this TFXIO. These will be used to construct the namespace to contain metrics for profiling and are therefore expected to be identifiers of the component itself and not individual instances of source use. |
schema
|
A TFMD Schema describing the dataset.
TYPE:
|
raw_record_column_name
|
If not None, the generated Arrow RecordBatches will contain a column of the given name that contains serialized records. |
Source code in tfx_bsl/tfxio/tf_example_record.py
Attributes¶
Functions¶
ArrowSchema
¶
Returns the schema of the RecordBatch
produced by self.BeamSource()
.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/record_based_tfxio.py
BeamSource
¶
Returns a beam PTransform
that produces PCollection[pa.RecordBatch]
.
May NOT raise an error if the TFMD schema was not provided at construction time.
If a TFMD schema was provided at construction time, all the
pa.RecordBatch
es in the result PCollection
must be of the same schema
returned by self.ArrowSchema
. If a TFMD schema was not provided, the
pa.RecordBatch
es might not be of the same schema (they may contain
different numbers of columns).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
Project
¶
Projects the dataset represented by this TFXIO.
A Projected TFXIO:
- Only columns needed for given tensor_names are guaranteed to be
produced by self.BeamSource()
- self.TensorAdapterConfig()
and self.TensorFlowDataset()
are trimmed
to contain only those tensors.
- It retains a reference to the very original TFXIO, so its TensorAdapter
knows about the specs of the tensors that would be produced by the
original TensorAdapter. Also see TensorAdapter.OriginalTensorSpec()
.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
tensor_names
|
a set of tensor names. |
RETURNS | DESCRIPTION |
---|---|
TFXIO
|
A |
TFXIO
|
|
TFXIO
|
|
Source code in tfx_bsl/tfxio/tfxio.py
RawRecordBeamSource
¶
Returns a PTransform that produces a PCollection[bytes].
Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:
record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()
would result in the files being read twice, while the following would only read once:
raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordTensorFlowDataset
¶
RawRecordTensorFlowDataset(
options: TensorFlowDatasetOptions,
) -> Dataset
Returns a Dataset that contains nested Datasets of raw records.
May not be implemented for some TFXIOs.
This should be used when RawTfRecordTFXIO.TensorFlowDataset does not suffice. Namely, if there is some logical grouping of files which we need to perform operations on, without applying the operation to each individual group (i.e. shuffle).
The returned Dataset object is a dataset of datasets, where each nested dataset is a dataset of serialized records. When shuffle=False (default), the nested datasets are deterministically ordered. Each nested dataset can represent multiple files. The files are merged into one dataset if the files have the same format. For example:
file_patterns = ['file_1', 'file_2', 'dir_1/*']
file_formats = ['recordio', 'recordio', 'sstable']
tfxio = SomeTFXIO(file_patterns, file_formats)
datasets = tfxio.RawRecordTensorFlowDataset(options)
datasets
would result in the following dataset: [ds1, ds2]
. Where ds1
iterates over records from 'file_1' and 'file_2', and ds2 iterates over
records from files matched by 'dir_1/*'.
Example usage:
tfxio = SomeTFXIO(file_patterns, file_formats)
ds = tfxio.RawRecordTensorFlowDataset(options=options)
ds = ds.flat_map(lambda x: x)
records = list(ds.as_numpy_iterator())
# iterating over `records` yields records from the each file in
# `file_patterns`. See `tf.data.Dataset.list_files` for more information
# about the order of files when expanding globs.
RawRecordTensorFlowDataset
returns
a dataset of datasets.
When shuffle=True, then the datasets not deterministically ordered, but the contents of each nested dataset are deterministcally ordered. For example, we may potentially have [ds2, ds1, ds3], where the contents of ds1, ds2, and ds3 are all deterministcally ordered.
PARAMETER | DESCRIPTION |
---|---|
options
|
A TensorFlowDatasetOptions object. Not all options will apply.
TYPE:
|
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordToRecordBatch
¶
Returns a PTransform that converts raw records to Arrow RecordBatches.
The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RecordBatches
¶
RecordBatches(options: RecordBatchesOptions)
Returns an iterable of record batches.
This can be used outside of Apache Beam or TensorFlow to access data.
PARAMETER | DESCRIPTION |
---|---|
options
|
An options object for iterating over record batches. Look at
TYPE:
|
TensorAdapter
¶
TensorAdapter() -> TensorAdapter
Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/tfxio.py
TensorAdapterConfig
¶
TensorAdapterConfig() -> TensorAdapterConfig
Returns the config to initialize a TensorAdapter
.
RETURNS | DESCRIPTION |
---|---|
TensorAdapterConfig
|
a |
TensorAdapterConfig
|
|
Source code in tfx_bsl/tfxio/tfxio.py
TensorFlowDataset
¶
TensorFlowDataset(options: TensorFlowDatasetOptions)
Returns a tf.data.Dataset of TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
options
|
an options object for the tf.data.Dataset. Look at
TYPE:
|
TensorRepresentations
¶
TensorRepresentations() -> TensorRepresentations
Returns the TensorRepresentations
.
These TensorRepresentation
s describe the tensors or composite tensors
produced by the TensorAdapter
created from self.TensorAdapter()
or
the tf.data.Dataset created from self.TensorFlowDataset()
.
May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.
TFExampleRecord
¶
TFExampleRecord(
file_pattern: Union[List[str], str],
validate: bool = True,
schema: Optional[Schema] = None,
raw_record_column_name: Optional[str] = None,
telemetry_descriptors: Optional[List[str]] = None,
)
Bases: _TFExampleRecordBase
TFXIO implementation for tf.Example on TFRecord.
Initializes a TFExampleRecord TFXIO.
PARAMETER | DESCRIPTION |
---|---|
file_pattern
|
A file glob pattern to read TFRecords from. |
validate
|
Not used. do not set. (not used since post 0.22.1).
TYPE:
|
schema
|
A TFMD Schema describing the dataset.
TYPE:
|
raw_record_column_name
|
If not None, the generated Arrow RecordBatches will contain a column of the given name that contains serialized records. |
telemetry_descriptors
|
A set of descriptors that identify the component that is instantiating this TFXIO. These will be used to construct the namespace to contain metrics for profiling and are therefore expected to be identifiers of the component itself and not individual instances of source use. |
Source code in tfx_bsl/tfxio/tf_example_record.py
Attributes¶
Functions¶
ArrowSchema
¶
Returns the schema of the RecordBatch
produced by self.BeamSource()
.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/record_based_tfxio.py
BeamSource
¶
Returns a beam PTransform
that produces PCollection[pa.RecordBatch]
.
May NOT raise an error if the TFMD schema was not provided at construction time.
If a TFMD schema was provided at construction time, all the
pa.RecordBatch
es in the result PCollection
must be of the same schema
returned by self.ArrowSchema
. If a TFMD schema was not provided, the
pa.RecordBatch
es might not be of the same schema (they may contain
different numbers of columns).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
Project
¶
Projects the dataset represented by this TFXIO.
A Projected TFXIO:
- Only columns needed for given tensor_names are guaranteed to be
produced by self.BeamSource()
- self.TensorAdapterConfig()
and self.TensorFlowDataset()
are trimmed
to contain only those tensors.
- It retains a reference to the very original TFXIO, so its TensorAdapter
knows about the specs of the tensors that would be produced by the
original TensorAdapter. Also see TensorAdapter.OriginalTensorSpec()
.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
tensor_names
|
a set of tensor names. |
RETURNS | DESCRIPTION |
---|---|
TFXIO
|
A |
TFXIO
|
|
TFXIO
|
|
Source code in tfx_bsl/tfxio/tfxio.py
RawRecordBeamSource
¶
Returns a PTransform that produces a PCollection[bytes].
Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:
record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()
would result in the files being read twice, while the following would only read once:
raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordTensorFlowDataset
¶
RawRecordTensorFlowDataset(
options: TensorFlowDatasetOptions,
) -> Dataset
Returns a Dataset that contains nested Datasets of raw records.
May not be implemented for some TFXIOs.
This should be used when RawTfRecordTFXIO.TensorFlowDataset does not suffice. Namely, if there is some logical grouping of files which we need to perform operations on, without applying the operation to each individual group (i.e. shuffle).
The returned Dataset object is a dataset of datasets, where each nested dataset is a dataset of serialized records. When shuffle=False (default), the nested datasets are deterministically ordered. Each nested dataset can represent multiple files. The files are merged into one dataset if the files have the same format. For example:
file_patterns = ['file_1', 'file_2', 'dir_1/*']
file_formats = ['recordio', 'recordio', 'sstable']
tfxio = SomeTFXIO(file_patterns, file_formats)
datasets = tfxio.RawRecordTensorFlowDataset(options)
datasets
would result in the following dataset: [ds1, ds2]
. Where ds1
iterates over records from 'file_1' and 'file_2', and ds2 iterates over
records from files matched by 'dir_1/*'.
Example usage:
tfxio = SomeTFXIO(file_patterns, file_formats)
ds = tfxio.RawRecordTensorFlowDataset(options=options)
ds = ds.flat_map(lambda x: x)
records = list(ds.as_numpy_iterator())
# iterating over `records` yields records from the each file in
# `file_patterns`. See `tf.data.Dataset.list_files` for more information
# about the order of files when expanding globs.
RawRecordTensorFlowDataset
returns
a dataset of datasets.
When shuffle=True, then the datasets not deterministically ordered, but the contents of each nested dataset are deterministcally ordered. For example, we may potentially have [ds2, ds1, ds3], where the contents of ds1, ds2, and ds3 are all deterministcally ordered.
PARAMETER | DESCRIPTION |
---|---|
options
|
A TensorFlowDatasetOptions object. Not all options will apply.
TYPE:
|
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordToRecordBatch
¶
Returns a PTransform that converts raw records to Arrow RecordBatches.
The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RecordBatches
¶
RecordBatches(
options: RecordBatchesOptions,
) -> Iterator[RecordBatch]
Returns an iterable of record batches.
This can be used outside of Apache Beam or TensorFlow to access data.
PARAMETER | DESCRIPTION |
---|---|
options
|
An options object for iterating over record batches. Look at
TYPE:
|
Source code in tfx_bsl/tfxio/tf_example_record.py
TensorAdapter
¶
TensorAdapter() -> TensorAdapter
Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/tfxio.py
TensorAdapterConfig
¶
TensorAdapterConfig() -> TensorAdapterConfig
Returns the config to initialize a TensorAdapter
.
RETURNS | DESCRIPTION |
---|---|
TensorAdapterConfig
|
a |
TensorAdapterConfig
|
|
Source code in tfx_bsl/tfxio/tfxio.py
TensorFlowDataset
¶
TensorFlowDataset(
options: TensorFlowDatasetOptions,
) -> Dataset
Creates a TFRecordDataset that yields Tensors.
The serialized tf.Examples are parsed by tf.io.parse_example
to create
Tensors.
See base class (tfxio.TFXIO) for more details.
PARAMETER | DESCRIPTION |
---|---|
options
|
an options object for the tf.data.Dataset. See
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Dataset
|
A dataset of |
Dataset
|
Each |
Dataset
|
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
if there is something wrong with the tensor_representation. |
Source code in tfx_bsl/tfxio/tf_example_record.py
TensorRepresentations
¶
TensorRepresentations() -> TensorRepresentations
Returns the TensorRepresentations
.
These TensorRepresentation
s describe the tensors or composite tensors
produced by the TensorAdapter
created from self.TensorAdapter()
or
the tf.data.Dataset created from self.TensorFlowDataset()
.
May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.
TFGraphRecordDecoder
¶
Base class for decoders that turns a list of bytes to (composite) tensors.
Sub-classes must implement decode_record()
(see its docstring
for requirements).
Decoder instances can be saved as a SavedModel by save_decoder()
.
The SavedModel can be loaded back by load_decoder()
. However, the loaded
decoder will always be of the type LoadedDecoder
and only have the public
interfaces listed in this base class available.
Attributes¶
record_index_tensor_name
property
¶
The name of the tensor indicating which record a slice is from.
The decoded tensors are batch-aligned among themselves, but they don't necessarily have to be batch-aligned with the input records. If not, sub-classes should implement this method to tie the batch dimension with the input record.
The record index tensor must be a SparseTensor or a RaggedTensor of integral type, and must be 2-D and must not contain "missing" values.
A record index tensor like the following: [[0], [0], [2]] means that of 3 "rows" in the output "batch", the first two rows came from the first record, and the 3rd row came from the third record.
The name must not be an empty string.
RETURNS | DESCRIPTION |
---|---|
Optional[str]
|
The name of the record index tensor. |
Functions¶
decode_record
abstractmethod
¶
Sub-classes should implement this.
Implementations must use TF ops to derive the result (composite) tensors, as this function will be traced and become a tf.function (thus a TF Graph). Note that autograph is not enabled in such tracing, which means any python control flow / loops will not be converted to TF cond / loops automatically.
The returned tensors must be batch-aligned (i.e. they should be at least
of rank 1, and their outer-most dimensions must be of the same size). They
do not have to be batch-aligned with the input tensor, but if that's the
case, an additional tensor must be provided among the results, to indicate
which input record a "row" in the output batch comes from. See
record_index_tensor_name
for more details.
PARAMETER | DESCRIPTION |
---|---|
records
|
a 1-D string tensor that contains the records to be decoded.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Dict[str, TensorAlike]
|
A dict of (composite) tensors. |
Source code in tfx_bsl/coders/tf_graph_record_decoder.py
output_type_specs
¶
Returns the tf.TypeSpecs of the decoded tensors.
RETURNS | DESCRIPTION |
---|---|
Dict[str, TypeSpec]
|
A dict whose keys are the same as keys of the dict returned by |
Dict[str, TypeSpec]
|
|
Dict[str, TypeSpec]
|
(composite) tensor. |
Source code in tfx_bsl/coders/tf_graph_record_decoder.py
save
¶
save(path: str) -> None
Saves this TFGraphRecordDecoder to a SavedModel at path
.
This functions the same as tf_graph_record_decoder.save_decoder()
. This is
provided purely for convenience, and should not impact the actual saved
model, since only the tf.function
from _make_concrete_decode_function
is
saved.
PARAMETER | DESCRIPTION |
---|---|
path
|
The path to where the saved_model is saved.
TYPE:
|
Source code in tfx_bsl/coders/tf_graph_record_decoder.py
TFSequenceExampleBeamRecord
¶
TFSequenceExampleBeamRecord(
physical_format: Text,
telemetry_descriptors: List[Text],
schema: Optional[Schema] = None,
raw_record_column_name: Optional[Text] = None,
)
Bases: _TFSequenceExampleRecordBase
TFXIO implementation for serialized tf.SequenceExamples in pcoll[bytes].
This is a special TFXIO that does not actually do I/O -- it relies on the caller to prepare a PCollection of bytes (serialized tf.SequenceExamples).
Initializer.
PARAMETER | DESCRIPTION |
---|---|
physical_format
|
The physical format that describes where the input pcoll[bytes] comes from. Used for telemetry purposes. Examples: "text", "tfrecord".
TYPE:
|
telemetry_descriptors
|
A set of descriptors that identify the component that is instantiating this TFXIO. These will be used to construct the namespace to contain metrics for profiling and are therefore expected to be identifiers of the component itself and not individual instances of source use. |
schema
|
A TFMD Schema describing the dataset.
TYPE:
|
raw_record_column_name
|
If not None, the generated Arrow RecordBatches will contain a column of the given name that contains serialized records. |
Source code in tfx_bsl/tfxio/tf_sequence_example_record.py
Attributes¶
Functions¶
ArrowSchema
¶
Returns the schema of the RecordBatch
produced by self.BeamSource()
.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/record_based_tfxio.py
BeamSource
¶
Returns a beam PTransform
that produces PCollection[pa.RecordBatch]
.
May NOT raise an error if the TFMD schema was not provided at construction time.
If a TFMD schema was provided at construction time, all the
pa.RecordBatch
es in the result PCollection
must be of the same schema
returned by self.ArrowSchema
. If a TFMD schema was not provided, the
pa.RecordBatch
es might not be of the same schema (they may contain
different numbers of columns).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
Project
¶
Projects the dataset represented by this TFXIO.
A Projected TFXIO:
- Only columns needed for given tensor_names are guaranteed to be
produced by self.BeamSource()
- self.TensorAdapterConfig()
and self.TensorFlowDataset()
are trimmed
to contain only those tensors.
- It retains a reference to the very original TFXIO, so its TensorAdapter
knows about the specs of the tensors that would be produced by the
original TensorAdapter. Also see TensorAdapter.OriginalTensorSpec()
.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
tensor_names
|
a set of tensor names. |
RETURNS | DESCRIPTION |
---|---|
TFXIO
|
A |
TFXIO
|
|
TFXIO
|
|
Source code in tfx_bsl/tfxio/tfxio.py
RawRecordBeamSource
¶
Returns a PTransform that produces a PCollection[bytes].
Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:
record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()
would result in the files being read twice, while the following would only read once:
raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordTensorFlowDataset
¶
RawRecordTensorFlowDataset(
options: TensorFlowDatasetOptions,
) -> Dataset
Returns a Dataset that contains nested Datasets of raw records.
May not be implemented for some TFXIOs.
This should be used when RawTfRecordTFXIO.TensorFlowDataset does not suffice. Namely, if there is some logical grouping of files which we need to perform operations on, without applying the operation to each individual group (i.e. shuffle).
The returned Dataset object is a dataset of datasets, where each nested dataset is a dataset of serialized records. When shuffle=False (default), the nested datasets are deterministically ordered. Each nested dataset can represent multiple files. The files are merged into one dataset if the files have the same format. For example:
file_patterns = ['file_1', 'file_2', 'dir_1/*']
file_formats = ['recordio', 'recordio', 'sstable']
tfxio = SomeTFXIO(file_patterns, file_formats)
datasets = tfxio.RawRecordTensorFlowDataset(options)
datasets
would result in the following dataset: [ds1, ds2]
. Where ds1
iterates over records from 'file_1' and 'file_2', and ds2 iterates over
records from files matched by 'dir_1/*'.
Example usage:
tfxio = SomeTFXIO(file_patterns, file_formats)
ds = tfxio.RawRecordTensorFlowDataset(options=options)
ds = ds.flat_map(lambda x: x)
records = list(ds.as_numpy_iterator())
# iterating over `records` yields records from the each file in
# `file_patterns`. See `tf.data.Dataset.list_files` for more information
# about the order of files when expanding globs.
RawRecordTensorFlowDataset
returns
a dataset of datasets.
When shuffle=True, then the datasets not deterministically ordered, but the contents of each nested dataset are deterministcally ordered. For example, we may potentially have [ds2, ds1, ds3], where the contents of ds1, ds2, and ds3 are all deterministcally ordered.
PARAMETER | DESCRIPTION |
---|---|
options
|
A TensorFlowDatasetOptions object. Not all options will apply.
TYPE:
|
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordToRecordBatch
¶
Returns a PTransform that converts raw records to Arrow RecordBatches.
The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RecordBatches
¶
RecordBatches(options: RecordBatchesOptions)
Returns an iterable of record batches.
This can be used outside of Apache Beam or TensorFlow to access data.
PARAMETER | DESCRIPTION |
---|---|
options
|
An options object for iterating over record batches. Look at
TYPE:
|
TensorAdapter
¶
TensorAdapter() -> TensorAdapter
Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/tfxio.py
TensorAdapterConfig
¶
TensorAdapterConfig() -> TensorAdapterConfig
Returns the config to initialize a TensorAdapter
.
RETURNS | DESCRIPTION |
---|---|
TensorAdapterConfig
|
a |
TensorAdapterConfig
|
|
Source code in tfx_bsl/tfxio/tfxio.py
TensorFlowDataset
¶
TensorFlowDataset(options: TensorFlowDatasetOptions)
Returns a tf.data.Dataset of TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
options
|
an options object for the tf.data.Dataset. Look at
TYPE:
|
TensorRepresentations
¶
TensorRepresentations() -> TensorRepresentations
Returns the TensorRepresentations
.
These TensorRepresentation
s describe the tensors or composite tensors
produced by the TensorAdapter
created from self.TensorAdapter()
or
the tf.data.Dataset created from self.TensorFlowDataset()
.
May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.
TFSequenceExampleRecord
¶
TFSequenceExampleRecord(
file_pattern: Union[List[Text], Text],
telemetry_descriptors: List[Text],
validate: bool = True,
schema: Optional[Schema] = None,
raw_record_column_name: Optional[Text] = None,
)
Bases: _TFSequenceExampleRecordBase
TFXIO implementation for tf.SequenceExample on TFRecord.
Initializes a TFSequenceExampleRecord TFXIO.
PARAMETER | DESCRIPTION |
---|---|
file_pattern
|
One or a list of glob patterns. If a list, must not be empty. |
telemetry_descriptors
|
A set of descriptors that identify the component that is instantiating this TFXIO. These will be used to construct the namespace to contain metrics for profiling and are therefore expected to be identifiers of the component itself and not individual instances of source use. |
validate
|
Not used. do not set. (not used since post 0.22.1).
TYPE:
|
schema
|
A TFMD Schema describing the dataset.
TYPE:
|
raw_record_column_name
|
If not None, the generated Arrow RecordBatches will contain a column of the given name that contains serialized records. |
Source code in tfx_bsl/tfxio/tf_sequence_example_record.py
Attributes¶
Functions¶
ArrowSchema
¶
Returns the schema of the RecordBatch
produced by self.BeamSource()
.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/record_based_tfxio.py
BeamSource
¶
Returns a beam PTransform
that produces PCollection[pa.RecordBatch]
.
May NOT raise an error if the TFMD schema was not provided at construction time.
If a TFMD schema was provided at construction time, all the
pa.RecordBatch
es in the result PCollection
must be of the same schema
returned by self.ArrowSchema
. If a TFMD schema was not provided, the
pa.RecordBatch
es might not be of the same schema (they may contain
different numbers of columns).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
Project
¶
Projects the dataset represented by this TFXIO.
A Projected TFXIO:
- Only columns needed for given tensor_names are guaranteed to be
produced by self.BeamSource()
- self.TensorAdapterConfig()
and self.TensorFlowDataset()
are trimmed
to contain only those tensors.
- It retains a reference to the very original TFXIO, so its TensorAdapter
knows about the specs of the tensors that would be produced by the
original TensorAdapter. Also see TensorAdapter.OriginalTensorSpec()
.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
tensor_names
|
a set of tensor names. |
RETURNS | DESCRIPTION |
---|---|
TFXIO
|
A |
TFXIO
|
|
TFXIO
|
|
Source code in tfx_bsl/tfxio/tfxio.py
RawRecordBeamSource
¶
Returns a PTransform that produces a PCollection[bytes].
Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:
record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()
would result in the files being read twice, while the following would only read once:
raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordTensorFlowDataset
¶
RawRecordTensorFlowDataset(
options: TensorFlowDatasetOptions,
) -> Dataset
Returns a Dataset that contains nested Datasets of raw records.
May not be implemented for some TFXIOs.
This should be used when RawTfRecordTFXIO.TensorFlowDataset does not suffice. Namely, if there is some logical grouping of files which we need to perform operations on, without applying the operation to each individual group (i.e. shuffle).
The returned Dataset object is a dataset of datasets, where each nested dataset is a dataset of serialized records. When shuffle=False (default), the nested datasets are deterministically ordered. Each nested dataset can represent multiple files. The files are merged into one dataset if the files have the same format. For example:
file_patterns = ['file_1', 'file_2', 'dir_1/*']
file_formats = ['recordio', 'recordio', 'sstable']
tfxio = SomeTFXIO(file_patterns, file_formats)
datasets = tfxio.RawRecordTensorFlowDataset(options)
datasets
would result in the following dataset: [ds1, ds2]
. Where ds1
iterates over records from 'file_1' and 'file_2', and ds2 iterates over
records from files matched by 'dir_1/*'.
Example usage:
tfxio = SomeTFXIO(file_patterns, file_formats)
ds = tfxio.RawRecordTensorFlowDataset(options=options)
ds = ds.flat_map(lambda x: x)
records = list(ds.as_numpy_iterator())
# iterating over `records` yields records from the each file in
# `file_patterns`. See `tf.data.Dataset.list_files` for more information
# about the order of files when expanding globs.
RawRecordTensorFlowDataset
returns
a dataset of datasets.
When shuffle=True, then the datasets not deterministically ordered, but the contents of each nested dataset are deterministcally ordered. For example, we may potentially have [ds2, ds1, ds3], where the contents of ds1, ds2, and ds3 are all deterministcally ordered.
PARAMETER | DESCRIPTION |
---|---|
options
|
A TensorFlowDatasetOptions object. Not all options will apply.
TYPE:
|
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RawRecordToRecordBatch
¶
Returns a PTransform that converts raw records to Arrow RecordBatches.
The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/record_based_tfxio.py
RecordBatches
¶
RecordBatches(options: RecordBatchesOptions)
Returns an iterable of record batches.
This can be used outside of Apache Beam or TensorFlow to access data.
PARAMETER | DESCRIPTION |
---|---|
options
|
An options object for iterating over record batches. Look at
TYPE:
|
TensorAdapter
¶
TensorAdapter() -> TensorAdapter
Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/tfxio.py
TensorAdapterConfig
¶
TensorAdapterConfig() -> TensorAdapterConfig
Returns the config to initialize a TensorAdapter
.
RETURNS | DESCRIPTION |
---|---|
TensorAdapterConfig
|
a |
TensorAdapterConfig
|
|
Source code in tfx_bsl/tfxio/tfxio.py
TensorFlowDataset
¶
TensorFlowDataset(
options: TensorFlowDatasetOptions,
) -> Dataset
Creates a tf.data.Dataset that yields Tensors.
The serialized tf.SequenceExamples are parsed by
tf.io.parse_sequence_example
.
See base class (tfxio.TFXIO) for more details.
PARAMETER | DESCRIPTION |
---|---|
options
|
an options object for the tf.data.Dataset. See
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Dataset
|
A dataset of |
Dataset
|
Each |
Dataset
|
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
if there is something wrong with the provided schema. |
Source code in tfx_bsl/tfxio/tf_sequence_example_record.py
TensorRepresentations
¶
TensorRepresentations() -> TensorRepresentations
Returns the TensorRepresentations
.
These TensorRepresentation
s describe the tensors or composite tensors
produced by the TensorAdapter
created from self.TensorAdapter()
or
the tf.data.Dataset created from self.TensorFlowDataset()
.
May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.
TFXIO
¶
Bases: object
Abstract basic class of all TFXIO API implementations.
Functions¶
ArrowSchema
abstractmethod
¶
Returns the schema of the RecordBatch
produced by self.BeamSource()
.
May raise an error if the TFMD schema was not provided at construction time.
BeamSource
abstractmethod
¶
Returns a beam PTransform
that produces PCollection[pa.RecordBatch]
.
May NOT raise an error if the TFMD schema was not provided at construction time.
If a TFMD schema was provided at construction time, all the
pa.RecordBatch
es in the result PCollection
must be of the same schema
returned by self.ArrowSchema
. If a TFMD schema was not provided, the
pa.RecordBatch
es might not be of the same schema (they may contain
different numbers of columns).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
if not None, the |
Source code in tfx_bsl/tfxio/tfxio.py
Project
¶
Projects the dataset represented by this TFXIO.
A Projected TFXIO:
- Only columns needed for given tensor_names are guaranteed to be
produced by self.BeamSource()
- self.TensorAdapterConfig()
and self.TensorFlowDataset()
are trimmed
to contain only those tensors.
- It retains a reference to the very original TFXIO, so its TensorAdapter
knows about the specs of the tensors that would be produced by the
original TensorAdapter. Also see TensorAdapter.OriginalTensorSpec()
.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
tensor_names
|
a set of tensor names. |
RETURNS | DESCRIPTION |
---|---|
TFXIO
|
A |
TFXIO
|
|
TFXIO
|
|
Source code in tfx_bsl/tfxio/tfxio.py
RecordBatches
abstractmethod
¶
RecordBatches(
options: RecordBatchesOptions,
) -> Iterator[RecordBatch]
Returns an iterable of record batches.
This can be used outside of Apache Beam or TensorFlow to access data.
PARAMETER | DESCRIPTION |
---|---|
options
|
An options object for iterating over record batches. Look at
TYPE:
|
Source code in tfx_bsl/tfxio/tfxio.py
TensorAdapter
¶
TensorAdapter() -> TensorAdapter
Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
Source code in tfx_bsl/tfxio/tfxio.py
TensorAdapterConfig
¶
TensorAdapterConfig() -> TensorAdapterConfig
Returns the config to initialize a TensorAdapter
.
RETURNS | DESCRIPTION |
---|---|
TensorAdapterConfig
|
a |
TensorAdapterConfig
|
|
Source code in tfx_bsl/tfxio/tfxio.py
TensorFlowDataset
abstractmethod
¶
TensorFlowDataset(
options: TensorFlowDatasetOptions,
) -> Dataset
Returns a tf.data.Dataset of TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
PARAMETER | DESCRIPTION |
---|---|
options
|
an options object for the tf.data.Dataset. Look at
TYPE:
|
Source code in tfx_bsl/tfxio/tfxio.py
TensorRepresentations
abstractmethod
¶
TensorRepresentations() -> TensorRepresentations
Returns the TensorRepresentations
.
These TensorRepresentation
s describe the tensors or composite tensors
produced by the TensorAdapter
created from self.TensorAdapter()
or
the tf.data.Dataset created from self.TensorFlowDataset()
.
May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.
Source code in tfx_bsl/tfxio/tfxio.py
TensorAdapter
¶
TensorAdapter(config: TensorAdapterConfig)
Bases: object
A TensorAdapter converts a RecordBatch to a collection of TF Tensors.
The conversion is determined by both the Arrow schema and the TensorRepresentations, which must be provided at the initialization time. Each TensorRepresentation contains the information needed to translates one or more columns in a RecordBatch of the given Arrow schema into a TF Tensor or CompositeTensor. They are contained in a Dict whose keys are the names of the tensors, which will be the keys of the Dict produced by ToBatchTensors().
TypeSpecs() returns static TypeSpecs of those tensors by their names, i.e. if they have a shape, then the size of the first (batch) dimension is always unknown (None) because it depends on the size of the RecordBatch passed to ToBatchTensors().
It is guaranteed that for any tensor_name in the given TensorRepresentations self.TypeSpecs()[tensor_name].is_compatible_with( self.ToBatchedTensors(...)[tensor_name])
Sliced RecordBatches and LargeListArray columns having null elements backed by non-empty sub-lists are not supported and will yield undefined behaviour.
Source code in tfx_bsl/tfxio/tensor_adapter.py
Functions¶
OriginalTypeSpecs
¶
Returns the origin's type specs.
A TFXIO 'Y' may be a result of projection of another TFXIO 'X', in which case then 'X' is the origin of 'Y'. And this method returns what X.TensorAdapter().TypeSpecs() would return.
May equal to self.TypeSpecs()
.
Returns: a mapping from tensor names to tf.TypeSpec
s.
Source code in tfx_bsl/tfxio/tensor_adapter.py
ToBatchTensors
¶
ToBatchTensors(
record_batch: RecordBatch,
produce_eager_tensors: Optional[bool] = None,
) -> Dict[str, Any]
Returns a batch of tensors translated from record_batch
.
PARAMETER | DESCRIPTION |
---|---|
record_batch
|
input RecordBatch.
TYPE:
|
produce_eager_tensors
|
controls whether the ToBatchTensors() produces eager tensors or ndarrays (or Tensor value objects). If None, determine that from whether TF Eager mode is enabled. |
RAISES | DESCRIPTION |
---|---|
RuntimeError
|
when Eager Tensors are requested but TF is not executing eagerly. |
ValueError
|
when Any handler failed to produce a Tensor. |
Source code in tfx_bsl/tfxio/tensor_adapter.py
TensorAdapterConfig
¶
TensorAdapterConfig(
arrow_schema: Schema,
tensor_representations: TensorRepresentations,
original_type_specs: Optional[
Dict[str, TypeSpec]
] = None,
)
Bases: object
Config to a TensorAdapter.
Contains all the information needed to create a TensorAdapter.
Source code in tfx_bsl/tfxio/tensor_adapter.py
TensorFlowDatasetOptions
¶
Bases: NamedTuple('TensorFlowDatasetOptions', [('batch_size', int), ('drop_final_batch', bool), ('num_epochs', Optional[int]), ('shuffle', bool), ('shuffle_buffer_size', int), ('shuffle_seed', Optional[int]), ('prefetch_buffer_size', int), ('reader_num_threads', int), ('parser_num_threads', int), ('sloppy_ordering', bool), ('label_key', Optional[str])])
Options for TFXIO's TensorFlowDataset.
Note: not all of these options may be effective. It depends on the particular TFXIO's implementation.