Skip to content

TensorFlow Transform tft Module

tensorflow_transform

Init module for TF.Transform.

Attributes

Callable module-attribute

Callable = _CallableType(Callable, 2)

Iterable module-attribute

Iterable = _alias(Iterable, 1)

List module-attribute

List = _alias(list, 1, inst=False, name='List')

Mapping module-attribute

Mapping = _alias(Mapping, 2)

Tuple module-attribute

Tuple = _TupleType(tuple, -1, inst=False, name='Tuple')

__version__ module-attribute

__version__ = '1.17.0.dev'

Classes

DatasetMetadata

DatasetMetadata(schema: Schema)

Metadata about a dataset used for the "instance dict" format.

Caution: The "instance dict" format used with DatasetMetadata is much less efficient than TFXIO. For any serious workloads you should use TFXIO with a tfxio.TensorAdapterConfig instance as the metadata. Refer to Get started with TF-Transform for more details.

This is an in-memory representation that may be serialized and deserialized to and from a variety of disk representations.

Source code in tensorflow_transform/tf_metadata/dataset_metadata.py
def __init__(self, schema: schema_pb2.Schema):
  self._schema = schema
  self._output_record_batches = True
Attributes
schema property
schema: Schema
Functions
from_feature_spec classmethod
from_feature_spec(
    feature_spec: Mapping[str, FeatureSpecType],
    domains: Optional[Mapping[str, DomainType]] = None,
) -> _DatasetMetadataType

Creates a DatasetMetadata from a TF feature spec dict.

Source code in tensorflow_transform/tf_metadata/dataset_metadata.py
@classmethod
def from_feature_spec(
    cls: Type[_DatasetMetadataType],
    feature_spec: Mapping[str, common_types.FeatureSpecType],
    domains: Optional[Mapping[str, common_types.DomainType]] = None
) -> _DatasetMetadataType:
  """Creates a DatasetMetadata from a TF feature spec dict."""
  return cls(schema_utils.schema_from_feature_spec(feature_spec, domains))

TFTransformOutput

TFTransformOutput(transform_output_dir: str)

A wrapper around the output of the tf.Transform.

Init method for TFTransformOutput.

PARAMETER DESCRIPTION
transform_output_dir

The directory containig tf.Transform output.

TYPE: str

Source code in tensorflow_transform/output_wrapper.py
def __init__(self, transform_output_dir: str):
  """Init method for TFTransformOutput.

  Args:
    transform_output_dir: The directory containig tf.Transform output.
  """
  self._transform_output_dir = transform_output_dir

  # Lazily constructed properties.
  self._transformed_metadata = None
  self._raw_metadata = None
  self._transform_features_layer = None
  self._exported_as_v1_value = None
  self._transformed_domains = None
Attributes
ASSET_MAP class-attribute instance-attribute
ASSET_MAP = 'asset_map'
POST_TRANSFORM_FEATURE_STATS_PATH class-attribute instance-attribute
POST_TRANSFORM_FEATURE_STATS_PATH = join(
    "post_transform_feature_stats", _FEATURE_STATS_PB
)
PRE_TRANSFORM_FEATURE_STATS_PATH class-attribute instance-attribute
PRE_TRANSFORM_FEATURE_STATS_PATH = join(
    "pre_transform_feature_stats", _FEATURE_STATS_PB
)
RAW_METADATA_DIR class-attribute instance-attribute
RAW_METADATA_DIR = 'metadata'
TRANSFORMED_METADATA_DIR class-attribute instance-attribute
TRANSFORMED_METADATA_DIR = 'transformed_metadata'
TRANSFORM_FN_DIR class-attribute instance-attribute
TRANSFORM_FN_DIR = 'transform_fn'
post_transform_statistics_path property
post_transform_statistics_path: str

Returns the path to the post-transform datum statistics.

Note: post_transform_statistics is not guaranteed to exist in the output of tf.transform and hence using this could fail, if post_transform statistics is not present in TFTransformOutput.

pre_transform_statistics_path property
pre_transform_statistics_path: str

Returns the path to the pre-transform datum statistics.

Note: pre_transform_statistics is not guaranteed to exist in the output of tf.transform and hence using this could fail, if pre_transform statistics is not present in TFTransformOutput.

raw_metadata property
raw_metadata: DatasetMetadata

A DatasetMetadata.

Note: raw_metadata is not guaranteed to exist in the output of tf.transform and hence using this could fail, if raw_metadata is not present in TFTransformOutput.

RETURNS DESCRIPTION
DatasetMetadata

A DatasetMetadata

transform_savedmodel_dir property
transform_savedmodel_dir: str

A python str.

transformed_metadata property
transformed_metadata: DatasetMetadata

A DatasetMetadata.

Functions
load_transform_graph
load_transform_graph()

Load the transform graph without replacing any placeholders.

This is necessary to ensure that variables in the transform graph are included in the training checkpoint when using tf.Estimator. This should be called in the training input_fn.

Source code in tensorflow_transform/output_wrapper.py
def load_transform_graph(self):
  """Load the transform graph without replacing any placeholders.

  This is necessary to ensure that variables in the transform graph are
  included in the training checkpoint when using tf.Estimator.  This should
  be called in the training input_fn.
  """
  if self._exported_as_v1 is None:
    self._exported_as_v1 = saved_transform_io.exported_as_v1(
        self.transform_savedmodel_dir)

  if self._exported_as_v1:
    saved_transform_io.partially_apply_saved_transform_internal(
        self.transform_savedmodel_dir, {})
  else:
    # Note: This should use the same mechanism as `transform_raw_features` to
    # load the SavedModel into the current graph context.
    _ = self.transform_features_layer()({})
num_buckets_for_transformed_feature
num_buckets_for_transformed_feature(name: str) -> int

Returns the number of buckets for an integerized transformed feature.

Source code in tensorflow_transform/output_wrapper.py
def num_buckets_for_transformed_feature(self, name: str) -> int:
  """Returns the number of buckets for an integerized transformed feature."""
  # Do checks that this tensor can be wrapped in
  # sparse_column_with_integerized_feature
  try:
    domain = self.transformed_domains()[name]
  except KeyError:
    raise ValueError('Column {} did not have a domain provided.'.format(name))
  if not isinstance(domain, schema_pb2.IntDomain):
    raise ValueError('Column {} has domain {}, expected an IntDomain'.format(
        name, domain))
  if domain.min != 0:
    raise ValueError('Column {} has min value {}, should be 0'.format(
        name, domain.min))
  return domain.max + 1
raw_domains
raw_domains() -> Dict[str, DomainType]

Returns domains for the raw features.

RETURNS DESCRIPTION
Dict[str, DomainType]

A dict from feature names to one of schema_pb2.IntDomain,

Dict[str, DomainType]

schema_pb2.StringDomain or schema_pb2.FloatDomain.

Source code in tensorflow_transform/output_wrapper.py
def raw_domains(self) -> Dict[str, common_types.DomainType]:
  """Returns domains for the raw features.

  Returns:
    A dict from feature names to one of schema_pb2.IntDomain,
    schema_pb2.StringDomain or schema_pb2.FloatDomain.
  """
  return schema_utils.schema_as_feature_spec(
      self.raw_metadata.schema).domains
raw_feature_spec
raw_feature_spec() -> Dict[str, FeatureSpecType]

Returns a feature_spec for the raw features.

RETURNS DESCRIPTION
Dict[str, FeatureSpecType]

A dict from feature names to FixedLenFeature/SparseFeature/VarLenFeature.

Source code in tensorflow_transform/output_wrapper.py
def raw_feature_spec(self) -> Dict[str, common_types.FeatureSpecType]:
  """Returns a feature_spec for the raw features.

  Returns:
    A dict from feature names to FixedLenFeature/SparseFeature/VarLenFeature.
  """
  return schema_utils.schema_as_feature_spec(
      self.raw_metadata.schema).feature_spec
transform_features_layer
transform_features_layer() -> Model

Creates a TransformFeaturesLayer from this transform output.

If a TransformFeaturesLayer has already been created for self, the same one will be returned.

RETURNS DESCRIPTION
Model

A TransformFeaturesLayer instance.

Source code in tensorflow_transform/output_wrapper.py
def transform_features_layer(self) -> tf_keras.Model:
  """Creates a `TransformFeaturesLayer` from this transform output.

  If a `TransformFeaturesLayer` has already been created for self, the same
  one will be returned.

  Returns:
    A `TransformFeaturesLayer` instance.
  """
  if self._transform_features_layer is None:
    self._transform_features_layer = TransformFeaturesLayer(
        self, exported_as_v1=self._exported_as_v1)
  return self._transform_features_layer
transform_raw_features
transform_raw_features(
    raw_features: Mapping[str, TensorType],
    drop_unused_features: bool = True,
) -> Dict[str, TensorType]

Takes a dict of tensors representing raw features and transforms them.

Takes a dictionary of Tensor, SparseTensor, or RaggedTensors that represent the raw features, and applies the transformation defined by tf.Transform.

If False it returns all transformed features defined by tf.Transform. To only return features transformed from the given 'raw_features', set drop_unused_features to True.

Note: If eager execution is enabled and this API is invoked inside a tf.function or an API that uses tf.function such as dataset.map, please use transform_features_layer instead. It separates out loading of the transform graph and hence resources will not be initialized on each invocation. This can have significant performance improvement if the transform graph was exported as a TF1 SavedModel and guarantees correctness if it was exported as a TF2 SavedModel.

PARAMETER DESCRIPTION
raw_features

A dict whose keys are feature names and values are

TYPE: Mapping[str, TensorType]

drop_unused_features

If True, the result will be filtered. Only the features that are transformed from 'raw_features' will be included in the returned result. If a feature is transformed from multiple raw features (e.g, feature cross), it will only be included if all its base raw features are present in raw_features.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
Dict[str, TensorType]

A dict whose keys are feature names and values are Tensors,

Dict[str, TensorType]

SparseTensors, or RaggedTensors representing transformed features.

Source code in tensorflow_transform/output_wrapper.py
def transform_raw_features(
    self,
    raw_features: Mapping[str, common_types.TensorType],
    drop_unused_features: bool = True  # LEGACY_VALUE=False
) -> Dict[str, common_types.TensorType]:
  """Takes a dict of tensors representing raw features and transforms them.

  Takes a dictionary of `Tensor`, `SparseTensor`, or `RaggedTensor`s that
  represent the raw features, and applies the transformation defined by
  tf.Transform.

  If False it returns all transformed features defined by tf.Transform. To
  only return features transformed from the given 'raw_features', set
  `drop_unused_features` to True.

  Note: If eager execution is enabled and this API is invoked inside a
  tf.function or an API that uses tf.function such as dataset.map, please use
  `transform_features_layer` instead. It separates out loading of the
  transform graph and hence resources will not be initialized on each
  invocation. This can have significant performance improvement if the
  transform graph was exported as a TF1 SavedModel and guarantees correctness
  if it was exported as a TF2 SavedModel.

  Args:
    raw_features: A dict whose keys are feature names and values are
    `Tensor`s, `SparseTensor`s, or `RaggedTensor`s.
    drop_unused_features: If True, the result will be filtered. Only the
      features that are transformed from 'raw_features' will be included in
      the returned result. If a feature is transformed from multiple raw
      features (e.g, feature cross), it will only be included if all its base
      raw features are present in `raw_features`.

  Returns:
    A dict whose keys are feature names and values are `Tensor`s,
    `SparseTensor`s, or `RaggedTensor`s representing transformed features.
  """
  if self._exported_as_v1:
    transformed_features = self._transform_raw_features_compat_v1(
        raw_features, drop_unused_features)
  else:
    tft_layer = self.transform_features_layer()
    if not drop_unused_features:
      tf.compat.v1.logging.warning(
          'Unused features are always dropped in the TF 2.x '
          'implementation. Ignoring value of drop_unused_features.')

    transformed_features = tft_layer(raw_features)
  return _TransformedFeaturesDict(transformed_features)
transformed_domains
transformed_domains() -> Dict[str, DomainType]

Returns domains for the transformed features.

RETURNS DESCRIPTION
Dict[str, DomainType]

A dict from feature names to one of schema_pb2.IntDomain,

Dict[str, DomainType]

schema_pb2.StringDomain or schema_pb2.FloatDomain.

Source code in tensorflow_transform/output_wrapper.py
def transformed_domains(self) -> Dict[str, common_types.DomainType]:
  """Returns domains for the transformed features.

  Returns:
    A dict from feature names to one of schema_pb2.IntDomain,
    schema_pb2.StringDomain or schema_pb2.FloatDomain.
  """
  if self._transformed_domains is None:
    self._transformed_domains = schema_utils.schema_as_feature_spec(
        self.transformed_metadata.schema).domains
  return self._transformed_domains
transformed_feature_spec
transformed_feature_spec() -> Dict[str, FeatureSpecType]

Returns a feature_spec for the transformed features.

RETURNS DESCRIPTION
Dict[str, FeatureSpecType]

A dict from feature names to FixedLenFeature/SparseFeature/VarLenFeature.

Source code in tensorflow_transform/output_wrapper.py
def transformed_feature_spec(self) -> Dict[str, common_types.FeatureSpecType]:
  """Returns a feature_spec for the transformed features.

  Returns:
    A dict from feature names to FixedLenFeature/SparseFeature/VarLenFeature.
  """
  return schema_utils.schema_as_feature_spec(
      self.transformed_metadata.schema).feature_spec
vocabulary_by_name
vocabulary_by_name(vocab_filename: str) -> List[bytes]

Like vocabulary_file_by_name but returns a list.

Source code in tensorflow_transform/output_wrapper.py
def vocabulary_by_name(self, vocab_filename: str) -> List[bytes]:
  """Like vocabulary_file_by_name but returns a list."""
  vocab_path = self.vocabulary_file_by_name(vocab_filename)
  if not vocab_path:
    raise ValueError('Could not read vocabulary: {}, does not exist'.format(
        vocab_filename))
  elif vocab_path.endswith('tfrecord.gz'):
    dataset = tf.data.TFRecordDataset(vocab_path, compression_type='GZIP')
    vocab_tensor = dataset.batch(tf.int32.max).reduce(
        tf.constant([], dtype=tf.string),
        lambda state, elem: tf.concat([state, elem], axis=-1))
    # Using as_numpy_iterator only works when executing eagerly.
    return _get_tensor_value(vocab_tensor).tolist()
  else:
    with tf.io.gfile.GFile(vocab_path, 'rb') as f:
      return [l.rstrip(os.linesep.encode('utf-8')) for l in f]
vocabulary_file_by_name
vocabulary_file_by_name(
    vocab_filename: str,
) -> Optional[str]

Returns the vocabulary file path created in the preprocessing function.

vocab_filename must either be (i) the name used as the vocab_filename argument to tft.compute_and_apply_vocabulary / tft.vocabulary or (ii) the key used in tft.annotate_asset.

When a mapping has been specified by calls to tft.annotate_asset, it will be checked first for the provided filename. If present, this filename will be used directly to construct a path.

If the mapping does not exist or vocab_filename is not present within it, we will default to sanitizing vocab_filename and searching for files matching it within the assets directory.

In either case, if the constructed path does not point to an existing file within the assets subdirectory, we will return a None.

PARAMETER DESCRIPTION
vocab_filename

The vocabulary name to lookup.

TYPE: str

Source code in tensorflow_transform/output_wrapper.py
def vocabulary_file_by_name(self, vocab_filename: str) -> Optional[str]:
  """Returns the vocabulary file path created in the preprocessing function.

  `vocab_filename` must either be (i) the name used as the vocab_filename
  argument to tft.compute_and_apply_vocabulary / tft.vocabulary or (ii) the
  key used in tft.annotate_asset.

  When a mapping has been specified by calls to tft.annotate_asset, it will be
  checked first for the provided filename. If present, this filename will be
  used directly to construct a path.

  If the mapping does not exist or `vocab_filename` is not present within it,
  we will default to sanitizing `vocab_filename` and searching for files
  matching it within the assets directory.

  In either case, if the constructed path does not point to an existing file
  within the assets subdirectory, we will return a None.

  Args:
    vocab_filename: The vocabulary name to lookup.
  """
  mapping_path = os.path.join(self._transformed_metadata_dir, self.ASSET_MAP)

  mapping = {}
  if tf.io.gfile.exists(mapping_path):
    with tf.io.gfile.GFile(mapping_path) as f:
      mapping = json.loads(f.read())
      if vocab_filename in mapping:
        vocab_path = os.path.join(self.transform_savedmodel_dir,
                                  tf.saved_model.ASSETS_DIRECTORY,
                                  mapping[vocab_filename])
        if tf.io.gfile.exists(vocab_path):
          return vocab_path

  prefix = os.path.join(self.transform_savedmodel_dir,
                        tf.saved_model.ASSETS_DIRECTORY,
                        sanitized_vocab_filename(filename=vocab_filename))
  files = tf.io.gfile.glob(prefix) + tf.io.gfile.glob(
      '{}.tfrecord.gz'.format(prefix))
  if not files:
    return None
  if len(files) != 1:
    raise ValueError('Found too many vocabulary files: {}'.format(files))
  return files[0]
vocabulary_size_by_name
vocabulary_size_by_name(vocab_filename: str) -> int

Like vocabulary_file_by_name, but returns the size of vocabulary.

Source code in tensorflow_transform/output_wrapper.py
def vocabulary_size_by_name(self, vocab_filename: str) -> int:
  """Like vocabulary_file_by_name, but returns the size of vocabulary."""
  vocab_size_from_annotations = self._vocabulary_size_from_annotations(
      vocab_filename)
  if vocab_size_from_annotations is not None:
    return vocab_size_from_annotations

  vocab_path = self.vocabulary_file_by_name(vocab_filename)
  if not vocab_path:
    raise ValueError(
        'Could not compute vocabulary size for {}, does not exist'.format(
            vocab_filename))
  elif vocab_path.endswith('tfrecord.gz'):
    dataset = tf.data.TFRecordDataset(vocab_path, compression_type='GZIP')

    def reduce_fn(accum, elem):
      return tf.size(elem, out_type=tf.int64, name='vocabulary_size') + accum

    return _get_tensor_value(
        dataset.batch(tf.int32.max).reduce(
            tf.constant(0, tf.int64), reduce_fn))
  else:
    with tf.io.gfile.GFile(vocab_path, 'rb') as f:
      return sum(1 for _ in f)

TransformFeaturesLayer

TransformFeaturesLayer(
    tft_output: TFTransformOutput,
    exported_as_v1: Optional[bool] = None,
)

Bases: Model

A Keras layer for applying a tf.Transform output to input layers.

Source code in tensorflow_transform/output_wrapper.py
def __init__(self,
             tft_output: TFTransformOutput,
             exported_as_v1: Optional[bool] = None):
  super().__init__(trainable=False)
  self._tft_output = tft_output
  if exported_as_v1 is None:
    self._exported_as_v1 = saved_transform_io.exported_as_v1(
        tft_output.transform_savedmodel_dir)
  else:
    self._exported_as_v1 = exported_as_v1
  self._saved_model_loader_value = None
  self._loaded_saved_model_graph = None
  if tf.compat.v1.executing_eagerly_outside_functions():
    # The model must be tracked by assigning to an attribute of the Keras
    # layer. Hence, we track the attributes of _saved_model_loader here as
    # well.
    self._saved_model_loader_tracked_dict = self._saved_model_loader.__dict__

  # TODO(b/162055065): This is needed because otherwise we'd get an error in
  # some cases:
  # ValueError: Your Layer or Model is in an invalid state. This can happen
  # if you are interleaving estimator/non-estimator models or interleaving
  # models/layers made in tf.compat.v1.Graph.as_default() with models/layers
  # created outside of it. Converting a model to an estimator (via
  # model_to_estimator) invalidates all models/layers made before the
  # conversion (even if they were not the model converted to an estimator).
  # Similarly, making a layer or a model inside a a tf.compat.v1.Graph
  # invalidates all layers/models you previously made outside of the graph.
  self._originally_built_as_v1 = True
Functions
call
call(
    inputs: Mapping[str, TensorType],
) -> Dict[str, TensorType]
Source code in tensorflow_transform/output_wrapper.py
def call(  # pytype: disable=signature-mismatch  # overriding-parameter-count-checks
    self, inputs: Mapping[str, common_types.TensorType]
) -> Dict[str, common_types.TensorType]:

  if self._exported_as_v1 and not ops.executing_eagerly_outside_functions():
    tf.compat.v1.logging.warning('Falling back to transform_raw_features...')
    return self._tft_output._transform_raw_features_compat_v1(  # pylint: disable=protected-access
        inputs,
        drop_unused_features=True)
  else:
    return self._saved_model_loader.apply_transform_model(inputs)

Functions

Any

Any(self, parameters)

Special type indicating an unconstrained type.

  • Any is compatible with every type.
  • Any assumed to have all methods.
  • All values assumed to be instances of Any.

Note that all the above statements are true from the point of view of static type checkers. At runtime, Any should not be used with instance or class checks.

Source code in python3.9/typing.py
@_SpecialForm
def Any(self, parameters):
    """Special type indicating an unconstrained type.

    - Any is compatible with every type.
    - Any assumed to have all methods.
    - All values assumed to be instances of Any.

    Note that all the above statements are true from the point of view of
    static type checkers. At runtime, Any should not be used with instance
    or class checks.
    """
    raise TypeError(f"{self} is not subscriptable")

Optional

Optional(self, parameters)

Optional type.

Optional[X] is equivalent to Union[X, None].

Source code in python3.9/typing.py
@_SpecialForm
def Optional(self, parameters):
    """Optional type.

    Optional[X] is equivalent to Union[X, None].
    """
    arg = _type_check(parameters, f"{self} requires a single type.")
    return Union[arg, type(None)]

Union

Union(self, parameters)

Union type; Union[X, Y] means either X or Y.

To define a union, use e.g. Union[int, str]. Details: - The arguments must be types and there must be at least one. - None as an argument is a special case and is replaced by type(None). - Unions of unions are flattened, e.g.::

Union[Union[int, str], float] == Union[int, str, float]
  • Unions of a single argument vanish, e.g.::

    Union[int] == int # The constructor actually returns int

  • Redundant arguments are skipped, e.g.::

    Union[int, str, int] == Union[int, str]

  • When comparing unions, the argument order is ignored, e.g.::

    Union[int, str] == Union[str, int]

  • You cannot subclass or instantiate a union.

  • You can use Optional[X] as a shorthand for Union[X, None].
Source code in python3.9/typing.py
@_SpecialForm
def Union(self, parameters):
    """Union type; Union[X, Y] means either X or Y.

    To define a union, use e.g. Union[int, str].  Details:
    - The arguments must be types and there must be at least one.
    - None as an argument is a special case and is replaced by
      type(None).
    - Unions of unions are flattened, e.g.::

        Union[Union[int, str], float] == Union[int, str, float]

    - Unions of a single argument vanish, e.g.::

        Union[int] == int  # The constructor actually returns int

    - Redundant arguments are skipped, e.g.::

        Union[int, str, int] == Union[int, str]

    - When comparing unions, the argument order is ignored, e.g.::

        Union[int, str] == Union[str, int]

    - You cannot subclass or instantiate a union.
    - You can use Optional[X] as a shorthand for Union[X, None].
    """
    if parameters == ():
        raise TypeError("Cannot take a Union of no types.")
    if not isinstance(parameters, tuple):
        parameters = (parameters,)
    msg = "Union[arg, ...]: each arg must be a type."
    parameters = tuple(_type_check(p, msg) for p in parameters)
    parameters = _remove_dups_flatten(parameters)
    if len(parameters) == 1:
        return parameters[0]
    return _UnionGenericAlias(self, parameters)

annotate_asset

annotate_asset(asset_key: str, asset_filename: str)

Creates mapping between user-defined keys and SavedModel assets.

This mapping is made available in BeamDatasetMetadata and is also used to resolve vocabularies in tft.TFTransformOutput.

Note: multiple mappings for the same key will overwrite the previous one.

PARAMETER DESCRIPTION
asset_key

The key to associate with the asset.

TYPE: str

asset_filename

The filename as it appears within the assets/ subdirectory. Must be sanitized and complete (e.g. include the tfrecord.gz for suffix appropriate files).

TYPE: str

Source code in tensorflow_transform/annotators.py
def annotate_asset(asset_key: str, asset_filename: str):
  """Creates mapping between user-defined keys and SavedModel assets.

  This mapping is made available in `BeamDatasetMetadata` and is also used to
  resolve vocabularies in `tft.TFTransformOutput`.

  Note: multiple mappings for the same key will overwrite the previous one.

  Args:
    asset_key: The key to associate with the asset.
    asset_filename: The filename as it appears within the assets/ subdirectory.
      Must be sanitized and complete (e.g. include the tfrecord.gz for suffix
      appropriate files).
  """
  tf.compat.v1.add_to_collection(_ASSET_KEY_COLLECTION, asset_key)
  tf.compat.v1.add_to_collection(_ASSET_FILENAME_COLLECTION, asset_filename)

apply_buckets

apply_buckets(
    x: ConsistentTensorType,
    bucket_boundaries: BucketBoundariesType,
    name: Optional[str] = None,
) -> ConsistentTensorType

Returns a bucketized column, with a bucket index assigned to each input.

Each element e in x is mapped to a positive index i for which bucket_boundaries[i-1] <= e < bucket_boundaries[i], if it exists. If e < bucket_boundaries[0], then e is mapped to 0. If e >= bucket_boundaries[-1], then e is mapped to len(bucket_boundaries). NaNs are mapped to len(bucket_boundaries).

Example:

x = tf.constant([[4.0, float('nan'), 1.0], [float('-inf'), 7.5, 10.0]]) bucket_boundaries = tf.constant([[2.0, 5.0, 10.0]]) tft.apply_buckets(x, bucket_boundaries)

PARAMETER DESCRIPTION
x

A numeric input Tensor, SparseTensor, or RaggedTensor whose values should be mapped to buckets. For CompositeTensors, the non-missing values will be mapped to buckets and missing value left missing.

TYPE: ConsistentTensorType

bucket_boundaries

A rank 2 Tensor or list representing the bucket boundaries sorted in ascending order.

TYPE: BucketBoundariesType

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor of the same shape as x, with

ConsistentTensorType

each element in the returned tensor representing the bucketized value.

ConsistentTensorType

Bucketized value is in the range [0, len(bucket_boundaries)].

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def apply_buckets(
    x: common_types.ConsistentTensorType,
    bucket_boundaries: common_types.BucketBoundariesType,
    name: Optional[str] = None) -> common_types.ConsistentTensorType:
  """Returns a bucketized column, with a bucket index assigned to each input.

  Each element `e` in `x` is mapped to a positive index `i` for which
  `bucket_boundaries[i-1] <= e < bucket_boundaries[i]`, if it exists.
  If `e < bucket_boundaries[0]`, then `e` is mapped to `0`. If
  `e >= bucket_boundaries[-1]`, then `e` is mapped to `len(bucket_boundaries)`.
  NaNs are mapped to `len(bucket_boundaries)`.

  Example:

  >>> x = tf.constant([[4.0, float('nan'), 1.0], [float('-inf'), 7.5, 10.0]])
  >>> bucket_boundaries = tf.constant([[2.0, 5.0, 10.0]])
  >>> tft.apply_buckets(x, bucket_boundaries)
  <tf.Tensor: shape=(2, 3), dtype=int64, numpy=
  array([[1, 3, 0],
         [0, 2, 3]])>

  Args:
    x: A numeric input `Tensor`, `SparseTensor`, or `RaggedTensor` whose values
      should be mapped to buckets.  For `CompositeTensor`s, the non-missing
      values will be mapped to buckets and missing value left missing.
    bucket_boundaries: A rank 2 `Tensor` or list representing the bucket
      boundaries sorted in ascending order.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` of the same shape as `x`, with
    each element in the returned tensor representing the bucketized value.
    Bucketized value is in the range [0, len(bucket_boundaries)].
  """
  with tf.compat.v1.name_scope(name, 'apply_buckets'):
    bucket_boundaries = tf.convert_to_tensor(bucket_boundaries)
    tf.compat.v1.assert_rank(bucket_boundaries, 2)

    bucketized_values = tf_utils.assign_buckets(
        tf_utils.get_values(x), bucket_boundaries, side=tf_utils.Side.RIGHT)

    # Attach the relevant metadata to result, so that the corresponding
    # output feature will have this metadata set.
    min_value = tf.constant(0, tf.int64)
    max_value = tf.shape(input=bucket_boundaries)[1]
    schema_inference.set_tensor_schema_override(
        bucketized_values, min_value, max_value)
    _annotate_buckets(bucketized_values, bucket_boundaries)
    compose_result_fn = _make_composite_tensor_wrapper_if_composite(x)
    return compose_result_fn(bucketized_values)

apply_buckets_with_interpolation

apply_buckets_with_interpolation(
    x: ConsistentTensorType,
    bucket_boundaries: BucketBoundariesType,
    name: Optional[str] = None,
) -> ConsistentTensorType

Interpolates within the provided buckets and then normalizes to 0 to 1.

A method for normalizing continuous numeric data to the range [0, 1]. Numeric values are first bucketized according to the provided boundaries, then linearly interpolated within their respective bucket ranges. Finally, the interpolated values are normalized to the range [0, 1]. Values that are less than or equal to the lowest boundary, or greater than or equal to the highest boundary, will be mapped to 0 and 1 respectively. NaN values will be mapped to the middle of the range (.5).

This is a non-linear approach to normalization that is less sensitive to outliers than min-max or z-score scaling. When outliers are present, standard forms of normalization can leave the majority of the data compressed into a very small segment of the output range, whereas this approach tends to spread out the more frequent values (if quantile buckets are used). Note that distance relationships in the raw data are not necessarily preserved (data points that close to each other in the raw feature space may not be equally close in the transformed feature space). This means that unlike linear normalization methods, correlations between features may be distorted by the transformation. This scaling method may help with stability and minimize exploding gradients in neural networks.

PARAMETER DESCRIPTION
x

A numeric input Tensor, SparseTensor, or RaggedTensor (tf.float[32|64], tf.int[32|64]).

TYPE: ConsistentTensorType

bucket_boundaries

Sorted bucket boundaries as a rank-2 Tensor or list.

TYPE: BucketBoundariesType

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor of the same shape as x,

ConsistentTensorType

normalized to the range [0, 1]. If the input x is tf.float64, the returned

ConsistentTensorType

values will be tf.float64. Otherwise, returned values are tf.float32.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def apply_buckets_with_interpolation(
    x: common_types.ConsistentTensorType,
    bucket_boundaries: common_types.BucketBoundariesType,
    name: Optional[str] = None) -> common_types.ConsistentTensorType:
  """Interpolates within the provided buckets and then normalizes to 0 to 1.

  A method for normalizing continuous numeric data to the range [0, 1].
  Numeric values are first bucketized according to the provided boundaries, then
  linearly interpolated within their respective bucket ranges. Finally, the
  interpolated values are normalized to the range [0, 1]. Values that are
  less than or equal to the lowest boundary, or greater than or equal to the
  highest boundary, will be mapped to 0 and 1 respectively. NaN values will be
  mapped to the middle of the range (.5).

  This is a non-linear approach to normalization that is less sensitive to
  outliers than min-max or z-score scaling. When outliers are present, standard
  forms of normalization can leave the majority of the data compressed into a
  very small segment of the output range, whereas this approach tends to spread
  out the more frequent values (if quantile buckets are used). Note that
  distance relationships in the raw data are not necessarily preserved (data
  points that close to each other in the raw feature space may not be equally
  close in the transformed feature space). This means that unlike linear
  normalization methods, correlations between features may be distorted by the
  transformation. This scaling method may help with stability and minimize
  exploding gradients in neural networks.

  Args:
    x: A numeric input `Tensor`, `SparseTensor`, or `RaggedTensor`
      (tf.float[32|64], tf.int[32|64]).
    bucket_boundaries: Sorted bucket boundaries as a rank-2 `Tensor` or list.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` of the same shape as `x`,
    normalized to the range [0, 1]. If the input x is tf.float64, the returned
    values will be tf.float64. Otherwise, returned values are tf.float32.
  """
  with tf.compat.v1.name_scope(name, 'buckets_with_interpolation'):
    bucket_boundaries = tf.convert_to_tensor(bucket_boundaries)
    tf.compat.v1.assert_rank(bucket_boundaries, 2)
    x_values = tf_utils.get_values(x)
    compose_result_fn = _make_composite_tensor_wrapper_if_composite(x)
    if not (x_values.dtype.is_floating or x_values.dtype.is_integer):
      raise ValueError(
          'Input tensor to be normalized must be numeric, got {}.'.format(
              x_values.dtype))
    # Remove any non-finite boundaries.
    if bucket_boundaries.dtype in (tf.float64, tf.float32):
      bucket_boundaries = tf.expand_dims(
          tf.gather_nd(bucket_boundaries,
                       tf.where(tf.math.is_finite(bucket_boundaries))),
          axis=0)
    return_type = tf.float64 if x.dtype == tf.float64 else tf.float32
    num_boundaries = tf.cast(
        tf.shape(bucket_boundaries)[1], dtype=tf.int64, name='num_boundaries')
    assert_some_finite_boundaries = tf.compat.v1.assert_greater(
        num_boundaries,
        tf.constant(0, tf.int64),
        name='assert_1_or_more_finite_boundaries')
    with tf.control_dependencies([assert_some_finite_boundaries]):
      bucket_indices = tf_utils.assign_buckets(
          x_values, bucket_boundaries, side=tf_utils.Side.RIGHT)
      # Get max, min, and width of the corresponding bucket for each element.
      bucket_max = tf.cast(
          tf.gather(
              tf.concat([bucket_boundaries[0], bucket_boundaries[:, -1]],
                        axis=0), bucket_indices), return_type)
      bucket_min = tf.cast(
          tf.gather(
              tf.concat([bucket_boundaries[:, 0], bucket_boundaries[0]],
                        axis=0), bucket_indices), return_type)
    bucket_width = bucket_max - bucket_min
    zeros = tf.zeros_like(x_values, dtype=return_type)
    ones = tf.ones_like(x_values, dtype=return_type)

    # Linearly interpolate each value within its respective bucket range.
    interpolation_value = (
        (tf.cast(x_values, return_type) - bucket_min) / bucket_width)
    bucket_interpolation = tf.compat.v1.verify_tensor_all_finite(
        tf.where(
            # If bucket index is first or last, which represents "less than
            # min" and "greater than max" respectively, the bucket logically
            # has an infinite width and we can't meaningfully interpolate.
            tf.logical_or(
                tf.equal(bucket_indices, 0),
                tf.equal(bucket_indices, num_boundaries)),
            zeros,
            tf.where(
                # If the bucket width is zero due to numerical imprecision,
                # there is no point in interpolating
                tf.equal(bucket_width, 0.0),
                ones / 2.0,
                # Finally, for a bucket with a valid width, we can interpolate.
                interpolation_value)),
        'bucket_interpolation')
    bucket_indices_with_interpolation = tf.cast(
        tf.maximum(bucket_indices - 1, 0), return_type) + bucket_interpolation

    # Normalize the interpolated values to the range [0, 1].
    denominator = tf.cast(tf.maximum(num_boundaries - 1, 1), return_type)
    normalized_values = bucket_indices_with_interpolation / denominator
    if x_values.dtype.is_floating:
      # Impute NaNs with .5, the middle value of the normalized output range.
      imputed_values = tf.ones_like(x_values, dtype=return_type) / 2.0
      normalized_values = tf.where(
          tf.math.is_nan(x_values), imputed_values, normalized_values)
    # If there is only one boundary, all values < the boundary are 0, all values
    # >= the boundary are 1.
    single_boundary_values = lambda: tf.where(  # pylint: disable=g-long-lambda
        tf.equal(bucket_indices, 0), zeros, ones)
    normalized_result = tf.cond(
        tf.equal(num_boundaries, 1),
        single_boundary_values, lambda: normalized_values)
    return compose_result_fn(normalized_result)

apply_pyfunc

apply_pyfunc(func, Tout, stateful=True, name=None, *args)

Applies a python function to some Tensors.

Applies a python function to some Tensors given by the argument list. The number of arguments should match the number of inputs to the function.

This function is for using inside a preprocessing_fn. It is a wrapper around tf.py_func. A function added this way can run in Transform, and during training when the graph is imported using the transform_raw_features method of the TFTransformOutput class. However if the resulting training graph is serialized and deserialized, then the tf.py_func op will not work and will cause an error. This means that TensorFlow Serving will not be able to serve this graph.

The underlying reason for this limited support is that tf.py_func ops were not designed to be serialized since they contain a reference to arbitrary Python functions. This function pickles those functions and including them in the graph, and transform_raw_features similarly unpickles the functions. But unpickling requires a Python environment, so there it's not possible to provide support in non-Python languages for loading such ops. Therefore loading these ops in libraries such as TensorFlow Serving is not supported.

Note: This API can only be used when TF2 is disabled or tft_beam.Context.force_tf_compat_v1=True.

PARAMETER DESCRIPTION
func

A Python function, which accepts a list of NumPy ndarray objects having element types that match the corresponding tf.Tensor objects in *args, and returns a list of ndarray objects (or a single ndarray) having element types that match the corresponding values in Tout.

Tout

A list or tuple of tensorflow data types or a single tensorflow data type if there is only one, indicating what func returns.

stateful

(Boolean.) If True, the function should be considered stateful. If a function is stateless, when given the same input it will return the same output and have no observable side effects. Optimizations such as common subexpression elimination are only performed on stateless operations.

DEFAULT: True

name

A name for the operation (optional).

DEFAULT: None

*args

The list of Tensors to apply the arguments to.

DEFAULT: ()

Returns: A Tensor representing the application of the function.

Source code in tensorflow_transform/py_func/api.py
def apply_pyfunc(func, Tout, stateful=True, name=None, *args):  # pylint: disable=invalid-name
  """Applies a python function to some `Tensor`s.

  Applies a python function to some `Tensor`s given by the argument list. The
  number of arguments should match the number of inputs to the function.

  This function is for using inside a preprocessing_fn.  It is a wrapper around
  `tf.py_func`.  A function added this way can run in Transform, and during
  training when the graph is imported using the `transform_raw_features` method
  of the `TFTransformOutput` class.  However if the resulting training graph is
  serialized and deserialized, then the `tf.py_func` op will not work and will
  cause an error.  This means that TensorFlow Serving will not be able to serve
  this graph.

  The underlying reason for this limited support is that `tf.py_func` ops were
  not designed to be serialized since they contain a reference to arbitrary
  Python functions. This function pickles those functions and including them in
  the graph, and `transform_raw_features` similarly unpickles the functions.
  But unpickling requires a Python environment, so there it's not possible to
  provide support in non-Python languages for loading such ops.  Therefore
  loading these ops in libraries such as TensorFlow Serving is not supported.

  Note: This API can only be used when TF2 is disabled or
  `tft_beam.Context.force_tf_compat_v1=True`.

  Args:
    func: A Python function, which accepts a list of NumPy `ndarray` objects
      having element types that match the corresponding `tf.Tensor` objects
      in `*args`, and returns a list of `ndarray` objects (or a single
      `ndarray`) having element types that match the corresponding values
      in `Tout`.
    Tout: A list or tuple of tensorflow data types or a single tensorflow data
      type if there is only one, indicating what `func` returns.
    stateful: (Boolean.) If True, the function should be considered stateful.
      If a function is stateless, when given the same input it will return the
      same output and have no observable side effects. Optimizations such as
      common subexpression elimination are only performed on stateless
      operations.
    name: A name for the operation (optional).
    *args: The list of `Tensor`s to apply the arguments to.
  Returns:
    A `Tensor` representing the application of the function.
  """
  return pyfunc_helper.insert_pyfunc(func, Tout, stateful, name, *args)

apply_vocabulary

apply_vocabulary(
    x: ConsistentTensorType,
    deferred_vocab_filename_tensor: TemporaryAnalyzerOutputType,
    *,
    default_value: Any = -1,
    num_oov_buckets: int = 0,
    lookup_fn: Optional[
        Callable[
            [TensorType, Tensor], Tuple[Tensor, Tensor]
        ]
    ] = None,
    file_format: VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
    name: Optional[str] = None,
) -> ConsistentTensorType

Maps x to a vocabulary specified by the deferred tensor.

This function also writes domain statistics about the vocabulary min and max values. Note that the min and max are inclusive, and depend on the vocab size, num_oov_buckets and default_value.

PARAMETER DESCRIPTION
x

A categorical Tensor, SparseTensor, or RaggedTensor of type tf.string or tf.int[8|16|32|64] to which the vocabulary transformation should be applied. The column names are those intended for the transformed tensors.

TYPE: ConsistentTensorType

deferred_vocab_filename_tensor

The deferred vocab filename tensor as returned by tft.vocabulary, as long as the frequencies were not stored.

TYPE: TemporaryAnalyzerOutputType

default_value

The value to use for out-of-vocabulary values, unless 'num_oov_buckets' is greater than zero.

TYPE: Any DEFAULT: -1

num_oov_buckets

Any lookup of an out-of-vocabulary token will return a bucket ID based on its hash if num_oov_buckets is greater than zero. Otherwise it is assigned the default_value.

TYPE: int DEFAULT: 0

lookup_fn

Optional lookup function, if specified it should take a tensor and a deferred vocab filename as an input and return a lookup op along with the table size, by default apply_vocabulary constructs a StaticHashTable for the table lookup.

TYPE: Optional[Callable[[TensorType, Tensor], Tuple[Tensor, Tensor]]] DEFAULT: None

file_format

(Optional) A str. The format of the given vocabulary. Accepted formats are: 'tfrecord_gzip', 'text'. The default value is 'text'.

TYPE: VocabularyFileFormatType DEFAULT: DEFAULT_VOCABULARY_FILE_FORMAT

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor where each string value is

ConsistentTensorType

mapped to an integer. Each unique string value that appears in the

ConsistentTensorType

vocabulary is mapped to a different integer and integers are consecutive

ConsistentTensorType

starting from zero, and string value not in the vocabulary is

ConsistentTensorType

assigned default_value.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def apply_vocabulary(
    x: common_types.ConsistentTensorType,
    deferred_vocab_filename_tensor: common_types.TemporaryAnalyzerOutputType,
    *,  # Force passing optional parameters by keys.
    default_value: Any = -1,
    num_oov_buckets: int = 0,
    lookup_fn: Optional[Callable[[common_types.TensorType, tf.Tensor],
                                 Tuple[tf.Tensor, tf.Tensor]]] = None,
    file_format: common_types.VocabularyFileFormatType = analyzers
    .DEFAULT_VOCABULARY_FILE_FORMAT,
    name: Optional[str] = None) -> common_types.ConsistentTensorType:
  r"""Maps `x` to a vocabulary specified by the deferred tensor.

  This function also writes domain statistics about the vocabulary min and max
  values. Note that the min and max are inclusive, and depend on the vocab size,
  num_oov_buckets and default_value.

  Args:
    x: A categorical `Tensor`, `SparseTensor`, or `RaggedTensor` of type
      tf.string or tf.int[8|16|32|64] to which the vocabulary transformation
      should be applied. The column names are those intended for the transformed
      tensors.
    deferred_vocab_filename_tensor: The deferred vocab filename tensor as
      returned by `tft.vocabulary`, as long as the frequencies were not stored.
    default_value: The value to use for out-of-vocabulary values, unless
      'num_oov_buckets' is greater than zero.
    num_oov_buckets:  Any lookup of an out-of-vocabulary token will return a
      bucket ID based on its hash if `num_oov_buckets` is greater than zero.
      Otherwise it is assigned the `default_value`.
    lookup_fn: Optional lookup function, if specified it should take a tensor
      and a deferred vocab filename as an input and return a lookup `op` along
      with the table size, by default `apply_vocabulary` constructs a
      StaticHashTable for the table lookup.
    file_format: (Optional) A str. The format of the given vocabulary. Accepted
      formats are: 'tfrecord_gzip', 'text'. The default value is 'text'.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` where each string value is
    mapped to an integer. Each unique string value that appears in the
    vocabulary is mapped to a different integer and integers are consecutive
    starting from zero, and string value not in the vocabulary is
    assigned default_value.
  """
  return _apply_vocabulary_internal(
      x,
      deferred_vocab_filename_tensor,
      default_value,
      num_oov_buckets,
      lookup_fn,
      file_format,
      False,
      name,
  )

bag_of_words

bag_of_words(
    tokens: SparseTensor,
    ngram_range: Tuple[int, int],
    separator: str,
    name: Optional[str] = None,
) -> SparseTensor

Computes a bag of "words" based on the specified ngram configuration.

A light wrapper around tft.ngrams. First computes ngrams, then transforms the ngram representation (list semantics) into a Bag of Words (set semantics) per row. Each row reflects the set of unique ngrams present in an input record.

See tft.ngrams for more information.

PARAMETER DESCRIPTION
tokens

a two-dimensional SparseTensor of dtype tf.string containing tokens that will be used to construct a bag of words.

TYPE: SparseTensor

ngram_range

A pair with the range (inclusive) of ngram sizes to compute.

TYPE: Tuple[int, int]

separator

a string that will be inserted between tokens when ngrams are constructed.

TYPE: str

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
SparseTensor

A SparseTensor containing the unique set of ngrams from each row of the input. Note: the original order of the ngrams may not be preserved.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def bag_of_words(tokens: tf.SparseTensor,
                 ngram_range: Tuple[int, int],
                 separator: str,
                 name: Optional[str] = None) -> tf.SparseTensor:
  """Computes a bag of "words" based on the specified ngram configuration.

  A light wrapper around tft.ngrams. First computes ngrams, then transforms the
  ngram representation (list semantics) into a Bag of Words (set semantics) per
  row. Each row reflects the set of *unique* ngrams present in an input record.

  See tft.ngrams for more information.

  Args:
    tokens: a two-dimensional `SparseTensor` of dtype `tf.string` containing
      tokens that will be used to construct a bag of words.
    ngram_range: A pair with the range (inclusive) of ngram sizes to compute.
    separator: a string that will be inserted between tokens when ngrams are
      constructed.
    name: (Optional) A name for this operation.

  Returns:
    A `SparseTensor` containing the unique set of ngrams from each row of the
      input. Note: the original order of the ngrams may not be preserved.
  """
  if tokens.get_shape().ndims != 2:
    raise ValueError('bag_of_words requires `tokens` to be 2-dimensional')
  with tf.compat.v1.name_scope(name, 'bag_of_words'):
    # First compute the ngram representation, which will contain ordered and
    # possibly duplicated ngrams per row.
    all_ngrams = ngrams(tokens, ngram_range, separator)
    # Then deduplicate the ngrams in each row.
    return deduplicate_tensor_per_row(all_ngrams)

bucketize

bucketize(
    x: ConsistentTensorType,
    num_buckets: int,
    epsilon: Optional[float] = None,
    weights: Optional[Tensor] = None,
    elementwise: bool = False,
    name: Optional[str] = None,
) -> ConsistentTensorType

Returns a bucketized column, with a bucket index assigned to each input.

PARAMETER DESCRIPTION
x

A numeric input Tensor, SparseTensor, or RaggedTensor whose values should be mapped to buckets. For a CompositeTensor only non-missing values will be included in the quantiles computation, and the result of bucketize will be a CompositeTensor with non-missing values mapped to buckets. If elementwise=True then x must be dense.

TYPE: ConsistentTensorType

num_buckets

Values in the input x are divided into approximately equal-sized buckets, where the number of buckets is num_buckets.

TYPE: int

epsilon

(Optional) Error tolerance, typically a small fraction close to zero. If a value is not specified by the caller, a suitable value is computed based on experimental results. For num_buckets less than 100, the value of 0.01 is chosen to handle a dataset of up to ~1 trillion input data values. If num_buckets is larger, then epsilon is set to (1/num_buckets) to enforce a stricter error tolerance, because more buckets will result in smaller range for each bucket, and so we want the boundaries to be less fuzzy. See analyzers.quantiles() for details.

TYPE: Optional[float] DEFAULT: None

weights

(Optional) Weights tensor for the quantiles. Tensor must have the same shape as x.

TYPE: Optional[Tensor] DEFAULT: None

elementwise

(Optional) If true, bucketize each element of the tensor independently.

TYPE: bool DEFAULT: False

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor of the same shape as x, with each element in the

ConsistentTensorType

returned tensor representing the bucketized value. Bucketized value is

ConsistentTensorType

in the range [0, actual_num_buckets). Sometimes the actual number of buckets

ConsistentTensorType

can be different than num_buckets hint, for example in case the number of

ConsistentTensorType

distinct values is smaller than num_buckets, or in cases where the

ConsistentTensorType

input values are not uniformly distributed.

ConsistentTensorType

NaN values are mapped to the last bucket. Values with NaN weights are

ConsistentTensorType

ignored in bucket boundaries calculation.

RAISES DESCRIPTION
TypeError

If num_buckets is not an int.

ValueError

If value of num_buckets is not > 1.

ValueError

If elementwise=True and x is a CompositeTensor.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def bucketize(x: common_types.ConsistentTensorType,
              num_buckets: int,
              epsilon: Optional[float] = None,
              weights: Optional[tf.Tensor] = None,
              elementwise: bool = False,
              name: Optional[str] = None) -> common_types.ConsistentTensorType:
  """Returns a bucketized column, with a bucket index assigned to each input.

  Args:
    x: A numeric input `Tensor`, `SparseTensor`, or `RaggedTensor` whose values
      should be mapped to buckets.  For a `CompositeTensor` only non-missing
      values will be included in the quantiles computation, and the result of
      `bucketize` will be a `CompositeTensor` with non-missing values mapped to
      buckets. If elementwise=True then `x` must be dense.
    num_buckets: Values in the input `x` are divided into approximately
      equal-sized buckets, where the number of buckets is `num_buckets`.
    epsilon: (Optional) Error tolerance, typically a small fraction close to
      zero. If a value is not specified by the caller, a suitable value is
      computed based on experimental results.  For `num_buckets` less than 100,
      the value of 0.01 is chosen to handle a dataset of up to ~1 trillion input
      data values.  If `num_buckets` is larger, then epsilon is set to
      (1/`num_buckets`) to enforce a stricter error tolerance, because more
      buckets will result in smaller range for each bucket, and so we want the
      boundaries to be less fuzzy. See analyzers.quantiles() for details.
    weights: (Optional) Weights tensor for the quantiles. Tensor must have the
      same shape as x.
    elementwise: (Optional) If true, bucketize each element of the tensor
      independently.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor` of the same shape as `x`, with each element in the
    returned tensor representing the bucketized value. Bucketized value is
    in the range [0, actual_num_buckets). Sometimes the actual number of buckets
    can be different than num_buckets hint, for example in case the number of
    distinct values is smaller than num_buckets, or in cases where the
    input values are not uniformly distributed.
    NaN values are mapped to the last bucket. Values with NaN weights are
    ignored in bucket boundaries calculation.

  Raises:
    TypeError: If num_buckets is not an int.
    ValueError: If value of num_buckets is not > 1.
    ValueError: If elementwise=True and x is a `CompositeTensor`.
  """
  with tf.compat.v1.name_scope(name, 'bucketize'):
    if not isinstance(num_buckets, int):
      raise TypeError('num_buckets must be an int, got %s' % type(num_buckets))

    if num_buckets < 1:
      raise ValueError('Invalid num_buckets %d' % num_buckets)

    if isinstance(x, (tf.SparseTensor, tf.RaggedTensor)) and elementwise:
      raise ValueError(
          'bucketize requires `x` to be dense if `elementwise=True`')

    if epsilon is None:
      # See explanation in args documentation for epsilon.
      epsilon = min(1.0 / num_buckets, 0.01)

    x_values = tf_utils.get_values(x)
    bucket_boundaries = analyzers.quantiles(
        x_values,
        num_buckets,
        epsilon,
        weights,
        reduce_instance_dims=not elementwise)

    if not elementwise:
      return apply_buckets(x, bucket_boundaries)

    num_features = tf.math.reduce_prod(x.get_shape()[1:])
    bucket_boundaries = tf.reshape(bucket_boundaries, [num_features, -1])
    x_reshaped = tf.reshape(x, [-1, num_features])
    bucketized = []
    for idx, boundaries in enumerate(tf.unstack(bucket_boundaries, axis=0)):
      bucketized.append(apply_buckets(x_reshaped[:, idx],
                                      tf.expand_dims(boundaries, axis=0)))
    return tf.reshape(tf.stack(bucketized, axis=1),
                      [-1] + x.get_shape().as_list()[1:])

bucketize_per_key

bucketize_per_key(
    x: ConsistentTensorType,
    key: ConsistentTensorType,
    num_buckets: int,
    epsilon: Optional[float] = None,
    weights: Optional[ConsistentTensorType] = None,
    name: Optional[str] = None,
) -> ConsistentTensorType

Returns a bucketized column, with a bucket index assigned to each input.

PARAMETER DESCRIPTION
x

A numeric input Tensor, SparseTensor, or RaggedTensor with rank 1, whose values should be mapped to buckets. CompositeTensors will have their non-missing values mapped and missing values left as missing.

TYPE: ConsistentTensorType

key

A Tensor, SparseTensor, or RaggedTensor with the same shape as x and dtype tf.string. If x is a CompositeTensor, key must exactly match x in everything except values, i.e. indices and dense_shape or nested row splits must be identical.

TYPE: ConsistentTensorType

num_buckets

Values in the input x are divided into approximately equal-sized buckets, where the number of buckets is num_buckets.

TYPE: int

epsilon

(Optional) see bucketize.

TYPE: Optional[float] DEFAULT: None

weights

(Optional) A Tensor, SparseTensor, or RaggedTensor with the same shape as x and dtype tf.float32. Used as weights for quantiles calculation. If x is a CompositeTensor, weights must exactly match x in everything except values.

TYPE: Optional[ConsistentTensorType] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor of the same shape as x, with

ConsistentTensorType

each element in the returned tensor representing the bucketized value.

ConsistentTensorType

Bucketized value is in the range [0, actual_num_buckets). If the computed

ConsistentTensorType

key vocabulary doesn't have an entry for key then the resulting bucket is

ConsistentTensorType

-1.

RAISES DESCRIPTION
ValueError

If value of num_buckets is not > 1.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def bucketize_per_key(
    x: common_types.ConsistentTensorType,
    key: common_types.ConsistentTensorType,
    num_buckets: int,
    epsilon: Optional[float] = None,
    weights: Optional[common_types.ConsistentTensorType] = None,
    name: Optional[str] = None) -> common_types.ConsistentTensorType:
  """Returns a bucketized column, with a bucket index assigned to each input.

  Args:
    x: A numeric input `Tensor`, `SparseTensor`, or `RaggedTensor` with rank 1,
      whose values should be mapped to buckets.  `CompositeTensor`s will have
      their non-missing values mapped and missing values left as missing.
    key: A `Tensor`, `SparseTensor`, or `RaggedTensor` with the same shape as
      `x` and dtype tf.string.  If `x` is a `CompositeTensor`, `key` must
      exactly match `x` in everything except values, i.e. indices and
      dense_shape or nested row splits must be identical.
    num_buckets: Values in the input `x` are divided into approximately
      equal-sized buckets, where the number of buckets is num_buckets.
    epsilon: (Optional) see `bucketize`.
    weights: (Optional) A `Tensor`, `SparseTensor`, or `RaggedTensor` with the
      same shape as `x` and dtype tf.float32. Used as weights for quantiles
      calculation. If `x` is a `CompositeTensor`, `weights` must exactly match
      `x` in everything except values.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` of the same shape as `x`, with
    each element in the returned tensor representing the bucketized value.
    Bucketized value is in the range [0, actual_num_buckets). If the computed
    key vocabulary doesn't have an entry for `key` then the resulting bucket is
    -1.

  Raises:
    ValueError: If value of num_buckets is not > 1.
  """
  with tf.compat.v1.name_scope(name, 'bucketize_per_key'):
    if not isinstance(num_buckets, int):
      raise TypeError(
          'num_buckets must be an int, got {}'.format(type(num_buckets)))

    if num_buckets < 1:
      raise ValueError('Invalid num_buckets {}'.format(num_buckets))

    if epsilon is None:
      # See explanation in args documentation for epsilon.
      epsilon = min(1.0 / num_buckets, 0.01)

    (key_vocab, bucket_boundaries, scale_factor_per_key, shift_per_key,
     actual_num_buckets) = (
         analyzers._quantiles_per_key(  # pylint: disable=protected-access
             tf_utils.get_values(x),
             tf_utils.get_values(key),
             num_buckets,
             epsilon,
             weights=tf_utils.get_values(weights)))
    return _apply_buckets_with_keys(x, key, key_vocab, bucket_boundaries,
                                    scale_factor_per_key, shift_per_key,
                                    actual_num_buckets)

compute_and_apply_vocabulary

compute_and_apply_vocabulary(
    x: ConsistentTensorType,
    *,
    default_value: Any = -1,
    top_k: Optional[int] = None,
    frequency_threshold: Optional[int] = None,
    num_oov_buckets: int = 0,
    vocab_filename: Optional[str] = None,
    weights: Optional[Tensor] = None,
    labels: Optional[Tensor] = None,
    use_adjusted_mutual_info: bool = False,
    min_diff_from_avg: float = 0.0,
    coverage_top_k: Optional[int] = None,
    coverage_frequency_threshold: Optional[int] = None,
    key_fn: Optional[Callable[[Any], Any]] = None,
    fingerprint_shuffle: bool = False,
    file_format: VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
    store_frequency: Optional[bool] = False,
    reserved_tokens: Optional[
        Union[Iterable[str], Tensor]
    ] = None,
    name: Optional[str] = None,
) -> ConsistentTensorType

Generates a vocabulary for x and maps it to an integer with this vocab.

In case one of the tokens contains the '\n' or '\r' characters or is empty it will be discarded since we are currently writing the vocabularies as text files. This behavior will likely be fixed/improved in the future.

Note that this function will cause a vocabulary to be computed. For large datasets it is highly recommended to either set frequency_threshold or top_k to control the size of the vocabulary, and also the run time of this operation.

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor of type tf.string or tf.int[8|16|32|64].

TYPE: ConsistentTensorType

default_value

The value to use for out-of-vocabulary values, unless 'num_oov_buckets' is greater than zero.

TYPE: Any DEFAULT: -1

top_k

Limit the generated vocabulary to the first top_k elements. If set to None, the full vocabulary is generated.

TYPE: Optional[int] DEFAULT: None

frequency_threshold

Limit the generated vocabulary only to elements whose absolute frequency is >= to the supplied threshold. If set to None, the full vocabulary is generated. Absolute frequency means the number of occurences of the element in the dataset, as opposed to the proportion of instances that contain that element. If labels are provided and the vocab is computed using mutual information, tokens are filtered if their mutual information with the label is < the supplied threshold.

TYPE: Optional[int] DEFAULT: None

num_oov_buckets

Any lookup of an out-of-vocabulary token will return a bucket ID based on its hash if num_oov_buckets is greater than zero. Otherwise it is assigned the default_value.

TYPE: int DEFAULT: 0

vocab_filename

The file name for the vocabulary file. If None, a name based on the scope name in the context of this graph will be used as the file name. If not None, should be unique within a given preprocessing function. NOTE in order to make your pipelines resilient to implementation details please set vocab_filename when you are using the vocab_filename on a downstream component.

TYPE: Optional[str] DEFAULT: None

weights

(Optional) Weights Tensor for the vocabulary. It must have the same shape as x.

TYPE: Optional[Tensor] DEFAULT: None

labels

(Optional) A Tensor of labels for the vocabulary. If provided, the vocabulary is calculated based on mutual information with the label, rather than frequency. The labels must have the same batch dimension as x. If x is sparse, labels should be a 1D tensor reflecting row-wise labels. If x is dense, labels can either be a 1D tensor of row-wise labels, or a dense tensor of the identical shape as x (i.e. element-wise labels). Labels should be a discrete integerized tensor (If the label is numeric, it should first be bucketized; If the label is a string, an integer vocabulary should first be applied). Note: CompositeTensor labels are not yet supported (b/134931826). WARNING: when labels are provided, the frequency_threshold argument functions as a mutual information threshold, which is a float. TODO(b/116308354): Fix confusing naming.

TYPE: Optional[Tensor] DEFAULT: None

use_adjusted_mutual_info

If true, use adjusted mutual information.

TYPE: bool DEFAULT: False

min_diff_from_avg

Mutual information of a feature will be adjusted to zero whenever the difference between count of the feature with any label and its expected count is lower than min_diff_from_average.

TYPE: float DEFAULT: 0.0

coverage_top_k

(Optional), (Experimental) The minimum number of elements per key to be included in the vocabulary.

TYPE: Optional[int] DEFAULT: None

coverage_frequency_threshold

(Optional), (Experimental) Limit the coverage arm of the vocabulary only to elements whose absolute frequency is >= this threshold for a given key.

TYPE: Optional[int] DEFAULT: None

key_fn

(Optional), (Experimental) A fn that takes in a single entry of x and returns the corresponding key for coverage calculation. If this is None, no coverage arm is added to the vocabulary.

TYPE: Optional[Callable[[Any], Any]] DEFAULT: None

fingerprint_shuffle

(Optional), (Experimental) Whether to sort the vocabularies by fingerprint instead of counts. This is useful for load balancing on the training parameter servers. Shuffle only happens while writing the files, so all the filters above will still take effect.

TYPE: bool DEFAULT: False

file_format

(Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.

TYPE: VocabularyFileFormatType DEFAULT: DEFAULT_VOCABULARY_FILE_FORMAT

store_frequency

If True, frequency of the words is stored in the vocabulary file. In the case labels are provided, the mutual information is stored in the file instead. Each line in the file will be of the form 'frequency word'. NOTE: if True and text_format is 'text' then spaces will be replaced to avoid information loss.

TYPE: Optional[bool] DEFAULT: False

reserved_tokens

(Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache.

TYPE: Optional[Union[Iterable[str], Tensor]] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor where each string value is

ConsistentTensorType

mapped to an integer. Each unique string value that appears in the

ConsistentTensorType

vocabulary is mapped to a different integer and integers are consecutive

ConsistentTensorType

starting from zero. String value not in the vocabulary is assigned

ConsistentTensorType

default_value. Alternatively, if num_oov_buckets is specified, out of

ConsistentTensorType

vocabulary strings are hashed to values in

ConsistentTensorType

[vocab_size, vocab_size + num_oov_buckets) for an overall range of

ConsistentTensorType

[0, vocab_size + num_oov_buckets).

RAISES DESCRIPTION
ValueError

If top_k or frequency_threshold is negative. If coverage_top_k or coverage_frequency_threshold is negative.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def compute_and_apply_vocabulary(
    x: common_types.ConsistentTensorType,
    *,  # Force passing optional parameters by keys.
    default_value: Any = -1,
    top_k: Optional[int] = None,
    frequency_threshold: Optional[int] = None,
    num_oov_buckets: int = 0,
    vocab_filename: Optional[str] = None,
    weights: Optional[tf.Tensor] = None,
    labels: Optional[tf.Tensor] = None,
    use_adjusted_mutual_info: bool = False,
    min_diff_from_avg: float = 0.0,
    coverage_top_k: Optional[int] = None,
    coverage_frequency_threshold: Optional[int] = None,
    key_fn: Optional[Callable[[Any], Any]] = None,
    fingerprint_shuffle: bool = False,
    file_format: common_types.VocabularyFileFormatType = analyzers.DEFAULT_VOCABULARY_FILE_FORMAT,
    store_frequency: Optional[bool] = False,
    reserved_tokens: Optional[Union[Iterable[str], tf.Tensor]] = None,
    name: Optional[str] = None,
) -> common_types.ConsistentTensorType:
  r"""Generates a vocabulary for `x` and maps it to an integer with this vocab.

  In case one of the tokens contains the '\n' or '\r' characters or is empty it
  will be discarded since we are currently writing the vocabularies as text
  files. This behavior will likely be fixed/improved in the future.

  Note that this function will cause a vocabulary to be computed.  For large
  datasets it is highly recommended to either set frequency_threshold or top_k
  to control the size of the vocabulary, and also the run time of this
  operation.

  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor` of type tf.string or
      tf.int[8|16|32|64].
    default_value: The value to use for out-of-vocabulary values, unless
      'num_oov_buckets' is greater than zero.
    top_k: Limit the generated vocabulary to the first `top_k` elements. If set
      to None, the full vocabulary is generated.
    frequency_threshold: Limit the generated vocabulary only to elements whose
      absolute frequency is >= to the supplied threshold. If set to None, the
      full vocabulary is generated.  Absolute frequency means the number of
      occurences of the element in the dataset, as opposed to the proportion of
      instances that contain that element. If labels are provided and the vocab
      is computed using mutual information, tokens are filtered if their mutual
      information with the label is < the supplied threshold.
    num_oov_buckets:  Any lookup of an out-of-vocabulary token will return a
      bucket ID based on its hash if `num_oov_buckets` is greater than zero.
      Otherwise it is assigned the `default_value`.
    vocab_filename: The file name for the vocabulary file. If None, a name based
      on the scope name in the context of this graph will be used as the file
      name. If not None, should be unique within a given preprocessing function.
      NOTE in order to make your pipelines resilient to implementation details
      please set `vocab_filename` when you are using the vocab_filename on a
      downstream component.
    weights: (Optional) Weights `Tensor` for the vocabulary. It must have the
      same shape as x.
    labels: (Optional) A `Tensor` of labels for the vocabulary. If provided, the
      vocabulary is calculated based on mutual information with the label,
      rather than frequency. The labels must have the same batch dimension as x.
      If x is sparse, labels should be a 1D tensor reflecting row-wise labels.
      If x is dense, labels can either be a 1D tensor of row-wise labels, or a
      dense tensor of the identical shape as x (i.e. element-wise labels).
      Labels should be a discrete integerized tensor (If the label is numeric,
      it should first be bucketized; If the label is a string, an integer
      vocabulary should first be applied). Note: `CompositeTensor` labels are
      not yet supported (b/134931826). WARNING: when labels are provided, the
      frequency_threshold argument functions as a mutual information threshold,
      which is a float. TODO(b/116308354): Fix confusing naming.
    use_adjusted_mutual_info: If true, use adjusted mutual information.
    min_diff_from_avg: Mutual information of a feature will be adjusted to zero
      whenever the difference between count of the feature with any label and
      its expected count is lower than min_diff_from_average.
    coverage_top_k: (Optional), (Experimental) The minimum number of elements
      per key to be included in the vocabulary.
    coverage_frequency_threshold: (Optional), (Experimental) Limit the coverage
      arm of the vocabulary only to elements whose absolute frequency is >= this
      threshold for a given key.
    key_fn: (Optional), (Experimental) A fn that takes in a single entry of `x`
      and returns the corresponding key for coverage calculation. If this is
      `None`, no coverage arm is added to the vocabulary.
    fingerprint_shuffle: (Optional), (Experimental) Whether to sort the
      vocabularies by fingerprint instead of counts. This is useful for load
      balancing on the training parameter servers. Shuffle only happens while
      writing the files, so all the filters above will still take effect.
    file_format: (Optional) A str. The format of the resulting vocabulary file.
      Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires
      tensorflow>=2.4. The default value is 'text'.
    store_frequency: If True, frequency of the words is stored in the vocabulary
      file. In the case labels are provided, the mutual information is stored in
      the file instead. Each line in the file will be of the form 'frequency
      word'. NOTE: if True and text_format is 'text' then spaces will be
      replaced to avoid information loss.
    reserved_tokens: (Optional) A list of tokens that should appear in the
      vocabulary regardless of their appearance in the input. These tokens would
      maintain their order, and have a reserved spot at the beginning of the
      vocabulary. Note: this field has no affect on cache.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` where each string value is
    mapped to an integer. Each unique string value that appears in the
    vocabulary is mapped to a different integer and integers are consecutive
    starting from zero. String value not in the vocabulary is assigned
    `default_value`. Alternatively, if `num_oov_buckets` is specified, out of
    vocabulary strings are hashed to values in
    [vocab_size, vocab_size + num_oov_buckets) for an overall range of
    [0, vocab_size + num_oov_buckets).

  Raises:
    ValueError: If `top_k` or `frequency_threshold` is negative.
      If `coverage_top_k` or `coverage_frequency_threshold` is negative.
  """
  with tf.compat.v1.name_scope(name, 'compute_and_apply_vocabulary'):
    if store_frequency and file_format == 'text':
      x = tf_utils.maybe_format_vocabulary_input(x)
    deferred_vocab_and_filename = analyzers.vocabulary(
        x=x,
        top_k=top_k,
        frequency_threshold=frequency_threshold,
        vocab_filename=vocab_filename,
        store_frequency=store_frequency,
        weights=weights,
        labels=labels,
        use_adjusted_mutual_info=use_adjusted_mutual_info,
        min_diff_from_avg=min_diff_from_avg,
        coverage_top_k=coverage_top_k,
        coverage_frequency_threshold=coverage_frequency_threshold,
        key_fn=key_fn,
        fingerprint_shuffle=fingerprint_shuffle,
        file_format=file_format,
        reserved_tokens=reserved_tokens,
    )
    return _apply_vocabulary_internal(
        x,
        deferred_vocab_and_filename,
        default_value,
        num_oov_buckets,
        lookup_fn=None,
        store_frequency=store_frequency,
        file_format=file_format,
        name=None,
    )

count_per_key

count_per_key(
    key: TensorType,
    key_vocabulary_filename: Optional[str] = None,
    name: Optional[str] = None,
)

Computes the count of each element of a Tensor.

PARAMETER DESCRIPTION
key

A Tensor, SparseTensor, or RaggedTensor of dtype tf.string or tf.int.

TYPE: TensorType

key_vocabulary_filename

(Optional) The file name for the key-output mapping file. If None and key are provided, this combiner assumes the keys fit in memory and will not store the result in a file. If empty string, a file name will be chosen based on the current scope. If not an empty string, should be unique within a given preprocessing function.

TYPE: Optional[str] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Either

(A) Two Tensors: one the key vocab with dtype of input; the other the count for each key, dtype tf.int64. (if key_vocabulary_filename is None).

(B) The filename where the key-value mapping is stored (if key_vocabulary_filename is not None).

RAISES DESCRIPTION
TypeError

If the type of x is not supported.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def count_per_key(key: common_types.TensorType,
                  key_vocabulary_filename: Optional[str] = None,
                  name: Optional[str] = None):
  """Computes the count of each element of a `Tensor`.

  Args:
    key: A `Tensor`, `SparseTensor`, or `RaggedTensor` of dtype tf.string or
      tf.int.
    key_vocabulary_filename: (Optional) The file name for the key-output mapping
      file. If None and key are provided, this combiner assumes the keys fit in
      memory and will not store the result in a file. If empty string, a file
      name will be chosen based on the current scope. If not an empty string,
      should be unique within a given preprocessing function.
    name: (Optional) A name for this operation.

  Returns:
    Either:
    (A) Two `Tensor`s: one the key vocab with dtype of input;
        the other the count for each key, dtype tf.int64. (if
        key_vocabulary_filename is None).
    (B) The filename where the key-value mapping is stored (if
        key_vocabulary_filename is not None).

  Raises:
    TypeError: If the type of `x` is not supported.
  """

  with tf.compat.v1.name_scope(name, 'count_per_key'):
    key_dtype = key.dtype
    batch_keys, batch_counts = tf_utils.reduce_batch_count_per_key(key)

    output_dtype, sum_fn = _sum_combine_fn_and_dtype(tf.int64)
    numeric_combine_result = _numeric_combine(
        inputs=[batch_counts],
        fn=sum_fn,
        default_accumulator_value=0,
        reduce_instance_dims=True,
        output_dtypes=[output_dtype],
        key=batch_keys,
        key_vocabulary_filename=key_vocabulary_filename)

    if key_vocabulary_filename is not None:
      return numeric_combine_result
    keys, counts = numeric_combine_result
    if key_dtype is not tf.string:
      keys = tf.strings.to_number(keys, key_dtype)
    return keys, counts

covariance

covariance(
    x: Tensor, dtype: DType, name: Optional[str] = None
) -> Tensor

Computes the covariance matrix over the whole dataset.

The covariance matrix M is defined as follows: Let x[:j] be a tensor of the jth element of all input vectors in x, and let u_j = mean(x[:j]). The entry M[i,j] = E[(x[:i] - u_i)(x[:j] - u_j)]. Notice that the diagonal entries correspond to variances of individual elements in the vector, i.e. M[i,i] corresponds to the variance of x[:i].

PARAMETER DESCRIPTION
x

A rank-2 Tensor, 0th dim are rows, 1st dim are indices in each input vector.

TYPE: Tensor

dtype

Tensorflow dtype of entries in the returned matrix.

TYPE: DType

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RAISES DESCRIPTION
ValueError

if input is not a rank-2 Tensor.

RETURNS DESCRIPTION
Tensor

A rank-2 (matrix) covariance Tensor

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def covariance(x: tf.Tensor,
               dtype: tf.DType,
               name: Optional[str] = None) -> tf.Tensor:
  """Computes the covariance matrix over the whole dataset.

  The covariance matrix M is defined as follows:
  Let x[:j] be a tensor of the jth element of all input vectors in x, and let
  u_j = mean(x[:j]). The entry M[i,j] = E[(x[:i] - u_i)(x[:j] - u_j)].
  Notice that the diagonal entries correspond to variances of individual
  elements in the vector, i.e. M[i,i] corresponds to the variance of x[:i].

  Args:
    x: A rank-2 `Tensor`, 0th dim are rows, 1st dim are indices in each input
      vector.
    dtype: Tensorflow dtype of entries in the returned matrix.
    name: (Optional) A name for this operation.

  Raises:
    ValueError: if input is not a rank-2 Tensor.

  Returns:
    A rank-2 (matrix) covariance `Tensor`
  """

  if not isinstance(x, tf.Tensor):
    raise TypeError('Expected a Tensor, but got %r' % x)

  with tf.compat.v1.name_scope(name, 'covariance'):
    x.shape.assert_has_rank(2)

    input_dim = x.shape.as_list()[1]
    shape = (input_dim, input_dim)

    (result,) = _apply_cacheable_combiner(
        CovarianceCombiner(shape, dtype.as_numpy_dtype), x)
    return result

deduplicate_tensor_per_row

deduplicate_tensor_per_row(input_tensor, name=None)

Deduplicates each row (0-th dimension) of the provided tensor.

PARAMETER DESCRIPTION
input_tensor

A two-dimensional Tensor or SparseTensor. The first dimension is assumed to be the batch or "row" dimension, and deduplication is done on the 2nd dimension. If the Tensor is 1D it is returned as the equivalent SparseTensor since the "row" is a scalar can't be further deduplicated.

name

Optional name for the operation.

DEFAULT: None

RETURNS DESCRIPTION

A SparseTensor containing the unique set of values from each row of the input. Note: the original order of the input may not be preserved.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def deduplicate_tensor_per_row(input_tensor, name=None):
  """Deduplicates each row (0-th dimension) of the provided tensor.

  Args:
    input_tensor: A two-dimensional `Tensor` or `SparseTensor`. The first
      dimension is assumed to be the batch or "row" dimension, and deduplication
      is done on the 2nd dimension. If the Tensor is 1D it is returned as the
      equivalent `SparseTensor` since the "row" is a scalar can't be further
      deduplicated.
    name: Optional name for the operation.

  Returns:
    A  `SparseTensor` containing the unique set of values from each
      row of the input. Note: the original order of the input may not be
      preserved.
  """
  with tf.compat.v1.name_scope(name, 'deduplicate_per_row'):

    if isinstance(input_tensor, tf.SparseTensor):
      batch_dim = tf.cast(input_tensor.dense_shape[0], tf.int32)
      rank = input_tensor.dense_shape.shape[0]
    else:
      batch_dim = tf.cast(tf.shape(input_tensor)[0], tf.int32)
      rank = input_tensor.shape.rank

    def _univalent_dense_to_sparse(batch_dim, input_tensor):
      """Helper to convert a 1D dense `Tensor` to a `SparseTensor`."""
      indices = tf.cast(
          tf.stack([
              tf.range(batch_dim, dtype=tf.int32),
              tf.zeros(batch_dim, dtype=tf.int32)
          ],
                   axis=1),
          dtype=tf.int64)

      return tf.SparseTensor(
          indices=indices, values=input_tensor, dense_shape=(batch_dim, 1))

    if rank is not None:
      # If the rank is known at graph construction time, and it's rank 1, there
      # is no deduplication to be done so we can return early.
      if rank <= 1:
        if isinstance(input_tensor, tf.SparseTensor):
          return input_tensor
        # Even though we are just returning as is, we convert to a SparseTensor
        # to ensure consistent output type.
        return _univalent_dense_to_sparse(batch_dim, input_tensor)
      if rank > 2:
        raise ValueError(
            'Deduplication assumes a rank 2 tensor, got {}.'.format(rank))
      return _deduplicate_tensor_per_row(input_tensor, batch_dim)

    if isinstance(input_tensor, tf.SparseTensor):
      return _deduplicate_tensor_per_row(input_tensor, batch_dim)
    else:
      # Again check for rank 1 tensor (that doesn't need deduplication), this
      # time handling inputs where rank isn't known until execution time.
      dynamic_rank = tf.rank(input_tensor)
      return tf.cond(
          tf.equal(dynamic_rank, 1),
          lambda: _univalent_dense_to_sparse(batch_dim, input_tensor),
          lambda: _deduplicate_tensor_per_row(input_tensor, batch_dim),
      )

estimated_probability_density

estimated_probability_density(
    x: Tensor,
    boundaries: Optional[Union[Tensor, int]] = None,
    categorical: bool = False,
    name: Optional[str] = None,
) -> Tensor

Computes an approximate probability density at each x, given the bins.

Using this type of fixed-interval method has several benefits compared to bucketization, although may not always be preferred. 1. Quantiles does not work on categorical data. 2. The quantiles algorithm does not currently operate on multiple features jointly, only independently.

Outlier detection in a multi-modal or arbitrary distribution.

Imagine a value x where a simple model is highly predictive of a target y within certain densely populated ranges. Outside these ranges, we may want to treat the data differently, but there are too few samples for the model to detect them by case-by-case treatment. One option would be to use the density estimate for this purpose:

outputs['x_density'] = tft.estimated_prob(inputs['x'], bins=100) outputs['outlier_x'] = tf.where(outputs['x_density'] < OUTLIER_THRESHOLD, tf.constant([1]), tf.constant([0]))

This exercise uses a single variable for illustration, but a direct density metric would become more useful with higher dimensions.

Note that we normalize by average bin_width to arrive at a probability density estimate. The result resembles a pdf, not the probability that a value falls in the bucket (except in the categorical case).

PARAMETER DESCRIPTION
x

A Tensor.

TYPE: Tensor

boundaries

(Optional) A Tensor or int used to approximate the density. If possible provide boundaries as a Tensor of multiple sorted values. Will default to 10 intervals over the 0-1 range, or find the min/max if an int is provided (not recommended because multi-phase analysis is inefficient). If the boundaries are known as potentially arbitrary interval boundaries, sizes are assumed to be equal. If the sizes are unequal, density may be inaccurate. Ignored if categorical is true.

TYPE: Optional[Union[Tensor, int]] DEFAULT: None

categorical

(Optional) A bool that will treat x as categorical if true.

TYPE: bool DEFAULT: False

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor the same shape as x, the probability density estimate at x (or

Tensor

probability mass estimate if categorical is True).

RAISES DESCRIPTION
NotImplementedError

If x is CompositeTensor.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def estimated_probability_density(x: tf.Tensor,
                                  boundaries: Optional[Union[tf.Tensor,
                                                             int]] = None,
                                  categorical: bool = False,
                                  name: Optional[str] = None) -> tf.Tensor:
  """Computes an approximate probability density at each x, given the bins.

  Using this type of fixed-interval method has several benefits compared to
    bucketization, although may not always be preferred.
    1. Quantiles does not work on categorical data.
    2. The quantiles algorithm does not currently operate on multiple features
    jointly, only independently.

  Ex: Outlier detection in a multi-modal or arbitrary distribution.
    Imagine a value x where a simple model is highly predictive of a target y
    within certain densely populated ranges. Outside these ranges, we may want
    to treat the data differently, but there are too few samples for the model
    to detect them by case-by-case treatment.
    One option would be to use the density estimate for this purpose:

    outputs['x_density'] = tft.estimated_prob(inputs['x'], bins=100)
    outputs['outlier_x'] = tf.where(outputs['x_density'] < OUTLIER_THRESHOLD,
                                    tf.constant([1]), tf.constant([0]))

    This exercise uses a single variable for illustration, but a direct density
    metric would become more useful with higher dimensions.

  Note that we normalize by average bin_width to arrive at a probability density
  estimate. The result resembles a pdf, not the probability that a value falls
  in the bucket (except in the categorical case).

  Args:
    x: A `Tensor`.
    boundaries: (Optional) A `Tensor` or int used to approximate the density.
        If possible provide boundaries as a Tensor of multiple sorted values.
        Will default to 10 intervals over the 0-1 range, or find the min/max
        if an int is provided (not recommended because multi-phase analysis is
        inefficient). If the boundaries are known as potentially arbitrary
        interval boundaries, sizes are assumed to be equal. If the sizes are
        unequal, density may be inaccurate. Ignored if `categorical` is true.
    categorical: (Optional) A `bool` that will treat x as categorical if true.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor` the same shape as x, the probability density estimate at x (or
    probability mass estimate if `categorical` is True).

  Raises:
    NotImplementedError: If `x` is CompositeTensor.
  """
  with tf.compat.v1.name_scope(name, 'estimated_probability_density'):
    if isinstance(x, (tf.SparseTensor, tf.RaggedTensor)):
      raise NotImplementedError(
          'estimated probability density does not support Composite Tensors')
    if x.get_shape().ndims > 1 and x.shape[-1] > 1:
      raise NotImplementedError(
          'estimated probability density does not support multiple dimensions')

    counts, boundaries = analyzers.histogram(x, boundaries=boundaries,
                                             categorical=categorical)

    xdims = x.get_shape().ndims
    counts = tf.cast(counts, tf.float32)
    probabilities = counts / tf.reduce_sum(counts)

    x = tf.reshape(x, [-1])

    if categorical:
      bucket_indices = tf_utils.lookup_key(x, boundaries)
      bucket_densities = probabilities
    else:
      # We need to compute the bin width so that density does not depend on
      # number of intervals.
      bin_width = tf.cast(boundaries[0, -1] - boundaries[0, 0], tf.float32) / (
          tf.cast(tf.size(probabilities), tf.float32))
      bucket_densities = probabilities / bin_width

      bucket_indices = tf_utils.assign_buckets(
          tf.cast(x, tf.float32),
          analyzers.remove_leftmost_boundary(boundaries))
    bucket_indices = tf_utils._align_dims(bucket_indices, xdims)  # pylint: disable=protected-access

    # In the categorical case, when keys are missing, the indices may be -1,
    # therefore we replace those with 0 in order to use tf.gather.
    adjusted_bucket_indices = tf.where(
        bucket_indices < 0, _fill_shape(0, tf.shape(bucket_indices), tf.int64),
        bucket_indices)
    bucket_densities = tf.gather(bucket_densities, adjusted_bucket_indices)
    return tf.where(bucket_indices < 0,
                    _fill_shape(0, tf.shape(bucket_indices), tf.float32),
                    bucket_densities)

get_analyze_input_columns

get_analyze_input_columns(
    preprocessing_fn: Callable[
        [Mapping[str, TensorType]], Mapping[str, TensorType]
    ],
    specs: Mapping[str, Union[FeatureSpecType, TypeSpec]],
    force_tf_compat_v1: bool = False,
) -> List[str]

Return columns that are required inputs of AnalyzeDataset.

PARAMETER DESCRIPTION
preprocessing_fn

A tf.transform preprocessing_fn.

TYPE: Callable[[Mapping[str, TensorType]], Mapping[str, TensorType]]

specs

A dict of feature name to tf.TypeSpecs. If force_tf_compat_v1 is True, this can also be feature specifications.

TYPE: Mapping[str, Union[FeatureSpecType, TypeSpec]]

force_tf_compat_v1

(Optional) If True, use Tensorflow in compat.v1 mode. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
List[str]

A list of columns that are required inputs of analyzers.

Source code in tensorflow_transform/inspect_preprocessing_fn.py
def get_analyze_input_columns(
    preprocessing_fn: Callable[[Mapping[str, common_types.TensorType]],
                               Mapping[str, common_types.TensorType]],
    specs: Mapping[str, Union[common_types.FeatureSpecType, tf.TypeSpec]],
    force_tf_compat_v1: bool = False) -> List[str]:
  """Return columns that are required inputs of `AnalyzeDataset`.

  Args:
    preprocessing_fn: A tf.transform preprocessing_fn.
    specs: A dict of feature name to tf.TypeSpecs. If `force_tf_compat_v1` is
      True, this can also be feature specifications.
    force_tf_compat_v1: (Optional) If `True`, use Tensorflow in compat.v1 mode.
      Defaults to `False`.

  Returns:
    A list of columns that are required inputs of analyzers.
  """
  use_tf_compat_v1 = tf2_utils.use_tf_compat_v1(force_tf_compat_v1)
  if not use_tf_compat_v1:
    assert all([isinstance(s, tf.TypeSpec) for s in specs.values()]), specs
  graph, structured_inputs, structured_outputs = (
      impl_helper.trace_preprocessing_function(
          preprocessing_fn, specs, use_tf_compat_v1=use_tf_compat_v1))

  tensor_sinks = graph.get_collection(analyzer_nodes.TENSOR_REPLACEMENTS)
  visitor = graph_tools.SourcedTensorsVisitor()
  for tensor_sink in tensor_sinks:
    nodes.Traverser(visitor).visit_value_node(tensor_sink.future)

  if use_tf_compat_v1:
    control_dependency_ops = []
  else:
    # If traced in TF2 as a tf.function, inputs that end up in control
    # dependencies are required for the function to execute. Return such inputs
    # as required inputs of analyzers as well.
    _, control_dependency_ops = (
        tf2_utils.strip_and_get_tensors_and_control_dependencies(
            tf.nest.flatten(structured_outputs, expand_composites=True)))

  output_tensors = list(
      itertools.chain(visitor.sourced_tensors, control_dependency_ops))
  analyze_input_tensors = graph_tools.get_dependent_inputs(
      graph, structured_inputs, output_tensors)
  return list(analyze_input_tensors.keys())

get_num_buckets_for_transformed_feature

get_num_buckets_for_transformed_feature(
    transformed_feature: TensorType,
) -> Tensor

Provides the number of buckets for a transformed feature if annotated.

This for example can be used for the direct output of tft.bucketize, tft.apply_buckets, tft.compute_and_apply_vocabulary, tft.apply_vocabulary. These methods annotate the transformed feature with additional information. If the given transformed_feature isn't annotated, this method will fail.

Example:

def preprocessing_fn(inputs): ... bucketized = tft.bucketize(inputs['x'], num_buckets=3) ... integerized = tft.compute_and_apply_vocabulary(inputs['x']) ... zeros = tf.zeros_like(inputs['x'], tf.int64) ... return { ... 'bucketized': bucketized, ... 'bucketized_num_buckets': ( ... zeros + tft.get_num_buckets_for_transformed_feature(bucketized)), ... 'integerized': integerized, ... 'integerized_num_buckets': ( ... zeros + tft.get_num_buckets_for_transformed_feature(integerized)), ... } raw_data = [dict(x=3),dict(x=23)] feature_spec = dict(x=tf.io.FixedLenFeature([], tf.int64)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'bucketized': 1, 'bucketized_num_buckets': 3, 'integerized': 0, 'integerized_num_buckets': 2}, {'bucketized': 2, 'bucketized_num_buckets': 3, 'integerized': 1, 'integerized_num_buckets': 2}]

PARAMETER DESCRIPTION
transformed_feature

A Tensor or SparseTensor which is the direct output of tft.bucketize, tft.apply_buckets, tft.compute_and_apply_vocabulary or tft.apply_vocabulary.

TYPE: TensorType

RAISES DESCRIPTION
ValueError

If the given tensor has not been annotated a the number of

RETURNS DESCRIPTION
Tensor

A Tensor with the number of buckets for the given transformed_feature.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def get_num_buckets_for_transformed_feature(
    transformed_feature: common_types.TensorType) -> tf.Tensor:
  # pyformat: disable
  """Provides the number of buckets for a transformed feature if annotated.

  This for example can be used for the direct output of `tft.bucketize`,
  `tft.apply_buckets`, `tft.compute_and_apply_vocabulary`,
  `tft.apply_vocabulary`.
  These methods annotate the transformed feature with additional information.
  If the given `transformed_feature` isn't annotated, this method will fail.

  Example:

  >>> def preprocessing_fn(inputs):
  ...   bucketized = tft.bucketize(inputs['x'], num_buckets=3)
  ...   integerized = tft.compute_and_apply_vocabulary(inputs['x'])
  ...   zeros = tf.zeros_like(inputs['x'], tf.int64)
  ...   return {
  ...      'bucketized': bucketized,
  ...      'bucketized_num_buckets': (
  ...         zeros + tft.get_num_buckets_for_transformed_feature(bucketized)),
  ...      'integerized': integerized,
  ...      'integerized_num_buckets': (
  ...         zeros + tft.get_num_buckets_for_transformed_feature(integerized)),
  ...   }
  >>> raw_data = [dict(x=3),dict(x=23)]
  >>> feature_spec = dict(x=tf.io.FixedLenFeature([], tf.int64))
  >>> raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)
  >>> with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
  ...   transformed_dataset, transform_fn = (
  ...       (raw_data, raw_data_metadata)
  ...       | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
  >>> transformed_data, transformed_metadata = transformed_dataset
  >>> transformed_data
  [{'bucketized': 1, 'bucketized_num_buckets': 3,
   'integerized': 0, 'integerized_num_buckets': 2},
  {'bucketized': 2, 'bucketized_num_buckets': 3,
   'integerized': 1, 'integerized_num_buckets': 2}]

  Args:
    transformed_feature: A `Tensor` or `SparseTensor` which is the direct output
      of `tft.bucketize`, `tft.apply_buckets`,
      `tft.compute_and_apply_vocabulary` or `tft.apply_vocabulary`.

  Raises:
    ValueError: If the given tensor has not been annotated a the number of
    buckets.

  Returns:
    A `Tensor` with the number of buckets for the given `transformed_feature`.
  """
  # pyformat: enable
  # Adding 1 to the 2nd Tensor of the returned pair in order to compute max + 1.
  return tf.cast(
      schema_inference.get_tensor_schema_override(transformed_feature)[1] + 1,
      tf.int64)

get_transform_input_columns

get_transform_input_columns(
    preprocessing_fn: Callable[
        [Mapping[str, TensorType]], Mapping[str, TensorType]
    ],
    specs: Mapping[str, Union[FeatureSpecType, TypeSpec]],
    force_tf_compat_v1: bool = False,
) -> List[str]

Return columns that are required inputs of TransformDataset.

PARAMETER DESCRIPTION
preprocessing_fn

A tf.transform preprocessing_fn.

TYPE: Callable[[Mapping[str, TensorType]], Mapping[str, TensorType]]

specs

A dict of feature name to tf.TypeSpecs. If force_tf_compat_v1 is True, this can also be feature specifications.

TYPE: Mapping[str, Union[FeatureSpecType, TypeSpec]]

force_tf_compat_v1

(Optional) If True, use Tensorflow in compat.v1 mode. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
List[str]

A list of columns that are required inputs of the transform tf.Graph

List[str]

defined by preprocessing_fn.

Source code in tensorflow_transform/inspect_preprocessing_fn.py
def get_transform_input_columns(
    preprocessing_fn: Callable[[Mapping[str, common_types.TensorType]],
                               Mapping[str, common_types.TensorType]],
    specs: Mapping[str, Union[common_types.FeatureSpecType, tf.TypeSpec]],
    force_tf_compat_v1: bool = False) -> List[str]:
  """Return columns that are required inputs of `TransformDataset`.

  Args:
    preprocessing_fn: A tf.transform preprocessing_fn.
    specs: A dict of feature name to tf.TypeSpecs. If `force_tf_compat_v1` is
      True, this can also be feature specifications.
    force_tf_compat_v1: (Optional) If `True`, use Tensorflow in compat.v1 mode.
      Defaults to `False`.

  Returns:
    A list of columns that are required inputs of the transform `tf.Graph`
    defined by `preprocessing_fn`.
  """
  use_tf_compat_v1 = tf2_utils.use_tf_compat_v1(force_tf_compat_v1)
  if not use_tf_compat_v1:
    assert all([isinstance(s, tf.TypeSpec) for s in specs.values()]), specs
  graph, structured_inputs, structured_outputs = (
      impl_helper.trace_preprocessing_function(
          preprocessing_fn, specs, use_tf_compat_v1=use_tf_compat_v1))

  transform_input_tensors = graph_tools.get_dependent_inputs(
      graph, structured_inputs, structured_outputs)
  return list(transform_input_tensors.keys())

hash_strings

hash_strings(
    strings: ConsistentTensorType,
    hash_buckets: int,
    key: Optional[Iterable[int]] = None,
    name: Optional[str] = None,
) -> ConsistentTensorType

Hash strings into buckets.

PARAMETER DESCRIPTION
strings

a Tensor, SparseTensor, or RaggedTensor of dtype tf.string.

TYPE: ConsistentTensorType

hash_buckets

the number of hash buckets.

TYPE: int

key

optional. An array of two Python uint64. If passed, output will be a deterministic function of strings and key. Note that hashing will be slower if this value is specified.

TYPE: Optional[Iterable[int]] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor of dtype tf.int64 with the

ConsistentTensorType

same shape as

ConsistentTensorType

the input strings.

RAISES DESCRIPTION
TypeError

if strings is not a Tensor, SparseTensor, or RaggedTensor

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def hash_strings(
    strings: common_types.ConsistentTensorType,
    hash_buckets: int,
    key: Optional[Iterable[int]] = None,
    name: Optional[str] = None) -> common_types.ConsistentTensorType:
  """Hash strings into buckets.

  Args:
    strings: a `Tensor`, `SparseTensor`, or `RaggedTensor` of dtype `tf.string`.
    hash_buckets: the number of hash buckets.
    key: optional. An array of two Python `uint64`. If passed, output will be a
      deterministic function of `strings` and `key`. Note that hashing will be
      slower if this value is specified.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` of dtype `tf.int64` with the
    same shape as
    the input `strings`.

  Raises:
    TypeError: if `strings` is not a `Tensor`, `SparseTensor`, or `RaggedTensor`
    of dtype `tf.string`.
  """
  if (not isinstance(strings, (tf.Tensor, tf.SparseTensor, tf.RaggedTensor)) or
      strings.dtype != tf.string):
    raise TypeError(
        'Input to hash_strings must be a `Tensor`, `SparseTensor`, or '
        f'`RaggedTensor` of dtype string; got {strings.dtype}')
  if isinstance(strings, tf.Tensor):
    if name is None:
      name = 'hash_strings'
    if key is None:
      return tf.strings.to_hash_bucket_fast(strings, hash_buckets, name=name)
    return tf.strings.to_hash_bucket_strong(
        strings, hash_buckets, key, name=name)
  else:
    compose_result_fn = _make_composite_tensor_wrapper_if_composite(strings)
    values = tf_utils.get_values(strings)
    return compose_result_fn(hash_strings(values, hash_buckets, key))

histogram

histogram(
    x: TensorType,
    boundaries: Optional[Union[Tensor, int]] = None,
    categorical: Optional[bool] = False,
    name: Optional[str] = None,
) -> Tuple[Tensor, Tensor]

Computes a histogram over x, given the bin boundaries or bin count.

Ex (1): counts, boundaries = histogram([0, 1, 0, 1, 0, 3, 0, 1], range(5)) counts: [4, 3, 0, 1, 0] boundaries: [0, 1, 2, 3, 4]

Ex (2): Can be used to compute class weights. counts, classes = histogram([0, 1, 0, 1, 0, 3, 0, 1], categorical=True) probabilities = counts / tf.reduce_sum(counts) class_weights = dict(map(lambda (a, b): (a.numpy(), 1.0 / b.numpy()), zip(classes, probabilities)))

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor.

TYPE: TensorType

boundaries

(Optional) A Tensor or int used to build the histogram; ignored if categorical is True. If possible, provide boundaries as multiple sorted values. Default to 10 intervals over the 0-1 range, or find the min/max if an int is provided (not recommended because multi-phase analysis is inefficient).

TYPE: Optional[Union[Tensor, int]] DEFAULT: None

categorical

(Optional) A bool that treats x as discrete values if true.

TYPE: Optional[bool] DEFAULT: False

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
counts

The histogram, as counts per bin.

TYPE: Tensor

boundaries

A Tensor used to build the histogram representing boundaries.

TYPE: Tensor

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def histogram(x: common_types.TensorType,
              boundaries: Optional[Union[tf.Tensor, int]] = None,
              categorical: Optional[bool] = False,
              name: Optional[str] = None) -> Tuple[tf.Tensor, tf.Tensor]:
  """Computes a histogram over x, given the bin boundaries or bin count.

  Ex (1):
  counts, boundaries = histogram([0, 1, 0, 1, 0, 3, 0, 1], range(5))
  counts: [4, 3, 0, 1, 0]
  boundaries: [0, 1, 2, 3, 4]

  Ex (2):
  Can be used to compute class weights.
  counts, classes = histogram([0, 1, 0, 1, 0, 3, 0, 1], categorical=True)
  probabilities = counts / tf.reduce_sum(counts)
  class_weights = dict(map(lambda (a, b): (a.numpy(), 1.0 / b.numpy()),
                           zip(classes, probabilities)))

  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor`.
    boundaries: (Optional) A `Tensor` or `int` used to build the histogram;
      ignored if `categorical` is True. If possible, provide boundaries as
      multiple sorted values.  Default to 10 intervals over the 0-1 range, or
      find the min/max if an int is provided (not recommended because
      multi-phase analysis is inefficient).
    categorical: (Optional) A `bool` that treats `x` as discrete values if true.
    name: (Optional) A name for this operation.

  Returns:
    counts: The histogram, as counts per bin.
    boundaries: A `Tensor` used to build the histogram representing boundaries.
  """

  with tf.compat.v1.name_scope(name, 'histogram'):
    x = tf.reshape(tf_utils.get_values(x), [-1])
    if categorical:
      x_dtype = x.dtype
      x = x if x_dtype == tf.string else tf.strings.as_string(x)
      elements, counts = count_per_key(x)
      if x_dtype != elements.dtype:
        elements = tf.strings.to_number(elements, tf.int64)
      return counts, elements

    if boundaries is None:
      boundaries = tf.range(11, dtype=tf.float32) / 10.0
    elif isinstance(boundaries, int) or (isinstance(boundaries, tf.Tensor) and
                                         boundaries.get_shape().ndims == 0):
      min_value, max_value = _min_and_max(x, True)
      boundaries = tf.linspace(
          tf.cast(min_value, tf.float32), tf.cast(max_value, tf.float32),
          tf.cast(boundaries, tf.int64))

    # Shift the boundaries slightly to account for floating point errors,
    # and due to the fact that the rightmost boundary is essentially ignored.
    boundaries = tf.expand_dims(tf.cast(boundaries, tf.float32), 0) - 0.0001

    bucket_indices = tf_utils.assign_buckets(
        tf.cast(x, tf.float32), remove_leftmost_boundary(boundaries))
    bucket_vocab, counts = count_per_key(tf.strings.as_string(bucket_indices))
    counts = tf_utils.reorder_histogram(bucket_vocab, counts,
                                        tf.size(boundaries) - 1)
    return counts, boundaries

make_and_track_object

make_and_track_object(
    trackable_factory_callable: Callable[[], Trackable],
    name: Optional[str] = None,
) -> Trackable

Keeps track of the object created by invoking trackable_factory_callable.

This API is only for use when Transform APIs are run with TF2 behaviors enabled and tft_beam.Context.force_tf_compat_v1 is set to False.

Use this API to track TF Trackable objects created in the preprocessing_fn such as tf.hub modules, tf.data.Dataset etc. This ensures they are serialized correctly when exporting to SavedModel.

PARAMETER DESCRIPTION
trackable_factory_callable

A callable that creates and returns a Trackable object.

TYPE: Callable[[], Trackable]

name

(Optional) Provide a unique name to track this object with. If the Trackable object created is a Keras Layer or Model this is needed for proper tracking.

TYPE: Optional[str] DEFAULT: None

Example:

def preprocessing_fn(inputs): ... dataset = tft.make_and_track_object( ... lambda: tf.data.Dataset.from_tensor_slices([1, 2, 3])) ... with tf.init_scope(): ... dataset_list = list(dataset.as_numpy_iterator()) ... return {'x_0': dataset_list[0] + inputs['x']} raw_data = [dict(x=1), dict(x=2), dict(x=3)] feature_spec = dict(x=tf.io.FixedLenFeature([], tf.int64)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp(), ... force_tf_compat_v1=False): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'x_0': 2}, {'x_0': 3}, {'x_0': 4}]

RETURNS DESCRIPTION
Trackable

The object returned when trackable_factory_callable is invoked. The object

Trackable

creation is lifted out to the eager context using tf.init_scope.

Source code in tensorflow_transform/annotators.py
def make_and_track_object(trackable_factory_callable: Callable[[],
                                                               base.Trackable],
                          name: Optional[str] = None) -> base.Trackable:
  # pyformat: disable
  """Keeps track of the object created by invoking `trackable_factory_callable`.

  This API is only for use when Transform APIs are run with TF2 behaviors
  enabled and `tft_beam.Context.force_tf_compat_v1` is set to False.

  Use this API to track TF Trackable objects created in the `preprocessing_fn`
  such as tf.hub modules, tf.data.Dataset etc. This ensures they are serialized
  correctly when exporting to SavedModel.

  Args:
    trackable_factory_callable: A callable that creates and returns a Trackable
      object.
    name: (Optional) Provide a unique name to track this object with. If the
      Trackable object created is a Keras Layer or Model this is needed for
      proper tracking.

  Example:

  >>> def preprocessing_fn(inputs):
  ...   dataset = tft.make_and_track_object(
  ...       lambda: tf.data.Dataset.from_tensor_slices([1, 2, 3]))
  ...   with tf.init_scope():
  ...     dataset_list = list(dataset.as_numpy_iterator())
  ...   return {'x_0': dataset_list[0] + inputs['x']}
  >>> raw_data = [dict(x=1), dict(x=2), dict(x=3)]
  >>> feature_spec = dict(x=tf.io.FixedLenFeature([], tf.int64))
  >>> raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)
  >>> with tft_beam.Context(temp_dir=tempfile.mkdtemp(),
  ...                       force_tf_compat_v1=False):
  ...   transformed_dataset, transform_fn = (
  ...       (raw_data, raw_data_metadata)
  ...       | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
  >>> transformed_data, transformed_metadata = transformed_dataset
  >>> transformed_data
  [{'x_0': 2}, {'x_0': 3}, {'x_0': 4}]

  Returns:
    The object returned when trackable_factory_callable is invoked. The object
    creation is lifted out to the eager context using `tf.init_scope`.
  """
  # pyformat: enable
  if not tf.inside_function():
    raise ValueError('This API should only be invoked inside the user defined '
                     '`preprocessing_fn` with TF2 behaviors enabled and '
                     '`force_tf_compat_v1=False`. ')
  result = _get_object(name) if name is not None else None
  if result is None:
    with tf.init_scope():
      result = trackable_factory_callable()
      if name is None and isinstance(result, tf_keras.layers.Layer):
        raise ValueError(
            'Please pass a unique `name` to this API to ensure Keras objects '
            'are tracked correctly.')
      track_object(result, name)
  return result

max

max(
    x: TensorType,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None,
) -> Tensor

Computes the maximum of the values of x over the whole dataset.

In the case of a CompositeTensor missing values will be used in return value: for float, NaN is used and for other dtypes the min is used.

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor.

TYPE: TensorType

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.

TYPE: bool DEFAULT: True

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor. Has the same type as x.

Raises: TypeError: If the type of x is not supported.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def max(  # pylint: disable=redefined-builtin
    x: common_types.TensorType,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None) -> tf.Tensor:
  """Computes the maximum of the values of `x` over the whole dataset.

  In the case of a `CompositeTensor` missing values will be used in return
  value: for float, NaN is used and for other dtypes the min is used.

  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor`.
    reduce_instance_dims: By default collapses the batch and instance dimensions
      to arrive at a single scalar output. If False, only collapses the batch
      dimension and outputs a vector of the same shape as the input.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor`. Has the same type as `x`.
  Raises:
    TypeError: If the type of `x` is not supported.
  """
  with tf.compat.v1.name_scope(name, 'max'):
    return _min_and_max(x, reduce_instance_dims, name)[1]

mean

mean(
    x: TensorType,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None,
    output_dtype: Optional[DType] = None,
) -> Tensor

Computes the mean of the values of a Tensor over the whole dataset.

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor. Its type must be floating point (float{16|32|64}), or integral ([u]int{8|16|32|64}).

TYPE: TensorType

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.

TYPE: bool DEFAULT: True

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

output_dtype

(Optional) If not None, casts the output tensor to this type.

TYPE: Optional[DType] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor containing the mean. If x is floating point, the mean will have

Tensor

the same type as x. If x is integral, the output is cast to float32.

Tensor

NaNs and infinite input values are ignored.

RAISES DESCRIPTION
TypeError

If the type of x is not supported.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def mean(x: common_types.TensorType,
         reduce_instance_dims: bool = True,
         name: Optional[str] = None,
         output_dtype: Optional[tf.DType] = None) -> tf.Tensor:
  """Computes the mean of the values of a `Tensor` over the whole dataset.

  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor`. Its type must be floating
        point (float{16|32|64}), or integral ([u]int{8|16|32|64}).
    reduce_instance_dims: By default collapses the batch and instance dimensions
        to arrive at a single scalar output. If False, only collapses the batch
        dimension and outputs a vector of the same shape as the input.
    name: (Optional) A name for this operation.
    output_dtype: (Optional) If not None, casts the output tensor to this type.

  Returns:
    A `Tensor` containing the mean. If `x` is floating point, the mean will have
    the same type as `x`. If `x` is integral, the output is cast to float32.
    NaNs and infinite input values are ignored.

  Raises:
    TypeError: If the type of `x` is not supported.
  """
  with tf.compat.v1.name_scope(name, 'mean'):
    return _mean_and_var(x, reduce_instance_dims, output_dtype)[0]

min

min(
    x: TensorType,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None,
) -> Tensor

Computes the minimum of the values of x over the whole dataset.

In the case of a CompositeTensor missing values will be used in return value: for float, NaN is used and for other dtypes the max is used.

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor.

TYPE: TensorType

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a Tensor of the same shape as the input.

TYPE: bool DEFAULT: True

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor with the same type as x.

RAISES DESCRIPTION
TypeError

If the type of x is not supported.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def min(  # pylint: disable=redefined-builtin
    x: common_types.TensorType,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None) -> tf.Tensor:
  """Computes the minimum of the values of `x` over the whole dataset.

  In the case of a `CompositeTensor` missing values will be used in return
  value: for float, NaN is used and for other dtypes the max is used.

  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor`.
    reduce_instance_dims: By default collapses the batch and instance dimensions
      to arrive at a single scalar output. If False, only collapses the batch
      dimension and outputs a `Tensor` of the same shape as the input.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor` with the same type as `x`.

  Raises:
    TypeError: If the type of `x` is not supported.
  """
  with tf.compat.v1.name_scope(name, 'min'):
    return _min_and_max(x, reduce_instance_dims, name)[0]

ngrams

ngrams(
    tokens: SparseTensor,
    ngram_range: Tuple[int, int],
    separator: str,
    name: Optional[str] = None,
) -> SparseTensor

Create a SparseTensor of n-grams.

Given a SparseTensor of tokens, returns a SparseTensor containing the ngrams that can be constructed from each row.

separator is inserted between each pair of tokens, so " " would be an appropriate choice if the tokens are words, while "" would be an appropriate choice if they are characters.

Example:

tokens = tf.SparseTensor( ... indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2], [1, 3]], ... values=['One', 'was', 'Johnny', 'Two', 'was', 'a', 'rat'], ... dense_shape=[2, 4]) print(tft.ngrams(tokens, ngram_range=(1, 3), separator=' ')) SparseTensor(indices=tf.Tensor( [[0 0][0 1] [0 2][0 3] [0 4][0 5] [1 0][1 1] [1 2][1 3] [1 4][1 5] [1 6][1 7] [1 8]], shape=(15, 2), dtype=int64), values=tf.Tensor( [b'One' b'One was' b'One was Johnny' b'was' b'was Johnny' b'Johnny' b'Two' b'Two was' b'Two was a' b'was' b'was a' b'was a rat' b'a' b'a rat' b'rat'], shape=(15,), dtype=string), dense_shape=tf.Tensor([2 9], shape=(2,), dtype=int64))

PARAMETER DESCRIPTION
tokens

a two-dimensionalSparseTensor of dtype tf.string containing tokens that will be used to construct ngrams.

TYPE: SparseTensor

ngram_range

A pair with the range (inclusive) of ngram sizes to return.

TYPE: Tuple[int, int]

separator

a string that will be inserted between tokens when ngrams are constructed.

TYPE: str

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
SparseTensor

A SparseTensor containing all ngrams from each row of the input. Note:

SparseTensor

if an ngram appears multiple times in the input row, it will be present the

SparseTensor

same number of times in the output. For unique ngrams, see tft.bag_of_words.

RAISES DESCRIPTION
ValueError

if tokens is not 2D.

ValueError

if ngram_range[0] < 1 or ngram_range[1] < ngram_range[0]

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def ngrams(tokens: tf.SparseTensor,
           ngram_range: Tuple[int, int],
           separator: str,
           name: Optional[str] = None) -> tf.SparseTensor:
  """Create a `SparseTensor` of n-grams.

  Given a `SparseTensor` of tokens, returns a `SparseTensor` containing the
  ngrams that can be constructed from each row.

  `separator` is inserted between each pair of tokens, so " " would be an
  appropriate choice if the tokens are words, while "" would be an appropriate
  choice if they are characters.

  Example:

  >>> tokens = tf.SparseTensor(
  ...         indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2], [1, 3]],
  ...         values=['One', 'was', 'Johnny', 'Two', 'was', 'a', 'rat'],
  ...         dense_shape=[2, 4])
  >>> print(tft.ngrams(tokens, ngram_range=(1, 3), separator=' '))
  SparseTensor(indices=tf.Tensor(
      [[0 0] [0 1] [0 2] [0 3] [0 4] [0 5]
       [1 0] [1 1] [1 2] [1 3] [1 4] [1 5] [1 6] [1 7] [1 8]],
       shape=(15, 2), dtype=int64),
    values=tf.Tensor(
      [b'One' b'One was' b'One was Johnny' b'was' b'was Johnny' b'Johnny' b'Two'
       b'Two was' b'Two was a' b'was' b'was a' b'was a rat' b'a' b'a rat'
       b'rat'], shape=(15,), dtype=string),
    dense_shape=tf.Tensor([2 9], shape=(2,), dtype=int64))

  Args:
    tokens: a two-dimensional`SparseTensor` of dtype `tf.string` containing
      tokens that will be used to construct ngrams.
    ngram_range: A pair with the range (inclusive) of ngram sizes to return.
    separator: a string that will be inserted between tokens when ngrams are
      constructed.
    name: (Optional) A name for this operation.

  Returns:
    A `SparseTensor` containing all ngrams from each row of the input. Note:
    if an ngram appears multiple times in the input row, it will be present the
    same number of times in the output. For unique ngrams, see tft.bag_of_words.

  Raises:
    ValueError: if `tokens` is not 2D.
    ValueError: if ngram_range[0] < 1 or ngram_range[1] < ngram_range[0]
  """
  # This function is implemented as follows.  Assume we start with the following
  # `SparseTensor`:
  #
  # indices=[[0, 0], [0, 1], [0, 2], [0, 3], [1, 0], [2, 0], [2, 1], [2, 2]]
  # values=['a', 'b', 'c', 'd', 'q', 'x', 'y', 'z']
  # dense_shape=[3, 4]
  #
  # First we then create shifts of the values and first column of indices,
  # buffering to avoid overrunning the end of the array, so the shifted values
  # (if we are ngrams up to size 3) are
  #
  # shifted_batch_indices[0]=[0, 0, 0, 0, 1, 2, 2, 2]
  # shifted_tokens[0]=['a', 'b', 'c', 'd', 'q', 'x', 'y', 'z']
  #
  # shifted_batch_indices[1]=[0, 0, 0, 1, 2, 2, 2, -1]
  # shifted_tokens[1]=['b', 'c', 'd', 'q', 'x', 'y', 'z', '']
  #
  # shifted_batch_indices[2]=[0, 0, 1, 2, 2, 2, -1, -1]
  # shifted_tokens[2]=['c', 'd', 'q', 'x', 'y', 'z', '', '']
  #
  # These shifted ngrams are used to create the ngrams as follows.  We use
  # tf.string_join to join shifted_tokens[:k] to create k-grams. The `separator`
  # string is inserted between each pair of tokens in the k-gram.
  # The batch that the first of these belonged to is given by
  # shifted_batch_indices[0]. However some of these will cross the boundaries
  # between 'batches' and so we we create a boolean mask which is True when
  # shifted_indices[:k] are all equal.
  #
  # This results in tensors of ngrams, their batch indices and a boolean mask,
  # which we then use to construct the output SparseTensor.
  if tokens.get_shape().ndims != 2:
    raise ValueError('ngrams requires `tokens` to be 2-dimensional')
  with tf.compat.v1.name_scope(name, 'ngrams'):
    if ngram_range[0] < 1 or ngram_range[1] < ngram_range[0]:
      raise ValueError('Invalid ngram_range: %r' % (ngram_range,))

    def _sliding_windows(values, num_shifts, fill_value):
      buffered_values = tf.concat(
          [values, tf.fill([num_shifts - 1], fill_value)], 0)
      return [
          tf.slice(buffered_values, [i], tf.shape(input=values))
          for i in range(num_shifts)
      ]

    shifted_batch_indices = _sliding_windows(
        tokens.indices[:, 0], ngram_range[1] + 1,
        tf.constant(-1, dtype=tf.int64))
    shifted_tokens = _sliding_windows(tokens.values, ngram_range[1] + 1, '')

    # Construct a tensor of the form
    # [['a', 'ab, 'abc'], ['b', 'bcd', cde'], ...]
    def _string_join(tensors):
      if tensors:
        return tf.strings.join(tensors, separator=separator)
      else:
        return

    ngrams_array = [_string_join(shifted_tokens[:k])
                    for k in range(ngram_range[0], ngram_range[1] + 1)]
    ngrams_tensor = tf.stack(ngrams_array, 1)

    # Construct a boolean mask for whether each ngram in ngram_tensor is valid,
    # in that each character came from the same batch.
    valid_ngram = tf.equal(
        tf.math.cumprod(
            tf.cast(
                tf.equal(
                    tf.stack(shifted_batch_indices, 1),
                    tf.expand_dims(shifted_batch_indices[0], 1)),
                dtype=tf.int32),
            axis=1), 1)
    valid_ngram = valid_ngram[:, (ngram_range[0] - 1):ngram_range[1]]

    # Construct a tensor with the batch that each ngram in ngram_tensor belongs
    # to.
    batch_indices = tf.tile(tf.expand_dims(tokens.indices[:, 0], 1),
                            [1, ngram_range[1] + 1 - ngram_range[0]])

    # Apply the boolean mask and construct a SparseTensor with the given indices
    # and values, where another index is added to give the position within a
    # batch.
    batch_indices = tf.boolean_mask(tensor=batch_indices, mask=valid_ngram)
    ngrams_tensor = tf.boolean_mask(tensor=ngrams_tensor, mask=valid_ngram)
    instance_indices = segment_indices(batch_indices)
    dense_shape_second_dim = tf.maximum(
        tf.reduce_max(input_tensor=instance_indices), -1) + 1
    return tf.SparseTensor(
        indices=tf.stack([batch_indices, instance_indices], 1),
        values=ngrams_tensor,
        dense_shape=tf.stack(
            [tokens.dense_shape[0], dense_shape_second_dim]))

pca

pca(
    x: Tensor,
    output_dim: int,
    dtype: DType,
    name: Optional[str] = None,
) -> Tensor

Computes PCA on the dataset using biased covariance.

The PCA analyzer computes output_dim orthonormal vectors that capture directions/axes corresponding to the highest variances in the input vectors of x. The output vectors are returned as a rank-2 tensor with shape (input_dim, output_dim), where the 0th dimension are the components of each output vector, and the 1st dimension are the output vectors representing orthogonal directions in the input space, sorted in order of decreasing variances.

The output rank-2 tensor (matrix) serves a useful transform purpose. Formally, the matrix can be used downstream in the transform step by multiplying it to the input tensor x. This transform reduces the dimension of input vectors to output_dim in a way that retains the maximal variance.

NOTE: To properly use PCA, input vector components should be converted to similar units of measurement such that the vectors represent a Euclidean space. If no such conversion is available (e.g. one element represents time, another element distance), the canonical approach is to first apply a transformation to the input data to normalize numerical variances, i.e. tft.scale_to_z_score(). Normalization allows PCA to choose output axes that help decorrelate input axes.

Below are a couple intuitive examples of PCA.

Consider a simple 2-dimensional example:

Input x is a series of vectors [e, e] where e is Gaussian with mean 0, variance 1. The two components are perfectly correlated, and the resulting covariance matrix is

[[1 1],
 [1 1]].

Applying PCA with output_dim = 1 would discover the first principal component [1 / sqrt(2), 1 / sqrt(2)]. When multipled to the original example, each vector [e, e] would be mapped to a scalar sqrt(2) * e. The second principal component would be [-1 / sqrt(2), 1 / sqrt(2)] and would map [e, e] to 0, which indicates that the second component captures no variance at all. This agrees with our intuition since we know that the two axes in the input are perfectly correlated and can be fully explained by a single scalar e.

Consider a 3-dimensional example:

Input x is a series of vectors [a, a, b], where a is a zero-mean, unit variance Gaussian and b is a zero-mean, variance 4 Gaussian and is independent of a. The first principal component of the unnormalized vector would be [0, 0, 1] since b has a much larger variance than any linear combination of the first two components. This would map [a, a, b] onto b, asserting that the axis with highest energy is the third component. While this may be the desired output if a and b correspond to the same units, it is not statistically desireable when the units are irreconciliable. In such a case, one should first normalize each component to unit variance first, i.e. b := b / 2. The first principal component of a normalized vector would yield [1 / sqrt(2), 1 / sqrt(2), 0], and would map [a, a, b] to sqrt(2) * a. The second component would be [0, 0, 1] and map [a, a, b] to b. As can be seen, the benefit of normalization is that PCA would capture highly correlated components first and collapse them into a lower dimension.

PARAMETER DESCRIPTION
x

A rank-2 Tensor, 0th dim are rows, 1st dim are indices in row vectors.

TYPE: Tensor

output_dim

The PCA output dimension (number of eigenvectors to return).

TYPE: int

dtype

Tensorflow dtype of entries in the returned matrix.

TYPE: DType

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RAISES DESCRIPTION
ValueError

if input is not a rank-2 Tensor.

RETURNS DESCRIPTION
Tensor

A 2D Tensor (matrix) M of shape (input_dim, output_dim).

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def pca(x: tf.Tensor,
        output_dim: int,
        dtype: tf.DType,
        name: Optional[str] = None) -> tf.Tensor:
  """Computes PCA on the dataset using biased covariance.

  The PCA analyzer computes output_dim orthonormal vectors that capture
  directions/axes corresponding to the highest variances in the input vectors of
  `x`. The output vectors are returned as a rank-2 tensor with shape
  `(input_dim, output_dim)`, where the 0th dimension are the components of each
  output vector, and the 1st dimension are the output vectors representing
  orthogonal directions in the input space, sorted in order of decreasing
  variances.

  The output rank-2 tensor (matrix) serves a useful transform purpose. Formally,
  the matrix can be used downstream in the transform step by multiplying it to
  the input tensor `x`. This transform reduces the dimension of input vectors to
  output_dim in a way that retains the maximal variance.

  NOTE: To properly use PCA, input vector components should be converted to
  similar units of measurement such that the vectors represent a Euclidean
  space. If no such conversion is available (e.g. one element represents time,
  another element distance), the canonical approach is to first apply a
  transformation to the input data to normalize numerical variances, i.e.
  `tft.scale_to_z_score()`. Normalization allows PCA to choose output axes that
  help decorrelate input axes.

  Below are a couple intuitive examples of PCA.

  Consider a simple 2-dimensional example:

  Input x is a series of vectors `[e, e]` where `e` is Gaussian with mean 0,
  variance 1. The two components are perfectly correlated, and the resulting
  covariance matrix is

  ```
  [[1 1],
   [1 1]].
  ```

  Applying PCA with `output_dim = 1` would discover the first principal
  component `[1 / sqrt(2), 1 / sqrt(2)]`. When multipled to the original
  example, each vector `[e, e]` would be mapped to a scalar `sqrt(2) * e`. The
  second principal component would be `[-1 / sqrt(2), 1 / sqrt(2)]` and would
  map `[e, e]` to 0, which indicates that the second component captures no
  variance at all. This agrees with our intuition since we know that the two
  axes in the input are perfectly correlated and can be fully explained by a
  single scalar `e`.

  Consider a 3-dimensional example:

  Input `x` is a series of vectors `[a, a, b]`, where `a` is a zero-mean, unit
  variance Gaussian and `b` is a zero-mean, variance 4 Gaussian and is
  independent of `a`. The first principal component of the unnormalized vector
  would be `[0, 0, 1]` since `b` has a much larger variance than any linear
  combination of the first two components. This would map `[a, a, b]` onto `b`,
  asserting that the axis with highest energy is the third component. While this
  may be the desired output if `a` and `b` correspond to the same units, it is
  not statistically desireable when the units are irreconciliable. In such a
  case, one should first normalize each component to unit variance first, i.e.
  `b := b / 2`. The first principal component of a normalized vector would yield
  `[1 / sqrt(2), 1 / sqrt(2), 0]`, and would map `[a, a, b]` to `sqrt(2) * a`.
  The second component would be `[0, 0, 1]` and map `[a, a, b]` to `b`. As can
  be seen, the benefit of normalization is that PCA would capture highly
  correlated components first and collapse them into a lower dimension.

  Args:
    x: A rank-2 `Tensor`, 0th dim are rows, 1st dim are indices in row vectors.
    output_dim: The PCA output dimension (number of eigenvectors to return).
    dtype: Tensorflow dtype of entries in the returned matrix.
    name: (Optional) A name for this operation.

  Raises:
    ValueError: if input is not a rank-2 Tensor.

  Returns:
    A 2D `Tensor` (matrix) M of shape (input_dim, output_dim).
  """

  if not isinstance(x, tf.Tensor):
    raise TypeError('Expected a Tensor, but got %r' % x)

  with tf.compat.v1.name_scope(name, 'pca'):
    x.shape.assert_has_rank(2)

    input_dim = x.shape.as_list()[1]
    shape = (input_dim, output_dim)

    (result,) = _apply_cacheable_combiner(
        PCACombiner(shape, output_dim, dtype.as_numpy_dtype), x)
    return result

quantiles

quantiles(
    x: Tensor,
    num_buckets: int,
    epsilon: float,
    weights: Optional[Tensor] = None,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None,
) -> Tensor

Computes the quantile boundaries of a Tensor over the whole dataset.

Quantile boundaries are computed using approximate quantiles, and error tolerance is specified using epsilon. The boundaries divide the input tensor into approximately equal num_buckets parts. See go/squawd for details, and how to control the error due to approximation. NaN input values and values with NaN weights are ignored.

PARAMETER DESCRIPTION
x

An input Tensor.

TYPE: Tensor

num_buckets

Values in the x are divided into approximately equal-sized buckets, where the number of buckets is num_buckets. The number of returned quantiles is num_buckets - 1.

TYPE: int

epsilon

Error tolerance, typically a small fraction close to zero (e.g. 0.01). Higher values of epsilon increase the quantile approximation, and hence result in more unequal buckets, but could improve performance, and resource consumption. Some measured results on memory consumption: For epsilon = 0.001, the amount of memory for each buffer to hold the summary for 1 trillion input values is ~25000 bytes. If epsilon is relaxed to 0.01, the buffer size drops to ~2000 bytes for the same input size. The buffer size also determines the amount of work in the different stages of the beam pipeline, in general, larger epsilon results in fewer and smaller stages, and less time. For more performance trade-offs see also http://web.cs.ucla.edu/~weiwang/paper/SSDBM07_2.pdf

TYPE: float

weights

(Optional) Weights tensor for the quantiles. Tensor must have the same batch size as x.

TYPE: Optional[Tensor] DEFAULT: None

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single output vector. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.

TYPE: bool DEFAULT: True

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

The bucket boundaries represented as a list, with num_bucket-1 elements,

Tensor

unless reduce_instance_dims is False, which results in a Tensor of

Tensor

shape x.shape + [num_bucket-1].

Tensor

See code below for discussion on the type of bucket boundaries.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def quantiles(x: tf.Tensor,
              num_buckets: int,
              epsilon: float,
              weights: Optional[tf.Tensor] = None,
              reduce_instance_dims: bool = True,
              name: Optional[str] = None) -> tf.Tensor:
  """Computes the quantile boundaries of a `Tensor` over the whole dataset.

  Quantile boundaries are computed using approximate quantiles,
  and error tolerance is specified using `epsilon`. The boundaries divide the
  input tensor into approximately equal `num_buckets` parts.
  See go/squawd for details, and how to control the error due to approximation.
  NaN input values and values with NaN weights are ignored.

  Args:
    x: An input `Tensor`.
    num_buckets: Values in the `x` are divided into approximately equal-sized
      buckets, where the number of buckets is `num_buckets`. The number of
      returned quantiles is `num_buckets` - 1.
    epsilon: Error tolerance, typically a small fraction close to zero (e.g.
      0.01). Higher values of epsilon increase the quantile approximation, and
      hence result in more unequal buckets, but could improve performance,
      and resource consumption.  Some measured results on memory consumption:
        For epsilon = 0.001, the amount of memory for each buffer to hold the
        summary for 1 trillion input values is ~25000 bytes. If epsilon is
        relaxed to 0.01, the buffer size drops to ~2000 bytes for the same input
        size. The buffer size also determines the amount of work in the
        different stages of the beam pipeline, in general, larger epsilon
        results in fewer and smaller stages, and less time. For more performance
        trade-offs see also http://web.cs.ucla.edu/~weiwang/paper/SSDBM07_2.pdf
    weights: (Optional) Weights tensor for the quantiles. Tensor must have the
      same batch size as x.
    reduce_instance_dims: By default collapses the batch and instance dimensions
        to arrive at a single output vector. If False, only collapses the batch
        dimension and outputs a vector of the same shape as the input.
    name: (Optional) A name for this operation.

  Returns:
    The bucket boundaries represented as a list, with num_bucket-1 elements,
    unless reduce_instance_dims is False, which results in a Tensor of
    shape x.shape + [num_bucket-1].
    See code below for discussion on the type of bucket boundaries.
  """
  # Quantile ops convert input values to double under the hood. Keep bucket
  # boundaries as float for all numeric types.
  bucket_dtype = tf.float32
  with tf.compat.v1.name_scope(name, 'quantiles'):
    if weights is None:
      analyzer_inputs = [x]
      has_weights = False
    else:
      analyzer_inputs = [x, weights]
      has_weights = True
    feature_shape = [] if reduce_instance_dims else x.get_shape().as_list()[1:]
    output_shape = (feature_shape if feature_shape else [1]) + [num_buckets - 1]
    combiner = QuantilesCombiner(
        num_buckets,
        epsilon,
        bucket_dtype.as_numpy_dtype,
        has_weights=has_weights,
        output_shape=output_shape,
        feature_shape=feature_shape)
    (quantile_boundaries,) = _apply_cacheable_combiner(combiner,
                                                       *analyzer_inputs)
    return quantile_boundaries

scale_by_min_max

scale_by_min_max(
    x: ConsistentTensorType,
    output_min: float = 0.0,
    output_max: float = 1.0,
    elementwise: bool = False,
    name: Optional[str] = None,
) -> ConsistentTensorType

Scale a numerical column into the range [output_min, output_max].

PARAMETER DESCRIPTION
x

A numeric Tensor, SparseTensor, or RaggedTensor.

TYPE: ConsistentTensorType

output_min

The minimum of the range of output values.

TYPE: float DEFAULT: 0.0

output_max

The maximum of the range of output values.

TYPE: float DEFAULT: 1.0

elementwise

If true, scale each element of the tensor independently.

TYPE: bool DEFAULT: False

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor containing the input column scaled to [output_min, output_max].

ConsistentTensorType

If the analysis dataset is empty or contains a singe distinct value, then

ConsistentTensorType

x is scaled using a sigmoid function.

RAISES DESCRIPTION
ValueError

If output_min, output_max have the wrong order.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def scale_by_min_max(
    x: common_types.ConsistentTensorType,
    output_min: float = 0.0,
    output_max: float = 1.0,
    elementwise: bool = False,
    name: Optional[str] = None) -> common_types.ConsistentTensorType:
  """Scale a numerical column into the range [output_min, output_max].

  Args:
    x: A numeric `Tensor`, `SparseTensor`, or `RaggedTensor`.
    output_min: The minimum of the range of output values.
    output_max: The maximum of the range of output values.
    elementwise: If true, scale each element of the tensor independently.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor` containing the input column scaled to [output_min, output_max].
    If the analysis dataset is empty or contains a singe distinct value, then
    `x` is scaled using a sigmoid function.

  Raises:
    ValueError: If output_min, output_max have the wrong order.
  """
  with tf.compat.v1.name_scope(name, 'scale_by_min_max'):
    return _scale_by_min_max_internal(
        x,
        key=None,
        output_min=output_min,
        output_max=output_max,
        elementwise=elementwise,
        key_vocabulary_filename=None)

scale_by_min_max_per_key

scale_by_min_max_per_key(
    x: ConsistentTensorType,
    key: TensorType,
    output_min: float = 0.0,
    output_max: float = 1.0,
    elementwise: bool = False,
    key_vocabulary_filename: Optional[str] = None,
    name: Optional[str] = None,
) -> ConsistentTensorType

Scale a numerical column into a predefined range on a per-key basis.

PARAMETER DESCRIPTION
x

A numeric Tensor, SparseTensor, or RaggedTensor.

TYPE: ConsistentTensorType

key

A Tensor, SparseTensor, or RaggedTensor of dtype tf.string. Must meet one of the following conditions: 0. key is None 1. Both x and key are dense, 2. Both x and key are composite and key must exactly match x in everything except values, 3. The axis=1 index of each x matches its index of dense key.

TYPE: TensorType

output_min

The minimum of the range of output values.

TYPE: float DEFAULT: 0.0

output_max

The maximum of the range of output values.

TYPE: float DEFAULT: 1.0

elementwise

If true, scale each element of the tensor independently.

TYPE: bool DEFAULT: False

key_vocabulary_filename

(Optional) The file name for the per-key file. If None, this combiner will assume the keys fit in memory and will not store the analyzer result in a file. If '', a file name will be chosen based on the current TensorFlow scope. If not '', it should be unique within a given preprocessing function.

TYPE: Optional[str] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

Example:

def preprocessing_fn(inputs): ... return { ... 'scaled': tft.scale_by_min_max_per_key(inputs['x'], inputs['s']) ... } raw_data = [dict(x=1, s='a'), dict(x=0, s='b'), dict(x=3, s='a')] feature_spec = dict( ... x=tf.io.FixedLenFeature([], tf.float32), ... s=tf.io.FixedLenFeature([], tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'scaled': 0.0}, {'scaled': 0.5}, {'scaled': 1.0}]

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor containing the input column scaled to

ConsistentTensorType

[output_min, output_max] on a per-key basis if a key is provided. If the

ConsistentTensorType

analysis dataset is empty, a certain key contains a single distinct value or

ConsistentTensorType

the computed key vocabulary doesn't have an entry for key, then x is

ConsistentTensorType

scaled using a sigmoid function.

RAISES DESCRIPTION
ValueError

If output_min, output_max have the wrong order.

NotImplementedError

If elementwise is True and key is not None.

InvalidArgumentError

If indices of sparse x and key do not match.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def scale_by_min_max_per_key(
    x: common_types.ConsistentTensorType,
    key: common_types.TensorType,
    output_min: float = 0.0,
    output_max: float = 1.0,
    elementwise: bool = False,
    key_vocabulary_filename: Optional[str] = None,
    name: Optional[str] = None) -> common_types.ConsistentTensorType:
  # pyformat: disable
  """Scale a numerical column into a predefined range on a per-key basis.

  Args:
    x: A numeric `Tensor`, `SparseTensor`, or `RaggedTensor`.
    key: A `Tensor`, `SparseTensor`, or `RaggedTensor` of dtype tf.string.
        Must meet one of the following conditions:
        0. key is None
        1. Both x and key are dense,
        2. Both x and key are composite and `key` must exactly match `x` in
           everything except values,
        3. The axis=1 index of each x matches its index of dense key.
    output_min: The minimum of the range of output values.
    output_max: The maximum of the range of output values.
    elementwise: If true, scale each element of the tensor independently.
    key_vocabulary_filename: (Optional) The file name for the per-key file.
      If None, this combiner will assume the keys fit in memory and will not
      store the analyzer result in a file. If '', a file name will be chosen
      based on the current TensorFlow scope. If not '', it should be unique
      within a given preprocessing function.
    name: (Optional) A name for this operation.

  Example:

  >>> def preprocessing_fn(inputs):
  ...   return {
  ...      'scaled': tft.scale_by_min_max_per_key(inputs['x'], inputs['s'])
  ...   }
  >>> raw_data = [dict(x=1, s='a'), dict(x=0, s='b'), dict(x=3, s='a')]
  >>> feature_spec = dict(
  ...     x=tf.io.FixedLenFeature([], tf.float32),
  ...     s=tf.io.FixedLenFeature([], tf.string))
  >>> raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)
  >>> with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
  ...   transformed_dataset, transform_fn = (
  ...       (raw_data, raw_data_metadata)
  ...       | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
  >>> transformed_data, transformed_metadata = transformed_dataset
  >>> transformed_data
  [{'scaled': 0.0}, {'scaled': 0.5}, {'scaled': 1.0}]

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` containing the input column scaled to
    [output_min, output_max] on a per-key basis if a key is provided. If the
    analysis dataset is empty, a certain key contains a single distinct value or
    the computed key vocabulary doesn't have an entry for `key`, then `x` is
    scaled using a sigmoid function.

  Raises:
    ValueError: If output_min, output_max have the wrong order.
    NotImplementedError: If elementwise is True and key is not None.
    InvalidArgumentError: If indices of sparse x and key do not match.
  """
  # pyformat: enable
  with tf.compat.v1.name_scope(name, 'scale_by_min_max_per_key'):
    if key is None:
      raise ValueError('key is None, call `tft.scale_by_min_max` instead')
    return _scale_by_min_max_internal(
        x,
        key=key,
        output_min=output_min,
        output_max=output_max,
        elementwise=elementwise,
        key_vocabulary_filename=key_vocabulary_filename)

scale_to_0_1

scale_to_0_1(
    x: ConsistentTensorType,
    elementwise: bool = False,
    name: Optional[str] = None,
) -> ConsistentTensorType

Returns a column which is the input column scaled to have range [0,1].

PARAMETER DESCRIPTION
x

A numeric Tensor, SparseTensor, or RaggedTensor.

TYPE: ConsistentTensorType

elementwise

If true, scale each element of the tensor independently.

TYPE: bool DEFAULT: False

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor containing the input column

ConsistentTensorType

scaled to

ConsistentTensorType

[0, 1]. If the analysis dataset is empty or contains a single distinct

ConsistentTensorType

value, then x is scaled using a sigmoid function.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def scale_to_0_1(
    x: common_types.ConsistentTensorType,
    elementwise: bool = False,
    name: Optional[str] = None) -> common_types.ConsistentTensorType:
  """Returns a column which is the input column scaled to have range [0,1].

  Args:
    x: A numeric `Tensor`, `SparseTensor`, or `RaggedTensor`.
    elementwise: If true, scale each element of the tensor independently.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` containing the input column
    scaled to
    [0, 1]. If the analysis dataset is empty or contains a single distinct
    value, then `x` is scaled using a sigmoid function.
  """
  with tf.compat.v1.name_scope(name, 'scale_to_0_1'):
    return _scale_by_min_max_internal(
        x,
        key=None,
        output_min=0,
        output_max=1,
        elementwise=elementwise,
        key_vocabulary_filename=None)

scale_to_0_1_per_key

scale_to_0_1_per_key(
    x: ConsistentTensorType,
    key: TensorType,
    elementwise: bool = False,
    key_vocabulary_filename: Optional[str] = None,
    name: Optional[str] = None,
) -> ConsistentTensorType

Returns a column which is the input column scaled to have range [0,1].

PARAMETER DESCRIPTION
x

A numeric Tensor, SparseTensor, or RaggedTensor.

TYPE: ConsistentTensorType

key

A Tensor, SparseTensor, or RaggedTensor of type string.

TYPE: TensorType

elementwise

If true, scale each element of the tensor independently.

TYPE: bool DEFAULT: False

key_vocabulary_filename

(Optional) The file name for the per-key file. If None, this combiner will assume the keys fit in memory and will not store the analyzer result in a file. If '', a file name will be chosen based on the current TensorFlow scope. If not '', it should be unique within a given preprocessing function.

TYPE: Optional[str] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

Example:

def preprocessing_fn(inputs): ... return { ... 'scaled': tft.scale_to_0_1_per_key(inputs['x'], inputs['s']) ... } raw_data = [dict(x=1, s='a'), dict(x=0, s='b'), dict(x=3, s='a')] feature_spec = dict( ... x=tf.io.FixedLenFeature([], tf.float32), ... s=tf.io.FixedLenFeature([], tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'scaled': 0.0}, {'scaled': 0.5}, {'scaled': 1.0}]

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor containing the input column scaled to [0, 1],

ConsistentTensorType

per key. If the analysis dataset is empty, contains a single distinct value

ConsistentTensorType

or the computed key vocabulary doesn't have an entry for key, then x is

ConsistentTensorType

scaled using a sigmoid function.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def scale_to_0_1_per_key(
    x: common_types.ConsistentTensorType,
    key: common_types.TensorType,
    elementwise: bool = False,
    key_vocabulary_filename: Optional[str] = None,
    name: Optional[str] = None) -> common_types.ConsistentTensorType:
  # pyformat: disable
  """Returns a column which is the input column scaled to have range [0,1].

  Args:
    x: A numeric `Tensor`, `SparseTensor`, or `RaggedTensor`.
    key: A `Tensor`, `SparseTensor`, or `RaggedTensor` of type string.
    elementwise: If true, scale each element of the tensor independently.
    key_vocabulary_filename: (Optional) The file name for the per-key file. If
      None, this combiner will assume the keys fit in memory and will not store
      the analyzer result in a file. If '', a file name will be chosen based on
      the current TensorFlow scope. If not '', it should be unique within a
      given preprocessing function.
    name: (Optional) A name for this operation.

  Example:

  >>> def preprocessing_fn(inputs):
  ...   return {
  ...      'scaled': tft.scale_to_0_1_per_key(inputs['x'], inputs['s'])
  ...   }
  >>> raw_data = [dict(x=1, s='a'), dict(x=0, s='b'), dict(x=3, s='a')]
  >>> feature_spec = dict(
  ...     x=tf.io.FixedLenFeature([], tf.float32),
  ...     s=tf.io.FixedLenFeature([], tf.string))
  >>> raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)
  >>> with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
  ...   transformed_dataset, transform_fn = (
  ...       (raw_data, raw_data_metadata)
  ...       | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
  >>> transformed_data, transformed_metadata = transformed_dataset
  >>> transformed_data
  [{'scaled': 0.0}, {'scaled': 0.5}, {'scaled': 1.0}]

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` containing the input column scaled to [0, 1],
    per key. If the analysis dataset is empty, contains a single distinct value
    or the computed key vocabulary doesn't have an entry for `key`, then `x` is
    scaled using a sigmoid function.
  """
  # pyformat: enable
  with tf.compat.v1.name_scope(name, 'scale_to_0_1_per_key'):
    if key is None:
      raise ValueError('key is None, call `tft.scale_to_0_1` instead')
    return _scale_by_min_max_internal(
        x,
        key=key,
        output_min=0,
        output_max=1,
        elementwise=elementwise,
        key_vocabulary_filename=key_vocabulary_filename)

scale_to_gaussian

scale_to_gaussian(
    x: ConsistentTensorType,
    elementwise: bool = False,
    name: Optional[str] = None,
    output_dtype: Optional[DType] = None,
) -> ConsistentTensorType

Returns an (approximately) normal column with mean to 0 and variance 1.

We transform the column to values that are approximately distributed according to a standard normal distribution. The transformation is obtained by applying the moments method to estimate the parameters of a Tukey HH distribution and applying the inverse of the estimated function to the column values. The method is partially described in

Georg M. Georgm "The Lambert Way to Gaussianize Heavy-Tailed Data with the Inverse of Tukey's h Transformation as a Special Case," The Scientific World Journal, Vol. 2015, Hindawi Publishing Corporation.

We use the L-moments instead of conventional moments to be able to deal with long-tailed distributions. The expressions of the L-moments for the Tukey HH distribution is in

Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey H and HH-Distributions through L-Moments and the L-Correlation," ISRN Applied Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153

Note that the transformation to Gaussian is applied only if the column has long-tails. If this is not the case, for instance if values are uniformly distributed, the values are only normalized using the z score. This applies also to the cases where only one of the tails is long; the other tail is only rescaled but not non linearly transformed. Also, if the analysis set is empty, the transformation is set to to leave the input vaules unchanged.

PARAMETER DESCRIPTION
x

A numeric Tensor, SparseTensor, or RaggedTensor.

TYPE: ConsistentTensorType

elementwise

If true, scales each element of the tensor independently; otherwise uses the parameters of the whole tensor.

TYPE: bool DEFAULT: False

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

output_dtype

(Optional) If not None, casts the output tensor to this type.

TYPE: Optional[DType] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor containing the input column

ConsistentTensorType

transformed to be approximately standard distributed (i.e. a Gaussian with

ConsistentTensorType

mean 0 and variance 1). If x is floating point, the mean will have the

ConsistentTensorType

same type as x. If x is integral, the output is cast to tf.float32.

ConsistentTensorType

Note that TFLearn generally permits only tf.int64 and tf.float32, so casting

ConsistentTensorType

this scaler's output may be necessary.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def scale_to_gaussian(
    x: common_types.ConsistentTensorType,
    elementwise: bool = False,
    name: Optional[str] = None,
    output_dtype: Optional[tf.DType] = None
) -> common_types.ConsistentTensorType:
  """Returns an (approximately) normal column with mean to 0 and variance 1.

  We transform the column to values that are approximately distributed
  according to a standard normal distribution.
  The transformation is obtained by applying the moments method to estimate
  the parameters of a Tukey HH distribution and applying the inverse of the
  estimated function to the column values.
  The method is partially described in

  Georg M. Georgm "The Lambert Way to Gaussianize Heavy-Tailed Data with the
  Inverse of Tukey's h Transformation as a Special Case," The Scientific World
  Journal, Vol. 2015, Hindawi Publishing Corporation.

  We use the L-moments instead of conventional moments to be able to deal with
  long-tailed distributions. The expressions of the L-moments for the Tukey HH
  distribution is in

  Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey H and
  HH-Distributions through L-Moments and the L-Correlation," ISRN Applied
  Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153

  Note that the transformation to Gaussian is applied only if the column has
  long-tails. If this is not the case, for instance if values are uniformly
  distributed, the values are only normalized using the z score. This applies
  also to the cases where only one of the tails is long; the other tail is only
  rescaled but not non linearly transformed.
  Also, if the analysis set is empty, the transformation is set to to leave the
  input vaules unchanged.

  Args:
    x: A numeric `Tensor`, `SparseTensor`, or `RaggedTensor`.
    elementwise: If true, scales each element of the tensor independently;
      otherwise uses the parameters of the whole tensor.
    name: (Optional) A name for this operation.
    output_dtype: (Optional) If not None, casts the output tensor to this type.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` containing the input column
    transformed to be approximately standard distributed (i.e. a Gaussian with
    mean 0 and variance 1). If `x` is floating point, the mean will have the
    same type as `x`. If `x` is integral, the output is cast to tf.float32.

    Note that TFLearn generally permits only tf.int64 and tf.float32, so casting
    this scaler's output may be necessary.
  """
  with tf.compat.v1.name_scope(name, 'scale_to_gaussian'):
    return _scale_to_gaussian_internal(
        x=x,
        elementwise=elementwise,
        output_dtype=output_dtype)

scale_to_z_score

scale_to_z_score(
    x: ConsistentTensorType,
    elementwise: bool = False,
    name: Optional[str] = None,
    output_dtype: Optional[DType] = None,
) -> ConsistentTensorType

Returns a standardized column with mean 0 and variance 1.

Scaling to z-score subtracts out the mean and divides by standard deviation. Note that the standard deviation computed here is based on the biased variance (0 delta degrees of freedom), as computed by analyzers.var.

PARAMETER DESCRIPTION
x

A numeric Tensor, SparseTensor, or RaggedTensor.

TYPE: ConsistentTensorType

elementwise

If true, scales each element of the tensor independently; otherwise uses the mean and variance of the whole tensor.

TYPE: bool DEFAULT: False

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

output_dtype

(Optional) If not None, casts the output tensor to this type.

TYPE: Optional[DType] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor containing the input column

ConsistentTensorType

scaled to mean 0

ConsistentTensorType

and variance 1 (standard deviation 1), given by: (x - mean(x)) / std_dev(x).

ConsistentTensorType

If x is floating point, the mean will have the same type as x. If x is

ConsistentTensorType

integral, the output is cast to tf.float32. If the analysis dataset is empty

ConsistentTensorType

or contains a single distinct value, then the input is returned without

ConsistentTensorType

scaling.

ConsistentTensorType

Note that TFLearn generally permits only tf.int64 and tf.float32, so casting

ConsistentTensorType

this scaler's output may be necessary.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def scale_to_z_score(
    x: common_types.ConsistentTensorType,
    elementwise: bool = False,
    name: Optional[str] = None,
    output_dtype: Optional[tf.DType] = None
) -> common_types.ConsistentTensorType:
  """Returns a standardized column with mean 0 and variance 1.

  Scaling to z-score subtracts out the mean and divides by standard deviation.
  Note that the standard deviation computed here is based on the biased variance
  (0 delta degrees of freedom), as computed by analyzers.var.

  Args:
    x: A numeric `Tensor`, `SparseTensor`, or `RaggedTensor`.
    elementwise: If true, scales each element of the tensor independently;
      otherwise uses the mean and variance of the whole tensor.
    name: (Optional) A name for this operation.
    output_dtype: (Optional) If not None, casts the output tensor to this type.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` containing the input column
    scaled to mean 0
    and variance 1 (standard deviation 1), given by: (x - mean(x)) / std_dev(x).
    If `x` is floating point, the mean will have the same type as `x`. If `x` is
    integral, the output is cast to tf.float32. If the analysis dataset is empty
    or contains a single distinct value, then the input is returned without
    scaling.

    Note that TFLearn generally permits only tf.int64 and tf.float32, so casting
    this scaler's output may be necessary.
  """
  with tf.compat.v1.name_scope(name, 'scale_to_z_score'):
    return _scale_to_z_score_internal(
        x=x,
        key=None,
        elementwise=elementwise,
        key_vocabulary_filename=None,
        output_dtype=output_dtype)

scale_to_z_score_per_key

scale_to_z_score_per_key(
    x: ConsistentTensorType,
    key: TensorType,
    elementwise: bool = False,
    key_vocabulary_filename: Optional[str] = None,
    name: Optional[str] = None,
    output_dtype: Optional[DType] = None,
) -> ConsistentTensorType

Returns a standardized column with mean 0 and variance 1, grouped per key.

Scaling to z-score subtracts out the mean and divides by standard deviation. Note that the standard deviation computed here is based on the biased variance (0 delta degrees of freedom), as computed by analyzers.var.

PARAMETER DESCRIPTION
x

A numeric Tensor, SparseTensor, or RaggedTensor.

TYPE: ConsistentTensorType

key

A Tensor, SparseTensor, or RaggedTensor of dtype tf.string. Must meet one of the following conditions: 0. key is None, 1. Both x and key are dense, 2. Both x and key are sparse and key must exactly match x in everything except values, 3. The axis=1 index of each x matches its index of dense key.

TYPE: TensorType

elementwise

If true, scales each element of the tensor independently; otherwise uses the mean and variance of the whole tensor. Currently, not supported for per-key operations.

TYPE: bool DEFAULT: False

key_vocabulary_filename

(Optional) The file name for the per-key file. If None, this combiner will assume the keys fit in memory and will not store the analyzer result in a file. If '', a file name will be chosen based on the current TensorFlow scope. If not '', it should be unique within a given preprocessing function.

TYPE: Optional[str] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

output_dtype

(Optional) If not None, casts the output tensor to this type.

TYPE: Optional[DType] DEFAULT: None

RETURNS DESCRIPTION
ConsistentTensorType

A Tensor, SparseTensor, or RaggedTensor containing the input column

ConsistentTensorType

scaled to mean 0

ConsistentTensorType

and variance 1 (standard deviation 1), grouped per key if a key is provided.

ConsistentTensorType

That is, for all keys k: (x - mean(x)) / std_dev(x) for all x with key k.

ConsistentTensorType

If x is floating point, the mean will have the same type as x. If x is

ConsistentTensorType

integral, the output is cast to tf.float32. If the analysis dataset is

ConsistentTensorType

empty, contains a single distinct value or the computed key vocabulary

ConsistentTensorType

doesn't have an entry for key, then the input is returned without scaling.

ConsistentTensorType

Note that TFLearn generally permits only tf.int64 and tf.float32, so casting

ConsistentTensorType

this scaler's output may be necessary.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def scale_to_z_score_per_key(
    x: common_types.ConsistentTensorType,
    key: common_types.TensorType,
    elementwise: bool = False,
    key_vocabulary_filename: Optional[str] = None,
    name: Optional[str] = None,
    output_dtype: Optional[tf.DType] = None
) -> common_types.ConsistentTensorType:
  """Returns a standardized column with mean 0 and variance 1, grouped per key.

  Scaling to z-score subtracts out the mean and divides by standard deviation.
  Note that the standard deviation computed here is based on the biased variance
  (0 delta degrees of freedom), as computed by analyzers.var.

  Args:
    x: A numeric `Tensor`, `SparseTensor`, or `RaggedTensor`.
    key: A `Tensor`, `SparseTensor`, or `RaggedTensor` of dtype tf.string. Must
      meet one of the following conditions:
      0. key is None,
      1. Both x and key are dense,
      2. Both x and key are sparse and `key` must exactly match `x` in
      everything except values,
      3. The axis=1 index of each x matches its index of dense key.
    elementwise: If true, scales each element of the tensor independently;
      otherwise uses the mean and variance of the whole tensor. Currently, not
      supported for per-key operations.
    key_vocabulary_filename: (Optional) The file name for the per-key file. If
      None, this combiner will assume the keys fit in memory and will not store
      the analyzer result in a file. If '', a file name will be chosen based on
      the current TensorFlow scope. If not '', it should be unique within a
      given preprocessing function.
    name: (Optional) A name for this operation.
    output_dtype: (Optional) If not None, casts the output tensor to this type.

  Returns:
    A `Tensor`, `SparseTensor`, or `RaggedTensor` containing the input column
    scaled to mean 0
    and variance 1 (standard deviation 1), grouped per key if a key is provided.

    That is, for all keys k: (x - mean(x)) / std_dev(x) for all x with key k.
    If `x` is floating point, the mean will have the same type as `x`. If `x` is
    integral, the output is cast to tf.float32. If the analysis dataset is
    empty, contains a single distinct value or the computed key vocabulary
    doesn't have an entry for `key`, then the input is returned without scaling.

    Note that TFLearn generally permits only tf.int64 and tf.float32, so casting
    this scaler's output may be necessary.
  """
  with tf.compat.v1.name_scope(name, 'scale_to_z_score_per_key'):
    if key is None:
      raise ValueError('key is None, call `tft.scale_to_z_score` instead')
    return _scale_to_z_score_internal(
        x=x,
        key=key,
        elementwise=elementwise,
        key_vocabulary_filename=key_vocabulary_filename,
        output_dtype=output_dtype)

segment_indices

segment_indices(
    segment_ids: Tensor, name: Optional[str] = None
) -> Tensor

Returns a Tensor of indices within each segment.

segment_ids should be a sequence of non-decreasing non-negative integers that define a set of segments, e.g. [0, 0, 1, 2, 2, 2] defines 3 segments of length 2, 1 and 3. The return value is a Tensor containing the indices within each segment.

Example:

result = tft.segment_indices(tf.constant([0, 0, 1, 2, 2, 2])) print(result) tf.Tensor([0 1 0 0 1 2], shape=(6,), dtype=int32)

PARAMETER DESCRIPTION
segment_ids

A 1-d Tensor containing an non-decreasing sequence of non-negative integers with type tf.int32 or tf.int64.

TYPE: Tensor

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor containing the indices within each segment.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def segment_indices(segment_ids: tf.Tensor,
                    name: Optional[str] = None) -> tf.Tensor:
  """Returns a `Tensor` of indices within each segment.

  segment_ids should be a sequence of non-decreasing non-negative integers that
  define a set of segments, e.g. [0, 0, 1, 2, 2, 2] defines 3 segments of length
  2, 1 and 3.  The return value is a `Tensor` containing the indices within each
  segment.

  Example:

  >>> result = tft.segment_indices(tf.constant([0, 0, 1, 2, 2, 2]))
  >>> print(result)
  tf.Tensor([0 1 0 0 1 2], shape=(6,), dtype=int32)

  Args:
    segment_ids: A 1-d `Tensor` containing an non-decreasing sequence of
      non-negative integers with type `tf.int32` or `tf.int64`.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor` containing the indices within each segment.
  """
  ndims = segment_ids.get_shape().ndims
  if ndims != 1 and ndims is not None:
    raise ValueError(
        'segment_indices requires a 1-dimensional input. '
        'segment_indices has {} dimensions.'.format(ndims))
  with tf.compat.v1.name_scope(name, 'segment_indices'):
    # TODO(KesterTong): This is a fundamental operation for segments, write a C++
    # op to do this.
    # TODO(KesterTong): Add a check that segment_ids are increasing.
    segment_lengths = tf.math.segment_sum(
        tf.ones_like(segment_ids), segment_ids)
    segment_starts = tf.gather(tf.concat([[0], tf.cumsum(segment_lengths)], 0),
                               segment_ids)
    return (tf.range(tf.size(input=segment_ids, out_type=segment_ids.dtype)) -
            segment_starts)

size

size(
    x: TensorType,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None,
) -> Tensor

Computes the total size of instances in a Tensor over the whole dataset.

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor.

TYPE: TensorType

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.

TYPE: bool DEFAULT: True

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor of type int64.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def size(x: common_types.TensorType,
         reduce_instance_dims: bool = True,
         name: Optional[str] = None) -> tf.Tensor:
  """Computes the total size of instances in a `Tensor` over the whole dataset.

  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor`.
    reduce_instance_dims: By default collapses the batch and instance dimensions
      to arrive at a single scalar output. If False, only collapses the batch
      dimension and outputs a vector of the same shape as the input.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor` of type int64.
  """
  with tf.compat.v1.name_scope(name, 'size'):
    # Note: Calling `sum` defined in this module, not the builtin.
    if isinstance(x, tf.SparseTensor):
      ones_like_x = tf.SparseTensor(
          indices=x.indices,
          values=tf.ones_like(x.values, tf.int64),
          dense_shape=x.dense_shape)
    else:
      ones_like_x = tf.ones_like(x, dtype=tf.int64)
    return sum(ones_like_x, reduce_instance_dims)

sparse_tensor_left_align

sparse_tensor_left_align(
    sparse_tensor: SparseTensor,
) -> SparseTensor

Re-arranges a tf.SparseTensor and returns a left-aligned version of it.

This mapper can be useful when returning a sparse tensor that may not be left-aligned from a preprocessing_fn.

PARAMETER DESCRIPTION
sparse_tensor

A 2D tf.SparseTensor.

TYPE: SparseTensor

RETURNS DESCRIPTION
SparseTensor

A left-aligned version of sparse_tensor as a tf.SparseTensor.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def sparse_tensor_left_align(sparse_tensor: tf.SparseTensor) -> tf.SparseTensor:
  """Re-arranges a `tf.SparseTensor` and returns a left-aligned version of it.

  This mapper can be useful when returning a sparse tensor that may not be
  left-aligned from a preprocessing_fn.

  Args:
    sparse_tensor: A 2D `tf.SparseTensor`.

  Raises:
    ValueError if `sparse_tensor` is not 2D.

  Returns:
    A left-aligned version of sparse_tensor as a `tf.SparseTensor`.
  """
  if sparse_tensor.get_shape().ndims != 2:
    raise ValueError('sparse_tensor_left_align requires a 2D input')
  reordered_tensor = tf.sparse.reorder(sparse_tensor)
  transposed_indices = tf.transpose(reordered_tensor.indices)
  row_indices = transposed_indices[0]
  row_counts = tf.unique_with_counts(row_indices, out_idx=tf.int64).count
  column_indices = tf.ragged.range(row_counts).flat_values
  return tf.SparseTensor(
      indices=tf.transpose(tf.stack([row_indices, column_indices])),
      values=reordered_tensor.values,
      dense_shape=reordered_tensor.dense_shape)

sparse_tensor_to_dense_with_shape

sparse_tensor_to_dense_with_shape(
    x: SparseTensor,
    shape: Union[TensorShape, Iterable[int]],
    default_value: Union[Tensor, int, float, str] = 0,
) -> Tensor

Converts a SparseTensor into a dense tensor and sets its shape.

PARAMETER DESCRIPTION
x

A SparseTensor.

TYPE: SparseTensor

shape

The desired shape of the densified Tensor.

TYPE: Union[TensorShape, Iterable[int]]

default_value

(Optional) Value to set for indices not specified. Defaults to zero.

TYPE: Union[Tensor, int, float, str] DEFAULT: 0

RETURNS DESCRIPTION
Tensor

A Tensor with the desired shape.

RAISES DESCRIPTION
ValueError

If input is not a SparseTensor.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def sparse_tensor_to_dense_with_shape(
    x: tf.SparseTensor,
    shape: Union[tf.TensorShape, Iterable[int]],
    default_value: Union[tf.Tensor, int, float, str] = 0) -> tf.Tensor:
  """Converts a `SparseTensor` into a dense tensor and sets its shape.

  Args:
    x: A `SparseTensor`.
    shape: The desired shape of the densified `Tensor`.
    default_value: (Optional) Value to set for indices not specified. Defaults
      to zero.

  Returns:
    A `Tensor` with the desired shape.

  Raises:
    ValueError: If input is not a `SparseTensor`.
  """
  if not isinstance(x, tf.SparseTensor):
    raise ValueError('input must be a SparseTensor')
  new_dense_shape = [
      x.dense_shape[i] if size is None else size
      for i, size in enumerate(shape)
  ]
  dense = tf.raw_ops.SparseToDense(
      sparse_indices=x.indices,
      output_shape=new_dense_shape,
      sparse_values=x.values,
      default_value=default_value)
  dense.set_shape(shape)
  return dense

sum

sum(
    x: TensorType,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None,
) -> Tensor

Computes the sum of the values of a Tensor over the whole dataset.

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor. Its type must be floating point (float{16|32|64}),integral (int{8|16|32|64}), or unsigned integral (uint{8|16}).

TYPE: TensorType

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.

TYPE: bool DEFAULT: True

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor containing the sum. If x is float32 or float64, the sum will

Tensor

have the same type as x. If x is float16, the output is cast to float32.

Tensor

If x is integral, the output is cast to [u]int64. If x is sparse and

Tensor

reduce_inst_dims is False will return 0 in place where column has no values

Tensor

across batches.

RAISES DESCRIPTION
TypeError

If the type of x is not supported.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def sum(  # pylint: disable=redefined-builtin
    x: common_types.TensorType,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None) -> tf.Tensor:
  """Computes the sum of the values of a `Tensor` over the whole dataset.

  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor`. Its type must be floating
        point (float{16|32|64}),integral (int{8|16|32|64}), or unsigned
        integral (uint{8|16}).
    reduce_instance_dims: By default collapses the batch and instance dimensions
        to arrive at a single scalar output. If False, only collapses the batch
        dimension and outputs a vector of the same shape as the input.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor` containing the sum. If `x` is float32 or float64, the sum will
    have the same type as `x`. If `x` is float16, the output is cast to float32.
    If `x` is integral, the output is cast to [u]int64. If `x` is sparse and
    reduce_inst_dims is False will return 0 in place where column has no values
    across batches.

  Raises:
    TypeError: If the type of `x` is not supported.
  """
  with tf.compat.v1.name_scope(name, 'sum'):
    if reduce_instance_dims:
      x = tf.reduce_sum(input_tensor=tf_utils.get_values(x))
    elif isinstance(x, tf.SparseTensor):
      if x.dtype == tf.uint8 or x.dtype == tf.uint16:
        x = tf.cast(x, tf.int64)
      elif x.dtype == tf.uint32 or x.dtype == tf.uint64:
        raise TypeError('Data type %r is not supported' % x.dtype)
      x = tf.sparse.reduce_sum(x, axis=0)
    elif isinstance(x, tf.RaggedTensor):
      raise NotImplementedError(
          'Elementwise sum does not support RaggedTensors.')
    else:
      x = tf.reduce_sum(input_tensor=x, axis=0)
    output_dtype, sum_fn = _sum_combine_fn_and_dtype(x.dtype)
    return _numeric_combine(
        inputs=[x],
        fn=sum_fn,
        default_accumulator_value=0,
        reduce_instance_dims=reduce_instance_dims,
        output_dtypes=[output_dtype])[0]

tfidf

tfidf(
    x: SparseTensor,
    vocab_size: int,
    smooth: bool = True,
    name: Optional[str] = None,
) -> Tuple[SparseTensor, SparseTensor]

Maps the terms in x to their term frequency * inverse document frequency.

The term frequency of a term in a document is calculated as (count of term in document) / (document size)

The inverse document frequency of a term is, by default, calculated as 1 + log((corpus size + 1) / (count of documents containing term + 1)).

Example usage:

def preprocessing_fn(inputs): ... integerized = tft.compute_and_apply_vocabulary(inputs['x']) ... vocab_size = tft.get_num_buckets_for_transformed_feature(integerized) ... vocab_index, tfidf_weight = tft.tfidf(integerized, vocab_size) ... return { ... 'index': vocab_index, ... 'tf_idf': tfidf_weight, ... 'integerized': integerized, ... } raw_data = [dict(x=["I", "like", "pie", "pie", "pie"]), ... dict(x=["yum", "yum", "pie"])] feature_spec = dict(x=tf.io.VarLenFeature(tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'index': array([0, 2, 3]), 'integerized': array([3, 2, 0, 0, 0]), 'tf_idf': array([0.6, 0.28109303, 0.28109303], dtype=float32)}, {'index': array([0, 1]), 'integerized': array([1, 1, 0]), 'tf_idf': array([0.33333334, 0.9369768 ], dtype=float32)}]

example strings: [["I", "like", "pie", "pie", "pie"], ["yum", "yum", "pie]]
in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
                          [1, 0], [1, 1], [1, 2]],
                 values=[1, 2, 0, 0, 0, 3, 3, 0])
out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],
                  values=[1, 2, 0, 3, 0])
     SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],
                  values=[(1/5)*(log(3/2)+1), (1/5)*(log(3/2)+1), (3/5),
                          (2/3)*(log(3/2)+1), (1/3)]

NOTE: the first doc's duplicate "pie" strings have been combined to one output, as have the second doc's duplicate "yum" strings.

PARAMETER DESCRIPTION
x

A 2D SparseTensor representing int64 values (most likely that are the result of calling compute_and_apply_vocabulary on a tokenized string).

TYPE: SparseTensor

vocab_size

An int - the count of vocab used to turn the string into int64s including any OOV buckets.

TYPE: int

smooth

A bool indicating if the inverse document frequency should be smoothed. If True, which is the default, then the idf is calculated as 1 + log((corpus size + 1) / (document frequency of term + 1)). Otherwise, the idf is 1 +log((corpus size) / (document frequency of term)), which could result in a division by zero error.

TYPE: bool DEFAULT: True

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
SparseTensor

Two SparseTensors with indices [index_in_batch, index_in_bag_of_words].

SparseTensor

The first has values vocab_index, which is taken from input x.

Tuple[SparseTensor, SparseTensor]

The second has values tfidf_weight.

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def tfidf(
    x: tf.SparseTensor,
    vocab_size: int,
    smooth: bool = True,
    name: Optional[str] = None) -> Tuple[tf.SparseTensor, tf.SparseTensor]:
  # pyformat: disable
  """Maps the terms in x to their term frequency * inverse document frequency.

  The term frequency of a term in a document is calculated as
  (count of term in document) / (document size)

  The inverse document frequency of a term is, by default, calculated as
  1 + log((corpus size + 1) / (count of documents containing term + 1)).


  Example usage:

  >>> def preprocessing_fn(inputs):
  ...   integerized = tft.compute_and_apply_vocabulary(inputs['x'])
  ...   vocab_size = tft.get_num_buckets_for_transformed_feature(integerized)
  ...   vocab_index, tfidf_weight = tft.tfidf(integerized, vocab_size)
  ...   return {
  ...      'index': vocab_index,
  ...      'tf_idf': tfidf_weight,
  ...      'integerized': integerized,
  ...   }
  >>> raw_data = [dict(x=["I", "like", "pie", "pie", "pie"]),
  ...             dict(x=["yum", "yum", "pie"])]
  >>> feature_spec = dict(x=tf.io.VarLenFeature(tf.string))
  >>> raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)
  >>> with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
  ...   transformed_dataset, transform_fn = (
  ...       (raw_data, raw_data_metadata)
  ...       | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
  >>> transformed_data, transformed_metadata = transformed_dataset
  >>> transformed_data
  [{'index': array([0, 2, 3]), 'integerized': array([3, 2, 0, 0, 0]),
    'tf_idf': array([0.6, 0.28109303, 0.28109303], dtype=float32)},
   {'index': array([0, 1]), 'integerized': array([1, 1, 0]),
    'tf_idf': array([0.33333334, 0.9369768 ], dtype=float32)}]

    ```
    example strings: [["I", "like", "pie", "pie", "pie"], ["yum", "yum", "pie]]
    in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
                              [1, 0], [1, 1], [1, 2]],
                     values=[1, 2, 0, 0, 0, 3, 3, 0])
    out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],
                      values=[1, 2, 0, 3, 0])
         SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],
                      values=[(1/5)*(log(3/2)+1), (1/5)*(log(3/2)+1), (3/5),
                              (2/3)*(log(3/2)+1), (1/3)]
    ```

    NOTE: the first doc's duplicate "pie" strings have been combined to
    one output, as have the second doc's duplicate "yum" strings.

  Args:
    x: A 2D `SparseTensor` representing int64 values (most likely that are the
        result of calling `compute_and_apply_vocabulary` on a tokenized string).
    vocab_size: An int - the count of vocab used to turn the string into int64s
        including any OOV buckets.
    smooth: A bool indicating if the inverse document frequency should be
        smoothed. If True, which is the default, then the idf is calculated as
        1 + log((corpus size + 1) / (document frequency of term + 1)).
        Otherwise, the idf is
        1 +log((corpus size) / (document frequency of term)), which could
        result in a division by zero error.
    name: (Optional) A name for this operation.

  Returns:
    Two `SparseTensor`s with indices [index_in_batch, index_in_bag_of_words].
    The first has values vocab_index, which is taken from input `x`.
    The second has values tfidf_weight.

  Raises:
    ValueError if `x` does not have 2 dimensions.
  """
  # pyformat: enable
  if x.get_shape().ndims != 2:
    raise ValueError('tft.tfidf requires a 2D SparseTensor input. '
                     'Input had {} dimensions.'.format(x.get_shape().ndims))

  with tf.compat.v1.name_scope(name, 'tfidf'):
    cleaned_input = tf_utils.to_vocab_range(x, vocab_size)

    term_frequencies = _to_term_frequency(cleaned_input, vocab_size)

    count_docs_with_term_column = _count_docs_with_term(term_frequencies)
    # Expand dims to get around the min_tensor_rank checks
    sizes = tf.expand_dims(tf.shape(input=cleaned_input)[0], 0)
    # [batch, vocab] - tfidf
    tfidfs = _to_tfidf(term_frequencies,
                       analyzers.sum(count_docs_with_term_column,
                                     reduce_instance_dims=False),
                       analyzers.sum(sizes),
                       smooth)
    return _split_tfidfs_to_outputs(tfidfs)

tukey_h_params

tukey_h_params(
    x: TensorType,
    reduce_instance_dims: bool = True,
    output_dtype: Optional[DType] = None,
    name: Optional[str] = None,
) -> Tuple[Tensor, Tensor]

Computes the h parameters of the values of a Tensor over the dataset.

This computes the parameters (hl, hr) of the samples, assuming a Tukey HH distribution, i.e. (x - tukey_location) / tukey_scale is a Tukey HH distribution with parameters hl (left parameter) and hr (right parameter). See the following publication for the definition of the Tukey HH distribution:

Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey h and hh-Distributions through L-Moments and the L-Correlation," ISRN Applied Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor. Its type must be floating point (float{16|32|64}), or integral ([u]int{8|16|32|64}).

TYPE: TensorType

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.

TYPE: bool DEFAULT: True

output_dtype

(Optional) If not None, casts the output tensor to this type.

TYPE: Optional[DType] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

The tuple (hl, hr) containing two Tensor instances with the hl and hr

Tensor

parameters. If x is floating point, each parameter will have the same type

Tuple[Tensor, Tensor]

as x. If x is integral, the output is cast to float32.

RAISES DESCRIPTION
TypeError

If the type of x is not supported.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def tukey_h_params(x: common_types.TensorType,
                   reduce_instance_dims: bool = True,
                   output_dtype: Optional[tf.DType] = None,
                   name: Optional[str] = None) -> Tuple[tf.Tensor, tf.Tensor]:
  """Computes the h parameters of the values of a `Tensor` over the dataset.

  This computes the parameters (hl, hr) of the samples, assuming a Tukey HH
  distribution, i.e. (x - tukey_location) / tukey_scale is a Tukey HH
  distribution with parameters hl (left parameter) and hr (right parameter).
  See the following publication for the definition of the Tukey HH distribution:

  Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey h and
  hh-Distributions through L-Moments and the L-Correlation," ISRN Applied
  Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153

  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor`. Its type must be floating
        point (float{16|32|64}), or integral ([u]int{8|16|32|64}).
    reduce_instance_dims: By default collapses the batch and instance dimensions
        to arrive at a single scalar output. If False, only collapses the batch
        dimension and outputs a vector of the same shape as the input.
    output_dtype: (Optional) If not None, casts the output tensor to this type.
    name: (Optional) A name for this operation.

  Returns:
    The tuple (hl, hr) containing two `Tensor` instances with the hl and hr
    parameters. If `x` is floating point, each parameter will have the same type
    as `x`. If `x` is integral, the output is cast to float32.

  Raises:
    TypeError: If the type of `x` is not supported.
  """
  with tf.compat.v1.name_scope(name, 'tukey_h_params'):
    return _tukey_parameters(x, reduce_instance_dims, output_dtype)[2:]

tukey_location

tukey_location(
    x: TensorType,
    reduce_instance_dims: Optional[bool] = True,
    output_dtype: Optional[DType] = None,
    name: Optional[str] = None,
) -> Tensor

Computes the location of the values of a Tensor over the whole dataset.

This computes the location of x, assuming a Tukey HH distribution, i.e. (x - tukey_location) / tukey_scale is a Tukey HH distribution with parameters tukey_h_params. See the following publication for the definition of the Tukey HH distribution:

Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey h and hh-Distributions through L-Moments and the L-Correlation," ISRN Applied Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor. Its type must be floating point (float{16|32|64}), or integral ([u]int{8|16|32|64}).

TYPE: TensorType

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.

TYPE: Optional[bool] DEFAULT: True

output_dtype

(Optional) If not None, casts the output tensor to this type.

TYPE: Optional[DType] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor containing the location. If x is floating point, the location

Tensor

will have the same type as x. If x is integral, the output is cast to

Tensor

float32.

RAISES DESCRIPTION
TypeError

If the type of x is not supported.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def tukey_location(x: common_types.TensorType,
                   reduce_instance_dims: Optional[bool] = True,
                   output_dtype: Optional[tf.DType] = None,
                   name: Optional[str] = None) -> tf.Tensor:
  """Computes the location of the values of a `Tensor` over the whole dataset.

  This computes the location of x, assuming a Tukey HH distribution, i.e.
  (x - tukey_location) / tukey_scale is a Tukey HH distribution with parameters
  tukey_h_params. See the following publication for the definition of the Tukey
  HH distribution:

  Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey h and
  hh-Distributions through L-Moments and the L-Correlation," ISRN Applied
  Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153

  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor`. Its type must be floating
        point (float{16|32|64}), or integral ([u]int{8|16|32|64}).
    reduce_instance_dims: By default collapses the batch and instance dimensions
        to arrive at a single scalar output. If False, only collapses the batch
        dimension and outputs a vector of the same shape as the input.
    output_dtype: (Optional) If not None, casts the output tensor to this type.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor` containing the location. If `x` is floating point, the location
    will have the same type as `x`. If `x` is integral, the output is cast to
    float32.

  Raises:
    TypeError: If the type of `x` is not supported.
  """
  with tf.compat.v1.name_scope(name, 'tukey_location'):
    return _tukey_parameters(x, reduce_instance_dims, output_dtype)[0]

tukey_scale

tukey_scale(
    x: TensorType,
    reduce_instance_dims: Optional[bool] = True,
    output_dtype: Optional[DType] = None,
    name: Optional[str] = None,
) -> Tensor

Computes the scale of the values of a Tensor over the whole dataset.

This computes the scale of x, assuming a Tukey HH distribution, i.e. (x - tukey_location) / tukey_scale is a Tukey HH distribution with parameters tukey_h_params. See the following publication for the definition of the Tukey HH distribution:

Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey h and hh-Distributions through L-Moments and the L-Correlation," ISRN Applied Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153

PARAMETER DESCRIPTION
x

A Tensor, SparseTensor, or RaggedTensor. Its type must be floating point (float{16|32|64}), or integral ([u]int{8|16|32|64}).

TYPE: TensorType

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.

TYPE: Optional[bool] DEFAULT: True

output_dtype

(Optional) If not None, casts the output tensor to this type.

TYPE: Optional[DType] DEFAULT: None

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor containing the scale. If x is floating point, the location

Tensor

will have the same type as x. If x is integral, the output is cast to

Tensor

float32.

RAISES DESCRIPTION
TypeError

If the type of x is not supported.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def tukey_scale(x: common_types.TensorType,
                reduce_instance_dims: Optional[bool] = True,
                output_dtype: Optional[tf.DType] = None,
                name: Optional[str] = None) -> tf.Tensor:
  """Computes the scale of the values of a `Tensor` over the whole dataset.

  This computes the scale of x, assuming a Tukey HH distribution, i.e.
  (x - tukey_location) / tukey_scale is a Tukey HH distribution with parameters
  tukey_h_params. See the following publication for the definition of the Tukey
  HH distribution:

  Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey h and
  hh-Distributions through L-Moments and the L-Correlation," ISRN Applied
  Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153


  Args:
    x: A `Tensor`, `SparseTensor`, or `RaggedTensor`. Its type must be floating
        point (float{16|32|64}), or integral ([u]int{8|16|32|64}).
    reduce_instance_dims: By default collapses the batch and instance dimensions
        to arrive at a single scalar output. If False, only collapses the batch
        dimension and outputs a vector of the same shape as the input.
    output_dtype: (Optional) If not None, casts the output tensor to this type.
    name: (Optional) A name for this operation.

  Returns:
    A `Tensor` containing the scale. If `x` is floating point, the location
    will have the same type as `x`. If `x` is integral, the output is cast to
    float32.

  Raises:
    TypeError: If the type of `x` is not supported.
  """
  with tf.compat.v1.name_scope(name, 'tukey_scale'):
    return _tukey_parameters(x, reduce_instance_dims, output_dtype)[1]

var

var(
    x: TensorType,
    reduce_instance_dims: bool = True,
    name: Optional[str] = None,
    output_dtype: Optional[DType] = None,
) -> Tensor

Computes the variance of the values of a Tensor over the whole dataset.

Uses the biased variance (0 delta degrees of freedom), as given by (x - mean(x))**2 / length(x).

PARAMETER DESCRIPTION
x

Tensor, SparseTensor, or RaggedTensor. Its type must be floating point (float{16|32|64}), or integral ([u]int{8|16|32|64}).

TYPE: TensorType

reduce_instance_dims

By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.

TYPE: bool DEFAULT: True

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

output_dtype

(Optional) If not None, casts the output tensor to this type.

TYPE: Optional[DType] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A Tensor containing the variance. If x is floating point, the variance

Tensor

will have the same type as x. If x is integral, the output is cast to

Tensor

float32. NaNs and infinite input values are ignored.

RAISES DESCRIPTION
TypeError

If the type of x is not supported.

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def var(x: common_types.TensorType,
        reduce_instance_dims: bool = True,
        name: Optional[str] = None,
        output_dtype: Optional[tf.DType] = None) -> tf.Tensor:
  """Computes the variance of the values of a `Tensor` over the whole dataset.

  Uses the biased variance (0 delta degrees of freedom), as given by
  (x - mean(x))**2 / length(x).

  Args:
    x: `Tensor`, `SparseTensor`, or `RaggedTensor`. Its type must be floating
        point (float{16|32|64}), or integral ([u]int{8|16|32|64}).
    reduce_instance_dims: By default collapses the batch and instance dimensions
        to arrive at a single scalar output. If False, only collapses the batch
        dimension and outputs a vector of the same shape as the input.
    name: (Optional) A name for this operation.
    output_dtype: (Optional) If not None, casts the output tensor to this type.

  Returns:
    A `Tensor` containing the variance. If `x` is floating point, the variance
    will have the same type as `x`. If `x` is integral, the output is cast to
    float32. NaNs and infinite input values are ignored.

  Raises:
    TypeError: If the type of `x` is not supported.
  """
  with tf.compat.v1.name_scope(name, 'var'):
    return _mean_and_var(x, reduce_instance_dims, output_dtype)[1]

vocabulary

vocabulary(
    x: TensorType,
    *,
    top_k: Optional[int] = None,
    frequency_threshold: Optional[int] = None,
    vocab_filename: Optional[str] = None,
    store_frequency: Optional[bool] = False,
    reserved_tokens: Optional[
        Union[Sequence[str], Tensor]
    ] = None,
    weights: Optional[Tensor] = None,
    labels: Optional[Union[Tensor, SparseTensor]] = None,
    use_adjusted_mutual_info: bool = False,
    min_diff_from_avg: Optional[int] = None,
    coverage_top_k: Optional[int] = None,
    coverage_frequency_threshold: Optional[int] = None,
    key_fn: Optional[Callable[[Any], Any]] = None,
    fingerprint_shuffle: Optional[bool] = False,
    file_format: VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
    name: Optional[str] = None,
) -> TemporaryAnalyzerOutputType

Computes the unique values of x over the whole dataset.

Computes The unique values taken by x, which can be a Tensor, SparseTensor, or RaggedTensor of any size. The unique values will be aggregated over all dimensions of x and all instances.

In case file_format is 'text' and one of the tokens contains the '\n' or '\r' characters or is empty it will be discarded.

If an integer Tensor is provided, its semantic type should be categorical not a continuous/numeric, since computing a vocabulary over a continuous feature is not appropriate.

The unique values are sorted by decreasing frequency and then reverse lexicographical order (e.g. [('a', 5), ('c', 3), ('b', 3)]). This is true even if x is numerical dtype (e.g. [('3', 5), ('2', 3), ('111', 3)]).

For large datasets it is highly recommended to either set frequency_threshold or top_k to control the size of the output, and also the run time of this operation.

When labels are provided, we filter the vocabulary based on the relationship between the token's presence in a record and the label for that record, using (possibly adjusted) Mutual Information. Note: If labels are provided, the x input must be a unique set of per record, as the semantics of the mutual information calculation depend on a multi-hot representation of the input. Having unique input tokens per row is advisable but not required for a frequency-based vocabulary.

WARNING: The following is experimental and is still being actively worked on.

Supply key_fn if you would like to generate a vocabulary with coverage over specific keys.

A "coverage vocabulary" is the union of two vocabulary "arms". The "standard arm" of the vocabulary is equivalent to the one generated by the same function call with no coverage arguments. Adding coverage only appends additional entries to the end of the standard vocabulary.

The "coverage arm" of the vocabulary is determined by taking the coverage_top_k most frequent unique terms per key. A term's key is obtained by applying key_fn to the term. Use coverage_frequency_threshold to lower bound the frequency of entries in the coverage arm of the vocabulary.

Note this is currently implemented for the case where the key is contained within each vocabulary entry (b/117796748).

PARAMETER DESCRIPTION
x

A categorical/discrete input Tensor, SparseTensor, or RaggedTensor with dtype tf.string or tf.int[8|16|32|64]. The inputs should generally be unique per row (i.e. a bag of words/ngrams representation).

TYPE: TensorType

top_k

Limit the generated vocabulary to the first top_k elements. If set to None, the full vocabulary is generated.

TYPE: Optional[int] DEFAULT: None

frequency_threshold

Limit the generated vocabulary only to elements whose absolute frequency is >= to the supplied threshold. If set to None, the full vocabulary is generated. Absolute frequency means the number of occurrences of the element in the dataset, as opposed to the proportion of instances that contain that element.

TYPE: Optional[int] DEFAULT: None

vocab_filename

The file name for the vocabulary file. If None, a file name will be chosen based on the current scope. If not None, should be unique within a given preprocessing function. NOTE To make your pipelines resilient to implementation details please set vocab_filename when you are using the vocab_filename on a downstream component.

TYPE: Optional[str] DEFAULT: None

store_frequency

If True, frequency of the words is stored in the vocabulary file. In the case labels are provided, the mutual information is stored in the file instead. Each line in the file will be of the form 'frequency word'. NOTE: if this is True then the computed vocabulary cannot be used with tft.apply_vocabulary directly, since frequencies are added to the beginning of each row of the vocabulary, which the mapper will not ignore.

TYPE: Optional[bool] DEFAULT: False

reserved_tokens

(Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache.

TYPE: Optional[Union[Sequence[str], Tensor]] DEFAULT: None

weights

(Optional) Weights Tensor for the vocabulary. It must have the same shape as x.

TYPE: Optional[Tensor] DEFAULT: None

labels

(Optional) Labels dense Tensor for the vocabulary. If provided, the vocabulary is calculated based on mutual information with the label, rather than frequency. The labels must have the same batch dimension as x. If x is sparse, labels should be a 1D tensor reflecting row-wise labels. If x is dense, labels can either be a 1D tensor of row-wise labels, or a dense tensor of the identical shape as x (i.e. element-wise labels). Labels should be a discrete integerized tensor (If the label is numeric, it should first be bucketized; If the label is a string, an integer vocabulary should first be applied). Note: CompositeTensor labels are not yet supported (b/134931826). WARNING: When labels are provided, the frequency_threshold argument functions as a mutual information threshold, which is a float. TODO(b/116308354): Fix confusing naming.

TYPE: Optional[Union[Tensor, SparseTensor]] DEFAULT: None

use_adjusted_mutual_info

If true, and labels are provided, calculate vocabulary using adjusted rather than raw mutual information.

TYPE: bool DEFAULT: False

min_diff_from_avg

MI (or AMI) of a feature x label will be adjusted to zero whenever the difference between count and the expected (average) count is lower than min_diff_from_average. This can be thought of as a regularizing parameter that pushes small MI/AMI values to zero. If None, a default parameter will be selected based on the size of the dataset (see calculate_recommended_min_diff_from_avg).

TYPE: Optional[int] DEFAULT: None

coverage_top_k

(Optional), (Experimental) The minimum number of elements per key to be included in the vocabulary.

TYPE: Optional[int] DEFAULT: None

coverage_frequency_threshold

(Optional), (Experimental) Limit the coverage arm of the vocabulary only to elements whose absolute frequency is >= this threshold for a given key.

TYPE: Optional[int] DEFAULT: None

key_fn

(Optional), (Experimental) A fn that takes in a single entry of x and returns the corresponding key for coverage calculation. If this is None, no coverage arm is added to the vocabulary.

TYPE: Optional[Callable[[Any], Any]] DEFAULT: None

fingerprint_shuffle

(Optional), (Experimental) Whether to sort the vocabularies by fingerprint instead of counts. This is useful for load balancing on the training parameter servers. Shuffle only happens while writing the files, so all the filters above (top_k, frequency_threshold, etc) will still take effect.

TYPE: Optional[bool] DEFAULT: False

file_format

(Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.

TYPE: VocabularyFileFormatType DEFAULT: DEFAULT_VOCABULARY_FILE_FORMAT

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
TemporaryAnalyzerOutputType

The path name for the vocabulary file containing the unique values of x.

RAISES DESCRIPTION
ValueError

If top_k or frequency_threshold is negative. If coverage_top_k or coverage_frequency_threshold is negative. If either coverage_top_k or coverage_frequency_threshold is specified and key_fn is not. If key_fn is specified and neither coverage_top_k, nor

Source code in tensorflow_transform/analyzers.py
@common.log_api_use(common.ANALYZER_COLLECTION)
def vocabulary(
    x: common_types.TensorType,
    *,  # Force passing optional parameters by keys.
    top_k: Optional[int] = None,
    frequency_threshold: Optional[int] = None,
    vocab_filename: Optional[str] = None,
    store_frequency: Optional[bool] = False,
    reserved_tokens: Optional[Union[Sequence[str], tf.Tensor]] = None,
    weights: Optional[tf.Tensor] = None,
    labels: Optional[Union[tf.Tensor, tf.SparseTensor]] = None,
    use_adjusted_mutual_info: bool = False,
    min_diff_from_avg: Optional[int] = None,
    coverage_top_k: Optional[int] = None,
    coverage_frequency_threshold: Optional[int] = None,
    key_fn: Optional[Callable[[Any], Any]] = None,
    fingerprint_shuffle: Optional[bool] = False,
    file_format: common_types.VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
    name: Optional[str] = None,
) -> common_types.TemporaryAnalyzerOutputType:
  r"""Computes the unique values of `x` over the whole dataset.

  Computes The unique values taken by `x`, which can be a `Tensor`,
  `SparseTensor`, or `RaggedTensor` of any size.  The unique values will be
  aggregated over all dimensions of `x` and all instances.

  In case `file_format` is 'text' and one of the tokens contains the '\n' or
  '\r' characters or is empty it will be discarded.

  If an integer `Tensor` is provided, its semantic type should be categorical
  not a continuous/numeric, since computing a vocabulary over a continuous
  feature is not appropriate.

  The unique values are sorted by decreasing frequency and then reverse
  lexicographical order (e.g. [('a', 5), ('c', 3), ('b', 3)]). This is true even
  if `x` is numerical dtype (e.g. [('3', 5), ('2', 3), ('111', 3)]).

  For large datasets it is highly recommended to either set frequency_threshold
  or top_k to control the size of the output, and also the run time of this
  operation.

  When labels are provided, we filter the vocabulary based on the relationship
  between the token's presence in a record and the label for that record, using
  (possibly adjusted) Mutual Information. Note: If labels are provided, the x
  input must be a unique set of per record, as the semantics of the mutual
  information calculation depend on a multi-hot representation of the input.
  Having unique input tokens per row is advisable but not required for a
  frequency-based vocabulary.

  WARNING: The following is experimental and is still being actively worked on.

  Supply `key_fn` if you would like to generate a vocabulary with coverage over
  specific keys.

  A "coverage vocabulary" is the union of two vocabulary "arms". The "standard
  arm" of the vocabulary is equivalent to the one generated by the same function
  call with no coverage arguments. Adding coverage only appends additional
  entries to the end of the standard vocabulary.

  The "coverage arm" of the vocabulary is determined by taking the
  `coverage_top_k` most frequent unique terms per key. A term's key is obtained
  by applying `key_fn` to the term. Use `coverage_frequency_threshold` to lower
  bound the frequency of entries in the coverage arm of the vocabulary.

  Note this is currently implemented for the case where the key is contained
  within each vocabulary entry (b/117796748).

  Args:
    x: A categorical/discrete input `Tensor`, `SparseTensor`, or `RaggedTensor`
      with dtype tf.string or tf.int[8|16|32|64]. The inputs should generally be
      unique per row (i.e. a bag of words/ngrams representation).
    top_k: Limit the generated vocabulary to the first `top_k` elements. If set
      to None, the full vocabulary is generated.
    frequency_threshold: Limit the generated vocabulary only to elements whose
      absolute frequency is >= to the supplied threshold. If set to None, the
      full vocabulary is generated.  Absolute frequency means the number of
      occurrences of the element in the dataset, as opposed to the proportion of
      instances that contain that element.
    vocab_filename: The file name for the vocabulary file. If None, a file name
      will be chosen based on the current scope. If not None, should be unique
      within a given preprocessing function. NOTE To make your pipelines
      resilient to implementation details please set `vocab_filename` when you
      are using the vocab_filename on a downstream component.
    store_frequency: If True, frequency of the words is stored in the vocabulary
      file. In the case labels are provided, the mutual information is stored in
      the file instead. Each line in the file will be of the form 'frequency
      word'. NOTE: if this is True then the computed vocabulary cannot be used
      with `tft.apply_vocabulary` directly, since frequencies are added to the
      beginning of each row of the vocabulary, which the mapper will not ignore.
    reserved_tokens: (Optional) A list of tokens that should appear in the
      vocabulary regardless of their appearance in the input. These tokens would
      maintain their order, and have a reserved spot at the beginning of the
      vocabulary. Note: this field has no affect on cache.
    weights: (Optional) Weights `Tensor` for the vocabulary. It must have the
      same shape as x.
    labels: (Optional) Labels dense `Tensor` for the vocabulary. If provided,
      the vocabulary is calculated based on mutual information with the label,
      rather than frequency. The labels must have the same batch dimension as x.
      If x is sparse, labels should be a 1D tensor reflecting row-wise labels.
      If x is dense, labels can either be a 1D tensor of row-wise labels, or a
      dense tensor of the identical shape as x (i.e. element-wise labels).
      Labels should be a discrete integerized tensor (If the label is numeric,
      it should first be bucketized; If the label is a string, an integer
      vocabulary should first be applied). Note: `CompositeTensor` labels are
      not yet supported (b/134931826). WARNING: When labels are provided, the
      frequency_threshold argument functions as a mutual information threshold,
      which is a float. TODO(b/116308354): Fix confusing naming.
    use_adjusted_mutual_info: If true, and labels are provided, calculate
      vocabulary using adjusted rather than raw mutual information.
    min_diff_from_avg: MI (or AMI) of a feature x label will be adjusted to zero
      whenever the difference between count and the expected (average) count is
      lower than min_diff_from_average. This can be thought of as a regularizing
      parameter that pushes small MI/AMI values to zero. If None, a default
      parameter will be selected based on the size of the dataset (see
      calculate_recommended_min_diff_from_avg).
    coverage_top_k: (Optional), (Experimental) The minimum number of elements
      per key to be included in the vocabulary.
    coverage_frequency_threshold: (Optional), (Experimental) Limit the coverage
      arm of the vocabulary only to elements whose absolute frequency is >= this
      threshold for a given key.
    key_fn: (Optional), (Experimental) A fn that takes in a single entry of `x`
      and returns the corresponding key for coverage calculation. If this is
      `None`, no coverage arm is added to the vocabulary.
    fingerprint_shuffle: (Optional), (Experimental) Whether to sort the
      vocabularies by fingerprint instead of counts. This is useful for load
      balancing on the training parameter servers. Shuffle only happens while
      writing the files, so all the filters above (top_k, frequency_threshold,
      etc) will still take effect.
    file_format: (Optional) A str. The format of the resulting vocabulary file.
      Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires
      tensorflow>=2.4. The default value is 'text'.
    name: (Optional) A name for this operation.

  Returns:
    The path name for the vocabulary file containing the unique values of `x`.

  Raises:
    ValueError: If `top_k` or `frequency_threshold` is negative.
      If `coverage_top_k` or `coverage_frequency_threshold` is negative.
      If either `coverage_top_k` or `coverage_frequency_threshold` is specified
        and `key_fn` is not.
      If `key_fn` is specified and neither `coverage_top_k`, nor
  """
  top_k, frequency_threshold = _get_top_k_and_frequency_threshold(
      top_k, frequency_threshold)

  if (coverage_top_k or coverage_frequency_threshold) and not key_fn:
    raise ValueError('You must specify `key_fn` if you specify `coverage_top_k'
                     ' or `coverage_frequency_threshold` in `vocabulary`.')

  if key_fn and not (coverage_top_k or coverage_frequency_threshold):
    raise ValueError('You must specify `coverage_top_k`  or '
                     '`coverage_frequency_threshold` if you specify `key_fn` in'
                     ' `vocabulary`.')

  if file_format not in ALLOWED_VOCABULARY_FILE_FORMATS:
    raise ValueError(
        '"{}" is not an accepted file_format. It should be one of: {}'.format(
            file_format, ALLOWED_VOCABULARY_FILE_FORMATS))

  coverage_top_k, coverage_frequency_threshold = (
      _get_top_k_and_frequency_threshold(
          coverage_top_k, coverage_frequency_threshold))

  if x.dtype != tf.string and not x.dtype.is_integer:
    raise ValueError('expected tf.string or integer but got %r' % x.dtype)

  if labels is not None and not labels.dtype.is_integer:
    raise ValueError('expected integer labels but got %r' % labels.dtype)

  if (frequency_threshold is None and labels is None and key_fn is None and
      not fingerprint_shuffle and top_k is not None and
      top_k <= LARGE_VOCAB_TOP_K):
    logging.info('If the number of unique tokens is smaller than the provided '
                 'top_k or approximation error is acceptable, consider using '
                 'tft.experimental.approximate_vocabulary for a potentially '
                 'more efficient implementation.')

  with tf.compat.v1.name_scope(name, 'vocabulary'):
    vocabulary_key = vocab_filename
    vocab_filename = _get_vocab_filename(vocab_filename, store_frequency)
    informativeness_threshold = float('-inf')
    coverage_informativeness_threshold = float('-inf')
    if labels is not None:
      if weights is not None:
        vocab_ordering_type = _VocabOrderingType.WEIGHTED_MUTUAL_INFORMATION
      else:
        vocab_ordering_type = _VocabOrderingType.MUTUAL_INFORMATION
      # Correct for the overloaded `frequency_threshold` API.
      if frequency_threshold is not None:
        informativeness_threshold = frequency_threshold
      frequency_threshold = 0.0
      if coverage_frequency_threshold is not None:
        coverage_informativeness_threshold = coverage_frequency_threshold
      coverage_frequency_threshold = 0.0
    elif weights is not None:
      vocab_ordering_type = _VocabOrderingType.WEIGHTED_FREQUENCY
    else:
      vocab_ordering_type = _VocabOrderingType.FREQUENCY
    analyzer_inputs = _get_vocabulary_analyzer_inputs(
        vocab_ordering_type=vocab_ordering_type,
        x=x,
        file_format=file_format,
        labels=labels,
        weights=weights)
    return _vocabulary_analyzer_nodes(
        analyzer_inputs=analyzer_inputs,
        input_dtype=x.dtype.name,
        vocab_ordering_type=vocab_ordering_type,
        vocab_filename=vocab_filename,
        top_k=top_k,
        frequency_threshold=frequency_threshold or 0,
        informativeness_threshold=informativeness_threshold,
        use_adjusted_mutual_info=use_adjusted_mutual_info,
        min_diff_from_avg=min_diff_from_avg,
        fingerprint_shuffle=fingerprint_shuffle,
        store_frequency=store_frequency,
        key_fn=key_fn,
        coverage_top_k=coverage_top_k,
        coverage_frequency_threshold=coverage_frequency_threshold or 0,
        coverage_informativeness_threshold=coverage_informativeness_threshold,
        file_format=file_format,
        vocabulary_key=vocabulary_key,
        reserved_tokens=reserved_tokens,
    )

word_count

word_count(
    tokens: Union[SparseTensor, RaggedTensor],
    name: Optional[str] = None,
) -> Tensor

Find the token count of each document/row.

tokens is either a RaggedTensor or SparseTensor, representing tokenized strings. This function simply returns size of each row, so the dtype is not constrained to string.

Example:

sparse = tf.SparseTensor(indices=[[0, 0], [0, 1], [2, 2]], ... values=['a', 'b', 'c'], dense_shape=(4, 4)) tft.word_count(sparse)

PARAMETER DESCRIPTION
tokens

either (1) a SparseTensor, or (2) a RaggedTensor with ragged rank of 1, non-ragged rank of 1 of dtype tf.string containing tokens to be counted

TYPE: Union[SparseTensor, RaggedTensor]

name

(Optional) A name for this operation.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tensor

A one-dimensional Tensor the token counts of each row.

RAISES DESCRIPTION
ValueError

if tokens is neither sparse nor ragged

Source code in tensorflow_transform/mappers.py
@common.log_api_use(common.MAPPER_COLLECTION)
def word_count(tokens: Union[tf.SparseTensor, tf.RaggedTensor],
               name: Optional[str] = None) -> tf.Tensor:
  # pyformat: disable
  """Find the token count of each document/row.

  `tokens` is either a `RaggedTensor` or `SparseTensor`, representing tokenized
  strings. This function simply returns size of each row, so the dtype is not
  constrained to string.

  Example:
  >>> sparse = tf.SparseTensor(indices=[[0, 0], [0, 1], [2, 2]],
  ...                          values=['a', 'b', 'c'], dense_shape=(4, 4))
  >>> tft.word_count(sparse)
  <tf.Tensor: shape=(4,), dtype=int64, numpy=array([2, 0, 1, 0])>

  Args:
    tokens: either
      (1) a `SparseTensor`, or
      (2) a `RaggedTensor` with ragged rank of 1, non-ragged rank of 1
      of dtype `tf.string` containing tokens to be counted
    name: (Optional) A name for this operation.

  Returns:
    A one-dimensional `Tensor` the token counts of each row.

  Raises:
    ValueError: if tokens is neither sparse nor ragged
  """
  # pyformat: enable
  with tf.compat.v1.name_scope(name, 'word_count'):
    if isinstance(tokens, tf.RaggedTensor):
      return tokens.row_lengths()
    elif isinstance(tokens, tf.SparseTensor):
      result = tf.sparse.reduce_sum(
          tf.SparseTensor(indices=tokens.indices,
                          values=tf.ones_like(tokens.values, dtype=tf.int64),
                          dense_shape=tokens.dense_shape),
          axis=list(range(1, tokens.get_shape().ndims)))
      result.set_shape([tokens.shape[0]])
      return result
    else:
      raise ValueError('Invalid token tensor')