TensorFlow Transform tft
Module¶
tensorflow_transform
¶
Init module for TF.Transform.
Attributes¶
Classes¶
DatasetMetadata
¶
Metadata about a dataset used for the "instance dict" format.
Caution: The "instance dict" format used with DatasetMetadata
is much less
efficient than TFXIO. For any serious workloads you should use TFXIO with a
tfxio.TensorAdapterConfig
instance as the metadata. Refer to
Get started with TF-Transform
for more details.
This is an in-memory representation that may be serialized and deserialized to and from a variety of disk representations.
Source code in tensorflow_transform/tf_metadata/dataset_metadata.py
Attributes¶
Functions¶
from_feature_spec
classmethod
¶
from_feature_spec(
feature_spec: Mapping[str, FeatureSpecType],
domains: Optional[Mapping[str, DomainType]] = None,
) -> _DatasetMetadataType
Creates a DatasetMetadata from a TF feature spec dict.
Source code in tensorflow_transform/tf_metadata/dataset_metadata.py
TFTransformOutput
¶
TFTransformOutput(transform_output_dir: str)
A wrapper around the output of the tf.Transform.
Init method for TFTransformOutput.
PARAMETER | DESCRIPTION |
---|---|
transform_output_dir
|
The directory containig tf.Transform output.
TYPE:
|
Source code in tensorflow_transform/output_wrapper.py
Attributes¶
POST_TRANSFORM_FEATURE_STATS_PATH
class-attribute
instance-attribute
¶
POST_TRANSFORM_FEATURE_STATS_PATH = join(
"post_transform_feature_stats", _FEATURE_STATS_PB
)
PRE_TRANSFORM_FEATURE_STATS_PATH
class-attribute
instance-attribute
¶
PRE_TRANSFORM_FEATURE_STATS_PATH = join(
"pre_transform_feature_stats", _FEATURE_STATS_PB
)
TRANSFORMED_METADATA_DIR
class-attribute
instance-attribute
¶
post_transform_statistics_path
property
¶
post_transform_statistics_path: str
Returns the path to the post-transform datum statistics.
Note: post_transform_statistics is not guaranteed to exist in the output of tf.transform and hence using this could fail, if post_transform statistics is not present in TFTransformOutput.
pre_transform_statistics_path
property
¶
pre_transform_statistics_path: str
Returns the path to the pre-transform datum statistics.
Note: pre_transform_statistics is not guaranteed to exist in the output of tf.transform and hence using this could fail, if pre_transform statistics is not present in TFTransformOutput.
raw_metadata
property
¶
raw_metadata: DatasetMetadata
A DatasetMetadata.
Note: raw_metadata is not guaranteed to exist in the output of tf.transform and hence using this could fail, if raw_metadata is not present in TFTransformOutput.
RETURNS | DESCRIPTION |
---|---|
DatasetMetadata
|
A DatasetMetadata |
Functions¶
load_transform_graph
¶
Load the transform graph without replacing any placeholders.
This is necessary to ensure that variables in the transform graph are included in the training checkpoint when using tf.Estimator. This should be called in the training input_fn.
Source code in tensorflow_transform/output_wrapper.py
num_buckets_for_transformed_feature
¶
Returns the number of buckets for an integerized transformed feature.
Source code in tensorflow_transform/output_wrapper.py
raw_domains
¶
Returns domains for the raw features.
RETURNS | DESCRIPTION |
---|---|
Dict[str, DomainType]
|
A dict from feature names to one of schema_pb2.IntDomain, |
Dict[str, DomainType]
|
schema_pb2.StringDomain or schema_pb2.FloatDomain. |
Source code in tensorflow_transform/output_wrapper.py
raw_feature_spec
¶
transform_features_layer
¶
Creates a TransformFeaturesLayer
from this transform output.
If a TransformFeaturesLayer
has already been created for self, the same
one will be returned.
RETURNS | DESCRIPTION |
---|---|
Model
|
A |
Source code in tensorflow_transform/output_wrapper.py
transform_raw_features
¶
transform_raw_features(
raw_features: Mapping[str, TensorType],
drop_unused_features: bool = True,
) -> Dict[str, TensorType]
Takes a dict of tensors representing raw features and transforms them.
Takes a dictionary of Tensor
, SparseTensor
, or RaggedTensor
s that
represent the raw features, and applies the transformation defined by
tf.Transform.
If False it returns all transformed features defined by tf.Transform. To
only return features transformed from the given 'raw_features', set
drop_unused_features
to True.
Note: If eager execution is enabled and this API is invoked inside a
tf.function or an API that uses tf.function such as dataset.map, please use
transform_features_layer
instead. It separates out loading of the
transform graph and hence resources will not be initialized on each
invocation. This can have significant performance improvement if the
transform graph was exported as a TF1 SavedModel and guarantees correctness
if it was exported as a TF2 SavedModel.
PARAMETER | DESCRIPTION |
---|---|
raw_features
|
A dict whose keys are feature names and values are |
drop_unused_features
|
If True, the result will be filtered. Only the
features that are transformed from 'raw_features' will be included in
the returned result. If a feature is transformed from multiple raw
features (e.g, feature cross), it will only be included if all its base
raw features are present in
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Dict[str, TensorType]
|
A dict whose keys are feature names and values are |
Dict[str, TensorType]
|
|
Source code in tensorflow_transform/output_wrapper.py
transformed_domains
¶
Returns domains for the transformed features.
RETURNS | DESCRIPTION |
---|---|
Dict[str, DomainType]
|
A dict from feature names to one of schema_pb2.IntDomain, |
Dict[str, DomainType]
|
schema_pb2.StringDomain or schema_pb2.FloatDomain. |
Source code in tensorflow_transform/output_wrapper.py
transformed_feature_spec
¶
Returns a feature_spec for the transformed features.
RETURNS | DESCRIPTION |
---|---|
Dict[str, FeatureSpecType]
|
A dict from feature names to FixedLenFeature/SparseFeature/VarLenFeature. |
Source code in tensorflow_transform/output_wrapper.py
vocabulary_by_name
¶
Like vocabulary_file_by_name but returns a list.
Source code in tensorflow_transform/output_wrapper.py
vocabulary_file_by_name
¶
Returns the vocabulary file path created in the preprocessing function.
vocab_filename
must either be (i) the name used as the vocab_filename
argument to tft.compute_and_apply_vocabulary / tft.vocabulary or (ii) the
key used in tft.annotate_asset.
When a mapping has been specified by calls to tft.annotate_asset, it will be checked first for the provided filename. If present, this filename will be used directly to construct a path.
If the mapping does not exist or vocab_filename
is not present within it,
we will default to sanitizing vocab_filename
and searching for files
matching it within the assets directory.
In either case, if the constructed path does not point to an existing file within the assets subdirectory, we will return a None.
PARAMETER | DESCRIPTION |
---|---|
vocab_filename
|
The vocabulary name to lookup.
TYPE:
|
Source code in tensorflow_transform/output_wrapper.py
vocabulary_size_by_name
¶
Like vocabulary_file_by_name, but returns the size of vocabulary.
Source code in tensorflow_transform/output_wrapper.py
TransformFeaturesLayer
¶
TransformFeaturesLayer(
tft_output: TFTransformOutput,
exported_as_v1: Optional[bool] = None,
)
Bases: Model
A Keras layer for applying a tf.Transform output to input layers.
Source code in tensorflow_transform/output_wrapper.py
Functions¶
call
¶
Source code in tensorflow_transform/output_wrapper.py
Functions¶
Any
¶
Special type indicating an unconstrained type.
- Any is compatible with every type.
- Any assumed to have all methods.
- All values assumed to be instances of Any.
Note that all the above statements are true from the point of view of static type checkers. At runtime, Any should not be used with instance or class checks.
Source code in python3.9/typing.py
Optional
¶
Optional type.
Optional[X] is equivalent to Union[X, None].
Union
¶
Union type; Union[X, Y] means either X or Y.
To define a union, use e.g. Union[int, str]. Details: - The arguments must be types and there must be at least one. - None as an argument is a special case and is replaced by type(None). - Unions of unions are flattened, e.g.::
Union[Union[int, str], float] == Union[int, str, float]
-
Unions of a single argument vanish, e.g.::
Union[int] == int # The constructor actually returns int
-
Redundant arguments are skipped, e.g.::
Union[int, str, int] == Union[int, str]
-
When comparing unions, the argument order is ignored, e.g.::
Union[int, str] == Union[str, int]
-
You cannot subclass or instantiate a union.
- You can use Optional[X] as a shorthand for Union[X, None].
Source code in python3.9/typing.py
annotate_asset
¶
Creates mapping between user-defined keys and SavedModel assets.
This mapping is made available in BeamDatasetMetadata
and is also used to
resolve vocabularies in tft.TFTransformOutput
.
Note: multiple mappings for the same key will overwrite the previous one.
PARAMETER | DESCRIPTION |
---|---|
asset_key
|
The key to associate with the asset.
TYPE:
|
asset_filename
|
The filename as it appears within the assets/ subdirectory. Must be sanitized and complete (e.g. include the tfrecord.gz for suffix appropriate files).
TYPE:
|
Source code in tensorflow_transform/annotators.py
apply_buckets
¶
apply_buckets(
x: ConsistentTensorType,
bucket_boundaries: BucketBoundariesType,
name: Optional[str] = None,
) -> ConsistentTensorType
Returns a bucketized column, with a bucket index assigned to each input.
Each element e
in x
is mapped to a positive index i
for which
bucket_boundaries[i-1] <= e < bucket_boundaries[i]
, if it exists.
If e < bucket_boundaries[0]
, then e
is mapped to 0
. If
e >= bucket_boundaries[-1]
, then e
is mapped to len(bucket_boundaries)
.
NaNs are mapped to len(bucket_boundaries)
.
Example:
x = tf.constant([[4.0, float('nan'), 1.0], [float('-inf'), 7.5, 10.0]]) bucket_boundaries = tf.constant([[2.0, 5.0, 10.0]]) tft.apply_buckets(x, bucket_boundaries)
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric input
TYPE:
|
bucket_boundaries
|
A rank 2
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
each element in the returned tensor representing the bucketized value. |
ConsistentTensorType
|
Bucketized value is in the range [0, len(bucket_boundaries)]. |
Source code in tensorflow_transform/mappers.py
apply_buckets_with_interpolation
¶
apply_buckets_with_interpolation(
x: ConsistentTensorType,
bucket_boundaries: BucketBoundariesType,
name: Optional[str] = None,
) -> ConsistentTensorType
Interpolates within the provided buckets and then normalizes to 0 to 1.
A method for normalizing continuous numeric data to the range [0, 1]. Numeric values are first bucketized according to the provided boundaries, then linearly interpolated within their respective bucket ranges. Finally, the interpolated values are normalized to the range [0, 1]. Values that are less than or equal to the lowest boundary, or greater than or equal to the highest boundary, will be mapped to 0 and 1 respectively. NaN values will be mapped to the middle of the range (.5).
This is a non-linear approach to normalization that is less sensitive to outliers than min-max or z-score scaling. When outliers are present, standard forms of normalization can leave the majority of the data compressed into a very small segment of the output range, whereas this approach tends to spread out the more frequent values (if quantile buckets are used). Note that distance relationships in the raw data are not necessarily preserved (data points that close to each other in the raw feature space may not be equally close in the transformed feature space). This means that unlike linear normalization methods, correlations between features may be distorted by the transformation. This scaling method may help with stability and minimize exploding gradients in neural networks.
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric input
TYPE:
|
bucket_boundaries
|
Sorted bucket boundaries as a rank-2
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
normalized to the range [0, 1]. If the input x is tf.float64, the returned |
ConsistentTensorType
|
values will be tf.float64. Otherwise, returned values are tf.float32. |
Source code in tensorflow_transform/mappers.py
1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 |
|
apply_pyfunc
¶
Applies a python function to some Tensor
s.
Applies a python function to some Tensor
s given by the argument list. The
number of arguments should match the number of inputs to the function.
This function is for using inside a preprocessing_fn. It is a wrapper around
tf.py_func
. A function added this way can run in Transform, and during
training when the graph is imported using the transform_raw_features
method
of the TFTransformOutput
class. However if the resulting training graph is
serialized and deserialized, then the tf.py_func
op will not work and will
cause an error. This means that TensorFlow Serving will not be able to serve
this graph.
The underlying reason for this limited support is that tf.py_func
ops were
not designed to be serialized since they contain a reference to arbitrary
Python functions. This function pickles those functions and including them in
the graph, and transform_raw_features
similarly unpickles the functions.
But unpickling requires a Python environment, so there it's not possible to
provide support in non-Python languages for loading such ops. Therefore
loading these ops in libraries such as TensorFlow Serving is not supported.
Note: This API can only be used when TF2 is disabled or
tft_beam.Context.force_tf_compat_v1=True
.
PARAMETER | DESCRIPTION |
---|---|
func
|
A Python function, which accepts a list of NumPy
|
Tout
|
A list or tuple of tensorflow data types or a single tensorflow data
type if there is only one, indicating what
|
stateful
|
(Boolean.) If True, the function should be considered stateful. If a function is stateless, when given the same input it will return the same output and have no observable side effects. Optimizations such as common subexpression elimination are only performed on stateless operations.
DEFAULT:
|
name
|
A name for the operation (optional).
DEFAULT:
|
*args
|
The list of
DEFAULT:
|
Returns:
A Tensor
representing the application of the function.
Source code in tensorflow_transform/py_func/api.py
apply_vocabulary
¶
apply_vocabulary(
x: ConsistentTensorType,
deferred_vocab_filename_tensor: TemporaryAnalyzerOutputType,
*,
default_value: Any = -1,
num_oov_buckets: int = 0,
lookup_fn: Optional[
Callable[
[TensorType, Tensor], Tuple[Tensor, Tensor]
]
] = None,
file_format: VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
name: Optional[str] = None,
) -> ConsistentTensorType
Maps x
to a vocabulary specified by the deferred tensor.
This function also writes domain statistics about the vocabulary min and max values. Note that the min and max are inclusive, and depend on the vocab size, num_oov_buckets and default_value.
PARAMETER | DESCRIPTION |
---|---|
x
|
A categorical
TYPE:
|
deferred_vocab_filename_tensor
|
The deferred vocab filename tensor as
returned by
TYPE:
|
default_value
|
The value to use for out-of-vocabulary values, unless 'num_oov_buckets' is greater than zero.
TYPE:
|
num_oov_buckets
|
Any lookup of an out-of-vocabulary token will return a
bucket ID based on its hash if
TYPE:
|
lookup_fn
|
Optional lookup function, if specified it should take a tensor
and a deferred vocab filename as an input and return a lookup
TYPE:
|
file_format
|
(Optional) A str. The format of the given vocabulary. Accepted formats are: 'tfrecord_gzip', 'text'. The default value is 'text'.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
mapped to an integer. Each unique string value that appears in the |
ConsistentTensorType
|
vocabulary is mapped to a different integer and integers are consecutive |
ConsistentTensorType
|
starting from zero, and string value not in the vocabulary is |
ConsistentTensorType
|
assigned default_value. |
Source code in tensorflow_transform/mappers.py
bag_of_words
¶
bag_of_words(
tokens: SparseTensor,
ngram_range: Tuple[int, int],
separator: str,
name: Optional[str] = None,
) -> SparseTensor
Computes a bag of "words" based on the specified ngram configuration.
A light wrapper around tft.ngrams. First computes ngrams, then transforms the ngram representation (list semantics) into a Bag of Words (set semantics) per row. Each row reflects the set of unique ngrams present in an input record.
See tft.ngrams for more information.
PARAMETER | DESCRIPTION |
---|---|
tokens
|
a two-dimensional
TYPE:
|
ngram_range
|
A pair with the range (inclusive) of ngram sizes to compute. |
separator
|
a string that will be inserted between tokens when ngrams are constructed.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
SparseTensor
|
A |
Source code in tensorflow_transform/mappers.py
bucketize
¶
bucketize(
x: ConsistentTensorType,
num_buckets: int,
epsilon: Optional[float] = None,
weights: Optional[Tensor] = None,
elementwise: bool = False,
name: Optional[str] = None,
) -> ConsistentTensorType
Returns a bucketized column, with a bucket index assigned to each input.
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric input
TYPE:
|
num_buckets
|
Values in the input
TYPE:
|
epsilon
|
(Optional) Error tolerance, typically a small fraction close to
zero. If a value is not specified by the caller, a suitable value is
computed based on experimental results. For |
weights
|
(Optional) Weights tensor for the quantiles. Tensor must have the same shape as x.
TYPE:
|
elementwise
|
(Optional) If true, bucketize each element of the tensor independently.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
returned tensor representing the bucketized value. Bucketized value is |
ConsistentTensorType
|
in the range [0, actual_num_buckets). Sometimes the actual number of buckets |
ConsistentTensorType
|
can be different than num_buckets hint, for example in case the number of |
ConsistentTensorType
|
distinct values is smaller than num_buckets, or in cases where the |
ConsistentTensorType
|
input values are not uniformly distributed. |
ConsistentTensorType
|
NaN values are mapped to the last bucket. Values with NaN weights are |
ConsistentTensorType
|
ignored in bucket boundaries calculation. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If num_buckets is not an int. |
ValueError
|
If value of num_buckets is not > 1. |
ValueError
|
If elementwise=True and x is a |
Source code in tensorflow_transform/mappers.py
1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 |
|
bucketize_per_key
¶
bucketize_per_key(
x: ConsistentTensorType,
key: ConsistentTensorType,
num_buckets: int,
epsilon: Optional[float] = None,
weights: Optional[ConsistentTensorType] = None,
name: Optional[str] = None,
) -> ConsistentTensorType
Returns a bucketized column, with a bucket index assigned to each input.
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric input
TYPE:
|
key
|
A
TYPE:
|
num_buckets
|
Values in the input
TYPE:
|
epsilon
|
(Optional) see |
weights
|
(Optional) A
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
each element in the returned tensor representing the bucketized value. |
ConsistentTensorType
|
Bucketized value is in the range [0, actual_num_buckets). If the computed |
ConsistentTensorType
|
key vocabulary doesn't have an entry for |
ConsistentTensorType
|
-1. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If value of num_buckets is not > 1. |
Source code in tensorflow_transform/mappers.py
compute_and_apply_vocabulary
¶
compute_and_apply_vocabulary(
x: ConsistentTensorType,
*,
default_value: Any = -1,
top_k: Optional[int] = None,
frequency_threshold: Optional[int] = None,
num_oov_buckets: int = 0,
vocab_filename: Optional[str] = None,
weights: Optional[Tensor] = None,
labels: Optional[Tensor] = None,
use_adjusted_mutual_info: bool = False,
min_diff_from_avg: float = 0.0,
coverage_top_k: Optional[int] = None,
coverage_frequency_threshold: Optional[int] = None,
key_fn: Optional[Callable[[Any], Any]] = None,
fingerprint_shuffle: bool = False,
file_format: VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
store_frequency: Optional[bool] = False,
reserved_tokens: Optional[
Union[Iterable[str], Tensor]
] = None,
name: Optional[str] = None,
) -> ConsistentTensorType
Generates a vocabulary for x
and maps it to an integer with this vocab.
In case one of the tokens contains the '\n' or '\r' characters or is empty it will be discarded since we are currently writing the vocabularies as text files. This behavior will likely be fixed/improved in the future.
Note that this function will cause a vocabulary to be computed. For large datasets it is highly recommended to either set frequency_threshold or top_k to control the size of the vocabulary, and also the run time of this operation.
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
default_value
|
The value to use for out-of-vocabulary values, unless 'num_oov_buckets' is greater than zero.
TYPE:
|
top_k
|
Limit the generated vocabulary to the first |
frequency_threshold
|
Limit the generated vocabulary only to elements whose absolute frequency is >= to the supplied threshold. If set to None, the full vocabulary is generated. Absolute frequency means the number of occurences of the element in the dataset, as opposed to the proportion of instances that contain that element. If labels are provided and the vocab is computed using mutual information, tokens are filtered if their mutual information with the label is < the supplied threshold. |
num_oov_buckets
|
Any lookup of an out-of-vocabulary token will return a
bucket ID based on its hash if
TYPE:
|
vocab_filename
|
The file name for the vocabulary file. If None, a name based
on the scope name in the context of this graph will be used as the file
name. If not None, should be unique within a given preprocessing function.
NOTE in order to make your pipelines resilient to implementation details
please set |
weights
|
(Optional) Weights
TYPE:
|
labels
|
(Optional) A
TYPE:
|
use_adjusted_mutual_info
|
If true, use adjusted mutual information.
TYPE:
|
min_diff_from_avg
|
Mutual information of a feature will be adjusted to zero whenever the difference between count of the feature with any label and its expected count is lower than min_diff_from_average.
TYPE:
|
coverage_top_k
|
(Optional), (Experimental) The minimum number of elements per key to be included in the vocabulary. |
coverage_frequency_threshold
|
(Optional), (Experimental) Limit the coverage arm of the vocabulary only to elements whose absolute frequency is >= this threshold for a given key. |
key_fn
|
(Optional), (Experimental) A fn that takes in a single entry of |
fingerprint_shuffle
|
(Optional), (Experimental) Whether to sort the vocabularies by fingerprint instead of counts. This is useful for load balancing on the training parameter servers. Shuffle only happens while writing the files, so all the filters above will still take effect.
TYPE:
|
file_format
|
(Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.
TYPE:
|
store_frequency
|
If True, frequency of the words is stored in the vocabulary file. In the case labels are provided, the mutual information is stored in the file instead. Each line in the file will be of the form 'frequency word'. NOTE: if True and text_format is 'text' then spaces will be replaced to avoid information loss. |
reserved_tokens
|
(Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache. |
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
mapped to an integer. Each unique string value that appears in the |
ConsistentTensorType
|
vocabulary is mapped to a different integer and integers are consecutive |
ConsistentTensorType
|
starting from zero. String value not in the vocabulary is assigned |
ConsistentTensorType
|
|
ConsistentTensorType
|
vocabulary strings are hashed to values in |
ConsistentTensorType
|
[vocab_size, vocab_size + num_oov_buckets) for an overall range of |
ConsistentTensorType
|
[0, vocab_size + num_oov_buckets). |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If |
Source code in tensorflow_transform/mappers.py
922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 |
|
count_per_key
¶
count_per_key(
key: TensorType,
key_vocabulary_filename: Optional[str] = None,
name: Optional[str] = None,
)
Computes the count of each element of a Tensor
.
PARAMETER | DESCRIPTION |
---|---|
key
|
A
TYPE:
|
key_vocabulary_filename
|
(Optional) The file name for the key-output mapping file. If None and key are provided, this combiner assumes the keys fit in memory and will not store the result in a file. If empty string, a file name will be chosen based on the current scope. If not an empty string, should be unique within a given preprocessing function. |
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Either
|
|
(A) Two |
|
(B) The filename where the key-value mapping is stored (if key_vocabulary_filename is not None). |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the type of |
Source code in tensorflow_transform/analyzers.py
covariance
¶
Computes the covariance matrix over the whole dataset.
The covariance matrix M is defined as follows: Let x[:j] be a tensor of the jth element of all input vectors in x, and let u_j = mean(x[:j]). The entry M[i,j] = E[(x[:i] - u_i)(x[:j] - u_j)]. Notice that the diagonal entries correspond to variances of individual elements in the vector, i.e. M[i,i] corresponds to the variance of x[:i].
PARAMETER | DESCRIPTION |
---|---|
x
|
A rank-2
TYPE:
|
dtype
|
Tensorflow dtype of entries in the returned matrix.
TYPE:
|
name
|
(Optional) A name for this operation. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
if input is not a rank-2 Tensor. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A rank-2 (matrix) covariance |
Source code in tensorflow_transform/analyzers.py
deduplicate_tensor_per_row
¶
Deduplicates each row (0-th dimension) of the provided tensor.
PARAMETER | DESCRIPTION |
---|---|
input_tensor
|
A two-dimensional
|
name
|
Optional name for the operation.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
A |
Source code in tensorflow_transform/mappers.py
1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 |
|
estimated_probability_density
¶
estimated_probability_density(
x: Tensor,
boundaries: Optional[Union[Tensor, int]] = None,
categorical: bool = False,
name: Optional[str] = None,
) -> Tensor
Computes an approximate probability density at each x, given the bins.
Using this type of fixed-interval method has several benefits compared to bucketization, although may not always be preferred. 1. Quantiles does not work on categorical data. 2. The quantiles algorithm does not currently operate on multiple features jointly, only independently.
Outlier detection in a multi-modal or arbitrary distribution.
Imagine a value x where a simple model is highly predictive of a target y within certain densely populated ranges. Outside these ranges, we may want to treat the data differently, but there are too few samples for the model to detect them by case-by-case treatment. One option would be to use the density estimate for this purpose:
outputs['x_density'] = tft.estimated_prob(inputs['x'], bins=100) outputs['outlier_x'] = tf.where(outputs['x_density'] < OUTLIER_THRESHOLD, tf.constant([1]), tf.constant([0]))
This exercise uses a single variable for illustration, but a direct density metric would become more useful with higher dimensions.
Note that we normalize by average bin_width to arrive at a probability density estimate. The result resembles a pdf, not the probability that a value falls in the bucket (except in the categorical case).
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
boundaries
|
(Optional) A |
categorical
|
(Optional) A
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Tensor
|
probability mass estimate if |
RAISES | DESCRIPTION |
---|---|
NotImplementedError
|
If |
Source code in tensorflow_transform/mappers.py
2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 |
|
get_analyze_input_columns
¶
get_analyze_input_columns(
preprocessing_fn: Callable[
[Mapping[str, TensorType]], Mapping[str, TensorType]
],
specs: Mapping[str, Union[FeatureSpecType, TypeSpec]],
force_tf_compat_v1: bool = False,
) -> List[str]
Return columns that are required inputs of AnalyzeDataset
.
PARAMETER | DESCRIPTION |
---|---|
preprocessing_fn
|
A tf.transform preprocessing_fn.
TYPE:
|
specs
|
A dict of feature name to tf.TypeSpecs. If |
force_tf_compat_v1
|
(Optional) If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[str]
|
A list of columns that are required inputs of analyzers. |
Source code in tensorflow_transform/inspect_preprocessing_fn.py
get_num_buckets_for_transformed_feature
¶
Provides the number of buckets for a transformed feature if annotated.
This for example can be used for the direct output of tft.bucketize
,
tft.apply_buckets
, tft.compute_and_apply_vocabulary
,
tft.apply_vocabulary
.
These methods annotate the transformed feature with additional information.
If the given transformed_feature
isn't annotated, this method will fail.
Example:
def preprocessing_fn(inputs): ... bucketized = tft.bucketize(inputs['x'], num_buckets=3) ... integerized = tft.compute_and_apply_vocabulary(inputs['x']) ... zeros = tf.zeros_like(inputs['x'], tf.int64) ... return { ... 'bucketized': bucketized, ... 'bucketized_num_buckets': ( ... zeros + tft.get_num_buckets_for_transformed_feature(bucketized)), ... 'integerized': integerized, ... 'integerized_num_buckets': ( ... zeros + tft.get_num_buckets_for_transformed_feature(integerized)), ... } raw_data = [dict(x=3),dict(x=23)] feature_spec = dict(x=tf.io.FixedLenFeature([], tf.int64)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'bucketized': 1, 'bucketized_num_buckets': 3, 'integerized': 0, 'integerized_num_buckets': 2}, {'bucketized': 2, 'bucketized_num_buckets': 3, 'integerized': 1, 'integerized_num_buckets': 2}]
PARAMETER | DESCRIPTION |
---|---|
transformed_feature
|
A
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the given tensor has not been annotated a the number of |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Source code in tensorflow_transform/mappers.py
get_transform_input_columns
¶
get_transform_input_columns(
preprocessing_fn: Callable[
[Mapping[str, TensorType]], Mapping[str, TensorType]
],
specs: Mapping[str, Union[FeatureSpecType, TypeSpec]],
force_tf_compat_v1: bool = False,
) -> List[str]
Return columns that are required inputs of TransformDataset
.
PARAMETER | DESCRIPTION |
---|---|
preprocessing_fn
|
A tf.transform preprocessing_fn.
TYPE:
|
specs
|
A dict of feature name to tf.TypeSpecs. If |
force_tf_compat_v1
|
(Optional) If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[str]
|
A list of columns that are required inputs of the transform |
List[str]
|
defined by |
Source code in tensorflow_transform/inspect_preprocessing_fn.py
hash_strings
¶
hash_strings(
strings: ConsistentTensorType,
hash_buckets: int,
key: Optional[Iterable[int]] = None,
name: Optional[str] = None,
) -> ConsistentTensorType
Hash strings into buckets.
PARAMETER | DESCRIPTION |
---|---|
strings
|
a
TYPE:
|
hash_buckets
|
the number of hash buckets.
TYPE:
|
key
|
optional. An array of two Python |
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
same shape as |
ConsistentTensorType
|
the input |
RAISES | DESCRIPTION |
---|---|
TypeError
|
if |
Source code in tensorflow_transform/mappers.py
histogram
¶
histogram(
x: TensorType,
boundaries: Optional[Union[Tensor, int]] = None,
categorical: Optional[bool] = False,
name: Optional[str] = None,
) -> Tuple[Tensor, Tensor]
Computes a histogram over x, given the bin boundaries or bin count.
Ex (1): counts, boundaries = histogram([0, 1, 0, 1, 0, 3, 0, 1], range(5)) counts: [4, 3, 0, 1, 0] boundaries: [0, 1, 2, 3, 4]
Ex (2): Can be used to compute class weights. counts, classes = histogram([0, 1, 0, 1, 0, 3, 0, 1], categorical=True) probabilities = counts / tf.reduce_sum(counts) class_weights = dict(map(lambda (a, b): (a.numpy(), 1.0 / b.numpy()), zip(classes, probabilities)))
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
boundaries
|
(Optional) A |
categorical
|
(Optional) A |
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
counts
|
The histogram, as counts per bin.
TYPE:
|
boundaries
|
A
TYPE:
|
Source code in tensorflow_transform/analyzers.py
make_and_track_object
¶
make_and_track_object(
trackable_factory_callable: Callable[[], Trackable],
name: Optional[str] = None,
) -> Trackable
Keeps track of the object created by invoking trackable_factory_callable
.
This API is only for use when Transform APIs are run with TF2 behaviors
enabled and tft_beam.Context.force_tf_compat_v1
is set to False.
Use this API to track TF Trackable objects created in the preprocessing_fn
such as tf.hub modules, tf.data.Dataset etc. This ensures they are serialized
correctly when exporting to SavedModel.
PARAMETER | DESCRIPTION |
---|---|
trackable_factory_callable
|
A callable that creates and returns a Trackable object.
TYPE:
|
name
|
(Optional) Provide a unique name to track this object with. If the Trackable object created is a Keras Layer or Model this is needed for proper tracking. |
Example:
def preprocessing_fn(inputs): ... dataset = tft.make_and_track_object( ... lambda: tf.data.Dataset.from_tensor_slices([1, 2, 3])) ... with tf.init_scope(): ... dataset_list = list(dataset.as_numpy_iterator()) ... return {'x_0': dataset_list[0] + inputs['x']} raw_data = [dict(x=1), dict(x=2), dict(x=3)] feature_spec = dict(x=tf.io.FixedLenFeature([], tf.int64)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp(), ... force_tf_compat_v1=False): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'x_0': 2}, {'x_0': 3}, {'x_0': 4}]
RETURNS | DESCRIPTION |
---|---|
Trackable
|
The object returned when trackable_factory_callable is invoked. The object |
Trackable
|
creation is lifted out to the eager context using |
Source code in tensorflow_transform/annotators.py
max
¶
Computes the maximum of the values of x
over the whole dataset.
In the case of a CompositeTensor
missing values will be used in return
value: for float, NaN is used and for other dtypes the min is used.
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Raises:
TypeError: If the type of x
is not supported.
Source code in tensorflow_transform/analyzers.py
mean
¶
mean(
x: TensorType,
reduce_instance_dims: bool = True,
name: Optional[str] = None,
output_dtype: Optional[DType] = None,
) -> Tensor
Computes the mean of the values of a Tensor
over the whole dataset.
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.
TYPE:
|
name
|
(Optional) A name for this operation. |
output_dtype
|
(Optional) If not None, casts the output tensor to this type.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Tensor
|
the same type as |
Tensor
|
NaNs and infinite input values are ignored. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the type of |
Source code in tensorflow_transform/analyzers.py
min
¶
Computes the minimum of the values of x
over the whole dataset.
In the case of a CompositeTensor
missing values will be used in return
value: for float, NaN is used and for other dtypes the max is used.
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions
to arrive at a single scalar output. If False, only collapses the batch
dimension and outputs a
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the type of |
Source code in tensorflow_transform/analyzers.py
ngrams
¶
ngrams(
tokens: SparseTensor,
ngram_range: Tuple[int, int],
separator: str,
name: Optional[str] = None,
) -> SparseTensor
Create a SparseTensor
of n-grams.
Given a SparseTensor
of tokens, returns a SparseTensor
containing the
ngrams that can be constructed from each row.
separator
is inserted between each pair of tokens, so " " would be an
appropriate choice if the tokens are words, while "" would be an appropriate
choice if they are characters.
Example:
tokens = tf.SparseTensor( ... indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2], [1, 3]], ... values=['One', 'was', 'Johnny', 'Two', 'was', 'a', 'rat'], ... dense_shape=[2, 4]) print(tft.ngrams(tokens, ngram_range=(1, 3), separator=' ')) SparseTensor(indices=tf.Tensor( [[0 0][0 1] [0 2][0 3] [0 4][0 5] [1 0][1 1] [1 2][1 3] [1 4][1 5] [1 6][1 7] [1 8]], shape=(15, 2), dtype=int64), values=tf.Tensor( [b'One' b'One was' b'One was Johnny' b'was' b'was Johnny' b'Johnny' b'Two' b'Two was' b'Two was a' b'was' b'was a' b'was a rat' b'a' b'a rat' b'rat'], shape=(15,), dtype=string), dense_shape=tf.Tensor([2 9], shape=(2,), dtype=int64))
PARAMETER | DESCRIPTION |
---|---|
tokens
|
a two-dimensional
TYPE:
|
ngram_range
|
A pair with the range (inclusive) of ngram sizes to return. |
separator
|
a string that will be inserted between tokens when ngrams are constructed.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
SparseTensor
|
A |
SparseTensor
|
if an ngram appears multiple times in the input row, it will be present the |
SparseTensor
|
same number of times in the output. For unique ngrams, see tft.bag_of_words. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
if |
ValueError
|
if ngram_range[0] < 1 or ngram_range[1] < ngram_range[0] |
Source code in tensorflow_transform/mappers.py
1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 |
|
pca
¶
Computes PCA on the dataset using biased covariance.
The PCA analyzer computes output_dim orthonormal vectors that capture
directions/axes corresponding to the highest variances in the input vectors of
x
. The output vectors are returned as a rank-2 tensor with shape
(input_dim, output_dim)
, where the 0th dimension are the components of each
output vector, and the 1st dimension are the output vectors representing
orthogonal directions in the input space, sorted in order of decreasing
variances.
The output rank-2 tensor (matrix) serves a useful transform purpose. Formally,
the matrix can be used downstream in the transform step by multiplying it to
the input tensor x
. This transform reduces the dimension of input vectors to
output_dim in a way that retains the maximal variance.
NOTE: To properly use PCA, input vector components should be converted to
similar units of measurement such that the vectors represent a Euclidean
space. If no such conversion is available (e.g. one element represents time,
another element distance), the canonical approach is to first apply a
transformation to the input data to normalize numerical variances, i.e.
tft.scale_to_z_score()
. Normalization allows PCA to choose output axes that
help decorrelate input axes.
Below are a couple intuitive examples of PCA.
Consider a simple 2-dimensional example:
Input x is a series of vectors [e, e]
where e
is Gaussian with mean 0,
variance 1. The two components are perfectly correlated, and the resulting
covariance matrix is
Applying PCA with output_dim = 1
would discover the first principal
component [1 / sqrt(2), 1 / sqrt(2)]
. When multipled to the original
example, each vector [e, e]
would be mapped to a scalar sqrt(2) * e
. The
second principal component would be [-1 / sqrt(2), 1 / sqrt(2)]
and would
map [e, e]
to 0, which indicates that the second component captures no
variance at all. This agrees with our intuition since we know that the two
axes in the input are perfectly correlated and can be fully explained by a
single scalar e
.
Consider a 3-dimensional example:
Input x
is a series of vectors [a, a, b]
, where a
is a zero-mean, unit
variance Gaussian and b
is a zero-mean, variance 4 Gaussian and is
independent of a
. The first principal component of the unnormalized vector
would be [0, 0, 1]
since b
has a much larger variance than any linear
combination of the first two components. This would map [a, a, b]
onto b
,
asserting that the axis with highest energy is the third component. While this
may be the desired output if a
and b
correspond to the same units, it is
not statistically desireable when the units are irreconciliable. In such a
case, one should first normalize each component to unit variance first, i.e.
b := b / 2
. The first principal component of a normalized vector would yield
[1 / sqrt(2), 1 / sqrt(2), 0]
, and would map [a, a, b]
to sqrt(2) * a
.
The second component would be [0, 0, 1]
and map [a, a, b]
to b
. As can
be seen, the benefit of normalization is that PCA would capture highly
correlated components first and collapse them into a lower dimension.
PARAMETER | DESCRIPTION |
---|---|
x
|
A rank-2
TYPE:
|
output_dim
|
The PCA output dimension (number of eigenvectors to return).
TYPE:
|
dtype
|
Tensorflow dtype of entries in the returned matrix.
TYPE:
|
name
|
(Optional) A name for this operation. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
if input is not a rank-2 Tensor. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A 2D |
Source code in tensorflow_transform/analyzers.py
2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 |
|
quantiles
¶
quantiles(
x: Tensor,
num_buckets: int,
epsilon: float,
weights: Optional[Tensor] = None,
reduce_instance_dims: bool = True,
name: Optional[str] = None,
) -> Tensor
Computes the quantile boundaries of a Tensor
over the whole dataset.
Quantile boundaries are computed using approximate quantiles,
and error tolerance is specified using epsilon
. The boundaries divide the
input tensor into approximately equal num_buckets
parts.
See go/squawd for details, and how to control the error due to approximation.
NaN input values and values with NaN weights are ignored.
PARAMETER | DESCRIPTION |
---|---|
x
|
An input
TYPE:
|
num_buckets
|
Values in the
TYPE:
|
epsilon
|
Error tolerance, typically a small fraction close to zero (e.g. 0.01). Higher values of epsilon increase the quantile approximation, and hence result in more unequal buckets, but could improve performance, and resource consumption. Some measured results on memory consumption: For epsilon = 0.001, the amount of memory for each buffer to hold the summary for 1 trillion input values is ~25000 bytes. If epsilon is relaxed to 0.01, the buffer size drops to ~2000 bytes for the same input size. The buffer size also determines the amount of work in the different stages of the beam pipeline, in general, larger epsilon results in fewer and smaller stages, and less time. For more performance trade-offs see also http://web.cs.ucla.edu/~weiwang/paper/SSDBM07_2.pdf
TYPE:
|
weights
|
(Optional) Weights tensor for the quantiles. Tensor must have the same batch size as x.
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions to arrive at a single output vector. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
The bucket boundaries represented as a list, with num_bucket-1 elements, |
Tensor
|
unless reduce_instance_dims is False, which results in a Tensor of |
Tensor
|
shape x.shape + [num_bucket-1]. |
Tensor
|
See code below for discussion on the type of bucket boundaries. |
Source code in tensorflow_transform/analyzers.py
2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 |
|
scale_by_min_max
¶
scale_by_min_max(
x: ConsistentTensorType,
output_min: float = 0.0,
output_max: float = 1.0,
elementwise: bool = False,
name: Optional[str] = None,
) -> ConsistentTensorType
Scale a numerical column into the range [output_min, output_max].
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric
TYPE:
|
output_min
|
The minimum of the range of output values.
TYPE:
|
output_max
|
The maximum of the range of output values.
TYPE:
|
elementwise
|
If true, scale each element of the tensor independently.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
If the analysis dataset is empty or contains a singe distinct value, then |
ConsistentTensorType
|
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
If output_min, output_max have the wrong order. |
Source code in tensorflow_transform/mappers.py
scale_by_min_max_per_key
¶
scale_by_min_max_per_key(
x: ConsistentTensorType,
key: TensorType,
output_min: float = 0.0,
output_max: float = 1.0,
elementwise: bool = False,
key_vocabulary_filename: Optional[str] = None,
name: Optional[str] = None,
) -> ConsistentTensorType
Scale a numerical column into a predefined range on a per-key basis.
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric
TYPE:
|
key
|
A
TYPE:
|
output_min
|
The minimum of the range of output values.
TYPE:
|
output_max
|
The maximum of the range of output values.
TYPE:
|
elementwise
|
If true, scale each element of the tensor independently.
TYPE:
|
key_vocabulary_filename
|
(Optional) The file name for the per-key file. If None, this combiner will assume the keys fit in memory and will not store the analyzer result in a file. If '', a file name will be chosen based on the current TensorFlow scope. If not '', it should be unique within a given preprocessing function. |
name
|
(Optional) A name for this operation. |
Example:
def preprocessing_fn(inputs): ... return { ... 'scaled': tft.scale_by_min_max_per_key(inputs['x'], inputs['s']) ... } raw_data = [dict(x=1, s='a'), dict(x=0, s='b'), dict(x=3, s='a')] feature_spec = dict( ... x=tf.io.FixedLenFeature([], tf.float32), ... s=tf.io.FixedLenFeature([], tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'scaled': 0.0}, {'scaled': 0.5}, {'scaled': 1.0}]
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
[output_min, output_max] on a per-key basis if a key is provided. If the |
ConsistentTensorType
|
analysis dataset is empty, a certain key contains a single distinct value or |
ConsistentTensorType
|
the computed key vocabulary doesn't have an entry for |
ConsistentTensorType
|
scaled using a sigmoid function. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If output_min, output_max have the wrong order. |
NotImplementedError
|
If elementwise is True and key is not None. |
InvalidArgumentError
|
If indices of sparse x and key do not match. |
Source code in tensorflow_transform/mappers.py
scale_to_0_1
¶
scale_to_0_1(
x: ConsistentTensorType,
elementwise: bool = False,
name: Optional[str] = None,
) -> ConsistentTensorType
Returns a column which is the input column scaled to have range [0,1].
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric
TYPE:
|
elementwise
|
If true, scale each element of the tensor independently.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
scaled to |
ConsistentTensorType
|
[0, 1]. If the analysis dataset is empty or contains a single distinct |
ConsistentTensorType
|
value, then |
Source code in tensorflow_transform/mappers.py
scale_to_0_1_per_key
¶
scale_to_0_1_per_key(
x: ConsistentTensorType,
key: TensorType,
elementwise: bool = False,
key_vocabulary_filename: Optional[str] = None,
name: Optional[str] = None,
) -> ConsistentTensorType
Returns a column which is the input column scaled to have range [0,1].
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric
TYPE:
|
key
|
A
TYPE:
|
elementwise
|
If true, scale each element of the tensor independently.
TYPE:
|
key_vocabulary_filename
|
(Optional) The file name for the per-key file. If None, this combiner will assume the keys fit in memory and will not store the analyzer result in a file. If '', a file name will be chosen based on the current TensorFlow scope. If not '', it should be unique within a given preprocessing function. |
name
|
(Optional) A name for this operation. |
Example:
def preprocessing_fn(inputs): ... return { ... 'scaled': tft.scale_to_0_1_per_key(inputs['x'], inputs['s']) ... } raw_data = [dict(x=1, s='a'), dict(x=0, s='b'), dict(x=3, s='a')] feature_spec = dict( ... x=tf.io.FixedLenFeature([], tf.float32), ... s=tf.io.FixedLenFeature([], tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'scaled': 0.0}, {'scaled': 0.5}, {'scaled': 1.0}]
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
per key. If the analysis dataset is empty, contains a single distinct value |
ConsistentTensorType
|
or the computed key vocabulary doesn't have an entry for |
ConsistentTensorType
|
scaled using a sigmoid function. |
Source code in tensorflow_transform/mappers.py
scale_to_gaussian
¶
scale_to_gaussian(
x: ConsistentTensorType,
elementwise: bool = False,
name: Optional[str] = None,
output_dtype: Optional[DType] = None,
) -> ConsistentTensorType
Returns an (approximately) normal column with mean to 0 and variance 1.
We transform the column to values that are approximately distributed according to a standard normal distribution. The transformation is obtained by applying the moments method to estimate the parameters of a Tukey HH distribution and applying the inverse of the estimated function to the column values. The method is partially described in
Georg M. Georgm "The Lambert Way to Gaussianize Heavy-Tailed Data with the Inverse of Tukey's h Transformation as a Special Case," The Scientific World Journal, Vol. 2015, Hindawi Publishing Corporation.
We use the L-moments instead of conventional moments to be able to deal with long-tailed distributions. The expressions of the L-moments for the Tukey HH distribution is in
Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey H and HH-Distributions through L-Moments and the L-Correlation," ISRN Applied Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153
Note that the transformation to Gaussian is applied only if the column has long-tails. If this is not the case, for instance if values are uniformly distributed, the values are only normalized using the z score. This applies also to the cases where only one of the tails is long; the other tail is only rescaled but not non linearly transformed. Also, if the analysis set is empty, the transformation is set to to leave the input vaules unchanged.
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric
TYPE:
|
elementwise
|
If true, scales each element of the tensor independently; otherwise uses the parameters of the whole tensor.
TYPE:
|
name
|
(Optional) A name for this operation. |
output_dtype
|
(Optional) If not None, casts the output tensor to this type.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
transformed to be approximately standard distributed (i.e. a Gaussian with |
ConsistentTensorType
|
mean 0 and variance 1). If |
ConsistentTensorType
|
same type as |
ConsistentTensorType
|
Note that TFLearn generally permits only tf.int64 and tf.float32, so casting |
ConsistentTensorType
|
this scaler's output may be necessary. |
Source code in tensorflow_transform/mappers.py
scale_to_z_score
¶
scale_to_z_score(
x: ConsistentTensorType,
elementwise: bool = False,
name: Optional[str] = None,
output_dtype: Optional[DType] = None,
) -> ConsistentTensorType
Returns a standardized column with mean 0 and variance 1.
Scaling to z-score subtracts out the mean and divides by standard deviation. Note that the standard deviation computed here is based on the biased variance (0 delta degrees of freedom), as computed by analyzers.var.
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric
TYPE:
|
elementwise
|
If true, scales each element of the tensor independently; otherwise uses the mean and variance of the whole tensor.
TYPE:
|
name
|
(Optional) A name for this operation. |
output_dtype
|
(Optional) If not None, casts the output tensor to this type.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
scaled to mean 0 |
ConsistentTensorType
|
and variance 1 (standard deviation 1), given by: (x - mean(x)) / std_dev(x). |
ConsistentTensorType
|
If |
ConsistentTensorType
|
integral, the output is cast to tf.float32. If the analysis dataset is empty |
ConsistentTensorType
|
or contains a single distinct value, then the input is returned without |
ConsistentTensorType
|
scaling. |
ConsistentTensorType
|
Note that TFLearn generally permits only tf.int64 and tf.float32, so casting |
ConsistentTensorType
|
this scaler's output may be necessary. |
Source code in tensorflow_transform/mappers.py
scale_to_z_score_per_key
¶
scale_to_z_score_per_key(
x: ConsistentTensorType,
key: TensorType,
elementwise: bool = False,
key_vocabulary_filename: Optional[str] = None,
name: Optional[str] = None,
output_dtype: Optional[DType] = None,
) -> ConsistentTensorType
Returns a standardized column with mean 0 and variance 1, grouped per key.
Scaling to z-score subtracts out the mean and divides by standard deviation. Note that the standard deviation computed here is based on the biased variance (0 delta degrees of freedom), as computed by analyzers.var.
PARAMETER | DESCRIPTION |
---|---|
x
|
A numeric
TYPE:
|
key
|
A
TYPE:
|
elementwise
|
If true, scales each element of the tensor independently; otherwise uses the mean and variance of the whole tensor. Currently, not supported for per-key operations.
TYPE:
|
key_vocabulary_filename
|
(Optional) The file name for the per-key file. If None, this combiner will assume the keys fit in memory and will not store the analyzer result in a file. If '', a file name will be chosen based on the current TensorFlow scope. If not '', it should be unique within a given preprocessing function. |
name
|
(Optional) A name for this operation. |
output_dtype
|
(Optional) If not None, casts the output tensor to this type.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
scaled to mean 0 |
ConsistentTensorType
|
and variance 1 (standard deviation 1), grouped per key if a key is provided. |
ConsistentTensorType
|
That is, for all keys k: (x - mean(x)) / std_dev(x) for all x with key k. |
ConsistentTensorType
|
If |
ConsistentTensorType
|
integral, the output is cast to tf.float32. If the analysis dataset is |
ConsistentTensorType
|
empty, contains a single distinct value or the computed key vocabulary |
ConsistentTensorType
|
doesn't have an entry for |
ConsistentTensorType
|
Note that TFLearn generally permits only tf.int64 and tf.float32, so casting |
ConsistentTensorType
|
this scaler's output may be necessary. |
Source code in tensorflow_transform/mappers.py
segment_indices
¶
Returns a Tensor
of indices within each segment.
segment_ids should be a sequence of non-decreasing non-negative integers that
define a set of segments, e.g. [0, 0, 1, 2, 2, 2] defines 3 segments of length
2, 1 and 3. The return value is a Tensor
containing the indices within each
segment.
Example:
result = tft.segment_indices(tf.constant([0, 0, 1, 2, 2, 2])) print(result) tf.Tensor([0 1 0 0 1 2], shape=(6,), dtype=int32)
PARAMETER | DESCRIPTION |
---|---|
segment_ids
|
A 1-d
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Source code in tensorflow_transform/mappers.py
size
¶
Computes the total size of instances in a Tensor
over the whole dataset.
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Source code in tensorflow_transform/analyzers.py
sparse_tensor_left_align
¶
Re-arranges a tf.SparseTensor
and returns a left-aligned version of it.
This mapper can be useful when returning a sparse tensor that may not be left-aligned from a preprocessing_fn.
PARAMETER | DESCRIPTION |
---|---|
sparse_tensor
|
A 2D
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
SparseTensor
|
A left-aligned version of sparse_tensor as a |
Source code in tensorflow_transform/mappers.py
sparse_tensor_to_dense_with_shape
¶
sparse_tensor_to_dense_with_shape(
x: SparseTensor,
shape: Union[TensorShape, Iterable[int]],
default_value: Union[Tensor, int, float, str] = 0,
) -> Tensor
Converts a SparseTensor
into a dense tensor and sets its shape.
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
shape
|
The desired shape of the densified |
default_value
|
(Optional) Value to set for indices not specified. Defaults to zero. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If input is not a |
Source code in tensorflow_transform/mappers.py
sum
¶
Computes the sum of the values of a Tensor
over the whole dataset.
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Tensor
|
have the same type as |
Tensor
|
If |
Tensor
|
reduce_inst_dims is False will return 0 in place where column has no values |
Tensor
|
across batches. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the type of |
Source code in tensorflow_transform/analyzers.py
tfidf
¶
tfidf(
x: SparseTensor,
vocab_size: int,
smooth: bool = True,
name: Optional[str] = None,
) -> Tuple[SparseTensor, SparseTensor]
Maps the terms in x to their term frequency * inverse document frequency.
The term frequency of a term in a document is calculated as (count of term in document) / (document size)
The inverse document frequency of a term is, by default, calculated as 1 + log((corpus size + 1) / (count of documents containing term + 1)).
Example usage:
def preprocessing_fn(inputs): ... integerized = tft.compute_and_apply_vocabulary(inputs['x']) ... vocab_size = tft.get_num_buckets_for_transformed_feature(integerized) ... vocab_index, tfidf_weight = tft.tfidf(integerized, vocab_size) ... return { ... 'index': vocab_index, ... 'tf_idf': tfidf_weight, ... 'integerized': integerized, ... } raw_data = [dict(x=["I", "like", "pie", "pie", "pie"]), ... dict(x=["yum", "yum", "pie"])] feature_spec = dict(x=tf.io.VarLenFeature(tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'index': array([0, 2, 3]), 'integerized': array([3, 2, 0, 0, 0]), 'tf_idf': array([0.6, 0.28109303, 0.28109303], dtype=float32)}, {'index': array([0, 1]), 'integerized': array([1, 1, 0]), 'tf_idf': array([0.33333334, 0.9369768 ], dtype=float32)}]
example strings: [["I", "like", "pie", "pie", "pie"], ["yum", "yum", "pie]]
in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1, 2, 0, 0, 0, 3, 3, 0])
out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],
values=[1, 2, 0, 3, 0])
SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],
values=[(1/5)*(log(3/2)+1), (1/5)*(log(3/2)+1), (3/5),
(2/3)*(log(3/2)+1), (1/3)]
NOTE: the first doc's duplicate "pie" strings have been combined to one output, as have the second doc's duplicate "yum" strings.
PARAMETER | DESCRIPTION |
---|---|
x
|
A 2D
TYPE:
|
vocab_size
|
An int - the count of vocab used to turn the string into int64s including any OOV buckets.
TYPE:
|
smooth
|
A bool indicating if the inverse document frequency should be smoothed. If True, which is the default, then the idf is calculated as 1 + log((corpus size + 1) / (document frequency of term + 1)). Otherwise, the idf is 1 +log((corpus size) / (document frequency of term)), which could result in a division by zero error.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
SparseTensor
|
Two |
SparseTensor
|
The first has values vocab_index, which is taken from input |
Tuple[SparseTensor, SparseTensor]
|
The second has values tfidf_weight. |
Source code in tensorflow_transform/mappers.py
677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 |
|
tukey_h_params
¶
tukey_h_params(
x: TensorType,
reduce_instance_dims: bool = True,
output_dtype: Optional[DType] = None,
name: Optional[str] = None,
) -> Tuple[Tensor, Tensor]
Computes the h parameters of the values of a Tensor
over the dataset.
This computes the parameters (hl, hr) of the samples, assuming a Tukey HH distribution, i.e. (x - tukey_location) / tukey_scale is a Tukey HH distribution with parameters hl (left parameter) and hr (right parameter). See the following publication for the definition of the Tukey HH distribution:
Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey h and hh-Distributions through L-Moments and the L-Correlation," ISRN Applied Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.
TYPE:
|
output_dtype
|
(Optional) If not None, casts the output tensor to this type.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
The tuple (hl, hr) containing two |
Tensor
|
parameters. If |
Tuple[Tensor, Tensor]
|
as |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the type of |
Source code in tensorflow_transform/analyzers.py
tukey_location
¶
tukey_location(
x: TensorType,
reduce_instance_dims: Optional[bool] = True,
output_dtype: Optional[DType] = None,
name: Optional[str] = None,
) -> Tensor
Computes the location of the values of a Tensor
over the whole dataset.
This computes the location of x, assuming a Tukey HH distribution, i.e. (x - tukey_location) / tukey_scale is a Tukey HH distribution with parameters tukey_h_params. See the following publication for the definition of the Tukey HH distribution:
Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey h and hh-Distributions through L-Moments and the L-Correlation," ISRN Applied Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input. |
output_dtype
|
(Optional) If not None, casts the output tensor to this type.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Tensor
|
will have the same type as |
Tensor
|
float32. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the type of |
Source code in tensorflow_transform/analyzers.py
tukey_scale
¶
tukey_scale(
x: TensorType,
reduce_instance_dims: Optional[bool] = True,
output_dtype: Optional[DType] = None,
name: Optional[str] = None,
) -> Tensor
Computes the scale of the values of a Tensor
over the whole dataset.
This computes the scale of x, assuming a Tukey HH distribution, i.e. (x - tukey_location) / tukey_scale is a Tukey HH distribution with parameters tukey_h_params. See the following publication for the definition of the Tukey HH distribution:
Todd C. Headrick, and Mohan D. Pant. "Characterizing Tukey h and hh-Distributions through L-Moments and the L-Correlation," ISRN Applied Mathematics, vol. 2012, 2012. doi:10.5402/2012/980153
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input. |
output_dtype
|
(Optional) If not None, casts the output tensor to this type.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Tensor
|
will have the same type as |
Tensor
|
float32. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the type of |
Source code in tensorflow_transform/analyzers.py
var
¶
var(
x: TensorType,
reduce_instance_dims: bool = True,
name: Optional[str] = None,
output_dtype: Optional[DType] = None,
) -> Tensor
Computes the variance of the values of a Tensor
over the whole dataset.
Uses the biased variance (0 delta degrees of freedom), as given by (x - mean(x))**2 / length(x).
PARAMETER | DESCRIPTION |
---|---|
x
|
TYPE:
|
reduce_instance_dims
|
By default collapses the batch and instance dimensions to arrive at a single scalar output. If False, only collapses the batch dimension and outputs a vector of the same shape as the input.
TYPE:
|
name
|
(Optional) A name for this operation. |
output_dtype
|
(Optional) If not None, casts the output tensor to this type.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A |
Tensor
|
will have the same type as |
Tensor
|
float32. NaNs and infinite input values are ignored. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the type of |
Source code in tensorflow_transform/analyzers.py
vocabulary
¶
vocabulary(
x: TensorType,
*,
top_k: Optional[int] = None,
frequency_threshold: Optional[int] = None,
vocab_filename: Optional[str] = None,
store_frequency: Optional[bool] = False,
reserved_tokens: Optional[
Union[Sequence[str], Tensor]
] = None,
weights: Optional[Tensor] = None,
labels: Optional[Union[Tensor, SparseTensor]] = None,
use_adjusted_mutual_info: bool = False,
min_diff_from_avg: Optional[int] = None,
coverage_top_k: Optional[int] = None,
coverage_frequency_threshold: Optional[int] = None,
key_fn: Optional[Callable[[Any], Any]] = None,
fingerprint_shuffle: Optional[bool] = False,
file_format: VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
name: Optional[str] = None,
) -> TemporaryAnalyzerOutputType
Computes the unique values of x
over the whole dataset.
Computes The unique values taken by x
, which can be a Tensor
,
SparseTensor
, or RaggedTensor
of any size. The unique values will be
aggregated over all dimensions of x
and all instances.
In case file_format
is 'text' and one of the tokens contains the '\n' or
'\r' characters or is empty it will be discarded.
If an integer Tensor
is provided, its semantic type should be categorical
not a continuous/numeric, since computing a vocabulary over a continuous
feature is not appropriate.
The unique values are sorted by decreasing frequency and then reverse
lexicographical order (e.g. [('a', 5), ('c', 3), ('b', 3)]). This is true even
if x
is numerical dtype (e.g. [('3', 5), ('2', 3), ('111', 3)]).
For large datasets it is highly recommended to either set frequency_threshold or top_k to control the size of the output, and also the run time of this operation.
When labels are provided, we filter the vocabulary based on the relationship between the token's presence in a record and the label for that record, using (possibly adjusted) Mutual Information. Note: If labels are provided, the x input must be a unique set of per record, as the semantics of the mutual information calculation depend on a multi-hot representation of the input. Having unique input tokens per row is advisable but not required for a frequency-based vocabulary.
WARNING: The following is experimental and is still being actively worked on.
Supply key_fn
if you would like to generate a vocabulary with coverage over
specific keys.
A "coverage vocabulary" is the union of two vocabulary "arms". The "standard arm" of the vocabulary is equivalent to the one generated by the same function call with no coverage arguments. Adding coverage only appends additional entries to the end of the standard vocabulary.
The "coverage arm" of the vocabulary is determined by taking the
coverage_top_k
most frequent unique terms per key. A term's key is obtained
by applying key_fn
to the term. Use coverage_frequency_threshold
to lower
bound the frequency of entries in the coverage arm of the vocabulary.
Note this is currently implemented for the case where the key is contained within each vocabulary entry (b/117796748).
PARAMETER | DESCRIPTION |
---|---|
x
|
A categorical/discrete input
TYPE:
|
top_k
|
Limit the generated vocabulary to the first |
frequency_threshold
|
Limit the generated vocabulary only to elements whose absolute frequency is >= to the supplied threshold. If set to None, the full vocabulary is generated. Absolute frequency means the number of occurrences of the element in the dataset, as opposed to the proportion of instances that contain that element. |
vocab_filename
|
The file name for the vocabulary file. If None, a file name
will be chosen based on the current scope. If not None, should be unique
within a given preprocessing function. NOTE To make your pipelines
resilient to implementation details please set |
store_frequency
|
If True, frequency of the words is stored in the vocabulary
file. In the case labels are provided, the mutual information is stored in
the file instead. Each line in the file will be of the form 'frequency
word'. NOTE: if this is True then the computed vocabulary cannot be used
with |
reserved_tokens
|
(Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache. |
weights
|
(Optional) Weights
TYPE:
|
labels
|
(Optional) Labels dense |
use_adjusted_mutual_info
|
If true, and labels are provided, calculate vocabulary using adjusted rather than raw mutual information.
TYPE:
|
min_diff_from_avg
|
MI (or AMI) of a feature x label will be adjusted to zero whenever the difference between count and the expected (average) count is lower than min_diff_from_average. This can be thought of as a regularizing parameter that pushes small MI/AMI values to zero. If None, a default parameter will be selected based on the size of the dataset (see calculate_recommended_min_diff_from_avg). |
coverage_top_k
|
(Optional), (Experimental) The minimum number of elements per key to be included in the vocabulary. |
coverage_frequency_threshold
|
(Optional), (Experimental) Limit the coverage arm of the vocabulary only to elements whose absolute frequency is >= this threshold for a given key. |
key_fn
|
(Optional), (Experimental) A fn that takes in a single entry of |
fingerprint_shuffle
|
(Optional), (Experimental) Whether to sort the vocabularies by fingerprint instead of counts. This is useful for load balancing on the training parameter servers. Shuffle only happens while writing the files, so all the filters above (top_k, frequency_threshold, etc) will still take effect. |
file_format
|
(Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
TemporaryAnalyzerOutputType
|
The path name for the vocabulary file containing the unique values of |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If |
Source code in tensorflow_transform/analyzers.py
1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 |
|
word_count
¶
Find the token count of each document/row.
tokens
is either a RaggedTensor
or SparseTensor
, representing tokenized
strings. This function simply returns size of each row, so the dtype is not
constrained to string.
Example:
sparse = tf.SparseTensor(indices=[[0, 0], [0, 1], [2, 2]], ... values=['a', 'b', 'c'], dense_shape=(4, 4)) tft.word_count(sparse)
PARAMETER | DESCRIPTION |
---|---|
tokens
|
either
(1) a
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
Tensor
|
A one-dimensional |
RAISES | DESCRIPTION |
---|---|
ValueError
|
if tokens is neither sparse nor ragged |