TensorFlow Transform tft.experimental
Module¶
tensorflow_transform.experimental
¶
Module level imports for tensorflow_transform.experimental.
Attributes¶
SimpleJsonPTransformAnalyzerCacheCoder
module-attribute
¶
Classes¶
CacheablePTransformAnalyzer
¶
Bases: TypedNamedTuple('PTransformCachedAnalyzer', [('make_accumulators_ptransform', _BeamPTransform), ('merge_accumulators_ptransform', _BeamPTransform), ('extract_output_ptransform', _BeamPTransform), ('cache_coder', PTransformAnalyzerCacheCoder)])
A PTransformAnalyzer which enables analyzer cache.
WARNING: This should only be used if the analyzer can correctly be separated
into make_accumulators, merge_accumulators and extract_output stages.
1. make_accumulators_ptransform: this is a beam.PTransform
which maps data
to a more compact mergeable representation (accumulator). Mergeable here
means that it is possible to combine multiple representations produced from
a partition of the dataset into a representation of the entire dataset.
1. merge_accumulators_ptransform: this is a beam.PTransform
which operates
on a collection of accumulators, i.e. the results of both the
make_accumulators_ptransform and merge_accumulators_ptransform stages,
and produces a single reduced accumulator. This operation must be
associative and commutative in order to have reliably reproducible results.
1. extract_output: this is a beam.PTransform
which operates on the result of
the merge_accumulators_ptransform stage, and produces the outputs of the
analyzer. These outputs must be consistent with the output_dtypes
and
output_shapes
provided to ptransform_analyzer
.
This container also holds a cache_coder
(PTransformAnalyzerCacheCoder
)
which can encode outputs and decode the inputs of the
merge_accumulators_ptransform
stage.
In many cases, SimpleJsonPTransformAnalyzerCacheCoder
would be sufficient.
To ensure the correctness of this analyzer, the following must hold: merge(make({D1, ..., Dn})) == merge({make(D1), ..., make(Dn)})
Functions¶
Any
¶
Special type indicating an unconstrained type.
- Any is compatible with every type.
- Any assumed to have all methods.
- All values assumed to be instances of Any.
Note that all the above statements are true from the point of view of static type checkers. At runtime, Any should not be used with instance or class checks.
Source code in python3.9/typing.py
Optional
¶
Optional type.
Optional[X] is equivalent to Union[X, None].
Union
¶
Union type; Union[X, Y] means either X or Y.
To define a union, use e.g. Union[int, str]. Details: - The arguments must be types and there must be at least one. - None as an argument is a special case and is replaced by type(None). - Unions of unions are flattened, e.g.::
Union[Union[int, str], float] == Union[int, str, float]
-
Unions of a single argument vanish, e.g.::
Union[int] == int # The constructor actually returns int
-
Redundant arguments are skipped, e.g.::
Union[int, str, int] == Union[int, str]
-
When comparing unions, the argument order is ignored, e.g.::
Union[int, str] == Union[str, int]
-
You cannot subclass or instantiate a union.
- You can use Optional[X] as a shorthand for Union[X, None].
Source code in python3.9/typing.py
annotate_sparse_output_shape
¶
Annotates a sparse output to have a given dense_shape.
PARAMETER | DESCRIPTION |
---|---|
tensor
|
An
TYPE:
|
shape
|
A dense_shape to annotate |
Source code in tensorflow_transform/experimental/annotators.py
annotate_true_sparse_output
¶
Annotates a sparse output to be truely sparse and not varlen.
approximate_vocabulary
¶
approximate_vocabulary(
x: TensorType,
top_k: int,
*,
vocab_filename: Optional[str] = None,
store_frequency: bool = False,
reserved_tokens: Optional[
Union[Sequence[str], Tensor]
] = None,
weights: Optional[Tensor] = None,
file_format: VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
name: Optional[str] = None,
) -> TemporaryAnalyzerOutputType
Computes the unique values of a Tensor
over the whole dataset.
Approximately computes the unique values taken by x
, which can be a
Tensor
, SparseTensor
, or RaggedTensor
of any size. The unique values
will be aggregated over all dimensions of x
and all instances.
This analyzer provides an approximate alternative to tft.vocabulary
that can
be more efficient with smaller top_k
and/or smaller number of unique
elements in x
. As a rule of thumb, approximate_vocabulary
becomes more
efficient than tft.vocabulary
if top_k
or the number of unique elements in
x
is smaller than 2*10^5. Moreover, this analyzer is subject to combiner
packing optimization that does not apply to tft.vocabulary
. Caching is also
more efficient with the approximate implementation since the filtration
happens before writing out cache. Output artifact of approximate_vocabulary
is consistent with tft.vocabulary
and can be used in tft.apply_vocabulary
mapper.
Implementation of this analyzer is based on the Misra-Gries algorithm [1]. It
stores at most top_k
elements with lower bound frequency estimates at a
time. The algorithm keeps track of the approximation error delta
such that
for any item x with true frequency X:
frequency[x] <= X <= frequency[x] + delta,
delta <= (m - m') / (top_k + 1),
where m is the total frequency of the items in the dataset and m' is the sum
of the lower bound estimates in frequency
[2]. For datasets that are Zipfian
distributed with parameter a
, the algorithm provides an expected value of
delta = m / (top_k ^ a) [3].
[1] https://www.cs.utexas.edu/users/misra/scannedPdf.dir/FindRepeatedElements.pdf [2] http://www.cohenwang.com/edith/bigdataclass2013/lectures/lecture1.pdf [3] http://dimacs.rutgers.edu/~graham/pubs/papers/countersj.pdf
In case file_format
is 'text' and one of the tokens contains the '\n' or
'\r' characters or is empty it will be discarded.
If an integer Tensor
is provided, its semantic type should be categorical
not a continuous/numeric, since computing a vocabulary over a continuous
feature is not appropriate.
The unique values are sorted by decreasing frequency and then reverse
lexicographical order (e.g. [('a', 5), ('c', 3), ('b', 3)]). This is true even
if x
is numerical dtype (e.g. [('3', 5), ('2', 3), ('111', 3)]).
PARAMETER | DESCRIPTION |
---|---|
x
|
A categorical/discrete input
TYPE:
|
top_k
|
Limit the generated vocabulary to the first
TYPE:
|
vocab_filename
|
The file name for the vocabulary file. If None, a file name
will be chosen based on the current scope. If not None, should be unique
within a given preprocessing function. NOTE: To make your pipelines
resilient to implementation details please set |
store_frequency
|
If True, frequency of the words is stored in the vocabulary
file. Each line in the file will be of the form 'frequency word'. NOTE: if
this is True then the computed vocabulary cannot be used with
TYPE:
|
reserved_tokens
|
(Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache. |
weights
|
(Optional) Weights
TYPE:
|
file_format
|
(Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
TemporaryAnalyzerOutputType
|
The path name for the vocabulary file containing the unique values of |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If |
Source code in tensorflow_transform/experimental/analyzers.py
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
|
compute_and_apply_approximate_vocabulary
¶
compute_and_apply_approximate_vocabulary(
x: ConsistentTensorType,
*,
default_value: Any = -1,
top_k: Optional[int] = None,
num_oov_buckets: int = 0,
vocab_filename: Optional[str] = None,
weights: Optional[Tensor] = None,
file_format: VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
store_frequency: Optional[bool] = False,
reserved_tokens: Optional[
Union[Sequence[str], Tensor]
] = None,
name: Optional[str] = None,
) -> ConsistentTensorType
Generates an approximate vocabulary for x
and maps it to an integer.
PARAMETER | DESCRIPTION |
---|---|
x
|
A
TYPE:
|
default_value
|
The value to use for out-of-vocabulary values, unless 'num_oov_buckets' is greater than zero.
TYPE:
|
top_k
|
Limit the generated vocabulary to the first |
num_oov_buckets
|
Any lookup of an out-of-vocabulary token will return a
bucket ID based on its hash if
TYPE:
|
vocab_filename
|
The file name for the vocabulary file. If None, a name based
on the scope name in the context of this graph will be used as the file
name. If not None, should be unique within a given preprocessing function.
NOTE in order to make your pipelines resilient to implementation details
please set |
weights
|
(Optional) Weights
TYPE:
|
file_format
|
(Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.
TYPE:
|
store_frequency
|
If True, frequency of the words is stored in the vocabulary file. In the case labels are provided, the mutual information is stored in the file instead. Each line in the file will be of the form 'frequency word'. NOTE: if True and text_format is 'text' then spaces will be replaced to avoid information loss. |
reserved_tokens
|
(Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache. |
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
ConsistentTensorType
|
A |
ConsistentTensorType
|
mapped to an integer. Each unique string value that appears in the |
ConsistentTensorType
|
vocabulary is mapped to a different integer and integers are consecutive |
ConsistentTensorType
|
starting from zero. String value not in the vocabulary is assigned |
ConsistentTensorType
|
|
ConsistentTensorType
|
vocabulary strings are hashed to values in |
ConsistentTensorType
|
[vocab_size, vocab_size + num_oov_buckets) for an overall range of |
ConsistentTensorType
|
[0, vocab_size + num_oov_buckets). |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If |
Source code in tensorflow_transform/experimental/mappers.py
document_frequency
¶
Maps the terms in x to their document frequency in the same order.
The document frequency of a term is the number of documents that contain the term in the entire dataset. Each unique vocab term has a unique document frequency.
Example usage:
def preprocessing_fn(inputs): ... integerized = tft.compute_and_apply_vocabulary(inputs['x']) ... vocab_size = tft.get_num_buckets_for_transformed_feature(integerized) ... return { ... 'df': tft.experimental.document_frequency(integerized, vocab_size), ... 'integerized': integerized, ... } raw_data = [dict(x=["I", "like", "pie", "pie", "pie"]), ... dict(x=["yum", "yum", "pie"])] feature_spec = dict(x=tf.io.VarLenFeature(tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'df': array([1, 1, 2, 2, 2]), 'integerized': array([3, 2, 0, 0, 0])}, {'df': array([1, 1, 2]), 'integerized': array([1, 1, 0])}]
example strings: [["I", "like", "pie", "pie", "pie"], ["yum", "yum", "pie]]
in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1, 2, 0, 0, 0, 3, 3, 0])
out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1, 1, 2, 2, 2, 1, 1, 2])
PARAMETER | DESCRIPTION |
---|---|
x
|
A 2D
TYPE:
|
vocab_size
|
An int - the count of vocab used to turn the string into int64s including any OOV buckets.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
SparseTensor
|
|
SparseTensor
|
values document_frequency. Same shape as the input |
Source code in tensorflow_transform/experimental/mappers.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
|
get_vocabulary_size_by_name
¶
get_vocabulary_size_by_name(vocab_filename: str) -> Tensor
Gets the size of a vocabulary created using tft.vocabulary
.
This is the number of keys in the output vocab_filename
and does not include
number of OOV buckets.
PARAMETER | DESCRIPTION |
---|---|
vocab_filename
|
The name of the vocabulary file whose size is to be retrieved.
TYPE:
|
Example:
def preprocessing_fn(inputs): ... num_oov_buckets = 1 ... x_int = tft.compute_and_apply_vocabulary( ... inputs['x'], vocab_filename='my_vocab', ... num_oov_buckets=num_oov_buckets) ... depth = ( ... tft.experimental.get_vocabulary_size_by_name('my_vocab') + ... num_oov_buckets) ... x_encoded = tf.one_hot( ... x_int, depth=tf.cast(depth, tf.int32), dtype=tf.int64) ... return {'x_encoded': x_encoded} raw_data = [dict(x='foo'), dict(x='foo'), dict(x='bar')] feature_spec = dict(x=tf.io.FixedLenFeature([], tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'x_encoded': array([1, 0, 0])}, {'x_encoded': array([1, 0, 0])}, {'x_encoded': array([0, 1, 0])}]
RETURNS | DESCRIPTION |
---|---|
Tensor
|
An integer tensor containing the size of the requested vocabulary. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
if no vocabulary size found for the given |
Source code in tensorflow_transform/experimental/annotators.py
idf
¶
idf(
x: SparseTensor,
vocab_size: int,
smooth: bool = True,
add_baseline: bool = True,
name: Optional[str] = None,
) -> SparseTensor
Maps the terms in x to their inverse document frequency in the same order.
The inverse document frequency of a term, by default, is calculated as 1 + log ((corpus size + 1) / (count of documents containing term + 1)).
Example usage:
def preprocessing_fn(inputs): ... integerized = tft.compute_and_apply_vocabulary(inputs['x']) ... vocab_size = tft.get_num_buckets_for_transformed_feature(integerized) ... idf_weights = tft.experimental.idf(integerized, vocab_size) ... return { ... 'idf': idf_weights, ... 'integerized': integerized, ... } raw_data = [dict(x=["I", "like", "pie", "pie", "pie"]), ... dict(x=["yum", "yum", "pie"])] feature_spec = dict(x=tf.io.VarLenFeature(tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset
1 + log(3/2) = 1.4054651¶
transformed_data [{'idf': array([1.4054651, 1.4054651, 1., 1., 1.], dtype=float32), 'integerized': array([3, 2, 0, 0, 0])}, {'idf': array([1.4054651, 1.4054651, 1.], dtype=float32), 'integerized': array([1, 1, 0])}]
example strings: [["I", "like", "pie", "pie", "pie"], ["yum", "yum", "pie]]
in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1, 2, 0, 0, 0, 3, 3, 0])
out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1 + log(3/2), 1 + log(3/2), 1, 1, 1,
1 + log(3/2), 1 + log(3/2), 1])
PARAMETER | DESCRIPTION |
---|---|
x
|
A 2D
TYPE:
|
vocab_size
|
An int - the count of vocab used to turn the string into int64s including any OOV buckets.
TYPE:
|
smooth
|
A bool indicating if the inverse document frequency should be smoothed. If True, which is the default, then the idf is calculated as 1 + log((corpus size + 1) / (document frequency of term + 1)). Otherwise, the idf is 1 + log((corpus size) / (document frequency of term)), which could result in a division by zero error.
TYPE:
|
add_baseline
|
A bool indicating if the inverse document frequency should be added with a constant baseline 1.0. If True, which is the default, then the idf is calculated as 1 + log(). Otherwise, the idf is log() without the constant 1 baseline. Keeping the baseline reduces the discrepancy in idf between commonly seen terms and rare terms.
TYPE:
|
name
|
(Optional) A name for this operation. |
RETURNS | DESCRIPTION |
---|---|
SparseTensor
|
|
SparseTensor
|
values inverse document frequency. Same shape as the input |
Source code in tensorflow_transform/experimental/mappers.py
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
|
ptransform_analyzer
¶
ptransform_analyzer(
inputs: Collection[Tensor],
ptransform: Union[
_BeamPTransform, CacheablePTransformAnalyzer
],
output_dtypes: Collection[DType],
output_shapes: Collection[List[int]],
output_asset_default_values: Optional[
Collection[Optional[bytes]]
] = None,
name: Optional[str] = None,
)
Applies a user-provided PTransform over the whole dataset.
WARNING: This is experimental.
Note that in order to have asset files copied correctly, any outputs that
represent asset filenames must be added to the tf.GraphKeys.ASSET_FILEPATHS
collection by the caller if using Transform's APIs in compat v1 mode.
Example:
class MeanPerKey(beam.PTransform): ... def expand(self, pcoll: beam.PCollection[Tuple[np.ndarray, np.ndarray]]) -> Tuple[beam.PCollection[np.ndarray], beam.PCollection[np.ndarray]]: ... def extract_output(key_value_pairs): ... keys, values = zip(key_value_pairs) ... return [beam.TaggedOutput('keys', keys), ... beam.TaggedOutput('values', values)] ... return tuple( ... pcoll ... | 'ZipAndFlatten' >> beam.FlatMap(lambda batches: list(zip(batches))) ... | 'MeanPerKey' >> beam.CombinePerKey(beam.combiners.MeanCombineFn()) ... | 'ToList' >> beam.combiners.ToList() ... | 'Extract' >> beam.FlatMap(extract_output).with_outputs( ... 'keys', 'values')) def preprocessing_fn(inputs): ... outputs = tft.experimental.ptransform_analyzer( ... inputs=[inputs['s'], inputs['x']], ... ptransform=MeanPerKey(), ... output_dtypes=[tf.string, tf.float32], ... output_shapes=[[2], [2]]) ... (keys, means) = outputs ... mean_a = tf.reshape(tf.gather(means, tf.where(keys == 'a')), []) ... return { 'x/mean_a': inputs['x'] / mean_a } raw_data = [dict(x=1, s='a'), dict(x=8, s='b'), dict(x=3, s='a')] feature_spec = dict( ... x=tf.io.FixedLenFeature([], tf.float32), ... s=tf.io.FixedLenFeature([], tf.string)) raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec) with tft_beam.Context(temp_dir=tempfile.mkdtemp()): ... transformed_dataset, transform_fn = ( ... (raw_data, raw_data_metadata) ... | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) transformed_data, transformed_metadata = transformed_dataset transformed_data [{'x/mean_a': 0.5}, {'x/mean_a': 4.0}, {'x/mean_a': 1.5}]
PARAMETER | DESCRIPTION |
---|---|
inputs
|
An ordered collection of input
TYPE:
|
ptransform
|
A Beam PTransform that accepts a Beam PCollection where each
element is a tuple of
TYPE:
|
output_dtypes
|
An ordered collection of TensorFlow dtypes of the output of the analyzer.
TYPE:
|
output_shapes
|
An ordered collection of shapes of the output of the analyzer. Must have the same length as output_dtypes.
TYPE:
|
output_asset_default_values
|
(Optional) An ordered collection of optional
TYPE:
|
name
|
(Optional) Similar to a TF op name. Used to define a unique scope for this analyzer, which can be used for debugging info. |
RETURNS | DESCRIPTION |
---|---|
A list of output |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If output_dtypes and output_shapes have different lengths. |
Source code in tensorflow_transform/experimental/analyzers.py
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 |
|