TensorFlow Data Validation API Documentation¶
tensorflow_data_validation
¶
Init module for TensorFlow Data Validation.
Attributes¶
Classes¶
CombinerStatsGenerator
¶
Bases: Generic[ACCTYPE]
, StatsGenerator
A StatsGenerator which computes statistics using a combiner function.
This class computes statistics using a combiner function. It emits partial states processing a batch of examples at a time, merges the partial states, and finally computes the statistics from the merged partial state at the end.
This object mirrors a beam.CombineFn except for the add_input interface, which is expected to be defined by its sub-classes. Specifically, the generator must implement the following four methods:
Initializes an accumulator to store the partial state and returns it. create_accumulator()
Incorporates a batch of input examples (represented as an arrow RecordBatch) into the current accumulator and returns the updated accumulator. add_input(accumulator, input_record_batch)
Merge the partial states in the accumulators and returns the accumulator containing the merged state. merge_accumulators(accumulators)
Compute statistics from the partial state in the accumulator and return the result as a DatasetFeatureStatistics proto. extract_output(accumulator)
Initializes a statistics generator.
PARAMETER | DESCRIPTION |
---|---|
name
|
A unique name associated with the statistics generator.
TYPE:
|
schema
|
An optional schema for the dataset.
TYPE:
|
Source code in tensorflow_data_validation/statistics/generators/stats_generator.py
Attributes¶
Functions¶
add_input
¶
Returns result of folding a batch of inputs into accumulator.
PARAMETER | DESCRIPTION |
---|---|
accumulator
|
The current accumulator, which may be modified and returned for efficiency.
TYPE:
|
input_record_batch
|
An Arrow RecordBatch whose columns are features and
rows are examples. The columns are of type List
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ACCTYPE
|
The accumulator after updating the statistics for the batch of inputs. |
Source code in tensorflow_data_validation/statistics/generators/stats_generator.py
compact
¶
Returns a compact representation of the accumulator.
This is optionally called before an accumulator is sent across the wire. The base class is a no-op. This may be overwritten by the derived class.
PARAMETER | DESCRIPTION |
---|---|
accumulator
|
The accumulator to compact.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ACCTYPE
|
The compacted accumulator. By default is an identity. |
Source code in tensorflow_data_validation/statistics/generators/stats_generator.py
create_accumulator
¶
Returns a fresh, empty accumulator.
RETURNS | DESCRIPTION |
---|---|
ACCTYPE
|
An empty accumulator. |
extract_output
¶
Returns result of converting accumulator into the output value.
PARAMETER | DESCRIPTION |
---|---|
accumulator
|
The final accumulator value.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DatasetFeatureStatistics
|
A proto representing the result of this stats generator. |
Source code in tensorflow_data_validation/statistics/generators/stats_generator.py
merge_accumulators
¶
merge_accumulators(
accumulators: Iterable[ACCTYPE],
) -> ACCTYPE
Merges several accumulators to a single accumulator value.
Note: mutating any element in accumulators
except for the first is not
allowed. The first element may be modified and returned for efficiency.
PARAMETER | DESCRIPTION |
---|---|
accumulators
|
The accumulators to merge.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ACCTYPE
|
The merged accumulator. |
Source code in tensorflow_data_validation/statistics/generators/stats_generator.py
setup
¶
Prepares an instance for combining.
Subclasses should put costly initializations here instead of in init(), so that 1) the cost is properly recognized by Beam as setup cost (per worker) and 2) the cost is not paid at the pipeline construction time.
Source code in tensorflow_data_validation/statistics/generators/stats_generator.py
CrossFeatureView
¶
Bases: object
View of a single cross feature.
Source code in tensorflow_data_validation/utils/stats_util.py
DatasetListView
¶
Bases: object
View of statistics for multiple datasets (slices).
Source code in tensorflow_data_validation/utils/stats_util.py
Functions¶
get_default_slice
¶
get_default_slice() -> Optional[DatasetView]
get_default_slice_or_die
¶
get_default_slice_or_die() -> DatasetView
Source code in tensorflow_data_validation/utils/stats_util.py
get_slice
¶
get_slice(slice_key: str) -> Optional[DatasetView]
list_slices
¶
DatasetView
¶
Bases: object
View of statistics for a dataset (slice).
Source code in tensorflow_data_validation/utils/stats_util.py
Functions¶
get_cross_feature
¶
get_cross_feature(
x_path: Union[str, FeaturePath, Iterable[str]],
y_path: Union[str, FeaturePath, Iterable[str]],
) -> Optional[CrossFeatureView]
Retrieve a cross-feature if it exists, or None.
Source code in tensorflow_data_validation/utils/stats_util.py
get_derived_feature
¶
get_derived_feature(
deriver_name: str, source_paths: Sequence[FeaturePath]
) -> Optional[FeatureView]
Retrieve a derived feature based on a deriver name and its inputs.
PARAMETER | DESCRIPTION |
---|---|
deriver_name
|
The name of a deriver. Matches validation_derived_source deriver_name.
TYPE:
|
source_paths
|
Source paths for derived features. Matches validation_derived_source.source_path.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[FeatureView]
|
FeatureView of derived feature. |
Source code in tensorflow_data_validation/utils/stats_util.py
get_feature
¶
get_feature(
feature_id: Union[str, FeaturePath, Iterable[str]],
) -> Optional[FeatureView]
Retrieve a feature if it exists.
Features specified within the underlying proto by name (instead of path) are normalized to a length 1 path, and can be referred to as such.
PARAMETER | DESCRIPTION |
---|---|
feature_id
|
A types.FeaturePath, Iterable[str] consisting of path steps, or a str, which is converted to a length one path.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[FeatureView]
|
A FeatureView, or None if feature_id is not present. |
Source code in tensorflow_data_validation/utils/stats_util.py
list_cross_features
¶
list_cross_features() -> Iterable[
Tuple[FeaturePath, FeaturePath]
]
list_features
¶
list_features() -> Iterable[FeaturePath]
DetectFeatureSkew
¶
DetectFeatureSkew(
identifier_features: List[FeatureName],
features_to_ignore: Optional[List[FeatureName]] = None,
sample_size: int = 0,
float_round_ndigits: Optional[int] = None,
allow_duplicate_identifiers: bool = False,
)
Bases: PTransform
API for detecting feature skew between training and serving examples.
Example:
with beam.Pipeline(runner=...) as p:
training_examples = p | 'ReadTrainingData' >>
beam.io.ReadFromTFRecord(
training_filepaths, coder=beam.coders.ProtoCoder(tf.train.Example))
serving_examples = p | 'ReadServingData' >>
beam.io.ReadFromTFRecord(
serving_filepaths, coder=beam.coders.ProtoCoder(tf.train.Example))
_ = ((training_examples, serving_examples) | 'DetectFeatureSkew' >>
DetectFeatureSkew(identifier_features=['id1'], sample_size=5)
| 'WriteFeatureSkewResultsOutput' >>
tfdv.WriteFeatureSkewResultsToTFRecord(output_path)
| 'WriteFeatureSkwePairsOutput' >>
tfdv.WriteFeatureSkewPairsToTFRecord(output_path))
See the documentation for DetectFeatureSkewImpl for more detail about feature skew detection.
Initializes the feature skew detection PTransform.
PARAMETER | DESCRIPTION |
---|---|
identifier_features
|
Names of features to use as identifiers.
TYPE:
|
features_to_ignore
|
Names of features for which no feature skew detection is done. |
sample_size
|
Size of the sample of training-serving example pairs that exhibit skew to include in the skew results.
TYPE:
|
float_round_ndigits
|
Number of digits precision after the decimal point to which to round float values before comparing them. |
allow_duplicate_identifiers
|
If set, skew detection will be done on examples for which there are duplicate identifier feature values. In this case, the counts in the FeatureSkew result are based on each training-serving example pair analyzed. Examples with given identifier feature values must all fit in memory.
TYPE:
|
Source code in tensorflow_data_validation/api/validation_api.py
Functions¶
expand
¶
Source code in tensorflow_data_validation/api/validation_api.py
FeatureView
¶
Bases: object
View of a single feature.
This class provides accessor methods, as well as access to the underlying proto. Where possible, accessors should be used in place of proto access (for example, x.numeric_statistics() instead of x.proto().num_stats) in order to support future extension of the proto.
Source code in tensorflow_data_validation/utils/stats_util.py
Functions¶
bytes_statistics
¶
bytes_statistics() -> Optional[BytesStatistics]
Retrieve byte statistics if available.
Source code in tensorflow_data_validation/utils/stats_util.py
common_statistics
¶
common_statistics() -> Optional[CommonStatistics]
Retrieve common statistics if available.
Source code in tensorflow_data_validation/utils/stats_util.py
custom_statistic
¶
Retrieve a custom_statistic by name.
Source code in tensorflow_data_validation/utils/stats_util.py
numeric_statistics
¶
numeric_statistics() -> Optional[NumericStatistics]
Retrieve numeric statistics if available.
Source code in tensorflow_data_validation/utils/stats_util.py
proto
¶
string_statistics
¶
string_statistics() -> Optional[StringStatistics]
Retrieve string statistics if available.
Source code in tensorflow_data_validation/utils/stats_util.py
GenerateStatistics
¶
GenerateStatistics(options: StatsOptions = StatsOptions())
Bases: PTransform
API for generating data statistics.
Example:
with beam.Pipeline(runner=...) as p:
_ = (p
| 'ReadData' >> tfx_bsl.public.tfxio.TFExampleRecord(data_location)
.BeamSource()
| 'GenerateStatistics' >> GenerateStatistics()
| 'WriteStatsOutput' >> tfdv.WriteStatisticsToTFRecord(output_path))
Initializes the transform.
PARAMETER | DESCRIPTION |
---|---|
options
|
TYPE:
|
RAISES | DESCRIPTION |
---|---|
TypeError
|
If options is not of the expected type. |
Source code in tensorflow_data_validation/api/stats_api.py
Functions¶
expand
¶
Source code in tensorflow_data_validation/api/stats_api.py
MergeDatasetFeatureStatisticsList
¶
Bases: PTransform
API for merging sharded DatasetFeatureStatisticsList.
StatsOptions
¶
StatsOptions(
generators: Optional[List[StatsGenerator]] = None,
schema: Optional[Schema] = None,
label_feature: Optional[FeatureName] = None,
weight_feature: Optional[FeatureName] = None,
slice_functions: Optional[List[SliceFunction]] = None,
sample_rate: Optional[float] = None,
num_top_values: int = 20,
frequency_threshold: int = 1,
weighted_frequency_threshold: float = 1.0,
num_rank_histogram_buckets: int = 1000,
num_values_histogram_buckets: int = 10,
num_histogram_buckets: int = 10,
num_quantiles_histogram_buckets: int = 10,
epsilon: float = 0.01,
infer_type_from_schema: bool = False,
desired_batch_size: Optional[int] = None,
enable_semantic_domain_stats: bool = False,
semantic_domain_stats_sample_rate: Optional[
float
] = None,
per_feature_weight_override: Optional[
Dict[FeaturePath, FeatureName]
] = None,
vocab_paths: Optional[
Dict[VocabName, VocabPath]
] = None,
add_default_generators: bool = True,
feature_allowlist: Optional[
Union[List[FeatureName], List[FeaturePath]]
] = None,
experimental_use_sketch_based_topk_uniques: Optional[
bool
] = None,
use_sketch_based_topk_uniques: Optional[bool] = None,
experimental_slice_functions: Optional[
List[SliceFunction]
] = None,
experimental_slice_sqls: Optional[List[Text]] = None,
experimental_result_partitions: int = 1,
experimental_num_feature_partitions: int = 1,
slicing_config: Optional[SlicingConfig] = None,
experimental_filter_read_paths: bool = False,
per_feature_stats_config: Optional[
PerFeatureStatsConfig
] = None,
)
Bases: object
Options for generating statistics.
Initializes statistics options.
PARAMETER | DESCRIPTION |
---|---|
generators
|
An optional list of statistics generators. A statistics generator must extend either CombinerStatsGenerator or TransformStatsGenerator. |
schema
|
An optional tensorflow_metadata Schema proto. Currently we use the schema to infer categorical and bytes features.
TYPE:
|
label_feature
|
An optional feature name which represents the label.
TYPE:
|
weight_feature
|
An optional feature name whose numeric value represents the weight of an example.
TYPE:
|
slice_functions
|
DEPRECATED. Use |
sample_rate
|
An optional sampling rate. If specified, statistics is computed over the sample. |
num_top_values
|
An optional number of most frequent feature values to keep for string features.
TYPE:
|
frequency_threshold
|
An optional minimum number of examples the most frequent values must be present in.
TYPE:
|
weighted_frequency_threshold
|
An optional minimum weighted number of examples the most frequent weighted values must be present in. This option is only relevant when a weight_feature is specified.
TYPE:
|
num_rank_histogram_buckets
|
An optional number of buckets in the rank histogram for string features.
TYPE:
|
num_values_histogram_buckets
|
An optional number of buckets in a quantiles histogram for the number of values per Feature, which is stored in CommonStatistics.num_values_histogram.
TYPE:
|
num_histogram_buckets
|
An optional number of buckets in a standard NumericStatistics.histogram with equal-width buckets.
TYPE:
|
num_quantiles_histogram_buckets
|
An optional number of buckets in a quantiles NumericStatistics.histogram.
TYPE:
|
epsilon
|
An optional error tolerance for the computation of quantiles, typically a small fraction close to zero (e.g. 0.01). Higher values of epsilon increase the quantile approximation, and hence result in more unequal buckets, but could improve performance, and resource consumption.
TYPE:
|
infer_type_from_schema
|
A boolean to indicate whether the feature types
should be inferred from the schema. If set to True, an input schema must
be provided. This flag is used only when invoking TFDV through
TYPE:
|
desired_batch_size
|
An optional maximum number of examples to include in
each batch that is passed to the statistics generators. When invoking
TFDV using its end-to-end APIs (e.g.
|
enable_semantic_domain_stats
|
If True statistics for semantic domains are generated (e.g: image, text domains).
TYPE:
|
semantic_domain_stats_sample_rate
|
An optional sampling rate for semantic domain statistics. If specified, semantic domain statistics is computed over a sample. |
per_feature_weight_override
|
If specified, the "example weight" paired
with a feature will be first looked up in this map and if not found,
fall back to
TYPE:
|
vocab_paths
|
An optional dictionary mapping vocab names to paths. Used in the schema when specifying a NaturalLanguageDomain. The paths can either be to GZIP-compressed TF record files that have a tfrecord.gz suffix or to text files. |
add_default_generators
|
Whether to invoke the default set of stats
generators in the run. Generators invoked consists of 1) the default
generators (controlled by this option); 2) user-provided generators (
controlled by the
TYPE:
|
feature_allowlist
|
An optional list of names of the features to calculate statistics for, or a list of paths.
TYPE:
|
experimental_use_sketch_based_topk_uniques
|
Deprecated, prefer use_sketch_based_topk_uniques. |
use_sketch_based_topk_uniques
|
if True, use the sketch based top-k and uniques stats generator. |
experimental_slice_functions
|
An optional list of functions that generate slice keys for each example. Each slice function should take pyarrow.RecordBatch as input and return an Iterable[Tuple[Text, pyarrow.RecordBatch]]. Each tuple contains the slice key and the corresponding sliced RecordBatch. Only one of experimental_slice_functions or experimental_slice_sqls must be specified. |
experimental_slice_sqls
|
List of slicing SQL queries. The query must have the following pattern: "SELECT STRUCT({feature_name} [AS {slice_key}]) [FROM example.feature_name [, example.feature_name, ... ][WHERE ... ]]" The “example.feature_name” inside the FROM statement is used to flatten the repeated fields. For non-repeated fields, you can directly write the query as follows: “SELECT STRUCT(non_repeated_feature_a, non_repeated_feature_b)” In the query, the “example” is a key word that binds to each input "row". The semantics of this variable will depend on the decoding of the input data to the Arrow representation (e.g., for tf.Example, each key is decoded to a separate column). Thus, structured data can be readily accessed by iterating/unnesting the fields of the "example" variable. Example 1: Slice on each value of a feature "SELECT STRUCT(gender) FROM example.gender" Example 2: Slice on each value of one feature and a specified value of another. "SELECT STRUCT(gender, country) FROM example.gender, example.country WHERE country = 'USA'" Only one of experimental_slice_functions or experimental_slice_sqls must be specified. |
experimental_result_partitions
|
The number of feature partitions to combine output DatasetFeatureStatisticsLists into. If set to 1 (default) output is globally combined. If set to value greater than one, up to that many shards are returned, each containing a subset of features.
TYPE:
|
experimental_num_feature_partitions
|
If > 1, partitions computations by supported generators to act on this many bundles of features. For best results this should be set to at least several times less than the number of features in a dataset, and never more than the available beam parallelism.
TYPE:
|
slicing_config
|
an optional SlicingConfig. SlicingConfig includes slicing_specs specified with feature keys, feature values or slicing SQL queries.
TYPE:
|
experimental_filter_read_paths
|
If provided, tries to push down either paths passed via feature_allowlist or via the schema (in that priority) to the underlying read operation. Support depends on the file reader.
TYPE:
|
per_feature_stats_config
|
Supports granular control of what statistics are enabled per feature. Experimental.
TYPE:
|
Source code in tensorflow_data_validation/statistics/stats_options.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 |
|
Attributes¶
enable_semantic_domain_stats
instance-attribute
¶
experimental_slice_functions
property
writable
¶
experimental_use_sketch_based_topk_uniques
property
writable
¶
experimental_use_sketch_based_topk_uniques: bool
feature_allowlist
property
writable
¶
feature_allowlist: Optional[
Union[List[FeatureName], List[FeaturePath]]
]
num_rank_histogram_buckets
instance-attribute
¶
semantic_domain_stats_sample_rate
property
writable
¶
weighted_frequency_threshold
instance-attribute
¶
Functions¶
from_json
classmethod
¶
from_json(options_json: Text) -> StatsOptions
Construct an instance of stats options from a JSON representation.
PARAMETER | DESCRIPTION |
---|---|
options_json
|
A JSON representation of the dict attribute of a StatsOptions instance.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
StatsOptions
|
A StatsOptions instance constructed by setting the dict attribute to |
StatsOptions
|
the deserialized value of options_json. |
Source code in tensorflow_data_validation/statistics/stats_options.py
to_json
¶
to_json() -> Text
Convert from an object to JSON representation of the dict attribute.
Custom generators and slice_functions cannot being converted. As a result, a ValueError will be raised when these options are specified and TFDV is running in a setting where the stats options have been json-serialized, first. This will happen in the case where TFDV is run as a TFX component. The schema proto and slicing_config will be json_encoded.
RETURNS | DESCRIPTION |
---|---|
Text
|
A JSON representation of a filtered version of dict. |
Source code in tensorflow_data_validation/statistics/stats_options.py
TransformStatsGenerator
¶
Bases: StatsGenerator
A StatsGenerator which wraps an arbitrary Beam PTransform.
This class computes statistics using a user-provided Beam PTransform. The PTransform must accept a Beam PCollection where each element is a tuple containing a slice key and an Arrow RecordBatch representing a batch of examples. It must return a PCollection where each element is a tuple containing a slice key and a DatasetFeatureStatistics proto representing the statistics of a slice.
Initializes a statistics generator.
PARAMETER | DESCRIPTION |
---|---|
name
|
A unique name associated with the statistics generator.
TYPE:
|
schema
|
An optional schema for the dataset.
TYPE:
|
Source code in tensorflow_data_validation/statistics/generators/stats_generator.py
WriteStatisticsToBinaryFile
¶
WriteStatisticsToBinaryFile(output_path: Text)
Bases: PTransform
API for writing serialized data statistics to a binary file.
Initializes the transform.
PARAMETER | DESCRIPTION |
---|---|
output_path
|
Output path for writing data statistics.
TYPE:
|
Source code in tensorflow_data_validation/api/stats_api.py
WriteStatisticsToRecordsAndBinaryFile
¶
WriteStatisticsToRecordsAndBinaryFile(
binary_proto_path: str,
records_path_prefix: str,
columnar_path_prefix: Optional[str] = None,
)
Bases: PTransform
API for writing statistics to both sharded records and binary pb.
This PTransform assumes that input represents sharded statistics, which are written directly. These statistics are also merged and written to a binary proto.
Currently Experimental.
TODO(b/202910677): After full migration to sharded stats, clean this up.
Initializes the transform.
PARAMETER | DESCRIPTION |
---|---|
binary_proto_path
|
Output path for writing statistics as a binary proto.
TYPE:
|
records_path_prefix
|
File pattern for writing statistics to sharded records.
TYPE:
|
columnar_path_prefix
|
Optional file pattern for writing statistics to columnar outputs. If provided, columnar outputs will be written when supported. |
Source code in tensorflow_data_validation/api/stats_api.py
Functions¶
expand
¶
Source code in tensorflow_data_validation/api/stats_api.py
WriteStatisticsToTFRecord
¶
WriteStatisticsToTFRecord(
output_path: Text, sharded_output=False
)
Bases: PTransform
API for writing serialized data statistics to TFRecord file.
Initializes the transform.
PARAMETER | DESCRIPTION |
---|---|
output_path
|
The output path or path prefix (if sharded_output=True).
TYPE:
|
sharded_output
|
If true, writes sharded TFRecords files in the form output_path-SSSSS-of-NNNNN.
DEFAULT:
|
Source code in tensorflow_data_validation/api/stats_api.py
Functions¶
compare_slices
¶
compare_slices(
statistics: DatasetFeatureStatisticsList,
lhs_slice_key: Text,
rhs_slice_key: Text,
)
Compare statistics of two slices using Facets.
PARAMETER | DESCRIPTION |
---|---|
statistics
|
A DatasetFeatureStatisticsList protocol buffer.
TYPE:
|
lhs_slice_key
|
Slice key of the first slice.
TYPE:
|
rhs_slice_key
|
Slice key of the second slice.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the input statistics proto does not have the specified slice statistics. |
Source code in tensorflow_data_validation/utils/display_util.py
display_anomalies
¶
Displays the input anomalies (for use in a Jupyter notebook).
PARAMETER | DESCRIPTION |
---|---|
anomalies
|
An Anomalies protocol buffer.
TYPE:
|
Source code in tensorflow_data_validation/utils/display_util.py
display_schema
¶
Displays the input schema (for use in a Jupyter notebook).
PARAMETER | DESCRIPTION |
---|---|
schema
|
A Schema protocol buffer.
TYPE:
|
Source code in tensorflow_data_validation/utils/display_util.py
experimental_get_feature_value_slicer
¶
experimental_get_feature_value_slicer(
features: Dict[FeatureName, Optional[_ValueType]],
) -> SliceFunction
Returns a function that generates sliced record batches for a given one.
The returned function returns sliced record batches based on the combination
of all features specified in features
. To slice on features separately (
e.g., slice on age feature and separately slice on interests feature), you
must use separate slice functions.
Examples:
Slice on each value of the specified features.¶
slice_fn = get_feature_value_slicer( features={'age': None, 'interests': None})
Slice on a specified feature value.¶
slice_fn = get_feature_value_slicer(features={'interests': ['dogs']})
Slice on each value of one feature and a specified value of another.¶
slice_fn = get_feature_value_slicer( features={'fruits': None, 'numbers': [1]})
PARAMETER | DESCRIPTION |
---|---|
features
|
A mapping of features to an optional iterable of values that the returned function will slice on. If values is None for a feature, then the slice keys will reflect each distinct value found for that feature in the input record batch. If values are specified for a feature, then the slice keys will reflect only those values for the feature, if found in the input record batch. Values must be an iterable of strings or integers. |
RETURNS | DESCRIPTION |
---|---|
SliceFunction
|
A function that takes as input a single record batch and returns a list of |
SliceFunction
|
sliced record batches (slice_key, record_batch). |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If feature values are not specified in an iterable. |
NotImplementedError
|
If a value of a type other than string or integer is
specified in the values iterable in |
Source code in tensorflow_data_validation/utils/slicing_util.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
|
generate_dummy_schema_with_paths
¶
generate_dummy_schema_with_paths(
paths: List[FeaturePath],
) -> Schema
Generate a schema with the requested paths and no other information.
Source code in tensorflow_data_validation/utils/schema_util.py
generate_statistics_from_csv
¶
generate_statistics_from_csv(
data_location: Text,
column_names: Optional[List[FeatureName]] = None,
delimiter: Text = ",",
output_path: Optional[bytes] = None,
stats_options: StatsOptions = StatsOptions(),
pipeline_options: Optional[PipelineOptions] = None,
compression_type: Text = AUTO,
) -> DatasetFeatureStatisticsList
Compute data statistics from CSV files.
Runs a Beam pipeline to compute the data statistics and return the result data statistics proto.
This is a convenience method for users with data in CSV format. Users with data in unsupported file/data formats, or users who wish to create their own Beam pipelines need to use the 'GenerateStatistics' PTransform API directly instead.
PARAMETER | DESCRIPTION |
---|---|
data_location
|
The location of the input data files.
TYPE:
|
column_names
|
A list of column names to be treated as the CSV header. Order must match the order in the input CSV files. If this argument is not specified, we assume the first line in the input CSV files as the header. Note that this option is valid only for 'csv' input file format. |
delimiter
|
A one-character string used to separate fields in a CSV file.
TYPE:
|
output_path
|
The file path to output data statistics result to. If None, we use a temporary directory. It will be a TFRecord file containing a single data statistics proto, and can be read with the 'load_statistics' API. If you run this function on Google Cloud, you must specify an output_path. Specifying None may cause an error. |
stats_options
|
TYPE:
|
pipeline_options
|
Optional beam pipeline options. This allows users to specify various beam pipeline execution parameters like pipeline runner (DirectRunner or DataflowRunner), cloud dataflow service project id, etc. See https://cloud.google.com/dataflow/pipelines/specifying-exec-params for more details.
TYPE:
|
compression_type
|
Used to handle compressed input files. Default value is CompressionTypes.AUTO, in which case the file_path's extension will be used to detect the compression.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DatasetFeatureStatisticsList
|
A DatasetFeatureStatisticsList proto. |
Source code in tensorflow_data_validation/utils/stats_gen_lib.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
|
generate_statistics_from_dataframe
¶
generate_statistics_from_dataframe(
dataframe: DataFrame,
stats_options: StatsOptions = StatsOptions(),
n_jobs: int = 1,
) -> DatasetFeatureStatisticsList
Compute data statistics for the input pandas DataFrame.
This is a utility function for users with in-memory data represented as a pandas DataFrame.
This function supports only DataFrames with columns of primitive string or numeric types. DataFrames with multivalent features or holding non-string object types are not supported.
PARAMETER | DESCRIPTION |
---|---|
dataframe
|
Input pandas DataFrame.
TYPE:
|
stats_options
|
TYPE:
|
n_jobs
|
Number of processes to run (defaults to 1). If -1 is provided, uses the same number of processes as the number of CPU cores.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DatasetFeatureStatisticsList
|
A DatasetFeatureStatisticsList proto. |
Source code in tensorflow_data_validation/utils/stats_gen_lib.py
generate_statistics_from_tfrecord
¶
generate_statistics_from_tfrecord(
data_location: Text,
output_path: Optional[bytes] = None,
stats_options: StatsOptions = StatsOptions(),
pipeline_options: Optional[PipelineOptions] = None,
) -> DatasetFeatureStatisticsList
Compute data statistics from TFRecord files containing TFExamples.
Runs a Beam pipeline to compute the data statistics and return the result data statistics proto.
This is a convenience method for users with data in TFRecord format. Users with data in unsupported file/data formats, or users who wish to create their own Beam pipelines need to use the 'GenerateStatistics' PTransform API directly instead.
PARAMETER | DESCRIPTION |
---|---|
data_location
|
The location of the input data files.
TYPE:
|
output_path
|
The file path to output data statistics result to. If None, we use a temporary directory. It will be a TFRecord file containing a single data statistics proto, and can be read with the 'load_statistics' API. If you run this function on Google Cloud, you must specify an output_path. Specifying None may cause an error. |
stats_options
|
TYPE:
|
pipeline_options
|
Optional beam pipeline options. This allows users to specify various beam pipeline execution parameters like pipeline runner (DirectRunner or DataflowRunner), cloud dataflow service project id, etc. See https://cloud.google.com/dataflow/pipelines/specifying-exec-params for more details.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DatasetFeatureStatisticsList
|
A DatasetFeatureStatisticsList proto. |
Source code in tensorflow_data_validation/utils/stats_gen_lib.py
get_confusion_count_dataframes
¶
Returns a pandas dataframe representation of a sequence of ConfusionCount.
PARAMETER | DESCRIPTION |
---|---|
confusion
|
An interable over ConfusionCount protos.
TYPE:
|
Returns: A map from feature name to a pandas dataframe containing match counts along with base and test counts for all unequal value pairs in the input.
Source code in tensorflow_data_validation/utils/display_util.py
get_domain
¶
get_domain(
schema: Schema,
feature_path: Union[FeatureName, FeaturePath],
) -> Any
Get the domain associated with the input feature from the schema.
PARAMETER | DESCRIPTION |
---|---|
schema
|
A Schema protocol buffer.
TYPE:
|
feature_path
|
The path of the feature whose domain needs to be found. If a FeatureName is passed, a one-step FeaturePath will be constructed and used. For example, "my_feature" -> types.FeaturePath(["my_feature"])
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Any
|
The domain protocol buffer associated with the input feature. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input schema is not of the expected type. |
ValueError
|
If the input feature is not found in the schema or there is no domain associated with the feature. |
Source code in tensorflow_data_validation/utils/schema_util.py
get_feature
¶
get_feature(
schema: Schema,
feature_path: Union[FeatureName, FeaturePath],
) -> Feature
Get a feature from the schema.
PARAMETER | DESCRIPTION |
---|---|
schema
|
A Schema protocol buffer.
TYPE:
|
feature_path
|
The path of the feature to obtain from the schema. If a FeatureName is passed, a one-step FeaturePath will be constructed and used. For example, "my_feature" -> types.FeaturePath(["my_feature"])
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Feature
|
A Feature protocol buffer. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input schema is not of the expected type. |
ValueError
|
If the input feature is not found in the schema. |
Source code in tensorflow_data_validation/utils/schema_util.py
get_feature_stats
¶
get_feature_stats(
stats: DatasetFeatureStatistics,
feature_path: FeaturePath,
) -> FeatureNameStatistics
Get feature statistics from the dataset statistics.
PARAMETER | DESCRIPTION |
---|---|
stats
|
A DatasetFeatureStatistics protocol buffer.
TYPE:
|
feature_path
|
The path of the feature whose statistics to obtain from the dataset statistics.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
FeatureNameStatistics
|
A FeatureNameStatistics protocol buffer. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input statistics is not of the expected type. |
ValueError
|
If the input feature is not found in the dataset statistics. |
Source code in tensorflow_data_validation/utils/stats_util.py
get_match_stats_dataframe
¶
Formats MatchStats as a pandas dataframe.
Source code in tensorflow_data_validation/utils/display_util.py
get_skew_result_dataframe
¶
get_skew_result_dataframe(
skew_results: Iterable[FeatureSkew],
) -> DataFrame
Formats FeatureSkew results as a pandas dataframe.
Source code in tensorflow_data_validation/utils/display_util.py
get_slice_stats
¶
get_slice_stats(
stats: DatasetFeatureStatisticsList, slice_key: Text
) -> DatasetFeatureStatisticsList
Get statistics associated with a specific slice.
PARAMETER | DESCRIPTION |
---|---|
stats
|
A DatasetFeatureStatisticsList protocol buffer.
TYPE:
|
slice_key
|
Slice key of the slice.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DatasetFeatureStatisticsList
|
Statistics of the specific slice. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the input statistics proto does not have the specified slice statistics. |
Source code in tensorflow_data_validation/utils/stats_util.py
get_statistics_html
¶
get_statistics_html(
lhs_statistics: DatasetFeatureStatisticsList,
rhs_statistics: Optional[
DatasetFeatureStatisticsList
] = None,
lhs_name: Text = "lhs_statistics",
rhs_name: Text = "rhs_statistics",
allowlist_features: Optional[List[FeaturePath]] = None,
denylist_features: Optional[List[FeaturePath]] = None,
) -> Text
Build the HTML for visualizing the input statistics using Facets.
PARAMETER | DESCRIPTION |
---|---|
lhs_statistics
|
A DatasetFeatureStatisticsList protocol buffer.
TYPE:
|
rhs_statistics
|
An optional DatasetFeatureStatisticsList protocol buffer to compare with lhs_statistics.
TYPE:
|
lhs_name
|
Name to use for the lhs_statistics dataset if a name is not already provided within the protocol buffer.
TYPE:
|
rhs_name
|
Name to use for the rhs_statistics dataset if a name is not already provided within the protocol buffer.
TYPE:
|
allowlist_features
|
Set of features to be visualized.
TYPE:
|
denylist_features
|
Set of features to ignore for visualization.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Text
|
HTML to be embedded for visualization. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input argument is not of the expected type. |
ValueError
|
If the input statistics protos does not have only one dataset. |
Source code in tensorflow_data_validation/utils/display_util.py
infer_schema
¶
infer_schema(
statistics: DatasetFeatureStatisticsList,
infer_feature_shape: bool = True,
max_string_domain_size: int = 100,
schema_transformations: Optional[
List[
Callable[
[Schema, DatasetFeatureStatistics], Schema
]
]
] = None,
) -> Schema
Infers schema from the input statistics.
PARAMETER | DESCRIPTION |
---|---|
statistics
|
A DatasetFeatureStatisticsList protocol buffer. Schema inference is currently supported only for lists with a single DatasetFeatureStatistics proto or lists with multiple DatasetFeatureStatistics protos corresponding to data slices that include the default slice (i.e., the slice with all examples). If a list with multiple DatasetFeatureStatistics protos is used, this function will infer the schema from the statistics corresponding to the default slice.
TYPE:
|
infer_feature_shape
|
A boolean to indicate if shape of the features need to be inferred from the statistics.
TYPE:
|
max_string_domain_size
|
Maximum size of the domain of a string feature in order to be interpreted as a categorical feature.
TYPE:
|
schema_transformations
|
List of transformation functions to apply to the auto-inferred schema. Each transformation function should take the schema and statistics as input and should return the transformed schema. The transformations are applied in the order provided in the list.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Schema
|
A Schema protocol buffer. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input argument is not of the expected type. |
ValueError
|
If the input statistics proto contains multiple datasets, none of which corresponds to the default slice. |
Source code in tensorflow_data_validation/api/validation_api.py
load_anomalies_text
¶
load_anomalies_text(input_path: Text) -> Anomalies
Loads the Anomalies proto stored in text format in the input path.
PARAMETER | DESCRIPTION |
---|---|
input_path
|
File path from which to load the Anomalies proto.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Anomalies
|
An Anomalies protocol buffer. |
Source code in tensorflow_data_validation/utils/anomalies_util.py
load_schema_text
¶
load_schema_text(input_path: Text) -> Schema
Loads the schema stored in text format in the input path.
PARAMETER | DESCRIPTION |
---|---|
input_path
|
File path to load the schema from.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Schema
|
A Schema protocol buffer. |
Source code in tensorflow_data_validation/utils/schema_util.py
load_sharded_statistics
¶
load_sharded_statistics(
input_path_prefix: Optional[str] = None,
input_paths: Optional[Iterable[str]] = None,
io_provider: Optional[StatisticsIOProvider] = None,
) -> DatasetListView
Read a sharded DatasetFeatureStatisticsList from disk as a DatasetListView.
PARAMETER | DESCRIPTION |
---|---|
input_path_prefix
|
If passed, loads files starting with this prefix and ending with a pattern corresponding to the output of the provided io_provider. |
input_paths
|
A list of file paths of files containing sharded DatasetFeatureStatisticsList protos. |
io_provider
|
Optional StatisticsIOProvider. If unset, a default will be constructed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DatasetListView
|
A DatasetListView containing the merged proto. |
Source code in tensorflow_data_validation/utils/stats_util.py
load_statistics
¶
load_statistics(
input_path: Text,
) -> DatasetFeatureStatisticsList
Loads data statistics proto from file.
PARAMETER | DESCRIPTION |
---|---|
input_path
|
Data statistics file path. The file should be a one-record TFRecord file or a plain file containing the statistics proto in Proto Text Format.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DatasetFeatureStatisticsList
|
A DatasetFeatureStatisticsList proto. |
RAISES | DESCRIPTION |
---|---|
IOError
|
If the input path does not exist. |
Source code in tensorflow_data_validation/utils/stats_util.py
load_stats_binary
¶
load_stats_binary(
input_path: Text,
) -> DatasetFeatureStatisticsList
Loads a serialized DatasetFeatureStatisticsList proto from a file.
PARAMETER | DESCRIPTION |
---|---|
input_path
|
File path from which to load the DatasetFeatureStatisticsList proto.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DatasetFeatureStatisticsList
|
A DatasetFeatureStatisticsList proto. |
Source code in tensorflow_data_validation/utils/stats_util.py
load_stats_text
¶
load_stats_text(
input_path: Text,
) -> DatasetFeatureStatisticsList
Loads the specified DatasetFeatureStatisticsList proto stored in text format.
PARAMETER | DESCRIPTION |
---|---|
input_path
|
File path from which to load the DatasetFeatureStatisticsList proto.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DatasetFeatureStatisticsList
|
A DatasetFeatureStatisticsList proto. |
Source code in tensorflow_data_validation/utils/stats_util.py
set_domain
¶
set_domain(
schema: Schema, feature_path: FeaturePath, domain: Any
) -> None
Sets the domain for the input feature in the schema.
If the input feature already has a domain, it is overwritten with the newly provided input domain. This method cannot be used to add a new global domain.
PARAMETER | DESCRIPTION |
---|---|
schema
|
A Schema protocol buffer.
TYPE:
|
feature_path
|
The name of the feature whose domain needs to be set. If a FeatureName is passed, a one-step FeaturePath will be constructed and used. For example, "my_feature" -> types.FeaturePath(["my_feature"])
TYPE:
|
domain
|
A domain protocol buffer or the name of a global string domain present in the input schema.
TYPE:
|
Example: ```python >>> from tensorflow_metadata.proto.v0 import schema_pb2
import tensorflow_data_validation as tfdv >>> schema = schema_pb2.Schema() >>> schema.feature.add(name='feature') # Setting a int domain. >>> int_domain = schema_pb2.IntDomain(min=3, max=5) >>> tfdv.set_domain(schema, "feature", int_domain) # Setting a string domain. str_domain = schema_pb2.StringDomain(value=['one', 'two', 'three']) >>> tfdv.set_domain(schema, "feature", str_domain) ```
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input schema or the domain is not of the expected type. |
ValueError
|
If an invalid global string domain is provided as input. |
Source code in tensorflow_data_validation/utils/schema_util.py
update_schema
¶
update_schema(
schema: Schema,
statistics: DatasetFeatureStatisticsList,
infer_feature_shape: Optional[bool] = True,
max_string_domain_size: Optional[int] = 100,
) -> Schema
Updates input schema to conform to the input statistics.
PARAMETER | DESCRIPTION |
---|---|
schema
|
A Schema protocol buffer.
TYPE:
|
statistics
|
A DatasetFeatureStatisticsList protocol buffer. Schema inference is currently supported only for lists with a single DatasetFeatureStatistics proto or lists with multiple DatasetFeatureStatistics protos corresponding to data slices that include the default slice (i.e., the slice with all examples). If a list with multiple DatasetFeatureStatistics protos is used, this function will update the schema to conform to the statistics corresponding to the default slice.
TYPE:
|
infer_feature_shape
|
DEPRECATED, do not use. If a feature specifies a shape, the shape will always be validated. If the feature does not specify a shape, this function will not try inferring a shape from the given statistics. |
max_string_domain_size
|
Maximum size of the domain of a string feature in order to be interpreted as a categorical feature. |
RETURNS | DESCRIPTION |
---|---|
Schema
|
A Schema protocol buffer. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input argument is not of the expected type. |
ValueError
|
If the input statistics proto contains multiple datasets, none of which corresponds to the default slice. |
Source code in tensorflow_data_validation/api/validation_api.py
validate_corresponding_slices
¶
validate_corresponding_slices(
statistics: DatasetFeatureStatisticsList,
schema: Schema,
environment: Optional[Text] = None,
previous_statistics: Optional[
DatasetFeatureStatisticsList
] = None,
serving_statistics: Optional[
DatasetFeatureStatisticsList
] = None,
) -> Anomalies
Validates corresponding sliced statistics.
Sliced statistics are flattened into a single unsliced stats input prior to validation. If multiple statistics are provided, validation is performed on corresponding slices. DatasetConstraints, if present, are applied to the overall slice.
Note: This API is experimental and subject to change.
PARAMETER | DESCRIPTION |
---|---|
statistics
|
See validate_statistics.
TYPE:
|
schema
|
See validate_statistics.
TYPE:
|
environment
|
See validate_statistics. |
previous_statistics
|
See validate_statistics.
TYPE:
|
serving_statistics
|
See validate_statistics.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Anomalies
|
An Anomalies protocol buffer. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If any of the input arguments is not of the expected type. |
Source code in tensorflow_data_validation/api/validation_api.py
validate_examples_in_csv
¶
validate_examples_in_csv(
data_location: Text,
stats_options: StatsOptions,
column_names: Optional[List[FeatureName]] = None,
delimiter: Text = ",",
output_path: Optional[Text] = None,
pipeline_options: Optional[PipelineOptions] = None,
num_sampled_examples=0,
) -> Union[
DatasetFeatureStatisticsList,
Tuple[
DatasetFeatureStatisticsList,
Mapping[str, DataFrame],
],
]
Validates examples in csv files.
Runs a Beam pipeline to detect anomalies on a per-example basis. If this function detects anomalous examples, it generates summary statistics regarding the set of examples that exhibit each anomaly.
This is a convenience function for users with data in CSV format. Users with data in unsupported file/data formats, or users who wish to create their own Beam pipelines need to use the 'IdentifyAnomalousExamples' PTransform API directly instead.
PARAMETER | DESCRIPTION |
---|---|
data_location
|
The location of the input data files.
TYPE:
|
stats_options
|
TYPE:
|
column_names
|
A list of column names to be treated as the CSV header. Order must match the order in the input CSV files. If this argument is not specified, we assume the first line in the input CSV files as the header. Note that this option is valid only for 'csv' input file format. |
delimiter
|
A one-character string used to separate fields in a CSV file.
TYPE:
|
output_path
|
The file path to output data statistics result to. If None, the function uses a temporary directory. The output will be a TFRecord file containing a single data statistics list proto, and can be read with the 'load_statistics' function. If you run this function on Google Cloud, you must specify an output_path. Specifying None may cause an error. |
pipeline_options
|
Optional beam pipeline options. This allows users to specify various beam pipeline execution parameters like pipeline runner (DirectRunner or DataflowRunner), cloud dataflow service project id, etc. See https://cloud.google.com/dataflow/pipelines/specifying-exec-params for more details.
TYPE:
|
num_sampled_examples
|
If set, returns up to this many examples of each anomaly type as a map from anomaly reason string to pd.DataFrame.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, DataFrame]]]
|
If num_sampled_examples is zero, returns a single |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, DataFrame]]]
|
DatasetFeatureStatisticsList proto in which each dataset consists of the |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, DataFrame]]]
|
set of examples that exhibit a particular anomaly. If |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, DataFrame]]]
|
num_sampled_examples is nonzero, returns the same statistics |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, DataFrame]]]
|
proto as well as a mapping from anomaly to a pd.DataFrame of CSV rows |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, DataFrame]]]
|
exhibiting that anomaly. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the specified stats_options does not include a schema. |
Source code in tensorflow_data_validation/utils/validation_lib.py
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
|
validate_examples_in_tfrecord
¶
validate_examples_in_tfrecord(
data_location: Text,
stats_options: StatsOptions,
output_path: Optional[Text] = None,
pipeline_options: Optional[PipelineOptions] = None,
num_sampled_examples=0,
) -> Union[
DatasetFeatureStatisticsList,
Tuple[
DatasetFeatureStatisticsList,
Mapping[str, List[Example]],
],
]
Validates TFExamples in TFRecord files.
Runs a Beam pipeline to detect anomalies on a per-example basis. If this function detects anomalous examples, it generates summary statistics regarding the set of examples that exhibit each anomaly.
This is a convenience function for users with data in TFRecord format. Users with data in unsupported file/data formats, or users who wish to create their own Beam pipelines need to use the 'IdentifyAnomalousExamples' PTransform API directly instead.
PARAMETER | DESCRIPTION |
---|---|
data_location
|
The location of the input data files.
TYPE:
|
stats_options
|
TYPE:
|
output_path
|
The file path to output data statistics result to. If None, the function uses a temporary directory. The output will be a TFRecord file containing a single data statistics list proto, and can be read with the 'load_statistics' function. If you run this function on Google Cloud, you must specify an output_path. Specifying None may cause an error. |
pipeline_options
|
Optional beam pipeline options. This allows users to specify various beam pipeline execution parameters like pipeline runner (DirectRunner or DataflowRunner), cloud dataflow service project id, etc. See https://cloud.google.com/dataflow/pipelines/specifying-exec-params for more details.
TYPE:
|
num_sampled_examples
|
If set, returns up to this many examples of each anomaly type as a map from anomaly reason string to a list of tf.Examples.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, List[Example]]]]
|
If num_sampled_examples is zero, returns a single |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, List[Example]]]]
|
DatasetFeatureStatisticsList proto in which each dataset consists of the |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, List[Example]]]]
|
set of examples that exhibit a particular anomaly. If |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, List[Example]]]]
|
num_sampled_examples is nonzero, returns the same statistics |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, List[Example]]]]
|
proto as well as a mapping from anomaly to a list of tf.Examples that |
Union[DatasetFeatureStatisticsList, Tuple[DatasetFeatureStatisticsList, Mapping[str, List[Example]]]]
|
exhibited that anomaly. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the specified stats_options does not include a schema. |
Source code in tensorflow_data_validation/utils/validation_lib.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
|
validate_statistics
¶
validate_statistics(
statistics: DatasetFeatureStatisticsList,
schema: Schema,
environment: Optional[Text] = None,
previous_statistics: Optional[
DatasetFeatureStatisticsList
] = None,
serving_statistics: Optional[
DatasetFeatureStatisticsList
] = None,
custom_validation_config: Optional[
CustomValidationConfig
] = None,
) -> Anomalies
Validates the input statistics against the provided input schema.
This method validates the statistics
against the schema
. If an optional
environment
is specified, the schema
is filtered using the
environment
and the statistics
is validated against the filtered schema.
The optional previous_statistics
and serving_statistics
are the statistics
computed over the control data for drift- and skew-detection, respectively.
If drift- or skew-detection is conducted, then the raw skew/drift measurements
for each feature that is compared will be recorded in the drift_skew_info
field in the returned Anomalies
proto.
PARAMETER | DESCRIPTION |
---|---|
statistics
|
A DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over the current data. Validation is currently supported only for lists with a single DatasetFeatureStatistics proto or lists with multiple DatasetFeatureStatistics protos corresponding to data slices that include the default slice (i.e., the slice with all examples). If a list with multiple DatasetFeatureStatistics protos is used, this function will validate the statistics corresponding to the default slice.
TYPE:
|
schema
|
A Schema protocol buffer. Note that TFDV does not currently support validation of the following messages/fields in the Schema protocol buffer: - FeaturePresenceWithinGroup - Schema-level FloatDomain and IntDomain (validation is supported for Feature-level FloatDomain and IntDomain)
TYPE:
|
environment
|
An optional string denoting the validation environment. Must be one of the default environments specified in the schema. By default, validation assumes that all Examples in a pipeline adhere to a single schema. In some cases introducing slight schema variations is necessary, for instance features used as labels are required during training (and should be validated), but are missing during serving. Environments can be used to express such requirements. For example, assume a feature named 'LABEL' is required for training, but is expected to be missing from serving. This can be expressed by defining two distinct environments in schema: ["SERVING", "TRAINING"] and associating 'LABEL' only with environment "TRAINING". |
previous_statistics
|
An optional DatasetFeatureStatisticsList protocol
buffer denoting the statistics computed over an earlier data (for
example, previous day's data). If provided, the
TYPE:
|
serving_statistics
|
An optional DatasetFeatureStatisticsList protocol
buffer denoting the statistics computed over the serving data. If
provided, the
TYPE:
|
custom_validation_config
|
An optional config that can be used to specify
custom validations to perform. If doing single-feature validations,
the test feature will come from
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Anomalies
|
An Anomalies protocol buffer. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If any of the input arguments is not of the expected type. |
ValueError
|
If the input statistics proto contains multiple datasets, none of which corresponds to the default slice. |
Source code in tensorflow_data_validation/api/validation_api.py
236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 |
|
visualize_statistics
¶
visualize_statistics(
lhs_statistics: DatasetFeatureStatisticsList,
rhs_statistics: Optional[
DatasetFeatureStatisticsList
] = None,
lhs_name: Text = "lhs_statistics",
rhs_name: Text = "rhs_statistics",
allowlist_features: Optional[List[FeaturePath]] = None,
denylist_features: Optional[List[FeaturePath]] = None,
) -> None
Visualize the input statistics using Facets.
PARAMETER | DESCRIPTION |
---|---|
lhs_statistics
|
A DatasetFeatureStatisticsList protocol buffer.
TYPE:
|
rhs_statistics
|
An optional DatasetFeatureStatisticsList protocol buffer to compare with lhs_statistics.
TYPE:
|
lhs_name
|
Name to use for the lhs_statistics dataset if a name is not already provided within the protocol buffer.
TYPE:
|
rhs_name
|
Name to use for the rhs_statistics dataset if a name is not already provided within the protocol buffer.
TYPE:
|
allowlist_features
|
Set of features to be visualized.
TYPE:
|
denylist_features
|
Set of features to ignore for visualization.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input argument is not of the expected type. |
ValueError
|
If the input statistics protos does not have only one dataset. |
Source code in tensorflow_data_validation/utils/display_util.py
write_anomalies_text
¶
write_anomalies_text(
anomalies: Anomalies, output_path: Text
) -> None
Writes the Anomalies proto to a file in text format.
PARAMETER | DESCRIPTION |
---|---|
anomalies
|
An Anomalies protocol buffer.
TYPE:
|
output_path
|
File path to which to write the Anomalies proto.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input Anomalies proto is not of the expected type. |
Source code in tensorflow_data_validation/utils/anomalies_util.py
write_schema_text
¶
write_schema_text(
schema: Schema, output_path: Text
) -> None
Writes input schema to a file in text format.
PARAMETER | DESCRIPTION |
---|---|
schema
|
A Schema protocol buffer.
TYPE:
|
output_path
|
File path to write the input schema.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input schema is not of the expected type. |
Source code in tensorflow_data_validation/utils/schema_util.py
write_stats_text
¶
write_stats_text(
stats: DatasetFeatureStatisticsList, output_path: Text
) -> None
Writes a DatasetFeatureStatisticsList proto to a file in text format.
PARAMETER | DESCRIPTION |
---|---|
stats
|
A DatasetFeatureStatisticsList proto.
TYPE:
|
output_path
|
File path to write the DatasetFeatureStatisticsList proto.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the input proto is not of the expected type. |