tensorflow / data-validation Goto Github PK

Library for exploring and validating machine learning data

License: Apache License 2.0

Python 66.77% C++ 31.96% Shell 0.15% Starlark 1.12%

data-validation's Introduction

`Documentation`

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.

TensorFlow was originally developed by researchers and engineers working within the Machine Intelligence team at Google Brain to conduct research in machine learning and neural networks. However, the framework is versatile enough to be used in other areas as well.

TensorFlow provides stable Python and C++ APIs, as well as a non-guaranteed backward compatible API for other languages.

Keep up-to-date with release announcements and security updates by subscribing to [email protected]. See all the mailing lists.

Install

See the TensorFlow install guide for the pip package, to enable GPU support, use a Docker container, and build from source.

To install the current release, which includes support for CUDA-enabled GPU cards (Ubuntu and Windows):

$ pip install tensorflow

Other devices (DirectX and MacOS-metal) are supported using Device plugins.

A smaller CPU-only package is also available:

$ pip install tensorflow-cpu

To update TensorFlow to the latest version, add --upgrade flag to the above commands.

Nightly binaries are available for testing using the tf-nightly and tf-nightly-cpu packages on PyPi.

Try your first TensorFlow program

$ python

>>> import tensorflow as tf
>>> tf.add(1, 2).numpy()
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
b'Hello, TensorFlow!'

For more examples, see the TensorFlow tutorials.

Contribution guidelines

If you want to contribute to TensorFlow, be sure to review the contribution guidelines. This project adheres to TensorFlow's code of conduct. By participating, you are expected to uphold this code.

We use GitHub issues for tracking requests and bugs, please see TensorFlow Forum for general questions and discussion, and please direct specific questions to Stack Overflow.

The TensorFlow project strives to abide by generally accepted best practices in open-source software development.

Patching guidelines

Follow these steps to patch a specific version of TensorFlow, for example, to apply fixes to bugs or security vulnerabilities:

Clone the TensorFlow repo and switch to the corresponding branch for your desired TensorFlow version, for example, branch r2.8 for version 2.8.
Apply (that is, cherry-pick) the desired changes and resolve any code conflicts.
Run TensorFlow tests and ensure they pass.
Build the TensorFlow pip package from source.

Continuous build status

You can find more community-supported platforms and configurations in the TensorFlow SIG Build community builds table.

Official Builds

Build Type	Status	Artifacts
Linux CPU		PyPI
Linux GPU		PyPI
Linux XLA		TBA
macOS		PyPI
Windows CPU		PyPI
Windows GPU		PyPI
Android		Download
Raspberry Pi 0 and 1		Py3
Raspberry Pi 2 and 3		Py3
Libtensorflow MacOS CPU	Status Temporarily Unavailable	Nightly Binary Official GCS
Libtensorflow Linux CPU	Status Temporarily Unavailable	Nightly Binary Official GCS
Libtensorflow Linux GPU	Status Temporarily Unavailable	Nightly Binary Official GCS
Libtensorflow Windows CPU	Status Temporarily Unavailable	Nightly Binary Official GCS
Libtensorflow Windows GPU	Status Temporarily Unavailable	Nightly Binary Official GCS

Resources

Learn more about the TensorFlow community and how to contribute.

Courses

License

Apache License 2.0

data-validation's People

Contributors

Stargazers

Watchers

Forkers

paulgc aabayarea hurutoriya hulalazz timyb liybu36 beijinghxl1990 i-spark devopsmi beautifulnow1992 ezhangle thedevankit mysqlsc terrytangyuan preetham dynamicwebpaige andreapi naveedgol dnuang aelbouchti brianmartin shangyian mbrukman burness devidipakkoroth daikeshi debasish83 yuhonghong7035 cbreuel mrgoogol gdcollect crashfunction tony32769 ppries javatrending ashishpatel26 pedrolelis yupbank venki90 dunglai micseb xiaowei0202 cyc markusj1201 a2un debasishmaji codeaudit jackustc paul-english brills krprls kingmbc edson-github ajchili ivej andrewsmartin aixioma ming-yao wsuchy christopherlee hughmiao wan-docai santosh-d3vpl3x etarakci-hvl etiennedaspe git04112019 wanwanzhu rauanarchive bobgy wrapper228 miyazakihayao ra312-forked tanguycdls christian7877 nguyenducnhaty wendykan aerinkim maxsei redpoint13 rohdesamuel mitakora stjordanis isitix coulbe msgpo thewayoftherob aaltay alexandergg alexfang0214sh hephaex isabella232 forestlzj arghyaganguly saiprasad16 barwimbe daugraph emkpmg-zz jay90099 wxj0916 meixixixi

data-validation's Issues

Error in generate statistics due to Python Snappy

Hi,

I got a fresh install of Ubuntu 16.04 on AWS, installed the following in this order:

python 2.7
python-dev
python-snappy
tensorflow 1.11
apache-beam
jupyter
tensorflow-data-validation

I manage to import tfdv without issue however the generate statistics fails with the following trace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-2fbc1f02b217> in <module>()
----> 1 stats = tfdv.generate_statistics_from_csv(DATA, delimiter=';')

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/stats_gen_lib.pyc in generate_statistics_from_csv(data_location, column_names, delimiter, output_path, stats_options, pipeline_options)
    153             shard_name_template='',
    154             coder=beam.coders.ProtoCoder(
--> 155                 statistics_pb2.DatasetFeatureStatisticsList)))
    156   return load_statistics(output_path)
    157 

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in __exit__(self, exc_type, exc_val, exc_tb)
    412   def __exit__(self, exc_type, exc_val, exc_tb):
    413     if not exc_type:
--> 414       self.run().wait_until_finish()
    415 
    416   def visit(self, visitor):

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in run(self, test_runner_api)
    392     if test_runner_api and self._verify_runner_api_compatible():
    393       return Pipeline.from_runner_api(
--> 394           self.to_runner_api(), self.runner, self._options).run(False)
    395 
    396     if self._options.view_as(TypeOptions).runtime_type_check:

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in run(self, test_runner_api)
    405       finally:
    406         shutil.rmtree(tmpdir)
--> 407     return self.runner.run_pipeline(self)
    408 
    409   def __enter__(self):

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/direct/direct_runner.pyc in run_pipeline(self, pipeline)
    133       runner = BundleBasedDirectRunner()
    134 
--> 135     return runner.run_pipeline(pipeline)
    136 
    137 

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.pyc in run_pipeline(self, pipeline)
    221     from apache_beam.runners.dataflow.dataflow_runner import DataflowRunner
    222     pipeline.visit(DataflowRunner.group_by_key_input_visitor())
--> 223     return self.run_via_runner_api(pipeline.to_runner_api())
    224 
    225   def run_via_runner_api(self, pipeline_proto):

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.pyc in run_via_runner_api(self, pipeline_proto)
    224 
    225   def run_via_runner_api(self, pipeline_proto):
--> 226     return self.run_stages(*self.create_stages(pipeline_proto))
    227 
    228   def create_stages(self, pipeline_proto):

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.pyc in run_stages(self, pipeline_components, stages, safe_coders)
    862         metrics_by_stage[stage.name] = self.run_stage(
    863             controller, pipeline_components, stage,
--> 864             pcoll_buffers, safe_coders).process_bundle.metrics
    865     finally:
    866       controller.close()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.pyc in run_stage(self, controller, pipeline_components, stage, pcoll_buffers, safe_coders)
    975     return BundleManager(
    976         controller, get_buffer, process_bundle_descriptor,
--> 977         self._progress_frequency).process_bundle(data_input, data_output)
    978 
    979   # These classes are used to interact with the worker.

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.pyc in process_bundle(self, inputs, expected_outputs)
   1179         process_bundle=beam_fn_api_pb2.ProcessBundleRequest(
   1180             process_bundle_descriptor_reference=self._bundle_descriptor.id))
-> 1181     result_future = self._controller.control_handler.push(process_bundle)
   1182 
   1183     with ProgressRequester(

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.pyc in push(self, request)
   1059         request.instruction_id = 'control_%s' % self._uid_counter
   1060       logging.debug('CONTROL REQUEST %s', request)
-> 1061       response = self.worker.do_instruction(request)
   1062       logging.debug('CONTROL RESPONSE %s', response)
   1063       return ControlFuture(request.instruction_id, response)

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.pyc in do_instruction(self, request)
    213       # E.g. if register is set, this will call self.register(request.register))
    214       return getattr(self, request_type)(getattr(request, request_type),
--> 215                                          request.instruction_id)
    216     else:
    217       raise NotImplementedError

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.pyc in process_bundle(self, request, instruction_id)
    235     try:
    236       with state_handler.process_instruction_id(instruction_id):
--> 237         processor.process_bundle(instruction_id)
    238     finally:
    239       del self.bundle_processors[instruction_id]

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/bundle_processor.pyc in process_bundle(self, instruction_id)
    297             instruction_id, [input_op.target]):
    298           # ignores input name
--> 299           input_op.process_encoded(data.data)
    300 
    301       # Finish all operations.

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/bundle_processor.pyc in process_encoded(self, encoded_windowed_values)
    118       decoded_value = self.windowed_coder_impl.decode_from_stream(
    119           input_stream, True)
--> 120       self.output(decoded_value)
    121 
    122 

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/operations.so in apache_beam.runners.worker.operations.Operation.output()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/operations.so in apache_beam.runners.worker.operations.Operation.output()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/operations.so in apache_beam.runners.worker.operations.ConsumerSet.receive()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/operations.so in apache_beam.runners.worker.operations.DoOperation.process()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/operations.so in apache_beam.runners.worker.operations.DoOperation.process()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.so in apache_beam.runners.common.DoFnRunner.receive()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.so in apache_beam.runners.common.DoFnRunner.process()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.so in apache_beam.runners.common.DoFnRunner._reraise_augmented()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.so in apache_beam.runners.common.DoFnRunner.process()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.so in apache_beam.runners.common.PerWindowInvoker.invoke_process()

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.so in apache_beam.runners.common.PerWindowInvoker._invoke_per_window()

/usr/local/lib/python2.7/dist-packages/apache_beam/io/iobase.pyc in process(self, element, init_result)
   1051     writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
   1052     for e in bundle[1]:  # values
-> 1053       writer.write(e)
   1054     return [window.TimestampedValue(writer.close(), window.MAX_TIMESTAMP)]
   1055 

/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.pyc in write(self, value)
    386 
    387   def write(self, value):
--> 388     self.sink.write_record(self.temp_handle, value)
    389 
    390   def close(self):

/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.pyc in write_record(self, file_handle, value)
    135     this sink's Coder.
    136     """
--> 137     self.write_encoded_record(file_handle, self.coder.encode(value))
    138 
    139   def write_encoded_record(self, file_handle, encoded_value):

/usr/local/lib/python2.7/dist-packages/apache_beam/io/tfrecordio.pyc in write_encoded_record(self, file_handle, value)
    278 
    279   def write_encoded_record(self, file_handle, value):
--> 280     _TFRecordUtil.write_record(file_handle, value)
    281 
    282 

/usr/local/lib/python2.7/dist-packages/apache_beam/io/tfrecordio.pyc in write_record(cls, file_handle, value)
     99     file_handle.write('{}{}{}{}'.format(
    100         encoded_length,
--> 101         struct.pack('<I', cls._masked_crc32c(encoded_length)),  #
    102         value,
    103         struct.pack('<I', cls._masked_crc32c(value))))

/usr/local/lib/python2.7/dist-packages/apache_beam/io/tfrecordio.pyc in _masked_crc32c(cls, value, crc32c_fn)
     79     """
     80 
---> 81     crc = crc32c_fn(value)
     82     return (((crc >> 15) | (crc << 17)) + 0xa282ead8) & 0xffffffff
     83 

/usr/local/lib/python2.7/dist-packages/apache_beam/io/tfrecordio.pyc in _default_crc32c_fn(value)
     45     try:
     46       import snappy  # pylint: disable=import-error
---> 47       _default_crc32c_fn.fn = snappy._snappy._crc32c  # pylint: disable=protected-access
     48     except ImportError:
     49       logging.warning('Couldn\'t find python-snappy so the implementation of '

AttributeError: 'module' object has no attribute '_snappy' [while running 'WriteStatsOutput/Write/WriteImpl/WriteBundles']

However if I try
import snappy
in a new notebook it loads properly.

EDIT: if I remove the package python-snappy the function works but with the following Warning:

WARNING:root:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

Any way to speed up tfdv.TFExampleDecoder?

I think most users of this library will most likely be feeding in data from serialized tf.train.Examples stored in TFRecords. However, the performance of tfdv.TFExampleDecoder may become a bottleneck for some users. If you profile that code, you find that it spends minimal amounts of time deserializing the data and the vast majority of its time building the output dictionary and instantiating numpy arrays. The problem is exacerbated for workflows with hundreds or thousands of scalar features, where the overhead of creating small numpy arrays seems to dominate.

Do you have any tips for what can be done to optimize this part of the TFDV workflow? I've already tried swapping in tft.coders.ExampleProtoCoder in cases where the schema is known in advance, but it does not seem to provide very much of a performance gain (about 10%).

Error while using tfdv.generate_statistics_from_csv('path_to_csv_file')

I got a error when using generate_statistics_from_csv()

train_stats = tfdv.generate_statistics_from_csv('data_1.csv')
(the path to data is correct)

Error- ValueError: No line was found. [while running 'DecodeData/ParseCSVRecords']

pip install tensorflow-data-validation fails on OS X El Capitan (10.11.6)

pip install tensorflow-data-validation
Collecting tensorflow-data-validation
Could not find a version that satisfies the requirement tensorflow-data-validation (from versions: )
No matching distribution found for tensorflow-data-validation

AttributeError: 'module' object has no attribute '_QuantilesCombinerSpec'

Taxi example gives this error:

AttributeError                            Traceback (most recent call last)
<ipython-input-8-cd13ad158f35> in <module>()
      4              'company', 'trip_seconds', 'dropoff_community_area', 'tips']
      5 
----> 6 compute_stats('data/data.csv', 'stats/stats.tfrecord', col_names)

<ipython-input-7-9b31bb049669> in compute_stats(input_path, stats_path, column_names, pipeline_args)
     14 
     15         _ = (raw_data | 'GenerateStatistics' >> tfdv.GenerateStatistics()
---> 16                       | 'WriteStatsOutput' >> beam.io.WriteToTFRecord(stats_path, shard_name_template='',
     17                                                                       coder=beam.coders.ProtoCoder(
     18                                                                           statistics_pb2.DatasetFeatureStatisticsList)))

/usr/local/lib/python2.7/dist-packages/apache_beam/pvalue.pyc in __or__(self, ptransform)
    109 
    110   def __or__(self, ptransform):
--> 111     return self.pipeline.apply(ptransform, self)
    112 
    113 

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label)
    465     if isinstance(transform, ptransform._NamedPTransform):
    466       return self.apply(transform.transform, pvalueish,
--> 467                         label or transform.label)
    468 
    469     if not isinstance(transform, ptransform.PTransform):

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label)
    475       try:
    476         old_label, transform.label = transform.label, label
--> 477         return self.apply(transform, pvalueish)
    478       finally:
    479         transform.label = old_label

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label)
    511       transform.type_check_inputs(pvalueish)
    512 
--> 513     pvalueish_result = self.runner.apply(transform, pvalueish)
    514 
    515     if type_options is not None and type_options.pipeline_type_check:

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.pyc in apply(self, transform, input)
    191       m = getattr(self, 'apply_%s' % cls.__name__, None)
    192       if m:
--> 193         return m(transform, input)
    194     raise NotImplementedError(
    195         'Execution of [%s] not implemented in runner %s.' % (transform, self))

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.pyc in apply_PTransform(self, transform, input)
    197   def apply_PTransform(self, transform, input):
    198     # The base case of apply is to call the transform's expand.
--> 199     return transform.expand(input)
    200 
    201   def run_transform(self, transform_node):

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/api/stats_api.pyc in expand(self, dataset)
    197             num_values_histogram_buckets=\
    198                 self._options.num_values_histogram_buckets,
--> 199             epsilon=self._options.epsilon),
    200 
    201         # Create numeric stats generator.

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/statistics/generators/common_stats_generator.pyc in __init__(self, name, schema, num_values_histogram_buckets, epsilon)
    229     # Initialize quantiles combiner.
    230     self._quantiles_combiner = quantiles_util.QuantilesCombiner(
--> 231         num_values_histogram_buckets, epsilon)
    232 
    233   # Create an accumulator, which maps feature name to the partial stats

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/quantiles_util.pyc in __init__(self, num_quantiles, epsilon)
     38     self._num_quantiles = num_quantiles
     39     self._epsilon = epsilon
---> 40     self._quantiles_spec = analyzers._QuantilesCombinerSpec(
     41         num_quantiles=num_quantiles, epsilon=epsilon,
     42         bucket_numpy_dtype=np.float32, always_return_num_quantiles=True)

AttributeError: 'module' object has no attribute '_QuantilesCombinerSpec'

Missing API for manually setting domain

Hi, thank you for a terrific library!

I ran into an edge case, where infer_schema doesn' t detect any domain. This leads to a missing attribute domain_info. Therefore, I would like to be able to set it manually.

This now means to fiddle around with the protobuffer, which doens't have the most user-friendly API.

Would it be possible to have a python function for example set_domain to avoid the protobuffer API?

get_statistics_html should support multiple datasets

Hi there! I was wondering whether get_statistics_html would support a statistics_pb2.DatasetFeatureStatisticsList() that has more than one dataset (currently it throws an exception if this is the case).

I am raising the issue because my use-case requires me to slice my data into multiple categories (using slice_functions in StatsOptions for GenerateStatistics()). After slicing per category, I would like to see a HTML per category to find how my data behaves per category.

Now, I think this is relatively straightforward - I even implemented a workaround below that just looped through the datasets and created a new HTML for each:

stats = load_statistics('...')  # these are statistics with multiple datasets
for stat in stats.datasets:
    stat_listed = statistics_pb2.DatasetFeatureStatisticsList()
    stat_listed.datasets.append(stat)
    with open(os.path.join(out_path, stat.name + '.html'), 'w') as f:
        f.write(get_statistics_html(stat_listed))

Is this too much of an edge case for the library, or is it worth a PR? Thanks in advance!

Is it possible to pinpoint the exact example that caused the anomalies?

Currently when generating anomalies, a table is generated like:

However, is there a way to pinpoint the actual example that's causing the problem?

Data valudation error - 'module' object has no attribute '_snappy' [while running 'WriteStatsOutput/Write/WriteImpl/WriteBundles']

tfdv.validate_statistics drops existing column and adds it as a new column

Awesome TFX tool! Thank you for the great work.

I stepped through the tfx data-validation example with my own data and noticed that the a column got dropped and added as a new column when I ran anomalies = tfdv.validate_statistics(eval_stats, schema)

In my example example data, my training and eval data have the same columns, I just added additional keys to different columns in the eval data.

When I then run

eval_stats = tfdv.generate_statistics_from_csv(EVAL_DATA)
anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(anomalies)

I receive the table output below like

Feature name		Anomaly short description            Anomaly long description
f_key			New column 				       New column (column in data but not in schema): f_key
...
f_key			Column dropped 			       Column is completely missing

f_key is in the schema.

feature {
  name: "f_key"
  value_count {
    min: 1
    max: 1
  }
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}

What causes the anomaly info (drop/new) for an existing column?

Statistics visualization doesn't work in Firefox

tfdv.visualize_statistics(train_stats) display nothing except a line.
I print 'train_stats', it does contain the stats i wanted.
so i think the steps before "tfdv.visualize_statistics(train_stats)" is all normal.
My browser is firefox.
where the problem is?

Not able to run "bazel run -c opt tensorflow_data_validation:build_pip_package" on Mac OSX

Hi,

I am on Mac OSX Sierra with Python 2.7. I am trying to follow the install instructions here:

https://www.tensorflow.org/tfx/data_validation/install

I received errors though when doing bazel run ... Here's the output:

(venv) aaron@ ~/Documents/github/data-validation (master) $ bazel run -c opt tensorflow_data_validation:build_pip_package
ERROR: /private/var/tmp/_bazel_aaron/85905f6f7354f8ed2c71cabc5a2e55c8/external/org_tensorflow/third_party/gpus/cuda_configure.bzl:32:1: file '@bazel_tools//tools/cpp:windows_cc_configure.bzl' does not contain symbol 'setup_vc_env_vars'
ERROR: error loading package '': Extension file 'third_party/gpus/cuda_configure.bzl' has errors
ERROR: error loading package '': Extension file 'third_party/gpus/cuda_configure.bzl' has errors
INFO: Elapsed time: 0.243s
FAILED: Build did NOT complete successfully (0 packages loaded)
ERROR: Build failed. Not running target

Here's my bazel information

(venv) aaron@ ~/Documents/github/data-validation (master) $ bazel version
Build label: 0.11.0-homebrew
Build target: bazel-out/darwin-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Sat Apr 30 17:07:35 +50118 (1519414477655)
Build timestamp: 1519414477655
Build timestamp as int: 1519414477655

Any idea what could be the issue?

I had just done a fresh pull from master of this repo.

thanks

Documentation about anomalies and what constraints trigger them

This issue is maybe related to #62 where the answer is "We are working on a better documentation".

Generally, I would like to see a list of

possible anomalies
what fields are compared to trigger them (between statistics and schema or statistics and statistics)
What field in schema is responsible for the constraint (might be the same as above)

Specifically,
I want to protect myself against a integer feature suddenly having 100% of zero values. I am wondering if there is a field in the schema that I can set to validate this?

tfdv.generate_statistics_from_csv not working anymore with current branch

I installed the latest version of tensorflow-data-validation with

pip install tensorflow-data-validation

I'm not able anymore to generate statistics from .csv files. On a file of mine which worked, I get now

startt = time.time()
train_stats = tfdv.generate_statistics_from_csv(TRAIN_DATA)
trainstats_time = time.time() - startt
print(trainstats_time)
AttributeErrorTraceback (most recent call last)
<ipython-input-3-5f22d5888316> in <module>()
      1 startt = time.time()
----> 2 train_stats = tfdv.generate_statistics_from_csv(TRAIN_DATA)
      3 trainstats_time = time.time() - startt
      4 print(trainstats_time)

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/stats_gen_lib.pyc in generate_statistics_from_csv(data_location, column_names, delimiter, output_path, stats_options, pipeline_options)
    149                                                 delimiter=delimiter)
    150         | 'GenerateStatistics' >> stats_api.GenerateStatistics(stats_options)
--> 151         | 'WriteStatsOutput' >> beam.io.WriteToTFRecord(
    152             output_path,
    153             shard_name_template='',

/usr/local/lib/python2.7/dist-packages/apache_beam/pvalue.pyc in __or__(self, ptransform)
    109 
    110   def __or__(self, ptransform):
--> 111     return self.pipeline.apply(ptransform, self)
    112 
    113 

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label)
    465     if isinstance(transform, ptransform._NamedPTransform):
    466       return self.apply(transform.transform, pvalueish,
--> 467                         label or transform.label)
    468 
    469     if not isinstance(transform, ptransform.PTransform):

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label)
    475       try:
    476         old_label, transform.label = transform.label, label
--> 477         return self.apply(transform, pvalueish)
    478       finally:
    479         transform.label = old_label

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label)
    511       transform.type_check_inputs(pvalueish)
    512 
--> 513     pvalueish_result = self.runner.apply(transform, pvalueish)
    514 
    515     if type_options is not None and type_options.pipeline_type_check:

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.pyc in apply(self, transform, input)
    191       m = getattr(self, 'apply_%s' % cls.__name__, None)
    192       if m:
--> 193         return m(transform, input)
    194     raise NotImplementedError(
    195         'Execution of [%s] not implemented in runner %s.' % (transform, self))

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.pyc in apply_PTransform(self, transform, input)
    197   def apply_PTransform(self, transform, input):
    198     # The base case of apply is to call the transform's expand.
--> 199     return transform.expand(input)
    200 
    201   def run_transform(self, transform_node):

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/api/stats_api.pyc in expand(self, dataset)
    197             num_values_histogram_buckets=\
    198                 self._options.num_values_histogram_buckets,
--> 199             epsilon=self._options.epsilon),
    200 
    201         # Create numeric stats generator.

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/statistics/generators/common_stats_generator.pyc in __init__(self, name, schema, num_values_histogram_buckets, epsilon)
    229     # Initialize quantiles combiner.
    230     self._quantiles_combiner = quantiles_util.QuantilesCombiner(
--> 231         num_values_histogram_buckets, epsilon)
    232 
    233   # Create an accumulator, which maps feature name to the partial stats

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/quantiles_util.pyc in __init__(self, num_quantiles, epsilon)
     38     self._num_quantiles = num_quantiles
     39     self._epsilon = epsilon
---> 40     self._quantiles_spec = analyzers._QuantilesCombinerSpec(
     41         num_quantiles=num_quantiles, epsilon=epsilon,
     42         bucket_numpy_dtype=np.float32, always_return_num_quantiles=True)

AttributeError: 'module' object has no attribute '_QuantilesCombinerSpec'

TFDV sometimes erroneously sets min_fraction to 1.0

I have not been able to reproduce this on a small test case locally, and the case where I have experienced this error involves a proprietary dataset that I can't share. But the issue is essentially this: I have two tfrecord shards of some dataset, one where a feature is always present and another where a feature is always missing (as in, the feature name is not even populated in the tf.Example). If I run GenerateStatistics on these shards while running on google cloud dataflow, the resulting stats file claims that there are 0 missing entries for that feature.

For instance, for this feature in question, the stats file in json format shows:

        {
          "numStats": {
            "commonStats": {
              "totNumValues": "181129", 
              "numNonMissing": "181129", 
              "maxNumValues": "1", 
              "numValuesHistogram": {},
              "avgNumValues": 1.0, 
              "minNumValues": "1"
            }, 
            "numZeros": "181071", 
            "histograms": [],
            "stdDev": 0.017891652618962306, 
            "max": 1.0, 
            "mean": 0.00032021377029630815
          }, 
          "name": "partially_missing_feature"
        },

Whereas a fully present feature looks like this:

        {
          "numStats": {
            "commonStats": {
              "totNumValues": "382166", 
              "numNonMissing": "382166", 
              "maxNumValues": "1", 
              "numValuesHistogram": {},
              "avgNumValues": 1.0, 
              "minNumValues": "1"
            }, 
            "numZeros": "101146", 
            "histograms": [],
            "median": 1.0, 
            "stdDev": 0.4301544503153421, 
            "max": 1.0, 
            "mean": 0.6821856555482622
          }, 
          "type": "FLOAT", 
          "name": "fully_present_feature"
        },

Please let me know if this is expected behavior.

Add correlations to Facets charts/tables

TensorFlow Data Validation is a great tool to look at the data. One feature that might make it even better is if it would also compute correlations among the variables, so that if two variables are highly correlated you can avoid multicollinearities by dropping one of the correlated variables. Having that available in the facets visualization would make it easier to spot issues with the data.

GenerateStatistics API Change

Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), GenerateStatistics API will take Arrow tables as input instead of Dict[FeatureName, ndarray]. The API will only accept Arrow tables whose columns are ListArray of primitive types (e.g., int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, binary, string, unicode) .

This change should be a no-op if you construct the pipeline using the default decoders (e.g., tfdv.DecodeTFExample and tfdv.DecodeCSV) or if you are using the utility methods to generate statistics (e.g., tfdv.generate_statistics_from_tfrecord, tfdv.generate_statistics_from_csv and tfdv.generate_statistics_from_dataframe).

TFDV 0.14 will have this new behavior. Let us know if you have any issues with migrating to the new API.

overflow encountered in long_scalars

I am getting the following error when running tfdv.generate_statistics_from_csv() in my dataset:

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/statistics/generators/numeric_stats_generator.py:108: RuntimeWarning: overflow encountered in long_scalars numeric_stats.sum_of_squares += v * v

Can't find the "Schema change generator" code

In the TFX Paper, under the "Data Validation" section, it is said:

In some cases the anomalies correspond to a natural evolution of the data, and the appropriate action is to change the schema (rather than fix the data). To accommodate this option, our component generates for each anomaly a corresponding schema change that can bring the schema up-to-date

Besides the textual Reason description, is this schema change generation code open sourced?
I spent some time looking in the doc and exploring the Python API but couldn't find anything.

schema not matching

I have above features. The schema generated by tensorflow data validation did not match. Ex:
clicktime - time datatype
device id: string
ip:string

features are shown as int and bytes instead of datetime and string. How to redefine schema to change into datetime and string datatype

[DataflowRuntimeException] ImportError: No module named tfdv.statistics.stats_impl

Context

When running tfdv.generate_statistics_from_tfrecord on Dataflow, the job gets submitted successfully to the cluster but I get a:
ImportError: No module named tensorflow_data_validation.statistics.stats_impl during the job unpickling phase in the Dataflow worker

Error trace

---------------------------------------------------------------------------
DataflowRuntimeException                  Traceback (most recent call last)
<ipython-input-23-8f1147effd88> in <module>()
     16 # for more options about stats, run `?tfdv.generate_statistics_from_tfrecord`
     17 tfdv.generate_statistics_from_tfrecord(TFRECORDS_PATH, 
---> 18                                        pipeline_options=pipeline_options)

/Users/romain/dev/venv/lib/python2.7/site-packages/tensorflow_data_validation/utils/stats_gen_lib.pyc in generate_statistics_from_tfrecord(data_location, output_path, stats_options, pipeline_options)
     86             shard_name_template='',
     87             coder=beam.coders.ProtoCoder(
---> 88                 statistics_pb2.DatasetFeatureStatisticsList)))
     89   return load_statistics(output_path)
     90 

/Users/romain/dev/venv/lib/python2.7/site-packages/apache_beam/pipeline.pyc in __exit__(self, exc_type, exc_val, exc_tb)
    421   def __exit__(self, exc_type, exc_val, exc_tb):
    422     if not exc_type:
--> 423       self.run().wait_until_finish()
    424 
    425   def visit(self, visitor):

/Users/romain/dev/venv/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in wait_until_finish(self, duration)
   1164         raise DataflowRuntimeException(
   1165             'Dataflow pipeline failed. State: %s, Error:\n%s' %
-> 1166             (self.state, getattr(self._runner, 'last_error_msg', None)), self)
   1167     return self.state
   1168 

DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 130, in execute
    test_shuffle_sink=self._test_shuffle_sink)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 104, in create_operation
    is_streaming=False)
  File "apache_beam/runners/worker/operations.py", line 636, in apache_beam.runners.worker.operations.create_operation
    op = create_pgbk_op(name_context, spec, counter_factory, state_sampler)
  File "apache_beam/runners/worker/operations.py", line 482, in apache_beam.runners.worker.operations.create_pgbk_op
    return PGBKCVOperation(step_name, spec, counter_factory, state_sampler)
  File "apache_beam/runners/worker/operations.py", line 538, in apache_beam.runners.worker.operations.PGBKCVOperation.__init__
    fn, args, kwargs = pickler.loads(self.spec.combine_fn)[:3]
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 246, in loads
    return dill.loads(s)
  File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 316, in loads
    return load(file, ignore)
  File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 304, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 465, in find_class
    return StockUnpickler.find_class(self, module, name)
  File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
    __import__(module)
ImportError: No module named tensorflow_data_validation.statistics.stats_impl

What code did I run?

!pip install -U tensorflow \
                tensorflow-data-validation \
                apache-beam[gcp]

import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions

# Create and set your PipelineOptions.
pipeline_options = PipelineOptions()

# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'
    
tfdv.generate_statistics_from_tfrecord(TFRECORDS_PATH, 
                                       pipeline_options=pipeline_options)

Pip trace

Requirement already up-to-date: tensorflow in /Users/romain/dev/venv/lib/python2.7/site-packages (1.12.0)
Requirement already up-to-date: tensorflow-data-validation in /Users/romain/dev/venv/lib/python2.7/site-packages (0.11.0)
Requirement already up-to-date: apache-beam[gcp] in /Users/romain/dev/venv/lib/python2.7/site-packages (2.8.0)
Requirement already satisfied, skipping upgrade: enum34>=1.1.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.1.6)
Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.5)
Requirement already satisfied, skipping upgrade: wheel in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.31.1)
Requirement already satisfied, skipping upgrade: astor>=0.6.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.7.1)
Requirement already satisfied, skipping upgrade: backports.weakref>=1.0rc1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.post1)
Requirement already satisfied, skipping upgrade: mock>=2.0.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (2.0.0)
Requirement already satisfied, skipping upgrade: tensorboard<1.13.0,>=1.12.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.12.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.1.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.6.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (3.6.1)
Requirement already satisfied, skipping upgrade: gast>=0.2.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.2.0)
Requirement already satisfied, skipping upgrade: absl-py>=0.1.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.3.0)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.13.0)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.10.0)
Requirement already satisfied, skipping upgrade: keras-applications>=1.0.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.6)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.14.0)
Requirement already satisfied, skipping upgrade: IPython<6,>=5.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (5.7.0)
Requirement already satisfied, skipping upgrade: tensorflow-metadata<0.10,>=0.9 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.9.0)
Requirement already satisfied, skipping upgrade: tensorflow-transform<0.12,>=0.11 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.11.0)
Requirement already satisfied, skipping upgrade: pandas<1,>=0.18 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.22.0)
Requirement already satisfied, skipping upgrade: oauth2client<5,>=2.0.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (4.1.3)
Requirement already satisfied, skipping upgrade: dill<=0.2.8.2,>=0.2.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.2.8.2)
Requirement already satisfied, skipping upgrade: pydot<1.3,>=1.2.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.2.4)
Requirement already satisfied, skipping upgrade: pyyaml<4.0.0,>=3.12 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.12)
Requirement already satisfied, skipping upgrade: pyvcf<0.7.0,>=0.6.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.6.8)
Requirement already satisfied, skipping upgrade: typing<3.7.0,>=3.6.0; python_version < "3.5.0" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.6.4)
Requirement already satisfied, skipping upgrade: avro<2.0.0,>=1.8.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.8.2)
Requirement already satisfied, skipping upgrade: future<1.0.0,>=0.16.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.16.0)
Requirement already satisfied, skipping upgrade: fastavro<0.22,>=0.21.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.21.13)
Requirement already satisfied, skipping upgrade: crcmod<2.0,>=1.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.7)
Requirement already satisfied, skipping upgrade: httplib2<=0.11.3,>=0.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.11.3)
Requirement already satisfied, skipping upgrade: futures<4.0.0,>=3.1.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.2.0)
Requirement already satisfied, skipping upgrade: hdfs<3.0.0,>=2.1.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (2.1.0)
Requirement already satisfied, skipping upgrade: pytz<=2018.4,>=2018.3 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (2018.4)
Requirement already satisfied, skipping upgrade: google-apitools<=0.5.20,>=0.5.18; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.5.20)
Requirement already satisfied, skipping upgrade: proto-google-cloud-pubsub-v1==0.15.4; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.15.4)
Requirement already satisfied, skipping upgrade: googledatastore==7.0.1; python_version < "3.0" and extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (7.0.1)
Requirement already satisfied, skipping upgrade: google-cloud-bigquery==0.25.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.25.0)
Requirement already satisfied, skipping upgrade: google-cloud-pubsub==0.26.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.26.0)
Requirement already satisfied, skipping upgrade: proto-google-cloud-datastore-v1<=0.90.4,>=0.90.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.90.4)
Requirement already satisfied, skipping upgrade: funcsigs>=1; python_version < "3.3" in /Users/romain/dev/venv/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow) (1.0.2)
Requirement already satisfied, skipping upgrade: pbr>=0.11 in /Users/romain/dev/venv/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow) (1.10.0)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.10 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorboard<1.13.0,>=1.12.0->tensorflow) (0.14.1)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorboard<1.13.0,>=1.12.0->tensorflow) (2.6.11)
Requirement already satisfied, skipping upgrade: setuptools in /Users/romain/dev/venv/lib/python2.7/site-packages (from protobuf>=3.6.1->tensorflow) (39.1.0)
Requirement already satisfied, skipping upgrade: h5py in /Users/romain/dev/venv/lib/python2.7/site-packages (from keras-applications>=1.0.6->tensorflow) (2.8.0)
Requirement already satisfied, skipping upgrade: simplegeneric>0.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.8.1)
Requirement already satisfied, skipping upgrade: pygments in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (2.2.0)
Requirement already satisfied, skipping upgrade: backports.shutil-get-terminal-size; python_version == "2.7" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (1.0.0)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.6.0)
Requirement already satisfied, skipping upgrade: prompt-toolkit<2.0.0,>=1.0.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (1.0.15)
Requirement already satisfied, skipping upgrade: decorator in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.3.0)
Requirement already satisfied, skipping upgrade: pickleshare in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.7.4)
Requirement already satisfied, skipping upgrade: appnope; sys_platform == "darwin" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.1.0)
Requirement already satisfied, skipping upgrade: traitlets>=4.2 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.3.2)
Requirement already satisfied, skipping upgrade: pathlib2; python_version == "2.7" or python_version == "3.3" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (2.3.2)
Requirement already satisfied, skipping upgrade: googleapis-common-protos in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-metadata<0.10,>=0.9->tensorflow-data-validation) (1.5.3)
Requirement already satisfied, skipping upgrade: python-dateutil in /Users/romain/dev/venv/lib/python2.7/site-packages (from pandas<1,>=0.18->tensorflow-data-validation) (2.7.3)
Requirement already satisfied, skipping upgrade: rsa>=3.1.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (3.4.2)
Requirement already satisfied, skipping upgrade: pyasn1>=0.1.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (0.1.9)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (0.0.8)
Requirement already satisfied, skipping upgrade: pyparsing>=2.1.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from pydot<1.3,>=1.2.0->apache-beam[gcp]) (2.1.10)
Requirement already satisfied, skipping upgrade: docopt in /Users/romain/dev/venv/lib/python2.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]) (0.6.2)
Requirement already satisfied, skipping upgrade: requests>=2.7.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]) (2.11.1)
Requirement already satisfied, skipping upgrade: fasteners>=0.14 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-apitools<=0.5.20,>=0.5.18; extra == "gcp"->apache-beam[gcp]) (0.14.1)
Requirement already satisfied, skipping upgrade: google-cloud-core<0.26dev,>=0.25.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (0.25.0)
Requirement already satisfied, skipping upgrade: gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.15.4)
Requirement already satisfied, skipping upgrade: ptyprocess>=0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from pexpect; sys_platform != "win32"->IPython<6,>=5.0->tensorflow-data-validation) (0.5.2)
Requirement already satisfied, skipping upgrade: wcwidth in /Users/romain/dev/venv/lib/python2.7/site-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython<6,>=5.0->tensorflow-data-validation) (0.1.7)
Requirement already satisfied, skipping upgrade: ipython-genutils in /Users/romain/dev/venv/lib/python2.7/site-packages (from traitlets>=4.2->IPython<6,>=5.0->tensorflow-data-validation) (0.2.0)
Requirement already satisfied, skipping upgrade: scandir; python_version < "3.5" in /Users/romain/dev/venv/lib/python2.7/site-packages (from pathlib2; python_version == "2.7" or python_version == "3.3"->IPython<6,>=5.0->tensorflow-data-validation) (1.7)
Requirement already satisfied, skipping upgrade: monotonic>=0.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from fasteners>=0.14->google-apitools<=0.5.20,>=0.5.18; extra == "gcp"->apache-beam[gcp]) (1.5)
Requirement already satisfied, skipping upgrade: google-auth-httplib2 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (0.0.3)
Requirement already satisfied, skipping upgrade: google-auth<2.0.0dev,>=0.4.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (1.1.1)
Requirement already satisfied, skipping upgrade: grpc-google-iam-v1<0.12dev,>=0.11.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.11.4)
Requirement already satisfied, skipping upgrade: google-gax<0.16dev,>=0.15.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.15.16)
Requirement already satisfied, skipping upgrade: cachetools>=2.0.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-auth<2.0.0dev,>=0.4.0->google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (2.0.1)
Requirement already satisfied, skipping upgrade: ply==3.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-gax<0.16dev,>=0.15.7->gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (3.8)

Support compressed files in CSV/TFR reader

Since all of tf file readers support gzip/zlib compression and beam.io also supports compression natively, Could you support compressed csv/tfrecord files in generate_statistics_from_* functions?

Open multiple .tfrecord

Hello,

Great work on tfdv.
Is there a way to open multiple .tfrecord files as would behave a tf.data.TFRecordDataset(filenames) ?
filenames being a list of string containing filename strings.
Large datasets are often stored via numerous .tfrecord files..

Thank you.

Ignoring feature of type datetime64[ns] when generate statistics from dataframe

df_sim_does2012_raw_stats = tfdv.generate_statistics_from_dataframe(df_sim_does2012_raw)
tfdv.visualize_statistics(df_sim_does2012_raw_stats)

WARNING:root:Ignoring feature DTATESTADO of type datetime64[ns]
WARNING:root:Ignoring feature DTCADASTRO of type datetime64[ns]
WARNING:root:Ignoring feature DTCADINF of type datetime64[ns]
WARNING:root:Ignoring feature DTCADINV of type datetime64[ns]
WARNING:root:Ignoring feature DTCONCASO of type datetime64[ns]
WARNING:root:Ignoring feature DTINVESTIG of type datetime64[ns]
WARNING:root:Ignoring feature DTNASC of type datetime64[ns]
WARNING:root:Ignoring feature DTOBITO of type datetime64[ns]
WARNING:root:Ignoring feature DTRECEBIM of type datetime64[ns]
WARNING:root:Ignoring feature DTRECORIGA of type datetime64[ns]

Add link in readme to paper

If you wouldn't mind, please add a prominent link in the readme to the corresponding paper https://www.sysml.cc/doc/2019/167.pdf . This link would be useful as a reference. I currently don't notice such a link. Thanks.

Python 3 Support

Currently, TFDV requires Python 2.7 due to dependency on Apache Beam, which is not compatible with Python 3 yet.

But Python 3 support should be available very soon, as Apache Beam is almost Python 3 ready.

Unclear anomaly_info

I have inferred a schema and statistics on a TFRecords dataset. (unfortunately I cannot share the dataset).
when I validate statistics against "its own" schema, I get anomalies for some of my features like:

u"'country_of_origin'": 
description: "Some examples have fewer values than expected."
severity: ERROR
short_description: "Missing values"
reason {
  type: FEATURE_TYPE_LOW_NUMBER_VALUES
  short_description: "Missing values"
  description: "Some examples have fewer values than expected."
}
path {
  step: "country_of_origin"
}

This is the relevant (inferred) part of my schema:

feature {
  name: "country_of_origin"
  value_count {
    min: 1
  }
  type: BYTES
  presence {
    min_count: 1
  }
}

This is the representation of the relevant part of the statistics:

I have several problems with this:

It looks like something weird happened with the quotes around the key of the anomaly_info.
The anomalies says: "Missing values", but my statistics says 0% missing values
Generally in TFDV, it's not clear which numbers are compared and raise an anomaly. For example here, is the error caused by the value_count or the presence.min_count part of my schema? In my statistics, which of this fields is used ?:
- feature.string_stats.common_stats.min_num_values
- feature.string_stats.common_stats.num_missing
- feature.string_stats.common_stats.tot_num_values
- feature.string_stats.common_stats.avg_num_values
  I suspect it's the first one but there is no other way for me to be sure than digging into the C++ code of this repository and this I don't know the codebase, I'm not 100% sure.
It also says: "Some examples have fewer values than expected." Ideally I would like to have more information to solve the problem or to account how severe the problem is: how many examples, which ones, how many values are missing, etc...
Depending on this results, maybe I would like to allow a portion of my dataset to have "fewer missing values" than what is required. But how could I define this?

I would expect the content of the anomaly would give me more information about what exactly is the problem, what comparison is failing.

Ideally it would also suggest one way to "silence" the anomaly if I which (could I change the severity for this error for example) or where to look in my dataset to find the root of the problem.

unable to find reference of from tensorflow_data_validation.pywrap import pywrap_tensorflow_data_validation as pywrap

i was unable to find the reference of
from tensorflow_data_validation.pywrap import pywrap_tensorflow_data_validation as pywrap.

It is mentioned in arrow_util.py

Add a tfdv.load_anomalies_text function similar to tfdv.load_schema_text

Hello,

I'm setting up a TFX pipeline, and I want to analyze the results with TFDV. The SchemaGen component outputs a "schema.pbtxt", which can be loaded then displayed nicely with

schema = tfdv.load_schema_text(os.path.join(PIPELINE_DIR, 'SchemaGen/output/3/schema.pbtxt'))
tfdv.display_schema(schema)

As far as I know, there is no function to load the anomalies text file outputted by the ExampleGen component and display it nicely, even though there is a display_anomalies function. Can you add it ?

Thank you !

[Improvement Suggestions] Specific na_values

Hi,

I've looked at the TFDV documentation but I didn't find any place to input a list of na_values like pandas, I have a dataset where the missing values are a simbol "?", the TFDV considere it as a valid String.

ImportError: cannot import name pywrap_tensorflow_data_validation

Hi All,

I'm facing following problem when I import tensorflow_data_validation in my source code:
ImportError: cannot import name pywrap_tensorflow_data_validation

Dig into the code, I find pywrap_tensorflow_data_validation has been lost (or removed, maybe), whereas this program(model) is required by following code:

Could someone kindly let me know where can I find pywrap_tensorflow_data_validation.py?

Thanks.

Error in documentation

The Get Started page has an error:

The schema itself is stored as a Schema protocol buffer and can thus be updated/edited using the standard protocol-buffer API. TFDV also provides a few utility methods to make these updates easier. For instance, suppose that the schema contains the following stanza to describe a required string feature device that takes a single value:

The string feature is actually called payment_type in the code example, not device:

feature {
  name: "payment_type"
  value_count {
    min: 1
    max: 1
  }
  type: BYTES
  domain: "payment_type"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}

Make desired_batch_size argument public in GenerateStatistics API

Hi, currently it's not possible to state desired_batch_size in the public API of GenerateStatistics.

Though the functionality is implemented in _BatchedCombineFnWrapper

The easiest way to use it is by adding it to StatsOptions and pass it here to _BatchedCombineFnWrapper.

The desired result would look like:

class GenerateSlicedStatisticsImpl(beam.PTransform):

  def __init__(
      self,
      options = stats_options.StatsOptions(desired_batch_size=100)
      ):
    self._options = options
   ....
    for generator in stats_generators:
      ....       
          _BatchedCombineFnWrapper(generator,
           desired_batch_size=self._options.desired_batch_size)).with_hot_key_fanout(fanout))

Or is there a reason why it's hidden at the moment?

Thanks.

AttributeError: 'module' object has no attribute '_QuantilesCombinerSpec'

While using tfdv.generate_statistics_from_csv() I get the error AttributeError: 'module' object has no attribute '_QuantilesCombinerSpec', as noted in.
This seems to be caused by Tensorflow transform changing the _QuantilesCombinerSpec in analyzers.py to QuantilesCombiner, which I think can be fixed by install tensorflow-transform==0.9.0.
I think the dependency on Tensorflow-transform in setup.py should be set <0.11 to avoid this error surfacing when installing via pip.

Replacing feature name with feature path in statistics proto

Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), we will populate path for each feature instead of name in the output statistics proto. Path contains a repeated string field step.

Current behavior

features {
  name: 'foo'
}

New behavior

features {
  path {
    step: 'foo'
  }
}

TFDV 0.14 will have this new behavior. The validation API will be backwards compatible with the protos with name populated.

planned support for hive table？

Custom statistics with CombinerStatsGenerator and sequential data

Hi!

I would like to use TFDV with sequential data (SequenceExamples stored in TFRecord). Under closed issues (#45) I found that this is currently not yet supported, but that this capability will be added in the future.

Since it doesn't seem this feature would be added any time soon I tried to use CombinerStatsGenerator for custom statistic generation but it turns out that input_table argument passed to CombinerStatsGenerator.add_input function includes only context features of SequenceExamples stored in my TFRecord file.

Is there a reason for this? Or even more importantly is there a workaround or any other way I can access sequential data inside TFRecord?

Also does anyone know when we can expect TFDV will include support for sequential data?

Thanks!

pip install fails in python 2.7 due to latest scikit-learn

This happens due to scikit-learn/scikit-learn#13860

A temporary fix is to install scikit-learn before installing TFDV

pip install scikit-learn==0.20.3

tf.SequenceExample support

Does TFDV support reading tf.SequenceExample from TFRecords, inferring a schema over them and compute statistics from them?

AttributeError: 'module' object has no attribute 'uint64'

I run TensorFlow Data Validation in a Docker container. Currently, my Dockerfile is something like this (skipped a few proxy-related lines):

FROM tensorflow/tensorflow:latest

ADD Container-Root /

RUN apt-get -y update && apt-get install -y libsnappy-dev python-dev

RUN  pip install python-snappy

RUN pip install tensorflow-probability tensorflow_transform==0.9.0 tensorflow-data-validation

WORKDIR /

This works. However, I don't understand why I need to install tensorflow-probability. If I leave it out, i.e., the Dockerfile becomes

FROM tensorflow/tensorflow:latest

ADD Container-Root /

RUN apt-get -y update && apt-get install -y libsnappy-dev python-dev

RUN  pip install python-snappy

RUN pip install tensorflow_transform==0.9.0 tensorflow-data-validation

WORKDIR /

then, as soon as I run the first cell of my test notebook, I get an error. This is the cell:

import tensorflow_data_validation as tfdv
import os
import time
import glob

And below is the error: I skipped a few ImportWarning at the beginning which I don't think are important (I get them even when I include tensorflow-probability, but they don't prevent the notebook from running fine).

AttributeErrorTraceback (most recent call last)
<ipython-input-1-284b3f08a890> in <module>()
----> 1 import tensorflow_data_validation as tfdv
      2 import os
      3 import time
      4 import glob

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/__init__.py in <module>()
     16 
     17 # Import stats API.
---> 18 from tensorflow_data_validation.api.stats_api import GenerateStatistics
     19 from tensorflow_data_validation.api.stats_api import StatsOptions
     20 

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/api/stats_api.py in <module>()
     50 from tensorflow_data_validation import types
     51 from tensorflow_data_validation.statistics import stats_impl
---> 52 from tensorflow_data_validation.statistics.generators import common_stats_generator
     53 from tensorflow_data_validation.statistics.generators import numeric_stats_generator
     54 from tensorflow_data_validation.statistics.generators import stats_generator

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/statistics/generators/common_stats_generator.py in <module>()
     35 from tensorflow_data_validation import types
     36 from tensorflow_data_validation.statistics.generators import stats_generator
---> 37 from tensorflow_data_validation.utils import quantiles_util
     38 from tensorflow_data_validation.utils import stats_util
     39 from tensorflow_data_validation.types_compat import Dict, List, Optional

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/quantiles_util.py in <module>()
     24 
     25 import numpy as np
---> 26 from tensorflow_transform import analyzers
     27 from tensorflow_data_validation.types_compat import List, Union
     28 from tensorflow_metadata.proto.v0 import statistics_pb2

/usr/local/lib/python2.7/dist-packages/tensorflow_transform/__init__.py in <module>()
     16 # pylint: disable=wildcard-import
     17 from tensorflow_transform import coders
---> 18 from tensorflow_transform.analyzers import *
     19 from tensorflow_transform.api import apply_function
     20 from tensorflow_transform.mappers import *

/usr/local/lib/python2.7/dist-packages/tensorflow_transform/analyzers.py in <module>()
     55     tf.int32: tf.int64,
     56     tf.int64: tf.int64,
---> 57     tf.uint8: tf.uint64,
     58     tf.uint16: tf.uint64,
     59     tf.uint32: tf.uint64,

AttributeError: 'module' object has no attribute 'uint64'

Any idea what could be the reason? Having tensorflow-probability in the image without actually using is increasing the size of my image by a good 1/2 GB.

import tensorflow_data_validation error

Hi when i'm importing tensorflow_data_visualization i'm getting the following error:

i am on mac OS X El Capitan using anaconda python 2.7.12 enviroment

How can i resolve this.

import tensorflow_data_validation
File "/Users/Eric/anaconda2/lib/python2.7/site-packages/tensorflow_data_validation/init.py", line 18, in
from tensorflow_data_validation.api.stats_api import GenerateStatistics
File "/Users/Eric/anaconda2/lib/python2.7/site-packages/tensorflow_data_validation/api/stats_api.py", line 51, in
from tensorflow_data_validation.statistics import stats_impl
File "/Users/Eric/anaconda2/lib/python2.7/site-packages/tensorflow_data_validation/statistics/stats_impl.py", line 24, in
from tensorflow_data_validation.statistics.generators import stats_generator
File "/Users/Eric/anaconda2/lib/python2.7/site-packages/tensorflow_data_validation/statistics/generators/stats_generator.py", line 57, in
from tensorflow_metadata.proto.v0 import schema_pb2
File "/Users/Eric/anaconda2/lib/python2.7/site-packages/tensorflow_metadata/proto/v0/schema_pb2.py", line 23, in
x01(\x01\x12\x11\n\tmin_count\x18\x02 \x01(\x03"\x8c\x03\n\x06\x44omain\x12\x0e\n\x04name\x18\x01 \x01(\tH\x00\x12\x31\n\x04ints\x18\x02 \x01(\x0b\x32!.tensorflow.metadata.v0.IntDomainH\x00\x12\x35\n\x06\x66loats\x18\x03 \x01(\x0b\x32#.tensorflow.metadata.v0.FloatDomainH\x00\x12\x37\n\x07strings\x18\x04 \x01(\x0b\x32$.tensorflow.metadata.v0.StringDomainH\x00\x12\x33\n\x05\x62ools\x18\x05 \x01(\x0b\x32".tensorflow.metadata.v0.BoolDomainH\x00\x12\x38\n\x08\x66\x65\x61tures\x18\x0c \x01(\x0b\x32$.tensorflow.metadata.v0.StructDomainH\x00\x12Q\n\x18\x64istribution_constraints\x18\x06 \x01(\x0b\x32/.tensorflow.metadata.v0.DistributionConstraintsB\r\n\x0b\x64omain_info"!\n\x0cInfinityNorm\x12\x11\n\tthreshold\x18\x01 \x01(\x01"P\n\x11\x46\x65\x61tureComparator\x12;\n\rinfinity_norm\x18\x01 \x01(\x0b\x32$.tensorflow.metadata.v0.InfinityNormu\n\x0eLifecycleStage\x12\x11\n\rUNKNOWN_STAGE\x10\x00\x12\x0b\n\x07PLANNED\x10\x01\x12\t\n\x05\x41LPHA\x10\x02\x12\x08\n\x04\x42\x45TA\x10\x03\x12\x0e\n\nPRODUCTION\x10\x04\x12\x0e\n\nDEPRECATED\x10\x05\x12\x0e\n\nDEBUG_ONLY\x10\x06J\n\x0b\x46\x65\x61tureType\x12\x10\n\x0cTYPE_UNKNOWN\x10\x00\x12\t\n\x05\x42YTES\x10\x01\x12\x07\n\x03INT\x10\x02\x12\t\n\x05\x46LOAT\x10\x03\x12\n\n\x06STRUCT\x10\x04\x42!\n\x1aorg.tensorflow.metadata.v0P\x01\xf8\x01\x01')
TypeError: new() got an unexpected keyword argument 'serialized_options'

Thanks

AttributeError: 'module' object has no attribute '_QuantilesCombinerSpec'

I installd tfdv in colab using python 2. I executed below commands.
!pip install tensorflow-data-validation==0.9.0

!pip install tensorflow-transform==0.9.0
when I do ,
train_stats = tfdv.generate_statistics_from_csv(train_data)

I get error as below.

AttributeErrorTraceback (most recent call last)
in ()
----> 1 train_stats = tfdv.generate_statistics_from_csv(train_data)

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/stats_gen_lib.pyc in generate_statistics_from_csv(data_location, column_names, delimiter, output_path, stats_options, pipeline_options)
149 delimiter=delimiter)
150 | 'GenerateStatistics' >> stats_api.GenerateStatistics(stats_options)
--> 151 | 'WriteStatsOutput' >> beam.io.WriteToTFRecord(
152 output_path,
153 shard_name_template='',

/usr/local/lib/python2.7/dist-packages/apache_beam/pvalue.pyc in or(self, ptransform)
109
110 def or(self, ptransform):
--> 111 return self.pipeline.apply(ptransform, self)
112
113

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label)
465 if isinstance(transform, ptransform._NamedPTransform):
466 return self.apply(transform.transform, pvalueish,
--> 467 label or transform.label)
468
469 if not isinstance(transform, ptransform.PTransform):

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label)
475 try:
476 old_label, transform.label = transform.label, label
--> 477 return self.apply(transform, pvalueish)
478 finally:
479 transform.label = old_label

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label)
511 transform.type_check_inputs(pvalueish)
512
--> 513 pvalueish_result = self.runner.apply(transform, pvalueish)
514
515 if type_options is not None and type_options.pipeline_type_check:

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.pyc in apply(self, transform, input)
191 m = getattr(self, 'apply_%s' % cls.name, None)
192 if m:
--> 193 return m(transform, input)
194 raise NotImplementedError(
195 'Execution of [%s] not implemented in runner %s.' % (transform, self))

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.pyc in apply_PTransform(self, transform, input)
197 def apply_PTransform(self, transform, input):
198 # The base case of apply is to call the transform's expand.
--> 199 return transform.expand(input)
200
201 def run_transform(self, transform_node):

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/api/stats_api.pyc in expand(self, dataset)
197 num_values_histogram_buckets=
198 self._options.num_values_histogram_buckets,
--> 199 epsilon=self._options.epsilon),
200
201 # Create numeric stats generator.

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/statistics/generators/common_stats_generator.pyc in init(self, name, schema, num_values_histogram_buckets, epsilon)
229 # Initialize quantiles combiner.
230 self._quantiles_combiner = quantiles_util.QuantilesCombiner(
--> 231 num_values_histogram_buckets, epsilon)
232
233 # Create an accumulator, which maps feature name to the partial stats

/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/quantiles_util.pyc in init(self, num_quantiles, epsilon)
38 self._num_quantiles = num_quantiles
39 self._epsilon = epsilon
---> 40 self._quantiles_spec = analyzers._QuantilesCombinerSpec(
41 num_quantiles=num_quantiles, epsilon=epsilon,
42 bucket_numpy_dtype=np.float32, always_return_num_quantiles=True)

AttributeError: 'module' object has no attribute '_QuantilesCombinerSpec'

is there docs about the statics and math part?

or links to some paper or tech report?

Slow performance when computing stats for moderately large data set

Hi,

I like this project a lot, and thanks for releasing it! I see the potential to save a lot of time when I first receive new datasets. However, I have issues with performance.

OS: the notebook is running in a Docker container based on https://hub.docker.com/r/tensorflow/tensorflow/.

Hardware:

GPUs 16X NVIDIA® Tesla V100
GPU Memory 512GB total
CPU Dual Intel Xeon Platinum
8168, 2.7 GHz, 24-cores
System Memory 1.5TB

It takes me >8 hours (30900 s) to compute the statistics for a ~100 dataset, with file sizes from 0.5 MB to 300 MB, with a median of 70 MB. It's true that the Docker container introduces some overhead, but given the specs of my hardware, I think it's too much. Any tips on how to speed up computations, w/o changing hardware (i.e., no cloud)? For example, if there was an option to compute statistics in a dataframe, rather than in a protocol buffer, one could use Modin to speedup pandas computations with minimal changes to code.

PS if I use a GPU container instead, with

$ docker run --runtime=nvidia -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
should I see a speedup?

_pywrap_tensorflow_data_validation.so: undefined symbol: PyInstanceMethod_New

Dear guys,

I met a problem when using TFDV

import tensorflow_data_validation

Traceback (most recent call last):
File "", line 1, in
File "/home/d00404091/myenv/local/lib/python2.7/site-packages/tensorflow_data_validation/init.py", line 22, in
from tensorflow_data_validation.api.validation_api import infer_schema
File "/home/d00404091/myenv/local/lib/python2.7/site-packages/tensorflow_data_validation/api/validation_api.py", line 25, in
from tensorflow_data_validation.anomalies import pywrap_tensorflow_data_validation
File "/home/d00404091/myenv/local/lib/python2.7/site-packages/tensorflow_data_validation/anomalies/pywrap_tensorflow_data_validation.py", line 28, in
_pywrap_tensorflow_data_validation = swig_import_helper()
File "/home/d00404091/myenv/local/lib/python2.7/site-packages/tensorflow_data_validation/anomalies/pywrap_tensorflow_data_validation.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_data_validation', fp, pathname, description)
ImportError: /home/d00404091/myenv/local/lib/python2.7/site-packages/tensorflow_data_validation/anomalies/_pywrap_tensorflow_data_validation.so: undefined symbol: PyInstanceMethod_New

OS: ubuntu 16.04
bazel: 0.15
python: 2.7.12

I also updated swig to 3.0.12 and it seemed useless. Any other suggestions?

GCP Dataflow file pattern *.tfrecord

Not really an issue but something that people might encounter :

Setting a glob pattern for data location in a Dataflow execution like 'dataset/train/*' will crash the job. This is because even if you only generated .tfrecord files inside this "folder", there is something else here in the blob that's trying to be read by your decoder pipeline.

This might result in something like that :
ValueError: Not a valid TFRecord. Mismatch of length mask: 934e554d5059010076007b27

You have to be sure to select only tf records 'dataset/train/*.tfrecord'

Hope it helps someone :)

[Improvement Suggestions]

Hi Paul,

As discussed on SO please find below my feedback on what I think would be nice additions to this already great library:

possibility to merge generated stats: a typical use case is a daily ETL receiving the daily data as a set of multiple files with common prefix. The stats need to be computed on the whole batch, not on each file. Therefore the possibility to merge several stats results would be useful. Accepting a prefix as an input file instead of a filename would also be very useful to that use case
documentation: a list of validation error messages would be very nice to have on the doc. Even more important is a list of all the feature properties that can be written in the protobuf schema.
RAM overhead: datasets are often big and TFDV requires a large RAM overhead. The possibility to process datasets by mini-batches would allow to run TFDV even in lower RAM environments. That could even be combined with my first suggestion.
Schema inference: TFDV is generating quite generic schemas when infering a dataset. I understand that domain knowledge is required in order to define precise schema properties, but maybe adding a suggestion output could help. For example if we have a standard deviation quite high when looking at the mean, or a missing value rate of 6% (then maybe 10% would be acceptable) you could output a message saying that such stdv may be out of bound and the following code could add this to your schema: code_example.
This kind of interactivity and suggestion would make it really easy to get better schemas imo
Performance: as my fellow colleague in this Issue list, I experience really low performance on big datasets composed of many files. There must be some tuning that we are missing. When I compute a mean with Apache Beam on the same dataset with a DirectRunner I get quite good performance for example. However I am very subjective here.... :)
Facets Dive: very cool to have the facets overview directly accessible when visualizing stats. A super useful addition would be to also have Dive in the same package.

Anyway thanks a lot for the super cool library, it is already very useful as is!!!

Newline in CSV quoted string breaks reader

Hi,
Looks like current CSV reader does not support the case where a quoted string value span a few lines (and line breaks are made). It means a logical CSV row may span a few physical lines, which is valid CSV.

Looks like some if this is indicated here? https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/coders/csv_decoder.py#L150

And some background here:

https://stackoverflow.com/questions/18724903/csvs-in-python-with-newline-in-quotes

A question,
There's probably a reason for it, but why not use an actual csv reader?
Edit: i'm assuming because streaming, beam, etc. want a unit = line, which makes parallelism possible.

ImportError: cannot import name types

I tried both compiling tensorflow_data_validation from the git repository and installing it through pip, but both result in the error below. Am I missing something or could this be a bug?

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-7-ff8693e3a039> in <module>()
----> 1 import tensorflow_data_validation as tfdv
      2 import os

/path/to/python2.7/site-packages/tensorflow_data_validation/__init__.py in <module>()
     16 
     17 # Import stats API.
---> 18 from tensorflow_data_validation.api.stats_api import GenerateStatistics
     19 from tensorflow_data_validation.api.stats_api import StatsOptions
     20 

/path/to/python2.7/site-packages/tensorflow_data_validation/api/stats_api.py in <module>()
     48 import collections
     49 import apache_beam as beam
---> 50 from tensorflow_data_validation import types
     51 from tensorflow_data_validation.statistics import stats_impl
     52 from tensorflow_data_validation.statistics.generators import common_stats_generator

ImportError: cannot import name types

Inconsistency between statistics and inferred schema

Good morning,

I have a dataset that contains many features, one of which is a list of 5 strings.
I have doubled check that for every examples I do have not more or less than 5 distinct strings for this feature. I am also generating this dataset with a dataflow pipeline, dumping the data both in tfrecords and in BigQuery. I have doubled check in BigQuery as well.
Now when I use tfdv and generate statistics and infer a schema I get these:
For statistics.pb:

name: "tracks"
type: STRING
string_stats {
  common_stats {
    num_non_missing: 6998315
    min_num_values: 5
    max_num_values: 5
    avg_num_values: 5.0
    num_values_histogram {
...

for schema.pb:

feature {
  name: "tracks"
  value_count {
    min: 1
  }
  type: BYTES
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}

Now when I convert my schema to FeatureSpec I get a VarLenFeature though I was expecting a FixedLenFeature since I "know" that the shape should be [5].
This is the only feature I have with a shape >1. All the others are basically 1 string per example and I get FixedLenFeature for them, as expected.

Does anyone know why this schema was generated like this?

tensorflow / data-validation Goto Github PK

data-validation's Introduction

Install

Try your first TensorFlow program

Contribution guidelines

Patching guidelines

Continuous build status

Official Builds

Resources

Courses

License

data-validation's People

Contributors

Stargazers

Watchers

Forkers

data-validation's Issues

Context

Error trace

What code did I run?

Pip trace

Recommend Projects

Recommend Topics

Recommend Org