Comments (6)
When writing, sometimes a mask of values to overwrite is used. This error indicates that there is such a mask but there is no data.
This is only a fragment of the code. Can you reduce it to a minimal single function with all of the code necessary to reproduce it.
from tensorstore.
Hi,
here is an example where i tried to summerize what my code is doing.
It is split into two parts, because the config generation happens on the main node before multiple nodes write the dataset in parallel. The boundaries of each node should allways align with shard boundaries.
//generate multiscale neuroglancer config - done on main node
tensorstore::Open({{"driver", "neuroglancer_precomputed"},
{"kvstore", {
{"driver", "file"},
{"path", "<path_to_new_dataset>"}
}},
{"multiscale_metadata", {
{"data_type", "uint8_t"},
{"num_channels", numberChannels},
{"type", "image"},
}},
{"scale_metadata", {
{"resolution", {resolution[0], resolution[1], resolution[2]}},
{"encoding", encoding},
{"chunk_size", {chunkSize[0], chunkSize[1], chunkSize[2]}},
{"size", {sizeX, sizeY, sizeZ}},
{"sharding", {
{"@type", "neuroglancer_uint64_sharded_v1"},
{"preshift_bits", neuroBits.preShiftBits},
{"minishard_bits", neuroBits.miniShardBits},
{"shard_bits", neuroBits.shardBits},
{"minishard_index_encoding", "raw"},
{"hash", "identity"}
}},
}},
{"scale_index", index},
},
context,
tensorstore::OpenMode::create,
tensorstore::RecheckCached{false},
tensorstore::ReadWriteMode::write).value();
//reopen tensorstore - following code is done in parallel on multiple nodes
tensorstore::TensorStore store =
tensorstore::Open({{"driver", "neuroglancer_precomputed"},
{"kvstore", {{"driver", "file"},
{"path", "<path_to_new_dataset>"}}},
{"scale_index", index}},
context,
tensorstore::OpenMode::open,
tensorstore::RecheckCached{true},
tensorstore::ReadWriteMode::write).value();
//size and intervall of array should be aligned with shard
boost::multi_array<uint8_t,4> dataPrint(boost::extents[D4][endD3 - startD3][endD2 - startD2][endD1 - startD1]);
//load data into array - not tensorstore
//do stuff with array
//write data to
std::vector<int64_t> shape = {endD1 - startD1, endD2 - startD2, endD3 - startD3, D4};
auto intervalD1 = tensorstore::Dims(0).HalfOpenInterval(startD1, endD1);
auto intervalD2 = tensorstore::Dims(1).HalfOpenInterval(startD2, endD2);
auto intervalD3 = tensorstore::Dims(2).HalfOpenInterval(startD3, endD3);
auto arr = tensorstore::Array(dataArray.data(), shape, tensorstore::fortran_order);
auto writeFuture = tensorstore::Write(tensorstore::UnownedToShared(arr), store | intervalD1 | intervalD2 | intervalD3);
writeFuture.commit_future.value();
auto result = writeFuture.result();
exceptional_assert(result.ok(), "Error while writing to disk");
Do you need more information?
Thank you for your help.
from tensorstore.
Let's try and get a self-contained repro case; I've built a guint test incorporating your spec.
Is it possible for you to edit this self-contained test so that it fails?
#include <cstdint>
#include <vector>
#include <gtest/gtest.h>
#include "absl/status/status.h"
#include "tensorstore/array.h"
#include "tensorstore/context.h"
#include "tensorstore/contiguous_layout.h"
#include "tensorstore/index_space/dim_expression.h"
#include "tensorstore/open.h"
#include "tensorstore/open_mode.h"
#include "tensorstore/staleness_bound.h"
#include "tensorstore/tensorstore.h"
#include "tensorstore/util/status_testutil.h"
// Boost
#include "boost/multi_array.hpp"
static constexpr int D4 = 1;
absl::Status CreateTensorstore(tensorstore::Context context) {
int chunkSize[] = {16, 16, 16};
int size[] = {1024, 1024, 1024};
int preShiftBits = 2;
int miniShardBits = 4;
int shardBits = 8;
return tensorstore::Open(
{
{"driver", "neuroglancer_precomputed"},
{"kvstore",
{
{"driver", "memory"},
{"path", "prefix/"},
}},
{"multiscale_metadata",
{
{"data_type", "uint8"}, // not uint8_t
{"num_channels", D4},
{"type", "image"},
}},
{"scale_metadata",
{
{"resolution", {1.0, 1.0, 1.0}},
{"encoding", "raw"},
{"chunk_size",
{chunkSize[0], chunkSize[1], chunkSize[2]}},
{"size", {size[0], size[1], size[2]}},
{"sharding",
{{"@type", "neuroglancer_uint64_sharded_v1"},
{"preshift_bits", preShiftBits},
{"minishard_bits", miniShardBits},
{"shard_bits", shardBits},
{"minishard_index_encoding", "raw"},
{"hash", "identity"}}},
}},
{"scale_index", 0},
},
context, tensorstore::OpenMode::create,
tensorstore::RecheckCached{false},
tensorstore::ReadWriteMode::write)
.status();
}
TEST(Issue155, Repro) {
tensorstore::Context context = tensorstore::Context::Default();
TENSORSTORE_ASSERT_OK(CreateTensorstore(context));
int startD1 = 0;
int endD1 = 64;
int startD2 = 0;
int endD2 = 64;
int startD3 = 0;
int endD3 = 64;
// reopen tensorstore - following code is done in parallel on multiple nodes
TENSORSTORE_ASSERT_OK_AND_ASSIGN(
auto store,
tensorstore::Open(
{
{"driver", "neuroglancer_precomputed"},
{"kvstore",
{
{"driver", "memory"},
{"path", "prefix/"},
}},
{"scale_index", 0},
},
context, tensorstore::OpenMode::open,
tensorstore::RecheckCached{true}, tensorstore::ReadWriteMode::write)
.result());
// size and interval of array should be aligned with shard
boost::multi_array<uint8_t, 4> dataArray(
boost::extents[D4][endD3 - startD3][endD2 - startD2][endD1 - startD1]);
// load data into array - not tensorstore
// do stuff with array
// write data to
std::vector<int64_t> shape = {endD1 - startD1, endD2 - startD2,
endD3 - startD3, D4};
auto intervalD1 = tensorstore::Dims(0).HalfOpenInterval(startD1, endD1);
auto intervalD2 = tensorstore::Dims(1).HalfOpenInterval(startD2, endD2);
auto intervalD3 = tensorstore::Dims(2).HalfOpenInterval(startD3, endD3);
auto arr =
tensorstore::Array(dataArray.data(), shape, tensorstore::fortran_order);
auto writeFuture =
tensorstore::Write(tensorstore::UnownedToShared(arr),
store | intervalD1 | intervalD2 | intervalD3);
writeFuture.commit_future.Wait();
TENSORSTORE_ASSERT_OK(writeFuture.commit_future.result());
}
You will need to put together a proper BUILD rule for this.
from tensorstore.
I don't think I can get this test case to crash.
The only time I have seen this error is when multiple workers run on multiple nodes at the same time and all open and write to the same dataset asynchronously. They should not interfere with each other because everyone should be working on their own shard.
But I have also observed, that it looks like tensorstore loads some surrounding data, even when it is not needed for the intended operation.
from tensorstore.
Do you attempt to create on every node? Or is the create completely independent?
We can make it a simple binary with a --create flag which takes start/end[1-3] as parameters.
You can try running your original code with verbose logging enabled. See https://github.com/google/tensorstore/blob/master/tensorstore/internal/log/verbose_flag.h
TENSORSTORE_VERBOSE_LOGGING=all
from tensorstore.
Thanks for reporting this.
I identified the bug, and have a fix that we can hopefully push out shortly.
I believe the specific case that would trigger this is:
- Process 1: Write all zeros to just a portion of a chunk. Starts writeback of shard. Observes that the chunk is equal to the fill value (all zero) because the existing chunk is either not present, or all unmodified elements are zero. At this point, the data array is freed since it is equal to the fill value, but the mask remains as it was to indicate a partial modification.
- Process 2: Concurrently modifies the shard.
- Process 1: Writeback must be retried due to concurrent modification. When integrating the new contents of the shard, assertion is triggered due to unexpected combination of partial modification with no data array.
There are two important things to note, though:
- If this sequence is indeed what triggered the bug, then that means your writes are in fact not shard aligned as you thought they were. Even with the bug fixed, shard aligned writes will be much more efficient.
- The assertion only triggers in debug builds (with NDEBUG not defined). For production use, disabling assertions may make it significantly faster. Usually NDEBUG will be defined automatically in release builds, so you may want to check to confirm you are building with optimizations.
from tensorstore.
Related Issues (20)
- consider looking in `/etc/pki/ca-trust/extracted` for CA certificates? HOT 3
- question about writing parallel and group handling HOT 39
- Slow random read performance HOT 8
- Segfault/Mutex Error HOT 13
- Does zarr_sharding_indexed exist? HOT 1
- Incorrect writes using int array indexing, affected by chunk layout HOT 1
- Zstd compression does not encode content size in header HOT 2
- Falied to acquire lock on file HOT 1
- Metric suggestions HOT 6
- Parallel write handling in Zarr HOT 1
- zarr3 shard user block / offset
- Windows: NotImplementedError: CopyRange not supported #1048 HOT 5
- Please allow compile-time evaluation of `DataType::id()` (constexpr), platform-independently HOT 1
- Support installation to enable reuse/packaging HOT 5
- Tensorstore C++ Integration: How to Avoid Driver Registry Conflict HOT 6
- Segmentation fault when using tensorstore with multiple workers in pytorch dataloader HOT 1
- Tensorstore key value store driver not Retrying on 429 Responses with S3 API HOT 8
- Read speeds decrease 2x when reading with fewer processes HOT 7
- Writing Neuroglancer Precomputed data with png compression not working correctly
- Build issues with latest version (73da2a2) HOT 15
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorstore.