pmem / pmemstream Goto Github PK

View Code? Open in Web Editor NEW

8.0 6.0 13.0 1.66 MB

License: Other

CMake 13.92% C 35.57% Perl 2.56% C++ 35.59% Shell 10.88% Dockerfile 1.03% Roff 0.14% GDB 0.32%

pmemstream's Introduction

pmemstream

⚠️ Discontinuation of the project

The pmemstream project will no longer be maintained by Intel.

Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.
You will find more information here.

Introduction

pmemstream is a logging data structure optimized for persistent memory.

This is experimental pre-release software and should not be used in production systems. APIs and file formats may change at any time without preserving backwards compatibility. All known issues and limitations are logged as GitHub issues.

Libpmemstream implements a pmem-optimized log data structure and provides stream-like access to the data. It presents a contiguous logical address space, divided into regions, with log entries of arbitrary sizes. We intend for this library to be a foundation for various, more complex higher-level solutions.

This library is a successor to libpmemlog. These two libraries are very similar in basic concept, but libpmemlog was developed in a straightforward manner and does not allow easy extensions.

For more information, including C API documentation see pmem.io/pmemstream.

Build and install

Installation guide provides detailed instructions how to build and install pmemstream from sources, build rpm and deb packages, and more.

Contact us

For more information about pmemstream, please:

read our whitepaper attached to 0.2.1 release,
if the blog post didn't answer your question, please contact us using the dedicated e-mail: [email protected].

pmemstream's People

Contributors

Stargazers

Watchers

Forkers

lukaszstolarczuk karczex igchor karolina002 pmem-bot kfilipek chorig9 tszczyp karczex-2nd nedved1 lgtm-migrator szadam zvi-code

pmemstream's Issues

Expose `pmemstream_entry_timestamp(struct pmemstream_entry)`

Future feature ideas

These are some of our ideas for the next features, not described in separate issues.

Add option to trigger pool prefaulting

FEAT: Add option to trigger pool prefaulting

Rationale

Right now, we are always clearing the unused part of the region on restart (to make sure that we don't have any trash which could be later interpreted as data). Because of this, all pages to which we will later write (inside append) are already prefaulted.

Once we get rid of clearing the regions, appends might trigger page fault. To get valid benchmarking results we should add option to force prefault at open/create time.

Description

API Changes

None. Add ENV var to trigger page faults.

Failing tests with TSAN feature

ISSUE: Failing tests with TSAN feature

Environment Information

pmemstream version(s): cfa5d7c
PMDK (libpmem2) package version(s):
OS(es) version(s):
kernel version(s):
compiler, libraries, packaging and other related tools version(s): clang 13.0.1-2ubuntu2

and possibly:

ndctl version(s):

Please provide a reproduction of the bug:

Enabled TSAN in CMake and ctest -R "_none"

How often bug is revealed:

always

Actual behavior:

         40 - concurrent_async_wait_0_none (Failed)
         43 - concurrent_iterate_0_none (Failed)
         44 - concurrent_iterate_with_append_0_none (Failed)
         50 - publish_append_async_0_none (Failed)
         51 - region_runtime_initialize_0_none (Failed)
         68 - singly_linked_list_pmreorder_negative_none (Failed)
         81 - example-05_timestamp_based_order_0_none (Failed)

Expected behavior:

Details

Additional information about Priority and Help Requested:

Are you willing to submit a pull request with a proposed change? (Yes, No)

Requested priority: Low

0.2.0 release feature list

FEAT: Multiple regions support

Rationale

Increasing concurrency (there is no need for heavy-weight synchronization between appends to different regions)
Logical division of entires
Grouping objects of different sizes together (can improve performance for concurrent append - appending similarly sized entires will result in bigger overlap for memcpy)

Description

Persistent allocator
- Store alloc/free actions on PMEM (which allow rebuilding the state), do compaction regularly
Synchronization between iterators and region_free needed
- Use delayed free / garbage collection?
Region discovery TBD (optional)
- Named regions?
- Root region?

API Changes

No changes - region allocate and free is already supported.

Implementation details

Most of the code is ready to handle multiple regions. There are a few global variables which should be made per-region (move to region_runtime), e.g. region_lock from region_runtimes_map

First steps

trivial persistent allocator based on free list
no synchronization for iterators and region_free
no region discovery (only iterators)

Assertion fault in span_get_region_runtime

ISSUE:

assertion fault in span_get_region_runtime when passing invalid region offset

The bug reproduction:

struct pmemstream_region region = {.offset = ALIGN_DOWN(UINT64_MAX, sizeof(span_bytes))};
ret = pmemstream_reserve(stream, region, NULL, sizeof(data), &entry, &data_address);

How often bug is revealed:

always

Actual behavior:

 __GI___assert_fail (assertion=0x7ffff7f226f0 "span_get_type(span) == SPAN_REGION", 
span_get_region_runtime () at /PMDK/pmemstream/src/span.c:121
pmemstream_reserve () at /PMDK/pmemstream/src/libpmemstream.c:166
invalid_region_test () at /PMDK/pmemstream/tests/api_c/reserve_and_publish.c:85
main () at /PMDK/pmemstream/tests/api_c/reserve_and_publish.c:189

check-license doesn't run properly

check-license is not running properly for all files, I guess we don't add this check in all sub-dirs...?

Add rapidcheck arbitrary generators

It would be useful to implement several generators for data to use it instead of RC_PRE, and generators for pmemstrem related objects with randomized sizes.

https://github.com/emil-e/rapidcheck/blob/master/doc/generators.md#arbitrary

Segmentation fault in pmemstream_reserve

ISSUE:

segmentation fault in pmemstream_reserve (potentially also in pmemstream_publish) when passing NULL entry

The bug reproduction:

struct pmemstream_entry *entry = NULL;
ret = pmemstream_reserve(stream, region, NULL, sizeof(data), entry, &data_address);

How often bug is revealed:

always

Actual behavior:

Program received signal SIGSEGV, Segmentation fault.
reserved_entry->offset = offset;
pmemstream_reserve () at PMDK/pmemstream/src/libpmemstream.c:193
null_entry_test () at /PMDK/pmemstream/tests/api_c/reserve_and_publish.c:165
main () at /PMDK/pmemstream/tests/api_c/reserve_and_publish.c:192

Segmentation fault in pmemstream_entry_iterator_new

ISSUE:

segmentation fault in pmemstream_entry_iterator_new when passing NULL entry_iterator

The bug reproduction:

ret = pmemstream_entry_iterator_new(NULL, stream, region);

How often bug is revealed:

always

Actual behavior:

  Program received signal SIGSEGV, Segmentation fault.
  pmemstream_entry_iterator_new () at /pmemstream/src/iterator.c:92
  92		*iterator = iter;
  
  pmemstream_entry_iterator_new () at /pmemstream/src/iterator.c:92
  null_entry_iterator_test () at /pmemstream/tests/api_c/entry_iterator.c:112
  main () at /pmemstream/tests/api_c/entry_iterator.c:162

Recovery during recovery test

FEAT: pmreorder based test for power failure during recovery.

Rationale

Process of recovery after power failure involves writing to pmem, so we need to test if it's possible continue it after another power failure.

Implementation details

This may be done in generic way:

Run under pmemcheck test code (i.e multi_region_pmreorder test)
Run pmreorder check with pmemcheck
run pmreorder check for each generated storelog

Tricky part

With little refactor in pmreorder.py, it may easily be changed into python module to be used directly from python. This would tremendously simplify tracking of storelogs.

Roughly tested POC - seems to work

./singly_linked_list_pmreorder create /dev/shm/sllp1

valgrind --tool=pmemcheck -q --log-stores=yes --print-summary=no --log-file=/dev/shm/logfile.storelog --log-stores-stacktraces=no --expect-fence-after-clflush=yes  ./singly_linked_list_pmreorder fill /dev/shm/sllp

pmreorder -l /dev/shm/logfile.storelog -o /dev/shm/x.pmreorder -r ReorderAccumulative -p "/usr/bin/valgrind \-\-tool=pmemcheck \-q \-\-log-stores=yes \-\-print-summary=no \-\-log-file=/dev/shm/logfile-$(cat /proc/sys/kernel/random/uuid).storelog \-\-log-stores-stacktraces=no \-\-expect-fence-after-clflush=yes  ./singly_linked_list_pmreorder check"

pmreorder -l /dev/shm/logfile-be119487-d4c9-43f1-aea5-e28733078988.storelog -o /dev/shm/x.pmreorder -r ReorderAccumulative -p "./singly_linked_list_pmreorder check"

Standalone example(s) compilation

We have to add CMake for compiling standalone example(s) and update CI with building it, to confirm our packages are built properly.

Assertion fault in span_get_entry_runtime

ISSUE:

assertion fault in span_get_entry_runtime when passing invalid entry offset

The bug reproduction:

struct pmemstream_entry entry = {.offset = ALIGN_DOWN(UINT64_MAX, sizeof(span_bytes))};
entry_data = pmemstream_entry_data(stream, entry);

How often bug is revealed:

always

Actual behavior:

__GI___assert_fail ( assertion=0x7ffff7f226c8 "span_get_type(span) == SPAN_ENTRY", 
span_get_entry_runtime ()  at /PMDK/pmemstream/src/span.c:105
pmemstream_entry_data ()  at /PMDK/pmemstream/src/libpmemstream.c:136
invalid_entry_test () at /PMDK/pmemstream/tests/api_c/append_entry.c:129
main () at /Share/PMDK/pmemstream/tests/api_c/append_entry.c:152

Extra test for region allocate/free

I'm not sure if current region_allocate clears all necessary metadata when taking region from free list.

Pseudocode:

r1 = allocate_region and append some data there
r2 = allocate_region and append some data there

reopen()

free_region(r1)
r3 = allocate_region()

if (testcase1) {
	check if data from r1 is not available
	check timestamps
} else if (testcase 2) {
	append some data to r3
	check if data from r1 is not available
	check timestamps
} else if (testcase3) {
	append some data to r2
	check if data from r1 is not available
	check timestamps
}

The above test cases are only for illustration. I think we could implement those tests as part of stateful testing.

test create_0_none fails

test create_0_none fails for RC_PARAMS="seed=4382463535112076010"

Environment Information

CI
pmemstream version(s): since 63de7d5

Please provide a reproduction of the bug:

make ..
make -j$(nproc)
export RC_PARAMS="seed=4382463535112076010"
ctest -R create_0_none

How often bug is revealed:

Always

Details

As 63de7d5 introduces new tests, the bug is probably older.

Output:

Signal: Aborted, backtrace:
0: /opt/workspace/pmemstream_v6/bisect_dir/tests/create (test_sighandler+0x27) [0x5576a72a09e7]
1: /lib/x86_64-linux-gnu/libc.so.6 (killpg+0x40) [0x7f9f7444e24f]
2: /lib/x86_64-linux-gnu/libc.so.6 (gsignal+0xcb) [0x7f9f7444e18b]
3: /lib/x86_64-linux-gnu/libc.so.6 (abort+0x12b) [0x7f9f7442d859]
4: /opt/workspace/pmemstream_v6/bisect_dir/tests/create (UT_FATAL+0xc3) [0x5576a7286593]
5: /opt/workspace/pmemstream_v6/bisect_dir/tests/create (_ZN12return_checkD2Ev+0x3b) [0x5576a7294f2b]
6: /opt/workspace/pmemstream_v6/bisect_dir/tests/create (_ZZ4mainENKUlvE_clEv.isra.0.cold+0x1eb) [0x5576a7283cd3]
7: /opt/workspace/pmemstream_v6/bisect_dir/tests/create (main+0x71) [0x5576a7286091]
8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9f7442f0b3]
9: /opt/workspace/pmemstream_v6/bisect_dir/tests/create (_start+0x2e) [0x5576a728628e]


-- Stderr:
Using configuration: seed=4382463535112076010

- verify if a single region of various sizes (>0) can be created
OK, passed 100 tests

- verify if a region_iterator finds the only region created
OK, passed 100 tests

- verify if a region of size > stream_size cannot be created
OK, passed 100 tests

- verify if a stream of various sizes can be created
OK, passed 100 tests

- verify if a stream of various block_sizes can be created
Falsifiable after 100 tests and 7 shrinks

unsigned long:
503187

/opt/workspace/pmemstream_v6/tests/common/stream_helpers.hpp:25:
RC_ASSERT(pmemstream_region_allocate(stream, region_size, &new_region) == 0)

Expands to:
-1 == 0
/opt/workspace/pmemstream_v6/tests/common/unittest.hpp:92 ~return_check - assertion failure: status, errormsg: 
Some of your RapidCheck properties had failures. To reproduce these, run with:
RC_PARAMS="reproduce=BgjdlJXamlHIpZGIhByc0JXZh1GIvZGI2Fmcp9WdzBiYs92Yr91cppXZzByYh5GIiVGIjJXZhRXZkh7cJf04xT1eZBavBXbr2b6FSHzdkJisMiOebemtu_tRACIgACYAAQCZAAAAH0ACPARESIB"

CMake Error at /opt/workspace/pmemstream_v6/tests/cmake/exec_functions.cmake:205 (message):
   /opt/workspace/pmemstream_v6/bisect_dir/tests/create /opt/workspace/pmemstream_v6/bisect_dir/tests/create_0_none/testfile failed: 134
Call Stack (most recent call first):
  /opt/workspace/pmemstream_v6/tests/cmake/exec_functions.cmake:246 (execute_common)
  /opt/workspace/pmemstream_v6/tests/cmake/run_default.cmake:8 (execute)

concurrent_async_wait_0_pmemcheck timeouts

Sometimes concurrent_async_wait_0_pmemcheck tests reached the timeout. It may be related to a deadlock, or just a very slow valgrind test... or something else 😉

async_append and reserve/async_publish with custom data mover

it'd be especially useful to test with it a scenario, where the very first async wait operation is executed as the latest future.

Consider use pmem2_memcpy_async in async_append

We may consider moving on to use PMDK's pmem2_memcpy_async instead of miniasync's vdm_memcpy in our async append:
https://github.com/pmem/pmemstream/blob/master/src/libpmemstream.c#L444-L445

We would have to wait for pmem/pmdk#5390 to be merged and working in libpmem2, consider pros and (potential) cons, and implement this in our async_append.

Arguments refactor

Simplify argument passed to functions.

pmemstream_span_create_empty(stream, &stream->data->spans[0], stream->usable_size - metadata_size);
should be changed to:
pmemstream_span_create_empty(stream, stream->usable_size - metadata_size);
Change &span[0] -> span

Segementation fault in pmem_offset_to_ptr

ISSUE:

segmentation fault in pmemstream_offset_to_ptr when passing NULL stream

The bug reproduction:

ret = pmemstream_entry_iterator_new(&eiter, NULL, region);

How often bug is revealed:

always

Actual behavior:

Program received signal SIGSEGV, Segmentation fault.
pmemstream_offset_to_ptr () at /PMDK/pmemstream/src/libpmemstream_internal.h:47
return (const uint8_t *)stream->data->spans + offset;
pmemstream_offset_to_ptr () at /Share/PMDK/pmemstream/src/libpmemstream_internal.h:47
span_offset_to_span_ptr () at /PMDK/pmemstream/src/span.c:14
span_get_region_runtime () at /PMDK/pmemstream/src/span.c:118
entry_iterator_initialize () at //PMDK/pmemstream/src/iterator.c:60
pmemstream_entry_iterator_new () at /PMDK/pmemstream/src/iterator.c:83
null_stream_test () at /PMDK/pmemstream/tests/api_c/entry_iterator.c:81
main () at /PMDK/pmemstream/tests/api_c/entry_iterator.c:160

Consider replacing libunwind with backward-cpp

It offers much nicer output (with source lines): https://github.com/bombela/backward-cpp

FEAT: Transaction

Rationale

Allow grouping multiple operations (append/region_allocate/region_free) into single atomic action.
Requires #78.

Description

API Changes

Additional functions:
pmemstream_tx_begin()
pmemstream_tx_commit()
pmemstream_tx_append()
pmemstream_tx_reserve()

pmemstream_tx_region_allocate()
pmemstream_tx_region_free()

Implementation details

Transaction implementation will be based on timestamps. Each transaction will be assigned a unique timestamp. Committed and persisted timestamps will be used to distinguish between committed and aborted/in-progress transactions.

First steps

Support only for tx_append and tx_reserve.

Transactions and concurrent append to a single region

We can use timestamps to implement cross-region transactions and (optionally) combine it with concurrent append solution:

Instead of popcount, we use timestamp in entry,
Transaction commits (and increments timestamp) only after all previous operations on all regions, this tx uses, completed.

Crash inconsistency in region allocation

ISSUE: Crash inconsistency in region allocation

Environment Information

pmemstream version(s): 0.2.0
compiler, libraries, packaging and other related tools version(s): gcc, clang

Please provide a reproduction of the bug:

https://github.com/karczex/pmemstream/blob/recovery_on_recovery_test_2TC/tests/integrity/append_to_new_region_pmreorder.cpp

How often bug is revealed:

always

Details

Pmreorder test fails due to slist invariant violation in the scenario:

Crash stream (via pmreorder) during data insertion to regions (Allocate regions first, than start appending data.)
After next start of application allocate new region and append data to it
allocate next region

Additional information about Priority and Help Requested:

Are you willing to submit a pull request with a proposed change? (Yes, No) Maybe

Requested priority: (Showstopper, High, Medium, Low) High

Add test for veryfing layout

Verify layout of all structure we use
Make sure that append writes entries at offsets we expect it will

FEAT: Support for multithreaded region allocation

FEAT: Support for multithreading in region allocator

Rationale

Developer have to keep thread-safety, safer is move this responsibility to the library.

Description

For now, allocator design doesn't support multithreading for adding, removing. Using iterators during allocate/free causes undefined behavior.
Affected API functions:
pmemstream_region_allocate
pmemstream_region_free
and all region iteration functions.

API Changes

There is no need to change API.

Implementation details

N/A

ISSUE:

assertion fault in region_runtime_clear_from_tail()

The bug reproduction:

struct pmemstream_region invalid_region = {.offset = ALIGN_DOWN(UINT64_MAX, sizeof(span_bytes))};
struct pmemstream_region_runtime *rtm = NULL;
ret = pmemstream_region_runtime_initialize(stream, invalid_region, &rtm);

How often bug is revealed:

always

Actual behavior:

 __GI___assert_fail ("region_runtime_get_state_acquire(region_runtime) == REGION_RUNTIME_STATE_DIRTY")
region_runtime_clear_from_tail () at pmemstream/src/region.c:238
region_runtime_initialize_clear_locked () at pmemstream/src/region.c:273
pmemstream_region_runtime_initialize () at pmemstream/src/libpmemstream.c:212
invalid_region_test () at pmemstream/tests/api_c/region_create.c:65
main () at pmemstream/tests/api_c/region_create.c:88

Optimization ideas

use pmem2_memcpy_async (#163) to enable nontemporal stores. Possibly we might need to add extra threshold to perform memcpy using temporal stores.
process timestamps in batches of arbitrary size during commit/persist operations (to increase concurrency - currently, thread tries to acquire as many timestamps as possible which might result in other threads waiting instead of doing actual work) (#245)
properly use notifier in wait_committed/wait_persisted - this can be especially important for SPDK integration (?)

DSA specific:

use batching (once implemented in miniasync or use DML directly)
for small appends, use normal mempcy and DML persist operation in background (combine multiple persists into one in software)

Use shrinkable number of threads in more tests

Currently it's used only in

pmemstream/tests/unittest/concurrent_iterate.cpp

Lines 29 to 32 in 63cb472

    
           ret += rc::check( 
        
           	"verify if each concurrent iteration observes the same data", 
        
           	[&](const std::vector<std::string> &data, bool reopen, 
        
           	    ranged<size_t, min_concurrency, max_concurrency> concurrency) {

Try running pmemcheck tests with "--mult-stores=yes"

This should provide better test coverage for rapidcheck tests where we reopen the pmemstream multiple times.
This should mostly likely be run on internal infrastructure as part of stress tests (due to longer length).

FEAT: Timestamps and async API

Rationale

Will be used to ensure data consistency instead of popcount: offloads some of the work to background thread to speed up user threads
Foundation for Transactions (#77)

Description

Each entry stores timestamp which describes when the entry was created (all entries created in one tx have the same timestamp)
On pmem, we store the latest timestamp, which is known to be persistent (all entries with timestamps less or equal, are persisted).
On DRAM we store the highest timestamp, which is known to be committed.

Timestamps generation and data persistence

Data is stored using NONTEMPORAL stores (in pmemstream_append)
Timestamp is generated by incrementing a global, runtime counter (either in pmemstream_append or on transaction commit) and storing this timestamp inside entry.
Persisted timestamp is updated and persisted by a separate async function that scans in-progress regions. This function can be run in arbitrary thread: https://github.com/pmem/pmemstream/blob/master/examples/04_basic_async/main.cpp

Example of pmemstream with 3 entries:

Update README file

It'd be nice to extend the content of our README, proposed updated content:

extend the lib description,
differences to other, existing libraries/solutions,
add info about write ahead/redo log + perhaps some link,
mention pmemlog,
add an entry point to examples (extend their description? perhaps not in the top-level README),
add a fancy image/gif showing how pmemstream works,
value proposition, e.g.:
- low latency writes,
- ...but depends on medium capabilities,
- low write amplification,
- low SW overhead = little metadata per entry,
- transparent usage of DSA/HW accelerators,
- it's universal and easy to use,
- does not require (extensive?) PMEM knowledge,
add a section about testing? rather perhaps write a new blog post about it,
...

references:
https://pmem.io/blog/2022/01/introduction-to-pmemstream/
https://pmem.io/announcements/2022/pmemstream-v0-2-0-release/
manpages

codecov clang builds are missing

Coverage delivered by llvm-cov (on clang builds) is missing.

My try to fill in this gap (or rather return to the previous state we had before switching to the new codecov uploader) failed - see my branch codecov-add-llvm-cov and commit.

custom gcov executable should be supported in the new uploader since this PR: codecov/uploader#794

Add data structure visualization tool

Data structure visualization tool as example

Rationale

It would be useful as debug tool and also may be used as part of documentation generation.

Description

Output may be inspired by tree (the linux tool):

.
├── stream
│   ├── region1
│   │   ├── entry1: offset; length; content
│   │   ├── entry2: offset; length; content
│   │   ├── entry3: offset; length; content
│   │   ├── entry4: offset; length; content
│   │   ├── entry5: offset; length; content
│   │   ├── entry6: offset; length; content
│   │   ├── entry7: offset; length; content
│   │   ├── entry8: offset; length; content
│   │   └── entry9: offset; length; content
│   ├── region2
│   │   ├── entry1: offset; length; content
│   │   ├── entry2: offset; length; content
│   │   └── entry3: offset; length; content
│   ├── region3
│   │   ├── entry1: offset; length; content
│   │   ├── entry2: offset; length; content
│   │   ├── entry3: offset; length; content
│   │   ├── entry4: offset; length; content
│   │   ├── entry5: offset; length; content
│   │   ├── entry6: offset; length; content
│   │   └── entry7: offset; length; content
│   └── region4
│       ├── entry1: offset; length; content
│       ├── entry12: offset; length; content
│       ├── entry2 : offset; length; content
│       ├── entry3 : offset; length; content
│       ├── entry4: offset; length; content
│       ├── entry5: offset; length; content
│       ├── entry6: offset; length; content
│       ├── entry7: offset; length; content
│       ├── entry8: offset; length; content
│       ├── entry9: offset; length; content
│       ├── entry10: offset; length; content
│       ├── entry11: offset; length; content
│       └── entry12: offset; length; content

4 regions, 32 entries

API Changes

None

Implementation details

Iterate over all regions and entries and print hierarchy, length and content

some pmreorder tests are run with ASAN

ISSUE: pmreorder tests are run with ASAN

Environment Information

pmemstream version(s): cfa5d7c

Please provide a reproduction of the bug:

CMAKE options:
CC=clang
CXX=clang++
BUILD_TESTS=ON
CMAKE_BUILD_TYPE=Debug
TESTS_PMREORDER=ON
TESTS_USE_FORCED_PMEM=ON
TESTS_USE_VALGRIND=ON
USE_ASAN=ON
USE_UBSAN=ON

How often bug is revealed:

always

Actual behavior:

-- Stderr:
==559733==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==559733==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==559733==This might be related to ELF_ET_DYN_BASE change in Linux 4.12.
==559733==See https://github.com/google/sanitizers/issues/856 for possible workarounds.

and finally:

The following tests FAILED:
         70 - singly_linked_list_pmreorder_0_none (Failed)
         71 - singly_linked_list_pmreorder_negative_none (Failed)
         72 - multi_region_pmreorder_0_none (Failed)
         76 - append_break_lazy_1_none (Failed)

Expected behavior:

Not applicable tests should be skipped like following:

    Start 63: timestamp_0_memcheck_SKIPPED_BECAUSE_SANITIZER_USED
2/3 Test #63: timestamp_0_memcheck_SKIPPED_BECAUSE_SANITIZER_USED ....   Passed    0.02 sec
    Start 64: timestamp_0_pmemcheck_SKIPPED_BECAUSE_SANITIZER_USED
3/3 Test #64: timestamp_0_pmemcheck_SKIPPED_BECAUSE_SANITIZER_USED ...   Passed    0.02 sec

Details

Possibly is enough to add an exclusion in:

tests/CMakeLists.txt:158

if(TESTS_PMREORDER)

tests/CMakeLists.txt:167

if(GDB AND DEBUG_BUILD)

Additional information about Priority and Help Requested:

Are you willing to submit a pull request with a proposed change? Yes

Requested priority: Low

Initial API proposal

struct data_entry {
uint64_t data;
};

struct pmem2_map *map = map_open(argv[1]);

if (map == NULL)
return -1;

struct pmemstream *stream;
pmemstream_from_map(&stream, 4096, map);

struct pmemstream_tx *tx;
pmemstream_tx_new(&tx, stream);

struct pmemstream_region new_region;
pmemstream_tx_region_allocate(tx, stream, 4096, &new_region);

struct pmemstream_region_context *rcontext;
pmemstream_region_context_new(&rcontext, stream, new_region);

struct data_entry e;
e.data = 1;

struct pmemstream_entry new_entry;
pmemstream_tx_append(tx, rcontext, &e, sizeof(e), &new_entry);
pmemstream_tx_reserve(tx, rcontext, sizeof(e), &new_entry);
pmemstream_tx_commit(&tx);

pmemstream_append(rcontext, &e, sizeof(e), &new_entry);

struct data_entry *new_data_entry = pmemstream_entry_data(stream, new_entry);
printf("new_data_entry: %lu\n", new_data_entry->data);
pmemstream_region_context_delete(&rcontext);

pmemstream_tx_region_free(tx, new_region);
pmemstream_region_clear(stream, new_region)

size_t size =  pmemstream_region_size(stream, new_region);

truct pmemstream_region_iterator *riter;
pmemstream_region_iterator_new(&riter, stream);
struct pmemstream_region region;

while (pmemstream_region_iterator_next(riter, &region) == 0) {
    struct pmemstream_entry entry;
    struct pmemstream_entry_iterator *eiter; 
    pmemstream_entry_iterator_new(&eiter, stream, region);
    uint64_t last_entry_data;
    while (pmemstream_entry_iterator_next(eiter, NULL, &entry) == 0) {
        struct data_entry *d = pmemstream_entry_data(stream, entry);
        printf("data entry %lu: %lu in region %lu\n",
        entry.offset, d->data, region.offset);
        last_entry_data = d->data;
    }
    pmemstream_entry_iterator_delete(&eiter);
}

pmemstream_delete(&stream);
pmem2_map_delete(&map);

Based on: https://github.com/pbalcer/nvml/blob/0149cf71d9377aab835516b8cd79cc9ddc601744/src/examples/libpmem2/pmemstream/pmemstream.h

FEAT: Concurrent append to a single region

Rationale

Provide better throughput by allowing to issue multiple appends (from single or multiple threads)
Support parallel workloads which cannot be divided into multiple regions

Description

To make sure pmemstream is contiguous (there are no holes in it) appends to a single region must be synchronized somehow. However, synchronization should result in memcpy operations being serialized.

The idea is to synchronize append just before returning (or marking commited future #76 as completed).
Code in the implementation details section explains it in more detail.

API Changes

None.

Implementation details

pmemstream_append(..., data, size) {
    // get offset for append
    uint64_t offset = acquire_offset(size);
    uint64_t popcount = popcount_memory(data, size);
    *(stream->data + offset) = size;
    *(stream->data + offset + 8) = popcount;
    pmem2_memcpy(stream->data + offset + 16, data, size);
    // data at `offset` is published
    publish_offset(offset, size);
    // instead of calling this function directly we can have return a future here
    wait_for_prev_appends(offset);
} // pmemstream_append_finished

wait_for_prev_appends(offset) {
    uint64_t ready_offset;
    do {
        size_t size = consume_offset(&ready_offset);
    } while (ready_offset + size < offset);
}

Enable thread sanitizer for tests

Change iterators API

FEAT: New iterators (for both region and entry) API:

int pmemstream_region_iterator_is_valid(struct pmemstream_region_iterator *iterator);

void pmemstream_region_iterator_seek_first(struct pmemstream_region_iterator *iterator);

void pmemstream_region_iterator_next(struct pmemstream_region_iterator *iterator);

struct pmemstream_region pmemstream_region_iterator_get(struct pmemstream_region_iterator *iterator);



int pmemstream_entry_iterator_is_valid(struct pmemstream_entry_iterator *iterator);

void pmemstream_entry_iterator_seek_first(struct pmemstream_entry_iterator *iterator);

void pmemstream_entry_iterator_next(struct pmemstream_entry_iterator *iterator);

struct pmemstream_entry pmemstream_entry_iterator_get(struct pmemstream_entry_iterator *iterator);

Rationale

Such change enables more intuitive usage of iterators:

for-loop based iteration.

for(pmemstream_region_iterator_seek_first(riter); pmemstream_region_iterator_is_valid(riter) == 0; pmemstream_region_iterator_next(riter)){
     		struct pmemstream_region region;
		pmemstream_region_iterator_get(riter, &region);
                /* Do something with region */
}

Simplify regions/entries access when iterator is stored in a structure.
Simplify next() functions signature.

FEAT: Multi-process support (single writer, multiple readers)

FEAT: Multi-process support

Rationale

With multi-process support (single writer, multiple readers) consumers can attach dynamically to pmemstream. Each consumer might use different set of dependencies.

One interesting use case is using separate consumer process for replication - application will constantly iterate over stream and send data to a replica (.e.g using rdma).

Description

Single writer/producer process
Multiple reader/consumer processes (can use iterators)
All runtime data stored in shared memory
Handle runtime data lifetime

API Changes

An additional method for attaching to stream in read-only mode.

Implementation details

All runtime metadata (including region_runtime's) must be kept in shared memory.

First steps

Single writer process per entire stream. In future we might consider adding support for single writer process per region.
Write process will be responsible for runtime data lifetime management. In future we might consider refcounts.

FEAT: Async API

Rationale

Allow user to perform computation when waiting on append completion.
Allow user to issue multiple concurrent appends to a single region and take advantage of DSA
Expose more states of the append operation to user. Right now, if append completes data can be treated as committed and persisted. With Async API we might expose committed and persisted states separately (as futures).

Description

Async API would use miniasync framework to expose commited and persisted futures.

API Changes

Expose futures in append and publish.

Example (pseudocode)

future = pmemstream_append();
// no guarantees here

wait(future, COMMITED);
// data commited (can be safely read), possibly not persisted

wait(future, PERSISTED);
// data commited and persistent

Move on to the new coverage uploader

The current codecov uploader - https://github.com/codecov/codecov-bash - is getting EOL.
We have time until February 2022 to move on.

Add get_usable_size() funciton

FEAT: Add get_usable_size() funciton

Rationale

Currently it's hard to guess size of region, which fits into the stream.

Description

Expose usable_size variable aligned to block size in API

API Changes

add get_max_region_size()

Add checksum to provide consistency for metadata

FEAT: Checksum for metadata consistency

Rationale

Let's imagine there is a case when metadata is corrupted in such a way that only size or popcount is broken.
In such case you can get a "correct" iterator with data exceeding original data.

Description

To avoid better consistency should be added mechanism like checksum for metadata memory blocks.
Provide mechanism that wouldn't need to check checksum twice (iterator + data retrieval (using pmemstream_entry_data).

API Changes

Just extend metadata structure and add function to calculate checksum from memory range in both, read and write.

Implementation details

To be discussed later.

FEAT: timestamp order with multiple threads and multiple regions tests

Description

Prerequisites:
(Prereq 1) Add generator derivated by pmemstream_test_base for that case:

Generates multiple regions
Multithreaded synchronously append entries to regions

(Prereq 2) Add generator derivated by pmemstream_test_base for that case:

Generates multiple regions
Multithreaded asynchronously append entries to regions

Cases for (prereq 1 and 2):
Case 1:
In every region, all timestamps increase.

Case 2:
In stream, all timestamps increase (duplications).

Case 3:
Add regions, remove single region with entries, add an empty region and then add new entries then check Case 2.

Implementation details

N/A

Add logging mechanism

We could add logging similar to e.g. libpmem2 (LOG macro and all its helpers):
https://github.com/pmem/pmdk/blob/master/src/libpmem2/libpmem2.c#L29

This could be used for logging i.a. issues introduced by a user's improper behavior, like in case user:

call pmemstream_reserve,
forgot/improperly memcpy data into the reserved space,
call pmemstream_publish,
call pmemstream_entry_iterator_next and got an assert error:
- src/iterator.c:146: pmemstream_entry_iterator_next: Assertion validate_entry(iterator->stream, entry) == 0' failed.`

This is exactly the place to report a broader message to the user, with proper information like: data corruption or improper append happen....

Add iterator_entry_from_offset and iterator_region_from_offset

TBD

Simplify pmemstream API

FEAT: Simplify pmemstream API

Rationale

The fact the particular entry is located in the particular region which is part of particular stream instance should be reflected in API.

Currently functions in pmemstream unnecessary gets to many arguments. i.e: int pmemstream_append(struct pmemstream *stream, struct pmemstream_region region, struct pmemstream_region_runtime *region_runtime, const void *data, size_t size, struct pmemstream_entry *new_entry) which is error prone. It's super easy to pass mismatched entry, region and region runtime.

Description

Make region and entry structures purely runtime.
Use pointers to regions and entries instead of offsets (allow to remove a lot of offset_to_ptr calls).
Add serialization/deserialization functions, which would operate on offsets.

API Changes

Signature change

int pmemstream_region_allocate(struct pmemstream *stream, size_t size, struct pmemstream_region *region);

int pmemstream_region_free(struct pmemstream_region region);
size_t pmemstream_region_size( struct pmemstream_region region);


size_t pmemstream_region_usable_size(struct pmemstream_region region);


int pmemstream_reserve(struct pmemstream_region region, size_t size,
		       struct pmemstream_entry *reserved_entry);


int pmemstream_publish(struct pmemstream_entry reserved_entry);

int pmemstream_append(const struct pmemstream_region region, const void *data, size_t size,
		      struct pmemstream_entry *new_entry);

int pmemstream_async_publish(struct pmemstream_entry entry);

int pmemstream_async_append(struct vdm *vdm, const struct pmemstream_region region, const void *data, size_t size,
			    struct pmemstream_entry *new_entry);

uint64_t pmemstream_committed_timestamp(struct pmemstream *stream);

uint64_t pmemstream_persisted_timestamp(struct pmemstream *stream);

struct pmemstream_async_wait_fut pmemstream_async_wait_committed(struct pmemstream *stream, uint64_t timestamp);

struct pmemstream_async_wait_fut pmemstream_async_wait_persisted(struct pmemstream *stream, uint64_t timestamp);

const void *pmemstream_entry_data(struct pmemstream_entry entry);

size_t pmemstream_entry_size(struct pmemstream_entry entry);

uint64_t pmemstream_entry_timestamp(struct pmemstream_entry entry);

Functions to be removed

	pmemstream_region_runtime_initialize()

Functions to be added

Those functions are needed to store as a separate metadata not related to actual address

	size_t pmemstream_region_offset(const struct pmemstream_region);
	size_t pmemstream_entry_offset(const struct pmemsream_entry);

	int pmemstream_reigon_from_offset(size_t offset, struct pmemstream_region *region); // Check if it's region is needed, so need to read metadata from pmem

	int pmemstream_entry_from_offset(size_t offset, struct *pmemstream_entry); // Check if it's entry is needed, so need to read metadata from pmem

Implementation details

Change structures pmemstream_region and pmemstream_entry to be purely runtime and operate on pointers instead of offsets

/* Runtime structure */
struct pmemstream_region {
	struct pmemstream *stream;

	struct pmemstream_region_runtime *region_runtime;

	/* Pointer to region on pmem */
	void *data;
};

/* Runtime structure */
struct pmemstream_entry {

	struct pmemstream_region region;

	 /* raw pointer to entry on pmem */
	 void *data;

	/* possible optimization: store in runtime structure size and timestamp to minimize entry metadata
	 * reads from pmem */
	size_t size;
	uint64_t timestamp;
};

Change entry_iterator behavior

FEAT: Change entry_iterator behavior

Rationale

Right now, when pmemstream_entry_iterator_next is called and there are no more entries user-provided offset will be set to after the last entry. This is problematic if a user wants to get the last entry from the region. They can't just do:

	struct pmemstream_entry last_entry;
	while (pmemstream_entry_iterator_next(eiter, nullptr, &last_entry) == 0) {		
        }

Becasue after pmemstream_entry_iterator_next returns -1, last_entry will be invalid.

Change this behavior to not modify offset when there are no more elements.

API Changes

None

Extend pmemstream customizability

FEAT: Extend pmemstream customizability

Rationale

There are multiple internal parameters that impact performance and are currently hardcoded. For example, MAX_CONCURRENCY or BATCH_SIZE for commit/persist processing. We should allow users to customize those values.

API Changes

Config structure + change in pmemstream_open function to accept the config.

	ret += rc::check(
	"verify if each concurrent iteration observes the same data",
	[&](const std::vector<std::string> &data, bool reopen,
	ranged<size_t, min_concurrency, max_concurrency> concurrency) {

pmem / pmemstream Goto Github PK

pmemstream's Introduction

pmemstream

⚠️ Discontinuation of the project

Introduction

Build and install

Contact us

pmemstream's People

Contributors

Stargazers

Watchers

Forkers

pmemstream's Issues

FEAT: Add option to trigger pool prefaulting

Rationale

Description

API Changes

ISSUE: Failing tests with TSAN feature

Environment Information

Please provide a reproduction of the bug:

How often bug is revealed:

Actual behavior:

Expected behavior:

Details

Additional information about Priority and Help Requested:

FEAT: Multiple regions support

Rationale

Description

API Changes

Implementation details

First steps

ISSUE:

The bug reproduction:

How often bug is revealed:

Actual behavior:

ISSUE:

The bug reproduction:

How often bug is revealed:

Actual behavior:

ISSUE:

The bug reproduction:

How often bug is revealed:

Actual behavior:

FEAT: pmreorder based test for power failure during recovery.

Rationale

Implementation details

Tricky part

Roughly tested POC - seems to work

ISSUE:

The bug reproduction:

How often bug is revealed:

Actual behavior:

Environment Information

Please provide a reproduction of the bug:

How often bug is revealed:

Details

Output:

ISSUE:

The bug reproduction:

How often bug is revealed:

Actual behavior:

FEAT: Transaction

Rationale

Description

API Changes

Implementation details

First steps

Transactions and concurrent append to a single region

ISSUE: Crash inconsistency in region allocation

Environment Information

Please provide a reproduction of the bug:

How often bug is revealed:

Details

Additional information about Priority and Help Requested:

FEAT: Support for multithreading in region allocator

Rationale

Description

API Changes

Implementation details

Meta