This is part of the issue <a class="issue-link js-issue-link" data-error-text="Failed

for the separating comment and comment-reply: </blockquo

hihi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Rewriting main-content / comments structure. about pttbbs HOT 13 CLOSED

ptt commented on August 11, 2024

Rewriting main-content / comments structure.

from pttbbs.

Comments (13)

robertabcd commented on August 11, 2024

I don't fully understand your design. Do you mean to store comments into a separate record-based file? I also don't see why "comment" and "comment-reply-from-poster" need to be two separate operations. (For example, we could have a parent id for each comment, and if that's -1 or something, it links to the post.)

A big issue for this record-based design is that it takes a full file scan to construct all links, and only after that can the pager display content. I know pmore already scans the whole file to calculate number of lines, but this is something I would like to avoid if it's going to be rebuilt.

IMO, we don't necessarily need to convert old posts into new format. We can keep as is, if this enables a cleaner design.

from pttbbs.

chhsiao1981 commented on August 11, 2024

I'll reply regarding with comment vs comment-reply first and then reply about the separating files for the convenience of the context.

for the separating comment and comment-reply:
1.1. Sometimes the poster would like to have multiple-line comment-reply
(Because of the privilege in "edit")
The structure of the comment-reply may be different from comment.
1.2. FB also provides comment-reply (and only 1-level comment-reply).
It's possible that there is a need for the comment-reply.
1.3 We disallowed ctrl-char in comments, we will also disallow ctrl-char in comment-reply as well.
This will simplify the page-calculation.
Because of the max-length of the commenting, and commenting is append-only-not-editable-op:
2.1 for the file-based-storage: I recommend to have comments stored in a big file separated from main-content and comment-reply. (So there will be 3 files: main-content, comments, comment-reply)
and for the comment file:
We can use length + comment for each record in the file, and then it's easy to do fseek.
(Assuming that fseek does not mean full-file-scan)
for the comment-reply file:
We will also have a separated indexed file to do the indexing-by-comment-id
(So actually there will be 4 files for 1 article: main-content, comments, comment-reply, comment-reply.idx)
and then we can easily retrieve the corresponding comment-reply.
(For the current UI setting, we allow only 1 comment-reply for one comment, and comment-reply is editable)
2.2 for the DB-based storage: DB will efficiently utilize mem. Each comment and comment-reply will be stored as separated record in DB.
We don't need to construct all links, we just need to load pre-page, current-page, next-page.
Basically the pre-page, current-page, next-page is recalculated when there is a need to refresh page.
If main-content or comment-reply are inserted and screw up the paging, the paging will be recovered when refreshing the page.

from pttbbs.

robertabcd commented on August 11, 2024

for the separating comment and comment-reply:

Those limitations do not necessarily need to be imposed on the data structure. For a very long time, I really want to allow multi-line comments. I rather let both the data structures have this flexibility.

Because of the max-length of the commenting, and commenting is append-only-not-editable-op:

I believe 4 files will require a lot of space in the file system, unless you have other plan to store these files, I doubt it can go to production. Maintaining comment-reply.idx and its consistency between comments in the current multi-process architecture is very hard to done right. I'd discourage this approach.

We don't need to construct all links, we just need to load pre-page, current-page, next-page.

I don't understand. At least, pmore needs to know total number of lines. How can this be calculate without reading everything? (Unless you want to propose caching. This will be a big hassle in maintaining consistency.)

from pttbbs.

robertabcd commented on August 11, 2024

2.2 for the DB-based storage: DB will efficiently utilize mem. Each comment and comment-reply will be stored as separated record in DB.

We will need to figure out the memory usage on mbbsd as well. (May not have mmap to use)

from pttbbs.

chhsiao1981 commented on August 11, 2024

The following code is the proposed code for separating main-content / comments:
(comment-reply will be extracted within the comments-block)

https://github.com/chhsiao1981/pttbbs/blob/hsiao.sep_main_content_comments/util/pyutil/sep_main_comments.py

from pttbbs.

chhsiao1981 commented on August 11, 2024

After retrieving 10 sample posts, I would like to propose the following policies:

Goal: for the posts until now, each content can be separated as the following components:
1. main-content: (until the last origin/from)
1.1 origin: starting with '※ 發信站'
1.2 from: the lines after origin and before comments.
2. comments: 推 / 噓 / → / 轉錄至看板
3. comment-reply: the edited-content by the poster after the comments, as the reply of the comments.

Method:
1. find the last origin and the following "from". all the content before last-origin is considered as main-content. all the content after "from" is considered as comment-block.
2. for main-content: we try to differentiate among main-content, origin, and from
for comment-block: we differentiate between comments and comment-blocks.

TODO: 1. check the "推 / 噓 / →" in the main-content and check the validity of putting those sentences into main-content.
2. UI: if pmore reads single-file: transfer to main-content / comments / comment-library
and then do the pmore-on-new-version.
3. UI: when editing: use the block to edit, and then do the corresponding boundary of the editing (in main-content: do editing within only main-content, in comments: do editing with only the corresponding comment.

refer to:

https://github.com/chhsiao1981/pttbbs/blob/hsiao.sep_main_content_comments/util/pyutil/sep_main_comments.py

from pttbbs.

chhsiao1981 commented on August 11, 2024

The following code illustrates the format for the separated files, including main-content, comments, comment-reply, index of comment-reply.

I feel that UI-in-editing needs to be done first before transferring the file-format.
I'll start the UI-in-editing part first.

UI-in-editing (as 2-level editing):
After pressing 'E' and starting the editing mode:

choose either the main-content or any of the comment for editing / reply.
Do editing.
After saving, backing to 1.
some hotkey as exit editing mode.

https://github.com/chhsiao1981/pttbbs/blob/hsiao.sep_main_content_comments/util/pyutil/sep_main_comments.py

from pttbbs.

chhsiao1981 commented on August 11, 2024

The following is the current progress of the development:

https://github.com/chhsiao1981/pttbbs/tree/hsiao.edit_test

Looks like able to separate lines with continuously-partially-reading-from-buffer.
Looks like able to determine whether the line is recommend (good) / boo (bad) / comment (arrow) / forwarding.
able to (mostly) successfully split the files to main-content / comments / comment-reply.
and pass some unit-test based on the googletest unit-test framework.

#34
Defined the structure of the current version of file-headers / record-structure.

https://github.com/chhsiao1981/pttbbs/blob/hsiao.edit_test/include/migrate_merge.h
implementation of merge is done too, but not tested yet, will provide the unit-test soon.

TODO:

dealing with tailing empty lines.
"轉錄自 xxx 信箱" may be with same effect as "發信站". I'll check further about this issue.
ignore lines of "編輯"
max of line is restricted to 8192 bytes. Need to check whether it fits all current posts.
comment-reply is currently restricted to 8192 bytes. Need to check the max of the comment-reply.
implement edit-UI.

from pttbbs.

robertabcd commented on August 11, 2024

I still have concerns for this design. Please see my comments above.

from pttbbs.

chhsiao1981 commented on August 11, 2024

hihi @robertabcd,

Thank you so much for the comments！

What I observe is that the reading and commenting is the majority of the ops in the ptt,
but the single-file-based storage may make the disks go through main-content a lot and may result in non-necessary disk-ops.

On the other hand, there are lots of cold-data in such kind of post-forum.
We may be able to store some hot-data in mem to reduce some disk-op.

The ultimate goal is to utilize db-based-storage to be able to have native built-in memory index / cache from db.
and have ptt be able to horizontally scalable.
(My hunch is that it's about time to have ptt be able to horizontally scalable.)

Basically:

for separating comments / comment-reply.
=> Basically this can be integrated as same data-structure, but then the data-structure still need a column named "type" to know whether it's comments or comment-reply from the poster.
for the 4-file thing and comment-reply.idx thing:
=> This is for the preparation of separating main-content and comments / comment-reply.
The goal to separate main-content / comments / comment-reply is to make it become record-based storage and to migrate to clustered-db-based storage.

=> making more files may be actually good for the disks.
for example, the disks does not need to go through the main-content when appending the comments to the end of the file.

=> for the comment-reply.idx:
currently comment-reply is only editable by poster / board-admin / sys-admins.
We still need to take care of mutex thing, but unlike commenting, the probability of locking is much lower.
```
    In addition, comment-reply.idx can be considered as caching, and we can do some primitive checksum (such as size of comments, magic-start-header, etc)  to check the consistency of comment-reply.idx

   (This is somewhat using multiple files to reduce locking issue～)
```
for the total-number of lines:
Based on current technology, we can just force that the display-line is fixed as 80-chars.
(To my knowledge, currently ptt is based on the max display-line as 80-chars,
and we can remove the support of display-line < 80 chars based on current technology.)

For main-content: we can store the total-lines in the beginning of the file as cache.
For comments: we can still store the total-lines in the beginning of the file as cache. (and each comment contains only a line)
For comment-reply: we can store the total-line of each record in the beginning of the record, and total-lines of the comment-replies in the beginning of the file.

With this information, we can easily know the offset of the line of each record (main-content, comments, comment-reply) and easily retrieve the offset of each line.
for the mem-usage in ptt:
Basically this is for the preparation of horizontally scalable design.
It's expected that there will be more mem-usage in ptt and db in total, but it's not necessary that the mem-usage in ptt with new design will increase a lot (most mem-usage will be in db-cache), and I think the new design will reduce non-necessary disk-op and make the system more stable.

from pttbbs.

chhsiao1981 commented on August 11, 2024

hi @robertabcd,

I think one core feature of the new design is that the proportion of comment-reply is little compared to main-content and comments, and the probability of locking in comment-reply is little as well.

We can reduce lots of disk-ops by separating main-content / comments / comment-reply.

from pttbbs.

chhsiao1981 commented on August 11, 2024

The newest status is #40

I will revise vedit2 (as vedit3) as an example for #40

from pttbbs.

chhsiao1981 commented on August 11, 2024

The proposal requires major revision of the code, which is not feasible for this repo.
I'll close the issue and have the implementation in a separated repo.

from pttbbs.

Rewriting main-content / comments structure. about pttbbs HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent