Comments (13)
I don't fully understand your design. Do you mean to store comments into a separate record-based file? I also don't see why "comment" and "comment-reply-from-poster" need to be two separate operations. (For example, we could have a parent id for each comment, and if that's -1 or something, it links to the post.)
A big issue for this record-based design is that it takes a full file scan to construct all links, and only after that can the pager display content. I know pmore already scans the whole file to calculate number of lines, but this is something I would like to avoid if it's going to be rebuilt.
IMO, we don't necessarily need to convert old posts into new format. We can keep as is, if this enables a cleaner design.
from pttbbs.
I'll reply regarding with comment vs comment-reply first and then reply about the separating files for the convenience of the context.
-
for the separating comment and comment-reply:
1.1. Sometimes the poster would like to have multiple-line comment-reply
(Because of the privilege in "edit")
The structure of the comment-reply may be different from comment.
1.2. FB also provides comment-reply (and only 1-level comment-reply).
It's possible that there is a need for the comment-reply.
1.3 We disallowed ctrl-char in comments, we will also disallow ctrl-char in comment-reply as well.
This will simplify the page-calculation. -
Because of the max-length of the commenting, and commenting is append-only-not-editable-op:
2.1 for the file-based-storage: I recommend to have comments stored in a big file separated from main-content and comment-reply. (So there will be 3 files: main-content, comments, comment-reply)
and for the comment file:
We can use length + comment for each record in the file, and then it's easy to do fseek.
(Assuming that fseek does not mean full-file-scan)
for the comment-reply file:
We will also have a separated indexed file to do the indexing-by-comment-id
(So actually there will be 4 files for 1 article: main-content, comments, comment-reply, comment-reply.idx)
and then we can easily retrieve the corresponding comment-reply.
(For the current UI setting, we allow only 1 comment-reply for one comment, and comment-reply is editable)
2.2 for the DB-based storage: DB will efficiently utilize mem. Each comment and comment-reply will be stored as separated record in DB. -
We don't need to construct all links, we just need to load pre-page, current-page, next-page.
Basically the pre-page, current-page, next-page is recalculated when there is a need to refresh page.
If main-content or comment-reply are inserted and screw up the paging, the paging will be recovered when refreshing the page.
from pttbbs.
- for the separating comment and comment-reply:
Those limitations do not necessarily need to be imposed on the data structure. For a very long time, I really want to allow multi-line comments. I rather let both the data structures have this flexibility.
- Because of the max-length of the commenting, and commenting is append-only-not-editable-op:
I believe 4 files will require a lot of space in the file system, unless you have other plan to store these files, I doubt it can go to production. Maintaining comment-reply.idx
and its consistency between comments
in the current multi-process architecture is very hard to done right. I'd discourage this approach.
- We don't need to construct all links, we just need to load pre-page, current-page, next-page.
I don't understand. At least, pmore needs to know total number of lines. How can this be calculate without reading everything? (Unless you want to propose caching. This will be a big hassle in maintaining consistency.)
from pttbbs.
2.2 for the DB-based storage: DB will efficiently utilize mem. Each comment and comment-reply will be stored as separated record in DB.
We will need to figure out the memory usage on mbbsd
as well. (May not have mmap
to use)
from pttbbs.
The following code is the proposed code for separating main-content / comments:
(comment-reply will be extracted within the comments-block)
from pttbbs.
After retrieving 10 sample posts, I would like to propose the following policies:
Goal: for the posts until now, each content can be separated as the following components:
1. main-content: (until the last origin/from)
1.1 origin: starting with '※ 發信站'
1.2 from: the lines after origin and before comments.
2. comments: 推 / 噓 / → / 轉錄至看板
3. comment-reply: the edited-content by the poster after the comments, as the reply of the comments.
Method:
1. find the last origin and the following "from". all the content before last-origin is considered as main-content. all the content after "from" is considered as comment-block.
2. for main-content: we try to differentiate among main-content, origin, and from
for comment-block: we differentiate between comments and comment-blocks.
TODO: 1. check the "推 / 噓 / →" in the main-content and check the validity of putting those sentences into main-content.
2. UI: if pmore reads single-file: transfer to main-content / comments / comment-library
and then do the pmore-on-new-version.
3. UI: when editing: use the block to edit, and then do the corresponding boundary of the editing (in main-content: do editing within only main-content, in comments: do editing with only the corresponding comment.
refer to:
from pttbbs.
The following code illustrates the format for the separated files, including main-content, comments, comment-reply, index of comment-reply.
I feel that UI-in-editing needs to be done first before transferring the file-format.
I'll start the UI-in-editing part first.
UI-in-editing (as 2-level editing):
After pressing 'E' and starting the editing mode:
- choose either the main-content or any of the comment for editing / reply.
- Do editing.
- After saving, backing to 1.
- some hotkey as exit editing mode.
from pttbbs.
The following is the current progress of the development:
https://github.com/chhsiao1981/pttbbs/tree/hsiao.edit_test
-
Looks like able to separate lines with continuously-partially-reading-from-buffer.
-
Looks like able to determine whether the line is recommend (good) / boo (bad) / comment (arrow) / forwarding.
-
able to (mostly) successfully split the files to main-content / comments / comment-reply.
and pass some unit-test based on the googletest unit-test framework. -
Defined the structure of the current version of file-headers / record-structure.
https://github.com/chhsiao1981/pttbbs/blob/hsiao.edit_test/include/migrate_merge.h
-
implementation of merge is done too, but not tested yet, will provide the unit-test soon.
TODO:
- dealing with tailing empty lines.
- "轉錄自 xxx 信箱" may be with same effect as "發信站". I'll check further about this issue.
- ignore lines of "編輯"
- max of line is restricted to 8192 bytes. Need to check whether it fits all current posts.
- comment-reply is currently restricted to 8192 bytes. Need to check the max of the comment-reply.
- implement edit-UI.
from pttbbs.
I still have concerns for this design. Please see my comments above.
from pttbbs.
hihi @robertabcd,
Thank you so much for the comments!
What I observe is that the reading and commenting is the majority of the ops in the ptt,
but the single-file-based storage may make the disks go through main-content a lot and may result in non-necessary disk-ops.
On the other hand, there are lots of cold-data in such kind of post-forum.
We may be able to store some hot-data in mem to reduce some disk-op.
The ultimate goal is to utilize db-based-storage to be able to have native built-in memory index / cache from db.
and have ptt be able to horizontally scalable.
(My hunch is that it's about time to have ptt be able to horizontally scalable.)
Basically:
-
for separating comments / comment-reply.
=> Basically this can be integrated as same data-structure, but then the data-structure still need a column named "type" to know whether it's comments or comment-reply from the poster. -
for the 4-file thing and comment-reply.idx thing:
=> This is for the preparation of separating main-content and comments / comment-reply.
The goal to separate main-content / comments / comment-reply is to make it become record-based storage and to migrate to clustered-db-based storage.=> making more files may be actually good for the disks.
for example, the disks does not need to go through the main-content when appending the comments to the end of the file.=> for the comment-reply.idx:
currently comment-reply is only editable by poster / board-admin / sys-admins.
We still need to take care of mutex thing, but unlike commenting, the probability of locking is much lower.In addition, comment-reply.idx can be considered as caching, and we can do some primitive checksum (such as size of comments, magic-start-header, etc) to check the consistency of comment-reply.idx (This is somewhat using multiple files to reduce locking issue~)
-
for the total-number of lines:
Based on current technology, we can just force that the display-line is fixed as 80-chars.
(To my knowledge, currently ptt is based on the max display-line as 80-chars,
and we can remove the support of display-line < 80 chars based on current technology.)For main-content: we can store the total-lines in the beginning of the file as cache.
For comments: we can still store the total-lines in the beginning of the file as cache. (and each comment contains only a line)
For comment-reply: we can store the total-line of each record in the beginning of the record, and total-lines of the comment-replies in the beginning of the file.With this information, we can easily know the offset of the line of each record (main-content, comments, comment-reply) and easily retrieve the offset of each line.
-
for the mem-usage in ptt:
Basically this is for the preparation of horizontally scalable design.
It's expected that there will be more mem-usage in ptt and db in total, but it's not necessary that the mem-usage in ptt with new design will increase a lot (most mem-usage will be in db-cache), and I think the new design will reduce non-necessary disk-op and make the system more stable.
from pttbbs.
hi @robertabcd,
I think one core feature of the new design is that the proportion of comment-reply is little compared to main-content and comments, and the probability of locking in comment-reply is little as well.
We can reduce lots of disk-ops by separating main-content / comments / comment-reply.
from pttbbs.
The newest status is #40
I will revise vedit2 (as vedit3) as an example for #40
from pttbbs.
The proposal requires major revision of the code, which is not feasible for this repo.
I'll close the issue and have the implementation in a separated repo.
from pttbbs.
Related Issues (20)
- whence in PttLock should be SEEK_SET? (or PttLock after lseek should set offset as 0?) HOT 4
- 請問如何找回ID和密碼 HOT 5
- (Forwarded from PttBug)「登入次數」的累計盲點 HOT 2
- Some articles in the searching result for the word "初音" are missing in board C_Chat HOT 1
- [Feature Request] Support more than 8 characters password HOT 10
- 可用單一信箱多次認證 HOT 2
- AOTP verification seems to be broken HOT 1
- Use ngx.exit(444)
- Add SO_REUSEPORT support to logind HOT 1
- User ID rename does not update regemaildb
- Email input length hardcoded HOT 1
- [Bug] Ptt web search logic error HOT 2
- logind compile error HOT 2
- 使用 Mac 連上 term.ptt.cc,在發表文章/發送站內信時按下 ^ + X 沒有任何反應 HOT 4
- 看板下面的 bar 在特定情況下會消失
- [propose] Replace .PASSWDS with DB (mongodb) HOT 6
- BRD_WARNEL 的註解有錯
- Possible one more i++ in cmsys.strip_nonebig5? HOT 1
- 關於 SHM 沒有加上 volatile 這件事 HOT 10
- [資訊] 如果你收不到認證信的話 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pttbbs.