poiscript / orgize Goto Github PK

View Code? Open in Web Editor NEW

272.0 272.0 34.0 895 KB

A Rust library for parsing org-mode files.

Home Page: https://poiscript.github.io/orgize/

License: MIT License

Rust 96.18% HTML 1.84% JavaScript 1.98%

cli lsp org-mode parser

orgize's Introduction

orgize's People

Contributors

Stargazers

Watchers

orgize's Issues

Unable to access Timestamps for Repeating Tasks

Problem

I've had some difficulty accessing the scheduled timestamp for repeated tasks and I'm not sure if this is a bug or if there is a different way to query for these repeated tasks.

What I've tried

I couldn't find anything in the docs and the json output of the below MWE only contains scheduled (in lower case) for the non-repeating tasks.

Further Info

This may be related to #27, #28, #29.

Minimum Working Example

Rust

Click Me

use orgize::{
    elements::{Datetime, Timestamp},
    Org,
};

use serde_json::to_string;

fn main() {
    let org = Org::parse(
        r#"
** TODO Call Father
   SCHEDULED: <2008-02-10 Sun ++1w>
   Marking this DONE shifts the date by at least one week, but also
   by as many weeks as it takes to get this date into the future.
   However, it stays on a Sunday, even if you called and marked it
   done on Saturday.

** TODO Empty kitchen trash
   SCHEDULED: <2008-02-08 Fri 20:00 ++1d>
   Marking this DONE shifts the date by at least one day, and also
   by as many days as it takes to get the timestamp into the future.
   Since there is a time in the timestamp, the next deadline in the
   future will be on today's date if you complete the task before
   20:00.

** TODO Check the batteries in the smoke detectors
   SCHEDULED: <2005-11-01 Tue .+1m>
   Marking this DONE shifts the date to one month after today.

** TODO Wash my hands
   SCHEDULED: <2019-04-05 08:00 Sun .+1h>
   Marking this DONE shifts the date to exactly one hour from now.

** TODO Task with repeater
SCHEDULED: <2022-04-17 Sun +1w>

** TODO Task without repeater
SCHEDULED: <2022-04-17 Sun>
   "#,);

    // Loop over the headlines
    for h in org.headlines() {
        let title = &h.title(&org).raw as &str;
        println!("\n\n{}\n-----------", title);

        match &h.title(&org).planning {
            Some(p) => println!("{:#?}", p),
            None => println!("No Planning Module"),
        }


        // Try and extract the timestamp
        match h.title(&org).scheduled() {
            Some(_t) => println!("Timestamp"),
            None     => println!("No Timestamp"),
        };

    }


    let org_json = to_string(&org).unwrap();
    println!("{}", org_json);
}

TOML

Click Me

[package]
name = "orgize_example"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
orgize = "0.9.0"
serde_json = "*"
colored = "*"
chrono = "*"
clap = { version = "3.1.6", features = ["derive"] }

Output

Raw

Click Me



Call Father
-----------
No Planning Module
No Timestamp


Empty kitchen trash
-----------
No Planning Module
No Timestamp


Check the batteries in the smoke detectors
-----------
No Planning Module
No Timestamp


Wash my hands
-----------
No Planning Module
No Timestamp


Task with repeater
-----------
No Planning Module
No Timestamp


Task without repeater
-----------
Planning {
    deadline: None,
    scheduled: Some(
        Active {
            start: Datetime {
                year: 2022,
                month: 4,
                day: 17,
                dayname: "Sun",
                hour: None,
                minute: None,
            },
            repeater: None,
            delay: None,
        },
    ),
    closed: None,
}
Timestamp
{"type":"document","pre_blank":1,"children":[{"type":"headline","level":2,"children":[{"type":"title","level":2,"keyword":"TODO","raw":"Call Father","post_blank":0,"children":[{"type":"text","value":"Call Father"}]},{"type":"section","children":[{"type":"paragraph","post_blank":1,"children":[{"type":"text","value":"   SCHEDULED: <2008-02-10 Sun ++1w>\n   Marking this DONE shifts the date by at least one week, but also\n   by as many weeks as it takes to get this date into the future.\n   However, it stays on a Sunday, even if you called and marked it\n   done on Saturday."}]}]}]},{"type":"headline","level":2,"children":[{"type":"title","level":2,"keyword":"TODO","raw":"Empty kitchen trash","post_blank":0,"children":[{"type":"text","value":"Empty kitchen trash"}]},{"type":"section","children":[{"type":"paragraph","post_blank":1,"children":[{"type":"text","value":"   SCHEDULED: <2008-02-08 Fri 20:00 ++1d>\n   Marking this DONE shifts the date by at least one day, and also\n   by as many days as it takes to get the timestamp into the future.\n   Since there is a time in the timestamp, the next deadline in the\n   future will be on today's date if you complete the task before\n   20:00."}]}]}]},{"type":"headline","level":2,"children":[{"type":"title","level":2,"keyword":"TODO","raw":"Check the batteries in the smoke detectors","post_blank":0,"children":[{"type":"text","value":"Check the batteries in the smoke detectors"}]},{"type":"section","children":[{"type":"paragraph","post_blank":1,"children":[{"type":"text","value":"   SCHEDULED: <2005-11-01 Tue .+1m>\n   Marking this DONE shifts the date to one month after today."}]}]}]},{"type":"headline","level":2,"children":[{"type":"title","level":2,"keyword":"TODO","raw":"Wash my hands","post_blank":0,"children":[{"type":"text","value":"Wash my hands"}]},{"type":"section","children":[{"type":"paragraph","post_blank":1,"children":[{"type":"text","value":"   SCHEDULED: <2019-04-05 08:00 Sun .+1h>\n   Marking this DONE shifts the date to exactly one hour from now."}]}]}]},{"type":"headline","level":2,"children":[{"type":"title","level":2,"keyword":"TODO","raw":"Task with repeater","post_blank":0,"children":[{"type":"text","value":"Task with repeater"}]},{"type":"section","children":[{"type":"paragraph","post_blank":1,"children":[{"type":"text","value":"SCHEDULED: <2022-04-17 Sun +1w>"}]}]}]},{"type":"headline","level":2,"children":[{"type":"title","level":2,"keyword":"TODO","raw":"Task without repeater","planning":{"scheduled":{"timestamp_type":"active","start":{"year":2022,"month":4,"day":17,"dayname":"Sun"}}},"post_blank":1,"children":[{"type":"text","value":"Task without repeater"}]}]}]}

Json

Click Me

{
  "type": "document",
  "pre_blank": 1,
  "children": [
    {
      "type": "headline",
      "level": 2,
      "children": [
        {
          "type": "title",
          "level": 2,
          "keyword": "TODO",
          "raw": "Call Father",
          "post_blank": 0,
          "children": [
            {
              "type": "text",
              "value": "Call Father"
            }
          ]
        },
        {
          "type": "section",
          "children": [
            {
              "type": "paragraph",
              "post_blank": 1,
              "children": [
                {
                  "type": "text",
                  "value": "   SCHEDULED: <2008-02-10 Sun ++1w>\n   Marking this DONE shifts the date by at least one week, but also\n   by as many weeks as it takes to get this date into the future.\n   However, it stays on a Sunday, even if you called and marked it\n   done on Saturday."
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "type": "headline",
      "level": 2,
      "children": [
        {
          "type": "title",
          "level": 2,
          "keyword": "TODO",
          "raw": "Empty kitchen trash",
          "post_blank": 0,
          "children": [
            {
              "type": "text",
              "value": "Empty kitchen trash"
            }
          ]
        },
        {
          "type": "section",
          "children": [
            {
              "type": "paragraph",
              "post_blank": 1,
              "children": [
                {
                  "type": "text",
                  "value": "   SCHEDULED: <2008-02-08 Fri 20:00 ++1d>\n   Marking this DONE shifts the date by at least one day, and also\n   by as many days as it takes to get the timestamp into the future.\n   Since there is a time in the timestamp, the next deadline in the\n   future will be on today's date if you complete the task before\n   20:00."
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "type": "headline",
      "level": 2,
      "children": [
        {
          "type": "title",
          "level": 2,
          "keyword": "TODO",
          "raw": "Check the batteries in the smoke detectors",
          "post_blank": 0,
          "children": [
            {
              "type": "text",
              "value": "Check the batteries in the smoke detectors"
            }
          ]
        },
        {
          "type": "section",
          "children": [
            {
              "type": "paragraph",
              "post_blank": 1,
              "children": [
                {
                  "type": "text",
                  "value": "   SCHEDULED: <2005-11-01 Tue .+1m>\n   Marking this DONE shifts the date to one month after today."
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "type": "headline",
      "level": 2,
      "children": [
        {
          "type": "title",
          "level": 2,
          "keyword": "TODO",
          "raw": "Wash my hands",
          "post_blank": 0,
          "children": [
            {
              "type": "text",
              "value": "Wash my hands"
            }
          ]
        },
        {
          "type": "section",
          "children": [
            {
              "type": "paragraph",
              "post_blank": 1,
              "children": [
                {
                  "type": "text",
                  "value": "   SCHEDULED: <2019-04-05 08:00 Sun .+1h>\n   Marking this DONE shifts the date to exactly one hour from now."
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "type": "headline",
      "level": 2,
      "children": [
        {
          "type": "title",
          "level": 2,
          "keyword": "TODO",
          "raw": "Task with repeater",
          "post_blank": 0,
          "children": [
            {
              "type": "text",
              "value": "Task with repeater"
            }
          ]
        },
        {
          "type": "section",
          "children": [
            {
              "type": "paragraph",
              "post_blank": 1,
              "children": [
                {
                  "type": "text",
                  "value": "SCHEDULED: <2022-04-17 Sun +1w>"
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "type": "headline",
      "level": 2,
      "children": [
        {
          "type": "title",
          "level": 2,
          "keyword": "TODO",
          "raw": "Task without repeater",
          "planning": {
            "scheduled": {
              "timestamp_type": "active",
              "start": {
                "year": 2022,
                "month": 4,
                "day": 17,
                "dayname": "Sun"
              }
            }
          },
          "post_blank": 1,
          "children": [
            {
              "type": "text",
              "value": "Task without repeater"
            }
          ]
        }
      ]
    }
  ]
}

FR: Timestamps: Accept timestamps with missing/incorrect day of week (good beginner issue!)

I'm going to start filing feature requests related to some of the work-in-progress code around timestamps, if that's all right, to help break down what is yet to be implemented into smaller chunks.

Org mode and org-element have robust support for timestamps that do not specify the day of week. It would be nice to implement this as well. So for example, [2020-01-01] should be accepted. Likewise, they will correct the day of week if you should have a timestamp where it does not match the numeric date.

The work here would be to update the parser (orgize/src/elements/timestamp.rs) to accept a missing day of the week (also note that per the spec, DAYNAME can contain any non whitespace-character besides +, -, ], >, a digit or \n.), and then to compute the day of week from the numeric value rather than storing it as a Cow string.

This would be a good issue for a beginner, provided they're familiar (or willing to learn) parsing with nom and chrono.

org validation failed error

Hello,

I got this error from orgize when trying to parse bunch of my org files. I'm using version 0.9.0 from crates.io.

Org validation failed. 1 error(s) found:
ExpectedChildren { at: NodeId { index1: 21, stamp: NodeStamp(0) } } at DynBlock(DynBlock { block_name: "clocktable", arguments: Some(":maxlevel 2 :scope file :block lastmonth :step month :stepskip0 t :formula \"$4=$2*42.0;t\""), pre_blank: 1, post_blank: 0 })
thread 'main' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/orgize-0.9.0/src/validate.rs:203:17:
Looks like there's a bug in orgize! Please report it with your org-mode content at https://github.com/PoiScript/orgize/issues.

to_owned for orgize::Org

I've struggled to find an ownership model that works well for keeping parse trees around longer term, rather than doing one pass of processing. Currently I'm working on something that keeps an in-memory parse tree of a set of Org files, and then updates them whenever they change (per inotify) by reparsing just the modified files.

So conceptually, what I'd like to do is something like:

struct Doc {
  org: orgize::Org,
  text: String,
}

struct ServerState {
  docs: HashMap<PathBuf, Doc>,
}

The problem is that the Org struct needs to refer to the text, and thus runs afoul of the borrow checker. In this use case, I'd be fine double storing the text, e.g., if orgize::Org had an into_owned as the elements do (or if Org provided a non-mutating accessor for its text).

Conceptually, there should be a way to get around this without changing Orgize, since this is essentially the opposite of splitting a reference. I found https://github.com/Kimundi/owning-ref-rs and https://github.com/jpernst/rental which I think would let me do that, though I haven't tried yet. If worst comes to worst, I could probably write something unsafe, since the fundamental access pattern should be safe, but it's frustrating to not be able to do the usual "clone to make your problems go away'" trick.

Do you have any thoughts on how best to approach this? Please do let me know if I'm missing something -- I'm somewhat new to Rust.

Exporters too separate

Both the org and html exporter have an identical trait signatures. We should have a single trait so that additional exporters can re-use the same underlying logic.

Idea for collaboration

Hello, I found your project from the worg tools list. First, sorry for the semi-spam nature of this issue. I had a notion for a project that the org community might find useful and I'm looking for feedback. Feel free to close this issue if it doesn't sound useful to you.

My idea is to start a list of org-mode snippets which can serve as a test bed for people developing tools. The idea is that having a separate collection of examples makes it easier for others in the community to benefit from the examples developed through communication with users.

Users could use these samples to try to construct minimal examples of issues they're having and/or contribute examples there which others could benefit from. Exactly how it will take shape is still up in the air.

These samples could also serve as a place to discuss ideas about how to develop the grammar itself. According to worg, the spec is still in draft state.

There's not much there at the moment. Mostly because I don't want to commit too early to what seems like it might be useful. I'll add more examples as I go.

If you like the concept and/or want to contribute and/or just want to offer feedback, I'd very much appreciate it.

Again, sorry for the spam.

symlinks break builds under Nix

When building this package with a crate using a Nix Flake, the symlink introduced in 4cc1130 appears to break the build - why specifically I'm not sure, but it's likely because the symlink is outside of the crates source directory.

This could be fixed by either removing the file or, more likely, making it a "real" file.

Question: where is the source code of https://poiscript.github.io/orgize/ ?

\r\n prevents further parsing; potential for substantial data loss

Found another weird case. Strange line endings prevent Orgize from parsing anything after them. If this occurs early in the org file, it can lead to substantial data loss.

fn main() {
    // According to the spec, this is two headlines.
    // org-element parses it as two headlines.
    // Per discussion on https://github.com/PoiScript/orgize/issues/19 I could understand parsing it as three headlines instead.
    // Orgize currently parses it as 1 headline, which is clearly wrong.
    let org = orgize::Org::parse("* \n*\r\n* \n");
    let headline_count = org.headlines().count();
    println!("Orgize parsed {} headlines.", headline_count);
    assert!(headline_count == 2 || headline_count == 3);

    // Orgize seems unable to parse any further headlines after that one.
    let org = orgize::Org::parse("* \n*\r\n* A\n* B\n* C\n* D\n* E\n* F\n* G\n* H\n* I\n* J\n* K\n");
    let headline_count = org.headlines().count();
    println!("Orgize parsed {} headlines.", headline_count);
    assert!(headline_count == 12 || headline_count == 13);

    // If written out, this causes data loss.
    let mut fd = iobuffer::IoBuffer::new();
    let mut s = String::default();
    org.write_org(&mut fd).unwrap();
    fd.read_to_string(&mut s).unwrap();

    // Output: After parse -> emit: "* \n"
    println!("After parse -> emit: {:?}", &s);
}

cli and lsp removal

the cli and lsp crates were recently removed and don't seem to have been put anywhere.
Are they no longer in scope?

Orgize does not produce html image tags for image links

Good evening! Thank you for writing and maintaining this library. I think it's great to have a Rust library that can render Org-mode to html.

Some unexpected behaviour I've run into is that DefaultHtmlHandler doesn't render images. I think it's because images aren't actually mentioned in the spec, but are rather a special case for links to files with certain extensions.

For example, [[org-mode-unicorn.png]] should render to <img src="org-mode-unicorn.png" /> rather than an <a> tag because of the png extension.

This page describes how I would expect the default behaviour to be: https://orgmode.org/worg/org-tutorials/images-and-xhtml-export.html

Issue in Start Events

Hello,
I am trying to parse a custom DSL based on org-mode using the library and I am confused about how events work. My document is a simple org-mode file, a minimal example:

* Challenge

Here is a short description that I need to capture.

It might have several paragraphs.

** There are some Hints

Which also have paragraphs

** And they can be as many as needed

With one

or more paragraphs.

When I am in a start event for a title I can already read the text (in the raw element). That should not be the case in my understanding. When a title starts it does not have the text parsed, that is only when it is finish.

The problem is, that I need to collect the text from the paragraphs into an accumulator and write it out if a new title starts. But the library has already sent a start and end event for the text in the title and I do not see a way to recognise the point where a new title starts (and then get the collected text i.e. the two lines in the minimal example above) to correctly handle text that is not a title.

What is the idea and the way in orgize to handle such a basic use case?

Thanks and regards
Sebastian

Handle LaTeX fragments and environments — $x!$, $$y$$, \begin{equation}z\end{equation}

This is a ticket to track progress on LaTeX fragments and environments, to be added per

orgize/docs/STATUS.md

Line 43 in e009e1c

- [ ] Entities and LaTeX Fragments
orgize/docs/STATUS.md

Line 36 in e009e1c

- [ ] LaTeX Environments

orgize/src/parsers.rs

Line 291 in e009e1c

// TODO: LaTeX environment

looks like the start of a handler for \ tokens.

This is of interest to downstream consumers, like Firn (teesloane/firn#56).

FR: Timestamps: Repeater/delay may be specified (and different) for both parts of a range

Org allows you to have a range with a different repeater for start and end. I've actually found this useful on occasion, when you want the interval to grow every time something is marked done, when combined with org-habit.

For example: <2020-01-01 ++1w -1d>--<2020-01-08 .+1w -2d> is valid, and will act as expected when marked done.

I know repeater and delay parsing is not yet implemented, but currently the Timestamp type does not even model this, giving ActiveRange and InactiveRange a single repeater/delay.

I started writing a PR to change just the data structures to have start_repeater, start_delay, end_repeater, end_delay, but it was fairly verbose, and decided I should instead open an issue to consider how to model ranges. If start and end are Datetimes, then we can treat a timestamp like [2020-01-10 03:55-04:00] and [2020-01-10 03:55]--[2020-01-10 04:00] identically if they have no repeater (or they have the same repeater), but need to use the latter form if there are two.

Another option would be to make [in]ActiveTimeRange in additon to [in]ActiveRange, but this adds yet more types to an already complex enum.

Because it's such a hard problem (I've struggled before with modeling Org timestamps using types), and one that is likely to have different opinions, I'll just file this issue instead of a PR because I'm concerned there's a high risk of disagreement on any particular solution.

But if nothing else, I did want to open an issue to note that at present, the types cannot model Org timestamps as defined in the spec and parsed by org-element.

discussion: drop indextree

Incomplete Syntax: No definition list support

By default, org-mode includes a definition list, this doesn't appear to be accounted for when making a list, so the parsing included in this library is incomplete.

definition lists are made by the following syntax

- Hello :: World

How do you get content from a section?

Hi,

I might've missed something obvious here 😁

So there's an easy way to grab headlines, but I couldn't find a way to grab the content that appears in that headline. There's even a set_section_content but no get_* functions.

There's also a children method, but only gives me headlines?

The only way I can see how to do this, is with iter, which goes through the entire file, and then I need to filter it out somehow. Which sounds a bit cumbersome to do.

I was wondering if there was an easier way to grab the section content for a specific headline?

Thanks in advance!

Inconsistent handling of edge case headlines lacking space

I wanted to mention a few edge cases I found involving headlines that don't have a space after the stars. Per the spec, none of these are headlines.

The first two are treated as headlines by org-element, but not org-agenda tags, org mode font lock, or orgize.

The second two are treated as headlines by org-element and orgize, but not org mode font lock.

It's fine if you want to just close this -- I'm not even entirely sure what the correct behavior should be. I just wanted to mention it since I found it when testing something. I could also send a PR for adding a errata/quirks documentation if you like.

fn main() {
    // Org mode: Treats it as not a headline (both font lock and agenda tag view)
    // Orgize 0.8.3: Not a headline
    // (with-temp-buffer (insert "*a\n")(org-element-parse-buffer (point-max))) --> title "a", level 1
    let org = orgize::Org::parse("*a :foo:\n");
    assert_eq!(0, org.headlines().count());

    // Org mode: Treats it as not a headline (both font lock and agenda tag view)
    // Orgize 0.8.3: Not a headline
    // (with-temp-buffer (insert "*a")(org-element-parse-buffer (point-max))) --> title "a", level 1
    let org = orgize::Org::parse("*a :foo:");
    assert_eq!(0, org.headlines().count());

    // Org mode: Treats it as not a headline (font lock)
    // Orgize 0.8.3: Headline with level 3 and title ""
    // (with-temp-buffer (insert "***")(org-element-parse-buffer (point-max))) --> title "", level 3
    let org = orgize::Org::parse("***");
    assert_eq!(1, org.headlines().count());
    assert_eq!(3, org.headlines().next().unwrap().level());
    assert_eq!("", org.headlines().next().unwrap().title(&org).raw);

    // Org mode: Treats it as not a headline (font lock)
    // Orgize 0.8.3: Headline with level 1 and title ""
    // (with-temp-buffer (insert "*\n")(org-element-parse-buffer (point-max))) --> title "", level 1
    let org = orgize::Org::parse("*\n");
    assert_eq!(1, org.headlines().count());
    assert_eq!(1, org.headlines().next().unwrap().level());
    assert_eq!("", org.headlines().next().unwrap().title(&org).raw);
}

Tags accept non-alphanumeric characters

Another fairly minor issue with tag parsing is that Orgize will accept non-alphanumeric characters (e.g., parentheses) in tags. On the plus side, it handles Unicode alphanumeric characters correctly, the same way org-mode does:

fn main() {
    // Bad
    let org = orgize::Org::parse("* a :(:");
    assert!(org.headlines().next().unwrap().title(&org).tags.is_empty());

    // Good -- 郫县豆瓣酱 is alphanumeric.
    let org = orgize::Org::parse("* a :郫县豆瓣酱:");
    assert_eq!(vec!("郫县豆瓣酱"), org.headlines().next().unwrap().title(&org).tags);
}

Construct Org tree

Is it possible to somehow manually construct the Org AST and then use the write_org method to convert it back to an org file?

The use-case that I'm trying to accomplish is the following:

Use orgize to parse an org file into the Org struct
Serialize and store this information in a DB
Make modifications to the DB as needed
Derserialize this back to an Org struct
Convert this back to an on org file using the write_org method.

Currently, I am serializing and deserializing to JSON using serde. Had to fork and make some changes to get that to work: https://github.com/samyak-jain/orgize/tree/deser. And them I'm directly storing this JSON into an sqlite storage.
However, this setup is really cumbersome to work with. The indextree data structure when serialized to a JSON, doesn't lend itself very well to making changes to the structure. There's a whole lot of duplication of the same element across the structure. While it may be a good way to store the representation, it seems to be really hard to make modifications to it.

I would be interested to know thoughts on:

Is there an easier to work with representation that we can convert back and forth from? I know there is an iterator over events which seems to slightly better but I don't see a way to get the Org structure back from that.
Having APIs to more easily modify the Org sturcuture directly? This isn't as ideal since I cannot independently modify the DB and will reply on derserializing to the Org struct and then making modifications everytime.

Is there something missing? Would love to know if there are better ways to tackle this.

UnexpectedChildren when parsing a footnote definition

Hello!
I'm trying to parse a document with footnotes, but get an advice to create a new issue :) The minimal example is:

Foo bar[fn:1] baz

* Footnotes

[fn:1] http://example.com

I use only "\n" as a new line. And the error is:

Org validation failed. 1 error(s) found:
UnexpectedChildren { at: NodeId { index1: 6, stamp: NodeStamp(0) } } at FnDef(FnDef { label: "1", post_blank: 0 })
thread 'main' panicked at 'Looks like there's a bug in orgize! Please report it with your org-mode content at https://github.com/PoiScript/orgize/issues.', /home/fr/.cargo/registry/src/github.com-1ecc6299db9ec823/orgize-0.9.0/src/validate.rs:203:17

Also I tried to parse the official guide https://git.savannah.gnu.org/cgit/emacs/org-mode.git/plain/doc/org-guide.org (which looks as a good test for a parser), but had no luck.

Crash when parsing unmatched * or /

Hi,

I found a crash related to unmatched * and /. I didn't check any other Org formatting markers, but they may likewise be affected. I am using orgize 0.8.1.

fn main() {
    let crashes = &[
        "* / // a",
        "\"* / // a\"",
        "* * ** a",
        "* 2020\n** December\n*** Experiment\nType A is marked with * and type B is marked with **.\n",
        "* 2020\n:DRAWER:\n* ** a\n:END:",
    ];

    let okies = &["* * ** :a:", "* * ** "];

    for crash in crashes {
        let crash = crash.to_string();
        assert!(std::thread::spawn(move || {
            let _ = orgize::Org::parse(&crash);
        })
        .join()
        .is_err())
    }

    for ok in okies {
        let ok = ok.to_string();
        assert!(std::thread::spawn(move || {
            let _ = orgize::Org::parse(&ok);
        })
                .join()
                .is_ok())
    }

    println!("\n\n\n");
    println!("***");
    println!("*** All examples did/did not panic as expected, regardless the presence of panics printed above.");
    println!("***");
}

Crashes related to numbers, empty lists, and whitespace preceeding a headline

Thank you for fixing the crashes I've reported so far, I appreciate you taking the time. I found a few more crashes in orgize 0.8.2 when trying to parse all my org files, and have produced minimized examples below. I think that these are all the crashes that are left from my personal org files, though of course it is hard to be sure.

fn main() {
    let crashes = &[
        // Number with a . and whitespace.
        "0. ",
        "* \n0. ",
        " 0. ",

        // Whitespace at start of line then *.
        " * ",
        "\t* ",

        // Seems to be an issue with lists with empty elements.
        "- ",
        "- hello\n- ",
        "- \n- hello",
        "- hello\n- \n- world",
        "* world\n- ",
    ];

    for crash in crashes {
        let crash = crash.to_string();
        assert!(std::thread::spawn(move || {
            let _ = orgize::Org::parse(&crash);
        })
        .join()
        .is_err())
    }

    println!("\n\n\n");
    println!("***");
    println!("*** All examples did/did not panic as expected, regardless the presence of panics printed above.");
    println!("***");
}

:PROPERTIES: takes precedence over headlines

This one is interesting, and from what I remember of parser generators, may be tricky to resolve. Currently, it seems that Orgize will fail to parse headlines "inside" a :PROPERTIES: drawer. This occurs only when the headline inside the drawer is at a greater level, and only in properties drawers. Additionally, the properties drawer must immediately follow the headline and planning (this is required by the Org spec to be a valid properties drawer, but you may have a different drawer named :PROPERTIES: elsewhere in the node).

Note that this is arguably a broken org file -- all org files are valid strings, but, e.g., org-lint would catch this case. But I do think that properly parsing headlines should work even in the presence of broken drawers.

fn main() {
    // Passes -- drawer other than PROPERTIES.
    let org = orgize::Org::parse("* Hello\n:MYDRAWER:\n** World\n:END:");
    assert_eq!(org.headlines().count(), 2);

    // Passes -- both headlines at same depth.
    let org = orgize::Org::parse("* Hello\n:PROPERTIES:\n* World\n:END:");
    assert_eq!(org.headlines().count(), 2);

    // Passes -- in order to be the special "properties" drawer, the drawer must
    // immediately follow the headline and planning..
    let org = orgize::Org::parse("* Hello\nSpacer\n:PROPERTIES:\n** World\n:END:");
    assert_eq!(org.headlines().count(), 2);

    // Fails
    let org = orgize::Org::parse("* Hello\n:PROPERTIES:\n** World\n:END:");
    assert_eq!(org.headlines().count(), 2);
}

optional support for org-fc cloze markup

I use https://github.com/l3kn/org-fc, an anki-style flashcarding system that has a few "card types"; most of them are based on the shape of a heading (one side of the card is the heading, one is the text therein), a text-input, etc, but it also introduces a custom markup for what it calls "cloze" cards:

The cards text contains one or more holes . During review, one hole is hidden while the text of (some) remaining ones is shown.

These introduce a markup that interacts poorly with orgize's parser, especially if you put non-text syntax inside of the "holes". With the 0.9 parser, I just used a regex match on a Text block to transform these clozes in to <span> elements on export, but if a link or so is in there it breaks up the text block.

Deletions can have the following forms

{{text}}

{{text}@id}

{{text}{hint}}

{{text}{hint}@id}

It would be nice to have a (perhaps feature gated) extension to the rowan syntax to parse these and have access to the underlying tokens in the text and hint for more robust export.

FR: Make properties iteration order deterministic

Currently, the properties API uses std::collections::HashMap, which is non-deterministic in iteration order. This means that parsing and then emitting an org file is not idempotent -- each time the file is parsed then written, the properties in each headline can change.

If you're using git, this leads to spurious diffs. If you prefer to have your properties in a certain order, then that will also be lost.

An easy "fix" would be to use https://docs.rs/indexmap/1.3.2/indexmap/map/struct.IndexMap.html which would maintain the order of the properties, though this would then mean either changing the API, copying to a hash map each time, or maintaining both a hash map and an indexmap (or just an order list).

I'm currently working around this by doing my own parsing of org files into a tree of headlines, and then only using Orgize to change them, or parse them but not write them out. This means that headlines never change unless I change them, in which case their property order is lost.

Support for org-entities (special symbols)

Hi,

I use Firn, a static site generator, that depends on Orgize. I want to use org-mode's special symbols in my org files, and produce the expected HTML ( with non breaking spaces and narrow no-break spaces for the moment, but I may need many other special symbols too).

When I use the live demo,
And input this : Hello\nbsp{}world
I get this :
{"type":"document","children":[{"type":"section","children":[{"type":"paragraph","children":[{"type":"text","value":"Hello\\nbsp{}world"}]}]}]}

Special symbols are parsed as plain text in the generated json.

Do you think that a support for org-entities is doable, or should it be done directly in the client code ?

Tags separated from headline text by \t not recognized

This is a fairly minor issue, but it does differ from org-mode's behavior. My read of the spec is completely ambiguous on this point -- it explicitly specifies (and org will only recognize) that the *** of the headline MUST end with a "space", but org-mode seems to accept \t and possibly other exotic whitespace (untested) for separating the tags from the rest of the headline.

I'm going to file this since I noticed it, but I can also see the argument that this is working as intended.

discuss: a new specifications

A correct implementation requires a precise specifications. However, neither org syntax draft nor org elements api can really serves as a good specifications: Org syntax doesn't specify syntax unambiguously, meanwhile, org-elements-api is quite buggy and provides some apparent wrong results in some cases (e.g. #22, #19 and #14).

To resolve that, I think the ultimate solution is to maintain a specification by our own. To be clear, it's not going to create a subset of org markup language but just describes the expected results you will get when using orgize.

Fortunately, we don't need to start from scratch. We can just make a copy of the original org syntax and adapt it for our needs (like defining required and optional fields). Then, I would like to borrow some concepts from CommonMark Spec, especially the part of handling whitespace and list indenting. In short, our new specification will be majorly based on org syntax and with some modifications and additions.

Feel free to leave any comments below if you have any idea about the new upcoming specification =).

As a reminder, here're some issues/commits that need to be reopened/reverted after applying the new specification.

#17: should be reopened. Org syntax clarified that title should be matched after other part have been matched.

ba9c83c: should reverted. We should only handle ascii whitespace. I think it is totally unnecessary to take care of unicode whitespace if we don't need to be compatible with org-elements-api.

#26 #34: should be closed and moved to a new issue. Org syntax specifies that headlines' stars must be followed by at least one space character. But I think tab is also acceptable (just like CommonMark does and I will include it in the new specifications latter). Hence, only *** , *** \n or ***\t\t should be valid, but not ***, ****\r or ***\n.

#33: should be closed.

FR: Timestamps: Accept timestamps with DAYNAME, repeater, and delay in any order

The Org spec requires the order be DAYNAME, then repeater-or-delay up to twice, but org-mode and org-element are robust, and will parse them in any order. So for example, these would all be accepted identically:

[2020-01-01 5:05 .+1w Fri --2d]
[2020-01-01 5:05 .+1w --2d Fri]
[2020-01-01 5:05 Fri .+1w --2d]

also valid is [2020-01-01 5:05 .+1w --2d]; covered under #27 .

support for top-level properties drawer

as of org 9.5, properties drawers are allowed before the first-level heading

Org mode is moving more towards making things before the first headline behave just as if it was at outline level 0. Inheritance for properties will work also for this level. In other words: defining things in a property drawer before the first headline will make them "inheritable" for all headlines.

org-roam uses this in the org-roam-capture template and parses the level 0 heading as if it were a regular org-roam node/org-mode heading. It would be nice if there was a Document::properties() which returns an Option<PropertyDrawer>

Priority cookie causes first word of title to be swallowed when there is no space after it

I was surprised to find that priority cookies don't need any whitespace after them, but orgize, org element, the spec, and org-mode all agree that the title may follow it immediately.

When the first word of the title immediately follows the priority cookie, Orgize correctly parses the priority, but will swallow the first word (i.e., all characters after the ] until the next whitespace) from the title.

This also occurs when the headline is commented (i.e., the COMMENT word is swallowed, and thus the headline is not detected as commented).

An interesting relationship to #17 is that "::" is also swallowed, giving an empty title vs "::" for org-element.

The last is a minor edge case probably not worth actually fixing unless it comes out of the other two (I can add it to the errata if it remains), but I think that the swallowed word is worth opening an issue for.

fn main() {
    let s = "* [#B]this_word_is_swallowed this_one_is_not";
    let org = orgize::Org::parse(&s);

    // Orgize swallows the word; org-element does not:
    // (headline (:raw-value this_word_is_swallowed this_one_is_not :priority 66 :title this_word_is_swallowed this_one_is_not))
    assert_eq!(
        "this_word_is_swallowed this_one_is_not",
        org.headlines().next().unwrap().title(&org).raw
    );

    let s = "* [#B]COMMENT hello world";
    let org = orgize::Org::parse(&s);

    assert!(org.headlines().next().unwrap().title(&org).is_commented());
    assert_eq!(
        "COMMENT hello world",
        org.headlines().next().unwrap().title(&org).raw
    );

    let s = "****** [#B]*  :a: ";
    let org = orgize::Org::parse(&s);
    assert_eq!(org.headlines().next().unwrap().title(&org).tags, vec!("a"));

    let s = "**  DONE [#B]::";
    let org = orgize::Org::parse(&s);
    // (headline (:raw-value :: :level 2 :priority 66 :todo-keyword DONE :title :: ))
    assert_eq!(org.headlines().next().unwrap().title(&org).raw, "::");
}

accessing ExportBlock content

it would be nice if impl ExportBlock had a value() like SourceBlock, and if that was included in the html exporter:

            Event::Enter(Container::ExportBlock(the_block)) => {
                let val = the_block
                    .syntax()
                    .children()
                    .find(|e| e.kind() == orgize::SyntaxKind::BLOCK_CONTENT)
                    .into_iter()
                    .flat_map(|n| n.children_with_tokens())
                    .filter_map(orgize::ast::filter_token(orgize::SyntaxKind::TEXT))
                    .fold(String::new(), |acc, value| acc + &value);

                html_export.push_str(format!(r#"{}"#, val));
                ctx.skip();
            }

Property drawer parsing edge cases

Hi,

I believe I have identified a bug in parsing properties drawer. I'm not sure exactly what it is, but I have several bad strings which when parsed and written out are very different. It may be related to properties which have no value, as per https://orgmode.org/worg/dev/org-syntax.html#Node_Properties

https://gist.github.com/calmofthestorm/08d0afeef571312d778958a5a5c2ad69

I have also observed that just :DRAWER: or :END: occurring in a node body will result in an :END: being added. This is still incorrect in my opinion -- a property drawer that cannot be properly parsed should be treated as body text -- but is much more understandable, especially since Org mode can identify and suggest fixing unclosed drawers.

This is a drawback of Org's permissive structure where every string is a valid Org mode file.

Get properties for zeroth section

Hi, firstly apologies for the newbie question. I'm new to rust and still learning.

I'm writing a tool that uses orgize. I'm attempting to extract only the properties from the Zeroth section of the org-mode file. My problem is that the iterator returns an indextree Arena and I don't know how to match it against "Element::Drawer" from the arena. Do I have to make some kind of type inference or typecast for this to work?

Here is my code currently:

    let doc_node = org.document().section_node().unwrap();
    for node_id in doc_node.children(org.arena()) {
        let node = org.arena().get(node_id).unwrap().into();
        println!("\nNode? {node:?}");
        match node {
            Element::Drawer(d) => {
                println!("We found the drawer: {d:?}");
            },
            _ => {
                println!("Something else: {node:?}");
            }
        }
    }

Of course the code above does not work because the node and the Element::Drawer isn't the same type. How could I match against specific element types when traversing a sub-section of the tree as I have above?

Exposing text position of each element

One thing that would be useful for a project I'm working on would be the ability to get the exact start and end of a given element in the original document text. I currently only need this for headlines, but could see other uses depending on how difficult it is to implement.

At the moment, I use my own very simple parser that breaks an org document into a tree of headlines and then use orgize to parse each headline as necessary. Another advantage of this is that it guarantees that parsing and then exporting a document produces an identical file, which is a useful feature for my use case.

I was mostly curious if you had any thoughts on how hard these would be to implement in orgize and whether you had suggestions on how to do so. My current solution is basically functional but cumbersome and frustrating, and also requires multiple storage or frequent reparsing (pick your poison) of text, so I'm considering alternatives and one possibility would be to upstream the features I need.

Orgize validation fails when parsing certain unicode values

In general I expect weird unicode values to get "interesting" results, but I'm going to report this since it results in a panic when debug_assertions are enabled.

Each of these characters, alone, as input, results in a panic in debug builds. I recommend running the example below with --release as otherwise calling parse will panic.

Up to you as to whether it's worth fixing. I saw you had a fuzz test in the source tree so I assume that crashes like this might be of interest, but I can also understand not wanting to go down the unicode rabbithole and it's unclear to me how often these actually come up in real use.

The one or two I tested with org-element work correctly -- a headline containing them in the title is parsed correctly.

fn main() {
    let s = "\u{000b}\u{0085}\u{00a0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200a}\u{2028}\u{2029}\u{202f}\u{205f}\u{3000}";

    for (i, c) in s.chars().enumerate() {
        let org = orgize::Org::parse_string(c.to_string());
        println!("Validation ok for {}: {}", i, org.validate().is_empty());
    }
}

Announcing v0.10

Hello everyone. After leaving this crate for almost unmaintained for over three years, I finally had some time to pick up what we left off. :)

Three years has passed, and many new feature and awesome crates have been introduced in the Rust world. So I think it's time for us to rebuild orgize with these new features and crates in mind. So I'm thrilled to announce that i'll be publishing orgize v0.10 in the next couple of weeks.

This version is total rewrite of orgize. Some notable changes including:

Switching to `rowan`

In this version, we're replacing the underlying data structure from indextree to rowan, one of the core crates of rust-analyzer.

To visualise that, input * Heading *bold* will now becomes something like:

[email protected]
  [email protected]
    [email protected] "*"
    [email protected] " "
    [email protected]
      [email protected] "Heading "
      [email protected]
        [email protected] "*"
        [email protected] "bold"
        [email protected] "*"
    [email protected] "\n"

rowan help us resolves some long exist problem like high memory usage and not able to get position of each parsed element.

Also thanks to the rich ecosystem of rowan, we might have a lsp server for org mode in future, just like rust-analyzer does for rust language.

Traversing org element tree

Another big change in this version is that org.iter() is now superseded by org.traverse(). When walking through org element tree, traversal is much more flexible than simple iteration. For example, you can now skip the whole subtree but continue on its sibling without breaking the entire traversing by calling traversal_context.skip().

Lossless parsing

Orgize is now a loessless parser. It means you can always get original input back from Org struct:

let s = "* hello /world/!";
let org = Org::parse(s);
assert_eq!(s, org.to_org());

Try it out today

You can try out v0.10 by installing it from crates.io:

cargo add [email protected]

or using new demo website: https://poiscript.github.io/orgize

I hope you guys enjoy the new release, and feel free to send any feedback!

What's next

Before publishing v0.10, I want to fix other issues and merge pull requests as much as possible.

After that, I want to support parsing org entities and inlinetasks and try build a lsp server for org-mode using tower-lsp and orgize.

PLANNING is not handled

It seems, some cases are not handled in Traverse::element.
The minimal example:

use orgize::{
    export::{Event, TraversalContext, Traverser},
    Org,
};

#[derive(Default)]
struct Test(String);

impl Traverser for Test {
    fn event(&mut self, _event: Event, _ctx: &mut TraversalContext) {}
}

fn main() {
    let mut test = Test::default();

    // Doesn't work: 
    // thread 'main' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/orgize-0.10.0-alpha.7/src/export/traverse.rs:210:29:
    // PLANNING is not handled
    Org::parse("* Task\nCLOSED: [2024-04-21 Sun 12:00]").traverse(&mut test);

    // Works:
    // Org::parse("CLOSED: [2024-04-21 Sun 12:00]").traverse(&mut test); 
}

Cannot parse empty headlines with tags

It looks like empty headlines' tags get treated as part of the raw, and the tags not picked up.

fn main() {
    let org = orgize::Org::parse("* :a:");
    assert_eq!(vec!("a"), org.headlines().next().unwrap().title(&org).tags);
}

Debug Panic on Parsing Table

Hi! I wanted to thank you for making this great crate. I'm currently using orgize to add org rendering to crates.io and the background workers are panicking when they encounter tables:

use orgize::Org;

const org_contents: &'static str = r#"
|-----+-----+-----|
| foo | bar | baz |
|-----+-----+-----|
|   0 |   1 |   2 |
|   4 |   5 |   6 |
|   7 |   8 |   9 |
|-----+-----+-----|
"#;

fn main() {
    let mut buf = Vec::new();
    Org::parse(org_contents).write_html(&mut buf);
    println!("Hello, world!");
}

This isn't a problem, per se, but would it be possible to avoid panicking? Ideally we'll have a

pub fn try_parse_custom(text: &'a str, config: &ParseConfig) -> Result<Org<'a>, Vec<ValidationError>>

so we could recover from malformed org-mode files.

I could open a PR if you'd like :P

poiscript / orgize Goto Github PK

orgize's Introduction

orgize's People

Contributors

Stargazers

Watchers

Forkers

orgize's Issues

Problem

What I've tried

Further Info

Minimum Working Example

Rust

TOML

Output

Raw

Json

Switching to rowan

Traversing org element tree

Lossless parsing

Try it out today

What's next

Recommend Projects

Recommend Topics

Recommend Org

Switching to `rowan`