magnusmanske / mediawiki_rust Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 15.0 345 KB

Rust API interface for MediaWiki sites.

License: Apache License 2.0

Rust 100.00%

mediawiki mediawiki-api

mediawiki_rust's People

Contributors

Stargazers

Watchers

Forkers

waldyrious v-gar erutuon enterprisey qedk legoktm rust-wiki ronlobo siddharthvp moxian madroostercro fenhl yasuhide0802 oylenshpeegul 1-byte

mediawiki_rust's Issues

Purpose of src/bin/main.rs?

What's the purpose of src/bin/main.rs? It looks like a script but I'm not sure if it's just for testing or we should be updating/maintaining it as part of this crate.

Add {{nobots}} helper function

The specification is documented at https://en.wikipedia.org/wiki/Template:Bots

I'd expect a function signature of

fn bot_may_edit(wikitext: &str, bot: &str) -> bool { ... }

This probably warrants going in its own module, maybe behind a feature since it'll probably add a dependency on regex.

Unresolved import futures while compiling

Hi,

I have been using mediawiki for a small scraping tool, but an error came up inside the lib recently (It was working fine before).
I have cloned the repo to see if it was just my setup but I am having the same issue:

Compiling mediawiki v0.2.7 (E:\Projects\mediawiki_rust)
error[E0432]: unresolved import `futures`
  --> src\api.rs:25:5
   |
25 | use futures::{Stream, StreamExt};
   |     ^^^^^^^ use of undeclared crate or module `futures`

error[E0433]: failed to resolve: use of undeclared crate or module `futures`
   --> src\api.rs:364:9
    |
364 |         futures::stream::unfold(initial_query_state, |mut query_state| async move {
    |         ^^^^^^^ use of undeclared crate or module `futures`

error: aborting due to 2 previous errors

Some errors have detailed explanations: E0432, E0433.
For more information about an error, try `rustc --explain E0432`.
error: could not compile `mediawiki`

What I did was:

git clone
cargo run --package mediawiki --bin main (through rust-analyzer on visual studio code)

Future direction?

Hi Magnus,

First, thanks for working on this crate - it is a great foundation and all of my tools so far use it.

I started writing some more advanced bots last week using this crate as the base and felt like I was missing stuff that I've come to expect from using Pywikibot for so many years. IMO, right now the crate takes care of the basics for login, token handling, and simple objects for titles, getting page text and other properties like links, external links, coordinates, editing, etc. But there's no high level error types, credential storage/handling, automatic retries, logging, other actions like page moving, deletion, protection and so on. And then there's stuff like {{nobots}} handling which is solely for bots and not other MW API consumers.

What do you see the scope of this crate as being? Are more bot-like functions and high-level types welcome contributions? Or would you see those go in a higher level "wikibot" crate that builds on top of this one? Something in the middle?

I don't want to step on any toes nor duplicate any work, but this is something I'd like to work on to make adopting Rust even easier.

Thanks!

P.S. I (along with @enterprisey) am starting a Wikimedia Rust developers user group, which you are definitely invited to join, that we hope will work on issues like this.

Structs or enums for API responses

Using serde_json::Value to represent API responses is pretty laborious. It requires lots of .as_object() or .as_str() and then checking that the result is Some(_), or Option::map, or matching on variants of serde_json::Value, etc.

I propose creating custom structs or enums to represent responses. This makes accessing fields in the JSON as simple as accessing fields in the struct or enum. For instance, this example shows a struct that could be used in the Page::text method (and returns a serde_json::Value if the JSON fails to deserialize as RevisionsResponse, though that may not be necessary):

use serde::Deserialize;
use serde_json::Value as JsonValue;
use std::collections::HashMap;
use url::Url;

#[derive(Debug, Deserialize)]
#[serde(untagged)]
enum FallibleDeserialization<T> {
    Success(T),
    Failure(JsonValue)
}

#[derive(Debug, Deserialize)]
#[allow(unused)]
struct RevisionsResponse {
    batchcomplete: bool,
    query: PagesQuery,
}

#[derive(Debug, Deserialize)]
struct PagesQuery {
    pages: Vec<Page>,
}

#[derive(Debug, Deserialize)]
struct Page {
    #[serde(rename = "pageid")]
    id: u32,
    #[serde(rename = "ns")]
    ns: i32,
    title: String,
    revisions: Vec<Revision>,
}

#[derive(Debug, Deserialize)]
struct Revision {
    slots: HashMap<String, RevisionSlot>,
}

#[derive(Debug, Deserialize)]
struct RevisionSlot {
    #[serde(rename = "contentmodel")]
    content_model: String,
    #[serde(rename = "contentformat")]
    content_format: String,
    content: String,
}

#[tokio::main]
async fn main() {
    let mut url: Url = Url::parse("https://en.wiktionary.org/w/api.php").unwrap();
    url.set_query(Some(&serde_urlencoded::to_string(&[
        ("action", "query"),
        ("prop", "revisions"),
        ("titles", "Template:link"),
        ("rvslots", "*"),
        ("rvprop", "content"),
        ("formatversion", "2"),
        ("format", "json"),
    ]).unwrap()));
    let response: FallibleDeserialization<RevisionsResponse> = reqwest::get(url).await.unwrap().json().await.unwrap();
    if let FallibleDeserialization::Success(response) = response {
        for Page { revisions, .. } in response.query.pages {
            for Revision { slots } in revisions {
                let slot = slots.get("main").or_else(|| slots.iter().next().map(|(_, slot)| slot));
                dbg!(slot);
            }
        }
    }
}

Dependencies in Cargo.toml:

reqwest = { version = "0.10", features = ["json"]}
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
serde_urlencoded = "0.6"
tokio = { version = "0.2", features = ["rt-core", "macros"] }
url = "2.1"

To make this possible, the Api methods that currently decode the response as serde_json::Value (ultimately via Api::query_api_json) would need to be generic, so that they could instead deserialize into a more specific struct or enum (something like, in the example above, FallibleDeserialization<RevisionsResponse>).

And Api::get_query_api_json_limit would probably need some way to perform the function of Api::json_merge generically, for the various structs that it would return in place of serde_json::Value. For instance it could be generic over a trait that has a merge method (maybe named MergeableResponse).

Difficulties: Using the Deserialize derive macro will add to compile time. Also it may require trial-and-error to figure out what the schema for the API responses actually is.

Upgrade to tokio 0.3

Ah, ecosystem churn.

Page.edit_text() should have edit conflict and integrity protection

Integrity protection is as simple as filling in the md5 parameter with a hash of the text value.

For edit conflict protection, we need to pass the revision id and timestamp of the revision we obtained from .text(). My suggestion is to have Page keep track of the text/revision info (lazy loading it) so if text() was called, then edit_text() can pass it back for conflict detection. Lazy-loading text info would also unlock preloading from generators in the future, but that's another issue...

Use RustCrypto's crates instead of rust-crypto.

https://rustsec.org/advisories/RUSTSEC-2016-0005

Improve error handling

We currently have functions that return a Result that has an error of:

Box<dyn Error> (problematic because it's not thread-safe)
PageError
String
MediaWikiError
&str

The inconsistencies make it difficult to use ? because you usually have to map the err to something else first. My proposal is to have one single mediawiki::Error type that all functions use. This type would have from implementations for reqwest:Error, serde_json::Error, and then ones for MissingPage, and so on, and then a generic UnknownAPIError. The thiserror crate should make writing it straightforward.

Clients then only need to implement from for one type, and you only need to import one error type in.

If that sounds good, then I can rework my existing PR into this direction. Or we can keep discussing :)

Can't build time dependency

rustc 1.43.1 (8d69840ab 2020-05-04)
cargo 1.43.0 (2cbe9048e 2020-05-03)

My code is more or less the sample code in the README with the query parameters tweaked:

$ cargo build
   Compiling time v0.2.8
   Compiling user_agent v0.9.0
error: expected an item keyword
   --> /home/user/.cargo/registry/src/github.com-1ecc6299db9ec823/time-0.2.8/src/utc_offset.rs:366:13
    |
366 |             let tm = timestamp_to_tm(datetime.timestamp())?;
    |             ^^^

error: aborting due to previous error

error: could not compile `time`.

Apparently this is because of time-rs/time#233 - if there's a way to workaround that it would be nice, but I understand if it's just waiting for other packages to update. I'm also a rust noob so it's totally possible I'm doing something wrong, help appreciated.

Offer async API

Hi,
have you thought about providing an async API? Since you are using reqwest, it should be easily possible by using reqwest::async::Client and returning futures instead of results.

magnusmanske / mediawiki_rust Goto Github PK

mediawiki_rust's People

Contributors

Stargazers

Watchers

Forkers

mediawiki_rust's Issues

Purpose of src/bin/main.rs?

Add {{nobots}} helper function

Unresolved import futures while compiling

Future direction?

Structs or enums for API responses

Upgrade to tokio 0.3

Page.edit_text() should have edit conflict and integrity protection

Use RustCrypto's crates instead of rust-crypto.

Improve error handling

Can't build time dependency

Offer async API

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent