magnusmanske / mediawiki_rust Goto Github PK
View Code? Open in Web Editor NEWRust API interface for MediaWiki sites.
License: Apache License 2.0
Rust API interface for MediaWiki sites.
License: Apache License 2.0
What's the purpose of src/bin/main.rs
? It looks like a script but I'm not sure if it's just for testing or we should be updating/maintaining it as part of this crate.
The specification is documented at https://en.wikipedia.org/wiki/Template:Bots
I'd expect a function signature of
fn bot_may_edit(wikitext: &str, bot: &str) -> bool { ... }
This probably warrants going in its own module, maybe behind a feature since it'll probably add a dependency on regex
.
Hi,
I have been using mediawiki for a small scraping tool, but an error came up inside the lib recently (It was working fine before).
I have cloned the repo to see if it was just my setup but I am having the same issue:
Compiling mediawiki v0.2.7 (E:\Projects\mediawiki_rust)
error[E0432]: unresolved import `futures`
--> src\api.rs:25:5
|
25 | use futures::{Stream, StreamExt};
| ^^^^^^^ use of undeclared crate or module `futures`
error[E0433]: failed to resolve: use of undeclared crate or module `futures`
--> src\api.rs:364:9
|
364 | futures::stream::unfold(initial_query_state, |mut query_state| async move {
| ^^^^^^^ use of undeclared crate or module `futures`
error: aborting due to 2 previous errors
Some errors have detailed explanations: E0432, E0433.
For more information about an error, try `rustc --explain E0432`.
error: could not compile `mediawiki`
What I did was:
Hi Magnus,
First, thanks for working on this crate - it is a great foundation and all of my tools so far use it.
I started writing some more advanced bots last week using this crate as the base and felt like I was missing stuff that I've come to expect from using Pywikibot for so many years. IMO, right now the crate takes care of the basics for login, token handling, and simple objects for titles, getting page text and other properties like links, external links, coordinates, editing, etc. But there's no high level error types, credential storage/handling, automatic retries, logging, other actions like page moving, deletion, protection and so on. And then there's stuff like {{nobots}}
handling which is solely for bots and not other MW API consumers.
What do you see the scope of this crate as being? Are more bot-like functions and high-level types welcome contributions? Or would you see those go in a higher level "wikibot" crate that builds on top of this one? Something in the middle?
I don't want to step on any toes nor duplicate any work, but this is something I'd like to work on to make adopting Rust even easier.
Thanks!
P.S. I (along with @enterprisey) am starting a Wikimedia Rust developers user group, which you are definitely invited to join, that we hope will work on issues like this.
Using serde_json::Value
to represent API responses is pretty laborious. It requires lots of .as_object()
or .as_str()
and then checking that the result is Some(_)
, or Option::map
, or matching on variants of serde_json::Value
, etc.
I propose creating custom structs or enums to represent responses. This makes accessing fields in the JSON as simple as accessing fields in the struct or enum. For instance, this example shows a struct that could be used in the Page::text
method (and returns a serde_json::Value
if the JSON fails to deserialize as RevisionsResponse
, though that may not be necessary):
use serde::Deserialize;
use serde_json::Value as JsonValue;
use std::collections::HashMap;
use url::Url;
#[derive(Debug, Deserialize)]
#[serde(untagged)]
enum FallibleDeserialization<T> {
Success(T),
Failure(JsonValue)
}
#[derive(Debug, Deserialize)]
#[allow(unused)]
struct RevisionsResponse {
batchcomplete: bool,
query: PagesQuery,
}
#[derive(Debug, Deserialize)]
struct PagesQuery {
pages: Vec<Page>,
}
#[derive(Debug, Deserialize)]
struct Page {
#[serde(rename = "pageid")]
id: u32,
#[serde(rename = "ns")]
ns: i32,
title: String,
revisions: Vec<Revision>,
}
#[derive(Debug, Deserialize)]
struct Revision {
slots: HashMap<String, RevisionSlot>,
}
#[derive(Debug, Deserialize)]
struct RevisionSlot {
#[serde(rename = "contentmodel")]
content_model: String,
#[serde(rename = "contentformat")]
content_format: String,
content: String,
}
#[tokio::main]
async fn main() {
let mut url: Url = Url::parse("https://en.wiktionary.org/w/api.php").unwrap();
url.set_query(Some(&serde_urlencoded::to_string(&[
("action", "query"),
("prop", "revisions"),
("titles", "Template:link"),
("rvslots", "*"),
("rvprop", "content"),
("formatversion", "2"),
("format", "json"),
]).unwrap()));
let response: FallibleDeserialization<RevisionsResponse> = reqwest::get(url).await.unwrap().json().await.unwrap();
if let FallibleDeserialization::Success(response) = response {
for Page { revisions, .. } in response.query.pages {
for Revision { slots } in revisions {
let slot = slots.get("main").or_else(|| slots.iter().next().map(|(_, slot)| slot));
dbg!(slot);
}
}
}
}
Dependencies in Cargo.toml
:
reqwest = { version = "0.10", features = ["json"]}
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
serde_urlencoded = "0.6"
tokio = { version = "0.2", features = ["rt-core", "macros"] }
url = "2.1"
To make this possible, the Api
methods that currently decode the response as serde_json::Value
(ultimately via Api::query_api_json
) would need to be generic, so that they could instead deserialize into a more specific struct or enum (something like, in the example above, FallibleDeserialization<RevisionsResponse>
).
And Api::get_query_api_json_limit
would probably need some way to perform the function of Api::json_merge
generically, for the various structs that it would return in place of serde_json::Value
. For instance it could be generic over a trait that has a merge
method (maybe named MergeableResponse
).
Difficulties: Using the Deserialize
derive macro will add to compile time. Also it may require trial-and-error to figure out what the schema for the API responses actually is.
Ah, ecosystem churn.
Integrity protection is as simple as filling in the md5
parameter with a hash of the text value.
For edit conflict protection, we need to pass the revision id and timestamp of the revision we obtained from .text()
. My suggestion is to have Page keep track of the text/revision info (lazy loading it) so if text() was called, then edit_text() can pass it back for conflict detection. Lazy-loading text info would also unlock preloading from generators in the future, but that's another issue...
We currently have functions that return a Result that has an error of:
Box<dyn Error>
(problematic because it's not thread-safe)PageError
String
MediaWikiError
&str
The inconsistencies make it difficult to use ?
because you usually have to map the err to something else first. My proposal is to have one single mediawiki::Error
type that all functions use. This type would have from
implementations for reqwest:Error
, serde_json::Error
, and then ones for MissingPage
, and so on, and then a generic UnknownAPIError
. The thiserror crate should make writing it straightforward.
Clients then only need to implement from
for one type, and you only need to import one error type in.
If that sounds good, then I can rework my existing PR into this direction. Or we can keep discussing :)
My code is more or less the sample code in the README with the query parameters tweaked:
$ cargo build
Compiling time v0.2.8
Compiling user_agent v0.9.0
error: expected an item keyword
--> /home/user/.cargo/registry/src/github.com-1ecc6299db9ec823/time-0.2.8/src/utc_offset.rs:366:13
|
366 | let tm = timestamp_to_tm(datetime.timestamp())?;
| ^^^
error: aborting due to previous error
error: could not compile `time`.
Apparently this is because of time-rs/time#233 - if there's a way to workaround that it would be nice, but I understand if it's just waiting for other packages to update. I'm also a rust noob so it's totally possible I'm doing something wrong, help appreciated.
Hi,
have you thought about providing an async API? Since you are using reqwest, it should be easily possible by using reqwest::async::Client
and returning futures instead of results.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.