Coder Social home page Coder Social logo

parse_mediawiki_dump's Introduction

Parse Mediawiki dump

Parse XML dumps exported from Mediawiki.

Parse Wiki Text

This module parses XML dumps exported from Mediawiki, providing each page from the dump through an iterator. This is useful for parsing the dumps from Wikipedia and other Wikimedia projects.

Caution

If you need to parse any wiki text extracted from a dump, please use the crate Parse Wiki Text (crates.io, Github). Correctly parsing wiki text requires dealing with an astonishing amount of difficult and counterintuitive cases. Parse Wiki Text automatically deals with all these cases, giving you an unambiguous tree of parsed elements that is easy to work with.

Limitations

This module only parses dumps containing only one revision of each page. This is what you get from the page Special:Export when enabling the option “Include only the current revision, not the full history”, as well as what you get from the Wikimedia dumps with file names ending with -pages-articles.xml.bz2.

This module ignores the siteinfo element, every child element of the page element except ns, revision and title, and every element inside the revision element except format, model and text.

Until there is a real use case that justifies going beyond these limitations, they will remain in order to avoid premature design driven by imagined requirements.

Examples

Parse a bzip2 compressed file and distinguish ordinary articles from other pages. A running example with complete error handling is available in the examples folder.

extern crate bzip2;
extern crate parse_mediawiki_dump;

fn main() {
    let file = std::fs::File::open("example.xml.bz2").unwrap();
    let file = std::io::BufReader::new(file);
    let file = bzip2::bufread::BzDecoder::new(file);
    let file = std::io::BufReader::new(file);
    for result in parse_mediawiki_dump::parse(file) {
        match result {
            Err(error) => {
                eprintln!("Error: {}", error);
                break;
            }
            Ok(page) => if page.namespace == 0 && match &page.format {
                None => false,
                Some(format) => format == "text/x-wiki"
            } && match &page.model {
                None => false,
                Some(model) => model == "wikitext"
            } {
                println!(
                    "The page {title:?} is an ordinary article with byte length {length}.",
                    title = page.title,
                    length = page.text.len()
                );
            } else {
                println!("The page {:?} has something special to it.", page.title);
            }
        }
    }
}

parse_mediawiki_dump's People

Contributors

newca12 avatar theadamcolton avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.