organicmaps / wikiparser Goto Github PK

Wikipedia parser that generates offline content embeddable into Organic Maps map mwm files

License: GNU Affero General Public License v3.0

Rust 23.36% Shell 4.09% HTML 72.54%

wikiparser's Introduction

Organic Maps

Organic Maps is a free Android & iOS offline maps app for travellers, tourists, drivers, hikers, and cyclists. It uses crowd-sourced OpenStreetMap data and is developed with love by the creators of MapsWithMe (later renamed to Maps.Me) and by our community. No ads, no tracking, no data collection, no crapware. Your donations and positive reviews motivate and inspire us, thanks ❤️!

Features

Organic Maps is the ultimate companion app for travellers, tourists, hikers, and cyclists:

Detailed offline maps with places that don't exist on other maps, thanks to OpenStreetMap
Cycling routes, hiking trails, and walking paths
Contour lines, elevation profiles, peaks, and slopes
Turn-by-turn walking, cycling, and car navigation with voice guidance
Fast offline search on the map
Bookmarks and tracks import and export in KML, KMZ & GPX formats
Dark Mode to protect your eyes
Countries and regions don't take a lot of space
Free and open-source

Why Organic?

Organic Maps is pure and organic, made with love:

Respects your privacy
Saves your battery
No unexpected mobile data charges

Organic Maps is free from trackers and other bad stuff:

No ads
No tracking
No data collection
No phoning home
No annoying registration
No mandatory tutorials
No noisy email spam
No push notifications
No crapware
~~No pesticides~~ Purely organic!

The Android application is verified by the Exodus Privacy Project:

The iOS application is verified by TrackerControl for iOS:

Organic Maps doesn't request excessive permissions to spy on you:

At Organic Maps, we believe that privacy is a fundamental human right:

Organic Maps is an indie community-driven open-source project
We protect your privacy from Big Tech's prying eyes
Stay safe no matter where you are

Reject surveillance - embrace your freedom.

Give Organic Maps a try!

Who is paying for the development?

The app is free for everyone, so we rely on donations. Please donate at organicmaps.app/donate to support us!

Beloved institutional sponsors below have provided targeted grants to cover some infrastructure costs and fund development of new selected features:

	The Search & Fonts improvement project has been funded through NGI0 Entrust Fund. NGI0 Entrust Fund is established by the NLnet Foundation with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 101069594.
	Google backed 5 student's projects in the Google Summer of Code program during 2022 and 2023 programs. Noteworthy projects included Android Auto and Wikipedia Dump Extractor.
	Mythic Beasts ISP provides us two virtual servers with 400 TB/month of free bandwidth to host and serve maps downloads and updates.
	FUTO has awarded $1000 micro-grant to Organic Maps in February 2023.

The majority of all expenses have been funded by founders of the project since its inception. The project is far from achieving any sort of financial sustainability. The current level of voluntary donations falls significantly short of covering efforts needed to sustain the app. Any new developments of features are beyond the scope of possibility due to the absence of the necessary financial resources.

Please consider donating if you want to see this open-source project thriving, not dying. There are other ways how to support the project. No coding skills required.

Copyrights

Licensed under the Apache License, Version 2.0. See LICENSE, NOTICE and data/copyright.html for more information.

Governance

See docs/GOVERNANCE.md.

Contributing

If you want to build the project, check docs/INSTALL.md. If you want to help the project, see docs/CONTRIBUTING.md. You can help in many ways, the ability to code is not necessary.

Beta

Please join our beta program, suggest your features, and report bugs:

Feedback

Rate us on the App Store and Google Play.
Star us on GitHub.
Report bugs or issues to the issue tracker.
Discuss ideas or propose feature requests.
Subscribe to our Telegram Channel or to the [matrix] space for updates.
Join our Telegram Group to discuss with other users.
- Присоединяйтесь к нашей русскоязычной группе в Telegram для обратной связи и помощи.
- Diğer kullanıcılarla tartışmak için Telegram Grubumuza katılın.
- Rejoignez notre groupe Telegram pour obtenir de l'aide.
Contact us by email.
Follow our updates in Mastodon, Facebook, Twitter, Instagram.
- Güncellemelerimizi Instagram üzerinden takip edin.

The Organic Maps community abides by the CNCF code of conduct.

wikiparser's People

Contributors

Stargazers

Watchers

Forkers

yuiseki jpds lens0021

wikiparser's Issues

Investigate using osmfilter/osmium for generating inputs

As discussed in #6,
Instead of waiting for the generator to reach the descriptions stage, we could process the OSM globe file directly to query the wikipedia and wikidata tags.

To separate them, we should be careful to:

use the same criteria as the generator for filtering nodes
process article titles in the same method as the generator

Previous Work

I got a query with osmfilter working based the filters in ftypes_matcher.cpp.

osmfilter planet.o5m --keep="( wikipedia= or wikidata= ) and ( amenity=grave_yard or amenity=fountain or amenity=place_of_worship or amenity=theatre or amenity=townhall or amenity=university or boundary=national_park or building=train_station or highway=pedestrian or historic=archaeological_site or historic=boundary_stone or historic=castle or historic=fort or historic=memorial or historic=monument or historic=ruins or historic=ship or historic=tomb or historic=wayside_cross or historic=wayside_shrine or landuse=cemetery or leisure=garden or leisure=nature_reserve or leisure=park or leisure=water_park or man_made=lighthouse or man_made=tower or natural=beach or natural=cave_entrance or natural=geyser or natural=glacier or natural=hot_spring or natural=peak or natural=volcano or place=square or tourism=artwork or tourism=museum or tourism=gallery or tourism=zoo or tourism=theme_park or waterway=waterfall or tourism=viewpoint or tourism=attraction )" \
| osmconvert - --csv-headline --csv="@oname @id wikipedia wikidata"

I ran it on the Yukon territory map and found that it output additional articles compared to the generator.

Next Steps

Investigate earlier processing layers in map generator to improve the query.
Try to convert osmfilter command to osmium so .pbf files can be used directly.
Run query on whole planet file and compare with generator output and all wikipedia/wikidata tags in the planet file.
Update wikiparser to handle direct OSM inputs.

Make script for production

As discussed in #6, it would be useful to have a standalone script that processes the generator inputs and sets up the program correctly for running on the map build server.

It should:

try to select the latest map build directory at ~/maps_build/*/intermediate_data/
build the program in release mode if it isn't already
process the generator inputs as mentioned in the README.
enable backtraces and debug logging
tee logs to a temporary file
accept an arbitrary number of dump files
write the the descriptions directory expected by the generator

The generator tool might also need to be updated to write the description inputs and ingest the outputs, but not call the scraper.

HTML Minification

The scraper crate added in #3 doesn't implement minification.

It uses the html5ever crate internally.
There are a number of html minification crates out there. My guess is one that can use the html5ever datastructure without re-parsing would be fastest, but that may not be true.

Remove gallery sections

Now that we strip images, the Gallery sections are empty and should be removed.

The local versions of the Gallery section in each supported language need to be found.

Language	Section
English	Gallery
Spanish
German
French
Russian

Triage OSM tag errors

There are around 600 wikipedia/wikidata tags in the full planet dump that cannot be parsed by our current setup.
See #19 and #23 for more details.

They should all be values that don't conform to the expected format.
Some of these can be fixed on OSM by us, we can leave notes on others.

Add error for titles greater than 255 bytes of UTF-8
Add error for langs greater than some amount - check standard
Add a subcommand to dump the errors to disk in a structured way (started in #23).
Categorize them by solution
Add any new issues for parsing problems
Contribute fixes/notes to OSM

Image support

Now that the article sources contain image elements, we can display them in the app.

We need a plan that will:

preserve any necessary styling of the elements
convert the image urls from relative to absolute
keep the images hidden/unloaded by default
allow the user to enable image loading dynamically in the apps

Presumably we can mark the elements as hidden or similar, and have a piece of JS to toggle it that is loaded in the app webviews based on a user preference.

Wikipedia text discards  

organicmaps/organicmaps#8651

Check relative link handling in webviews

In the dumps, links within the same language's wiki are written as relative
links (./Article_Title), while cross-wiki links (e.g. [[:fr:Pomme]]) are
already expanded to absolute links (https://fr.wikipedia.org/wiki/Pomme).
See https://en.wikipedia.org/wiki/Help:Interwiki_linking for more examples of wikitext linking.

The dump html includes a base element set to //lang.wikipedia.org/wiki/, but when opened as a file in firefox it assumes the scheme is file: so they don't work.

We should see if the android/ios webviews have a similar problem, and if so updating the html processing to set the scheme in the base element should handle it.

Discussed more in #10.

Split downloads across mirrors

As discussed in #22, Wikipedia has a limit of 2 concurrent connections and seems to rate limit each to about 4 MB/s. There are at least two mirrors of the Enterprise dumps.
For the fastest speeds, ideally we could share downloads between wikipedia and the mirrors, or even download different parts of the same file concurrently like aria2c.

Unfortunately, none of the parallel downloaders I've seen allow setting connection limits per host (e.g. 2 for dumps.wikimedia.org, 4 for the rest).

So besides writing our own downloader, to respect the wikimedia limits we could:

Keep the 2 threads limit and divide the files across the available hosts
Increase the 2 threads limit and only use dumps.wikimedia.org for two files
Increase the 2 threads limit and don't use dumps.wikimedia.org for any files

Automate dump downloading

Find the latest dump, listed at https://dumps.wikimedia.org/other/enterprise_html/runs/.

Get the filenames for all supported languages: {$LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz.

Download any that aren't already present.

wget handles redirects, retrying on temporary errors, and resuming partial downloads.

Extracting the links from the html could be done with a sed incantation, but wouldn't be very robust.

Alternatively, handle extracting the latest links in a subcommand of the rust program, and invoke that from a script.

Leave some tables in articles

Now all tables are cleaned up. There are some useful tables like this one: https://en.wikipedia.org/wiki/K%C3%BCsnacht_Goldbach_railway_station

We may try to leave those tables that are inside the article sections and see if it works well.

Additional HTML Simplification

As covered in #3, the python scraper uses the Extracts API, which strips most of the article's markup.

The HTML in the dumps on the other hand seem much closer to the content in a complete article. Size-wise the dump HTML is around 10x the size of that of the extracts for the subset I looked at.

To get to parity with that output, we'll need to add additional steps to the html processing.

The extension code is available here, and depends on Wikimedia's HtmlFormatter library.

As an example, here's the html from the Extracts API and the enterprise API dump for the Raleigh page:

raleigh-extracts-api.html.txt
raleigh-enterprise-dump.html.txt
Finish list of what to remove:
- Extracts API selectors:
  table, div, figure, script, input, style, ul.gallery, .mw-editsection, sup.reference, ol.references, .error, .nomobile, .noprint, .noexcerpt, .sortkey,
- Media elements: img, audio, video, figure, embed
- Remove class and style from span (and flatten if no other attributes are left)
- MediaWiki-specific tags and attributes: data-mw, data-mw-*, prefix, typeof, about, rel
- ids that start with mw
- Flatten links
- info boxes (should be covered in above)
- Gallery sections #16 (may be handled by removing empty sections after removing elements)
- See if header removals can use the section element that wikipedia wraps them with
- Comments
- Extra whitespace
Add steps to the HTML processing code.
Set up snapshot tests of sample html.
Compare file sizes again.

Compare gzip decompressors

With more optimization it's likely that decompressing the gzip archives will be a bottleneck.
There are a number of "parallel" implementations for gzip, but all the ones I've looked at aren't actually useful for this case.

Steps:

Look for more implementations
Compare decompression rates
Compare with rust gzip crates

What I've looked at so far:

Implementation	Notes
GNU gunzip	Baseline
pigz	Moves some decompression work to separate threads, mostly single-threaded
libdeflate	Only for small files
pugz	Hypothetically ideal, novel approach, but unstable and needs to load entire file into memory
pgzip (Python)	Decompression only parallelized when compressed with same tool
pgzip (Golang)	Decompression only does single-threaded work on a separate thread

Synchronize configuration with main organicmaps

As discussed here, there are other lists of supported languages in the main organicmaps repo.

Keeping the article_processing_config.json file and any other configurations in sync automatically, or having one use the other, would be better.

Investigate articles without QID

The schema for the wikipedia enterprise dumps lists the QID field (main_entity) as optional.

All articles should have a QID, but apparently there are cases where they don't.

It's not just articles that are so minor they don't have a wikidata item. In the 20230801 dump for example, out of this sample of errors:

[2023-08-04T17:58:48Z INFO  om_wikiparser] Page without wikidata qid: "Wiriadinata Airport" (https://en.wikipedia.org/wiki/Wiriadinata_Airport)
[2023-08-04T17:59:11Z INFO  om_wikiparser] Page without wikidata qid: "Uptown (Brisbane)" (https://en.wikipedia.org/wiki/Uptown_(Brisbane))

Both articles were edited on 2023-07-31, around when the dump was created:

Is this the main cause of these cases, or is there something else?

Is there some data we can preserve across dumps to prevent this, like keeping old qid links if there is no current one?

Document interface with generator

As discussed in #6, the interface between the generator and this tool should be explicitly documented.

add to README input file formats and output directory structure
write up ideas for improving generator interface

Skip articles that haven't changed between dumps

The dump schema includes a date_modified timestamp and other revision metadata.

To reduce disk I/O, we could store some metadata along the articles, compare it against the new one when processing, and skip them if they haven't changed.

One way to do this would be to store the date_modified timestamp as the modified attribute of the article file.

Get all translations for articles matched by title

Currently the program checks for matches against the list of article titles and wikidata QIDs.
The QIDs are language agnostic, so all translations of them will be picked up.

For titles however, there's no way to figure out if an article is the translation of another title in the list, so only the article in the title's language is matched on.

Example

For the Eiffel Tower, if OSM doesn't have a wikidata= tag, only wikipedia=fr:Tour Eiffel, we don't know to extract en:Eiffel Tower or ru:Эйфелева башня until we process the page in the fr dump and get its wikidata QID.

At the same time there will be russian-only tags that need to be mapped to other languages, but can't be resolved until we process the ru dump.

For objects with a wikidata= tag this is not a problem, and there are wikipedia:lang= tags, but the generator needs to be updated to handle those and not every OSM object has all of the tags.

Solution

A complete mapping from title to QID would need to include all titles and redirects in each supported language.

We can build that by scanning through all the dumps initially, or by parsing some smaller dumps of redirects and QIDs, by using or doing something similar to this wikimapper project.

Some options to resolve the problem:

Build a complete mapping
Build a partial mapping only of the required titles
Save QIDs of articles matched by title, then find them again in another pass over all dumps

I think writing the missed QIDs out after the first scan is a good first step, if doing two passes increases runtime too much we can investigate the smaller dump option.

Investigate escaping in article titles and urls

Wikipedia articles can contain slashes (/). Wikipedia accepts them in urls escaped or not, e.g.
https://en.wikipedia.org/wiki/Baltimore%2FWashington_International_Airport
and
https://en.wikipedia.org/wiki/Baltimore/Washington_International_Airport
return the same page, and neither redirects to the other.

The generator attempts to decode urls from OSM tags, and then encodes '%' again when it converts them back into urls.

My guess is that some of the tags that are not urls still have url encoding in them, but determining which are actually url-encoded and which just have % in them is a little tricky, and the generator doesn't do that.

It looks like some of the resulting urls are encoded twice, thankfully a small number:

$ tail -n +2 ~/Downloads/wiki_urls.txt | cut -f 3 | grep -F '%' | sort | uniq
https://de.wikipedia.org/wiki/Georg-B%25C3%25BCchner-Platz
https://de.wikipedia.org/wiki/Kontorhaus_am_J%25C3%25B6debrunnen
https://en.wikipedia.org/wiki/Brighton_%2526_Hove_Greyhound_Stadium
https://en.wikipedia.org/wiki/de:Liste_der_Kulturdenkmäler_in_Schwachhausen#0218%252CT003
https://en.wikipedia.org/wiki/McMullen%2527s_Brewery
https://en.wikipedia.org/wiki/P%25C3%25A9cs_TV_Tower
https://en.wikipedia.org/wiki/Sedbergh_People%2527s_Hall
https://en.wikipedia.org/wiki/Sight_%2526_Sound_Theatres
https://es.wikipedia.org/wiki/100%25_Banco
https://es.wikipedia.org/wiki/Ruta_de_los_D%25C3%25B3lmenes
https://FR.wikipedia.org/wiki/Maisons_industrialis%25C3%25A9es_%25C3%25A0_Meudon
https://fr.wikipedia.org/wiki/Salm_(rivi%25C3%25A8re_de_Belgique)
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
https://sv.wikipedia.org/wiki/Kungliga_Tr%25C3%25A4dg%25C3%25A5rden_3
https://sv.wikipedia.org/wiki/Sverigev%25C3%25A4ggen
https://sv.wikipedia.org/wiki/V%25C3%25A4ttern,_Storfors_kommun
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598

Of those, all except the three below are malformed:

https://es.wikipedia.org/wiki/100%25_Banco
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598

Some seem to be arbitrary character data, for example:

https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
with the extra escaped %25s removed becomes:
https://sv.wikipedia.org/wiki/Kanngjutarm%C3%A4starens_hus
which the browser converts to:
https://sv.wikipedia.org/wiki/Kanngjutarmästarens_hus