Coder Social home page Coder Social logo

wikiparser's Introduction

Organic Maps

Organic Maps is a free Android & iOS offline maps app for travellers, tourists, drivers, hikers, and cyclists. It uses crowd-sourced OpenStreetMap data and is developed with love by the creators of MapsWithMe (later renamed to Maps.Me) and by our community. No ads, no tracking, no data collection, no crapware. Your donations and positive reviews motivate and inspire us, thanks ❤️!

Download on the App Store Get it on Google Play Explore it on AppGallery Get it on F-Droid

Features

Organic Maps is the ultimate companion app for travellers, tourists, hikers, and cyclists:

  • Detailed offline maps with places that don't exist on other maps, thanks to OpenStreetMap
  • Cycling routes, hiking trails, and walking paths
  • Contour lines, elevation profiles, peaks, and slopes
  • Turn-by-turn walking, cycling, and car navigation with voice guidance
  • Fast offline search on the map
  • Bookmarks and tracks import and export in KML, KMZ & GPX formats
  • Dark Mode to protect your eyes
  • Countries and regions don't take a lot of space
  • Free and open-source

Why Organic?

Organic Maps is pure and organic, made with love:

  • Respects your privacy
  • Saves your battery
  • No unexpected mobile data charges

Organic Maps is free from trackers and other bad stuff:

  • No ads
  • No tracking
  • No data collection
  • No phoning home
  • No annoying registration
  • No mandatory tutorials
  • No noisy email spam
  • No push notifications
  • No crapware
  • No pesticides Purely organic!

The Android application is verified by the Exodus Privacy Project:

The iOS application is verified by TrackerControl for iOS:


Organic Maps doesn't request excessive permissions to spy on you:

At Organic Maps, we believe that privacy is a fundamental human right:

  • Organic Maps is an indie community-driven open-source project
  • We protect your privacy from Big Tech's prying eyes
  • Stay safe no matter where you are

Reject surveillance - embrace your freedom.

Give Organic Maps a try!

Who is paying for the development?

The app is free for everyone, so we rely on donations. Please donate at organicmaps.app/donate to support us!

Beloved institutional sponsors below have provided targeted grants to cover some infrastructure costs and fund development of new selected features:

The NLnet Foundation The Search & Fonts improvement project has been funded through NGI0 Entrust Fund. NGI0 Entrust Fund is established by the NLnet Foundation with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 101069594.
Google Summer of Code Google backed 5 student's projects in the Google Summer of Code program during 2022 and 2023 programs. Noteworthy projects included Android Auto and Wikipedia Dump Extractor.
Mythic Beasts Mythic Beasts ISP provides us two virtual servers with 400 TB/month of free bandwidth to host and serve maps downloads and updates.
FUTO FUTO has awarded $1000 micro-grant to Organic Maps in February 2023.

The majority of all expenses have been funded by founders of the project since its inception. The project is far from achieving any sort of financial sustainability. The current level of voluntary donations falls significantly short of covering efforts needed to sustain the app. Any new developments of features are beyond the scope of possibility due to the absence of the necessary financial resources.

Please consider donating if you want to see this open-source project thriving, not dying. There are other ways how to support the project. No coding skills required.

Copyrights

Licensed under the Apache License, Version 2.0. See LICENSE, NOTICE and data/copyright.html for more information.

Governance

See docs/GOVERNANCE.md.

If you want to build the project, check docs/INSTALL.md. If you want to help the project, see docs/CONTRIBUTING.md. You can help in many ways, the ability to code is not necessary.

Beta

Please join our beta program, suggest your features, and report bugs:

Feedback

The Organic Maps community abides by the CNCF code of conduct.

wikiparser's People

Contributors

biodranik avatar jean-baptistec avatar jpds avatar lens0021 avatar newsch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

wikiparser's Issues

Investigate using osmfilter/osmium for generating inputs

As discussed in #6,
Instead of waiting for the generator to reach the descriptions stage, we could process the OSM globe file directly to query the wikipedia and wikidata tags.

To separate them, we should be careful to:

  • use the same criteria as the generator for filtering nodes
  • process article titles in the same method as the generator

Previous Work

I got a query with osmfilter working based the filters in ftypes_matcher.cpp.

osmfilter planet.o5m --keep="( wikipedia= or wikidata= ) and ( amenity=grave_yard or amenity=fountain or amenity=place_of_worship or amenity=theatre or amenity=townhall or amenity=university or boundary=national_park or building=train_station or highway=pedestrian or historic=archaeological_site or historic=boundary_stone or historic=castle or historic=fort or historic=memorial or historic=monument or historic=ruins or historic=ship or historic=tomb or historic=wayside_cross or historic=wayside_shrine or landuse=cemetery or leisure=garden or leisure=nature_reserve or leisure=park or leisure=water_park or man_made=lighthouse or man_made=tower or natural=beach or natural=cave_entrance or natural=geyser or natural=glacier or natural=hot_spring or natural=peak or natural=volcano or place=square or tourism=artwork or tourism=museum or tourism=gallery or tourism=zoo or tourism=theme_park or waterway=waterfall or tourism=viewpoint or tourism=attraction )" \
| osmconvert - --csv-headline --csv="@oname @id wikipedia wikidata"

I ran it on the Yukon territory map and found that it output additional articles compared to the generator.

Next Steps

  • Investigate earlier processing layers in map generator to improve the query.
  • Try to convert osmfilter command to osmium so .pbf files can be used directly.
  • Run query on whole planet file and compare with generator output and all wikipedia/wikidata tags in the planet file.
  • Update wikiparser to handle direct OSM inputs.

Make script for production

As discussed in #6, it would be useful to have a standalone script that processes the generator inputs and sets up the program correctly for running on the map build server.

It should:

  • try to select the latest map build directory at ~/maps_build/*/intermediate_data/
  • build the program in release mode if it isn't already
  • process the generator inputs as mentioned in the README.
  • enable backtraces and debug logging
  • tee logs to a temporary file
  • accept an arbitrary number of dump files
  • write the the descriptions directory expected by the generator

The generator tool might also need to be updated to write the description inputs and ingest the outputs, but not call the scraper.

Remove gallery sections

Now that we strip images, the Gallery sections are empty and should be removed.

The local versions of the Gallery section in each supported language need to be found.

Language Section
English Gallery
Spanish
German
French
Russian

Triage OSM tag errors

There are around 600 wikipedia/wikidata tags in the full planet dump that cannot be parsed by our current setup.
See #19 and #23 for more details.

They should all be values that don't conform to the expected format.
Some of these can be fixed on OSM by us, we can leave notes on others.

  • Add error for titles greater than 255 bytes of UTF-8
  • Add error for langs greater than some amount - check standard
  • Add a subcommand to dump the errors to disk in a structured way (started in #23).
  • Categorize them by solution
  • Add any new issues for parsing problems
  • Contribute fixes/notes to OSM

Image support

Now that the article sources contain image elements, we can display them in the app.

We need a plan that will:

  • preserve any necessary styling of the elements
  • convert the image urls from relative to absolute
  • keep the images hidden/unloaded by default
  • allow the user to enable image loading dynamically in the apps

Presumably we can mark the elements as hidden or similar, and have a piece of JS to toggle it that is loaded in the app webviews based on a user preference.

Check relative link handling in webviews

In the dumps, links within the same language's wiki are written as relative
links (./Article_Title), while cross-wiki links (e.g. [[:fr:Pomme]]) are
already expanded to absolute links (https://fr.wikipedia.org/wiki/Pomme).
See https://en.wikipedia.org/wiki/Help:Interwiki_linking for more examples of wikitext linking.

The dump html includes a base element set to //lang.wikipedia.org/wiki/, but when opened as a file in firefox it assumes the scheme is file: so they don't work.

We should see if the android/ios webviews have a similar problem, and if so updating the html processing to set the scheme in the base element should handle it.

Discussed more in #10.

Split downloads across mirrors

As discussed in #22, Wikipedia has a limit of 2 concurrent connections and seems to rate limit each to about 4 MB/s. There are at least two mirrors of the Enterprise dumps.
For the fastest speeds, ideally we could share downloads between wikipedia and the mirrors, or even download different parts of the same file concurrently like aria2c.

Unfortunately, none of the parallel downloaders I've seen allow setting connection limits per host (e.g. 2 for dumps.wikimedia.org, 4 for the rest).

So besides writing our own downloader, to respect the wikimedia limits we could:

  • Keep the 2 threads limit and divide the files across the available hosts
  • Increase the 2 threads limit and only use dumps.wikimedia.org for two files
  • Increase the 2 threads limit and don't use dumps.wikimedia.org for any files

Automate dump downloading

Find the latest dump, listed at https://dumps.wikimedia.org/other/enterprise_html/runs/.

Get the filenames for all supported languages: {$LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz.

Download any that aren't already present.

wget handles redirects, retrying on temporary errors, and resuming partial downloads.

Extracting the links from the html could be done with a sed incantation, but wouldn't be very robust.

Alternatively, handle extracting the latest links in a subcommand of the rust program, and invoke that from a script.

Additional HTML Simplification

As covered in #3, the python scraper uses the Extracts API, which strips most of the article's markup.

The HTML in the dumps on the other hand seem much closer to the content in a complete article. Size-wise the dump HTML is around 10x the size of that of the extracts for the subset I looked at.

To get to parity with that output, we'll need to add additional steps to the html processing.

The extension code is available here, and depends on Wikimedia's HtmlFormatter library.

As an example, here's the html from the Extracts API and the enterprise API dump for the Raleigh page:

  • raleigh-extracts-api.html.txt

  • raleigh-enterprise-dump.html.txt

  • Finish list of what to remove:

    • Extracts API selectors:
      table, div, figure, script, input, style, ul.gallery, .mw-editsection, sup.reference, ol.references, .error, .nomobile, .noprint, .noexcerpt, .sortkey,
    • Media elements: img, audio, video, figure, embed
    • Remove class and style from span (and flatten if no other attributes are left)
    • MediaWiki-specific tags and attributes: data-mw, data-mw-*, prefix, typeof, about, rel
    • ids that start with mw
    • Flatten links
    • info boxes (should be covered in above)
    • Gallery sections #16 (may be handled by removing empty sections after removing elements)
    • See if header removals can use the section element that wikipedia wraps them with
    • Comments
    • Extra whitespace
  • Add steps to the HTML processing code.

  • Set up snapshot tests of sample html.

  • Compare file sizes again.

Compare gzip decompressors

With more optimization it's likely that decompressing the gzip archives will be a bottleneck.
There are a number of "parallel" implementations for gzip, but all the ones I've looked at aren't actually useful for this case.

Steps:

  • Look for more implementations
  • Compare decompression rates
  • Compare with rust gzip crates

What I've looked at so far:

Implementation Notes
GNU gunzip Baseline
pigz Moves some decompression work to separate threads, mostly single-threaded
libdeflate Only for small files
pugz Hypothetically ideal, novel approach, but unstable and needs to load entire file into memory
pgzip (Python) Decompression only parallelized when compressed with same tool
pgzip (Golang) Decompression only does single-threaded work on a separate thread

Investigate articles without QID

The schema for the wikipedia enterprise dumps lists the QID field (main_entity) as optional.

All articles should have a QID, but apparently there are cases where they don't.

It's not just articles that are so minor they don't have a wikidata item. In the 20230801 dump for example, out of this sample of errors:

[2023-08-04T17:58:48Z INFO  om_wikiparser] Page without wikidata qid: "Wiriadinata Airport" (https://en.wikipedia.org/wiki/Wiriadinata_Airport)
[2023-08-04T17:59:11Z INFO  om_wikiparser] Page without wikidata qid: "Uptown (Brisbane)" (https://en.wikipedia.org/wiki/Uptown_(Brisbane))

Both articles were edited on 2023-07-31, around when the dump was created:

Is this the main cause of these cases, or is there something else?

Is there some data we can preserve across dumps to prevent this, like keeping old qid links if there is no current one?

Document interface with generator

As discussed in #6, the interface between the generator and this tool should be explicitly documented.

  • add to README input file formats and output directory structure
  • write up ideas for improving generator interface

Skip articles that haven't changed between dumps

The dump schema includes a date_modified timestamp and other revision metadata.

To reduce disk I/O, we could store some metadata along the articles, compare it against the new one when processing, and skip them if they haven't changed.

One way to do this would be to store the date_modified timestamp as the modified attribute of the article file.

Get all translations for articles matched by title

Currently the program checks for matches against the list of article titles and wikidata QIDs.
The QIDs are language agnostic, so all translations of them will be picked up.

For titles however, there's no way to figure out if an article is the translation of another title in the list, so only the article in the title's language is matched on.

Example

For the Eiffel Tower, if OSM doesn't have a wikidata= tag, only wikipedia=fr:Tour Eiffel, we don't know to extract en:Eiffel Tower or ru:Эйфелева башня until we process the page in the fr dump and get its wikidata QID.

At the same time there will be russian-only tags that need to be mapped to other languages, but can't be resolved until we process the ru dump.

For objects with a wikidata= tag this is not a problem, and there are wikipedia:lang= tags, but the generator needs to be updated to handle those and not every OSM object has all of the tags.

Solution

A complete mapping from title to QID would need to include all titles and redirects in each supported language.

We can build that by scanning through all the dumps initially, or by parsing some smaller dumps of redirects and QIDs, by using or doing something similar to this wikimapper project.

Some options to resolve the problem:

  • Build a complete mapping
  • Build a partial mapping only of the required titles
  • Save QIDs of articles matched by title, then find them again in another pass over all dumps

I think writing the missed QIDs out after the first scan is a good first step, if doing two passes increases runtime too much we can investigate the smaller dump option.

Investigate escaping in article titles and urls

Wikipedia articles can contain slashes (/). Wikipedia accepts them in urls escaped or not, e.g.
https://en.wikipedia.org/wiki/Baltimore%2FWashington_International_Airport
and
https://en.wikipedia.org/wiki/Baltimore/Washington_International_Airport
return the same page, and neither redirects to the other.

The generator attempts to decode urls from OSM tags, and then encodes '%' again when it converts them back into urls.

My guess is that some of the tags that are not urls still have url encoding in them, but determining which are actually url-encoded and which just have % in them is a little tricky, and the generator doesn't do that.

It looks like some of the resulting urls are encoded twice, thankfully a small number:

$ tail -n +2 ~/Downloads/wiki_urls.txt | cut -f 3 | grep -F '%' | sort | uniq
https://de.wikipedia.org/wiki/Georg-B%25C3%25BCchner-Platz
https://de.wikipedia.org/wiki/Kontorhaus_am_J%25C3%25B6debrunnen
https://en.wikipedia.org/wiki/Brighton_%2526_Hove_Greyhound_Stadium
https://en.wikipedia.org/wiki/de:Liste_der_Kulturdenkmäler_in_Schwachhausen#0218%252CT003
https://en.wikipedia.org/wiki/McMullen%2527s_Brewery
https://en.wikipedia.org/wiki/P%25C3%25A9cs_TV_Tower
https://en.wikipedia.org/wiki/Sedbergh_People%2527s_Hall
https://en.wikipedia.org/wiki/Sight_%2526_Sound_Theatres
https://es.wikipedia.org/wiki/100%25_Banco
https://es.wikipedia.org/wiki/Ruta_de_los_D%25C3%25B3lmenes
https://FR.wikipedia.org/wiki/Maisons_industrialis%25C3%25A9es_%25C3%25A0_Meudon
https://fr.wikipedia.org/wiki/Salm_(rivi%25C3%25A8re_de_Belgique)
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
https://sv.wikipedia.org/wiki/Kungliga_Tr%25C3%25A4dg%25C3%25A5rden_3
https://sv.wikipedia.org/wiki/Sverigev%25C3%25A4ggen
https://sv.wikipedia.org/wiki/V%25C3%25A4ttern,_Storfors_kommun
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598

Of those, all except the three below are malformed:

https://es.wikipedia.org/wiki/100%25_Banco
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598

Some seem to be arbitrary character data, for example:

https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
with the extra escaped %25s removed becomes:
https://sv.wikipedia.org/wiki/Kanngjutarm%C3%A4starens_hus
which the browser converts to:
https://sv.wikipedia.org/wiki/Kanngjutarmästarens_hus

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.