Comments (33)
perfect.
from mwoffliner.
@kelson42 @Popolechien
For review:
- EN top 100 (nopic): https://framadrop.org/r/T3rMJ5Lb24#tIg7V9ceXfFjT/96FJpKZ7gu1e70GSsuDDi1+nlKlLk=
- FR top 100 (nopic): https://framadrop.org/r/-7DezLe9kR#X/SpyR0FxJjyoT8Nzvxh0M/OkwqMwOJA3r4udgEg/Yc=
from mwoffliner.
BM Full nopic: https://framadrop.org/r/cyk0sHthFk#vjOsZMdLvq9vqrulrpSOO/WUqSAlZ7ehMf6Zv36aVy0=
No, the current logic is to check each article for categories as it's downloaded. Then we only end up with categories that contain at least one article as per @kelson42's spec:
Mirrors categories which only have at least one article in it.
from mwoffliner.
Yes, categories are not mirrored. This is a work to do in mwoffliner. Probablyt the top priority.
from mwoffliner.
@ISNIT0 Here is currently the TOP priority topic on mwoffliner. It's not extremly complicated but need a bit work. Let me know if you are interested to have a look so I can explain you a bit.
from mwoffliner.
@kelson42 I'm interested :) What's the best place to start looking?
from mwoffliner.
@ISNIT0 Let's make a video conf about that. Let me know when you have time.
from mwoffliner.
Was 'probably the top priority' in Sept of 2016 yet still not implemented. Any time frame?
from mwoffliner.
No, it is still the top priority, but there is nobody to work on this so far.
from mwoffliner.
Thanks for pointing me to this. Hope to see it fixed sometime soon. Maybe a google summer of code project for someone?
from mwoffliner.
@WikiDocJames Maybe even if this first GSoC we are managing is focus on Kiwix-Android. That said if someone comes to me and is motivated and capable, I might consider to mentor it myself.
from mwoffliner.
Further Context: this issue directly affects Haiti schools who've made clear they would use Vikidia IF its link ("84 super articles") were clickable in the top right, as seen in the current Vikidia ZIM here:
http:// iiab . me : 3000 /vikidia_fr_all_novid_2018-03/
Current Vikidia ZIM downloaded from:
http://download.kiwix.org/zim/vikidia/vikidia_fr_all_2018-03.zim
Compare the original (online) version at https://fr.vikidia.org works far better. However the offline version (above ZIM file) is extremely frustrating to educators or children, when the most important link ("84 super articles") is not yet fixed — in future these essential materials should appear much like they do online here:
https://fr.vikidia.org/wiki/Cat%C3%A9gorie:Super_article
PS @kelson42 has clarified that he's hopeful this will be fixed before the end of 2018.
from mwoffliner.
Things to do (the ones I can see):
- Include "Category" namespace to the namespace to scrape per default
- Verify the category pages are scrapped properly
- Secure that links within articles pages to categories work properly
- Secure the category links at the bottom of the page are displayed properly
- Secure the list of articles are displayed properly like online (sorted alphabetically) are displayed also offline
- Secure the category pagination works properly
- Remove articles which are not mirrored from the category list of articles.
- Mirrors categories which only have at least one article in it.
from mwoffliner.
What is the best thing to do for an articleList selection? Keep all the many parent categories? Not keep categories? Keep only one level of categories? Something else?
from mwoffliner.
@ISNIT0 Keep each category with at least one non-category child and merge all categories (to the top one) if there is only one sub-category.
from mwoffliner.
What about categories with media? e.g. https://commons.wikimedia.org/wiki/Category:Birds_in_art
from mwoffliner.
There doesn't seem to be a way to get the structured data of what order to show the sub-categories in. It's not just Alphabetical:
e.g. https://bm.wikipedia.org/wiki/Cat%C3%A9gorie:Lien_th%C3%A9matique_pour_cat%C3%A9gories
The single category is in the "G" namespace
and
https://en.wikipedia.org/wiki/Category:London
There is a *
, Β
(greek letter), Ι
, Ξ
, and Σ
Any suggestions here @Popolechien?
The query I'm currently using is this: https://bm.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtype=subcat&cmlimit=500&format=json&cmtitle=Cat%C3%A9gorie%3ALien_th%C3%A9matique_pour_cat%C3%A9gories
Which only gives back the article namespace
, pageid
, and title
.
from mwoffliner.
Progress:
I've added a --getCategories
work in progress flag which enables the category scraping.
There are certainly issues with the current implementation, so it should not be used yet.
Each article has a Categories
section added to the bottom with a list of links to category pages, each Category page has a Sub-categories
section which links to sub-category pages.
TODO:
- Display categories as on wikipedia.org
- List pages within a category
- Improve efficiency of category page scraping
from mwoffliner.
It seems to display the categories in the same way as MediaWiki displays them, we need information that isn't available through the API. Instead I'm just grouping them Alphabetically which is pretty close
from mwoffliner.
Progress so far:
https://framadrop.org/r/R1A5MJwaey#hxz6gNnGy7mFqv23Hf9SJNQSVL9JPS+pHOGQgVcyDvc=
Known issues:
- Categories/Sub-Categories/Pages sections don't collapse/expand
(resolved by implementing #677) - "Hidden" Are not separated from normal categories
(solved by not downloading "hidden" categories)
from mwoffliner.
Yeah, the hidden categories not being weeded out is a real blocker. These are useless and take up quite some space. @ISNIT0 what's your plan about those?
from mwoffliner.
@Popolechien I've just updated the comment above, we're now not scraping them at all. Is this okay?
from mwoffliner.
Niiiice.
Am I right to understand that all categories within the categories will also be showcased (ie not only the categories in the articles themselves)?
Either way, good job!
from mwoffliner.
I have tested with https://framadrop.org/r/D1EE0C6YxL#SwJO6719lYGfukNN1i71HHy1glAK4MaJTdKiifDHBlo=:
- In Category pages, if an article is not in the selection it should not be listed (currently in black)
- I think we should choose an other namespace to put category pages. The ZIM specs talks about "U" see https://wiki.openzim.org/wiki/ZIM_file_format#Namespaces
- Up categories should be migrated too, it is not the case in "Category:2010s in Austria".
from mwoffliner.
@kelson42 What do you mean by "Up categories should be migrated too"?
from mwoffliner.
@ISNIT0 In mean "categories parent categories", the full ancestor three should be downloaded (but of course in a simplified version).
from mwoffliner.
@kelson42 You previously said:
Mirrors categories which only have at least one article in it.
from mwoffliner.
@ISNIT0 @kelson42 @Popolechien thanks a lot for working on this - I think the addition of live category links will make for a huge improvement! So far I've been working with Kirundi/Kynarwanda/French zims for use in refugee camps and they also appeared with dead links on the index.html page. Are you thinking of applying these changes (active category links) to all zims currently available for download via the kiwix website?
from mwoffliner.
@samkellerhals This is the goal, might take a few additional months to see it happening everywhere.
from mwoffliner.
@kelson42 This is now doing the tree-shaking/graph simplification:
https://framadrop.org/r/dIaIeQVRtO#zZjY9W6s5P6ukctJPxU8GDvEQpzAUPdsqSKXbQohwII=
Because this is done using the top 100 articles, there is not a lot of shared categorisation, but Mantis
is a good example
from mwoffliner.
@kelson42 I'd like to move the namespacing item you mentioned into a separate ticket and add it to 2.0
I can see it causing lots of back-and-forth with routing edge-cases
from mwoffliner.
@ISNIT0 From what I can see from last file you have proposed https://framadrop.org/r/P1S5xi6PRm#A6fiUMsysQsdZzr72yXsT6i/QaYm/Dc97iJZZtYktVg= This looks quite good :) That said I was not able to check if the pagination works fine! Do you have a demo ZIM for that?
from mwoffliner.
AFAIK everything has now been implemented in 1.9, except #762 to be done in 2.0
from mwoffliner.
Related Issues (20)
- Unexisting image keys are requested to the S3 cache HOT 1
- Some maps images can not be download from bm.wikipedia.org
- Unable to execute 'npm ci' or 'npm install command HOT 3
- Open link in new tab. Kiwix extension in Firefox. HOT 3
- Collapsed tables are not viewable HOT 1
- Unable to find appropriate API end-point to retrieve article HTML HOT 4
- summary and details tag are no longer supported HOT 2
- Make sure format option is working for WikimediaMobile renderer HOT 5
- MWoffliner should support latest mediawiki release HOT 2
- What the role of `res/inserted_style.css` HOT 12
- Define title param (article_id) role for Wikimedia REST API offline resources
- Release 1.14.0
- wikipedia_en_all_mini is the same as _nopic HOT 1
- Refactor mwoffliner logic behind CSS/JS modules handling HOT 3
- Raw HTML and html entities appearing in directory entry title field HOT 3
- Images from page/mobile-html endpoint are too big HOT 6
- Full article URL calculates two times in different parts of mwoffliner using different funtions HOT 2
- nopdf param doesn't work as expected for some articles
- Apply test coverage for all endpoins
- Zimcheck failing for some articles HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mwoffliner.