Coder Social home page Coder Social logo

graby-site-config's People

Contributors

bilelmoussaoui avatar burkemw3 avatar coolius avatar digicommons avatar doc75 avatar elibadou384 avatar fivefilters avatar holgerausb avatar j0k3r avatar jangernert avatar janjastrow avatar jordidg avatar kdecherf avatar kreativmonkey avatar lukas0907 avatar marmo avatar moneytoo avatar ngosang avatar nicosomb avatar shtrom avatar silberzwiebel avatar simounet avatar snptrs avatar stesie avatar strubbl avatar techexo avatar thiagotalma avatar timgws avatar tomtaylor avatar zinnober avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

graby-site-config's Issues

nature.com Improvement

This patch makes article body extraction for nature.com more exact:

--- a/nature.com.txt    2021-07-23 12:11:36.331873505 +0200
+++ b/nature.com.txt    2021-07-23 12:11:17.747730246 +0200
@@ -2,7 +2,7 @@
 date: //meta[@name="dc.date"]/@content
 date: //meta[@name="prism.publicationDate"]/@content
 author: //meta[@name='dc.creator']/@content
-body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')]
+body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' c-article-body ')]
 
 strip: //div[contains(concat(' ',normalize-space(@id),' '),' further-reading-section ')]

cnn.com.txt and edition.cnn.txt - Meta refresh to unsupported browser page

I'm using Wallabag 2.3.2 and have found that cnn.com hasn't worked for a while. it appears that the find_string and replace_string to prevent the redirect to the unsupported browser page aren't working:

find_string: <meta http-equiv="refresh"
replace_string: <meta norefresh
[2018-06-05 20:17:30] graby.DEBUG: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://edition.cnn.com/2018/06/05/politics/scott-pruitt-chick-fil-a-job-wife/index.html" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2018/06/05/politics/scott-pruitt-chick-fil-a-job-wife/index.html"} []
[2018-06-05 20:17:30] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.86.0/static/unsupp.html [] []
[2018-06-05 20:17:30] graby.DEBUG: Trying using method "get" on url "https://edition.cnn.com/2.86.0/static/unsupp.html" {"method":"get","url":"https://edition.cnn.com/2.86.0/static/unsupp.html"} []

I tried messing with those strings but couldn't find the fix.

If I'm testing this with wallabag would I need to do anything beyond php /wallabag/bin/console cache:clear --env=prod for it to see the updated site config?

That same link works for me f43.me on though.

Cheers

Grabbing article title from rockpapershotgun.com broken

The config for rockpapershotgun.com does not work anymore. It tries to extract the title from the h2 element which just results in all articles in Wallabag being titled with "Tagged with" instead of the actual title.

The title selector should be "//h1[@Class='title']" - I'll send a PR with the change.

[BUG] Finder->sortByName breaks custom overwrites

Found files are currently sorted by name (https://github.com/j0k3r/graby-site-config/blob/master/src/Files.php#L24) that kind of prevents you from overwriting config files which are provided with cusom ones.

Example:

Assume we have the following structure (and example.com would exist in this repository as well):

site_config
  default
    example.com
  custom
    example.com
$files = Files::getFiles(['site_config/default', 'site_config/custom'])

i would assume - based on the order of folders i submitted to getFiles - that site_config/custom/example.com is used, but instead (because of sorting) vendor/j0k3r/graby-site-config/example.com would be used.

by just getting rid of the sorting (which imho isn't needed anyway) the files would be returned correctly

Can't parse article from nymag.com despite GDPR cookie in site-config

I am trying to import http://nymag.com/selectall/2018/04/an-apology-for-the-internet-from-the-people-who-built-it.html into Wallabag, but keep getting the GDPR consent page instead.

The site-config seems to have the right cookie:

http_header(Cookie): nymuc=1111111111111

I even replaced it with that cookie,, and both others, that my browser had, but I keep getting the page, regardless of whether I reload the entry, or delete it and re-add the URL.

I cleared the Wallabag PHP cache, to no avail.

PS: Not sure whether this should be a Wallabag or ftr-site-config issue, so I went for the middle ground; please let me know if another repo is better indicated.

Issue with nextinpact config

Hello,

I have an issue with "nextinpact.com.txt". The long articles are not completely grabbed.
The article in test_url is too small to see the problem. And I don't find any documentation on how to run the tests.

Thanks for your help.

Trouble adding larlesienne.info

Hi,

I can't achieve a working configuration file for larlesienne.info paywalled articles.
Maybe it is because the login page has two forms (one for logging in, one for registering) with both username and password inputs ?

Someone can help ?

I tried :

body: //article[contains(concat(' ',normalize-space(@class),' '),' module ')]
test_url: https://larlesienne.info/2022/02/22/la-municipalite-de-carolis-fait-planter-le-service-informatique/

requires_login: yes
login_uri: https://larlesienne.info/mon-compte/
login_username_field: username
login_password_field: password
not_logged_in_xpath: //div[@class="pmpro_content_message"]

tabs.ultimate-guitar.com

Hey, pourrais tu créer une config pour tabs.ultimate-guitar.com ? J'ai crée un fichier de conf plutôt basique mais je n'ai aucun moyen de le tester.

Pour https://tabs.ultimate-guitar.com/m/macklemore/same_love_crd.htm l'idée serait d'avoir le même contenu que lorsqu’on clique sur print (https://tabs.ultimate-guitar.com/print/1186397), la tablature avec les accords. Est-il possible d'utiliser single_page_link ? En tout cas l'outil point and click ne marche pas sur les boutons/liens.

Sinon il faudrait prendre comme éléments du body pre.js-tab-content avec aussi les accords qui apparaissent quand on clique sur "Display chords" en haut à gauche de la tab.

Avec Wallabag v1.9.2 la page est correctement sauvegardée mais avec la v2 l'url print est redirigée vers la principale.

Merci !

multipage articles do not use defined cookie?

E.g. https://www.golem.de/news/app-entwicklung-cross-platform-oder-nativ-programmieren-2202-162600.html

When i add this article, i do not get the content of pages 2 and following. Instead of the content from page 2, i see a notice about the missing cookie consent. But this is defined in https://github.com/j0k3r/graby-site-config/blob/master/golem.de.txt#L5 already. Is the cookie sent as header when fetching consecutive pages of an article?
Fetching page 1 works fine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.