j0k3r / graby-site-config Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fivefilters/ftr-site-config

18.0 18.0 28.0 4.77 MB

Graby site config files

License: Other

PHP 74.17% Makefile 25.83%

graby-site-config's People

Contributors

Stargazers

Watchers

graby-site-config's Issues

nature.com Improvement

This patch makes article body extraction for nature.com more exact:

--- a/nature.com.txt    2021-07-23 12:11:36.331873505 +0200
+++ b/nature.com.txt    2021-07-23 12:11:17.747730246 +0200
@@ -2,7 +2,7 @@
 date: //meta[@name="dc.date"]/@content
 date: //meta[@name="prism.publicationDate"]/@content
 author: //meta[@name='dc.creator']/@content
-body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')]
+body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' c-article-body ')]
 
 strip: //div[contains(concat(' ',normalize-space(@id),' '),' further-reading-section ')]

Bypassing paywall using Twitter UA & Referer?

It's seems we can bypass some paywall using a trick as simple as changing referer and user agent with Twitter ones.

https://elaineou.com/2017/01/19/how-the-twitter-app-bypasses-paywalls/

Should tested against wsj.com website

cnn.com.txt and edition.cnn.txt - Meta refresh to unsupported browser page

I'm using Wallabag 2.3.2 and have found that cnn.com hasn't worked for a while. it appears that the find_string and replace_string to prevent the redirect to the unsupported browser page aren't working:

find_string: <meta http-equiv="refresh"
replace_string: <meta norefresh

[2018-06-05 20:17:30] graby.DEBUG: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://edition.cnn.com/2018/06/05/politics/scott-pruitt-chick-fil-a-job-wife/index.html" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2018/06/05/politics/scott-pruitt-chick-fil-a-job-wife/index.html"} []
[2018-06-05 20:17:30] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.86.0/static/unsupp.html [] []
[2018-06-05 20:17:30] graby.DEBUG: Trying using method "get" on url "https://edition.cnn.com/2.86.0/static/unsupp.html" {"method":"get","url":"https://edition.cnn.com/2.86.0/static/unsupp.html"} []

I tried messing with those strings but couldn't find the fix.

If I'm testing this with wallabag would I need to do anything beyond php /wallabag/bin/console cache:clear --env=prod for it to see the updated site config?

That same link works for me f43.me on though.

Cheers

theathletic.com updates

Saw theathletic.com fail to pull in recently. Looks likely related to changes in the login process on the site, routing to /login2/ instead of /login/ and form changes. As of now, it pulls in a truncated version with a segment of the article paywalled. Happy to provide creds via a safe medium if it helps.

Example link: https://theathletic.com/3243539/2022/04/12/arsenals-january-gambles-have-left-them-in-the-hands-of-fate/

Grabbing article title from rockpapershotgun.com broken

The config for rockpapershotgun.com does not work anymore. It tries to extract the title from the h2 element which just results in all articles in Wallabag being titled with "Tagged with" instead of the actual title.

The title selector should be "//h1[@Class='title']" - I'll send a PR with the change.

[BUG] Finder->sortByName breaks custom overwrites

Found files are currently sorted by name (https://github.com/j0k3r/graby-site-config/blob/master/src/Files.php#L24) that kind of prevents you from overwriting config files which are provided with cusom ones.

Example:

Assume we have the following structure (and example.com would exist in this repository as well):

site_config
  default
    example.com
  custom
    example.com

$files = Files::getFiles(['site_config/default', 'site_config/custom'])

i would assume - based on the order of folders i submitted to getFiles - that site_config/custom/example.com is used, but instead (because of sorting) vendor/j0k3r/graby-site-config/example.com would be used.

by just getting rid of the sorting (which imho isn't needed anyway) the files would be returned correctly

Allow Symfony 7

Fixed by #54

Golem.de needs cookie-update

This is fixed in fivefilters@892b0d3

Possible to create a reddit config that also grabs comments?

As title say.. I want to use Wallabag to grab a entire reddit post including the comments.

ANy way to do this?

Thanks!

Can't parse article from nymag.com despite GDPR cookie in site-config

I am trying to import http://nymag.com/selectall/2018/04/an-apology-for-the-internet-from-the-people-who-built-it.html into Wallabag, but keep getting the GDPR consent page instead.

The site-config seems to have the right cookie:

http_header(Cookie): nymuc=1111111111111

I even replaced it with that cookie,, and both others, that my browser had, but I keep getting the page, regardless of whether I reload the entry, or delete it and re-add the URL.

I cleared the Wallabag PHP cache, to no avail.

PS: Not sure whether this should be a Wallabag or ftr-site-config issue, so I went for the middle ground; please let me know if another repo is better indicated.

Issue with nextinpact config

Hello,

I have an issue with "nextinpact.com.txt". The long articles are not completely grabbed.
The article in test_url is too small to see the problem. And I don't find any documentation on how to run the tests.

Thanks for your help.

Trouble adding larlesienne.info

Hi,

I can't achieve a working configuration file for larlesienne.info paywalled articles.
Maybe it is because the login page has two forms (one for logging in, one for registering) with both username and password inputs ?

Someone can help ?

I tried :

body: //article[contains(concat(' ',normalize-space(@class),' '),' module ')]
test_url: https://larlesienne.info/2022/02/22/la-municipalite-de-carolis-fait-planter-le-service-informatique/

requires_login: yes
login_uri: https://larlesienne.info/mon-compte/
login_username_field: username
login_password_field: password
not_logged_in_xpath: //div[@class="pmpro_content_message"]

tabs.ultimate-guitar.com

Hey, pourrais tu créer une config pour tabs.ultimate-guitar.com ? J'ai crée un fichier de conf plutôt basique mais je n'ai aucun moyen de le tester.

Pour https://tabs.ultimate-guitar.com/m/macklemore/same_love_crd.htm l'idée serait d'avoir le même contenu que lorsqu’on clique sur print (https://tabs.ultimate-guitar.com/print/1186397), la tablature avec les accords. Est-il possible d'utiliser single_page_link ? En tout cas l'outil point and click ne marche pas sur les boutons/liens.

Sinon il faudrait prendre comme éléments du body pre.js-tab-content avec aussi les accords qui apparaissent quand on clique sur "Display chords" en haut à gauche de la tab.

Avec Wallabag v1.9.2 la page est correctement sauvegardée mais avec la v2 l'url print est redirigée vers la principale.

Merci !

multipage articles do not use defined cookie?

E.g. https://www.golem.de/news/app-entwicklung-cross-platform-oder-nativ-programmieren-2202-162600.html

When i add this article, i do not get the content of pages 2 and following. Instead of the content from page 2, i see a notice about the missing cookie consent. But this is defined in https://github.com/j0k3r/graby-site-config/blob/master/golem.de.txt#L5 already. Is the cookie sent as header when fetching consecutive pages of an article?
Fetching page 1 works fine.

j0k3r / graby-site-config Goto Github PK

graby-site-config's People

Contributors

Stargazers

Watchers

Forkers

graby-site-config's Issues

Recommend Projects

Recommend Topics

Recommend Org