Coder Social home page Coder Social logo

Comments (8)

hadley avatar hadley commented on August 21, 2024

You probably need to set the Accepts header to prioritise html over json.

from curl.

briatte avatar briatte commented on August 21, 2024

That unfortunately did not work:

GET("http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=300453&idLegislatura=17", accept("text/html"))

still returns the "JSON" version of the page.

from curl.

jeroen avatar jeroen commented on August 21, 2024

They are running a misconfiged cache server so you are getting false hits. Try this:

req <- GET("http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350", 
  add_headers(Accept = "text/html", "Cache-control" = "max-age=0"))
content(req, "text")

Sometimes it helps if you just add an arbitrary parameter to the URL to bypass the cache:

url <- paste0("http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350&_random=", runif(1))
req <- GET(url, accept("text/html"))
content(req, "text")

from curl.

briatte avatar briatte commented on August 21, 2024

Still no luck: I get a false hit, whatever the URL used.

On top of that, I've just discovered that curl -v always returns HTML, but the content of the page is often faulty ("page temporarily inaccessible, return later").

from curl.

jeroen avatar jeroen commented on August 21, 2024

I think this works:

library(httr)
url <- "http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=300453&idLegislatura=17"
url <- paste0(url, "&_random=", rnorm(1))
req <- GET(url, accept("text/html"))
stopifnot(req$headers[["x-cache"]] == "MISS")
stopifnot(req$headers$age == "0")
content(req, "text")

Their server is really poorly configured, not only does it give false hits but it ignores the Cache-Control: no-cache request header. But slightly changing the URL usually forces the cache server to fetch a new copy.

from curl.

briatte avatar briatte commented on August 21, 2024

It seems wot work indeed!

Thank you very much to both of you for your help.

How did you come with the &random= part?

from curl.

jeroen avatar jeroen commented on August 21, 2024

It's just something arbitrary that you add to the URL in order to trick the cache server into thinking that you are fetching a different page, so it cannot serve you a cached copy. It's a common trick to force bypassing any cache.

from curl.

briatte avatar briatte commented on August 21, 2024

Excellent. Thanks again and enjoy your days.

from curl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.