Coder Social home page Coder Social logo

Comments (4)

mvolfik avatar mvolfik commented on August 16, 2024

Also, this header quite reliably causes blocking by Zillow (PerimeterX). This code:

import { gotScraping } from 'got-scraping';

console.log(
    await gotScraping({
        headers: { 'Content-Type': 'application/json', 'sec-ch-ua': '' },
        body: '{"searchQueryState":{"isMapVisible":true,"filterState":{"sortSelection":{"value":"globalrelevanceex"},"isAllHomes":{"value":true}},"mapBounds":{"north":36.18115,"east":-86.666132,"south":35.807142,"west":-86.891423},"isListVisible":true,"mapZoom":12,"regionSelection":[{"regionId":72192,"regionType":7}],"pagination":{}},"wants":{"cat1":["listResults","mapResults"],"cat2":["total"]},"requestId":1,"isDebugRequest":false}',
        url: 'https://www.zillow.com/async-create-search-page-state',
        method: 'PUT',
    }),
);

quite reliably works, but if you remove the forced sec-ch-ua="", you get almost always blocked

from fingerprint-suite.

mvolfik avatar mvolfik commented on August 16, 2024

huh, the code above doesn't seem to work either anymore, is PerimeterX learning our fingerprints? (for context, I pushed a fix for ZIP Search scraper that used the fix above, started a run of ~3000 ZIP codes, which was producing results for a while, but now everything is 403 again)

this still works though:

curl -i -X PUT "https://www.zillow.com/async-create-search-page-state" \
  --data '{"searchQueryState":{"isMapVisible":true,"filterState":{"sortSelection":{"value":"globalrelevanceex"},"isAllHomes":{"value":true}},"mapBounds":{"north":36.18115,"east":-86.666132,"south":35.807142,"west":-86.891423},"isListVisible":true,"mapZoom":12,"regionSelection":[{"regionId":72192,"regionType":7}],"pagination":{}},"wants":{"cat1":["listResults","mapResults"],"cat2":["total"]},"requestId":1,"isDebugRequest":false}' \
  -H "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:122.0) Gecko/20100102 Firefox/122.0" \
  -H "Accept: */*" \
  -H "Content-Type: application/json"

from fingerprint-suite.

B4nan avatar B4nan commented on August 16, 2024

Looks like the value is supposed to be random:

https://stackoverflow.com/questions/64413275/why-does-chrome-use-sec-ch-ua-not-abrandv-99

Maybe we should really just strip that from our data (or maybe just the generated fingerprints).

from fingerprint-suite.

barjin avatar barjin commented on August 16, 2024

It seems to me that this issue has forked into two different things:

  • The "weird" user agent client hint: This is expected behavior (see Martin's linked post, but also e.g. this draft here). The user agent string is purposefully random to make servers not depend on the actual value of the ua string too closely. This originally stems from the GREASE principle in TLS.

    • I'm still standing on my hill - we shouldn't manipulate this, as that would differentiate us from the actual browsers.
  • Zillow blocking got-scraping: it blocked my Chrome (and curl-impersonate impersonating Chrome, too). Passed with Firefox (and curl-impersonate impersonating Firefox, too). Blocked got-scraping with (probably) Chrome fingerprint. Passed with @mvolfik 's curl with Firefox UA.

Guess what happens when I run got-scraping with this config:

await gotScraping({
    headers: { 'Content-Type': 'application/json' },
    body: '{"searchQueryState":{"isMapVisible":true,"filterState":{"sortSelection":{"value":"globalrelevanceex"},"isAllHomes":{"value":true}},"mapBounds":{"north":36.18115,"east":-86.666132,"south":35.807142,"west":-86.891423},"isListVisible":true,"mapZoom":12,"regionSelection":[{"regionId":72192,"regionType":7}],"pagination":{}},"wants":{"cat1":["listResults","mapResults"],"cat2":["total"]},"requestId":1,"isDebugRequest":false}',
    url: 'https://www.zillow.com/async-create-search-page-state',
    method: 'PUT',
    headerGeneratorOptions: {
        browsers: ['firefox'],
        devices: ['mobile'], // originally not needed, but I had better probability of not getting blocked with this
    }
})

If I had to guess the root of this, my bet would be on a skewed prior distribution between Chrome and Firefox fingerprints in the PerimeterX database - they just have more Chrome examples, so they can be more specific with the checks. And there, we're still losing with the non-genuine TLS stack etc., see my message from the other thread below:

obrazek


TLDR: I understand that defeating the antiblocking scripts can frustrating, but looking into an automatically generated JSON file and trying to point out "weird" looking data is not the way forward. Creating confirmation bias without testing things properly is not the way forward either.

Let's be systematic about this, base our decisions on actual specifications and data, experiment, take notes, and see what works best.

from fingerprint-suite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.