Coder Social home page Coder Social logo

web-scraper-api-guide's Introduction

Web Scraper API Quick Start Guide

Oxylabs promo code

Oxylabs’ Web Scraper API is a data scraper API designed to collect real-time data from websites at scale. This web scraping tool serves as a trustworthy solution for gathering information from complicated targets and ensures the ease of the crawling process. Web Scraper API best fits for cases such as website changes monitoring, fraud protection, and travel fare monitoring.

In this guide, we’ll explain how Web Scraper API works and walk you through the process of getting started with this tool without hassle.

For a detailed explanation, see our blog post.

Authentication

Web Scraper API employs basic HTTP authentication which requires username and password:

curl --user "USERNAME:PASSWORD"'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json" -d '{"source": "universal", "url": "https://ip.oxylabs.io/location"}'

Integration methods

Push-Pull

Example of a single query request:

curl --user "USERNAME:PASSWORD"'https://data.oxylabs.io/v1/queries' -H "Content-Type: application/json" -d '{"source": "universal", "url": "https://ip.oxylabs.io/location", "geo_location": "United States"}'

If you are observing low success rates or retrieve empty content, please try using additional "render":"html" parameter in your request. More information about render parameter can be found here.

Sample of the initial response output:

{
  "callback_url": null,
  "client_id": 1,
  "created_at": "2021-09-30 12:40:32",
  "domain": "io",
  "geo_location": "United States",
  "id": "6849322054852825089",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "parse": false,
  "parser_type": null,
  "render": null,
  "url": "hhttps://ip.oxylabs.io/location",
  "query": "",
  "source": "universal",
  "start_page": 1,
  "status": "pending",
  "storage_type": null,
  "storage_url": null,
  "subdomain": "ip",
  "content_encoding": "utf-8",
  "updated_at": "2021-09-30 12:40:32",
  "user_agent_type": "desktop",
  "session_info": null,
  "statuses": [],
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/6849322054852825089",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/6849322054852825089/results",
      "method": "GET"
    }
  ]
}

In order to check whether the job is "status": "done", we can use the link from ["_links"][0]["href"] which is http://data.oxylabs.io/v1/queries/6849322054852825089.

Example of how to check a job status:

curl --user "USERNAME:PASSWORD"
'http://data.oxylabs.io/v1/queries/6849322054852825089'

The response will contain the same data as the initial response. If the job is "status": "done", we can retrieve the contents using the link from [“_links”][1][“href”] which is http://data.oxylabs.io/v1/queries/6849322054852825089/results.

Example of how to retrieve data:

curl --user "USERNAME:PASSWORD"
'http://data.oxylabs.io/v1/queries/6849322054852825089/results'

Sample of the response data output:

{
    "results": [
      {
        "content": "{\"ip\":\"45.24.127.174\",\"providers\":{\"dbip\":{\"country\":\"US\",\"asn\":\"AS7018\",\"org_name\":\"AT\\u0026T Services, Inc.\",\"city\":\"Birmingham\",\"zip_code\":\"\",\"time_zone\":\"\",\"meta\":\"\\u003ca href='https:\/\/db-ip.com'\\u003eIP Geolocation by DB-IP\\u003c\/a\\u003e\"},\"ip2location\":{\"country\":\"US\",\"asn\":\"\",\"org_name\":\"\",\"city\":\"Birmingham\",\"zip_code\":\"35201\",\"time_zone\":\"-06:00\",\"meta\":\"This site or product includes IP2Location LITE data available from \\u003ca href=\\\"https:\/\/lite.ip2location.com\\\"\\u003ehttps:\/\/lite.ip2location.com\\u003c\/a\\u003e.\"},\"ipinfo\":{\"country\":\"US\",\"asn\":\"AS7018\",\"org_name\":\"AT\\u0026T Services, Inc.\",\"city\":\"\",\"zip_code\":\"\",\"time_zone\":\"\",\"meta\":\"\\u003cp\\u003eIP address data powered by \\u003ca href=\\\"https:\/\/ipinfo.io\\\" \\u003eIPinfo\\u003c\/a\\u003e\\u003c\/p\\u003e\"},\"maxmind\":{\"country\":\"US\",\"asn\":\"AS7018\",\"org_name\":\"ATT-INTERNET4\",\"city\":\"Madison\",\"zip_code\":\"\",\"time_zone\":\"-06:00\",\"meta\":\"This product includes GeoLite2 Data created by MaxMind, available from https:\/\/www.maxmind.com.\"}}}\n", # Actual content from https://ip.oxylabs.io/location
        "created_at": "2023-12-18 10:01:59",
        "updated_at": "2023-12-18 10:01:59",
        "page": 1,
        "url": "https://ip.oxylabs.io/location",
        "job_id": "7142453937470222339",
        "status_code": 200
      }
    ]
}

Realtime

With this method, you can send your request and receive data back on the same open HTTPS connection straight away.

Sample request:

curl --user
"USERNAME:PASSWORD"'https://realtime.oxylabs.io/v1/queries' -H
"Content-Type: application/json" -d '{"source": "universal", "url":
"https://ip.oxylabs.io/location", "geo_location": "United States"}'

If you are observing low success rates or retrieve empty content, please try using additional "render":"html" parameter in your request. More information about render parameter can be found here.

Example response body that will be returned on the open connection:

{
    "results": [
      {
        "content": "{\"ip\":\"45.24.127.174\",\"providers\":{\"dbip\":{\"country\":\"US\",\"asn\":\"AS7018\",\"org_name\":\"AT\\u0026T Services, Inc.\",\"city\":\"Birmingham\",\"zip_code\":\"\",\"time_zone\":\"\",\"meta\":\"\\u003ca href='https:\/\/db-ip.com'\\u003eIP Geolocation by DB-IP\\u003c\/a\\u003e\"},\"ip2location\":{\"country\":\"US\",\"asn\":\"\",\"org_name\":\"\",\"city\":\"Birmingham\",\"zip_code\":\"35201\",\"time_zone\":\"-06:00\",\"meta\":\"This site or product includes IP2Location LITE data available from \\u003ca href=\\\"https:\/\/lite.ip2location.com\\\"\\u003ehttps:\/\/lite.ip2location.com\\u003c\/a\\u003e.\"},\"ipinfo\":{\"country\":\"US\",\"asn\":\"AS7018\",\"org_name\":\"AT\\u0026T Services, Inc.\",\"city\":\"\",\"zip_code\":\"\",\"time_zone\":\"\",\"meta\":\"\\u003cp\\u003eIP address data powered by \\u003ca href=\\\"https:\/\/ipinfo.io\\\" \\u003eIPinfo\\u003c\/a\\u003e\\u003c\/p\\u003e\"},\"maxmind\":{\"country\":\"US\",\"asn\":\"AS7018\",\"org_name\":\"ATT-INTERNET4\",\"city\":\"Madison\",\"zip_code\":\"\",\"time_zone\":\"-06:00\",\"meta\":\"This product includes GeoLite2 Data created by MaxMind, available from https:\/\/www.maxmind.com.\"}}}\n", # Actual content from https://ip.oxylabs.io/location
        "created_at": "2023-12-18 10:01:59",
        "updated_at": "2023-12-18 10:01:59",
        "page": 1,
        "url": "https://ip.oxylabs.io/location",
        "job_id": "7142453937470222339",
        "status_code": 200
      }
    ]
}

Proxy Endpoint

Instead of parameters such as domain and search query, Proxy Endpoint only takes completely formed URLs.

Proxy Endpoint code sample in the Python programming language:

curl -k -x realtime.oxylabs.io:60000 -U USERNAME:PASSWORD -H
"X-Oxylabs-Geo-Location: United States" "https://ip.oxylabs.io/location"

If you are observing low success rates or retrieve empty content, please try adding additional "x-oxylabs-render: html" header with your request.

If you wish to find out more about Web Scraper API Quick Start Guide, see our blog post.

web-scraper-api-guide's People

Contributors

augustoxy avatar emilija-ja avatar oxyjohan avatar oxyjowyd avatar oxylabsorg avatar

Stargazers

 avatar  avatar

Forkers

oxylabsorg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.