Coder Social home page Coder Social logo

Comments (9)

mxsnq avatar mxsnq commented on May 27, 2024

Hi!
Could you attach your sitemap please?
Maybe you didn't set the multiple checkbox in Element click. For this selector, multiplicity refers to data elements, not to buttons to click on.

from web-scraper-chrome-extension.

orizzontiholding avatar orizzontiholding commented on May 27, 2024

Hello @mxsnq
Here is the sitemap. What I need to understand is how I can manage infinite scroll or load more buttons. In this case I have an infinite scroll site where the data are loaded when the user reach the bottom of the page

{
  "rootSelector": {
    "id": "_root",
    "uuid": "0"
  },
  "_id": "sole365-sitemap",
  "startUrls": [
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/frutta-e-verdura/c/10006",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/salumi-formaggi/c/10013",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/gastronomia-e-pasta-fresca/c/10007",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/latte-burro-uova-yogurt/c/10009",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/carne/c/10003",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/pesce/c/10011",
    "https://www.cosicomodo.it/sole365/pontecagnano/colazione-merenda-dolci/c/10004",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/prodotti-alimentari/c/10012",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/pane-e-pasticceria/c/10010",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/gelati-e-surgelati/c/10008",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/bevande-vini-liquori/c/10002",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/tutto-per-il-bambino/c/10015",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/cura-della-persona/c/10005",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/tutto-per-la-casa/c/10016",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/tempo-libero/c/10014"
  ],
  "selectors": [
    {
      "id": "sole365-linkprodotto",
      "selector": ".product-image > a",
      "type": "SelectorLink",
      "multiple": true,
      "extractAttribute": "href",
      "parentSelectors": [
        "0"
      ],
      "uuid": "1"
    },
    {
      "parentSelectors": [
        "0"
      ],
      "type": "SelectorText",
      "multiple": true,
      "uuid": "2",
      "id": "sole365-productname",
      "selector": ".product-name > h3 > a",
      "textmanipulation": {
        "removeHtml": true
      }
    }
  ],
  "sitemapSpecificationVersion": 1
}

from web-scraper-chrome-extension.

mxsnq avatar mxsnq commented on May 27, 2024

@orizzontiholding, you can use Element Scroll selector to manage infinite scrolling. By default it scrolls to the end of the page, but on this site it scrolls past the list of products and doesn't trigger loading, so it is better to scroll directly onto the spinner element. Also set a few seconds delay to let new product data load.

{
  "rootSelector": {
    "id": "_root",
    "uuid": "0"
  },
  "_id": "sole365-sitemap",
  "startUrls": [
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/frutta-e-verdura/c/10006",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/salumi-formaggi/c/10013",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/gastronomia-e-pasta-fresca/c/10007",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/latte-burro-uova-yogurt/c/10009",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/carne/c/10003",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/pesce/c/10011",
    "https://www.cosicomodo.it/sole365/pontecagnano/colazione-merenda-dolci/c/10004",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/prodotti-alimentari/c/10012",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/pane-e-pasticceria/c/10010",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/gelati-e-surgelati/c/10008",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/bevande-vini-liquori/c/10002",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/tutto-per-il-bambino/c/10015",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/cura-della-persona/c/10005",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/tutto-per-la-casa/c/10016",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/tempo-libero/c/10014"
  ],
  "selectors": [
    {
      "id": "product_element",
      "selector": "li.product-list-item",
      "scrollElementSelector": ".spinner-container",
      "type": "SelectorElementScroll",
      "multiple": true,
      "parentSelectors": [
        "0"
      ],
      "delay": "3000",
      "uuid": "3"
    },
    {
      "parentSelectors": [
        "3"
      ],
      "type": "SelectorText",
      "uuid": "2",
      "id": "sole365-productname",
      "selector": ".product-name > h3 > a",
      "textmanipulation": {
        "removeHtml": true
      }
    },
    {
      "id": "sole365-linkprodotto",
      "selector": ".product-image > a",
      "type": "SelectorLink",
      "extractAttribute": "href",
      "parentSelectors": [
        "3"
      ],
      "uuid": "1"
    }
  ],
  "sitemapSpecificationVersion": 1
}

For visible pagination buttons you can use Element Click selector with similar semantics.

Also note that in this sitemap the Link selector only navigates to product pages, but does not extract anything from them, so you'll have to add child selectors to it for extraction. If you are interested in product links extraction only, use Element Attribute selector with "href" attribute instead of Link.

from web-scraper-chrome-extension.

orizzontiholding avatar orizzontiholding commented on May 27, 2024

@orizzontiholding, you can use Element Scroll selector to manage infinite scrolling. By default it scrolls to the end of the page, but on this site it scrolls past the list of products and doesn't trigger loading, so it is better to scroll directly onto the spinner element. Also set a few seconds delay to let new product data load.

{
  "rootSelector": {
    "id": "_root",
    "uuid": "0"
  },
  "_id": "sole365-sitemap",
  "startUrls": [
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/frutta-e-verdura/c/10006",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/salumi-formaggi/c/10013",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/gastronomia-e-pasta-fresca/c/10007",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/latte-burro-uova-yogurt/c/10009",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/carne/c/10003",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/pesce/c/10011",
    "https://www.cosicomodo.it/sole365/pontecagnano/colazione-merenda-dolci/c/10004",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/prodotti-alimentari/c/10012",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/pane-e-pasticceria/c/10010",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/gelati-e-surgelati/c/10008",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/bevande-vini-liquori/c/10002",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/tutto-per-il-bambino/c/10015",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/cura-della-persona/c/10005",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/tutto-per-la-casa/c/10016",
    "https://www.cosicomodo.it/sole365/pontecagnano/reparti/tempo-libero/c/10014"
  ],
  "selectors": [
    {
      "id": "product_element",
      "selector": "li.product-list-item",
      "scrollElementSelector": ".spinner-container",
      "type": "SelectorElementScroll",
      "multiple": true,
      "parentSelectors": [
        "0"
      ],
      "delay": "3000",
      "uuid": "3"
    },
    {
      "parentSelectors": [
        "3"
      ],
      "type": "SelectorText",
      "uuid": "2",
      "id": "sole365-productname",
      "selector": ".product-name > h3 > a",
      "textmanipulation": {
        "removeHtml": true
      }
    },
    {
      "id": "sole365-linkprodotto",
      "selector": ".product-image > a",
      "type": "SelectorLink",
      "extractAttribute": "href",
      "parentSelectors": [
        "3"
      ],
      "uuid": "1"
    }
  ],
  "sitemapSpecificationVersion": 1
}

For visible pagination buttons you can use Element Click selector with similar semantics.

Also note that in this sitemap the Link selector only navigates to product pages, but does not extract anything from them, so you'll have to add child selectors to it for extraction. If you are interested in product links extraction only, use Element Attribute selector with "href" attribute instead of Link.

ok, now is more clear. I've done some tests and I've noticed what you've explained about the infinite scroll. For the links, I need that all the products link are extracted, I will use your suggested method. Just the last thing, Into the documentation isn't explained how to manage the scraping settings, what values can be good for this kind of website?I'm talking about page load and requests delay.

Thank you for the help.

from web-scraper-chrome-extension.

mxsnq avatar mxsnq commented on May 27, 2024

Page load delay - amount of time to wait for a page to load before applying selectors. It is useful when some data on a page is loaded dynamically. This setting applies to all requests, but you can define this delay per selector instead.

Requests interval - minimum amount of time between 2 consecutive requests. Use it to throttle your crawling speed, to avoid hitting servers too hard. A good practice for polite crawl is to look for /robots.txt page on a site and look for crawl-delay directive.

For your site, I believe you can go with defaults, they are reasonable for most sites.

from web-scraper-chrome-extension.

orizzontiholding avatar orizzontiholding commented on May 27, 2024

Page load delay - amount of time to wait for a page to load before applying selectors. It is useful when some data on a page is loaded dynamically. This setting applies to all requests, but you can define this delay per selector instead.

Requests interval - minimum amount of time between 2 consecutive requests. Use it to throttle your crawling speed, to avoid hitting servers too hard. A good practice for polite crawl is to look for /robots.txt page on a site and look for crawl-delay directive.

For your site, I believe you can go with defaults, they are reasonable for most sites.

Ok, it will be useful to insert this informations into the documentation.

I've created a fork of the repo to try update the manifest to v3. Also I will need to combine the extracted product data to have for each product all the informations merged in a single object.
Where I need to look into the extension files?
Another issue I'm facing is that the scroll will not continue until all products are visible, I've modified the selector for the scroll as suggested from you but seems not work as expected?

from web-scraper-chrome-extension.

mxsnq avatar mxsnq commented on May 27, 2024

Also I will need to combine the extracted product data to have for each product all the informations merged in a single object.

This is what Element-like selectors (Element, Element click, Element scroll) are for. In your sitemap, adding more selectors as children to "product_element" will let you have each product data extracted in separate object. There is also a flag in Element-like selectors to merge those extracted objects into a single object with list of nested objects (unless some child selector is a Link).

You can use Data Preview button on Element selector to perform extraction on currently opened web page and see which data will be extracted. When using data preview on Element scroll selector, you'll have to wait until the page finishes scrolling, but you can limit the number of scrolls with Pagination limit parameter in selector.

Another issue I'm facing is that the scroll will not continue until all products are visible, I've modified the selector for the scroll as suggested from you but seems not work as expected?

If you mean the page is taking too long to load before the scrolling starts, then I believe there is nothing you can do about it, plugin waits for navigation to fully complete.

If the scrolling does not start at all and another page is opened, try increasing the delay in scroll selector.

If you mean scraping does not advance before it finishes scrolling the page, then this works as expected. Scraping is performed in a single browser window, so it is necessary to fully process one page before navigating to another. You have multiple start urls in your sitemap, so you'll have to wait until the scraper scrolls through each of them; try experimenting with a single start url to see if it works correctly. You may also limit the number of scrolls as I said above.

Where I need to look into the extension files?

For v3 migration?

from web-scraper-chrome-extension.

mxsnq avatar mxsnq commented on May 27, 2024

Also, this extension is a fork of the fork of old open-source version of webscraper.io plugin. Even though the versions have diverged significantly, you can still check out their site for docs and tutorial videos about that plugin and try it on demo sites. This might be helpful to better understand how to work with this extension.

from web-scraper-chrome-extension.

orizzontiholding avatar orizzontiholding commented on May 27, 2024

Also I will need to combine the extracted product data to have for each product all the informations merged in a single object.

This is what Element-like selectors (Element, Element click, Element scroll) are for. In your sitemap, adding more selectors as children to "product_element" will let you have each product data extracted in separate object. There is also a flag in Element-like selectors to merge those extracted objects into a single object with list of nested objects (unless some child selector is a Link).

You can use Data Preview button on Element selector to perform extraction on currently opened web page and see which data will be extracted. When using data preview on Element scroll selector, you'll have to wait until the page finishes scrolling, but you can limit the number of scrolls with Pagination limit parameter in selector.

You mean that I can select the main element that hold the products details and then assign child selectors, right? I've inspected the DOM and saw that there are some data attributes on an html article element, if I want extract them I need to use the attribute function right?

Another issue I'm facing is that the scroll will not continue until all products are visible, I've modified the selector for the scroll as suggested from you but seems not work as expected?

If you mean the page is taking too long to load before the scrolling starts, then I believe there is nothing you can do about it, plugin waits for navigation to fully complete.

I mean that after the first scroll, products are loaded, but the scroll event will be not fired anymore and the script will navigate to another page that is in the sitemap. This will cause that the data extraction isn't complete.

For v3 migration?
I've started to modify the manifest file to align to v3, I will just need to check the API used in the various scripts of the extension to replace what is deprecated in mv3

from web-scraper-chrome-extension.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.