Coder Social home page Coder Social logo

ispras / web-scraper-chrome-extension Goto Github PK

View Code? Open in Web Editor NEW
194.0 6.0 65.0 4.07 MB

Web data extraction tool implemented as chrome extension

License: GNU Lesser General Public License v3.0

JavaScript 79.82% CSS 6.82% HTML 13.36%
webscraping scraping scraping-tool javascript

web-scraper-chrome-extension's Introduction

Web Scraper

Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data. Scraped data later can be exported as CSV or JSON Lines.

Latest Version

Read about installation process on installation page.

Changelog

v0.3.6

  • Updated support for Tables (update vertical tables support and added complex headers and data rows)
  • Added export and import sitemap from file
  • Added Russian translations and support of i18n that make possible to add every language translation
  • Added Rest Api CRUD storage for sitemaps
  • Moved to webpack bundler
  • Added id hints from predefined model
  • Added selectors for Constants and Documents
  • Refactored preview data and added search in scraped data
  • Refactored returned items model to JSON
  • Added saving in JSON lines

v0.3

  • Enabled pasting of multiple start URLs (by @jwillmer)
  • Added scraping of dynamic table columns (by @jwillmer)
  • Added style extraction type (by @jwillmer)
  • Added text manipulation (trim, replace, prefix, suffix, remove HTML) (by @jwillmer)
  • Added image improvements to find images in div background (by @jwillmer)
  • Added support for vertical tables (by @jwillmer)
  • Added random delay function between requests (by @Euphorbium)
  • Start URL can now also be a local URL (by @3flex)
  • Added CSV export options (by @mohamnag)
  • Added Regex group for select (by @RuneHL)
  • JSON export/import of settings (by @haisi)
  • Added date and number pattern in URL (by @codoff)
  • Added pagination selector limit (by @codoff)
  • Improved CSV export (by @haisi)
  • Added click limit option (by @panna-ahmed)

v0.2

  • Added Element click selector
  • Added Element scroll down selector
  • Added Link popup selector
  • Improved table selector to work with any html markup
  • Added Image download
  • Added keyboard shortcuts when selecting elements
  • Added configurable delay before using selector
  • Added configurable delay between page visiting
  • Added multiple start url configuration
  • Added form field validation
  • Fixed a lot of bugs

v0.1.3

  • Added Table selector
  • Added HTML selector
  • Added HTML attribute selector
  • Added data preview
  • Added ranged start urls
  • Fixed bug which made selector tree not to show on some operating systems

Bugs

When submitting a bug please attach an exported sitemap if possible.

Development

Read the Development Instructions before you start.

License

LGPLv3

web-scraper-chrome-extension's People

Contributors

3flex avatar dependabot[bot] avatar eldos-dl avatar euphorbium avatar goodromka avatar hat331 avatar jackburridge avatar jwillmer avatar martinsbalodis avatar mohamnag avatar mxsnq avatar panna-ahmed avatar runehl avatar vlazarew avatar willhirsch avatar yatskov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

web-scraper-chrome-extension's Issues

Selector is not working and selector panel not showing in the web page

I am using Chrome 102.0.5005.63. When I add a new selector and go to the page in the browser, I don't get the selector panel in the bottom left. I can't select anything on the page. I am comparing the behavior with the extension from the web store. Does the extension here work differently? I tried different websites, restarted Chrome and I don't see errors in the console (both in top and the extension's console context). I removed the browser_specific_settings setting from the manifest file.

I am not sure what the issue is.

[Question] How to build and build-zip HEAD

I tried to build on npm but it was not successful. I think it is due to deprecated node modules.

Can any body list the required node, node-gyp, npm, python and msvc versions to build it?

Thanks.

Infinite scroll button not clicked, missing data from scraping

I'm trying to set a click on a button in a website where infinite scroll is active, but I've noticed that the scraped data will be related to only one thing inside the page and not all the selectors that have the same class.

How I can scrape correctly the data, do I need to set the selectors in a different way?

Pagination on non-ajax sites

Hi there!

I am struggling to create a pagination for a "next"-link (which changes the URL NOT using JavaScript to dynamically load new content).
E.g. have a look at https://www.ader-paris.fr/en/catalog/121109?offset=0.

On the mentioned page I want to cycle through all pages (1-7) using the "Next" link and extract the data shown below.

Is something like this possible with this extension? I found no way how to do it...

Would be great if you can give me a hint how this can be achieved!

Thank you for your help!

Best wishes,
koseduhemak

Validator id field

Save the selector with an empty id field , it invalid. Then enter the name and validator reject a saving.

Does web-scraper-chrome-extension-v0.3.716.zip release work on Chrome Version 97.0.4692.71?

Firstly it is an awesome extension, thanks,

I try it and it works in someway,

What is the purposes of Select, Element preview, Data preview of Selector in Selectors and selectors properties pages?
Nothing happens when clicked.

Is the purpose of Select button to select the element on the web page by mouse click or what?

web-scraper-chrome-extension-v0.3.716.zip
Chrome Version 97.0.4692.71

multiple startUrls apparently does not working / stop the startUrls pagination when condition

Hi,

Good job with the plugin.

Chrome: Version 87.0.4280.141 (Official Build) (64-bit)
Ubuntu 20.04.1 LTS 64-bit

But I'm trying to use this:

Supported URL patterns:
1. Numeric with optional step and zero padding – [START_END:STEP] – [001_010:10]

my sitemap:

{"_id":"google","startUrls":["http://google.com.br?id=[001_010:10]"],"selectors":[{"id":"body","selector":"body","type":"SelectorHTML","parentSelectors":["_root"]}]}

and the pagination does not work.

image
image
image
log.log

I tried with 3.6 and it does not work again.

I would like the loop to stop the pagination when conditions like repeated elements or html contain.

Thank you.

Access data dynamicly for relevant sites

Currently the manifest requires permision for all urls.
Some users might prefer least needed permisions from the app.
A solution for that might be adding that permision as an optional permission and asking user to add permission for data of new sites added to the extention.

example for making data for domains an optional permission that can be asked later from the running extention at runtime:
manifest.json

...
"optional_permissions": [ "http://*/", "https://*/"  ]
...

example for asking permission for a new site at runtime:

	chrome.permissions.request({
	origins: [protocol +"://"+ domain +":"+ port+"/"]
	}, function(granted) {
		// The callback argument will be true if the user granted the permissions.
		if (granted) {
			alert("amazing things happend here")
		} else {
			alert("Without permision to the site the app can't work")
		}
	});

Element click selector doc

Hi,
First thank you for the extension. I liked very much.
I'm trying do a simple pagination with Element click selector. I read documentation.
Maybe facilitate adding simple sample sitemap to the Element click selector doc .
Below my sitemap:

{"_id":"bestbuy","startUrls":["https://www.bestbuy.com/site/car-stereos/android-auto-receivers/pcmcat1495052094624.c?cp=3&id=pcmcat1495052094624"],"selectors":[{"id":"pagination","selector":"#sku-list-1","clickElementSelector":"a.sku-list-page-next svg.svg-size-s","clickElementUniquenessType":"uniqueHTML","clickType":"clickMore","type":"SelectorElementClick","parentSelectors":["_root"],"delay":"2000"},{"id":"pagination products list","selector":"body","type":"SelectorHTML","multiple":true,"parentSelectors":["pagination"],"delay":"2000"},{"id":"product list","selector":"#main-results > ol","type":"SelectorHTML","parentSelectors":["_root"],"delay":"2000"}]}

Login websites

Does it work with the websites that have a prompt login like bhadoo index?

Working out of the box in latest Chrome 86.0

Not sure if this project is still being maintained. I just tried to install this extension in the latest Chrome version 86.0.x and there are so many bugs right away:

I get a series of " Uncaught SyntaxError: Cannot use import statement outside a module" . Not sure if Chrome APIs or settings have changed in the last three months (last time this repo was updated).

Can anyone else confirm my observations?

How to debug and use breakpoints in JS code which built by Webpack?

I am using Firefox and I am not finding an easy way to debug and put breakpoints. Webpack puts the code I want to put a breakpoint on in a long line of an eval statement. I am using yarn watch:dev.

How does one debug and place breakpoints in such code which is hard to read? I am familiar with debugging extensions before the JS code gets built.

Manifest file missing error on install

On installing in chrome as shown in the installation guide, chrome gives a message that manifest file is missing. Is this extension compatible or not.

How to run the unminified version?

How to run the unminified version or how does one develop, make changes and run this extension? I know how to use the release version.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.