Coder Social home page Coder Social logo

schasins / helena Goto Github PK

View Code? Open in Web Editor NEW
241.0 241.0 24.0 516.4 MB

A Chrome extension for writing custom web scraping programs and web automation programs. Just demonstrate how to collect the first row of data, then let the extension write the program for collecting all rows.

License: BSD 2-Clause "Simplified" License

HTML 4.63% JavaScript 92.27% CSS 0.42% Python 2.28% Shell 0.15% Dockerfile 0.04% Closure Templates 0.20%
chrome-extension javascript programming-by-example synthesis web-scraping

helena's People

Contributors

emjun avatar hesscl avatar jlumbroso avatar miafrancesca avatar samkaufman avatar schasins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

helena's Issues

scroll function block

Hi schasins,

first of all congratulations on this really nice project ๐ŸŽ‰

i'm just struggling with one thing: virtualized lists. I think the simples way to circunvent this is:

  1. scrap what is visible
  2. scroll the viewport (or configurable) height
  3. do it again

(probably the scoll block it's better to be coded outside the main scrap block)

example: react-virtualized

so, for this workflow, i would need a scroll or scrollTo function block.. what do you think?

Data Security Issue

When a script is saved to the server it's contents are publicly available to other users. If a script contained sensitive information it could lead to a data breach.

Load URL from files?

Hi again

I have US state wise URLs like http://example.com/ca, http://example.com/al

I have created my scraper and saved it. Is it possible to assign URLs from file in load textfile instead of I run them manually?

Thanks

Default Helena server timing out?

The extension runs smoothly the overwhelming part of the time. Occasionally it stalls, and when looking at the output of the developer console, it appears that the server kaofang.cs.berkeley.edu:8080 times out arbitrarily:

Screen Shot 2019-04-27 at 1 46 46 PM

From my understanding, kaofang is currently the place where a default instance of the Helena Server is running, for purposes of simplifying the installation of the Helena extension. I am guessing that perhaps the server is overwhelmed by traffic. If this is correct, I can think of two suggestions to improve this issue:

  1. Error reporting to the user, to explain the temporary problem. (I think otherwise, such interactions risk undermining the trust of the non-CS users of this system.)

  2. I wonder if it would make sense to transition the Helena server to a serverless type of instance, so that it would scale exactly based with demand.

How to remove existing saved scripts?

There are many scripts shipped with the extension. How do I remove all of them because of them I can't load new one as CLICK is not working to load a script

How to scrape link URLs ?

It is easy to scrape text (using Alt + click) - but how can I capture a link's URL (href) instead of its text?

Scripts not created by me are accessible to me

I just installed Helena extension as per the instructions provided in the documentation.

When I tried to load saved scripts I could see a long list of saved programs and they were accessible to me.

Not sure if this is intentional or could it pose a security concern.

Anyway to automate the downloaded script?

I have downloaded the script I want. Now I have 100s of input URLs for the same output. Is there any way I could automate this part instead of manually input new URLs for the same output?

Add Firefox Support

First of all, this is a quite useful browser extension - thanks for making it!

Feature Request

As I frequently use Firefox, it would be good if this project could also be ported to Firefox. Do you think this could be feasible?

According to this article, porting a Chrome extension to Firefox is generally not very hard.

Feature to add: scraping "link address"

Please correct me if I'm wrong. I could not find a way to scrape the "link address" of hyperlinks (anchor texts, images, buttons, etc.). In many cases, it is necessary to scrape this information. Please consider adding this. Thanks.

Can I use it remotely?

Is it possible to use it on remote AWS machines? I recently used ROUSILLON and loved it. However I want to automate the process on the remote machine. How can I do it?

Cannot seem to run from source

I tried running Helena from source today. Here's what I did:

$ git clone [email protected]:schasins/helena.git
$ git clone [email protected]:schasins/helena-lang.git
$ cd helena/src/scripts/lib
$ rmdir helena-library
$ ln -s ../../../../helena-lang helena-library
  • Add helena/src as an unpacked extension to Chromium in developer mode

  • Launch Helena

I observed two overarching issues:

  • Something is wrong with plumbing events to/from pages: no preview shows up when I hold down Ctrl+Alt, and nothing happens in the extension when I Ctrl+Alt+<click>

  • The "Stop Recording" button does not work, but the "Cancel Recording" button does work; console error from pressing "Stop Recording":

Uncaught TypeError: Cannot read property 'outerWidth' of undefined
    at setWindowSize (mainpanel_script.js:129)
    at HTMLDivElement._stopRecording (mainpanel_script.js:153)
    at HTMLDivElement.dispatch (jquery.js:3)
    at HTMLDivElement.r.handle (jquery.js:3)

Are there specific revisions or branches I should use if I want to develop with Helena?

Project Files

First of all.. AWESOME project! Thank you

I'm trying to clone but does not complete.. try to download and it but is taking forever.. maybe is something wrong on github now.. i dont knwo.. but its more than 100mb..

Is it the docs folder? its possible to break it down in 2 projects?

I'm going to need a options page to set a url for a different backend.. will do locally and see if i can clone it (its never ends) to make a PR later

thanks!

How can I execute pagination?

I want to go one page after the other. Right now I do the following(an example is CraigsList)

  • ALT+CLICK listing text
  • Click on NEXT>> link
  • GO TO Step 1

What it does that it keeps moving between page 1 - page2 in an infinite loop

How to fetch multi column data?

Referring to the URL, how can I fetch city names contain in multi column row?

I tried clicking all entries in a row but it fetched only a row, not all.

[Suggestion] Add Contribution Options

I cannot currently see a Contributing.md file or any guidelines, and would very much appreciate if these were added! Another potentially good idea might be to document early stage issues (even if nobody ends up working on them except repository maintainers). Personally I am very interested in contributing to this project

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.