montferret / ferret Goto Github PK
View Code? Open in Web Editor NEWDeclarative web scraping
Home Page: https://www.montferret.dev/
License: Apache License 2.0
Declarative web scraping
Home Page: https://www.montferret.dev/
License: Apache License 2.0
We need a mechanism that would allows us to log all issues that are not handled (in API which does not return errors) properly.
We should pass a logger with a context.
Moreover, we need to be able to set custom output for the logger: std, file something else that implements Writer interface.
Add a function which navigates a given page to a new url
NAVIGATE(doc, url)
2010..2013
should produce the following result:
[ 2010, 2011, 2012, 2013 ]
Add possibility to interrupt an execution by handling context.Done().
lots of servers block you if you make too many requests at a time.
So solution is either to put a wait in OR run the script from many cloud servers.
I like the 2nd option
It would mean that each one runs in a command and control fashion i think.
a central brain controls them all, telling them the exact next step, but each shares the same cookie etc for authenticated scraping.
Raising this as an idea...
Chrome is awesome and all, but for scraping tasks it's too heavy.
We need to investigate how we can create a custom build with stripped features that are not relevant to web scraping and publish this Docker image.
Some links:
Add WAIT_CLASS
function that would stop an execution until a given CSS class(es) appear in an element.
Signature:
WAIT_CLASS(document, selector, class...)
Where:
document
- document objectselector
- CSS selector to find an element as a class owner.class
- an arbitrary number of CSS classes as multiple arguments (at least 1) to wait forWAIT_CLASS(element, class...)
Where:
element
- element objectclass
- an arbitrary number of CSS classes as multiple arguments (at least 1) to wait forFrom time to time, CDP Client returns the following error:
cdp.DOM: GetDocument: rpcc: the connection is closing
which leads to empty results, since Ferret handles it gracefully.
It would be great to be able to read a query from STDIN.
e.g.
echo "some query" | go run ./cmd/cli/main.go
or
cat someQueryFile.fql | go run ./cmd/cli/main.go
or
go run ./cmd/cli/main.go < someQueryFile.fql
sam@debian ~ $ go version
go version go1.7.4 linux/amd64
Do I need a higher go version? In Readme it says go 1.6 should be sufficient.
Hello,
Could be interesting to integrate Scrapoxy (http://scrapoxy.io).
You can have your own distributed network of proxies.
A cool mixed feature should be the capacity of running a scenario with a specific proxy of the mesh proxy (you select the output node to have a scenario from the same outside IP).
Best regards,
Fabien.
Describe the bug
It appears that line 47 might be broken in pkg/runtime/core/param_test.go
.
pf, err = core.ParamsFrom(ctx)
is being run and the test assumes err
is not nil. The results say otherwise :)
Failures:
* /Users/esell/go/src/github.com/esell/ferret/pkg/runtime/core/param_test.go
Line 47:
Expected '<nil>' to NOT be nil (but it was)!
To Reproduce
Steps to reproduce the behavior:
pkg/runtime/core
go test
Expected behavior
to pass
Desktop (please complete the following information):
Currently, ferret
opens pages in regular mode which makes things more difficult since it uses all user session cookies.
We need to change it. Open pages in incognito mode.
I'm new to this go stuff. I tried installing this on Ubuntu 18. First installing go, and then trying to make ferret... Would it be possible to post a completely newbie guide with all the command line steps? Would be much appreciated, thanks!
Currently, we run unit tests covering functionality that does not require a running browser.
As a result, the most complex and valuable functionality - work with dynamic pages - not covered by unit tests.
We need to build an infrastructure that would allow us to test dynamic pages with predictable results.
Here is a draft of how it can be implemented:
One of two last parts of the language that are left to
https://www.arangodb.com/docs/stable/aql/advanced-array-operators.html
Add CLICK(el)
function that would emit click event for a passed element.
CDP only.
I need grab a gb2312 encoding html page, such as http://tour.sanya.gov.cn/News.asp
, How could I covert the page body data from gb2312 to utf8 in the below code?
LET doc = DOCUMENT("http://tour.sanya.gov.cn/News.asp")
if don't do this. I can't convert the extra json data from gb2312 to UTF8. Any suggestion for this issue?
Add DOWNLOAD(content) -> binary
function which would download binary data.
Add a function which converts an open page into PDF and returns it as binary data.
The method is supposed to be handled by dynamic
web driver.
There are two signatures:
PDF(document) -> binary
and
PDF(url) -> binary
and it should use cdp
client API.
Added WAIT_ELEMENT(doc, selector, timeout = 1000)
function.
The function would stop execution until it finds elements by elector.
With default 1000ms timeout.
When I build the project using CircleCI I'm getting this error for the command go vet :
make build
dep ensure
go vet ./cmd/... ./pkg/...github.com/{{ORG_NAME}}/{{REPO_NAME}}/pkg/runtime/values_test [github.com/{{ORG_NAME}}/{{REPO_NAME}}/pkg/runtime/values.test]
pkg/runtime/values/array_test.go:266:10: invalid argument s (type *values.Array) for len
pkg/runtime/values/array_test.go:267:8: invalid operation: s[0] (type *values.Array does not support indexing)
pkg/runtime/values/array_test.go:271:10: invalid argument s2 (type *values.Array) for len
Makefile:43: recipe for target 'vet' failed
make: *** [vet] Error 2
Exited with code 2
Any idea why ?
Is your feature request related to a problem? Please describe.
The current algorithm of UA generation gives us UA strings of very old browsers which lead to unpredicted results like wrong CSS selectors or inability to render a page at all.
Describe the solution you'd like
The solution we need is a list (ideally a smart generator) of modern browsers which has a wide variety of:
Libraries to consider:
If CLI is started with --cdp-launch
flag, it should check whether Chrome is running and if not, open a new instance with --remote-debugging-port
flag.
It would improve usability of REPL if we added autocomplete for registered functions.
Here returns the private struct values.None. Maybe it's better to make it public?
ferret/pkg/stdlib/arrays/append.go
Lines 16 to 21 in 7f00078
The function must check whether the pattern search is contained in the string text, using wildcard matching.
LIKE("cart", "ca_t") // true
LIKE("carrot", "ca_t") // false
LIKE("carrot", "ca%t") // true
LIKE("foo bar baz", "bar") // false
LIKE("foo bar baz", "%bar%") // true
LIKE("bar", "%bar%") // true
LIKE("FoO bAr BaZ", "fOo%bAz") // false
LIKE("FoO bAr BaZ", "fOo%bAz", true) // true
Add a functions which takes a screenshot of the open page and returns it as binary data.
Add a function which waits for a completion of a navigation of the given page.
WAIT_NAVIGATION(doc, timeout)
FOR u IN users
FILTER u.id == @id && u.name == @name
RETURN u
Add possibility to use proxy.
Add a function which waits for a certain event from a given document or an element.
WAIT_EVENT(doc, selector, eventName, timeout)
and
WAIT_ELEMENT_EVENT(docOrEl, eventName, timeout)
Currently, we generate random user agent for each document.
The problem is that underlying library sometimes gives us user agent string of an old browser, which might be not supported by scraped website.
That leads to unpredicted results like this:
We need to narrow down the list of user agents that represent modern browsers.
Maybe we could switch to this library: https://github.com/avct/uasurfer
Update:
Probably, it makes sense to extend the functionality in a following manner:
*
as a UA string - enables UA string generationLooks good?
Add PDF(doc) -> binary
function that would make a PDF of the passed document.
u.age > 15 || u.active == true ? u.userId : null
There is also a shortcut variant of the ternary operator with just two operands. This variant can be used when the expression for the boolean condition and the return value should be the same:
u.value ? : 'value is null, 0 or not present'
Describe the bug
Returns a given error when element actually exists.
To Reproduce
Appears randomly. You can run input.fql
script from examples multiple times and notice that sometimes it returns an empty array.
Expected behavior
Should always find an element.
Additional context
That might be a problem either in cdp
package or in Chrome itself.
Add a possibility to fill out forms using INPUT(el, value)
/INPUT(doc, selector, value)
function.
[ 1, 2, 3 ] ALL IN [ 2, 3, 4 ] // false
[ 1, 2, 3 ] ALL IN [ 1, 2, 3 ] // true
[ 1, 2, 3 ] NONE IN [ 3 ] // false
[ 1, 2, 3 ] NONE IN [ 23, 42 ] // true
[ 1, 2, 3 ] ANY IN [ 4, 5, 6 ] // false
[ 1, 2, 3 ] ANY IN [ 1, 42 ] // true
[ 1, 2, 3 ] ANY == 2 // true
[ 1, 2, 3 ] ANY == 4 // false
[ 1, 2, 3 ] ANY > 0 // true
[ 1, 2, 3 ] ANY <= 1 // true
[ 1, 2, 3 ] NONE < 99 // false
[ 1, 2, 3 ] NONE > 10 // true
[ 1, 2, 3 ] ALL > 2 // false
[ 1, 2, 3 ] ALL > 0 // true
[ 1, 2, 3 ] ALL >= 3 // false
["foo", "bar"] ALL != "moo" // true
["foo", "bar"] NONE == "bar" // false
["foo", "bar"] ANY == "foo" // true
This TR below is inside a table, and i cant find out out to address each. I need to use the ng-repeat as thats the only thing thats unique to them. class="ng-scope" is useless because thats everywhere in the page.
SO how can i address a unique Element like "ng-repeat" ?
<tr ng-repeat="service in services | filter: specialismFilter " class="ng-scope">
<td class="ng-binding">Renewable Technology</td>
<td><input type="checkbox" ng-model="service.checked" ng-change="specialistsChecked(service)" ng-disabled="specialistsSelected == checkedLimit && !service.checked" class="ng-pristine ng-untouched ng-valid ng-empty"></td>
</tr>
22 IN [ 22, 7 ]
23 NOT IN [ 22, 7 ]
The problem is that I cannot install ANTLR4 properly in order to generate parser.
COLLECT variableName = expression options
COLLECT variableName = expression INTO groupsVariable options
COLLECT variableName = expression INTO groupsVariable = projectionExpression options
COLLECT variableName = expression INTO groupsVariable KEEP keepVariable options
COLLECT variableName = expression WITH COUNT INTO countVariable options
COLLECT variableName = expression AGGREGATE variableName = aggregateExpression options
COLLECT AGGREGATE variableName = aggregateExpression options
COLLECT WITH COUNT INTO countVariable options
We need more unit tests in runtime
package.
I'm having an issue with gathering links and then going through those links. I.e. say trying to get a list of articles, and then going to the actual article and getting title, content, etc. I can get the list of links, and I can scrape a specific page, but adding them together keeps timing out. Any ideas on what I'm doing wrong?
%
LET doc = DOCUMENT('https://www.theverge.com/tech', true)
WAIT_ELEMENT(doc, '.c-compact-river__entry', 5000)
LET articles = ELEMENTS(doc, '.c-entry-box--compact__image-wrapper')
LET links = (
FOR article IN articles
RETURN article.attributes.href
)
FOR link IN links
NAVIGATE(doc, link)
LET doc = DOCUMENT(link, true)
WAIT_ELEMENT(doc, '.c-entry-content', 5000)
LET texter = ELEMENT(doc, '.c-entry-content')
RETURN texter.innerText
%
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.