Coming soon!
imax153 / scalpel-ts Goto Github PK
View Code? Open in Web Editor NEWA porting of Haskell's scalpel web scraper to TypeScript.
License: MIT License
A porting of Haskell's scalpel web scraper to TypeScript.
License: MIT License
We have no tests at all at the moment. This is a big problem.
At the very minimum, we need to build a comprehensive test suite that unit tests all modules.
While I was (relatively) good about documenting things as I wrote the library, it would be helpful to go back and add documentation to the different modules and fill in descriptions of the types so that doc-ts
(when added) can generate more useful information.
Currently the definitions for the constructors, destructors, combinators, and utilities for the Scraper
Monad stacks utilized in the library are co-located with the definitions of the instances that they back. The current Monad stacks used are:
type Scraper<A> = Reader<TagSpec, O.Option<A>>
type SerialScraper<A> = State<SpecZipper, Option<A>>
It would be cleaner if we separated concerns by splitting the monad stacks into their own modules. This way the Scraper
and Serial
modules can focus on their primary function instead of also defining all the function boilerplate for the monad stacks.
The type definitions for these monad stacks would be as follows:
type ReaderOption<R, A> = Reader<R, Option<A>>
type StateOption<S, A> = State<S, Option<A>>
A Scraper
should be able to backtrack when failing on a selected node, and continue searching for all nodes matched by a selector. Currently, this is not the case.
For example, the following code should isolate and return the only comment containing the word "cat".
export const chroots = (selector: Selector) => <A>(
scraper: Scraper<A>
): Scraper<ReadonlyArray<A>> => flow(select(selector), RA.traverse(O.Applicative)(scraper))
However, the current behavior is to fail the entire scraper and return None
if a single node is filter
ed from the scraped results.
This can be solved by modifying the behavior of the chroots
function in the Scraper
module. Currently we are executing the scraper action for every element in the list of selected nodes, and accumulating the results of the scraper into an Option<ReadonlyArray<A>>
. The default behavior of traverse
is to return None
if the scraper returns None
for any of the selected nodes.
Instead, the scraper should be executed on each element in the list of selected nodes to return a ReadonlyArray<Option<A>>
, which can then be compact
ed, resulting in a ReadonlyArray<A>
. This will allow scrapers to backtrack and evaluate all selected nodes only retaining values which evaluate to Some<A>
.
export const chroots = (selector: Selector) => <A>(
scraper: Scraper<A>
): Scraper<ReadonlyArray<A>> =>
pipe(
ask(),
map((spec) => pipe(spec, select(selector), RA.map(scraper), RA.compact))
)
At the moment, the primary algorithms for generation of the TagSpec
and selecting the DOM nodes targeted by a Scraper
are highly vulnerable to stack overflow given a large enough HTML document.
We need to convert many of the existing algorithms from their recursive implementations either to an iterative implementation or to a tail recursive implementation:
The addition of a Filterable
instance to the Scraper
module would allow for filtering through scraped web content within the context of the Scraper
Monad.
For example, if we wanted to parse all <h1 />
tags and only keep those whose text starts with the word "hello", we could do something along the lines of:
pipe(
Scraper.text(Select.tag('h1')),
Scraper.filter((text) => text.startsWith('hello'))
)
It would be helpful to add some helpers to query a webpage, parse it, and scrape the contents with a provided Scraper
. The main Haskell library does this here.
import * as O from 'fp-ts/Option'
import * as R from 'fp-ts/Reader'
const getSemigroup = <A>(S: Semigroup<A>): Semigroup<Scraper<A>> =>
R.getSemigroup(O.getApplySemigroup(S))
const getMonoid = <A>(M: Monoid<A>): Monoid<Scraper<A>> => R.getMonoid(O.getMonoid(M))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.