imax153 / scalpel-ts Goto Github PK

View Code? Open in Web Editor NEW

3.0 2.0 0.0 1.57 MB

A porting of Haskell's scalpel web scraper to TypeScript.

License: MIT License

TypeScript 100.00%

scalpel-ts's Introduction

scalpel-ts

Coming soon!

scalpel-ts's People

Contributors

Stargazers

Watchers

scalpel-ts's Issues

Add tests

We have no tests at all at the moment. This is a big problem.

At the very minimum, we need to build a comprehensive test suite that unit tests all modules.

While I was (relatively) good about documenting things as I wrote the library, it would be helpful to go back and add documentation to the different modules and fill in descriptions of the types so that doc-ts (when added) can generate more useful information.

[Internal]: Move monad stack definitions into their own modules

Currently the definitions for the constructors, destructors, combinators, and utilities for the Scraper Monad stacks utilized in the library are co-located with the definitions of the instances that they back. The current Monad stacks used are:

type Scraper<A> = Reader<TagSpec, O.Option<A>>

type SerialScraper<A> = State<SpecZipper, Option<A>>

It would be cleaner if we separated concerns by splitting the monad stacks into their own modules. This way the Scraper and Serial modules can focus on their primary function instead of also defining all the function boilerplate for the monad stacks.

The type definitions for these monad stacks would be as follows:

type ReaderOption<R, A> = Reader<R, Option<A>>

type StateOption<S, A> = State<S, Option<A>>

Fix: Scraper does not backtrack with chroots

Description

A Scraper should be able to backtrack when failing on a selected node, and continue searching for all nodes matched by a selector. Currently, this is not the case.

For example, the following code should isolate and return the only comment containing the word "cat".

export const chroots = (selector: Selector) => <A>(
  scraper: Scraper<A>
): Scraper<ReadonlyArray<A>> => flow(select(selector), RA.traverse(O.Applicative)(scraper))

However, the current behavior is to fail the entire scraper and return None if a single node is filtered from the scraped results.

Solution

This can be solved by modifying the behavior of the chroots function in the Scraper module. Currently we are executing the scraper action for every element in the list of selected nodes, and accumulating the results of the scraper into an Option<ReadonlyArray<A>>. The default behavior of traverse is to return None if the scraper returns None for any of the selected nodes.

Instead, the scraper should be executed on each element in the list of selected nodes to return a ReadonlyArray<Option<A>>, which can then be compacted, resulting in a ReadonlyArray<A>. This will allow scrapers to backtrack and evaluate all selected nodes only retaining values which evaluate to Some<A>.

export const chroots = (selector: Selector) => <A>(
  scraper: Scraper<A>
): Scraper<ReadonlyArray<A>> =>
  pipe(
    ask(),
    map((spec) => pipe(spec, select(selector), RA.map(scraper), RA.compact))
  )

[Enhancement]: Improve stack safety

At the moment, the primary algorithms for generation of the TagSpec and selecting the DOM nodes targeted by a Scraper are highly vulnerable to stack overflow given a large enough HTML document.

We need to convert many of the existing algorithms from their recursive implementations either to an iterative implementation or to a tail recursive implementation:

Add Filterable instance to Scraper

The addition of a Filterable instance to the Scraper module would allow for filtering through scraped web content within the context of the Scraper Monad.

For example, if we wanted to parse all <h1 /> tags and only keep those whose text starts with the word "hello", we could do something along the lines of:

pipe(
  Scraper.text(Select.tag('h1')),
  Scraper.filter((text) => text.startsWith('hello'))
)

import * as O from 'fp-ts/Option'
import * as R from 'fp-ts/Reader'

const getSemigroup = <A>(S: Semigroup<A>): Semigroup<Scraper<A>> =>
  R.getSemigroup(O.getApplySemigroup(S))

const getMonoid = <A>(M: Monoid<A>): Monoid<Scraper<A>> => R.getMonoid(O.getMonoid(M))