projectwallace / extract-css-core Goto Github PK

View Code? Open in Web Editor NEW

36.0 3.0 5.0 510 KB

Extract all CSS from a given url, both server side and client side rendered.

Home Page: https://www.projectwallace.com/get-css

License: MIT License

JavaScript 100.00%

scrape css extract wallace extract-css js-styling inline-styling

extract-css-core's People

Contributors

Stargazers

Watchers

Forkers

spencerx kenyannoob rsmelo92 primekobodevlopement icephoenixz

extract-css-core's Issues

browserOverride should accept a browser instance instead of a brittle config object

ATM, browserOverride is a complex piece of if-statements and a combination of fields, but actually it should accept a fully configured browser instance.

Example:

const browser = puppeteer.launch(browserOptions)
const css = extractCss(url, {browserOverride: browser})

We should build in some sanity checks that the browser has at least the correct interface in regard to the following methods:

newPage()
close()

Report where styles come from

Sometimes it's pretty interesting to know where styles may have come from. Some possible options:

<link rel="stylesheet"> in HTML
<link rel="stylesheet"> generated by JS
<style> in HTML
<style> generated by JS
<div style=""> in HTML
<div style=""> generated by JS
element.style.color = 'red' in JS
myStyle.insertRule('#blanc { color: white }', 0); CSSStyleSheet.insertRule()
@import rules for most of the above mentioned cases

Prior art

get-css does this, but not for JS-generated CSS, I think

Any minified CSS on the page is now returned un-minified by this package. This CSS should still be minified, because otherwise a css analyzer would pick up rgb(0, 0, 0) differently than rgb(0,0,0).

Add option to include/exclude inline styles

const css = extractCss('my-url', {
  inlineStyles: 'include' // or 'exclude'
})

Namings could be better tho

Direct links to CSS files result in no CSS

Steps to reproduce:

await extract('https://cdn.jsdelivr.net/npm/tailwindcss/dist/tailwind.min.css')

Actual result

''

Expected result

Lots of CSS

does not work for FC Bayern webaite

https://fcbayern.com/de

Reports entire CSS twice for inlined CSS in <head>

CSS is inlined in <head> with a <style> tag. Extract-css-core reports the whole thing twice, but with different formatting for colors: first time it's#3d515b (as authored), second time it's rgb(61, 81, 91) (see Project Wallace commit). https://projectwallace.com/get-css also sees 1 <link>‐tag or @import and 1 <style> ‐tag or CSS‐in‐JS.

Upgrade puppeteer to v3

Puppeteer v3+ only supports Node v10+, making this a breaking change.

Add option to ignore inline styles/css-in-js/regular css

I could imagine that in some cases it's not interesting to get the inline styles from a page.

const cssWithoutInlineStyles = await extractCss('test.url', {
  includeInlineStyles: false,
  includeJsStylesheetsApi: true,
  includeLinks: true,
  includeStyleTags: true
})

Allow User-Agent string to be set

I want to override the specific user-agent so I can tell a website that it's not just Chrome crawling them, but my custom project UA.

Example implementation from Puppeteer docs:

page.setUserAgent(userAgent)

styled-components and friends are not extracted

I haven't found a way yet to extract CSS generated by styled-components (and probably others). Any help would be much appreciated.

support web components

Add support for Web Components, both open and closed. Maybe it already works, but at least it should be covered by tests.
Example: https://css-tricks.com/encapsulating-style-and-structure-with-shadow-dom/

Build for master branch fails

https://travis-ci.org/bartveneman/extract-css-core/builds/625868162

✔ it rejects on an invalid url (1.4s)
✔ it finds JS generated <style /> CSS (1.5s)
✖ it finds css-in-js, like Styled Components 
✔ it combines server generated <link> and <style> tags with client side created <link> and <style> tags (2.5s)
✔ it fetches css from a page with CSS in server generated <style> inside the <head> (2.6s)
✔ it finds JS generated <link /> CSS (2.6s)
✔ it rejects if the url has an HTTP error status (2.6s)
✔ it fetches css from a page with CSS in a server generated <link> inside the <head> (2.6s)

106:   t.is(actual, expected)

Difference:

  - 'html { color: rgb(255, 0, 0); }'
  + 'html { color: rgb(255, 0, 0); }.hJHBhT { color: blue; font-family: sans-serif; font-size: 3em; }'

Some potential optimisations

See https://www.zachleat.com/web/speedy-screenshots/

How to deal with inline styles?

It should be fairly easy to find inline styles, and it could certainly be very interesting to see their results, but how should they be reported? Inline styles don't have selectors, but merely declarations.

Scraping inline styles from a page:

[...document.querySelectorAll('[style]')].map(el => el.getAttribute('style')).join('')

We could generate a unique selector for each inline style attribute, but it could interfere with the resulting CSS statistics:

const nanoid = require('nanoid')
[...document.querySelectorAll('[style]')].map(el => {
  return `[x-inline-style-${nanoid()}] { // create a custom [x-inline-style-*] selector
    ${el.getAttribute('style')} // contains all declarations
  }`
)

Add element breadcrumbs for style tags and inline styles

Report where in the DOM the <style> and <x style="..."> were found. This breadcrumb could be generated by looking at the target DOM node, and traverse up the (while (node.parentNode)) and generating the selector for that node by taking the nodeName, className and ID.

[
  {
    href: undefined,
    breadcrumb: ['html', 'body', 'thing', 'p'],
    type: 'inline'
    css: '[x-inline-style] { color: red; }'
  },
  {
    href: undefined,
    breadcrumb: ['html', 'head', 'style'],
    type: 'style',
    css: 'p { }'
  }
]

be able to pass in a custom chromium and puppeteer instance

To use this module inside an AWS Lambda, I must be able to provide puppeteer-core and aws-lambda-chrome to make it run smoothly.