Coder Social home page Coder Social logo

jacktuck / unfurl Goto Github PK

View Code? Open in Web Editor NEW
463.0 5.0 51.0 17.57 MB

Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap:

License: MIT License

TypeScript 72.57% HTML 25.76% Procfile 0.03% JavaScript 1.64%
ogp open-graph oembed twitter-cards scraper metadata embed nodejs slack unfurl

unfurl's Introduction

Unfurl

A metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js (>=v8.0.0).

Note: Will not work in the Browser

Travis CI Coverage Status Known Vulnerabilities npm

ko-fi

The what

Unfurl (spread out from a furled state) will take a url and some options, fetch the url, extract the metadata we care about and format the result in a sane way. It supports all major metadata providers and expanding it to work for any others should be trivial.

The why

So you know when you link to something on Slack, or Facebook, or Twitter - they typically show a preview of the link. To do so they have crawled the linked website for metadata and enriched the link by providing more context about it. Which usually entails grabbing its title, description and image/player embed.

The how

npm install unfurl.js

unfurl(url [, opts])

url - string


opts - object of:

  • oembed?: boolean - support retrieving oembed metadata
  • timeout? number - req/res timeout in ms, it resets on redirect. 0 to disable (OS limit applies)
  • follow?: number - maximum redirect count. 0 to not follow redirect
  • compress?: boolean - support gzip/deflate content encoding
  • size?: number - maximum response body size in bytes. 0 to disable
  • headers?: Headers | Record<string, string> | Iterable<readonly [string, string]> | Iterable<Iterable<string>> - map of request headers, overrides the defaults

Default headers:

{
  'Accept': 'text/html, application/xhtml+xml',
  'User-Agent': 'facebookexternalhit'
}

import { unfurl } from 'unfurl.js'
const result = unfurl('https://github.com/trending')

result is <Promise<Metadata>>

type Metadata = {
  title?: string
  description?: string
  keywords?: string[]
  favicon?: string
  author?: string
  theme_color?: string
  canonical_url?: string
  oEmbed?: OEmbedPhoto | OEmbedVideo | OEmbedLink | OEmbedRich
  twitter_card: {
    card: string
    site?: string
    creator?: string
    creator_id?: string
    title?: string
    description?: string
    players?: {
      url: string
      stream?: string
      height?: number
      width?: number
    }[]
    apps: {
      iphone: {
        id: string
        name: string
        url: string
      }
      ipad: {
        id: string
        name: string
        url: string
      }
      googleplay: {
        id: string
        name: string
        url: string
      }
    }
    images: {
      url: string
      alt: string
    }[]
  }
  open_graph: {
    title: string
    type: string
    images?: {
      url: string
      secure_url?: string
      type: string
      width: number
      height: number
      alt?: string
    }[]
    url?: string
    audio?: {
      url: string
      secure_url?: string
      type: string
    }[]
    description?: string
    determiner?: string
    site_name?: string
    locale: string
    locale_alt: string
    videos: {
      url: string
      stream?: string
      height?: number
      width?: number
      tags?: string[]
    }[]
    article: {
      published_time?: string
      modified_time?: string
      expiration_time?: string
      author?: string
      section?: string
      tags?: string[]
    }
  }
}

type OEmbedBase = {
  type: "photo" | "video" | "link" | "rich"
  version: string
  title?: string
  author_name?: string
  author_url?: string
  provider_name?: string
  provider_url?: string
  cache_age?: number
  thumbnails?: [
    {
      url?: string
      width?: number
      height?: number
    }
  ]
}

type OEmbedPhoto = OEmbedBase & {
  type: "photo"
  url: string
  width: number
  height: number
}

type OEmbedVideo = OEmbedBase & {
  type: "video"
  html: string
  width: number
  height: number
}

type OEmbedLink = OEmbedBase & {
  type: "link"
}

type OEmbedRich = OEmbedBase & {
  type: "rich"
  html: string
  width: number
  height: number
}

The who πŸ’–

(If you use unfurl.js too feel free to add your project)

  • vapid/vapid - A template-driven content management system
  • beeman/micro-unfurl - small microservice that unfurls a URL and returns the OpenGraph meta data.
  • probot/unfurl - a GitHub App built with probot that unfurls links on Issues and Pull Request discussions

unfurl's People

Contributors

adam-ismael avatar alexghr avatar alexgleason avatar amyjchen avatar andreyvital avatar andyford avatar atjeff avatar dependabot[bot] avatar fossabot avatar fuji44 avatar glebedel avatar greenkeeper[bot] avatar jacktuck avatar martip avatar r4ai avatar ravenscar avatar snyk-bot avatar stevenao avatar trieloff avatar weisisheng avatar werehamster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

unfurl's Issues

Unfurling image URLs broken in 1.1.7+

Unfurling https://pbs.twimg.com/media/DHOfU6fVYAItcBY.jpg in 1.1.6 returns:

{ other: { _type: 'image' } }

In 1.1.7:

error unfurling URL: Error: content-type must be text/html
    at fetch.then.res (/home/bcolby/dev/unfurl-test/node_modules/unfurl.js/index.js:70:13)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)

This breaks my app of course. I wouldn't expect this from a minor version change. I've had to revert back to 1.1.6 for now.

Looks like 3 changes even more:

error unfurling URL: { Error
    at new UnexpectedError (/home/bcolby/dev/unfurl-test/node_modules/unfurl.js/src/unexpectedError.ts:21:18)
    at /home/bcolby/dev/unfurl-test/node_modules/unfurl.js/src/index.ts:59:11
    at Generator.next (<anonymous>)
    at fulfilled (/home/bcolby/dev/unfurl-test/node_modules/unfurl.js/dist/index.js:4:58)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)
  name: 'WRONG_CONTENT_TYPE',
  info: { contentType: 'image/jpeg', contentLength: '169418' } }

Favicon

It would be nice if the returned metadata included links to the favicon. Do you think such a feature is in scope of this package?

Respect robots.txt

Most sites will also look at robots.txt before parsing meta tags, would be great if unfurl can take this into account too!

Issue scraping Amazon

Hello again!

unfurl.js returned

{
      "title": "Sorry! Something went wrong!",
      "favicon": "https://www.amazon.com/favicon.ico"
}

at one occassion when scraping https://www.amazon.com/gp/product/1732265178/ref=ox_sc_act_image_1

now it works better and returns correct metadate

My question is: did all of that come from Amazon? I don't think that title is something that is coming from unfurl or any associated libraries... so if that's the case, nothing can be done except to retry, right?

The problem is that I cannot know which data to retry (automatically)...
is there a good way to detect such cases based on something else, perhaps possible HTTP code coming from the server ?

I'd like to check for that and not save any metadata like this example above to the database.

thank you

UPDATE:

Amazon returns 503 with HTML which contains above title "Sorry! ..."

OEmbed for SoundCloud isn't resolved

Trying with https://soundcloud.com/cheryl-lin-fielding/chanson-pour-jeanne?in=cheryl-lin-fielding/sets/website I'm getting following results:

{ title:
   'Chabrier: Chanson pour Jeanne - Efrain Solis, Baritone & Cheryl Lin Fielding, Piano by Cheryl Lin Fielding | Free Listening on SoundCloud',
  keywords:
   [ 'record',
     'sounds',
     'share',
     'sound',
     'audio',
     'tracks',
     'music',
     'soundcloud' ],
  open_graph:
   { site_name: 'SoundCloud',
     type: 'music.song',
     url:
      'https://soundcloud.com/cheryl-lin-fielding/chanson-pour-jeanne',
     title:
      'Chabrier: Chanson pour Jeanne - Efrain Solis, Baritone & Cheryl Lin Fielding, Piano',
     images: [ [Object] ],
     description:
      'Efrain Solis, Baritone\nCheryl Lin Fielding, Piano\nLive performance' },
  twitter_card:
   { site: 'SoundCloud',
     apps: { iphone: [Object], ipad: [Object], googleplay: [Object] },
     title:
      'Chabrier: Chanson pour Jeanne - Efrain Solis, Baritone & Cheryl Lin Fielding, Piano',
     description:
      'Efrain Solis, Baritone\nCheryl Lin Fielding, Piano\nLive performance',
     players: [ [Object] ],
     images: [ [Object] ],
     card: 'player' },
  description:
   'Stream Chabrier: Chanson pour Jeanne - Efrain Solis, Baritone & Cheryl Lin Fielding, Piano by Cheryl Lin Fielding from desktop or your mobile device',
  favicon:
   'https://a-v2.sndcdn.com/assets/images/sc-icons/favicon-2cadd14bdb.ico' }

oembed is missing, despite being provided by SoundCloud.

I have identified the root cause and will provide a PR.

og:image:alt not used

Just noticed that the og:image:alt tag isn't included (but it is included for twitter cards). I can look at adding this, but just wanted to check β€” is there a reason why this isn't included?

Exports no types for Typescript

When using unfurl like so, it leads to a complaint by typescript:

import unfurl = require("unfurl.js")
const unfurled = unfurl(last)
2349: Cannot invoke an expression whose type lacks a call signature. Type 'typeof import("/Users/dimroc/go/src/github.com/dimroc/storyboard/functions/node_modules/unfurl.js/dist/index")' has no compatible call signatures

This is probably because we don't export types:
https://github.com/jacktuck/unfurl/blob/master/dist/index.d.ts

export {};

Let me know if I'm missing an easy step.

Z_BUF_ERROR makes app crash

When trying to unfurl certain URLs, zlib sometimes throw a Z_BUF_ERROR, and this part of the app makes it crash.

res.body.once('error', (err) => {
      log('error', err.message)

      process.nextTick(function () {
        throw err
      })
    })

I was using this URL https://www.reddit.com/r/holdmybeer/comments/9644ap/work_is_hard/

The result of the unfurling is fine but the app crashes. Should an error really be thrown in the case of that non fatal error?

Missing types in dist

Hey man, awesome package! It does a great job of getting me the data I need, however, it appears that whatever method of building / packaging you're using isn't including the typescript types in the dist!

image

An in-range update of coveralls is breaking the build 🚨

Version 2.13.2 of coveralls just got published.

Branch Build failing 🚨
Dependency coveralls
Current Version 2.13.1
Type devDependency

This version is covered by your current version range and after updating it in your project the build failed.

As coveralls is β€œonly” a devDependency of this project it might not break production or downstream projects, but β€œonly” your build or test tools – preventing new deploys or publishes.

I recommend you give this issue a high priority. I’m sure you can resolve this πŸ’ͺ

Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build failed Details

Commits

The new version differs by 3 commits.

  • 5ebe57f bump version
  • 428780c Expand allowed dependency versions to all API compatible versions (#172)
  • eb1b723 Update Mocha link (#169)

See the full diff

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.


Your Greenkeeper Bot 🌴

Incorrect metadata types

Hey πŸ‘‹
Thanks for the great lib!
The Metadata type defines twitter_card and open_graph as arrays, but the received results are objects.

An in-range update of open-graph-scraper is breaking the build 🚨

Version 2.5.5 of open-graph-scraper just got published.

Branch Build failing 🚨
Dependency open-graph-scraper
Current Version 2.5.4
Type devDependency

This version is covered by your current version range and after updating it in your project the build failed.

As open-graph-scraper is β€œonly” a devDependency of this project it might not break production or downstream projects, but β€œonly” your build or test tools – preventing new deploys or publishes.

I recommend you give this issue a high priority. I’m sure you can resolve this πŸ’ͺ

Status Details
  • βœ… coverage/coveralls First build on greenkeeper/open-graph-scraper-2.5.5 at 100.0% Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build failed Details

Commits

The new version differs by 3 commits.

  • 54b1957 2.5.5
  • 7e7bd21 Merge pull request #47 from jshemas/snyk-fix-34fef66e
  • 7d57bf6 fix: package.json to reduce vulnerabilities

See the full diff

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.


Your Greenkeeper Bot 🌴

Error: write after end

I am getting a sporadic error:

Error: write after end
at writeAfterEnd (_stream_writable.js:236:12)
at PassThrough.Writable.write (_stream_writable.js:287:5)
at IncomingMessage.ondata (_stream_readable.js:639:20)
at emitOne (events.js:116:13)
at IncomingMessage.emit (events.js:211:7)
at addChunk (_stream_readable.js:263:12)
at readableAddChunk (_stream_readable.js:250:11)
at IncomingMessage.Readable.push (_stream_readable.js:208:10)
at HTTPParser.parserOnBody (_http_common.js:130:22)
at TLSSocket.socketOnData (_http_client.js:440:20)
at emitOne (events.js:116:13)
at TLSSocket.emit (events.js:211:7)
at addChunk (_stream_readable.js:263:12)
at readableAddChunk (_stream_readable.js:250:11)
at TLSSocket.Readable.push (_stream_readable.js:208:10)
at TLSWrap.onread (net.js:597:20)

It seems like if I introduce a wait of some kind of the code by putting a break point it works, but if I don't have the break point the error will occur.

The website I am getting this on is:

https://www.lexhelper.com

And the method in typescript is:

async function get_site_data(scanReport: ScanReport, url: string) {
let result = await unfurl(url);
log.info("Got unfurl data " + result);
scanReport.unfurlData = JSON.stringify(result);
await scanReport.save()
return scanReport;
};

It dies on the very first line and I was using 1.18.beta2 but it still happens with beta3 as well.

Thanks!

Wrong parsing of <title>

Thank you very much for this great library, seems to work well, I'm currently testing it.

However:

unfurl.js will parse the title of https://hash.ai as "Facebook".

Reason:

This is the correct title and it appears first:

Screenshot 2021-05-25 at 21 55 53

but later on there is more stuff with <title> and what unfurl parses is the last one:

Screenshot 2021-05-25 at 21 55 41

This works correctly:

https://github.com/wzbg/read-title/blob/master/index.js

would it be possible to do it this way in your library?

Can it be done easily / soon ? If not, if you could please explain if you see this as an issue or not ... I'll think more about it and possibly fork + implement a fix myself... if there are any guidelines for me to adhere to for the fix to have a better chance of being accepted upstream, also let me know.. I'm not yet versed in typescript so the code might not be the best...

I hope it will be possible to do something in some way, so... thank you for the feedback / thoughts on this.
david

Implement Isomorphic fetch so it works for browsers

Hey, thank you for this wonderful package.

I know this is a Node JS package but can it be used from the frontend like React?

I'm currently making a Chrome extension where I need to get metadata of any site specified so I did this -

import unfurl from "unfurl";

_fetchMeta = async () => {
    try {
      let result = await unfurl("https://akshaykadam.me");
      console.log("result", result);
      alert(result);
    } catch (e) {
      console.error("e", e);
    }
};

But I get this error -

Failed to load https://akshaykadam.me/: No 'Access-Control-Allow-Origin' header is present on the 
requested resource. Origin 'http://localhost:3000' is therefore not allowed access. If an opaque 
response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS 
disabled.

e TypeError: Failed to fetch

In Chrome Extension, it can be removed by adding "permissions": ["<all_urls>"] so I did that but still doesn't work.

It should work as written here πŸ‘‡
Source: Find Cross-Domain from Chrome Extensions in CORS tutorial

I also tried running it as a Web App but still same error. So my question is it possible to do this from Frontend?

Unhandled promise rejection

I think this is an issue with unfurl rather than micro-unfurl's service need to double check later so making this so i don't forget.

image

Installation instructions

I was having a hard time finding the actual npm package as there isn't an installation section in the readme, and the npm badge doesn't actually link to the package. Which got me curious and digging through the readme. Why was the installation section removed? Do you need a PR to add it back?

πŸ› Library doesn't work clientside

Thanks for the amazing work.

I noticed that the library doesn't work on any browser. It results in the error:

Refused to set unsafe header "user-agent"

Access to XMLHttpRequest at 'https://github.com/a-tokyo' from origin 'http://localhost:3000' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

Your schema is amazing and i'd love to use this, however I will have to use https://github.com/itaditya/simple-unfurl for now as it works clientside as well

The "url" argument must be of type string. Received type undefined

Getting:

The "url" argument must be of type string. Received type undefined

With: http://cityofcorona.maps.arcgis.com/apps/MapSeries/index.html?appid=6e3b7f35182a45d79d753bc4a6c543e0

Here's what being parsed out as metadata:

[
  [ 'title', 'City of Corona - Project Updates' ],
  [
    'keywords',
    [
      'JavaScript',   'layout-tab',
      'Map',          'Mapping Site',
      'mapseries',    'Online Map',
      'Ready To Use', 'selfConfigured',
      'Story Map',    'Story Maps',
      'Web Map',      'Corona'
    ]
  ],
  [
    'description',
    'This story map is a comprehensive look at all development projects going on in the city of Corona, CA.'
  ],
  [ 'twitter:card', undefined ],
  [ 'twitter:creator', undefined ],
  [ 'twitter:url', undefined ],
  [ 'twitter:title', undefined ],
  [ 'twitter:description', undefined ],
  [ 'twitter:image', undefined ],
  [ 'og:title', 'City of Corona - Project Updates' ],
  [
    'og:description',
    'This story map is a comprehensive look at all development projects going on in the city of Corona, CA.'
  ],
  [
    'og:url',
    'https://www.arcgis.com/home/item.html?id=6e3b7f35182a45d79d753bc4a6c543e0'
  ],
  [
    'og:image',
    'https://www.arcgis.com/sharing/rest/content/items/6e3b7f35182a45d79d753bc4a6c543e0/info/thumbnail/thumbnail.png'
  ],
  [ 'favicon', 'https://cityofcorona.maps.arcgis.com/favicon.ico' ]
]

TypeError [ERR_UNESCAPED_CHARACTERS]: Request path contains unescaped characters

When I use with URL contains Double Byte Character (for example: Japanese), returns error as Title.

Reproduction of the problem

<!---OK--> 
http://localhost:3000/api/ogp?url=http://affiliweb.info/office/
<!---NG (URL with Double Byte Character)--> 
http://localhost:3000/api/ogp?url=http://affiliweb.info/ζ—₯本θͺžurlγ£γ¦γ©γ†γ‚ˆοΌŸ/
<!---NG (URL with encoded Double Byte Character)--> 
http://localhost:3000/api/ogp?url=http://affiliweb.info/%E6%97%A5%E6%9C%AC%E8%AA%9Eurl%E3%81%A3%E3%81%A6%E3%81%A9%E3%81%86%E3%82%88%EF%BC%9F/

Each addresses are accessible via browser now.

An in-range update of nyc is breaking the build 🚨

Version 11.3.0 of nyc was just published.

Branch Build failing 🚨
Dependency nyc
Current Version 11.2.1
Type devDependency

This version is covered by your current version range and after updating it in your project the build failed.

nyc is a devDependency of this project. It might not break your production code or affect downstream projects, but probably breaks your build or test tools, which may prevent deploying or publishing.

Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build failed Details

Commits

The new version differs by 12 commits.

  • d85118c chore(release): 11.3.0
  • 7792733 chore: explicitly update istanbul dependencies (#698)
  • 222a3d0 chore: slight difference in pinning logic, based on @ljharb's advice
  • f04b7a9 feat: add option to hook vm.runInContext (#680)
  • cdfdff3 feat: add --exclude-after-remap option for users who pre-instrument their codebase (#697)
  • a413f6a chore: upgrade to yargs 10 (#694)
  • 10125aa docs: fix reporters link
  • f5089ca docs: added examples of coverage reports (#656)
  • af281e7 chore: update spawn-wrap to 1.4.0 (#692)
  • a685f7c docs: missing options prefix -- in front of check-coverage (#695)
  • f31d7a6 feat: allow instrument-only to produce sourcemaps (#674)
  • 425c0fd chore: ignore package-lock.json (#683)

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

[Bug]: SoundCloud oEmbed is not returning any html content

PR: #96

I was playing around with SoundCloud urls on unfurl and I noticed that they don't return any html content currently while they do for other attributes.

The issue seems to be that they are escaping the html content with a CDATA and that trips the logic that extracts the content.

I have raised a PR that will fix that by checkign for that scenario and clearing that CDATA.

Hopefully that will solve the issue, feel free to close this if you consider it irrrelevant sicne we have a PR but I just wanted to leave something written on an issue in case someone comes in the future asking why this was done :D

Add support article type in OGP specification

Hi all.

I'm looking to support the article type as written in the OGP specification.
The specification has the following description.

These are globally defined objects that just don't fit into a vertical but yet are broadly used and agreed upon.

I interpreted this as "not specified as an OGP, but widely used".
I thought that some people might not like to support it with unfurl, depending on their interpretation of this statement.

Do you guys have any doubts about being able to handle article types in unfurl?
If there is no doubt, I'd like to start working on it.

Posibility to use axios or enhance the current implementation to work on company proxyfied networks

We have not been able to make unfurl work in internal corporate networks behind a proxy. We use regularly axios with https://github.com/TooTallNate/node-https-proxy-agent as http/https proxy agent.

Would you consider a potential transition to axios or to extend the package to make it compliant with both node-fetch and axios?
Or eventually extend the current implementation to make it work behind a proxy.

Thanks for your thoughts!

meta theme-color and/or a way to access the scraped HTML?

Could the meta theme-color value be added? (MDN docs)This could make constructing social cards from the data a bit more "branded" looking when it's available.

If not, is there some way to get access to the scraped meta info and/or HTML so authors can do their own parsing of the fetched content?

Changelog

Great project! Just looking at upgrading from 1.x to 3 and I see the data structure has changed. Is there a changelog somewhere I'm missing? If not, the releases feature on here is really handy for logging all that. I'm having to go through, compare, and change code manually. Not a huge deal but it would be nice to have a changelog.

An in-range update of serve-static is breaking the build 🚨

Version 1.13.0 of serve-static just got published.

Branch Build failing 🚨
Dependency serve-static
Current Version 1.12.6
Type devDependency

This version is covered by your current version range and after updating it in your project the build failed.

As serve-static is β€œonly” a devDependency of this project it might not break production or downstream projects, but β€œonly” your build or test tools – preventing new deploys or publishes.

I recommend you give this issue a high priority. I’m sure you can resolve this πŸ’ͺ

Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build failed Details

Release Notes 1.13.0
  • deps: [email protected]
    • Add 70 new types for file extensions
    • Add immutable option
    • Fix missing </html> in default error & redirects
    • Set charset as "UTF-8" for .js and .json
    • Use instance methods on steam to check for listeners
    • deps: [email protected]
    • perf: improve path validation speed
Commits

The new version differs by 2 commits.

See the full diff

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.


Your Greenkeeper Bot 🌴

only absolute urls are supported

When I try to run unfurl against this URL: https://www.gohighlevel.com/blog/2018/04/25/the-winner-take-all-world-of-dental-reviews/index.html

I get the error:

<rejected> Error: only absolute urls are supported
   at /Users/shaun/Documents/PycharmProjects/spm-appengine/node_modules/node-fetch/index.js:54:10
   at new Promise (<anonymous>)
   at new Fetch (/Users/shaun/Documents/PycharmProjects/spm-appengine/node_modules/node-fetch/index.js:49:9)
   at Fetch (/Users/shaun/Documents/PycharmProjects/spm-appengine/node_modules/node-fetch/index.js:37:10)
   at /Users/shaun/Documents/PycharmProjects/spm-appengine/node_modules/unfurl.js/index.js:296:14
   at <anonymous>
   at process._tickDomainCallback (internal/process/next_tick.js:228:7) } reason: Error: only absolute urls are supported
   at /Users/shaun/Documents/PycharmProjects/spm-appengine/node_modules/node-fetch/index.js:54:10
   at new Promise (<anonymous>)
   at new Fetch (/Users/shaun/Documents/PycharmProjects/spm-appengine/node_modules/node-fetch/index.js:49:9)
   at Fetch (/Users/shaun/Documents/PycharmProjects/spm-appengine/node_modules/node-fetch/index.js:37:10)
   at /Users/shaun/Documents/PycharmProjects/spm-appengine/node_modules/unfurl.js/index.js:296:14
   at <anonymous>
   at process._tickDomainCallback (internal/process/next_tick.js:228:7)

Is this a known issue or is their a way to convert these to absolute urls on the fly? Thanks!

Extract website icon

First of all, awesome work!
It would be great we you can extract the site icon in the response as well :)

Youtube OEmbeds fail, entities in URL not decoded

Trying to embed https://www.youtube.com/watch?v=TTCVn4EByfI fails, as the getRemoteMetadata step fails to retrieve http://www.youtube.com/oembed?format=json&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DccYpEv4APec. The correct URL is of course http://www.youtube.com/oembed?format=json&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DccYpEv4APec, so this issue can be resolved by parsing entities in the provided URL. I will open a PR shortly.

Youtube: only favicon gets extracted

Youtube changed its HTML a month ago and since then our tests (adobe/helix-embed#345) have been failing when verifying the output for Youtube.

The underlying issue is a combination of making the reasonable assumption that all metadata is in the head here

unfurl/src/index.ts

Lines 270 to 273 in db57429

// We want to parse as little as possible so finish once we see </head>
if (tag === 'head') {
parser.reset()
}

and Youtube being above convention, standards, and reason:

<!DOCTYPE html>
<html
  style="font-size: 10px; font-family: Roboto, Arial, sans-serif"
  lang="de-DE"
>
  <head>
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <link
      rel="shortcut icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon.ico"
      type="image/x-icon"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_32.png"
      sizes="32x32"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_48.png"
      sizes="48x48"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_96.png"
      sizes="96x96"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_144.png"
      sizes="144x144"
    />
    <link
      rel="stylesheet"
      href="//fonts.googleapis.com/css?family=Roboto:500,300,700,400"
      name="www-roboto"
    />
    <script name="www-roboto" nonce="26OMsP9eT4h+T5PS9iXDRQ">
      if (document.fonts && document.fonts.load) {
        document.fonts.load("400 10pt Roboto", "");
        document.fonts.load("500 10pt Roboto", "");
      }
    </script>
    <link
      rel="stylesheet"
      href="//fonts.googleapis.com/css?family=YT%20Sans%3A300%2C500%2C700"
      name="www-webfont-yt-sans"
    />
    <link rel="stylesheet" href="/s/player/5dd3f3b2/www-player.css" />
    <link
      rel="stylesheet"
      href="https://www.youtube.com/s/desktop/d743f786/cssbin/www-main-desktop-watch-page-skeleton.css"
    />
    <link
      rel="stylesheet"
      href="https://www.youtube.com/s/desktop/d743f786/cssbin/www-main-desktop-player-skeleton.css"
    />
    <link
      rel="stylesheet"
      href="https://www.youtube.com/s/desktop/d743f786/cssbin/www-onepick.css"
    />
    <meta name="theme-color" content="rgba(255, 255, 255, 0.98)" />
    <link
      rel="search"
      type="application/opensearchdescription+xml"
      href="https://www.youtube.com/opensearch?locale=de_DE"
      title="YouTube"
    />
    <link
      rel="manifest"
      href="/s/notifications/manifest/manifest.json"
      crossorigin="use-credentials"
    />
  </head> <!-- END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE  --->
  <body dir="ltr" no-y-overflow>
    <link
      rel="canonical"
      href="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      media="handheld"
      href="https://m.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      media="only screen and (max-width: 640px)"
      href="https://m.youtube.com/watch?v=ccYpEv4APec"
    /><title>
      Google Translate Sings: &quot;The Sound of Silence&quot; (Simon &amp;
      Garfunkel) - YouTube</title
    ><meta
      name="title"
      content='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><meta
      name="description"
      content="SUBSCRIBE: http://bit.ly/sub2MalindaCHECK OUT MY MUSIC CHANNEL: https://bit.ly/2GsRyrqPATREON: http://bit.ly/MKRsupportMERCH: http://shopmalinda.com/Follow m..."
    /><meta
      name="keywords"
      content="sound of silence, parody, google translate, google translate sings, disturbed, pentatonix, performance, the sound of silence, simon and garfunkel, translator fails, translation, fail, comedy, 1960s, paul simon, official video"
    /><link rel="shortlinkUrl" href="https://youtu.be/ccYpEv4APec" /><link
      rel="alternate"
      href="android-app://com.google.android.youtube/http/www.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      href="ios-app://544007664/vnd.youtube/www.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      type="application/json+oembed"
      href="http://www.youtube.com/oembed?format=json&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DccYpEv4APec"
      title='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><link
      rel="alternate"
      type="text/xml+oembed"
      href="http://www.youtube.com/oembed?format=xml&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DccYpEv4APec"
      title='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><link
      rel="image_src"
      href="https://i.ytimg.com/vi/ccYpEv4APec/maxresdefault.jpg"
    /><meta property="og:site_name" content="YouTube" /><meta
      property="og:url"
      content="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><meta
      property="og:title"
      content='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><meta
      property="og:image"
      content="https://i.ytimg.com/vi/ccYpEv4APec/maxresdefault.jpg"
    /><meta property="og:image:width" content="1280" /><meta
      property="og:image:height"
      content="720"
    /><meta
      property="og:description"
      content="SUBSCRIBE: http://bit.ly/sub2MalindaCHECK OUT MY MUSIC CHANNEL: https://bit.ly/2GsRyrqPATREON: http://bit.ly/MKRsupportMERCH: http://shopmalinda.com/Follow m..."
    /><meta property="al:ios:app_store_id" content="544007664" /><meta
      property="al:ios:app_name"
      content="YouTube"
    /><meta
      property="al:ios:url"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta
      property="al:android:url"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta
      property="al:web:url"
      content="http://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta property="og:type" content="video.other" /><meta
      property="og:video:url"
      content="https://www.youtube.com/embed/ccYpEv4APec"
    /><meta
      property="og:video:secure_url"
      content="https://www.youtube.com/embed/ccYpEv4APec"
    /><meta property="og:video:type" content="text/html" /><meta
      property="og:video:width"
      content="1280"
    /><meta property="og:video:height" content="720" /><meta
      property="al:android:app_name"
      content="YouTube"
    /><meta
      property="al:android:package"
      content="com.google.android.youtube"
    /><meta property="og:video:tag" content="sound of silence" /><meta
      property="og:video:tag"
      content="parody"
    /><meta property="og:video:tag" content="google translate" /><meta
      property="og:video:tag"
      content="google translate sings"
    /><meta property="og:video:tag" content="disturbed" /><meta
      property="og:video:tag"
      content="pentatonix"
    /><meta property="og:video:tag" content="performance" /><meta
      property="og:video:tag"
      content="the sound of silence"
    /><meta property="og:video:tag" content="simon and garfunkel" /><meta
      property="og:video:tag"
      content="translator fails"
    /><meta property="og:video:tag" content="translation" /><meta
      property="og:video:tag"
      content="fail"
    /><meta property="og:video:tag" content="comedy" /><meta
      property="og:video:tag"
      content="1960s"
    /><meta property="og:video:tag" content="paul simon" /><meta
      property="og:video:tag"
      content="official video"
    /><meta property="fb:app_id" content="87741124305" /><meta
      name="twitter:card"
      content="player"
    /><meta name="twitter:site" content="@youtube" /><meta
      name="twitter:url"
      content="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><meta
      name="twitter:title"
      content='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><meta
      name="twitter:description"
      content="SUBSCRIBE: http://bit.ly/sub2MalindaCHECK OUT MY MUSIC CHANNEL: https://bit.ly/2GsRyrqPATREON: http://bit.ly/MKRsupportMERCH: http://shopmalinda.com/Follow m..."
    /><meta
      name="twitter:image"
      content="https://i.ytimg.com/vi/ccYpEv4APec/maxresdefault.jpg"
    /><meta name="twitter:app:name:iphone" content="YouTube" /><meta
      name="twitter:app:id:iphone"
      content="544007664"
    /><meta name="twitter:app:name:ipad" content="YouTube" /><meta
      name="twitter:app:id:ipad"
      content="544007664"
    /><meta
      name="twitter:app:url:iphone"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta
      name="twitter:app:url:ipad"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta name="twitter:app:name:googleplay" content="YouTube" /><meta
      name="twitter:app:id:googleplay"
      content="com.google.android.youtube"
    /><meta
      name="twitter:app:url:googleplay"
      content="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><meta
      name="twitter:player"
      content="https://www.youtube.com/embed/ccYpEv4APec"
    /><meta name="twitter:player:width" content="1280" /><meta
      name="twitter:player:height"
      content="720"
    />

(HTML reformatted and all script and style tags removed)

As you can see, most of the interesting metadata (even title) is outside the head.

I will submit a PR to address that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.