laurengarcia / url-metadata Goto Github PK

View Code? Open in Web Editor NEW

166.0 166.0 43.0 168 KB

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.

Home Page: https://www.npmjs.com/package/url-metadata

License: MIT License

JavaScript 95.40% Shell 0.19% HTML 1.35% TypeScript 3.06%

url-metadata's People

Contributors

Stargazers

Watchers

url-metadata's Issues

Promise neither resolves nor rejects

I'm using url-metadata to retrieve metadata about URLs that are gotten from another search. Sometimes, neither my success nor failure functions are called, and I don't understand why.

Example "bad" url: http://sartma.com/art_13825.html

for (var i = 0; i < url_list.length; i++){
                       winston.debug(url_list[i]['MentionIdentifier']);
                       urlMetadata(url_list[i]['MentionIdentifier']).then(
                           (metadata) => {
                               var title = metadata.title;
                               var author = metadata.author;
                               var description = metadata.description;
                               var keywords = metadata.keywords;
                               var source = metadata.source;

                               var image;

                               if (metadata.image)
                               {
                                   if (metadata.image.substring(0,2) == '//')
                                   {
                                       image = metadata.image.replace('//','');
                                   } else
                                   {
                                       image = metadata.image.replace('https','http');
                                   }
                                   
                               }

                               var url = metadata.url;

                               events.push([image, title, url, author, description, keywords, source])
                               
                               winston.debug(image, title, url, author, description, keywords, source);
                       },
                       (error) => {
                           winston.error("URL Metadata failure: " + error);
                       });
                       
                   }

Parsing raw html

It would be great if we could pass a raw HTML string to be parsed rather than relying on the library to make a request. Some websites are blocked on my server, so this library will throw HTTP errors. Using a proxy service works well to get around this, but there's currently no option to pass the HTML to urlMetadata().

CORS Origin Error

access to fetch at url from origin 'http://localhost:19006' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.

Problems with cheerio dependency version

We have detected that in some cases, the library reach a loop because css-select. This issue is already resolved on css-select. I suppose that if the library reference is update to take version 1.0 of cheerio the problem would be resolved.

Redirect / short URLs not working

Only certain short URLs that redirect are supported where meta data is returned.

e.g. https://abnb.me/fHw1T5PE68 - this is not working and no meaningful meta data is returned. When using this URL in: https://developers.facebook.com/tools/debug/ - you can see the meta data that should be returned after the redirect

e.g. https://bit.ly/322JENd - this is working as expected and returns meta data from the redirected URL.

levelnews.org TLS certificate expired

Hello!

Apologies that this isn't the right place but I couldn't find a contact email on the site. Looks like https://levelnews.org's TLS cert expired on May 1st FYI.

Best regards!

Option to select which data we want

Do you plan to add the option to select which data is needed?

This can also help in reduced processing when only some data (e.g. title, description, favicon) is required.

If it is in the plan, I can help implement it and send a PR.

Uncaught TypeError: Cannot add property robots, object is not extensible

Hello,

First of all - Thank you for good library guys.
I am trying to run in from VueJS application mixin and getting following error:
index.js?e609:1081 Uncaught TypeError: Cannot add property robots, object is not extensible
any idea how to workaround it?


export const urlMetadata = {
  created() {

    this.urlMetaDataCall('https://cors-anywhere.herokuapp.com/http://bit.ly/2ePIrDy');
  },
        methods: {
   urlMetaDataCall(url){
     urlMetaData(url);
    
     return url;...```
thanks!

I am getting cors error when ever i hit this function

Access to fetch at 'https://github.com/salwa1234/meuzmnet-react/pull/829' from origin 'http://localhost:3000' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: It does not have HTTP ok status.

Passing in HTML?

While playing with this I've run into problem sites like www.crunchyroll.com, whereby the page metadata is not available until the page is rendered. For this reason I am looking to use puppeteer in certain scenarios, to render the page and then get the HTML, though from what I can see I can't pass this HTML to url-metadata.

  async function getRenderedHtml (pageUrl: string): Promise<string> {
    const browser = await puppeteer.launch();
    try {
      const page = await browser.newPage();
      await page.goto(pageUrl);
      await page.waitForSelector('meta[name=description]', { timeout: 5000 });
      return await page.content();
    } finally {
      await browser.close();
    }
  }

Is there any way I could pass the HTML to url-metadata, so that it can process the content and provide the parsed metadata?

BTW I did see the 'alternate' use-case, with parseResponseObject, so will see if there is a way I could create a compatible response object, but just using the HTML I already have:

// Alternate use-case: parse a Response object instead
try {
  // fetch the url in your own code
  const response = await fetch('https://www.npmjs.com/package/url-metadata');
  // ... do other stuff with it...
  // pass the `response` object to be parsed for its metadata
  const metadata = await urlMetadata(null, { parseResponseObject: response });
  console.log(metadata);
} catch (err) {
  console.log(err);
}

First parameter can't be null in typescript

I just got the latest version of url-metadata and tried the following code

const response = new Response(html);
const metadata = await urlMetadata(null, {
  requestHeaders: {
    ...(this.requestHeaders || {}),
    'Accept-Language': locale,
  },
  parseResponseObject: response
});

VSCode is telling me Argument of type 'null' is not assignable to parameter of type 'string'. For now I can work around this by using null as any, but it isn't ideal.

One way of addressing this is with:

declare function urlMetadata(
  url: string | null,
  options?: urlMetadata.Options,
): Promise<urlMetadata.Result>

The may be some way of indicating null is only permitted when parseResponseObject is provided, but I'd have to explore, since my TS knowledge doesn't go that deep.

fetch is not defined

Hello.

I am getting "fetch is not defined" on the following line:

const metadata = await urlMetadata(https://adnan-tech.com)

Any help is highly appreciated.

warning > [email protected] deprecated

warning url-metadata > [email protected]: request has been deprecated, see request/request#3142
warning url-metadata > standard > eslint > file-entry-cache > flat-cache > [email protected]: CircularJSON is in maintenance only, flatted is its successor.

Certain URLs cause Maximum call stack size exceeded

Filing a new issue as I can't reopen #6

Some URLs cause a RangeError: Maximum call stack size exceeded
An example of URL is https://lnkd.in/gVeYnv7

The error message is below (the stack trace is truncated, shoing only the snippet below)

/<redacted>/node_modules/domutils/lib/querying.js:83
function findAll(test, elems){
                ^

RangeError: Maximum call stack size exceeded
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:83:17)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)

UserAgent options has is not considered in version 2.2.1.

When I am passing userAgent option, it is not getting passed while requesting for the URL. Please look into this.

I am passing userAgent as below:

var options = {
    userAgent: 'my_custom_useragent'
};
const urlMetadata = require('url-metadata');
urlMetadata(url, options);

In this case my_custom_useragent is ignored.

"Fails" with multiple tags of the same name

There are cases when websites have multiple <meta> tags with the same name attribute:

I'm not sure if this is valid/proper html, but I've seen several sites do it, especially in academia.

Since this library returns metadata as an object with the name attributes as the keys, if there are multiple tags with the same name, it only returns the last one. In the above screenshot, it only returns Cheeseman.

Don't know what a good way to handle this would be. Maybe if there are multiple, just return an array of them instead under the same object key?

ensureSecureImageRequest cannot be set to false

Hi guys,

This line in index.js will evaluate to true when 'options.ensureSecureImageRequest' is set to false:

ensureSecureImageRequest: options.ensureSecureImageRequest || true

Add scraping for favicon <link> attribute to fetch location of favicon used

Add an attribute to also return the path of favicon used in the result of urlMetadata()

Typescript definition is missing `parseResponseObject`

Looking at the Typescript definition, it would appear that parseResponseObject is missing from the Options interface, causing problems when compiling in Typescript.

This issue can be worked around by casting the options to any.

url-metadata: 3.5.2

Handling specific encodings dropped?

The previous version 2.5.0 had a decode parameter that should be handling specific encodings. How that should be done in 3.3.0?
The problem is reproducing with windows 1251 encoding. The text is unreadable afterwards.

Thank you!

unable to scrap image from the below url

Hi Guys,

First of all thank you very much for this amazing work.
I am facing one issue to scrap meta og:image from the below mention URL

https://finance.yahoo.com/news/passenger-dies-boston-bound-united-164546412.html

the og:image which truncated
https://s.yimg.com/uu/api/res/1.2/xZ2FcgWSxDeKhchPxVEtjA--~B/aD02ODI7dz0xMDI0O3NtPTE7YXBwaWQ9eXRhY2h5b24-/http://media.zenfs.com/en-US/homerun/fortune_175/abd8f21e89363e43b2babd33af71da3e

When I debug the code and found the problem is
https://github.com/LevelNewsOrg/url-metadata/blob/master/lib/clean.js here

Can you please help to fix this issue?
we can't use ensureSecureImageRequest options since its always true.

Getting { Error: 'response code 403' } just after one request from server.

Does it return twitter:description tag?

It is not mentioned in the documentation. Is twitter:description supported? if not please add it.
Thank you.

Maximum call stack size exceeded

Error: unable to verify the first certificate
at TLSSocket.onConnectSecure (node:_tls_wrap:1532:34)
at TLSSocket.emit (node:events:369:20)
at TLSSocket.emit (node:domain:470:12)
at TLSSocket._finishInit (node:_tls_wrap:946:8)
at TLSWrap.ssl.onhandshakedone (node:_tls_wrap:720:12) {
code: 'UNABLE_TO_VERIFY_LEAF_SIGNATURE'
} https://nift.ac.in/NIFT-HG ::::::::::::;
/home/shubham/Desktop/Work/gide/api/node_modules/url-metadata/node_modules/domelementtype/index.js:12
isTag: function(elem){
^

RangeError: Maximum call stack size exceeded

CORS issue

When I try to fetch meta data I get this issue, how do I resolve?

Access to fetch at 'https://chrome.google.com/' from origin 'http://localhost:8080' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.

Support for links

Is there any way to support link tags? The motivation is looking at http://news.bbc.co.uk where there is link tags are being used for the favicons. I am not seeing any other useful metadata for the favicons.

Timeout for some

nevermind, delete this.

404 returned for existing url after upgrading to version 3.3.1

#!/usr/bin/env node

const urlMetadata = require('url-metadata');

(async function () {
  try {
    const url = 'https://www.skynews.com.au/world-news/united-states/joe-biden-backs-defense-secretary-despite-lack-of-transparency-on-hospitalisation/video/442a6796cce06e13ce9b8658a5add27a';
    const metadata = await urlMetadata(url, { mode: 'same-origin' });
    console.log('fetched metadata:', metadata)
  } catch(err) {
    console.log('fetch error:', err);
  }
})();

Take the url in the cod for example

https://www.skynews.com.au/world-news/united-states/joe-biden-backs-defense-secretary-despite-lack-of-transparency-on-hospitalisation/video/442a6796cce06e13ce9b8658a5add27a

After upgrading to the latest version(3.3.1), it always returns 404.

does this package load the whole html file?

does this pakcage load the entire html file and THEN extract the meta data from it?, or does it somehow load only the beginning of the html file and then extract the meta from it, because there are some html pages that are so big

Failed to compile with x errors . These dependencies were not found:

I am using this lib with NuxtJS + Typescript.
And this error comes up while using the urlMetaData function.
I double-checked by removing the function and the error was gone.
Image for reference:

`npm run test` fails on checked out master

I ran into issue while testing my code (trying to make a commit for issue #76) and noticed it seems to be a problem on master, at the current head (d6aa7ba).

Running npm run test I get the following output:

» npm run test                                                                                     ajmas@ghostwalker-echo

> [email protected] test
> jest --testPathIgnorePatterns=/test-debug/ && standard

 PASS  test/robots.test.js
 PASS  test/fail.test.js
 PASS  test/citations.test.js
 PASS  test/og.test.js
 FAIL  test/basic.test.js
  ● favicons

    expect(received).toBe(expected) // Object.is equality

    Expected: undefined
    Received: [Error: response code 403]

      59 |     expect(metadata.favicons[4].color).toBe('#000000')
      60 |   } catch (err) {
    > 61 |     expect(err).toBe(undefined)
         |                 ^
      62 |   }
      63 | })
      64 |

      at Object.toBe (test/basic.test.js:61:17)

 FAIL  test/options.test.js
  ● option: `ensureSecureImageRequest` edge cases

    expect(received).toBe(expected) // Object.is equality

    Expected: undefined
    Received: [Error: response code 403]

      41 |     })
      42 |   } catch (err) {
    > 43 |     expect(err).toBe(undefined)
         |                 ^
      44 |   }
      45 | })
      46 |

      at Object.toBe (test/options.test.js:43:17)

 PASS  test/json-ld.test.js
 PASS  test/decode.test.js

Test Suites: 2 failed, 6 passed, 8 total
Tests:       2 failed, 21 passed, 23 total
Snapshots:   0 total
Time:        2.98 s, estimated 5 s
Ran all test suites.

I looked into this and noticed while the code is showing a 403 response, during the tests, the page works fine when testing in Chrome. I am wondering whether it is down to a header it is expecting or something else?

Environment:

node: v20.11.0
OS: macOS 14.3.1 (Intel)

Response Code 0

Hey, I have issue when using this.

I got this error.
fetch error: Error: response code 0

Here is my code.

const handleOnScrapeMetadata = async (event: React.ChangeEvent<HTMLInputElement>) => {
    const url = event.target?.value;
    const options = {
      cache: 'no-cache',
      mode: 'no-cors',
      timeout: 10000,
      descriptionLength: 750,
      ensureSecureImageRequest: true,
      includeResponseBody: true,
    };

    console.log('fetching metadata for:', url);

    try {
      const metadata = await urlMetadata(url, options);
      console.log('fetched metadata:', metadata);
    } catch (err) {
      console.log('fetch error:', err);
    }
  };

Is there anything i am missing of ?

Failed charset decoding?

Hi!

I'm seeing broken characters from this czech website:

https://www.idnes.cz/kultura/film-televize/blade-runner-ryan-gosling-harrison-ford-praha.A240221_135716_filmvideo_jgo

Apparently it's windows-1250.

I haven't changed any decode setting 🤔

Can't get metadata

Hello,
thank you for nice plugin.
When I used your example url I get nothing

author: "" availability: "" canonical: "" description: "" image: "" keywords: "" og:description: "" og:determiner: "" og:image: "" og:image:height: "" og:image:secure_url: "" og:image:type: "" og:image:width: "" og:locale: "" og:locale:alternate: "" og:site_name: "" og:title: "" og:type: "" og:url: "" price: "" priceCurrency: "" source: "wearechange.org" title: null url: "https://wearechange.org/hero-bullhorn-reads-internet-julian-assange-sidewalk"

Any ideas ?

thank you

when i send url with "-" error ocurred

When i send this url to lib, this never respond https://github.com/comprobamos/dashboard-comprobamos/wiki/Guia-de-comandos

I used your lib from 1 year ago and this is a first error i all that time

Electron error - Cannot find module 'url-metadata'

After adding module to my electron angular project i am now receiving this error shown below:

This error only occurs when building the electron and not running with serve

Thanks

Getting Cors Error

Access to fetch at 'https://dribbble.com/' from origin 'http://localhost:3000' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

Certain URLs cause Maximum call stack size exceeded

domutils/lib/traversal.js:22
exports.getName = function(elem){

RangeError: Maximum call stack size exceeded
at Object.exports.getName

This also occurs with some of the other urlmetadata modules on npm, specifically node-MetaInspector

Multiple JSON LD objects?

I'm looking at https://github.com/laurengarcia/url-metadata/blob/master/lib/extract-json-ld.js

Unless I'm reading it wrong, It seems like the extracted is being replaced with the last parsed JSON+LD, instead of returning an array of all JSON+LD infos?

I'm testing on this URL:
https://goout.net/cs/metronome-prague-2024/szpsfuw/

rich results tester detects two items:
https://search.google.com/test/rich-results/result?id=6sn_Xcdp6zfbqC3lP9ywpg

And I'm getting back only one:

{
"jsonLd": 
  {
    "@context": "https://schema.org",
    "@type": "BreadcrumbList",
    "itemListElement": [
      {
        "@type": "ListItem",
        "position": 1,
        "item": {
          "@id": "https://goout.net/cs/",
          "name": "Domů"
        }
      }
    ]
  }
}

Error: redirect count exceeded

There are issues with Twitter urls. Is this something we can fix somehow?

fetch error: Error: response code 0

Hello, I have issue when i'm trying to fetch metadata of external data

I got this error.

Here is my code.

"use client";
import React, { useState, useEffect } from "react";

function ToolsPage() {
  const [metadataImage, setMetadataImage] = useState<string | undefined>(
    undefined
  );

  const urlMetadata = require("url-metadata");

  useEffect(() => {
    const fetchData = async () => {
      try {
        const options = {
          mode: "no-cors",
          includeResponseBody: false,
        };

        const metadata = await urlMetadata(
          "https://youtube.com/",
          options
        );

        const ogImage = metadata["og:image"] || undefined;
        setMetadataImage(ogImage);

        console.log("fetched metadata:", metadata);
        console.log("inside fetchMetadata, ogImage:", ogImage);
      } catch (err) {
        console.log("fetch error:", err);
      }
    };

    fetchData();
  }, []); 

  console.log("outside fetchMetadata, metadataImage:", metadataImage);

  return (
    <>
      <img
        src={`${metadataImage}`}
        className="object-cover h-52 w-96 object-center pb-3 rounded-t-lg"
        alt=""
      />
    </>
  );
}

export default ToolsPage;

Don't display falsey values

What do you think about having an option whereby you can not have the fields displayed if they have no value? That would be useful in the code I'm writing. I could use a helper of course but just wanted to float the idea for being built into the module, thanks.

405 Error

I am using the url-metadata in my nextjs app at pages/api/getmetadata.js
everything works perfect on my localhost, but if i deploy it to vercel its giving me a 405 Error in the console.log
Failed to load resource: the server responded with a status of 405 (Method Not Allowed)

is it a CORS problem? or a request header problem? is there a way to modify the request to not get the 405 error? or is the 405 Error not due to those things?