Coder Social home page Coder Social logo

laurengarcia / url-metadata Goto Github PK

View Code? Open in Web Editor NEW
166.0 166.0 43.0 168 KB

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.

Home Page: https://www.npmjs.com/package/url-metadata

License: MIT License

JavaScript 95.40% Shell 0.19% HTML 1.35% TypeScript 3.06%

url-metadata's People

Contributors

abelorian avatar bb-work avatar czarandy avatar itjesse avatar laurengarcia avatar martinmalinda avatar orbin avatar suchtomwow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

url-metadata's Issues

Promise neither resolves nor rejects

I'm using url-metadata to retrieve metadata about URLs that are gotten from another search. Sometimes, neither my success nor failure functions are called, and I don't understand why.

Example "bad" url: http://sartma.com/art_13825.html

for (var i = 0; i < url_list.length; i++){
                       winston.debug(url_list[i]['MentionIdentifier']);
                       urlMetadata(url_list[i]['MentionIdentifier']).then(
                           (metadata) => {
                               var title = metadata.title;
                               var author = metadata.author;
                               var description = metadata.description;
                               var keywords = metadata.keywords;
                               var source = metadata.source;

                               var image;

                               if (metadata.image)
                               {
                                   if (metadata.image.substring(0,2) == '//')
                                   {
                                       image = metadata.image.replace('//','');
                                   } else
                                   {
                                       image = metadata.image.replace('https','http');
                                   }
                                   
                               }

                               var url = metadata.url;

                               events.push([image, title, url, author, description, keywords, source])
                               
                               winston.debug(image, title, url, author, description, keywords, source);
                       },
                       (error) => {
                           winston.error("URL Metadata failure: " + error);
                       });
                       
                   }

Parsing raw html

It would be great if we could pass a raw HTML string to be parsed rather than relying on the library to make a request. Some websites are blocked on my server, so this library will throw HTTP errors. Using a proxy service works well to get around this, but there's currently no option to pass the HTML to urlMetadata().

CORS Origin Error

access to fetch at url from origin 'http://localhost:19006' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.

Problems with cheerio dependency version

We have detected that in some cases, the library reach a loop because css-select. This issue is already resolved on css-select. I suppose that if the library reference is update to take version 1.0 of cheerio the problem would be resolved.

Option to select which data we want

Do you plan to add the option to select which data is needed?

This can also help in reduced processing when only some data (e.g. title, description, favicon) is required.

If it is in the plan, I can help implement it and send a PR.

Uncaught TypeError: Cannot add property robots, object is not extensible

Hello,

First of all - Thank you for good library guys.
I am trying to run in from VueJS application mixin and getting following error:
index.js?e609:1081 Uncaught TypeError: Cannot add property robots, object is not extensible
any idea how to workaround it?


export const urlMetadata = {
  created() {

    this.urlMetaDataCall('https://cors-anywhere.herokuapp.com/http://bit.ly/2ePIrDy');
  },
        methods: {
   urlMetaDataCall(url){
     urlMetaData(url);
    
     return url;...```
thanks!

Passing in HTML?

While playing with this I've run into problem sites like www.crunchyroll.com, whereby the page metadata is not available until the page is rendered. For this reason I am looking to use puppeteer in certain scenarios, to render the page and then get the HTML, though from what I can see I can't pass this HTML to url-metadata.

  async function getRenderedHtml (pageUrl: string): Promise<string> {
    const browser = await puppeteer.launch();
    try {
      const page = await browser.newPage();
      await page.goto(pageUrl);
      await page.waitForSelector('meta[name=description]', { timeout: 5000 });
      return await page.content();
    } finally {
      await browser.close();
    }
  }

Is there any way I could pass the HTML to url-metadata, so that it can process the content and provide the parsed metadata?

BTW I did see the 'alternate' use-case, with parseResponseObject, so will see if there is a way I could create a compatible response object, but just using the HTML I already have:

// Alternate use-case: parse a Response object instead
try {
  // fetch the url in your own code
  const response = await fetch('https://www.npmjs.com/package/url-metadata');
  // ... do other stuff with it...
  // pass the `response` object to be parsed for its metadata
  const metadata = await urlMetadata(null, { parseResponseObject: response });
  console.log(metadata);
} catch (err) {
  console.log(err);
}

First parameter can't be null in typescript

I just got the latest version of url-metadata and tried the following code

const response = new Response(html);
const metadata = await urlMetadata(null, {
  requestHeaders: {
    ...(this.requestHeaders || {}),
    'Accept-Language': locale,
  },
  parseResponseObject: response
});    

VSCode is telling me Argument of type 'null' is not assignable to parameter of type 'string'. For now I can work around this by using null as any, but it isn't ideal.

One way of addressing this is with:

declare function urlMetadata(
  url: string | null,
  options?: urlMetadata.Options,
): Promise<urlMetadata.Result>

The may be some way of indicating null is only permitted when parseResponseObject is provided, but I'd have to explore, since my TS knowledge doesn't go that deep.

fetch is not defined

Hello.

I am getting "fetch is not defined" on the following line:

const metadata = await urlMetadata(https://adnan-tech.com)

Any help is highly appreciated.

Certain URLs cause Maximum call stack size exceeded

Filing a new issue as I can't reopen #6

Some URLs cause a RangeError: Maximum call stack size exceeded
An example of URL is https://lnkd.in/gVeYnv7

The error message is below (the stack trace is truncated, shoing only the snippet below)

/<redacted>/node_modules/domutils/lib/querying.js:83
function findAll(test, elems){
                ^

RangeError: Maximum call stack size exceeded
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:83:17)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
    at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)

UserAgent options has is not considered in version 2.2.1.

When I am passing userAgent option, it is not getting passed while requesting for the URL. Please look into this.

I am passing userAgent as below:

var options = {
    userAgent: 'my_custom_useragent'
};
const urlMetadata = require('url-metadata');
urlMetadata(url, options);

In this case my_custom_useragent is ignored.

"Fails" with multiple tags of the same name

There are cases when websites have multiple <meta> tags with the same name attribute:

image

I'm not sure if this is valid/proper html, but I've seen several sites do it, especially in academia.

Since this library returns metadata as an object with the name attributes as the keys, if there are multiple tags with the same name, it only returns the last one. In the above screenshot, it only returns Cheeseman.

Don't know what a good way to handle this would be. Maybe if there are multiple, just return an array of them instead under the same object key?

Typescript definition is missing `parseResponseObject`

Looking at the Typescript definition, it would appear that parseResponseObject is missing from the Options interface, causing problems when compiling in Typescript.

This issue can be worked around by casting the options to any.

url-metadata: 3.5.2

Handling specific encodings dropped?

The previous version 2.5.0 had a decode parameter that should be handling specific encodings. How that should be done in 3.3.0?
The problem is reproducing with windows 1251 encoding. The text is unreadable afterwards.

Thank you!

unable to scrap image from the below url

Hi Guys,

First of all thank you very much for this amazing work.
I am facing one issue to scrap meta og:image from the below mention URL

https://finance.yahoo.com/news/passenger-dies-boston-bound-united-164546412.html

the og:image which truncated
https://s.yimg.com/uu/api/res/1.2/xZ2FcgWSxDeKhchPxVEtjA--~B/aD02ODI7dz0xMDI0O3NtPTE7YXBwaWQ9eXRhY2h5b24-/http://media.zenfs.com/en-US/homerun/fortune_175/abd8f21e89363e43b2babd33af71da3e

When I debug the code and found the problem is
https://github.com/LevelNewsOrg/url-metadata/blob/master/lib/clean.js here

Can you please help to fix this issue?
we can't use ensureSecureImageRequest options since its always true.

Maximum call stack size exceeded

Error: unable to verify the first certificate
at TLSSocket.onConnectSecure (node:_tls_wrap:1532:34)
at TLSSocket.emit (node:events:369:20)
at TLSSocket.emit (node:domain:470:12)
at TLSSocket._finishInit (node:_tls_wrap:946:8)
at TLSWrap.ssl.onhandshakedone (node:_tls_wrap:720:12) {
code: 'UNABLE_TO_VERIFY_LEAF_SIGNATURE'
} https://nift.ac.in/NIFT-HG ::::::::::::;
/home/shubham/Desktop/Work/gide/api/node_modules/url-metadata/node_modules/domelementtype/index.js:12
isTag: function(elem){
^

RangeError: Maximum call stack size exceeded

CORS issue

When I try to fetch meta data I get this issue, how do I resolve?

Access to fetch at 'https://chrome.google.com/' from origin 'http://localhost:8080' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.

Support for links

Is there any way to support link tags? The motivation is looking at http://news.bbc.co.uk where there is link tags are being used for the favicons. I am not seeing any other useful metadata for the favicons.

404 returned for existing url after upgrading to version 3.3.1

#!/usr/bin/env node

const urlMetadata = require('url-metadata');

(async function () {
  try {
    const url = 'https://www.skynews.com.au/world-news/united-states/joe-biden-backs-defense-secretary-despite-lack-of-transparency-on-hospitalisation/video/442a6796cce06e13ce9b8658a5add27a';
    const metadata = await urlMetadata(url, { mode: 'same-origin' });
    console.log('fetched metadata:', metadata)
  } catch(err) {
    console.log('fetch error:', err);
  }
})();

Take the url in the cod for example

https://www.skynews.com.au/world-news/united-states/joe-biden-backs-defense-secretary-despite-lack-of-transparency-on-hospitalisation/video/442a6796cce06e13ce9b8658a5add27a

After upgrading to the latest version(3.3.1), it always returns 404.

does this package load the whole html file?

does this pakcage load the entire html file and THEN extract the meta data from it?, or does it somehow load only the beginning of the html file and then extract the meta from it, because there are some html pages that are so big

`npm run test` fails on checked out master

I ran into issue while testing my code (trying to make a commit for issue #76) and noticed it seems to be a problem on master, at the current head (d6aa7ba).

Running npm run test I get the following output:

» npm run test                                                                                     ajmas@ghostwalker-echo

> [email protected] test
> jest --testPathIgnorePatterns=/test-debug/ && standard

 PASS  test/robots.test.js
 PASS  test/fail.test.js
 PASS  test/citations.test.js
 PASS  test/og.test.js
 FAIL  test/basic.test.js
  ● favicons

    expect(received).toBe(expected) // Object.is equality

    Expected: undefined
    Received: [Error: response code 403]

      59 |     expect(metadata.favicons[4].color).toBe('#000000')
      60 |   } catch (err) {
    > 61 |     expect(err).toBe(undefined)
         |                 ^
      62 |   }
      63 | })
      64 |

      at Object.toBe (test/basic.test.js:61:17)

 FAIL  test/options.test.js
  ● option: `ensureSecureImageRequest` edge cases

    expect(received).toBe(expected) // Object.is equality

    Expected: undefined
    Received: [Error: response code 403]

      41 |     })
      42 |   } catch (err) {
    > 43 |     expect(err).toBe(undefined)
         |                 ^
      44 |   }
      45 | })
      46 |

      at Object.toBe (test/options.test.js:43:17)

 PASS  test/json-ld.test.js
 PASS  test/decode.test.js

Test Suites: 2 failed, 6 passed, 8 total
Tests:       2 failed, 21 passed, 23 total
Snapshots:   0 total
Time:        2.98 s, estimated 5 s
Ran all test suites.

I looked into this and noticed while the code is showing a 403 response, during the tests, the page works fine when testing in Chrome. I am wondering whether it is down to a header it is expecting or something else?

Environment:

  • node: v20.11.0
  • OS: macOS 14.3.1 (Intel)

Response Code 0

Hey, I have issue when using this.

I got this error.
fetch error: Error: response code 0

Here is my code.

const handleOnScrapeMetadata = async (event: React.ChangeEvent<HTMLInputElement>) => {
    const url = event.target?.value;
    const options = {
      cache: 'no-cache',
      mode: 'no-cors',
      timeout: 10000,
      descriptionLength: 750,
      ensureSecureImageRequest: true,
      includeResponseBody: true,
    };

    console.log('fetching metadata for:', url);

    try {
      const metadata = await urlMetadata(url, options);
      console.log('fetched metadata:', metadata);
    } catch (err) {
      console.log('fetch error:', err);
    }
  };

Is there anything i am missing of ?

Can't get metadata

Hello,
thank you for nice plugin.
When I used your example url I get nothing

author: "" availability: "" canonical: "" description: "" image: "" keywords: "" og:description: "" og:determiner: "" og:image: "" og:image:height: "" og:image:secure_url: "" og:image:type: "" og:image:width: "" og:locale: "" og:locale:alternate: "" og:site_name: "" og:title: "" og:type: "" og:url: "" price: "" priceCurrency: "" source: "wearechange.org" title: null url: "https://wearechange.org/hero-bullhorn-reads-internet-julian-assange-sidewalk"

Any ideas ?

thank you

Getting Cors Error

Access to fetch at 'https://dribbble.com/' from origin 'http://localhost:3000' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

Certain URLs cause Maximum call stack size exceeded

domutils/lib/traversal.js:22
exports.getName = function(elem){

RangeError: Maximum call stack size exceeded
at Object.exports.getName

This also occurs with some of the other urlmetadata modules on npm, specifically node-MetaInspector

Multiple JSON LD objects?

I'm looking at https://github.com/laurengarcia/url-metadata/blob/master/lib/extract-json-ld.js

Unless I'm reading it wrong, It seems like the extracted is being replaced with the last parsed JSON+LD, instead of returning an array of all JSON+LD infos?

I'm testing on this URL:
https://goout.net/cs/metronome-prague-2024/szpsfuw/

rich results tester detects two items:
https://search.google.com/test/rich-results/result?id=6sn_Xcdp6zfbqC3lP9ywpg

And I'm getting back only one:

{
"jsonLd": 
  {
    "@context": "https://schema.org",
    "@type": "BreadcrumbList",
    "itemListElement": [
      {
        "@type": "ListItem",
        "position": 1,
        "item": {
          "@id": "https://goout.net/cs/",
          "name": "Domů"
        }
      }
    ]
  }
}

fetch error: Error: response code 0

Hello, I have issue when i'm trying to fetch metadata of external data

I got this error.

image

Here is my code.

"use client";
import React, { useState, useEffect } from "react";

function ToolsPage() {
  const [metadataImage, setMetadataImage] = useState<string | undefined>(
    undefined
  );

  const urlMetadata = require("url-metadata");

  useEffect(() => {
    const fetchData = async () => {
      try {
        const options = {
          mode: "no-cors",
          includeResponseBody: false,
        };

        const metadata = await urlMetadata(
          "https://youtube.com/",
          options
        );

        const ogImage = metadata["og:image"] || undefined;
        setMetadataImage(ogImage);

        console.log("fetched metadata:", metadata);
        console.log("inside fetchMetadata, ogImage:", ogImage);
      } catch (err) {
        console.log("fetch error:", err);
      }
    };

    fetchData();
  }, []); 

  console.log("outside fetchMetadata, metadataImage:", metadataImage);

  return (
    <>
      <img
        src={`${metadataImage}`}
        className="object-cover h-52 w-96 object-center pb-3 rounded-t-lg"
        alt=""
      />
    </>
  );
}

export default ToolsPage;

Don't display falsey values

What do you think about having an option whereby you can not have the fields displayed if they have no value? That would be useful in the code I'm writing. I could use a helper of course but just wanted to float the idea for being built into the module, thanks.

405 Error

I am using the url-metadata in my nextjs app at pages/api/getmetadata.js
everything works perfect on my localhost, but if i deploy it to vercel its giving me a 405 Error in the console.log
Failed to load resource: the server responded with a status of 405 (Method Not Allowed)

is it a CORS problem? or a request header problem? is there a way to modify the request to not get the 405 error? or is the 405 Error not due to those things?

unable to get og:image preview from below links

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.