laurengarcia / url-metadata Goto Github PK
View Code? Open in Web Editor NEWNPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
Home Page: https://www.npmjs.com/package/url-metadata
License: MIT License
NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
Home Page: https://www.npmjs.com/package/url-metadata
License: MIT License
I'm using url-metadata to retrieve metadata about URLs that are gotten from another search. Sometimes, neither my success nor failure functions are called, and I don't understand why.
Example "bad" url: http://sartma.com/art_13825.html
for (var i = 0; i < url_list.length; i++){
winston.debug(url_list[i]['MentionIdentifier']);
urlMetadata(url_list[i]['MentionIdentifier']).then(
(metadata) => {
var title = metadata.title;
var author = metadata.author;
var description = metadata.description;
var keywords = metadata.keywords;
var source = metadata.source;
var image;
if (metadata.image)
{
if (metadata.image.substring(0,2) == '//')
{
image = metadata.image.replace('//','');
} else
{
image = metadata.image.replace('https','http');
}
}
var url = metadata.url;
events.push([image, title, url, author, description, keywords, source])
winston.debug(image, title, url, author, description, keywords, source);
},
(error) => {
winston.error("URL Metadata failure: " + error);
});
}
It would be great if we could pass a raw HTML string to be parsed rather than relying on the library to make a request. Some websites are blocked on my server, so this library will throw HTTP errors. Using a proxy service works well to get around this, but there's currently no option to pass the HTML to urlMetadata()
.
access to fetch at url from origin 'http://localhost:19006' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.
We have detected that in some cases, the library reach a loop because css-select. This issue is already resolved on css-select. I suppose that if the library reference is update to take version 1.0 of cheerio the problem would be resolved.
Only certain short URLs that redirect are supported where meta data is returned.
e.g. https://abnb.me/fHw1T5PE68 - this is not working and no meaningful meta data is returned. When using this URL in: https://developers.facebook.com/tools/debug/ - you can see the meta data that should be returned after the redirect
e.g. https://bit.ly/322JENd - this is working as expected and returns meta data from the redirected URL.
Hello!
Apologies that this isn't the right place but I couldn't find a contact email on the site. Looks like https://levelnews.org's TLS cert expired on May 1st FYI.
Best regards!
Do you plan to add the option to select which data is needed?
This can also help in reduced processing when only some data (e.g. title, description, favicon) is required.
If it is in the plan, I can help implement it and send a PR.
Hello,
First of all - Thank you for good library guys.
I am trying to run in from VueJS application mixin and getting following error:
index.js?e609:1081 Uncaught TypeError: Cannot add property robots, object is not extensible
any idea how to workaround it?
export const urlMetadata = {
created() {
this.urlMetaDataCall('https://cors-anywhere.herokuapp.com/http://bit.ly/2ePIrDy');
},
methods: {
urlMetaDataCall(url){
urlMetaData(url);
return url;...```
thanks!
Access to fetch at 'https://github.com/salwa1234/meuzmnet-react/pull/829' from origin 'http://localhost:3000' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: It does not have HTTP ok status.
While playing with this I've run into problem sites like www.crunchyroll.com, whereby the page metadata is not available until the page is rendered. For this reason I am looking to use puppeteer in certain scenarios, to render the page and then get the HTML, though from what I can see I can't pass this HTML to url-metadata.
async function getRenderedHtml (pageUrl: string): Promise<string> {
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(pageUrl);
await page.waitForSelector('meta[name=description]', { timeout: 5000 });
return await page.content();
} finally {
await browser.close();
}
}
Is there any way I could pass the HTML to url-metadata, so that it can process the content and provide the parsed metadata?
BTW I did see the 'alternate' use-case, with parseResponseObject
, so will see if there is a way I could create a compatible response object, but just using the HTML I already have:
// Alternate use-case: parse a Response object instead
try {
// fetch the url in your own code
const response = await fetch('https://www.npmjs.com/package/url-metadata');
// ... do other stuff with it...
// pass the `response` object to be parsed for its metadata
const metadata = await urlMetadata(null, { parseResponseObject: response });
console.log(metadata);
} catch (err) {
console.log(err);
}
I just got the latest version of url-metadata
and tried the following code
const response = new Response(html);
const metadata = await urlMetadata(null, {
requestHeaders: {
...(this.requestHeaders || {}),
'Accept-Language': locale,
},
parseResponseObject: response
});
VSCode is telling me Argument of type 'null' is not assignable to parameter of type 'string'
. For now I can work around this by using null as any
, but it isn't ideal.
One way of addressing this is with:
declare function urlMetadata(
url: string | null,
options?: urlMetadata.Options,
): Promise<urlMetadata.Result>
The may be some way of indicating null
is only permitted when parseResponseObject
is provided, but I'd have to explore, since my TS knowledge doesn't go that deep.
Hello.
I am getting "fetch is not defined" on the following line:
const metadata = await urlMetadata(https://adnan-tech.com)
Any help is highly appreciated.
warning url-metadata > [email protected]: request has been deprecated, see request/request#3142
warning url-metadata > standard > eslint > file-entry-cache > flat-cache > [email protected]: CircularJSON is in maintenance only, flatted is its successor.
Filing a new issue as I can't reopen #6
Some URLs cause a RangeError: Maximum call stack size exceeded
An example of URL is https://lnkd.in/gVeYnv7
The error message is below (the stack trace is truncated, shoing only the snippet below)
/<redacted>/node_modules/domutils/lib/querying.js:83
function findAll(test, elems){
^
RangeError: Maximum call stack size exceeded
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:83:17)
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
at findAll (/<redacted>/node_modules/domutils/lib/querying.js:90:27)
When I am passing userAgent option, it is not getting passed while requesting for the URL. Please look into this.
I am passing userAgent as below:
var options = {
userAgent: 'my_custom_useragent'
};
const urlMetadata = require('url-metadata');
urlMetadata(url, options);
In this case my_custom_useragent is ignored.
There are cases when websites have multiple <meta>
tags with the same name
attribute:
I'm not sure if this is valid/proper html, but I've seen several sites do it, especially in academia.
Since this library returns metadata as an object with the name
attributes as the keys, if there are multiple tags with the same name
, it only returns the last one. In the above screenshot, it only returns Cheeseman
.
Don't know what a good way to handle this would be. Maybe if there are multiple, just return an array of them instead under the same object key?
Hi guys,
This line in index.js will evaluate to true when 'options.ensureSecureImageRequest' is set to false:
ensureSecureImageRequest: options.ensureSecureImageRequest || true
Add an attribute to also return the path of favicon used in the result of urlMetadata()
Looking at the Typescript definition, it would appear that parseResponseObject
is missing from the Options
interface, causing problems when compiling in Typescript.
This issue can be worked around by casting the options to any
.
url-metadata: 3.5.2
The previous version 2.5.0 had a decode parameter that should be handling specific encodings. How that should be done in 3.3.0?
The problem is reproducing with windows 1251 encoding. The text is unreadable afterwards.
Thank you!
Hi Guys,
First of all thank you very much for this amazing work.
I am facing one issue to scrap meta og:image from the below mention URL
https://finance.yahoo.com/news/passenger-dies-boston-bound-united-164546412.html
the og:image which truncated
https://s.yimg.com/uu/api/res/1.2/xZ2FcgWSxDeKhchPxVEtjA--~B/aD02ODI7dz0xMDI0O3NtPTE7YXBwaWQ9eXRhY2h5b24-/http://media.zenfs.com/en-US/homerun/fortune_175/abd8f21e89363e43b2babd33af71da3e
When I debug the code and found the problem is
https://github.com/LevelNewsOrg/url-metadata/blob/master/lib/clean.js
here
Can you please help to fix this issue?
we can't use ensureSecureImageRequest
options since its always true.
It is not mentioned in the documentation. Is twitter:description supported? if not please add it.
Thank you.
Error: unable to verify the first certificate
at TLSSocket.onConnectSecure (node:_tls_wrap:1532:34)
at TLSSocket.emit (node:events:369:20)
at TLSSocket.emit (node:domain:470:12)
at TLSSocket._finishInit (node:_tls_wrap:946:8)
at TLSWrap.ssl.onhandshakedone (node:_tls_wrap:720:12) {
code: 'UNABLE_TO_VERIFY_LEAF_SIGNATURE'
} https://nift.ac.in/NIFT-HG ::::::::::::;
/home/shubham/Desktop/Work/gide/api/node_modules/url-metadata/node_modules/domelementtype/index.js:12
isTag: function(elem){
^
RangeError: Maximum call stack size exceeded
When I try to fetch meta data I get this issue, how do I resolve?
Access to fetch at 'https://chrome.google.com/' from origin 'http://localhost:8080' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.
Is there any way to support link
tags? The motivation is looking at http://news.bbc.co.uk where there is link
tags are being used for the favicons. I am not seeing any other useful metadata for the favicons.
nevermind, delete this.
#!/usr/bin/env node
const urlMetadata = require('url-metadata');
(async function () {
try {
const url = 'https://www.skynews.com.au/world-news/united-states/joe-biden-backs-defense-secretary-despite-lack-of-transparency-on-hospitalisation/video/442a6796cce06e13ce9b8658a5add27a';
const metadata = await urlMetadata(url, { mode: 'same-origin' });
console.log('fetched metadata:', metadata)
} catch(err) {
console.log('fetch error:', err);
}
})();
Take the url in the cod for example
After upgrading to the latest version(3.3.1), it always returns 404.
does this pakcage load the entire html file and THEN extract the meta data from it?, or does it somehow load only the beginning of the html file and then extract the meta from it, because there are some html pages that are so big
I ran into issue while testing my code (trying to make a commit for issue #76) and noticed it seems to be a problem on master, at the current head (d6aa7ba).
Running npm run test
I get the following output:
» npm run test ajmas@ghostwalker-echo
> [email protected] test
> jest --testPathIgnorePatterns=/test-debug/ && standard
PASS test/robots.test.js
PASS test/fail.test.js
PASS test/citations.test.js
PASS test/og.test.js
FAIL test/basic.test.js
● favicons
expect(received).toBe(expected) // Object.is equality
Expected: undefined
Received: [Error: response code 403]
59 | expect(metadata.favicons[4].color).toBe('#000000')
60 | } catch (err) {
> 61 | expect(err).toBe(undefined)
| ^
62 | }
63 | })
64 |
at Object.toBe (test/basic.test.js:61:17)
FAIL test/options.test.js
● option: `ensureSecureImageRequest` edge cases
expect(received).toBe(expected) // Object.is equality
Expected: undefined
Received: [Error: response code 403]
41 | })
42 | } catch (err) {
> 43 | expect(err).toBe(undefined)
| ^
44 | }
45 | })
46 |
at Object.toBe (test/options.test.js:43:17)
PASS test/json-ld.test.js
PASS test/decode.test.js
Test Suites: 2 failed, 6 passed, 8 total
Tests: 2 failed, 21 passed, 23 total
Snapshots: 0 total
Time: 2.98 s, estimated 5 s
Ran all test suites.
I looked into this and noticed while the code is showing a 403 response, during the tests, the page works fine when testing in Chrome. I am wondering whether it is down to a header it is expecting or something else?
Hey, I have issue when using this.
I got this error.
fetch error: Error: response code 0
Here is my code.
const handleOnScrapeMetadata = async (event: React.ChangeEvent<HTMLInputElement>) => {
const url = event.target?.value;
const options = {
cache: 'no-cache',
mode: 'no-cors',
timeout: 10000,
descriptionLength: 750,
ensureSecureImageRequest: true,
includeResponseBody: true,
};
console.log('fetching metadata for:', url);
try {
const metadata = await urlMetadata(url, options);
console.log('fetched metadata:', metadata);
} catch (err) {
console.log('fetch error:', err);
}
};
Is there anything i am missing of ?
Hi!
I'm seeing broken characters from this czech website:
Apparently it's windows-1250.
I haven't changed any decode setting 🤔
Hello,
thank you for nice plugin.
When I used your example url I get nothing
author: "" availability: "" canonical: "" description: "" image: "" keywords: "" og:description: "" og:determiner: "" og:image: "" og:image:height: "" og:image:secure_url: "" og:image:type: "" og:image:width: "" og:locale: "" og:locale:alternate: "" og:site_name: "" og:title: "" og:type: "" og:url: "" price: "" priceCurrency: "" source: "wearechange.org" title: null url: "https://wearechange.org/hero-bullhorn-reads-internet-julian-assange-sidewalk"
Any ideas ?
thank you
When i send this url to lib, this never respond https://github.com/comprobamos/dashboard-comprobamos/wiki/Guia-de-comandos
I used your lib from 1 year ago and this is a first error i all that time
Access to fetch at 'https://dribbble.com/' from origin 'http://localhost:3000' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.
domutils/lib/traversal.js:22
exports.getName = function(elem){
RangeError: Maximum call stack size exceeded
at Object.exports.getName
This also occurs with some of the other urlmetadata modules on npm, specifically node-MetaInspector
I'm looking at https://github.com/laurengarcia/url-metadata/blob/master/lib/extract-json-ld.js
Unless I'm reading it wrong, It seems like the extracted
is being replaced with the last parsed JSON+LD, instead of returning an array of all JSON+LD infos?
I'm testing on this URL:
https://goout.net/cs/metronome-prague-2024/szpsfuw/
rich results tester detects two items:
https://search.google.com/test/rich-results/result?id=6sn_Xcdp6zfbqC3lP9ywpg
And I'm getting back only one:
{
"jsonLd":
{
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"item": {
"@id": "https://goout.net/cs/",
"name": "Domů"
}
}
]
}
}
There are issues with Twitter urls. Is this something we can fix somehow?
Hello, I have issue when i'm trying to fetch metadata of external data
I got this error.
Here is my code.
"use client";
import React, { useState, useEffect } from "react";
function ToolsPage() {
const [metadataImage, setMetadataImage] = useState<string | undefined>(
undefined
);
const urlMetadata = require("url-metadata");
useEffect(() => {
const fetchData = async () => {
try {
const options = {
mode: "no-cors",
includeResponseBody: false,
};
const metadata = await urlMetadata(
"https://youtube.com/",
options
);
const ogImage = metadata["og:image"] || undefined;
setMetadataImage(ogImage);
console.log("fetched metadata:", metadata);
console.log("inside fetchMetadata, ogImage:", ogImage);
} catch (err) {
console.log("fetch error:", err);
}
};
fetchData();
}, []);
console.log("outside fetchMetadata, metadataImage:", metadataImage);
return (
<>
<img
src={`${metadataImage}`}
className="object-cover h-52 w-96 object-center pb-3 rounded-t-lg"
alt=""
/>
</>
);
}
export default ToolsPage;
What do you think about having an option whereby you can not have the fields displayed if they have no value? That would be useful in the code I'm writing. I could use a helper of course but just wanted to float the idea for being built into the module, thanks.
I am using the url-metadata in my nextjs app at pages/api/getmetadata.js
everything works perfect on my localhost, but if i deploy it to vercel its giving me a 405 Error in the console.log
Failed to load resource: the server responded with a status of 405 (Method Not Allowed)
is it a CORS problem? or a request header problem? is there a way to modify the request to not get the 405 error? or is the 405 Error not due to those things?
Hi Guys,
Thank u so much for the your great work.
I m getting data from aliexpress but unable to get og:image.
Also unable to get item data from amazon.com
Can you please help me find out what is the issue? is there something which i m missing or its module issue?
When I'm loading url like this one https://goout.net/cs/colours-of-ostrava-2024/szqghcw/. I'm getting favicon URL https://goout.net/apple-touch-icon-144x144.png%3Fv=1.0 which gives 404 because correct url is https://goout.net/apple-touch-icon-144x144.png?v=1.0.
In the HTML response of the site it's fine, so the url encoding is being done sometime later. Maybe some parsing library is doing this.
Changed from 750 to 300 to 100 and it doesn't truncate the description text...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.