jshemas / opengraphscraper Goto Github PK
View Code? Open in Web Editor NEWNode.js scraper service for Open Graph Info and More!
License: MIT License
Node.js scraper service for Open Graph Info and More!
License: MIT License
I tried this on Runkit as well as my local machine and got the following error
ReferenceError: errorFlag is not defined
at parseGoogleNewsRSSData in googlenews-rss-scraper/index.js — line 75
Nevermind, delete this.
So when I try to scrape https://github.com, I get:
"twitterImage": {
"url": "1200",
"width": "1200"
}
whereas:
<meta property="twitter:image:src" content="https://assets-cdn.github.com/images/modules/open_graph/github-logo.png">
<meta property="twitter:image:width" content="1200">
<meta property="twitter:image:height" content="1200">
I see the following problems:
twitter:image:height
did not get scrapedtwitter:image:src
got scraped instead of twitter:image:height
twitterImage.url
rather than twitterImage.src
which would be more consistent to twitter:image:src
(there is no twitter:image:url
after all)?This link doesn't throw an error but also doesn't scrape open graph data.
Anybody else having this problem?
https://link.medium.com/HozE2bvBmU
edit: the link above redirects to this one: https://rsci.app.link/HozE2bvBmU?_p=f4503a44f32dc261648a177f2460
Hi,
I've been testing your library, it is really helpful! However, I came across a vague error now. When testing from my local dev environment it is working fine, when deployed to the server I getting this error. The issue seems to be happening only for Vimeo URLs. Any ideas?
Request:
const options = {'url': 'https://vimeo.com/232889838'};
ogs(options, function (error, results) {
console.log('error:', error);
console.log('results:', results);
});
Result:
results: { error: 'Page Not Found',
success: false,
requestUrl: 'https://vimeo.com/232889838',
errorDetails: 'Server Has Ran Into A Error' }
Thanks!
The dependency jschardet
uses an LGPL license.
Will open a PR soon for replacing with an MIT licensed project.
Getting this error when using OGS in my create-react-app: https://github.com/facebook/create-react-app/blob/master/packages/react-scripts/template/README.md#npm-run-build-fails-to-minify
Failed to minify the code from this file:
./node_modules/open-graph-scraper/lib/openGraphScraper.js:46
Because code is < ES5?
Using the following URL I'm getting a page not found error but when I navigate to the URL in my browser there is no issue. Is there a reason for this? I've had it working with other URLs.
Forbes shows first the home page, and after a few seconds it redirects to the article page, so openGraphScraper is loading the home page OG tags, but facebook (by example) show the proper OG tags if I use the article link
Looks like there's an automated pull request for this issue: #73
Hi
How to scrap author, publisher meta info etc?
It will be better to set custom meta key for scrapping.
Is there any option for this?
If so, please let me know.
Thanks.
Since in package.json, engines
are configured for node 6.x, yarn is not allowing to install it.
Please fix this.
Hey, I'm just trying to understand why http://localhost:3000 is invalid but 127.0.0.1 is valid? Is it the require_tld that it fails on in validator?
openGraphScraper/tests/unit/utils.spec.js
Line 79 in 3fb4fe1
Before I opened this I found a mention in validator about exactly this, you can disable require_tld to get localhost to validate.
Is there space here to allow for passing in validator options through the oGS options? i.e.
const options = {
url: 'http://localhost:3000',
validator: {
require_tld: false
}
}
ogs(options)
.then((data) => {
const { error, result } = data
if (error) console.log('error:', error)
console.log('result:', result)
})
.catch((error) => {
consola.error(error)
})
WDYT? I can give a PR a go to add this.
Loading dependency graph, done.
error: bundling: UnableToResolveError: Unable to resolve module url
from ../SampleOGP/node_modules/open-graph-scraper/app.js
:
Module does not exist in the module map or in these directories:
../SampleOGP/node_modules/open-graph-scraper/node_modules
, ../SampleOGP/node_modules
This might be related to facebook/react-native#4968
To resolve try the following:
watchman watch-del-all
.node_modules
folder: rm -rf node_modules && npm install
.rm -fr $TMPDIR/react-*
or npm start -- --reset-cache
.build environment:
node: v7.7.3,
react-native: react-native-cli: 2.0.1,
OS: MAC-OS,
react: 16.0.0-alpha.12,
react-native: 0.45.1
"The above eample will return something like..."
*example
Using the latest version 4.3.1 and following the "usage example" returns limited results.
result: {
requestUrl: 'http://ogp.me/',
success: true
}
How do I get the og stuff?
Hello,
This is not an issue necessarily but a feature request. I am wondering if it would be easy to add support for Twitter meta tags to show up in the scrape. I am trying to detect if a URL has twitter cards enabled (see: https://dev.twitter.com/cards/types/summary). I already use this library for scraping other open graph information and it would be extremely awesome if there was an option that you could set to opt in to get those Twitter values.
Thank you.
Currently for your recent adds to handle encoding, you are checking that options.encoding === null
to see if you should try to detect the charset, but this by default is undefined
so doesn't run. This should eighter have a default set, or be if (options.encoding === null || options.encoding === undefined)
.../node_modules/domutils/lib/querying.js:83
function findAll(test, elems){
^
RangeError: Maximum call stack size exceeded
occurs when calling ogs on a url that contains large image, e.g.
I am trying to fetch the metadata of this url:
https://www.bloomberg.com/news/articles/2020-06-24/pentagon-names-20-chinese-firms-it-says-are-military-controlled?cmpid=socialflow-twitter-business&utm_medium=social&utm_content=business&utm_source=twitter&utm_campaign=socialflow-organic
I get this in response:
{ ogTitle: 'Bloomberg - Are you a robot?', ogImage: [] }
I get the correct data when I run it locally. But when I try to fetch the meta data on my production, it gives the above response
Hi, I'm running into some issues using this package. I'm running yarn add open-graph-package
and in the code I have:
import OpenGraph from 'open-graph-scraper'
const url = "https://www.test.com"
const options = { url: url }
OpenGraph(options, (error, results, response) => {
console.log("error:", error)
console.log("results:", results)
console.log("response:", response)
})
and I get the following error:
./node_modules/http2-wrapper/source/index.js
Module not found: Can't resolve 'http2' in '/Users/taylorwong/Documents/GitHub/client-dashboard/node_modules/http2-wrapper/source'
I installed http2 with yarn add http2
but then run into the following error:
TypeError: Cannot read property 'split' of undefined
node_modules/@szmarczak/http-timer/dist/source/index.js:4
Let me know if I can provide any additional information to help with this. Thank you!
Didn't get information correctly from "http://www.uniqlo.com/jp/store/goods/182567-08".
It is sample.
Below is test result
{ ogTitle: '{{meta.title}}',
ogType: 'website',
ogUrl: '{{meta.absUrl}}',
ogSiteName: 'ユニクロ(UNIQLO)オンラインストア',
ogDescription: '{{meta.description}}',
ogImage: { url: '{{meta.ogImage}}', width: null, height: null, type: null } },
success: true,
requestUrl: 'http://www.uniqlo.com/jp/store/feature_mb/uq/fe_list/ut/build/kids?quickviewproduct=405988001' }
Hi, Guys.
I executed script.js with code example
var ogs = require('open-graph-scraper');
var options = {'url': 'https://medium.com/@marcobrunobr/mvp-%C3%A9-coisa-do-passado-a-moda-agora-%C3%A9-mlp-4446fc476006#.lef2qqqhf'};
ogs(options, function (err, results) {
console.log('err:', err);
console.log('results:', results);
});
Returned all metadatas
But when i executed from express, returned Bad Request.
exports.metatag = function(req, res){
var url = req.query.url;
var options = {'url': url, timeout: 4000};
ogs(options, function (err, results) {
if(err) return res.status(400).json({results: results, err: err});
return res.status(200).json(results);
});
}
Whats Happened?
Hi! Thanks for the scraper, gets the job done so far! I've been using it for about a year now and after you released your last version I tried looking for a changelog and didn't see any. Because of this the change from err
to error
in the error response was unexpected for me and lead to some crashes. I still have no idea if anything else changed.
Could you please provide a summary of changes between previous and current version, and perhaps start a "CHANGELOG.md" file?
It would be very useful to be able to receive full html source of remote site, e.g. as third callback's parameter, like:
ogs(options, function (err, results, source) {
console.log('source:', source); // This would be full html source of remote site
});
Sometimes it thinks a response is not text/html, even though it is.
For example, with https://www.namecheap.com/ , a very popular dns provider
It should be able to give the ogs info, but it fails with error.
There is a check in the code:
else if (!(response && response.headers && response.headers['content-type'] && response.headers['content-type'].indexOf('text/html') !== -1)) {
callback('Must scrape an HTML page', null, response);
That should be relaxed somehow, or cater for scenarios like the example above
the most recent version is [email protected]
which lacks the useful html
option. help!
error:XMLHttpRequest cannot load https://www.naver.com/. Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'localhost:8000' is therefore not allowed access. The response had HTTP status code 404.
how should i do?
Basic example is not working with my site:
var ogs = require('open-graph-scraper');
var options = { 'url': 'https://signanthealth.com/careers/' };
ogs(options, function (error, results) {
console.log('error:', error); // This is returns true or false. True if there was a error. The error it self is inside the results object.
console.log('results:', results);
});
Error:
error: true
results: {
error: 'Page Not Found',
success: false,
requestUrl: 'https://signanthealth.com/careers/',
errorDetails: Error: incorrect header check
at Zlib.zlibOnError [as onerror] (zlib.js:170:17) {
errno: -3,
code: 'Z_DATA_ERROR'
},
response: undefined
}
For example: https://www.target.com.au/path-to-product
will fail as .tar
is one of the invalid image types. This causes the Must scrape an HTML page
error.
Looking at the code I assume this will happen for any URL that has any of the invalid images types in its domain whether it be domain or sub-domain. E.g. https://products.txtshop.com/product
If I get a spare few mins I'll submit a PR with some sort of solution.
Just thought I'd make you aware of the issue :)
Update: PR submitted: #108
Steve
Hi.
Thank you for your work on this library. I am really happy to use it.
Recently I got an issue while trying to get data from https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom
.
I got the error:
{
error: true,
result: {
success: false,
requestUrl: 'https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom',
error: 'Page not found',
errorDetails: Error: Page not found
at setOptionsAndReturnOpenGraphResults (.../node_modules/open-graph-scraper/lib/openGraphScraper.js:174:13)
at processTicksAndRejections (internal/process/task_queues.js:85:5)
}
}
My version of OGS is 4.4.0 and options for the request are:
headers: {
'user-agent':
'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)'
},
timeout: 10000,
ogImageFallback: false,
onlyGetOpenGraphInfo: false
Do you have an idea why this might happen?
Hi,
it would be very useful if any errors of the internal request could be passed back in the callback.
Example:
The internal request has the following error:
{ Error: self signed certificate in certificate chain
at TLSSocket.<anonymous> (_tls_wrap.js:1088:38)
at emitNone (events.js:86:13)
at TLSSocket.emit (events.js:188:7)
at TLSSocket._finishInit (_tls_wrap.js:610:8)
at TLSWrap.ssl.onhandshakedone (_tls_wrap.js:440:38) code: 'SELF_SIGNED_CERT_IN_CHAIN' }
The error passed in the callback:
true
Troubleshooting is not possible in this way, without adding logs to the module :-(
Doesn't pull title, image etc.
Ex:
https://twitter.com/MarkRober/status/1256956499286298625
{
ogSiteName: 'Twitter',
requestUrl: 'https://twitter.com/MarkRober/status/1256956499286298625',
success: true
}
Hello,
Really liking openGraphScraper and finding it useful in my project, thanks for making it.
When I get the occasional error from a graph request, an error object is passed but its not a standard JavaScript error. In the documentation it shows dealing with an error like this...
console.log("err:",err);
But this will output as err: err
when there is an error.
Is there any chance you could throw a more meaningful error message or give a clearer indication of what is happening when the requested webpage has a problem?
Thanks again for making openGraphScraper,
/t
Since the request has already been made, it would help to return the content type encountered when scraping resources that aren't HTML (such as images, media, etc).
This could be made as an extra parameter in the results object, so as to not break existing deployments.
If url points to a large file (link to a document for example), it downloads the whole file before parsing the content. Since we are using cheerio, we probably can provide a way to specify the maximum bytes fetched, and still be able to get open graph tags from partial content?
Npm audit is reporting a vulnerability; fixed in lodash >=4.17.12.
Auto pull request here: #78
When an OG-image is served over HTTP instead of HTTPS the browser complains about mixed content. Is there a way to avoid that?
The module fails for the following link:
This is because it does multiple redirects:
curl -v -L http://nytimes.com/2016/09/01/arts/design/gallery-hopes-to-sell-kanye-wests-famous-sculpture-for-4-million.html?_r=0
...
Connection #7 to host myaccount.nytimes.com left intact
So I can use curl for getting the html contents myself, but I would like to use the scraper to parse from string.
Let me start of with: great package! A happy user here :)
Then my problem: we are doing multiple requests at the same time and running in some ESOCKETTIMEDOUT
errors. I did some tricks to improve this (for example improving the UV_THREADPOOL_SIZE
), but cannot completely get rid of them. I am now building functionality to do a retry after some seconds, with a max retry setting. Would be awesome if there was a property in this package we could set to do the same.
Good idea? I have seen some request
packages that add this functionality for recoverable errors to request
as well.
Having trouble using openGraphScraper
, even with the default example i am getting error, i have tried different versions from 4.00 to 4.2.0 but received the same error. I have also tried removing the node_modules
and package-lock.json
. I am using node 10.16.3
Using this example
const ogs = require('open-graph-scraper');
const options = { url: 'https://www.npmjs.com/package/open-graph-scraper' };
ogs(options)
.then((data) => {
const { error, result, response } = data;
console.log('error:', error); // This is returns true or false. True if there was a error. The error it self is inside the results object.
console.log('result:', result); // This contains all of the Open Graph results
console.log('response:', response); // This contains the HTML of page
})
I am getting this error
{ error: true
result: {
"success": false,
"requestUrl": "https://www.npmjs.com/package/open-graph-scraper",
"error": "Cannot read property 'startsWith' of undefined",
"errorDetails": {
"name": "RequestError"
}
}}
Hey! Thanks a lot for your library, I am totally enjoying it and using in my telegram bot application. Stumbled on an issue and wanted to get your opinion:
https://link.medium.com/O8Wg18lGwW
The above link appears to have no data on promise resolve but success is still true. It redirects to completely different page and some openGraph checkers yield no results on it, although some return all the correct data. Is it an known issue, or I am just doing smth wrong?
Thanks a lot for your help and time. Cheers!
I am trying to fetch data for a number of links. It is giving the Unhandled Promise Rejection warning .
Following is the code
```
for(var i=0; i< 100;i++){
var options = {'url': url};
ogs(options, function (err, results) {
if(err){
console.log('error')
}
else{
console.log(results);
}
});
}
If page other than html is specified and the url does not end with an extension, nodejs may get stuck.
For example, the following url:
http://rcc.jp/event/daiku/images/flier.pdf?191101
Thanks for the library. I've noticed that you're missing the alias for video tags, og:video:url
. both Youtube and Vimeo appear to use this over og:video
.
https://developers.facebook.com/docs/sharing/webmasters#video
I found the following meta tags. Could you please add them?
<meta property="og:price:amount" content="64.00"/>
<meta property="og:price:currency" content="USD"/>
<meta property="og:availability" content="InStock"/>
When I scrape this Amazon url: https://amzn.to/2Is8sCR, I expect to see the same results as the Facebook debugger (here) which follows a redirect to here, but for some reason I always get this:
{
ogDescription: 'Buy The Anatomy of Story: 22 Steps to Becoming a Master Storyteller Reprint by John Truby (ISBN: 8601200418156) from Amazon\'s Book Store. Everyday low prices and free delivery on eligible orders.',
ogImage: {
url: 'https://images-eu.ssl-images-amazon.com/images/G/02/gno/sprites/nav-sprite-global_bluebeacon-V3-1x_optimized._CB516557022_.png'
},
ogTitle: 'The Anatomy of Story: 22 Steps to Becoming a Master Storyteller: Amazon.co.uk: John Truby: 8601200418156: Books'
}
Thanks, hope somebody can help!
Is it possible to return the full response when you're using the library with promises? It seems like that doesn't get returned.
ogs(options)
.then((results, response) => {
console.log('results:', results);
//console.log('response', response);
})
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.