jshemas / opengraphscraper Goto Github PK

View Code? Open in Web Editor NEW

635.0 635.0 101.0 1.88 MB

Node.js scraper service for Open Graph Info and More!

License: MIT License

TypeScript 100.00%

opengraphscraper's People

Stargazers

Watchers

Forkers

mattskala tart martindale scshepard spirytoos evansolomon mixersoft fabnoe frankdilo nsillik actionably chudnyi jiawenzhang mickaelbzh gazay mdimitrov lutfidemirci jamie452 sisidovski randyamiel appacea shishircc whoischriswalker kawasako ayanchevski mediapolis digitalkaoz thg1101 benjie faulik reut-co ejb yakimko mottx ljtijhuis pirateminds rodrigorznd viet0511 natalieethell xr shumbo karlsander tomarad jeremija level51 stacktical maxhis athlinks is2ei ralf-walk farnazj baob frytg sgpinkus jhnsmth ipnn steve1000 shingom catnose99 soylemezali42 tpxp shatindle scribblelive elissonsvensson nick777-pixel klavierbrono denfung w3joe yodalightsabr profilezone nobu-sh alexandrecoulay soojeongh keisukekomeda nhemnt bjornlarus harshit-budhraja amotarao jdubauw get-devkit martingagnon gladevise vmedina-rod jamespacileo jeeyoungk rocketman-21 rkstar noelzubin kentaroutakeda cm-dyoshikawa hxsang04 totto2727-org davidmatthe erkankesik bugkarma ngprnk

opengraphscraper's Issues

Doesn't seem to handle Amazon partner link redirects?

When I scrape this Amazon url: https://amzn.to/2Is8sCR, I expect to see the same results as the Facebook debugger (here) which follows a redirect to here, but for some reason I always get this:

{
  ogDescription: 'Buy The Anatomy of Story: 22 Steps to Becoming a Master Storyteller Reprint by John Truby (ISBN: 8601200418156) from Amazon\'s Book Store. Everyday low prices and free delivery on eligible orders.',
  ogImage: {
    url: 'https://images-eu.ssl-images-amazon.com/images/G/02/gno/sprites/nav-sprite-global_bluebeacon-V3-1x_optimized._CB516557022_.png'
  },
  ogTitle: 'The Anatomy of Story: 22 Steps to Becoming a Master Storyteller: Amazon.co.uk: John Truby: 8601200418156: Books'

}

Here's what I see

Here's what I want

Thanks, hope somebody can help!

Server Has Ran Into A Error for Vimeo URLs

Hi,

I've been testing your library, it is really helpful! However, I came across a vague error now. When testing from my local dev environment it is working fine, when deployed to the server I getting this error. The issue seems to be happening only for Vimeo URLs. Any ideas?

Request:

const options = {'url': 'https://vimeo.com/232889838'};
ogs(options, function (error, results) {
                console.log('error:', error);
                console.log('results:', results);
            });

Result:

results: { error: 'Page Not Found',
  success: false,
  requestUrl: 'https://vimeo.com/232889838',
  errorDetails: 'Server Has Ran Into A Error' }

Thanks!

Didn't get tags correctly

Didn't get information correctly from "http://www.uniqlo.com/jp/store/goods/182567-08".
It is sample.

Below is test result
{ ogTitle: '{{meta.title}}',
ogType: 'website',
ogUrl: '{{meta.absUrl}}',
ogSiteName: 'ユニクロ(UNIQLO)オンラインストア',
ogDescription: '{{meta.description}}',
ogImage: { url: '{{meta.ogImage}}', width: null, height: null, type: null } },
success: true,
requestUrl: 'http://www.uniqlo.com/jp/store/feature_mb/uq/fe_list/ut/build/kids?quickviewproduct=405988001' }

Add properties

I found the following meta tags. Could you please add them?

<meta property="og:price:amount" content="64.00"/>
<meta property="og:price:currency" content="USD"/>
<meta property="og:availability" content="InStock"/>

stuck nodejs when non html pages specified

If page other than html is specified and the url does not end with an extension, nodejs may get stuck.

For example, the following url:
http://rcc.jp/event/daiku/images/flier.pdf?191101

Unhandled Promise Rejection warning

I am trying to fetch data for a number of links. It is giving the Unhandled Promise Rejection warning .
Following is the code

 ```
    for(var i=0; i< 100;i++){
	var options = {'url': url};
  ogs(options, function (err, results) {
			if(err){	
                           console.log('error')
			}
			else{
                             console.log(results);
			}
  });

}

Simple example not working on my site

Basic example is not working with my site:

var ogs = require('open-graph-scraper');
var options = { 'url': 'https://signanthealth.com/careers/' };

ogs(options, function (error, results) {
  console.log('error:', error); // This is returns true or false. True if there was a error. The error it self is inside the results object.
  console.log('results:', results);
});

Error:

error: true
results: {
  error: 'Page Not Found',
  success: false,
  requestUrl: 'https://signanthealth.com/careers/',
  errorDetails: Error: incorrect header check
      at Zlib.zlibOnError [as onerror] (zlib.js:170:17) {
    errno: -3,
    code: 'Z_DATA_ERROR'
  },
  response: undefined
}

Page not found

Hi.
Thank you for your work on this library. I am really happy to use it.

Recently I got an issue while trying to get data from https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom.

I got the error:

{
  error: true,
  result: {
    success: false,
    requestUrl: 'https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom',
    error: 'Page not found',
    errorDetails: Error: Page not found
        at setOptionsAndReturnOpenGraphResults (.../node_modules/open-graph-scraper/lib/openGraphScraper.js:174:13)
        at processTicksAndRejections (internal/process/task_queues.js:85:5)
  }
}

My version of OGS is 4.4.0 and options for the request are:

    headers: {
      'user-agent':
        'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)'
    },
    timeout: 10000,
    ogImageFallback: false,
    onlyGetOpenGraphInfo: false

Do you have an idea why this might happen?

Return response in promise

Is it possible to return the full response when you're using the library with promises? It seems like that doesn't get returned.

ogs(options)
  .then((results, response) => {
    console.log('results:', results);
    //console.log('response', response);
  })

Feature request: return content type for non-html requests

Since the request has already been made, it would help to return the content type encountered when scraping resources that aren't HTML (such as images, media, etc).
This could be made as an extra parameter in the results object, so as to not break existing deployments.

Limited response

Using the latest version 4.3.1 and following the "usage example" returns limited results.

result: {
    requestUrl: 'http://ogp.me/',
    success: true
}

How do I get the og stuff?

Can this be used on client side?

Dependency jschardet uses restrictive license

The dependency jschardet uses an LGPL license.
Will open a PR soon for replacing with an MIT licensed project.

Update lodash to fix vulnerabilities

Looks like there's an automated pull request for this issue: #73

Can't resolve 'http2' and Cannot read property 'split' of undefined

Hi, I'm running into some issues using this package. I'm running yarn add open-graph-package and in the code I have:

import OpenGraph from 'open-graph-scraper'

const url = "https://www.test.com"
const options = { url: url }

OpenGraph(options, (error, results, response) => {
  console.log("error:", error)
  console.log("results:", results)
  console.log("response:", response)
})

and I get the following error:

./node_modules/http2-wrapper/source/index.js
Module not found: Can't resolve 'http2' in '/Users/taylorwong/Documents/GitHub/client-dashboard/node_modules/http2-wrapper/source'

I installed http2 with yarn add http2 but then run into the following error:

TypeError: Cannot read property 'split' of undefined
node_modules/@szmarczak/http-timer/dist/source/index.js:4

Let me know if I can provide any additional information to help with this. Thank you!

og:video:url

Thanks for the library. I've noticed that you're missing the alias for video tags, og:video:url. both Youtube and Vimeo appear to use this over og:video.

https://developers.facebook.com/docs/sharing/webmasters#video

Integration issue

Loading dependency graph, done.
error: bundling: UnableToResolveError: Unable to resolve module url from ../SampleOGP/node_modules/open-graph-scraper/app.js:
Module does not exist in the module map or in these directories:
../SampleOGP/node_modules/open-graph-scraper/node_modules
, ../SampleOGP/node_modules

This might be related to facebook/react-native#4968
To resolve try the following:

Clear watchman watches: watchman watch-del-all.
Delete the node_modules folder: rm -rf node_modules && npm install.
Reset packager cache: rm -fr $TMPDIR/react-* or npm start -- --reset-cache.
at ResolutionRequest._resolveNodeDependency (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:470:11)
at ResolutionRequest.resolveDependency (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:135:29)
at dependencyNames.map.name (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:142:59)
at Array.map (native)
at ResolutionRequest.resolveModuleDependencies (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:142:42)
at module.readFresh.then (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:182:40)
at process._tickCallback (internal/process/next_tick.js:109:7)

build environment:
node: v7.7.3,
react-native: react-native-cli: 2.0.1,
OS: MAC-OS,
react: 16.0.0-alpha.12,
react-native: 0.45.1

no data for medium.com urls from the iOS app

This link doesn't throw an error but also doesn't scrape open graph data.
Anybody else having this problem?

https://link.medium.com/HozE2bvBmU

edit: the link above redirects to this one: https://rsci.app.link/HozE2bvBmU?_p=f4503a44f32dc261648a177f2460

*example

"The above eample will return something like..."

*example

Getting are you a robot in response

I am trying to fetch the metadata of this url:
https://www.bloomberg.com/news/articles/2020-06-24/pentagon-names-20-chinese-firms-it-says-are-military-controlled?cmpid=socialflow-twitter-business&utm_medium=social&utm_content=business&utm_source=twitter&utm_campaign=socialflow-organic

I get this in response:
{ ogTitle: 'Bloomberg - Are you a robot?', ogImage: [] }

I get the correct data when I run it locally. But when I try to fetch the meta data on my production, it gives the above response

Unable to minify OGS

Getting this error when using OGS in my create-react-app: https://github.com/facebook/create-react-app/blob/master/packages/react-scripts/template/README.md#npm-run-build-fails-to-minify

Failed to minify the code from this file: 

 	./node_modules/open-graph-scraper/lib/openGraphScraper.js:46

Because code is < ES5?

Issue with handling encoding

Currently for your recent adds to handle encoding, you are checking that options.encoding === null to see if you should try to detect the charset, but this by default is undefined so doesn't run. This should eighter have a default set, or be if (options.encoding === null || options.encoding === undefined)

How to scrap author, publisher meta info

Hi
How to scrap author, publisher meta info etc?
It will be better to set custom meta key for scrapping.
Is there any option for this?
If so, please let me know.

Thanks.

Not installing with node 7

Since in package.json, engines are configured for node 6.x, yarn is not allowing to install it.
Please fix this.

Add Twitter Name Meta Tags Support

Hello,

This is not an issue necessarily but a feature request. I am wondering if it would be easy to add support for Twitter meta tags to show up in the scrape. I am trying to detect if a URL has twitter cards enabled (see: https://dev.twitter.com/cards/types/summary). I already use this library for scraping other open graph information and it would be extremely awesome if there was an option that you could set to opt in to get those Twitter values.

Thank you.

cors!!!

error:XMLHttpRequest cannot load https://www.naver.com/. Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'localhost:8000' is therefore not allowed access. The response had HTTP status code 404.

how should i do?

Access for Express, returned Bad Request URL valid

Hi, Guys.
I executed script.js with code example

var ogs = require('open-graph-scraper');
var options = {'url': 'https://medium.com/@marcobrunobr/mvp-%C3%A9-coisa-do-passado-a-moda-agora-%C3%A9-mlp-4446fc476006#.lef2qqqhf'};
ogs(options, function (err, results) {
    console.log('err:', err);
    console.log('results:', results);
});

Returned all metadatas

But when i executed from express, returned Bad Request.

exports.metatag = function(req, res){
 var url = req.query.url;

var options = {'url': url, timeout: 4000};
ogs(options, function (err, results) {
    if(err) return res.status(400).json({results: results, err: err});

    return res.status(200).json(results);
});

}

Whats Happened?

url: https://medium.com/@marcobrunobr/mvp-%C3%A9-coisa-do-passado-a-moda-agora-%C3%A9-mlp-4446fc476006#.lef2qqqhf

Update lodash to fix vulnerability

Npm audit is reporting a vulnerability; fixed in lodash >=4.17.12.
Auto pull request here: #78

Medium short links are not supported

Hey! Thanks a lot for your library, I am totally enjoying it and using in my telegram bot application. Stumbled on an issue and wanted to get your opinion:

https://link.medium.com/O8Wg18lGwW

The above link appears to have no data on promise resolve but success is still true. It redirects to completely different page and some openGraph checkers yield no results on it, although some return all the correct data. Is it an known issue, or I am just doing smth wrong?

Thanks a lot for your help and time. Cheers!

Suggestion: add retry functionality

Let me start of with: great package! A happy user here :)

Then my problem: we are doing multiple requests at the same time and running in some ESOCKETTIMEDOUT errors. I did some tricks to improve this (for example improving the UV_THREADPOOL_SIZE), but cannot completely get rid of them. I am now building functionality to do a retry after some seconds, with a max retry setting. Would be awesome if there was a property in this package we could set to do the same.

Good idea? I have seen some request packages that add this functionality for recoverable errors to request as well.

utils.isThisANonHTMLUrl causing false positives on domains that start with invalid image type strings

For example: https://www.target.com.au/path-to-product will fail as .tar is one of the invalid image types. This causes the Must scrape an HTML page error.

Looking at the code I assume this will happen for any URL that has any of the invalid images types in its domain whether it be domain or sub-domain. E.g. https://products.txtshop.com/product

If I get a spare few mins I'll submit a PR with some sort of solution.

Just thought I'd make you aware of the issue :)

Update: PR submitted: #108

Steve

forbes.com articles are allways showing home page OG data

Forbes shows first the home page, and after a few seconds it redirects to the article page, so openGraphScraper is loading the home page OG tags, but facebook (by example) show the proper OG tags if I use the article link

provide a way to limit the data fetched?

If url points to a large file (link to a document for example), it downloads the whole file before parsing the content. Since we are using cheerio, we probably can provide a way to specify the maximum bytes fetched, and still be able to get open graph tags from partial content?

twitterImg.url incorrect

So when I try to scrape https://github.com, I get:

"twitterImage": {
            "url": "1200",
            "width": "1200"
        }

whereas:

      <meta property="twitter:image:src" content="https://assets-cdn.github.com/images/modules/open_graph/github-logo.png">
      <meta property="twitter:image:width" content="1200">
      <meta property="twitter:image:height" content="1200">

I see the following problems:

twitter:image:height did not get scraped
probably twitter:image:src got scraped instead of twitter:image:height
why there is twitterImage.url rather than twitterImage.src which would be more consistent to twitter:image:src (there is no twitter:image:url after all)?

RangeError: Maximum call stack size exceeded

.../node_modules/domutils/lib/querying.js:83
function findAll(test, elems){
^

RangeError: Maximum call stack size exceeded

occurs when calling ogs on a url that contains large image, e.g.

https://upload.wikimedia.org/wikipedia/commons/a/a2/Overlook_Hong_Kong_Island_north_coast,_Victoria_Harbour_and_Kowloon_from_middle_section_of_Lugard_Road_at_daytime_(enlarged_version_and_better_contrast,_revised).jpg

Consider starting a changelog or using tags/releases

Hi! Thanks for the scraper, gets the job done so far! I've been using it for about a year now and after you released your last version I tried looking for a changelog and didn't see any. Because of this the change from err to error in the error response was unexpected for me and lead to some crashes. I still have no idea if anything else changed.

Could you please provide a summary of changes between previous and current version, and perhaps start a "CHANGELOG.md" file?

Timeout for some urls

Nevermind, delete this.

[email protected] is not published

the most recent version is [email protected] which lacks the useful html option. help!

Page not found error

Using the following URL I'm getting a page not found error but when I navigate to the URL in my browser there is no issue. Is there a reason for this? I've had it working with other URLs.

URL: http://www.wemeanbusinesslondon.com/blog/2016/5/10/the-entrepreneur-spiration-series-going-nuts-for-pip-nut

Cannot read property 'startsWith' of undefined

Having trouble using openGraphScraper, even with the default example i am getting error, i have tried different versions from 4.00 to 4.2.0 but received the same error. I have also tried removing the node_modules and package-lock.json. I am using node 10.16.3
Using this example

const ogs = require('open-graph-scraper');
const options = { url: 'https://www.npmjs.com/package/open-graph-scraper' };
ogs(options)
  .then((data) => {
    const { error, result, response } = data;
    console.log('error:', error);  // This is returns true or false. True if there was a error. The error it self is inside the results object.
    console.log('result:', result); // This contains all of the Open Graph results
    console.log('response:', response); // This contains the HTML of page
  })

I am getting this error

   { error: true
    result: {
      "success": false,
      "requestUrl": "https://www.npmjs.com/package/open-graph-scraper",
      "error": "Cannot read property 'startsWith' of undefined",
      "errorDetails": {
        "name": "RequestError"
      }
    }}

Twitter scraping is not working

Doesn't pull title, image etc.

Ex:
https://twitter.com/MarkRober/status/1256956499286298625

{
  ogSiteName: 'Twitter',
  requestUrl: 'https://twitter.com/MarkRober/status/1256956499286298625',
  success: true
}

Pass through validator options in oGS options

~~Hey, I'm just trying to understand why http://localhost:3000 is invalid but 127.0.0.1 is valid? Is it the require_tld that it fails on in validator?~~

openGraphScraper/tests/unit/utils.spec.js

Line 79 in 3fb4fe1

'http://localhost:3000/',

Before I opened this I found a mention in validator about exactly this, you can disable require_tld to get localhost to validate.

validatorjs/validator.js#675

Is there space here to allow for passing in validator options through the oGS options? i.e.

    const options = { 
      url: 'http://localhost:3000',
      validator: {
        require_tld: false
      }
    }

    ogs(options)
      .then((data) => {
        const { error, result } = data
        if (error) console.log('error:', error)
        console.log('result:', result)
      })
      .catch((error) => {
        consola.error(error)
      })

WDYT? I can give a PR a go to add this.

content-type text/html misread

Sometimes it thinks a response is not text/html, even though it is.
For example, with https://www.namecheap.com/ , a very popular dns provider
It should be able to give the ogs info, but it fails with error.
There is a check in the code:
else if (!(response && response.headers && response.headers['content-type'] && response.headers['content-type'].indexOf('text/html') !== -1)) {
callback('Must scrape an HTML page', null, response);

That should be relaxed somehow, or cater for scenarios like the example above

Feature request: return html markup

It would be very useful to be able to receive full html source of remote site, e.g. as third callback's parameter, like:

ogs(options, function (err, results, source) {
    console.log('source:', source); // This would be full html source of remote site
});

How to avoid mixed content when pulling og-image

When an OG-image is served over HTTP instead of HTTPS the browser complains about mixed content. Is there a way to avoid that?

ReferenceError: errorFlag is not defined

I tried this on Runkit as well as my local machine and got the following error

ReferenceError: errorFlag is not defined
at parseGoogleNewsRSSData in googlenews-rss-scraper/index.js — line 75

Feature request: parse from string rather than url

The module fails for the following link:

http://www.nytimes.com/2016/09/01/arts/design/gallery-hopes-to-sell-kanye-wests-famous-sculpture-for-4-million.html?_r=0

This is because it does multiple redirects:

curl -v -L http://nytimes.com/2016/09/01/arts/design/gallery-hopes-to-sell-kanye-wests-famous-sculpture-for-4-million.html?_r=0
...
Connection #7 to host myaccount.nytimes.com left intact

So I can use curl for getting the html contents myself, but I would like to use the scraper to parse from string.

Error param in callback is not an error object

Hello,

Really liking openGraphScraper and finding it useful in my project, thanks for making it.

When I get the occasional error from a graph request, an error object is passed but its not a standard JavaScript error. In the documentation it shows dealing with an error like this...

console.log("err:",err);

But this will output as err: err when there is an error.

Is there any chance you could throw a more meaningful error message or give a clearer indication of what is happening when the requested webpage has a problem?

Thanks again for making openGraphScraper,
/t

Return original request error

Hi,

it would be very useful if any errors of the internal request could be passed back in the callback.

Example:

The internal request has the following error:

{ Error: self signed certificate in certificate chain
    at TLSSocket.<anonymous> (_tls_wrap.js:1088:38)
    at emitNone (events.js:86:13)
    at TLSSocket.emit (events.js:188:7)
    at TLSSocket._finishInit (_tls_wrap.js:610:8)
    at TLSWrap.ssl.onhandshakedone (_tls_wrap.js:440:38) code: 'SELF_SIGNED_CERT_IN_CHAIN' }

The error passed in the callback:

true

Troubleshooting is not possible in this way, without adding logs to the module :-(

Twitter doesn't response image etc

It cause of this.
They response this page.