Coder Social home page Coder Social logo

opengraphscraper's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opengraphscraper's Issues

Doesn't seem to handle Amazon partner link redirects?

When I scrape this Amazon url: https://amzn.to/2Is8sCR, I expect to see the same results as the Facebook debugger (here) which follows a redirect to here, but for some reason I always get this:

{
  ogDescription: 'Buy The Anatomy of Story: 22 Steps to Becoming a Master Storyteller Reprint by John Truby (ISBN: 8601200418156) from Amazon\'s Book Store. Everyday low prices and free delivery on eligible orders.',
  ogImage: {
    url: 'https://images-eu.ssl-images-amazon.com/images/G/02/gno/sprites/nav-sprite-global_bluebeacon-V3-1x_optimized._CB516557022_.png'
  },
  ogTitle: 'The Anatomy of Story: 22 Steps to Becoming a Master Storyteller: Amazon.co.uk: John Truby: 8601200418156: Books'

}

Here's what I see

seen

Here's what I want

wanted

Thanks, hope somebody can help!

Server Has Ran Into A Error for Vimeo URLs

Hi,

I've been testing your library, it is really helpful! However, I came across a vague error now. When testing from my local dev environment it is working fine, when deployed to the server I getting this error. The issue seems to be happening only for Vimeo URLs. Any ideas?

Request:

const options = {'url': 'https://vimeo.com/232889838'};
ogs(options, function (error, results) {
                console.log('error:', error);
                console.log('results:', results);
            });

Result:

results: { error: 'Page Not Found',
  success: false,
  requestUrl: 'https://vimeo.com/232889838',
  errorDetails: 'Server Has Ran Into A Error' }

Thanks!

Didn't get tags correctly

Didn't get information correctly from "http://www.uniqlo.com/jp/store/goods/182567-08".
It is sample.

Below is test result
{ ogTitle: '{{meta.title}}',
ogType: 'website',
ogUrl: '{{meta.absUrl}}',
ogSiteName: 'ユニクロ(UNIQLO)オンラインストア',
ogDescription: '{{meta.description}}',
ogImage: { url: '{{meta.ogImage}}', width: null, height: null, type: null } },
success: true,
requestUrl: 'http://www.uniqlo.com/jp/store/feature_mb/uq/fe_list/ut/build/kids?quickviewproduct=405988001' }

Add properties

I found the following meta tags. Could you please add them?

<meta property="og:price:amount" content="64.00"/>
<meta property="og:price:currency" content="USD"/>
<meta property="og:availability" content="InStock"/>

Unhandled Promise Rejection warning

I am trying to fetch data for a number of links. It is giving the Unhandled Promise Rejection warning .
Following is the code

 ```
    for(var i=0; i< 100;i++){
	var options = {'url': url};
  ogs(options, function (err, results) {
			if(err){	
                           console.log('error')
			}
			else{
                             console.log(results);
			}
  });

}

Simple example not working on my site

Basic example is not working with my site:

var ogs = require('open-graph-scraper');
var options = { 'url': 'https://signanthealth.com/careers/' };

ogs(options, function (error, results) {
  console.log('error:', error); // This is returns true or false. True if there was a error. The error it self is inside the results object.
  console.log('results:', results);
});

Error:

error: true
results: {
  error: 'Page Not Found',
  success: false,
  requestUrl: 'https://signanthealth.com/careers/',
  errorDetails: Error: incorrect header check
      at Zlib.zlibOnError [as onerror] (zlib.js:170:17) {
    errno: -3,
    code: 'Z_DATA_ERROR'
  },
  response: undefined
}

Page not found

Hi.
Thank you for your work on this library. I am really happy to use it.

Recently I got an issue while trying to get data from https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom.

I got the error:

{
  error: true,
  result: {
    success: false,
    requestUrl: 'https://cookpad.com/ru/recipes/7345377-piechienie-banan-s-tvoroghom',
    error: 'Page not found',
    errorDetails: Error: Page not found
        at setOptionsAndReturnOpenGraphResults (.../node_modules/open-graph-scraper/lib/openGraphScraper.js:174:13)
        at processTicksAndRejections (internal/process/task_queues.js:85:5)
  }
}

My version of OGS is 4.4.0 and options for the request are:

    headers: {
      'user-agent':
        'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)'
    },
    timeout: 10000,
    ogImageFallback: false,
    onlyGetOpenGraphInfo: false

Do you have an idea why this might happen?

Return response in promise

Is it possible to return the full response when you're using the library with promises? It seems like that doesn't get returned.

ogs(options)
  .then((results, response) => {
    console.log('results:', results);
    //console.log('response', response);
  })

Feature request: return content type for non-html requests

Since the request has already been made, it would help to return the content type encountered when scraping resources that aren't HTML (such as images, media, etc).
This could be made as an extra parameter in the results object, so as to not break existing deployments.

Limited response

Using the latest version 4.3.1 and following the "usage example" returns limited results.

result: {
    requestUrl: 'http://ogp.me/',
    success: true
}

How do I get the og stuff?

Can't resolve 'http2' and Cannot read property 'split' of undefined

Hi, I'm running into some issues using this package. I'm running yarn add open-graph-package and in the code I have:

import OpenGraph from 'open-graph-scraper'

const url = "https://www.test.com"
const options = { url: url }

OpenGraph(options, (error, results, response) => {
  console.log("error:", error)
  console.log("results:", results)
  console.log("response:", response)
})

and I get the following error:

./node_modules/http2-wrapper/source/index.js
Module not found: Can't resolve 'http2' in '/Users/taylorwong/Documents/GitHub/client-dashboard/node_modules/http2-wrapper/source'

I installed http2 with yarn add http2 but then run into the following error:

TypeError: Cannot read property 'split' of undefined
node_modules/@szmarczak/http-timer/dist/source/index.js:4

Let me know if I can provide any additional information to help with this. Thank you!

Integration issue

Loading dependency graph, done.
error: bundling: UnableToResolveError: Unable to resolve module url from ../SampleOGP/node_modules/open-graph-scraper/app.js:
Module does not exist in the module map or in these directories:
../SampleOGP/node_modules/open-graph-scraper/node_modules
, ../SampleOGP/node_modules

This might be related to facebook/react-native#4968
To resolve try the following:

  1. Clear watchman watches: watchman watch-del-all.
  2. Delete the node_modules folder: rm -rf node_modules && npm install.
  3. Reset packager cache: rm -fr $TMPDIR/react-* or npm start -- --reset-cache.
    at ResolutionRequest._resolveNodeDependency (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:470:11)
    at ResolutionRequest.resolveDependency (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:135:29)
    at dependencyNames.map.name (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:142:59)
    at Array.map (native)
    at ResolutionRequest.resolveModuleDependencies (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:142:42)
    at module.readFresh.then (../SampleOGP/node_modules/react-native/packager/src/node-haste/DependencyGraph/ResolutionRequest.js:182:40)
    at process._tickCallback (internal/process/next_tick.js:109:7)

build environment:
node: v7.7.3,
react-native: react-native-cli: 2.0.1,
OS: MAC-OS,
react: 16.0.0-alpha.12,
react-native: 0.45.1

*example

"The above eample will return something like..."

*example

Getting are you a robot in response

I am trying to fetch the metadata of this url:
https://www.bloomberg.com/news/articles/2020-06-24/pentagon-names-20-chinese-firms-it-says-are-military-controlled?cmpid=socialflow-twitter-business&utm_medium=social&utm_content=business&utm_source=twitter&utm_campaign=socialflow-organic

I get this in response:
{ ogTitle: 'Bloomberg - Are you a robot?', ogImage: [] }

I get the correct data when I run it locally. But when I try to fetch the meta data on my production, it gives the above response

Issue with handling encoding

Currently for your recent adds to handle encoding, you are checking that options.encoding === null to see if you should try to detect the charset, but this by default is undefined so doesn't run. This should eighter have a default set, or be if (options.encoding === null || options.encoding === undefined)

How to scrap author, publisher meta info

Hi
How to scrap author, publisher meta info etc?
It will be better to set custom meta key for scrapping.
Is there any option for this?
If so, please let me know.

Thanks.

Not installing with node 7

Since in package.json, engines are configured for node 6.x, yarn is not allowing to install it.
Please fix this.

Add Twitter Name Meta Tags Support

Hello,

This is not an issue necessarily but a feature request. I am wondering if it would be easy to add support for Twitter meta tags to show up in the scrape. I am trying to detect if a URL has twitter cards enabled (see: https://dev.twitter.com/cards/types/summary). I already use this library for scraping other open graph information and it would be extremely awesome if there was an option that you could set to opt in to get those Twitter values.

Thank you.

cors!!!

error:XMLHttpRequest cannot load https://www.naver.com/. Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'localhost:8000' is therefore not allowed access. The response had HTTP status code 404.

how should i do?

Access for Express, returned Bad Request URL valid

Hi, Guys.
I executed script.js with code example

var ogs = require('open-graph-scraper');
var options = {'url': 'https://medium.com/@marcobrunobr/mvp-%C3%A9-coisa-do-passado-a-moda-agora-%C3%A9-mlp-4446fc476006#.lef2qqqhf'};
ogs(options, function (err, results) {
    console.log('err:', err);
    console.log('results:', results);
});

Returned all metadatas

But when i executed from express, returned Bad Request.

exports.metatag = function(req, res){
 var url = req.query.url;

var options = {'url': url, timeout: 4000};
ogs(options, function (err, results) {
    if(err) return res.status(400).json({results: results, err: err});

    return res.status(200).json(results);
});

}

Whats Happened?

url: https://medium.com/@marcobrunobr/mvp-%C3%A9-coisa-do-passado-a-moda-agora-%C3%A9-mlp-4446fc476006#.lef2qqqhf

Medium short links are not supported

Hey! Thanks a lot for your library, I am totally enjoying it and using in my telegram bot application. Stumbled on an issue and wanted to get your opinion:

https://link.medium.com/O8Wg18lGwW

The above link appears to have no data on promise resolve but success is still true. It redirects to completely different page and some openGraph checkers yield no results on it, although some return all the correct data. Is it an known issue, or I am just doing smth wrong?

Thanks a lot for your help and time. Cheers!

Suggestion: add retry functionality

Let me start of with: great package! A happy user here :)

Then my problem: we are doing multiple requests at the same time and running in some ESOCKETTIMEDOUT errors. I did some tricks to improve this (for example improving the UV_THREADPOOL_SIZE), but cannot completely get rid of them. I am now building functionality to do a retry after some seconds, with a max retry setting. Would be awesome if there was a property in this package we could set to do the same.

Good idea? I have seen some request packages that add this functionality for recoverable errors to request as well.

utils.isThisANonHTMLUrl causing false positives on domains that start with invalid image type strings

For example: https://www.target.com.au/path-to-product will fail as .tar is one of the invalid image types. This causes the Must scrape an HTML page error.

Looking at the code I assume this will happen for any URL that has any of the invalid images types in its domain whether it be domain or sub-domain. E.g. https://products.txtshop.com/product

If I get a spare few mins I'll submit a PR with some sort of solution.

Just thought I'd make you aware of the issue :)

Update: PR submitted: #108

Steve

provide a way to limit the data fetched?

If url points to a large file (link to a document for example), it downloads the whole file before parsing the content. Since we are using cheerio, we probably can provide a way to specify the maximum bytes fetched, and still be able to get open graph tags from partial content?

twitterImg.url incorrect

So when I try to scrape https://github.com, I get:

"twitterImage": {
            "url": "1200",
            "width": "1200"
        }

whereas:

      <meta property="twitter:image:src" content="https://assets-cdn.github.com/images/modules/open_graph/github-logo.png">
      <meta property="twitter:image:width" content="1200">
      <meta property="twitter:image:height" content="1200">

I see the following problems:

  1. twitter:image:height did not get scraped
  2. probably twitter:image:src got scraped instead of twitter:image:height
  3. why there is twitterImage.url rather than twitterImage.src which would be more consistent to twitter:image:src (there is no twitter:image:url after all)?

Consider starting a changelog or using tags/releases

Hi! Thanks for the scraper, gets the job done so far! I've been using it for about a year now and after you released your last version I tried looking for a changelog and didn't see any. Because of this the change from err to error in the error response was unexpected for me and lead to some crashes. I still have no idea if anything else changed.

Could you please provide a summary of changes between previous and current version, and perhaps start a "CHANGELOG.md" file?

Cannot read property 'startsWith' of undefined

Having trouble using openGraphScraper, even with the default example i am getting error, i have tried different versions from 4.00 to 4.2.0 but received the same error. I have also tried removing the node_modules and package-lock.json. I am using node 10.16.3
Using this example

const ogs = require('open-graph-scraper');
const options = { url: 'https://www.npmjs.com/package/open-graph-scraper' };
ogs(options)
  .then((data) => {
    const { error, result, response } = data;
    console.log('error:', error);  // This is returns true or false. True if there was a error. The error it self is inside the results object.
    console.log('result:', result); // This contains all of the Open Graph results
    console.log('response:', response); // This contains the HTML of page
  })

I am getting this error

   { error: true
    result: {
      "success": false,
      "requestUrl": "https://www.npmjs.com/package/open-graph-scraper",
      "error": "Cannot read property 'startsWith' of undefined",
      "errorDetails": {
        "name": "RequestError"
      }
    }}

Screenshot from 2020-06-26 16-44-54

Pass through validator options in oGS options

Hey, I'm just trying to understand why http://localhost:3000 is invalid but 127.0.0.1 is valid? Is it the require_tld that it fails on in validator?

'http://localhost:3000/',

Before I opened this I found a mention in validator about exactly this, you can disable require_tld to get localhost to validate.

validatorjs/validator.js#675

Is there space here to allow for passing in validator options through the oGS options? i.e.

    const options = { 
      url: 'http://localhost:3000',
      validator: {
        require_tld: false
      }
    }

    ogs(options)
      .then((data) => {
        const { error, result } = data
        if (error) console.log('error:', error)
        console.log('result:', result)
      })
      .catch((error) => {
        consola.error(error)
      })

WDYT? I can give a PR a go to add this.

content-type text/html misread

Sometimes it thinks a response is not text/html, even though it is.
For example, with https://www.namecheap.com/ , a very popular dns provider
It should be able to give the ogs info, but it fails with error.
There is a check in the code:
else if (!(response && response.headers && response.headers['content-type'] && response.headers['content-type'].indexOf('text/html') !== -1)) {
callback('Must scrape an HTML page', null, response);

That should be relaxed somehow, or cater for scenarios like the example above

Feature request: return html markup

It would be very useful to be able to receive full html source of remote site, e.g. as third callback's parameter, like:

ogs(options, function (err, results, source) {
    console.log('source:', source); // This would be full html source of remote site
});

ReferenceError: errorFlag is not defined

I tried this on Runkit as well as my local machine and got the following error

ReferenceError: errorFlag is not defined
at parseGoogleNewsRSSData in googlenews-rss-scraper/index.js — line 75

Feature request: parse from string rather than url

The module fails for the following link:

http://www.nytimes.com/2016/09/01/arts/design/gallery-hopes-to-sell-kanye-wests-famous-sculpture-for-4-million.html?_r=0

This is because it does multiple redirects:

curl -v -L http://nytimes.com/2016/09/01/arts/design/gallery-hopes-to-sell-kanye-wests-famous-sculpture-for-4-million.html?_r=0
...
Connection #7 to host myaccount.nytimes.com left intact

So I can use curl for getting the html contents myself, but I would like to use the scraper to parse from string.

Error param in callback is not an error object

Hello,

Really liking openGraphScraper and finding it useful in my project, thanks for making it.

When I get the occasional error from a graph request, an error object is passed but its not a standard JavaScript error. In the documentation it shows dealing with an error like this...

console.log("err:",err);

But this will output as err: err when there is an error.

Is there any chance you could throw a more meaningful error message or give a clearer indication of what is happening when the requested webpage has a problem?

Thanks again for making openGraphScraper,
/t

Return original request error

Hi,

it would be very useful if any errors of the internal request could be passed back in the callback.

Example:

The internal request has the following error:

{ Error: self signed certificate in certificate chain
    at TLSSocket.<anonymous> (_tls_wrap.js:1088:38)
    at emitNone (events.js:86:13)
    at TLSSocket.emit (events.js:188:7)
    at TLSSocket._finishInit (_tls_wrap.js:610:8)
    at TLSWrap.ssl.onhandshakedone (_tls_wrap.js:440:38) code: 'SELF_SIGNED_CERT_IN_CHAIN' }

The error passed in the callback:

true

Troubleshooting is not possible in this way, without adding logs to the module :-(

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.