Coder Social home page Coder Social logo

betahuhn / metadata-scraper Goto Github PK

View Code? Open in Web Editor NEW
94.0 4.0 17.0 619 KB

๐Ÿท๏ธ A JavaScript library for scraping/parsing metadata from a web page.

Home Page: https://mxis.ch

License: MIT License

TypeScript 97.77% JavaScript 2.23%
metadata meta-tags metadata-extraction parser html-scraper page typescript javascript-library metatags open-graph

metadata-scraper's Introduction

metadata-scraper

GitHub David npm

A Javascript library for scraping/parsing metadata from a web page.

๐Ÿ‘‹ Introduction

metadata-scraper is a Javascript library which scrapes/parses metadata from web pages. You only need to supply it with a URL or an HTML string and it will use different rules to find the most relevant metadata like:

  • Title
  • Description
  • Favicons/Images
  • Language
  • Keywords
  • Author
  • and more (full list below)

๐Ÿš€ Get started

Install metadata-scraper via npm:

npm install metadata-scraper

๐Ÿ“š Usage

Import metadata-scraper and pass it a URL or options object:

const getMetaData = require('metadata-scraper')

const url = 'https://github.com/BetaHuhn/metadata-scraper'

getMetaData(url).then((data) => {
	console.log(data)
})

Or with async/await:

const getMetaData = require('metadata-scraper')

async function run() {
	const url = 'https://github.com/BetaHuhn/metadata-scraper'
	const data = await getMetaData(url)
	console.log(data)
}

run()

This will return:

{
	title: 'BetaHuhn/metadata-scraper',
	description: 'A Javascript library for scraping/parsing metadata from a web page.',
	language: 'en',
	url: 'https://github.com/BetaHuhn/metadata-scraper',
	provider: 'GitHub',
	twitter: '@github',
	image: 'https://avatars1.githubusercontent.com/u/51766171?s=400&v=4',
	icon: 'https://github.githubassets.com/favicons/favicon.svg'
}

You can see a list of all metadata which metadata-scraper tries to scrape below.

โš™๏ธ Configuration

You can change the behaviour of metadata-scraper by passing an options object:

const getMetaData = require('metadata-scraper')

const options = {
	url: 'https://github.com/BetaHuhn/metadata-scraper', // URL of web page
	maxRedirects: 0, // Maximum number of redirects to follow (default: 5)
	ua: 'MyApp', // Specify User-Agent header
	lang: 'de-CH', // Specify Accept-Language header
	timeout: 1000, // Request timeout in milliseconds (default: 10000ms)
	forceImageHttps: false, // Force all image URLs to use https (default: true)
	customRules: {} // more info below
}

getMetaData(options).then((data) => {
	console.log(data)
})

You can specify the URL by either passing it as the first parameter, or by setting it in the options object.

๐Ÿ“– Examples

Here are some examples on how to use metadata-scraper:

Basic

Pass a URL as the first parameter and metadata-scraper automatically scrapes it and returns everything it finds:

const getMetaData = require('metadata-scraper')
const data = await getMetaData('https://github.com/BetaHuhn/metadata-scraper')

Example file located at examples/basic.js.


HTML String

If you already have an HTML string and don't want metadata-scraper to make an http request, specify it in the options object:

const getMetaData = require('metadata-scraper')

const html = `
	<meta name="og:title" content="Example">
	<meta name="og:description" content="This is an example.">
`

const options {
	html: html, 
	url: 'https://example.com' // Optional URL to make relative image paths absolute
}

const data = await getMetaData(options)

Example file located at examples/html.js.


Custom Rules

Look at the rules.ts file in the src directory to see all rules which will be used.

You can expand metadata-scraper easily by specifying custom rules:

const getMetaData = require('metadata-scraper')

const options = {
	url: 'https://github.com/BetaHuhn/metadata-scraper',
	customRules: {
		name: {
			rules: [
				[ 'meta[name="customName"][content]', (element) => element.getAttribute('content') ]
			],
			processor: (text) => text.toLowerCase()
		}
	}
}

const data = await getMetaData(options)

customRules needs to contain one or more objects, where the key (name above) will identify the value in the returned data.

You can then specify different rules for each item in the rules array.

The first item is the query which gets inserted into the browsers querySelector function, and the second item is a function which gets passed the HTML element:

[ 'querySelector', (element) => element.innerText ]

You can also specify a processor function which will process/transform the result of one of the matched rules:

{
	processor: (text) => text.toLowerCase()
}

If you find a useful rule, let me know and I will add it (or create a PR yourself).

Example file located at examples/custom.js.

๐Ÿ“‡ All metadata

Here's what metadata-scraper currently tries to scrape:

{
	title: 'Title of page or article',
	description: 'Description of page or article',
	language: 'Language of page or article',
	type: 'Page type',
	url: 'URL of page',
	provider: 'Page provider',
	keywords: ['array', 'of', 'keywords'],
	section: 'Section/Category of page',
	author: 'Article author',
	published: 1605221765, // Date the article was published
	modified: 1605221765, // Date the article was modified
	robots: ['array', 'for', 'robots'],
	copyright: 'Page copyright',
	email: 'Contact email',
	twitter: 'Twitter handle',
	facebook: 'Facebook account id',
	image: 'Image URL',
	icon: 'Favicon URL',
	video: 'Video URL',
	audio: 'Audio URL'
}

If you find a useful metatag, let me know and I will add it (or create a PR yourself).

๐Ÿ’ป Development

Issues and PRs are very welcome!

Please check out the contributing guide before you start.

This project adheres to Semantic Versioning. To see differences with previous versions refer to the CHANGELOG.

โ” About

This library was developed by me (@betahuhn) in my free time. If you want to support me:

Donate via PayPal

Credits

This library is based on Mozilla's page-metadata-parser. I converted it to TypeScript, implemented a few new features, and added more rules.

License

Copyright 2020 Maximilian Schiller

This project is licensed under the MIT License - see the LICENSE file for details.

metadata-scraper's People

Contributors

betahuhn avatar betahuhnbot avatar dependabot[bot] avatar smolak avatar utrolig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

metadata-scraper's Issues

The WWW url not working

Let me describe when I enter the URL as "www.bluehost.in" it's throwing an error Invalid URL.

const getMetaData = require('metadata-scraper')
const data = await getMetaData('www.bluehost.in')

It's throw an error shown in below screenshot.

๐Ÿ–ผ๏ธ Screenshots

image

โš™๏ธ Environment

  • OS: [e.g. Ubuntu 18.04]
  • Node.js Version [e.g. v14]
  • Link to website

Request to web.com blocked by Cloudflare

๐Ÿž Describe the bug

I am building a service that requires me to fetch metadata from any given URL.
This isn't at the center of what I am working on, it's an addon feature and after some Google search I found this package (cool stuff)

When working with most URLs and fetching data everything seems to work fine.
But whenever I try to retrieve metadata from http://web.com, it throws an error.

๐Ÿ“š To Reproduce

Try fetching metadata for http://web.com

๐Ÿ’ก Expected behavior
I expect to get results with the metadata from the URL instead I get

HTTPError: Response code 403 (Forbidden)
    at Request.<anonymous> (/root/node_modules/got/dist/source/as-promise/index.js:117:42)
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  code: undefined,
  timings: {
    start: 1632256779506,
    socket: 1632256779517,
    lookup: 1632256779568,
    connect: 1632256779579,
    secureConnect: 1632256779602,
    upload: 1632256779602,
    response: 1632256779759,
    end: 1632256779809,
    error: undefined,
    abort: undefined,
    phases: {
      wait: 11,
      dns: 51,
      tcp: 11,
      tls: 23,
      request: 0,
      firstByte: 157,
      download: 50,
      total: 303
    }
  }
}

๐Ÿ–ผ๏ธ Screenshots

https://scrnli.com/CV2MNqySiJMIeS

โš™๏ธ Environment

  • OS: Linux Mint 20
  • Node.js Version: v12.16.3
  • Link to website

๐Ÿ“‹ Additional context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.