Coder Social home page Coder Social logo

sreyas-sreelal / pawn-scraper Goto Github PK

View Code? Open in Web Editor NEW
10.0 5.0 2.0 113 KB

Web scraping with HTML parsers and querying with CSS selectors in pawn (WIP)

Home Page: https://forum.sa-mp.com/showthread.php?t=660753

License: GNU General Public License v3.0

Pawn 30.23% Rust 66.91% Makefile 2.86%
sa-mp sa-mp-plugin rust pawn pawn-package scraper web-scraping html-parser css-selector pawn-scraper

pawn-scraper's Introduction

pawn-scraper

Build sampctl GitHub issues GitHub pull requests GitHub pull license

A powerful scraper plugin that provides interface for utlising html_parsers and css selectors in pawn.

Installing

Thanks to Southclaws,plugin installation is now much easier with sampctl

sampctl p install Sreyas-Sreelal/pawn-scraper

OR

  • Download suitable binary files from releases for your operating system
  • Add it your plugins folder
  • Add PawnScraper to server.cfg or PawnScraper.so (for linux)
  • Add pawnscraper.inc in includes folder

Building

  • Clone the repo

    git clone https://github.com/Sreyas-Sreelal/pawn-scraper.git

  • Use makefile to compile and test

    • Setup testing environment

      make setup

    • To build release version

      make release

    • Run tests

      make run

API

  • ParseHtmlDocument(document[])

    • Params
      • document[] - string of html document
    • Returns
      • Html document instance id
      • if failed to parse document INVALID_HTML_DOC is returned
    • Example Usage
       new Html:doc = ParseHtmlDocument("\
       	<!DOCTYPE html>\
       	<meta charset=\"utf-8\">\
       	<title>Hello, world!</title>\
       	<h1 class=\"foo\">Hello, <i>world!</i></h1>\
       	");
       ASSERT(doc != INVALID_HTML_DOC);
       DeleteHtml(doc);
  • ResponseParseHtml(Response:id)

    • Params
      • id - Http response id returned from HttpGet
    • Returns
      • Html document instance id
      • if failed to parse document INVALID_HTML_DOC is returned
    • Example Usage
       new Response:response = HttpGet("https://www.sa-mp.com");
       new Html:doc = ResponseParseHtml(response);
       ASSERT(doc != INVALID_HTML_DOC);
       DeleteHtml(doc);
  • HttpGet(url[],Header:headerid=INVALID_HEADER)

    • Params
      • url[] - Url of a website
      • header - id of header object created using CreateHeader
    • Returns
      • Response id if successful
      • if failed to INVALID_HTTP_RESPONSE is returned
    • Example Usage
      new Response:response = HttpGet("https://www.sa-mp.com");
      ASSERT(response != INVALID_HTTP_RESPONSE);
      DeleteResponse(response);
  • HttpGetThreaded(playerid,callback[],url[],Header:headerid=INVALID_HEADER)

    • Params

      • playerid - id of the player
      • callback[] - name of the callback function to handle the response.
      • url[] - Url of a website
      • header - id of header object created using CreateHeader
    • Example Usage

      HttpGetThreaded(0,"MyHandler","https://sa-mp.com");
      //********
      forward MyHandler(playerid,Response:responseid);
      public MyHandler(playerid,Response:responseid){
          ASSERT(responseid != INVALID_HTTP_RESPONSE);
          DeleteResponse(responseid);
      }
  • ParseSelector(string[])

    • Params
      • string[] - CSS selector
    • Returns
      • Selector instance id if successful
      • if failed to INVALID_SELECTOR is returned
    • Example Usage
      new Selector:selector = ParseSelector("h1 .foo");
      ASSERT(selector != INVALID_SELECTOR);
      DeleteSelector(selector);
  • CreateHeader(...)

    • Params
      • key,value pairs of String type
    • Returns
      • Header instance id if successful
      • if failed to INVALID_HEADER is returned
    • Example Usage
      new Header:header = CreateHeader(
          "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
      );
      ASSERT(header != INVALID_HEADER);
      new Response:response = HttpGet("https://sa-mp.com/",header);
      ASSERT(response != INVALID_HTTP_RESPONSE);
      ASSERT(DeleteHeader(header) == 1);
  • GetNthElementName(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))

    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the n'th occurence of element in the document (starts from 0)
      • string[] - element name is stored
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage
      new Html:doc = ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      ");
      ASSERT(doc != INVALID_HTML_DOC);
      
      new Selector:selector = ParseSelector("i");
      ASSERT(selector != INVALID_SELECTOR);
      
      new i= -1,element_name[10];
      while(GetNthElementName(doc,selector,++i,element_name)!=0){
          ASSERT(strcmp(element_name,"i") == 0);
      }
      
      DeleteSelector(selector);
      DeleteHtml(doc);
  • GetNthElementText(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))

    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the n'th occurence of element in the document (starts from 0)
      • string[] - element name
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage
      new Html:doc = ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      ");
      ASSERT(doc != INVALID_HTML_DOC);
      
      new Selector:selector = ParseSelector("h1.foo");
      ASSERT(selector != INVALID_SELECTOR);
      
      new element_text[20];
      ASSERT(GetNthElementText(doc,selector,0,element_text) == 1);
      
      new check = strcmp(element_text,("Hello, world!"));
      ASSERT(check == 0);
      
      DeleteSelector(selector);
      DeleteHtml(doc);
  • GetNthElementAttrVal(Html:docid,Selector:selectorid,idx,attribute[],string[],size = sizeof(string))

    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the n'th occurence of element in the document (starts from 0)
      • attribute[] - the attribute of element
      • string[] - element name
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage
       new Html:doc = ParseHtmlDocument("\
       	<!DOCTYPE html>\
       	<meta charset=\"utf-8\">\
       	<title>Hello, world!</title>\
       	<h1 class=\"foo\">Hello, <i>world!</i></h1>\
       ");
       ASSERT(doc != INVALID_HTML_DOC);
       
       new Selector:selector = ParseSelector("h1");
       ASSERT(selector != INVALID_SELECTOR);
       
       new element_attribute[20];
       ASSERT(GetNthElementAttrVal(doc,selector,0,"class",element_attribute) == 1);
      
       new check = strcmp(element_attribute,("foo"));
       ASSERT(check == 0);
      
       DeleteSelector(selector);
       DeleteHtml(doc);
  • DeleteHtml(Html:id)

    • Params
      • id - html instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed
  • DeleteSelector(Selector:id)

    • Params
      • id - selector instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed
  • DeleteResponse(Html:id)

    • Params
      • id - response instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed
  • DeleteHeader(Header:id)

    • Params
      • id - header instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

Usage

A small example to fetch all links in wiki.sa-mp.com

new Response:response = HttpGet("https://wiki.sa-mp.com");
if(response == INVALID_HTTP_RESPONSE){
	printf("HTTP ERROR");
	return;
}

new Html:html = ResponseParseHtml(response);
if(html == INVALID_HTML_DOC){
	DeleteResponse(response);
	return;
}

new Selector:selector = ParseSelector("a");
if(selector == INVALID_SELECTOR){
	DeleteResponse(response);
	DeleteHtml(html);
	return;
}

new str[500],i;
while(GetNthElementAttrVal(html,selector,i,"href",str)){
	printf("%s",str);
	++i;
}
//delete created objects after the usage..
DeleteHtml(html);
DeleteResponse(response);
DeleteSelector(selector);

The same above with threaded http call would be

HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");
//...
forward MyHandler(playerid,Response:responseid);
public MyHandler(playerid,Response:responseid)
{
	
	if(responseid == INVALID_HTTP_RESPONSE){
		printf("HTTP ERROR");
		return 0;
	}

	new Html:html = ResponseParseHtml(responseid);
	if(html == INVALID_HTML_DOC){
		DeleteResponse(response);
		return 0;
	}

	new Selector:selector = ParseSelector("a");
	if(selector == INVALID_SELECTOR){
		DeleteResponse(response);
		DeleteHtml(html);
		return 0;
	}

	new str[500],i;
	while(GetNthElementAttrVal(html,selector,i,"href",str)){
		printf("%s",str);
		++i;
	}

	DeleteHtml(html);
	Delete(response);
	DeleteSelector(selector);
	return 1;
}

More examples can be found in examples

Note

The plugin is in primary stage and more tests and features needed to be added.I'm open to any kind of contribution, just open a pull request if you have anything to improve or add new features.

Special thanks

pawn-scraper's People

Contributors

sreyas-sreelal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

bonn333

pawn-scraper's Issues

interface creation

Is it possible to use this plugin to create interfaces similar to textdraw?

Handle Cleaning up of objects correctly

Current method deletes the object from memory without checking whether any other variable having different life time is also referencing the same id. Also array of ids needs to be handled as well.

One way to solve 1st problem is by doing a borrow check or overload = to make a copy of object in memory and set id of new copy.

Bulky binary

The binary has considerable amount of size.Seems like reqwest crate is the one adding size.Replacing it with something like minihttp could reduce the size and also remove the openssl vendor dependence completely,which will further reduce library size in linux.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.