pawn-scraper

A powerful scraper plugin that provides interface for utlising html_parsers and css selectors in pawn.

Installing

Thanks to Southclaws,plugin installation is now much easier with sampctl

sampctl p install Sreyas-Sreelal/pawn-scraper

OR

Download suitable binary files from releases for your operating system
Add it your plugins folder
Add PawnScraper to server.cfg or PawnScraper.so (for linux)
Add pawnscraper.inc in includes folder

Building

Clone the repo

git clone https://github.com/Sreyas-Sreelal/pawn-scraper.git
Use makefile to compile and test
- Setup testing environment
  
  make setup
- To build release version
  
  make release
- Run tests
  
  make run

API

ParseHtmlDocument(document[])

Params
- document[] - string of html document
Returns
- Html document instance id
- if failed to parse document INVALID_HTML_DOC is returned

Example Usage

 new Html:doc = ParseHtmlDocument("\
 	<!DOCTYPE html>\
 	<meta charset=\"utf-8\">\
 	<title>Hello, world!</title>\
 	<h1 class=\"foo\">Hello, <i>world!</i></h1>\
 	");
 ASSERT(doc != INVALID_HTML_DOC);
 DeleteHtml(doc);

ResponseParseHtml(Response:id)
- Params
  - id - Http response id returned from HttpGet
- Returns
  - Html document instance id
  - if failed to parse document INVALID_HTML_DOC is returned
- Example Usage
```
 new Response:response = HttpGet("https://www.sa-mp.com");
 new Html:doc = ResponseParseHtml(response);
 ASSERT(doc != INVALID_HTML_DOC);
 DeleteHtml(doc);
```
HttpGet(url[],Header:headerid=INVALID_HEADER)
- Params
  - url[] - Url of a website
  - header - id of header object created using CreateHeader
- Returns
  - Response id if successful
  - if failed to INVALID_HTTP_RESPONSE is returned
- Example Usage
```
new Response:response = HttpGet("https://www.sa-mp.com");
ASSERT(response != INVALID_HTTP_RESPONSE);
DeleteResponse(response);
```
HttpGetThreaded(playerid,callback[],url[],Header:headerid=INVALID_HEADER)
- Params
  - playerid - id of the player
  - callback[] - name of the callback function to handle the response.
  - url[] - Url of a website
  - header - id of header object created using CreateHeader
- Example Usage
```
HttpGetThreaded(0,"MyHandler","https://sa-mp.com");
//********
forward MyHandler(playerid,Response:responseid);
public MyHandler(playerid,Response:responseid){
    ASSERT(responseid != INVALID_HTTP_RESPONSE);
    DeleteResponse(responseid);
}
```
ParseSelector(string[])
- Params
  - string[] - CSS selector
- Returns
  - Selector instance id if successful
  - if failed to INVALID_SELECTOR is returned
- Example Usage
```
new Selector:selector = ParseSelector("h1 .foo");
ASSERT(selector != INVALID_SELECTOR);
DeleteSelector(selector);
```

CreateHeader(...)

Params
- key,value pairs of String type
Returns
- Header instance id if successful
- if failed to INVALID_HEADER is returned

Example Usage

new Header:header = CreateHeader(
    "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
);
ASSERT(header != INVALID_HEADER);
new Response:response = HttpGet("https://sa-mp.com/",header);
ASSERT(response != INVALID_HTTP_RESPONSE);
ASSERT(DeleteHeader(header) == 1);

GetNthElementName(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))

Params
- docid - Html instance id
- selectorid - CSS selector instance id
- idx - the n'th occurence of element in the document (starts from 0)
- string[] - element name is stored
- size - sizeof string
Returns
- 1 if successful
- 0 if failed

Example Usage

new Html:doc = ParseHtmlDocument("\
    <!DOCTYPE html>\
    <meta charset=\"utf-8\">\
    <title>Hello, world!</title>\
    <h1 class=\"foo\">Hello, <i>world!</i></h1>\
");
ASSERT(doc != INVALID_HTML_DOC);

new Selector:selector = ParseSelector("i");
ASSERT(selector != INVALID_SELECTOR);

new i= -1,element_name[10];
while(GetNthElementName(doc,selector,++i,element_name)!=0){
    ASSERT(strcmp(element_name,"i") == 0);
}

DeleteSelector(selector);
DeleteHtml(doc);

GetNthElementText(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))

Params
- docid - Html instance id
- selectorid - CSS selector instance id
- idx - the n'th occurence of element in the document (starts from 0)
- string[] - element name
- size - sizeof string
Returns
- 1 if successful
- 0 if failed

Example Usage

new Html:doc = ParseHtmlDocument("\
    <!DOCTYPE html>\
    <meta charset=\"utf-8\">\
    <title>Hello, world!</title>\
    <h1 class=\"foo\">Hello, <i>world!</i></h1>\
");
ASSERT(doc != INVALID_HTML_DOC);

new Selector:selector = ParseSelector("h1.foo");
ASSERT(selector != INVALID_SELECTOR);

new element_text[20];
ASSERT(GetNthElementText(doc,selector,0,element_text) == 1);

new check = strcmp(element_text,("Hello, world!"));
ASSERT(check == 0);

DeleteSelector(selector);
DeleteHtml(doc);

GetNthElementAttrVal(Html:docid,Selector:selectorid,idx,attribute[],string[],size = sizeof(string))

Params
- docid - Html instance id
- selectorid - CSS selector instance id
- idx - the n'th occurence of element in the document (starts from 0)
- attribute[] - the attribute of element
- string[] - element name
- size - sizeof string
Returns
- 1 if successful
- 0 if failed

Example Usage

 new Html:doc = ParseHtmlDocument("\
 	<!DOCTYPE html>\
 	<meta charset=\"utf-8\">\
 	<title>Hello, world!</title>\
 	<h1 class=\"foo\">Hello, <i>world!</i></h1>\
 ");
 ASSERT(doc != INVALID_HTML_DOC);
 
 new Selector:selector = ParseSelector("h1");
 ASSERT(selector != INVALID_SELECTOR);
 
 new element_attribute[20];
 ASSERT(GetNthElementAttrVal(doc,selector,0,"class",element_attribute) == 1);

 new check = strcmp(element_attribute,("foo"));
 ASSERT(check == 0);

 DeleteSelector(selector);
 DeleteHtml(doc);

DeleteHtml(Html:id)
- Params
  - id - html instance to be deleted
- Returns
  - 1 if successful
  - 0 if failed
DeleteSelector(Selector:id)
- Params
  - id - selector instance to be deleted
- Returns
  - 1 if successful
  - 0 if failed
DeleteResponse(Html:id)
- Params
  - id - response instance to be deleted
- Returns
  - 1 if successful
  - 0 if failed
DeleteHeader(Header:id)
- Params
  - id - header instance to be deleted
- Returns
  - 1 if successful
  - 0 if failed

Usage

A small example to fetch all links in wiki.sa-mp.com

new Response:response = HttpGet("https://wiki.sa-mp.com");
if(response == INVALID_HTTP_RESPONSE){
	printf("HTTP ERROR");
	return;
}

new Html:html = ResponseParseHtml(response);
if(html == INVALID_HTML_DOC){
	DeleteResponse(response);
	return;
}

new Selector:selector = ParseSelector("a");
if(selector == INVALID_SELECTOR){
	DeleteResponse(response);
	DeleteHtml(html);
	return;
}

new str[500],i;
while(GetNthElementAttrVal(html,selector,i,"href",str)){
	printf("%s",str);
	++i;
}
//delete created objects after the usage..
DeleteHtml(html);
DeleteResponse(response);
DeleteSelector(selector);

The same above with threaded http call would be

HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");
//...
forward MyHandler(playerid,Response:responseid);
public MyHandler(playerid,Response:responseid)
{
	
	if(responseid == INVALID_HTTP_RESPONSE){
		printf("HTTP ERROR");
		return 0;
	}

	new Html:html = ResponseParseHtml(responseid);
	if(html == INVALID_HTML_DOC){
		DeleteResponse(response);
		return 0;
	}

	new Selector:selector = ParseSelector("a");
	if(selector == INVALID_SELECTOR){
		DeleteResponse(response);
		DeleteHtml(html);
		return 0;
	}

	new str[500],i;
	while(GetNthElementAttrVal(html,selector,i,"href",str)){
		printf("%s",str);
		++i;
	}

	DeleteHtml(html);
	Delete(response);
	DeleteSelector(selector);
	return 1;
}

More examples can be found in examples

Note

The plugin is in primary stage and more tests and features needed to be added.I'm open to any kind of contribution, just open a pull request if you have anything to improve or add new features.

Special thanks

Eva for samp-rust-sdk
Y_Less for y_tests
Discord members in SAMP discord channel

sreyas-sreelal / pawn-scraper Goto Github PK

pawn-scraper's Introduction

pawn-scraper

Installing

OR

Building

API

ParseHtmlDocument(document[])

ResponseParseHtml(Response:id)

HttpGet(url[],Header:headerid=INVALID_HEADER)

HttpGetThreaded(playerid,callback[],url[],Header:headerid=INVALID_HEADER)

ParseSelector(string[])

CreateHeader(...)

GetNthElementName(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))

GetNthElementText(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))

GetNthElementAttrVal(Html:docid,Selector:selectorid,idx,attribute[],string[],size = sizeof(string))

DeleteHtml(Html:id)

DeleteSelector(Selector:id)

DeleteResponse(Html:id)

DeleteHeader(Header:id)

Usage

Note

Special thanks

pawn-scraper's People

Contributors

Stargazers

Watchers

Forkers

pawn-scraper's Issues

Recommend Projects

Recommend Topics

Recommend Org