interactiveadvertisingbureau / adstxtcrawler Goto Github PK

View Code? Open in Web Editor NEW

175.0 175.0 111.0 2.82 MB

A reference implementation in python of a simple crawler for Ads.txt

Python 87.16% Shell 0.36% Go 12.48%

adstxtcrawler's People

Contributors

Stargazers

Watchers

Forkers

i04n adtechnology amandaohara jhpacker sohilgithub pbjorke iantri brk212 g-seed bradlucas ouimet51 nag4 antoinejac argrigoryan muralidharan1886 nirmal1212 ssauerhaft sunlchk wickram aramzs ckseet markparolisi sean-mcmann miyaichi freedom8879 web365 zed7576 abeusher semion-r thezedwards dmalanga sangaviashok justinformentin survivor4848 schustda1 sdarekar breza jacksonosvaldo hgaron ectogon bensch5000 roxmbaldwin bhanditz intelligentart ganjix91-twitch ynariw fuguangsen bedrijfsportaal seymourjoseph araujodara mojib229 mpk112 bjhulst sa-yamini jaymeebugplayss kingdove35 monkou jp100m disqus-inc jeremycanales01 allandproust hosamdib deni-eng-27 http-denijujni tigareks kevinqz shynsky ressmann andy-emre terrorizer1980 galtay fahrulmustari6 admariner nathaly-alarcon-mojix-com berjayasompo rvaneijk fmudhish isabella232 gugupopes123 wrmike1 ggaengma smooth80 foreza asmae58 basil79 sanjeevkumarpandey gaybro8777 bryanchance cesarhub pinkdiamond1 dodiiit maamon-bh samanthadegges jpvang22 sonvo9900 freetellurian devilpie tonyakilbourne33 sky19831225 bheemla123

adstxtcrawler's Issues

Typo in README

The phrase "For each line the crawler with extract the full hostname" should be "For each line the crawler will extract the full hostname." The word "with" should be "will".

I'm running into a lot of UnicodeEncodeErrors as a practical matter, which are related to special characters like "-" or non-breaking spaces when trying to run a list of domains. Is there a solution to that?

Crawler doesn't fail if the first data line contains non-schema data

Test case results in HTML file being sent to parse, generating unicode parsing error.

Consider verifying that the first non-comment line contains exactly 3 or 4 fields around adstxt_crawler.py line 140.

Missing value for ADSYSTEM_DOMAIN in INSERT

Hi,
In the INSERT query there is no value for ADSYSTEM_DOMAIN which is configured as NOT NULL in adstxt table and the insert is failing.
When creating the table without it all works.

adstxt_crawler.py line 59
insert_stmt = "INSERT OR IGNORE INTO adstxt (SITE_DOMAIN, EXCHANGE_DOMAIN, SELLER_ACCOUNT_ID, ACCOUNT_TYPE, TAG_ID, ENTRY_COMMENT) VALUES (?, ?, ?, ?, ?, ? );"

Create Table in adstxt_crawler.sql
CREATE TABLE adstxt( SITE_DOMAIN TEXT NOT NULL, EXCHANGE_DOMAIN TEXT NOT NULL, ADSYSTEM_DOMAIN INTEGER NOT NULL, SELLER_ACCOUNT_ID TEXT NOT NULL, ACCOUNT_TYPE TEXT NOT NULL, TAG_ID TEXT NOT NULL, ENTRY_COMMENT TEXT NOT NULL, UPDATED DATE DEFAULT (datetime('now','localtime')), PRIMARY KEY (SITE_DOMAIN,EXCHANGE_DOMAIN,SELLER_ACCOUNT_ID) );

Sqlite DB not populating

I love this tool but for some reason when I run it the data doesn't get stored on the DB I have been just outputting it to a CSV but wondering if I am missing something obvious.

我台的

Myprojectgc

google.com, pub-2788282916770715, DIRECT, f08c47fec0942fa0

./well-known

Please consider moving this to /.well-known/

https://tools.ietf.org/html/rfc5785

Following too many Redirects Is Problematic

Test case results in 4-5 redirects due to paywall. This misconfiguration should be handle elegantly, with a clear max on redirects, rather than failing to parse paywall text.

Consider update to adstxt_crawler.py line 126 to include "allow_redirects=False" in the request.get function. This fails to handle useful redirects (domain.com > www.domain.com / HTTP > HTTPS).
Consider update to adstxt_crawler.py line 129 to examine "r.history" for a maximum number of redirects (1-2). This requires additional code to understand r.history return.

ads.txt

http://mayerads.com

Honour robots.txt

We noticed this bot was crawling/fetching app-ads.txt one of our dev sites.

::ffff:127.0.0.1 - - [17/Jul/2020:19:29:54 +0000] "GET /app-ads.txt HTTP/1.0" 404 - "-" "AdsTxtCrawler/1.0; +https://github.com/InteractiveAdvertisingBureau/adstxtcrawler"

Add instructions in the Readme on how to disallow the bot by adding an entry to your robots.txt

User-agent: AdsTxtCrawler
Disallow: /

Fetch the robots.txt and honour it if it has a disallow

User agent

What user agent will this crawler report as?

Please can there be a setting to specify information as to who is running the crawler, and put this in the useragent. This way, website owners can contact those responsible if they are being excessively crawled.

com.zhiliaoapp.musicallyIssue Title

Add tests

This will make further changes/improvements safer. Pytest is a good candidate for unit tests.

Accept: text/plain; charset=UTF-8

If a server declares the character set for the text file with "text/plain; charset=UTF-8" (as it should), adstxtcrawler gets an HTTP 406 (Not acceptable) response, instead of downloading the ads.txt file. This seems to be due to the fact that adstxtcrawler only accepts 'text/plain' and nothing else.

    myheaders = {
        'User-Agent':
        'AdsTxtCrawler/1.0; +https://github.com/InteractiveAdvertisingBureau/adstxtcrawler',
        'Accept':
        'text/plain',
    }

I think one of these will fix the issue:

'text/plain,*/*',
'text/plain,text/plain; charset=UTF-8',

My web browser, for instance has:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8

It accepts anything due to /.

I can easily set up a server for you to test this with if you want.

Question on crawls and their licensing

The branch adstxt-results contains historic crawls of ads.txt s all over the web.
How was the initial list of domains obtained? The commit message says "Google's ads.txt crawler".

Also, what is the licensing on this data?

PulsePoint and Contextweb

I have noticed within the .sql file PulsePoint and Contextweb are listed under the same ID number (95)

Does this mean if a publisher were to add PulsePoint or add Contextweb into their file, it would be recognised as the same?

As PulsePoint is Contextweb.