interactiveadvertisingbureau / adstxtcrawler Goto Github PK
View Code? Open in Web Editor NEWA reference implementation in python of a simple crawler for Ads.txt
A reference implementation in python of a simple crawler for Ads.txt
The phrase "For each line the crawler with extract the full hostname" should be "For each line the crawler will extract the full hostname." The word "with" should be "will".
I'm running into a lot of UnicodeEncodeErrors as a practical matter, which are related to special characters like "-" or non-breaking spaces when trying to run a list of domains. Is there a solution to that?
Test case results in HTML file being sent to parse, generating unicode parsing error.
Hi,
In the INSERT query there is no value for ADSYSTEM_DOMAIN which is configured as NOT NULL in adstxt table and the insert is failing.
When creating the table without it all works.
adstxt_crawler.py line 59
insert_stmt = "INSERT OR IGNORE INTO adstxt (SITE_DOMAIN, EXCHANGE_DOMAIN, SELLER_ACCOUNT_ID, ACCOUNT_TYPE, TAG_ID, ENTRY_COMMENT) VALUES (?, ?, ?, ?, ?, ? );"
Create Table in adstxt_crawler.sql
CREATE TABLE adstxt( SITE_DOMAIN TEXT NOT NULL, EXCHANGE_DOMAIN TEXT NOT NULL, ADSYSTEM_DOMAIN INTEGER NOT NULL, SELLER_ACCOUNT_ID TEXT NOT NULL, ACCOUNT_TYPE TEXT NOT NULL, TAG_ID TEXT NOT NULL, ENTRY_COMMENT TEXT NOT NULL, UPDATED DATE DEFAULT (datetime('now','localtime')), PRIMARY KEY (SITE_DOMAIN,EXCHANGE_DOMAIN,SELLER_ACCOUNT_ID) );
I love this tool but for some reason when I run it the data doesn't get stored on the DB I have been just outputting it to a CSV but wondering if I am missing something obvious.
google.com, pub-2788282916770715, DIRECT, f08c47fec0942fa0
Please consider moving this to /.well-known/
Test case results in 4-5 redirects due to paywall. This misconfiguration should be handle elegantly, with a clear max on redirects, rather than failing to parse paywall text.
We noticed this bot was crawling/fetching app-ads.txt one of our dev sites.
::ffff:127.0.0.1 - - [17/Jul/2020:19:29:54 +0000] "GET /app-ads.txt HTTP/1.0" 404 - "-" "AdsTxtCrawler/1.0; +https://github.com/InteractiveAdvertisingBureau/adstxtcrawler"
User-agent: AdsTxtCrawler
Disallow: /
What user agent will this crawler report as?
Please can there be a setting to specify information as to who is running the crawler, and put this in the useragent. This way, website owners can contact those responsible if they are being excessively crawled.
This will make further changes/improvements safer. Pytest is a good candidate for unit tests.
If a server declares the character set for the text file with "text/plain; charset=UTF-8" (as it should), adstxtcrawler gets an HTTP 406 (Not acceptable) response, instead of downloading the ads.txt file. This seems to be due to the fact that adstxtcrawler only accepts 'text/plain' and nothing else.
myheaders = {
'User-Agent':
'AdsTxtCrawler/1.0; +https://github.com/InteractiveAdvertisingBureau/adstxtcrawler',
'Accept':
'text/plain',
}
I think one of these will fix the issue:
'text/plain,*/*',
'text/plain,text/plain; charset=UTF-8',
My web browser, for instance has:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
It accepts anything due to /.
I can easily set up a server for you to test this with if you want.
The branch adstxt-results
contains historic crawls of ads.txt s all over the web.
How was the initial list of domains obtained? The commit message says "Google's ads.txt crawler".
Also, what is the licensing on this data?
I have noticed within the .sql file PulsePoint and Contextweb are listed under the same ID number (95)
Does this mean if a publisher were to add PulsePoint or add Contextweb into their file, it would be recognised as the same?
As PulsePoint is Contextweb.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.