Coder Social home page Coder Social logo

adstxtcrawler's People

Contributors

breza avatar dharmik9009 avatar galtay avatar iabmayank avatar jhpacker avatar nealrichter avatar pbjorke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adstxtcrawler's Issues

Typo in README

The phrase "For each line the crawler with extract the full hostname" should be "For each line the crawler will extract the full hostname." The word "with" should be "will".

Encoding Issues

I'm running into a lot of UnicodeEncodeErrors as a practical matter, which are related to special characters like "-" or non-breaking spaces when trying to run a list of domains. Is there a solution to that?

Missing value for ADSYSTEM_DOMAIN in INSERT

Hi,
In the INSERT query there is no value for ADSYSTEM_DOMAIN which is configured as NOT NULL in adstxt table and the insert is failing.
When creating the table without it all works.

adstxt_crawler.py line 59
insert_stmt = "INSERT OR IGNORE INTO adstxt (SITE_DOMAIN, EXCHANGE_DOMAIN, SELLER_ACCOUNT_ID, ACCOUNT_TYPE, TAG_ID, ENTRY_COMMENT) VALUES (?, ?, ?, ?, ?, ? );"

Create Table in adstxt_crawler.sql
CREATE TABLE adstxt( SITE_DOMAIN TEXT NOT NULL, EXCHANGE_DOMAIN TEXT NOT NULL, ADSYSTEM_DOMAIN INTEGER NOT NULL, SELLER_ACCOUNT_ID TEXT NOT NULL, ACCOUNT_TYPE TEXT NOT NULL, TAG_ID TEXT NOT NULL, ENTRY_COMMENT TEXT NOT NULL, UPDATED DATE DEFAULT (datetime('now','localtime')), PRIMARY KEY (SITE_DOMAIN,EXCHANGE_DOMAIN,SELLER_ACCOUNT_ID) );

Sqlite DB not populating

I love this tool but for some reason when I run it the data doesn't get stored on the DB I have been just outputting it to a CSV but wondering if I am missing something obvious.

Myprojectgc

google.com, pub-2788282916770715, DIRECT, f08c47fec0942fa0

Following too many Redirects Is Problematic

Test case results in 4-5 redirects due to paywall. This misconfiguration should be handle elegantly, with a clear max on redirects, rather than failing to parse paywall text.

  • Consider update to adstxt_crawler.py line 126 to include "allow_redirects=False" in the request.get function. This fails to handle useful redirects (domain.com > www.domain.com / HTTP > HTTPS).
  • Consider update to adstxt_crawler.py line 129 to examine "r.history" for a maximum number of redirects (1-2). This requires additional code to understand r.history return.

Honour robots.txt

We noticed this bot was crawling/fetching app-ads.txt one of our dev sites.

::ffff:127.0.0.1 - - [17/Jul/2020:19:29:54 +0000] "GET /app-ads.txt HTTP/1.0" 404 - "-" "AdsTxtCrawler/1.0; +https://github.com/InteractiveAdvertisingBureau/adstxtcrawler"
  • Add instructions in the Readme on how to disallow the bot by adding an entry to your robots.txt
User-agent: AdsTxtCrawler
Disallow: /
  • Fetch the robots.txt and honour it if it has a disallow

User agent

What user agent will this crawler report as?

Please can there be a setting to specify information as to who is running the crawler, and put this in the useragent. This way, website owners can contact those responsible if they are being excessively crawled.

Add tests

This will make further changes/improvements safer. Pytest is a good candidate for unit tests.

Accept: text/plain; charset=UTF-8

If a server declares the character set for the text file with "text/plain; charset=UTF-8" (as it should), adstxtcrawler gets an HTTP 406 (Not acceptable) response, instead of downloading the ads.txt file. This seems to be due to the fact that adstxtcrawler only accepts 'text/plain' and nothing else.

    myheaders = {
        'User-Agent':
        'AdsTxtCrawler/1.0; +https://github.com/InteractiveAdvertisingBureau/adstxtcrawler',
        'Accept':
        'text/plain',
    }

I think one of these will fix the issue:

'text/plain,*/*',
'text/plain,text/plain; charset=UTF-8',

My web browser, for instance has:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8

It accepts anything due to /.

I can easily set up a server for you to test this with if you want.

Question on crawls and their licensing

The branch adstxt-results contains historic crawls of ads.txt s all over the web.
How was the initial list of domains obtained? The commit message says "Google's ads.txt crawler".

Also, what is the licensing on this data?

PulsePoint and Contextweb

I have noticed within the .sql file PulsePoint and Contextweb are listed under the same ID number (95)

Does this mean if a publisher were to add PulsePoint or add Contextweb into their file, it would be recognised as the same?

As PulsePoint is Contextweb.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.