Coder Social home page Coder Social logo

imagecrawl's Introduction

ImageCrawl

Overview

Based on Scrapy, ImageCrawl is a web image crawler that outputs images' origin url and downloads images automatically.
Recently supports:

Requirements

  • Python 2.7
  • Scrapy
  • GoAgent (if you are working in China mainland and disconnecting with the target websites )

Documentation

You can go to the top level directory of this project and run:

scrapy crawl [spider name]

In this project, the spider name can be Flickr, Instagram, GoogleSearch,BingSearch(no brackets). But you need to edit the file ImageCrawl/spiders/xxx_spider.py before you run the command above.


For Flickr, you should have your own api_key (see here), and decide your search tag. If you want to change other params, look at the file carefully or get help from Flickr API.

class FlickrSpider(scrapy.Spider):
    name = "Flickr"
    tag='your tag'
    api_key='your api_key'

For Instagram, you should have your own access_token (see here), and decide your search tag. If you want to change other params, look at the file carefully or get help from Instagram API.

class InstagramSpider(scrapy.Spider):
    name = "Instagram"
    tag='your tag'
    params = {
        'access_token': 'your access_token',
    }

For Google Image Search, you should decide your search key word. If you want to change other params, look at the file carefully or get help from Google Image API.

class GoogleSearchSpider(scrapy.Spider):
    name = "GoogleSearch"
    key_word='your key_word'

For Bing Image Search, you should have your own account Key (see here), and decide your search key word. If you want to change other params, look at the file carefully or get help from Bing search API.

class BingSearchSpider(scrapy.Spider):
    name = "BingSearch"
    key_word='your key_word'
    acctKey = 'your account Key'

You will get a csv folder that stores the crawl result(named with the beginning time of the program) and the images would be downloaded to folder data when the program finished.
If you want to change the image download directory, edit the last line in file ImageCrawl/settings.py:

IMAGES_STORE = 'data'

Note the program works with GoAgent by default, please ensure your GoAgent pre-opened and works well, change or disable GoAgent, see this.

imagecrawl's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.