Coder Social home page Coder Social logo

djparsing's Introduction

djparsing

Convenient parsing, can be used as an application Django or independently

parser of the first data block (by date this is new) and saving in the specified table

Requirements

  • python (3.4, 3.5, 3.6б 3.7)
  • django (1.8, 1.9, 1.10, 1.11)
  • lxml (4.1.1)
  • cssselect (1.0.1)
  • Pillow

Quick start

Install:
    pip install djparsing
Using:
class MyModel(models.Model):
    title = models.CharField(max_length=256)
    text = HTMLField(blank=True)
    source = models.URLField(max_length=255, blank=True)
    create_date = models.DateTimeField(auto_now_add=True)
    img = models.ImageField(blank=True, null=True)
    flag = models.BooleanField(default=False)
    
from djparsing.core import Parser, init
from djparsing import data

@init(model='MyModel', app='my_app')
class MyParserClass(Parser):
    body = data.BodyCSSSelect()
    text = data.TextContentCSSSelect()
    source = data.AttrCSSSelect(attr_data='href') #Or set the first argument AttrCSSSelect('href')
    title = data.TextCSSSelect()
    img = data.ImgCSSSelect('src') #The default is src, so the argument is optional. can ImgCSSSelect()
    
    class Meta:
        coincidence = ['Python', 'Django', 'Питон', 'ML'] # a list of words for the condition that the data fit
        field_coincidence = 'title' # field to which a list of words is used
    
pars_obj = MyParserClass(
        body='.content-list__item',
        text='.post__body_crop > .post__text',
        source='a.post__title_link',
        title='a.post__title_link',
        img='.post__body_crop > .post__text img',
        url='http://site/'
        )
pars_obj.run()

Note: a model for saving data can be specified in Meta class

class Meta:
    model = MyModel # decorator @init is not needed

Inheritance:

class MyChildParserClass(MyParserClass):
    my_field = data.TextCSSSelect()

Note: fields from the base class, and also the Meta class is inherited. You can override

If you need to install an additional field in the database:

pars_obj.add_field['flag'] = True
pars_obj.run()  #if you do not need to save to the database and print the data to the log, 
                # add the argument log -> run(log=True) and redefine the method log_output(self, result):
Example:
@init(model='MyModel', app='my_app')
class MyParserClass(Parser):
    body = data.BodyCSSSelect()
    text = data.TextContentCSSSelect()
    
    def log_output(self, result): # if you do not override the method, the result will be output to the terminal
        pass # and work further with the result

If you do not want to write data to the database or output to the log, use:

data = pars_obj.run(create=False)

Note: Also a must create=False, when you are not working with django and base

Attributs

start_url
# initialize the path to the URL with the data block.
# This is needed when the list of objects is on the page, and the data is on another page 
BodyCSSSelect(start_url='div.description.float-right > a')

Note: in the attribute with the URL should be href

add_domain
# if the URL in the attribute does not have a domain
# set add_domain=True, by default False

BodyCSSSelect(start_url='div.description.float-right > a', add_domain=True)
save_start_url

when you need to save additional data in the field, such as the start URLs of objects, add the ExtraDataField field (save_start_url = True)

body_count

how many objects are parsing

Example:
class MyParserClass(Parser):
    start = BodyCssSelect(start_url='ul.quest-tiles > li.quest-tile-1 > div.item-box > div.item-box-desc h4  a',
                          add_domain=True,
                          body_count=4)
    source = ExtraDataField(save_start_url=True)

It works on this site, all this on the channel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.