Coder Social home page Coder Social logo

bill-scraper's Introduction

⚠️ IMPORTANT

DemocracyOS is currently without maintenance. More information about it will be published soon. For now you can check democraciaos.org


DemocracyOS

Dependencies Code Climate

DemocracyOS is an online space for deliberation and voting on political proposals. It is a platform for a more open and participatory government.The software aims to stimulate better arguments and come to better rulings, as peers.

DemocracyOS

Installation

Please refer to the Installation wiki page for detailed instructions on how to install and setup your instance of DemocracyOS.

Current DemocracyOS deployments

  • DemocracyOS - PDR: The Net Party's official deployment of DemocracyOS.
  • EuVoto: Brazilian initiative by the Open Knowledge Foundation Brasil to discuss legislation in the city of Sao Paulo.
  • Loi Renseignement: First deployment by DemocracyOS France to discuss the Loi Renseignement.
  • Evoks: Hungarian project by Atlatszo.hu for discussing social issues.
  • PAMI: The largest healthcare program for elderly people opens its technical decisions.
  • Ukrainian Choice: Official deployment of DemocracyOS Ukraine.
  • Paris: Official deployment of DemocracyOS in Paris (France) by city's mayor Anne Hidalgo.

Contributing

Please see CONTRIBUTING.md for further details.

Contributors

See CONTRIBUTORS.md to get to know the DemocracyOS team and contributors.

Links

Browser support

We support real browsers and IE10+

Acknowledgements

Icons made by Jamal Jama and Ahmad Firoz from simplelineicons.com.

License

DemocracyOS is open source software under the GPL v3.0 license. Please see full terms in the LICENSE file.

bill-scraper's People

Contributors

cristiandouce avatar gvilarino avatar ultraklon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bill-scraper's Issues

Preserve article number

Right now we're scraping articles like this:

{
  "articulo": "</b> Proh&iacute;bese a los establecimientos educativos ",
  "_id": {
    "$oid": "520a8ec68be1e20000000008"
  }
}

Add a 'number' property to an article's JSON representation and preserve its value so we can display it in the app.

No, we can't rely on order to determine the article numbers.

Scrape .docs with bill projects

Here you can find two .docs, one with the originally presented bill project (like the ones you can scrape from CEDOM) and a Despacho, which is the final version that ACTUALLY got treated by congressmen in the recinct.

So, we need to be able to turn the latter into HTML (https://crocodoc.com/ seems a fine tool to do so) and scrape them back into our platform.

How should variables be passed?

@cristiandouce plase take a look at the file godoit.js
I want to have values for the variables expediente, comision and articles at line 46 but I'm getting undefined for all of them there with the code as is.
At line 88 there is data but it gets lost when passing to line 46.
If I comment line [90](https://github.com/DemocracyOS/bill-scraper/blob/zombie/godoit.js#L90 and uncomment line 89, I get data at line 46, but the program throws an error saying:

TypeError: listener must be a function
    at TypeError (<anonymous>)
    at NativeConnection.EventEmitter.once (events.js:171:11)
    at connectToDB (/Users/pnmoyano/node/bill-scraper/godoit.js:89:6)
    at saveAll (/Users/pnmoyano/node/bill-scraper/godoit.js:45:3)
    at processPage (/Users/pnmoyano/node/bill-scraper/godoit.js:41:3)
    at /Users/pnmoyano/node/bill-scraper/godoit.js:8:5
    at _fulfilled (/Users/pnmoyano/node/bill-scraper/node_modules/q/q.js:798:54)
    ...

I can use global variables as a very dirty fix, but why this is happening and which of the lines (89 or 90) is better?

Matching current info with existent structure

Watching at
https://github.com/DemocracyOS/app/blob/development/lib/models/law.js
I see we have this fields (and the value we should fill them with)
state: I think here is "bill" or maybe we need another value
lawId: current "expediente"
clauses.clauseId: don't know what to put here
clauses.order: ok, the order number
clauses.text: yes, we have this
createdAt: auto
updatedAt: won't fill this

I think we need a field for comision, right? Should I add it in the other project?

Is the matching ok?

Regex discards important text

The scraper miss-detects stuff and discards important data. For instance, scraping Cedom's bill 400 yields the following result:

{
"sancion": "01/06/2000",
"publicacion": "BOCBA N� 989 del 21/07/2000",
"promulgacion": "De Hecho del 03/07/2000",
"_id": {
  "$oid": "520a8ec68be1e20000000002"
},
"articulos": [
  {
    "articulo": "</b> Proh&iacute;bese a los establecimientos educativos ",
    "_id": {
      "$oid": "520a8ec68be1e20000000008"
    }
  },
  {
    "articulo": "</b> Ning&uacute;n alumno, con motivo de mora en el ",
    "_id": {
      "$oid": "520a8ec68be1e20000000007"
    }
  },
  {
    "articulo": " </b>los alumnos de los establecimientos citados ",
    "_id": {
      "$oid": "520a8ec68be1e20000000006"
    }
  },
  {
    "articulo": "</b> De verse configurados los extremos descriptos en ",
    "_id": {
      "$oid": "520a8ec68be1e20000000005"
    }
  },
  {
    "articulo": "</b> La Secretar&iacute;a de Educaci&oacute;n podr&aacute; ",
    "_id": {
      "$oid": "520a8ec68be1e20000000004"
    }
  },
  {
    "articulo": "</b> Comun&iacute;quese, etc</P>",
    "_id": {
      "$oid": "520a8ec68be1e20000000003"
    }
  }
],
"__v": 0
}

As you can see, the articles' text are not quite complete.

On other cases, like when part of the text contains double quotes (e.g.: "Some text"), all of the article's text up to that section is also discarded.

As a general rule, ALL text between two articles' titles should be included as part of the article.

Import "Despachos" into app

We should import our current "Despachos" into our beta app environment so we can showcase it.

Also, we should remove all existing dummy entries.

Encoding to UTF-8

We should persist scraped text as properly encoded UTF-8, instead of HTML-escaped special characters.

I.e.: "ó" should read "ó"

Preserve bill number and Id

Given the following bill object for CABA Bill 400:

{
"sancion": "01/06/2000",
"publicacion": "BOCBA N� 989 del 21/07/2000",
"promulgacion": "De Hecho del 03/07/2000",
"_id": {
  "$oid": "520a8ec68be1e20000000002"
},
"articulos": [
  {
    "articulo": "</b> Proh&iacute;bese a los establecimientos educativos ",
    "_id": {
      "$oid": "520a8ec68be1e20000000008"
    }
  }
]
}

the values should be

  • billNumber: it's the "law number", and it's generated sequentially when bills are passed. In the example it'd be 400.
  • billId: it's the unique identifier for a bill project generated by the congress. It's not shown on a passed bill on Cedom, but you'll see them in the 'Proyectos' section. All Dispatches WILL have this ID. The convetion goes like: X-L-yyyy. The proper translation for 'expediente' may used for naming.
    • x is sequential number that is reset every year, and is assigned as the bill project is submitted to the legislative process.
      *L is a letter that indicates the nature of the bill project comes from (D is for declarations, J is for Law Projects that come from a Congressman, P is for popular initiative, etc.)
    • yyyy are the two last digit for the year the bill was submitted into the legislative process.

Names of bill files downloaded through the 'Proyectos' tab through Cedom almost respect this standard, though they do 5 digits for 'x', with leading zeros if needed, no hyphens and only two digits for the year. So, it would look like this

Bill 894-J-2013 - Filename 00894J13.doc

Bil 1233-P-2012 - Filename 01233P12.doc

Property names in english

We're using property names in spanish, we need to translate all of them to english.

{
"sancion": "01/06/2000",
"publicacion": "BOCBA N� 989 del 21/07/2000",
"promulgacion": "De Hecho del 03/07/2000",
"_id": {
  "$oid": "520a8ec68be1e20000000002"
},
"articulos": [
  {
    "articulo": "</b> Proh&iacute;bese a los establecimientos educativos ",
    "_id": {
      "$oid": "520a8ec68be1e20000000008"
    }
  },

For proper translation refer to Akoma Ntoso naming convention, so we start supporting this convention from day 1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.