democracyos / bill-scraper Goto Github PK
View Code? Open in Web Editor NEWBill scraper for feeding the DemocracyOS platform
Bill scraper for feeding the DemocracyOS platform
We should persist scraped text as properly encoded UTF-8, instead of HTML-escaped special characters.
I.e.: "ó" should read "ó"
Watching at
https://github.com/DemocracyOS/app/blob/development/lib/models/law.js
I see we have this fields (and the value we should fill them with)
state: I think here is "bill" or maybe we need another value
lawId: current "expediente"
clauses.clauseId: don't know what to put here
clauses.order: ok, the order number
clauses.text: yes, we have this
createdAt: auto
updatedAt: won't fill this
I think we need a field for comision, right? Should I add it in the other project?
Is the matching ok?
Right now we're scraping articles like this:
{
"articulo": "</b> Prohíbese a los establecimientos educativos ",
"_id": {
"$oid": "520a8ec68be1e20000000008"
}
}
Add a 'number' property to an article's JSON representation and preserve its value so we can display it in the app.
No, we can't rely on order to determine the article numbers.
We should import our current "Despachos" into our beta app environment so we can showcase it.
Also, we should remove all existing dummy entries.
Given the following bill object for CABA Bill 400:
{
"sancion": "01/06/2000",
"publicacion": "BOCBA N� 989 del 21/07/2000",
"promulgacion": "De Hecho del 03/07/2000",
"_id": {
"$oid": "520a8ec68be1e20000000002"
},
"articulos": [
{
"articulo": "</b> Prohíbese a los establecimientos educativos ",
"_id": {
"$oid": "520a8ec68be1e20000000008"
}
}
]
}
the values should be
Names of bill files downloaded through the 'Proyectos' tab through Cedom almost respect this standard, though they do 5 digits for 'x', with leading zeros if needed, no hyphens and only two digits for the year. So, it would look like this
Bill 894-J-2013 - Filename 00894J13.doc
Bil 1233-P-2012 - Filename 01233P12.doc
@cristiandouce plase take a look at the file godoit.js
I want to have values for the variables expediente
, comision
and articles
at line 46 but I'm getting undefined for all of them there with the code as is.
At line 88 there is data but it gets lost when passing to line 46.
If I comment line [90](https://github.com/DemocracyOS/bill-scraper/blob/zombie/godoit.js#L90 and uncomment line 89, I get data at line 46, but the program throws an error saying:
TypeError: listener must be a function
at TypeError (<anonymous>)
at NativeConnection.EventEmitter.once (events.js:171:11)
at connectToDB (/Users/pnmoyano/node/bill-scraper/godoit.js:89:6)
at saveAll (/Users/pnmoyano/node/bill-scraper/godoit.js:45:3)
at processPage (/Users/pnmoyano/node/bill-scraper/godoit.js:41:3)
at /Users/pnmoyano/node/bill-scraper/godoit.js:8:5
at _fulfilled (/Users/pnmoyano/node/bill-scraper/node_modules/q/q.js:798:54)
...
I can use global variables as a very dirty fix, but why this is happening and which of the lines (89 or 90) is better?
This may have implicances in bug #1
We're using property names in spanish, we need to translate all of them to english.
{
"sancion": "01/06/2000",
"publicacion": "BOCBA N� 989 del 21/07/2000",
"promulgacion": "De Hecho del 03/07/2000",
"_id": {
"$oid": "520a8ec68be1e20000000002"
},
"articulos": [
{
"articulo": "</b> Prohíbese a los establecimientos educativos ",
"_id": {
"$oid": "520a8ec68be1e20000000008"
}
},
For proper translation refer to Akoma Ntoso naming convention, so we start supporting this convention from day 1.
Here you can find two .docs, one with the originally presented bill project (like the ones you can scrape from CEDOM) and a Despacho, which is the final version that ACTUALLY got treated by congressmen in the recinct.
So, we need to be able to turn the latter into HTML (https://crocodoc.com/ seems a fine tool to do so) and scrape them back into our platform.
The scraper miss-detects stuff and discards important data. For instance, scraping Cedom's bill 400 yields the following result:
{
"sancion": "01/06/2000",
"publicacion": "BOCBA N� 989 del 21/07/2000",
"promulgacion": "De Hecho del 03/07/2000",
"_id": {
"$oid": "520a8ec68be1e20000000002"
},
"articulos": [
{
"articulo": "</b> Prohíbese a los establecimientos educativos ",
"_id": {
"$oid": "520a8ec68be1e20000000008"
}
},
{
"articulo": "</b> Ningún alumno, con motivo de mora en el ",
"_id": {
"$oid": "520a8ec68be1e20000000007"
}
},
{
"articulo": " </b>los alumnos de los establecimientos citados ",
"_id": {
"$oid": "520a8ec68be1e20000000006"
}
},
{
"articulo": "</b> De verse configurados los extremos descriptos en ",
"_id": {
"$oid": "520a8ec68be1e20000000005"
}
},
{
"articulo": "</b> La Secretaría de Educación podrá ",
"_id": {
"$oid": "520a8ec68be1e20000000004"
}
},
{
"articulo": "</b> Comuníquese, etc</P>",
"_id": {
"$oid": "520a8ec68be1e20000000003"
}
}
],
"__v": 0
}
As you can see, the articles' text are not quite complete.
On other cases, like when part of the text contains double quotes (e.g.: "Some text"), all of the article's text up to that section is also discarded.
As a general rule, ALL text between two articles' titles should be included as part of the article.
Make the user options drowp-down on the header bar work
Make the categories filter combo box (left-hand frame) work
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.