Processing pipeline of HTMLs
from the raw HTML of facebook you can extract meaningful metadata and append your own result to the database, to let other researcher benefit of that, and so on, in a collaborative effort to create a
The goal is having a distributed network of parsers. independent developers might run their own analysis tool on top of some validated meta data. a distributed effort of parsing, trying to emulate the analysis facebook itself does. well, not exactly the same because would be impossible, but somehow, create a working pipeline that might:
- show to the user (restricted access) more information on what is received
- perform statistics on topics, penetration of fake news, shape of spreading
- observing trends online from an open source independent third party, like alexa of facebook
- provide API for algorithm analysis, to researchers, working group, policy makers, journalists
To begin, we've to extract the smaller chunk of metadata, and make progress in a binary tree of parsers.
we can save the metadata submitted, if the information is meaningful, privacy preserving at their best, minimized at the best against attacks that can provide any benefits to minimized to be against decontexualisation attacks at API level.
processed that empower the data analysis and the capability of this network. and the dataset, and the analysis might follow
This is what is in the database after some iteration. every iteration extend the metadata in mongodb:
simple kind of parser
function getPostType(snippet) {
var $ = cheerio.load(snippet.html);
if ($('.uiStreamSponsoredLink').length > 0)
var retVal = "promoted";
else if ($('.uiStreamAdditionalLogging').length > 0)
var retVal = "promoted";
else
var retVal = "feed";
// TODO, don't use exclusion condition, but find a selector
// for 'feed' too, and associate postType: fail so we can investigate on it later
debug("・%s ∩ %s", snippet.id, retVal);
return { 'postType': true,
'type': retVal };
};
var postType = {
'name': 'postType',
'requirements': {},
'implementation': getPostType,
'since': "2016-11-13",
'until': moment().toISOString(),
};
return parse.please(postType);
*The HTMLs are collected via web-extension and saved at the end of this backend-handler: https://github.com/tracking-exposed/facebook/blob/master/lib/events.js#L52 *
More complicate parser exists, they are located in https://github.com/tracking-exposed/facebook/tree/master/parsers
@nolash do you have suggestion? you've been the first to contribute 👍 I'm committing in branch feedBasicInfo, and @fievelk is doing the version in python https://github.com/fievelk/fbt_pyparsers
This is the first script is run in sequence, postType, pasted above, just extend the table 'html' as metadata in the server. is a binary decision tree
$ DEBUG=* node parsers/postType.js
parser:⊹core Connecting to https://facebook.tracking.exposed/api/v1/snippet/status
{
"since": "2016-11-13",
"until": "2016-12-29T19:11:36.938Z",
"parserName": "postType",
"requirements": {}
} +0ms
parser:⊹core 46638 HTMLs, 300 per request = 155 requests +1s
parser:⊹core Connecting to https://facebook.tracking.exposed/api/v1/snippet/content
{
"since": "2016-11-13",
"until": "2016-12-29T19:11:36.938Z",
"parserName": "postType",
"requirements": {}
} +5ms
This is the output of the execution, for every html snippet, look in two patterns. It is better if the condition cease do be exclusive. if we can understand how to spot a non-promoted post too, the information is more robust and everything goes better
parser:postType ・fdb795f8c2394d23dd2280ad4eedf9f7c897b98e ∩ feed +6ms
parser:postType ・e41f623d1cf4e3737aaf8396ee0f52383622c145 ∩ feed +4ms
parser:postType ・f55e0ba360454fd295070b8ac4231cfd75a4dc21 ∩ promoted +11ms
parser:postType ・d76f8d8e8f21162f21a291cccbe5101699bb585e ∩ feed +274ms
parser:postType ・4ccd0d6090490d9afd0c9c0a4cdb24b47eaa68c6 ∩ feed +729ms
parser:postType ・916ebb01da701f417391ab30928298a6c24428eb ∩ feed +130ms