tracking-exposed / facebook Goto Github PK

View Code? Open in Web Editor NEW

110.0 110.0 46.0 16.02 MB

facebook.tracking.exposed - collaborative tool for algorithm investigation

Home Page: https://facebook.tracking.exposed

License: GNU Affero General Public License v3.0

JavaScript 99.66% Shell 0.22% Dockerfile 0.12%

facebook's Introduction

Tracking Exposed

Synopsis

Tracking Exposed enables academic research and analysis on the impact of algorithms.

Users that want data about their own filter bubble.
Researchers collecting data with control groups in Facebook.
Journalists interested in echo chambers and algorithm personalization.

Core Packages

Package	Version	Description
`@tracking-exposed/data`		Common data layer.
`@tracking-exposed/processor-cli`		Control a data processor.
`@tracking-exposed/services-cli`		Control a web service.
`@tracking-exposed/utils`		Shared utility functions.

Web Services

Package	Version	Description
`@tracking-exposed/service-rss`		Subscribe to custom RSS feeds based on entities.

Stream Processors

Package	Version	Description
`@tracking-exposed/process-entities`		Process impressions and extract entities.
`@tracking-exposed/process-rss`		Generate and cache RSS feeds.

FAQ

Want to report a bug or request a feature?

Please read through our CONTRIBUTING.md and file an issue at tracking-exposed/issues!

Want to contribute to tracking-exposed?

Check out our CONTRIBUTING.md to get started with setting up the repo.

How is the repo structured?

This repo is managed as a monorepo that is composed of many npm packages.

facebook's People

Contributors

Stargazers

Watchers

facebook's Issues

Research how to use a compressed database entry

It should be dedicated to store the HTML snippets

Implement a robust way to extract metadata from FB feed

implement graph visualization for sponsored posts, in the personal page

At the moment the API does not return sponsored posts. it should and it should be it is own graph

facebook HTML processing pipeline

Processing pipeline of HTMLs

from the raw HTML of facebook you can extract meaningful metadata and append your own result to the database, to let other researcher benefit of that, and so on, in a collaborative effort to create a

The goal is having a distributed network of parsers. independent developers might run their own analysis tool on top of some validated meta data. a distributed effort of parsing, trying to emulate the analysis facebook itself does. well, not exactly the same because would be impossible, but somehow, create a working pipeline that might:

show to the user (restricted access) more information on what is received
perform statistics on topics, penetration of fake news, shape of spreading
observing trends online from an open source independent third party, like alexa of facebook
provide API for algorithm analysis, to researchers, working group, policy makers, journalists

To begin, we've to extract the smaller chunk of metadata, and make progress in a binary tree of parsers.

we can save the metadata submitted, if the information is meaningful, privacy preserving at their best, minimized at the best against attacks that can provide any benefits to minimized to be against decontexualisation attacks at API level.

processed that empower the data analysis and the capability of this network. and the dataset, and the analysis might follow

This is what is in the database after some iteration. every iteration extend the metadata in mongodb:

simple kind of parser

function getPostType(snippet) {

    var $ = cheerio.load(snippet.html);

    if ($('.uiStreamSponsoredLink').length > 0) 
        var retVal = "promoted";
    else if ($('.uiStreamAdditionalLogging').length > 0)
        var retVal = "promoted";
    else
        var retVal = "feed";

    // TODO, don't use exclusion condition, but find a selector
    // for 'feed' too, and associate postType: fail so we can investigate on it later
    debug("・%s ∩ %s", snippet.id, retVal);
    return { 'postType': true, 
             'type': retVal };
};

var postType = {
    'name': 'postType',
    'requirements': {},
    'implementation': getPostType,
    'since': "2016-11-13",
    'until': moment().toISOString(),
};
return parse.please(postType);

*The HTMLs are collected via web-extension and saved at the end of this backend-handler: https://github.com/tracking-exposed/facebook/blob/master/lib/events.js#L52 *

More complicate parser exists, they are located in https://github.com/tracking-exposed/facebook/tree/master/parsers

@nolash do you have suggestion? you've been the first to contribute 👍 I'm committing in branch feedBasicInfo, and @fievelk is doing the version in python https://github.com/fievelk/fbt_pyparsers

This is the first script is run in sequence, postType, pasted above, just extend the table 'html' as metadata in the server. is a binary decision tree

$ DEBUG=* node parsers/postType.js 
  parser:⊹core Connecting to https://facebook.tracking.exposed/api/v1/snippet/status
{
  "since": "2016-11-13",
  "until": "2016-12-29T19:11:36.938Z",
  "parserName": "postType",
  "requirements": {}
} +0ms
  parser:⊹core 46638 HTMLs, 300 per request = 155 requests +1s
  parser:⊹core Connecting to https://facebook.tracking.exposed/api/v1/snippet/content
{
  "since": "2016-11-13",
  "until": "2016-12-29T19:11:36.938Z",
  "parserName": "postType",
  "requirements": {}
} +5ms

This is the output of the execution, for every html snippet, look in two patterns. It is better if the condition cease do be exclusive. if we can understand how to spot a non-promoted post too, the information is more robust and everything goes better

  parser:postType ・fdb795f8c2394d23dd2280ad4eedf9f7c897b98e ∩ feed +6ms
  parser:postType ・e41f623d1cf4e3737aaf8396ee0f52383622c145 ∩ feed +4ms
  parser:postType ・f55e0ba360454fd295070b8ac4231cfd75a4dc21 ∩ promoted +11ms
  parser:postType ・d76f8d8e8f21162f21a291cccbe5101699bb585e ∩ feed +274ms
  parser:postType ・4ccd0d6090490d9afd0c9c0a4cdb24b47eaa68c6 ∩ feed +729ms
  parser:postType ・916ebb01da701f417391ab30928298a6c24428eb ∩ feed +130ms

Collect a list of user-proof reader for RSS

we need a list of RSS client for different operating systems. They should be most straightforward as possible (for example, I tested newsbeuter but considering it should be compiled it is not ok. this is the reason why is not here where self-hosted RSS client such https://tt-rss.org/ and https://freshrss.org )

We need something download+install+use proof.

Linux
Windows
MacOSX
Android: https://feeder.co/
Online services: (https://feedly.com/, https://www.inoreader.com/)

fbtrex July'17 Introduction

+ fbtrex introduction July’17 +

What is the problem?

Facebook has tremendous global reach and influences thinking in many places
The ways in which FB selects information to be presented to individual users are not well understood
Algorithms are like social policies; they should be in public discussions, they can’t be imposed.
Users has been used as rats lab in psychological and social experiment
There are strong evidences on the impact the algorithm can, and has, in the society
There is no transparency (with a data policy) from FB
Facebook claim the complexity of the algorithm it is so huge no engineer can understand or explain the internals, making in the algorithm the perfect technocratic scapegoat

What are the goals

Overall
- Observe, evaluate, and better understand Facebook algorithm, in particular how FB leverages and profits from promoted content.
- Develop awareness on how Facebook operate the business of promoted content
- Enable researchers from non-technical fields to analyze the social media
For users of FB
- Enable better understanding a critical issue and its implications
- Raise awareness around the concept of “algorithm diversity” or “the right to pick your algorithm.”
- Provide users with a persistent record of their timeline.
For research
- Serve as a neutral actor, reliably providing research-quality data for analysis
- Engage diverse research audiences to utilize the data, prioritizing and encouraging interdisciplinary approaches.
For FB
- Over the long term, get FB to give users greater transparency and greater agency over how they experience the algorithm.
- Encourage FB to publish more open data describing the social phenomena happening in the social media

How we intend to work toward the goals

Overall, the approach is based on broadly distributed collection of individual user experiences with FB feeds from their respective observation points, utilizing a networks of volunteers and bots
Using a web-extension, we will collect data on what each user sees on their public timeline.
Extract the metadata from the data submitted, and make this dataset available to researchers.
Produce visualizations and other renderings of how different users experience the FB algorithm.
Use research findings to inform both advocacy and user education.

What needs to be done next?

Overall, we are open to advice, partnership, and collaboration with all interested parties.

Our roadmap includes the following milestones:

Improve the software
- Improve the code of the parsers, small component who extract metadata from the posts seen from the user.
Establish community processes
- Write a visually and formally clear data policy, explain the life-cycle of the data and where third party and users interact with us in an exemplary way.
- Write an ethical agreement to
  - permit the third party accessing the database of collected observations, in order to protect the supporters against social media intelligence;
- the third party has not to sell, monetize, publish, reuse or analyze and copy outside the scope of the agreed analysis.
  - As safeguards, the algorithms runs on the database are declared and formalized, we execute the script, providing an API for the owner.
Engage and support researchers
- Provide support for research organizations as early adopters.
- Presentation at SHA2017 in August, call for engagement (previous presentations here: https://facebook.tracking.exposed/initiatives )

Implement /api/v1/zombiereport

A POST cryptographic signed containing number fields:

{
    timestamp: ISO format about when the post is done
    content1: number
    content2: number
}

content has to be defined on tracking-exposed/web-extension#23

from the headers publicKey is associated to an authenticated user, and the number are updated. the number reported are send incrementally. Will never happen that a number sent would be lesser than the previous.

research API should return data aligned with the timezone of the researcher

The impressionTime is saved as GMT+0
The timezone knowledge is in the client (browser)
therefore, we need to support the offset as a parameter in the API request, to let the researcher retrieve their data aligned with their timezone.

CSV download it is broken

12124 ۞  ~/Downloads cat feed-100014305273231.csv 
"savingTime","id","type","timelineId","publicationUTime","postId","permaLink"
"2017-04-14T08:41:26.817Z","c7619be84ae3ddb88715b67b393864ad09c4168e","feed","299e771127cbc6e4f5093488c40c1bb3c8f7c241","2017-04-12T12:06:43.000Z","",""

Fix website metadata

This is how facebook.tracking.exposed looks like when shared on Facebook:

https://developers.facebook.com/tools/debug/sharing/?q=facebook.tracking.exposed

On Twitter the preview seems not consistent:

A starting point is to support Facebook Open Graph. I don't remember how Twitter handles that

TODOs to deploy new release of userscript

test signature validation, key extraction
implement meta field extension #15
implement old format data conversion
update all the API to work with the new DB format

am I forgetting something @vrde ?

debug database performances and improve I/O

@vrde any suggestion on how to do it with mongoDB ? beside secondary indexes, I'm also doing some hourly-daily reduction, so we can query the data already reduced

Document the semantic analysis, the semantic database and the RSS database

Using https://dandelion.eu as third party and sponsor of fbTREX, they receive pieces of text and return the semantic analysis by extracting the wikipedia pages related to the content.

This permit us to do keyword research and indexing of what is on wikipedia (allegedly and enforced by community standard) only content with encyclopedic relevance are there.

Workflow

online user connects to /feeds/something.xml
if "something" it is a valid keyword, it is inserted as the subscribed feeds in the collection feeds
a scheduled task rss-composer looks for all the existing feed, looks at the last imported labels and if matches, it append the

mongodb collections, format and naming

labels

mongodb indexes

{  semanticId: 1  }, 
{  when: 1 }

object:

{
    "semanticId" : "61d87231dd11b414ca333c6528732ed75303150d",
    "lang" : "en",
    "when" : ISODate("2019-01-22T09:32:25.514Z"),
    "l" : [ 
        "Deep learning", 
        "Artificial intelligence", 
        "Curve fitting"
    ],
    "matchmap" : [ 0 ],
    "textsize" : 77
}

semantics

mongodb indexes

{  semanticId: 1  }, 
{  label: 1 }

object:

{
    "semanticId" : "b37245be6425ff066866786940775eb6848ab3eb",
    "spot" : "prometió",
    "confidence" : 0.83,
    "title" : "Prometio",
    "wp" : "http://es.wikipedia.org/wiki/Prometio",
    "label" : "Prometio"
}

feeds

mongodb indexes:

{ "id" : 1, "unique" : true }

object:

{
    "id" : "e362945dba9eeacc93fdeda926f3167a2f0be59a",
    "insertAt" : ISODate("2019-01-22T10:28:44.701Z"),
    "labels" : [ 
        "Dudeism"
    ],
    "created" : false
}

flatmap-stream dependency is malicious

Currently we are depending on flatmap-stream, which contains a malicious payload that looks for crypto currencies wallets and tries to grab them Link.

The dependency on flatmap-stream is introduced by nodemon, which is on v1.8.3, but it seems that is has been fixed in v1.8.7.
So probably by dumping the version it should be fixed.

supporting more keys for the same user

restructure the onboarding API in the backend.
support more keys per user, communicate them to the users via web, evaluate with @vrde how to signal and communicate these conditions:

when an user delete localstorage
when firefox/chrome syncing are used
if we can communicate condition like "we know your user but you are in a new browser, therefore..."

restore/redesign the metadata extraction and error management

An important feature should be:

see the original post from Facebook
see how we process it, which metadata have been extracted by it
access to these evidences when a new error in the parser arises

At the moment we are facing a parser refactor and a new rebuilding of the database, this issue is to keep track of the progress

Document researcher use case

Can run the project locally on first install

Following the readme after doing npm run watch I get:

> [email protected] watch /home/me/dev/facebook
> nodemon --config config/nodeamon.json app

[nodemon] 1.17.4
[nodemon] reading config ./config/nodeamon.json
[nodemon] to restart at any time, enter `rs`
[nodemon] or send SIGHUP to 3847 to restart
[nodemon] ignoring: sections/webscripts/*.js
[nodemon] watching: lib/*.js app.js sections/*.pug sections/*/*.pug dist/**
[nodemon] watching extensions: js,css,pug
[nodemon] starting `node app app.js`
[nodemon] forking
[nodemon] child pid: 3858
[nodemon] watching 367 files
ઉ nconf loaded, using config/settings.json
/home/me/dev/facebook/app.js:31
    throw new Error("Rename config/settings.example to config/settins.json, and read the content");
    ^

Error: Rename config/settings.example to config/settins.json, and read the content
    at Object.<anonymous> (/home/me/dev/facebook/app.js:31:11)
    at Module._compile (internal/modules/cjs/loader.js:678:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:689:10)
    at Module.load (internal/modules/cjs/loader.js:589:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:528:12)
    at Function.Module._load (internal/modules/cjs/loader.js:520:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:719:10)
    at startup (internal/bootstrap/node.js:228:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:576:3)
[nodemon] app crashed - waiting for file changes before starting...

Suggestions support

The goal is permit viewers to rapidly interact with the project, in order to get volunteers and hints.

Trought the presence of [a text], pages will open a modal asking to fill up some questions, implemented in dist/js/suggest.js

implement signature verification of submitted posts

in lib/events.js

        .then(function(supporter) {
            if(!supporter || !_.isInteger(supporter.userId)) {
                debug("UserId %d not found: *recording anyway*", headers.supporterId);
                // throw new Error('user lookup - userId not found');
                return {
                    'userId': headers.supporterId,
                    'protocolViolation': true
                }
            } else {
                debug("UserId %d found", headers.supporterId);
                return supporter;
            }

            // console.log(_.size(req.rawBody));
            // debug("%d", _.size(supporter.publicKey));
            // if (signer.verify('NaCL is amazing!', signature, publicKey))
            // raise error if fail -- work in progress

This piece of code is an holed patchwork to make the backend works with the web-extension.
The base58 and NaCL sequence gave me some problems and I didn't found yet the need to make it a priority, but would be very wise have it.

og:description for project file has to be different from homepage

https://twitter.com/_vecna/status/896820941476560896 doesn't looks OK if the preview is the same

fix and review parserv ELK format

Condition where postId fail in being extracted should be managed differently.

org.elasticsearch.index.mapper.MapperParsingException: failed to parse [errors]
[...]
Caused by: java.lang.IllegalArgumentException: For input string: "postId"

visual feedback for parsers

implement a CSS that may mock the original one of Facebook
have a visualization reporting: the exacted metadata and the original post (with the style said above)
a button that permit to report to the system a wrong parsing
a visualization that might make emerge posts with unexpected condition

support localized splashpage

find a more reliable geoip
test clean include Jade pattern for support different splashscreen
test out suggestion workflow
have a good example
have a list of implemented "declinations" showing used a link to see it

implement and use log with ELK

This is a multipurpose issue, in collaboration with @joxer

test the sending of messages to the ELK system log.tracking.exposed
testing the visualization of the ELK system
integrate the logging in different components of the pipeline.

edit, at the end of the issue, we should have these logs working:

adopters handshake
events received by web-extension
page navigation info
parsers errors
parsers statistics on metadata extraction
mongodb auditlog
anomalies (they require investigation before they expires)

make the whole project status be accessible via web

at the moment. online if you open https://facebook.tracking.exposed/project you'll see only two bullet points (glossary and .pdf download). But these are the file that should be listed: https://github.com/tracking-exposed/facebook/tree/master/sections/project

The files "problem" "solution" and "details" should be fixed. there are some unprintable chars which should become a ' or "

And the links, in the .pdf as footnotes, should become clickable links . This is because the document has been written as a shared doc and now the conversion to HTML is a bit tedious

If an API requests fails (as in 500) it doesn't return a response

While debugging #25 I noticed that if the backend fails with a 500 code, it never returns the response to the client. It should be possible to handle this somewhere in the middleware and return to the client something like { 'error': 'whatever' }.

Implement a stronger authentication for fbtrex users

In this moment the URL used by the user to access their own data can be:

guessed from the outside
is never changing

Despite there are not private information on it, it is debatable it can represent a sensible information. It is important keep these data accessible only to the legitimate user, and if the user want, share said link to someone trusted. The user should be able to revoke the access token.

reminder, refactor

When the rewriting of the code will be planned, has to be remember:

socket.io, realtime update
split of data storage (elasticsearch, graph-js, mongo, node4j ?)
component splitting, pipeline approach, functional definition as much as possible
dockerization of every component
interpret (python or node) capability tuning and limit
kernel security capabilities limit
logging of every data management
description of the data structures copied in the different backend

Request too large!!

loading many posts at one reach the server limit of 3Mb

metadata extension in db.timelines

@vrde I was thinking that the meta-data extension on the posts might be asynchronous.
the best way to keep them, is in a list of object, every metadata an object, and trough the label of the object the meta-data get identified

order: xx,
displayTime: yy
type: "promoted" | "feed"
meta: [
]

inside of meta:

{ label: 'via',
  userId: 212121,
  displayName: 'Claudio A.' },
{ label: 'title'
  ... },
{ label: 'friendactivity',
  friend: '3232321',
  what: 'comment' | 'like' | 'reaction '
}, 
{ label: 'likes',
   count: 10
}, 
{ label: 'comments',
   count: 15
}

basically, what is parsed it is in 'meta'. The object timeline contain just the basic info of the place (which position, when. even time and postId are in meta, because promoted posts hasn't them)

redesign realitycheck webapp [transition task]

In the next future we would have a professional UX designer, web designer and d3.js developers working on the webapp currently known as realitycheck. In this issue I keep track of the changes which would permit a smooth transition, capable to:

display the new metadata collected
perform some basic filtering operation
download your data

Landing page has a wrong link

is realitycheck/432432423423/recent when there is implemented only data and overlook

test http://tsuyoshiwada.github.io/sweet-scroll/

http://tsuyoshiwada.github.io/sweet-scroll/ for front page or /background page

UX research and improvement on realitycheck page

Now the @Mitch90 visualisation is in place, at the realitycheck page answers https://facebook.tracking.exposed/realitycheck/100013436260185 the new viz. we can imagine, this would be the page where users interact more and specially the users that don't know what the filter bubble really is.

how can be performed an UX interaction test? how can we improve the communication on the project itself ?

I think using .jade snipped we can manage new links, new text, to appear as a feed and as group share the access on feeding with information related.

as per the look, I was thinking

to make work the number of timelines based on the screen size, and number of post per timeline be a "mandatory" number. if a timeline has less posts than the required, is ignored and took an older one.
lines connecting the rectangles

Someone with some grunt skills fix my 'npm run build' because is shameful ...

I'm guilty
https://github.com/vecna/ESCVI/blame/master/package.json#L10

restore stats | impact

The alpha /impact is hard to be restored. c3 has been used in a sub-optimal way, so, it is better define which stats can be done and rewrite these APIs/viz.

users, adoption, page access can be separated by the others (timelines, html, impression) and then, metadata for html.

these might be three different graph, restructured properly. I'll document here the progress

reduce text lenght and improve metadata in opengraph

don't repeat the name of the project at the beginning
put an Ellipsis "…" if longer
proof read with some native

Missing license

I see there's GLP-2.0 (please note the typo) in package.json but there's no COPYING or LICENSE file.

revision page need a shareable URL

extend the route registratio in app.js
implement in lib/staticpages.js the condition
update the revision.js script to look in the URL

@digitigrafo

improve robustness of ID in echoes.js

  lib:echoes ES logging ID 154777370644111870 in [parserv] +28ms
  lib:echoes ES logging ID 154777370644160860 in [parserv] +0ms
  lib:echoes ES logging ID 154777370644142820 in [parserv] +0ms
  lib:echoes ES logging ID 154777370644190140 in [parserv] +0ms
  lib:echoes ES logging ID 154777370644118140 in [parserv] +0ms

This is the log, it should be more random and smaller. sadly we can't use hashes

Validate endpoint fails when user has a vanity url

If the user has a vanity URL for their profile, the permalink does not contain ?id= and the requests fails.

This behaviour can be reproduced with the following command:

curl -X POST -H "Content-Type: application/json"  -d '{
  "html": "",
  "userId": "818251551",
  "permalink": "https://www.facebook.com/agranzot/posts/10154799187231552",
  "publicKey": "HDtoJqDMDsP7hJJqevWGsdhirWaydLzv6XEuU3ofo58i"
}' "http://localhost:8000/api/v1/validate"

add Y line with links of releases

ref: impact / stats peak of view
https://www.oneworld.nl/deze-tool-checkt-facebook-echt-de-verkiezingen-beinvloedt

string before "CR" is not displayed when reading RSS

we need to fix these lines

facebook/lib/rss.js

Line 14 in 4d9fcc2

    
           const fbtrexRSSplaceholder = "Welcome, you should wait 10 minutes circa to get this newsfeed populated, now the subscription is taken in account. " + CR + "fbTREX would stop to populate this feed if no request is seen in 5 days. updates would be automatic. You can find more specifics about the RSS settings in [here todo doc]";

facebook/lib/rss.js

Line 15 in 4d9fcc2

    
           const fbtrexRSSdescription = "This newsfeed is generated by the distributed observation of Facebook posts, collected with https://facebook.tracking.exposed browser extension; The posts are processed with a technical proceduce called semantic analysis, it extract the core meanings of the post linkable to existing wikipedia pages";

facebook/lib/rss.js

Line 16 in 4d9fcc2

    
           const fbtrexRSSproblem = "We can't provide a newsfeed on the information you requested. This is, normally, due because you look for a keyword which has not been seen recently. We permit to generate RSS only about voices which are part of wikipedia because this ensure we do not enable any kind of stalking. (editing wikipedia would not work). You can really use only label which are meaningful encyclopedic voices.";

crash

`
$ npm run watch

[email protected] watch /home/user/src/facebook/facebook
nodemon --config config/nodeamon.json app

[nodemon] 1.18.3
[nodemon] reading config ./config/nodeamon.json
[nodemon] to restart at any time, enter rs
[nodemon] or send SIGHUP to 10358 to restart
[nodemon] ignoring: sections/webscripts/.js
[nodemon] watching: lib/.js app.js sections/.pug sections//*.pug dist/**
[nodemon] watching extensions: js,css,pug
[nodemon] starting node app app.js
[nodemon] forking
[nodemon] child pid: 10370
[nodemon] watching 272 files
/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:59
throw err;
^

Error: /home/user/src/facebook/facebook/lib/../sections/talks/landing.pug:50:1
48| a(href="https://elezioni.tracking.exposed") https://elezioni.tracking.exposed
49| | some

50| b English
--------^
51| | blogpost commenting on the analysis are:
52| a(href="https://medium.com/@trackingexposed/facebook-tracking-exposed-background-80e0f72e615f") Methodology/Background
53| |,

unexpected token "indent"
at makeError (/home/user/src/facebook/facebook/node_modules/pug-error/index.js:32:13)
at Parser.error (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:53:15)
at Parser.parseExpr (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:264:14)
at Parser.block (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:996:25)
at Parser.tag (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:1157:24)
at Parser.parseTag (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:1049:17)
at Parser.parseExpr (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:208:21)
at Parser.block (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:996:25)
at Parser.tag (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:1157:24)
at Parser.parseTag (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:1049:17)
at Parser.parseExpr (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:208:21)
at Parser.block (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:996:25)
at Parser.tag (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:1157:24)
at Parser.parseTag (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:1049:17)
at Parser.parseExpr (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:208:21)
at Parser.block (/home/user/src/facebook/facebook/node_modules/pug-parser/index.js:996:25)
[nodemon] app crashed - waiting for file changes before starting...
`

structure of landing user page

link to other visualisations, last update, last link, some small content has to be funneled in the page where the users land, is the best place to promote the experiment

Export data in RSS format

As suggested by some people in the internet, RSS is a format to share updates which can be resurrected. The first use case is to let a person fetch their own data in RSS format from the personal page. This issue will be addressed during the mozilla/global-sprint#308 mozsprint

code refactor for version 1.0

The branch where development is happening is: https://github.com/tracking-exposed/facebook/tree/yS
below the two priority tasks as discussed with @rugantio @fedebarba @joxer)

P0 - Show data on realitycheck

Have robust parsers (me and @fedebarba )

Maintain a QOS of acceptable service (this should be measured with a better interace than the current)
- 80% of posts in a month are collected well
- SLA 72 hours on breaking parser

Redo the part of the site that shows the data (graphic part, stripped down)

Change the reality check system and focus on interactive tables
search / sort / filter
at the moment the problem lies in the heaviness of the query and the answer, so the API must be redone

Complete the separation of the 3 services, integrate system logs.

split the static web pages, the collector, the API

P1 - Having constant logging and showing transparency to users

Having a monitoring and logging infrastructure (ELK, currently in testing on log.tracking.exposed)
Something custom that reads from elasticsearch with statistics

Next steps (still to be documented/addressed)

P2 - Having batch processes that allow you to have the aggregated data aggregated for users

P3 - allow those who "create a research group" to see aggregate data of users who have opted in his group.

P4 - allow users to share portions of their timeline (so as to make comparisons between them)

Application start problem without a Database

The application can start even if mongo is not running, when running it starts without displaying any error.
We can adopt two strategies:

Don't make the application start when there's an error
Logs out every error

I would say the best option is the first one, but I leave the open decision for the team.

research? can the users be authenticatied?

At the moment I can't validate the existence of a Facebook user without passing through Facebook, and I don't want because this will probably imply other ToS/API key

tracking-exposed / facebook Goto Github PK

facebook's Introduction

Tracking Exposed

Synopsis

Packages

Core Packages

Web Services

Stream Processors

FAQ

Want to report a bug or request a feature?

Want to contribute to tracking-exposed?

How is the repo structured?

facebook's People

Contributors

Stargazers

Watchers

Forkers

facebook's Issues

Processing pipeline of HTMLs

simple kind of parser

We need something download+install+use proof.

+ fbtrex introduction July’17 +

What is the problem?

What are the goals

How we intend to work toward the goals

What needs to be done next?

Workflow

mongodb collections, format and naming

labels

semantics

feeds

P0 - Show data on realitycheck

P1 - Having constant logging and showing transparency to users

Next steps (still to be documented/addressed)

P2 - Having batch processes that allow you to have the aggregated data aggregated for users

P3 - allow those who "create a research group" to see aggregate data of users who have opted in his group.

P4 - allow users to share portions of their timeline (so as to make comparisons between them)

Recommend Projects

Recommend Topics

Recommend Org