Coder Social home page Coder Social logo

justinlittman / fbarc Goto Github PK

View Code? Open in Web Editor NEW
77.0 16.0 11.0 115 KB

A commandline tool and Python library for archiving data from Facebook using the Graph API.

License: Creative Commons Zero v1.0 Universal

Python 97.05% HTML 2.95%
code4lib facebook-graph-api

fbarc's Introduction

F(b)arc

(The "b" is silent.)

A commandline tool and Python library for archiving data from Facebook using the Graph API.

Facebook data is represented as a graph. The graph is composed of:

  • nodes: Things on Facebook, such as Pages, Albums, and Photos. Each node has an id (e.g., 1322855124437680) and a type (e.g., Page).
  • fields: Attributes such as things, such as name and id.
  • edges: Connections between nodes, e.g., Page's Photos.

The graph is represented as a JSON object. For example:

{
  "name": "The White House",
  "id": "1191441824276882",
  "about": "Welcome to the official White House Facebook page.

Comments posted on and messages received through White House pages are subject to the Presidential Records Act and may be archived. Learn more at WhiteHouse.gov/privacy.",
  "albums": {
    "data": [
      {
        "created_time": "2017-01-20T19:33:16+0000",
        "name": "Timeline Photos",
        "id": "1199645353456529"
      }
    ]
  },
  "metadata": {
    "type": "page"
  }          
}

F(b)arc supports retrieving parts of the graph for archiving. To do so, it allows you to specify what fields and edges to retrieve for a particular node type. (What fields and connections to retrieve is referred to as a definition and is described further below).

Getting API keys

Before you f(b)arc you will need to register an app. To do this:

  1. If you don't already have one, create a Facebook account.
  2. Go to https://developers.facebook.com/apps/ and log in.
  3. Click Add a New App and complete the form.
  4. From the app's dashboard, note the app id and app secret.

See below for more information on tokens.

Install

Note: pip install coming once f(b)arc is more stable.

These are instructions for Python 3. Make appropriate adjustments for Python 2.

  1. Download f(b)arc or clone it:

     git clone https://github.com/justinlittman/fbarc.git
    
  2. Change to the directory:

     cd fbarc
    
  3. Optional: Create a virtualenv:

     virtualenv -p python3 ENV
     source ENV/bin/activate
    
  4. Install requirements:

     pip install -r requirements/requirements3.txt
    
  5. Get commandline usage:

     python fbarc.py -h
    

Usage

Configure

Once you've got your API keys you can tell f(b)arc what they are with the configure command.

python fbarc.py configure

This will store your credentials in a file called .fbarc in your home directory so you don't have to keep providing them. If you would rather supply them directly you can set them in the environment (APP_ID, APP_SECRET) or using commandline options (--app_id, --app_secret).

Tokens

Using the API requires an access token. F(b)arc supports app access tokens and user access tokens.

F(b)arc can retrieve an app access token using the app id and app secret. However, there are some nodes that cannot be retrieved with an app access token, thus a user access token is recommended.

A user access token allows retrieving more nodes than an app access token (but as used in f(b)arc is still limited to public data). There are two types of user access tokens: short-lived and long-lived tokens. Short-lived access tokens are valid for around an hour; long-lived access tokens for a few months. Long-lived user access tokens are retrieved using a short-lived user access tokens and the app id and app secret.

When given a short-lived access token (e.g., with the configure command), f(b)arc will retrieve and store a long-lived access token. You can get a short-lived access token from https://developers.facebook.com/tools/accesstoken/.

F(b)arc will warn you when you're long-lived user access token is going to expire.

Graph

The graph command will retrieve the graph for a node (or use the graphs command to retrieve the graphs for multiple nodes provided in files or stdin). The node is identified by a node id (e.g., 1191441824276882), name (e.g., WhiteHouse) or a Facebook url (e.g., https://www.facebook.com/WhiteHouse/).

The node graph is retrieved according to the specified definition. If the type of a node is not known, provide a definition of discover and f(b)arc will look up the node's type and try to match it to a definition.

f(b)arc finds additional nodes in the graph for a node. For example, for a Page it may find the Album nodes. The --levels parameter will determine the number of levels of nodes that are retrieved, with the default being 1 (i.e., the graph for just the node that was requested). Each additional node graph is returned separately. Setting --levels to 0 will continue until all nodes reachable by edges are exhausted. Be careful, because depending on the definitions, this could be, well, infinite. Use the --exclude parameter to exclude definitions from recursive retrieval.

Note that f(b)arc may need to make multiple requests to retrieve the entire node graph so executing the graph command may take some time.

python fbarc.py graph page 1191441824276882 --levels 2 --pretty

To write the output to a file, use --output-dir or redirect output to a file with > <filename>.jsonl.

python fbarc.py graph page 1191441824276882 --levels 2 --pretty > 1191441824276882.jsonl

Metadata

The metadata command will retrieve all of the fields and connections for a node.

python fbarc.py metadata 1191441824276882 --pretty

Note that you may not be able to actually retrieve all of those fields or connections with the level of permissions of your API keys. The API will ignore any fields or connections that you cannot access.

The --template and --update parameters help with creating definitions. These are described below.

Url

The url command will return the url for retrieving the graph of a node according to the specified definition.

python fbarc.py url page 1191441824276882

Definitions

Definitions specify what fields and connections will be returned for a node type, as well as the size of node batches and edges.

Definitions are represented as simple python configuration files stored in the definitions or local_definitions directories. Definitions in definitions are distributed with f(b)arc. You can add additional definitions in local_definitions. A definition in local_definitions with the same filename as a definition in definitions will take precedence.

Here is an example definition for a Page:

definition = {
    'node_batch_size': 10,
    'edge_size': 10,
    'fields': {
        'albums': {'edge_type': 'album'},
        'bio': {},
        likes': {'edge_type': 'page', 'follow_edge': False},
        'name': {'default': True},
        'workflows': {'omit': True},
        'visitor_posts': {'edge_type': 'post', 'omit_on_error': 10}
    }
}

fields is a map of names to fields or edges to be retrieved for the node.

A name with an edge_type is an edge. The value of edge_type is the name of another definition.

A field or edge in which default is True will always be retrieved. Otherwise, the field or edge will only be retrieved when the node is the primary node being retrieved. In other words, default fields or edges specify the summary for a node type; other fields or edges are part of the detail for a node type.

A field or edge in which omit is True will be ignored. This is helpful for keeping track of fields or edges that have been considered, but are not to be retrieved.

If an edge has follow_edge set to False then only the default fields or edges will be retrieved for that edge. That edge will be omitted from recursive retrieval. For example, for a Page, the likes edge is set to not follow edges because this would cause retrieval of all pages that liked this page, which is not desired.

Sometimes for inexplicable reasons, the Graph API will report errors for particular fields. For example, as of late 2017, requesting the visitor_posts edge on SenatorTedCruz with even a limit of 1 results in a "Please reduce the amount of data you're asking for, then retry your request" error. To handle these sorts of errors, setting omit_on_error will cause the field to be omitted when the specified error is encountered. (Errors are identified using Facebook error codes.)

node_batch_size and edge_size are optional; if omitted sensible defaults will be used. Node batch size determines how many nodes of that type will be requested at a time. A larger number reduces the number of requests to the API, speeding up retrieval. Edge size determines, when retrieving an edge, how many nodes to retrieve. A larger number reduces the number of paging requests, speeding up retrieval. In some cases, limits for node batch size and edge size can be found in the documentation; in others, it must be found by trial and error.

The --template and --update parameters of the metadata command can assist with creating definitions. --template will produce a definition for a node type that includes all possible fields or edges with omit set to True by default. --update will update an existing definition with any new fields or edges that are not already included in the definition. The new field or edges will be indicated by a comment ("Added field") and will have omit set to True.

The Graph API Explorer is helpful for understanding the fields and connections that are available for a node type. Less helpful is the Graph API Reference.

F(b)arc Viewer

F(b)arc Viewer allows you to view and explore the data retrieved from the API.

To run:

python fbarc_viewer.py <filepath(s) of file containing JSON or directories containing JSON files>

Adding --index will cause indexes to be used. Indexes will reduce the amount of memory required. If indexes don't already exist, they will be created:

Once F(b)arc Viewer is running, it will be available at http://localhost:5000/.

Unit tests

To run unit tests:

    python -m unittest discover

Limitations

Users

Facebook limits retrieving Users. F(b)arc does not support retrieving Users from the graph command, but it does retrieve them when connected from other nodes. The fields that are available are extremely limited.

Incremental archiving

It would be ideal to be able to perform incremental archiving, i.e., only retrieve new or updated nodes. For example, only retrieve new Photos in an Album. Unfortunately, the Graph API doesn't support this. In particular, ordering does not appear to work as documented and if it did work, it is unclear what field is used for ordering.

Suggestions on a strategy for incremental harvesting would be appreciated.

Not yet implemented

  • Search
  • Setup.py
  • Travis configuration

Acknowledgemens

F(b)arc borrows liberally from Twarc in code and spirit.

Facebook policies

Please be mindful of the Facebook Platform Policy.

fbarc's People

Contributors

edsu avatar justinlittman avatar lucahammer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fbarc's Issues

JSONDecodeError

How can I find out why there was a problem with the JSON?

Full error message:

python fbarc.py --debug graph page altenbeken.info --levels 2 > altenbeken.jsonl
Access token expires on 2018-03-04 10:21:56+00:00
Getting graph for node altenbeken.info
Traceback (most recent call last):
  File "fbarc.py", line 476, in raise_for_fb_exception
    raise FbException(error_response)
__main__.FbException: An unknown error has occurred.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "fbarc.py", line 1156, in <module>
    main()
  File "fbarc.py", line 214, in main
    args.csv_output_dir, fb)
  File "fbarc.py", line 273, in graph_command
    exclude_definition_names=exclude_definition_name), graph_outputs)
  File "fbarc.py", line 298, in print_graphs
    for graph in graph_iter:
  File "fbarc.py", line 522, in get_nodes
    for node_graph in self._get_nodes(node_counter, node_queue, queued_nodes, le
vels, exclude_definition_names):
  File "fbarc.py", line 536, in _get_nodes
    node_graph_dict = self.get_node_batch(node_ids, definition_name)
  File "fbarc.py", line 617, in get_node_batch
    nodes_graph_dict = self._perform_http_post(url, data=params)
  File "fbarc.py", line 908, in _perform_http_post
    raise_for_fb_exception(response, data=data)
  File "fbarc.py", line 477, in raise_for_fb_exception
    except json.decoder.JSONDecodeError:
AttributeError: 'module' object has no attribute 'JSONDecodeError'

debug.log

2018-01-05 12:05:15,428 ERROR Error for https://graph.facebook.com/v2.11 ({'fields': 'id,metadata{type},description,end_time,event_times,name,start_time,attending_count,category,cover,declined_count,interested_count,is_canceled,maybe_count,noreply_count,owner,picture,place,timezone,type,updated_time,comments.limit(2000){id,created_time,from,message,permalink_url},feed.limit(25){id,created_time,from,message,permalink_url,status_type,to,updated_time},live_videos.limit(100){id,creation_time,description,permalink_url,title},photos.limit(100){id,created_time,link,name,updated_time}', 'access_token': 'XXXXXX', 'ids': '330364757445239,1785087901790238,1778205972478904,1348214928621073,215387655660218,268732113631738,396768557350139,1633347510043387,1027045984065537,139391659967274,1523274661042677,1832586256990227,985206564948474,300970570319747,193476294470573,238752436577395,1178518992264453,580038218861570,180063365812330,237478843337911', 'method': 'GET', 'metadata': 1}): {
    "error": {
        "message": "An unknown error has occurred.",
        "code": 1,
        "fbtrace_id": "A98KjDQxfoZ",
        "type": "OAuthException"
    }
}

could not build url for endpoint node

python3 fbarc.py graph page MarkRWarner | python fbarc_viewer.py
Warning: Using an app token. You may encounter authorization problems.
Getting graph for node MarkRWarner

Traceback (most recent call last):
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 488, in _get_nodes
node_graph_dict[node_ids[0]] = self.get_node(node_ids[0], definition_name)
File "fbarc.py", line 567, in get_node
raise e
File "fbarc.py", line 546, in get_node
node_graph = self._perform_http_post(url, data=params)
File "fbarc.py", line 914, in _perform_http_post
**kwargs)
File "fbarc.py", line 914, in _perform_http_post
**kwargs)
File "fbarc.py", line 914, in _perform_http_post
**kwargs)
[Previous line repeated 5 more times]
File "fbarc.py", line 910, in _perform_http_post
raise e
File "fbarc.py", line 875, in _perform_http_post
raise_for_fb_exception(response, data=data)
File "fbarc.py", line 432, in raise_for_fb_exception
raise FbException(error_response)
main.FbException: Unsupported get request. Object with ID 'MarkRWarner' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fbarc.py", line 1132, in
main()
File "fbarc.py", line 213, in main
args.csv_output_dir, fb)
File "fbarc.py", line 255, in graph_command
exclude_definition_names=exclude_definition_name), graph_outputs)
File "fbarc.py", line 260, in print_graphs
for graph in graph_iter:
File "fbarc.py", line 476, in get_nodes
for node_graph in self._get_nodes(node_counter, node_queue, queued_nodes, levels, exclude_definition_names):
File "fbarc.py", line 512, in _get_nodes
log.warning('Skipping %s due to unexpected GraphMethodException: %s', node_id, e)
UnboundLocalError: local variable 'node_id' referenced before assignment

  • Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
    [2018-01-14 16:56:01,616] ERROR in app: Exception on / [GET]
    Traceback (most recent call last):
    File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
    File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
    File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
    File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
    File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functionsrule.endpoint
    File "fbarc_viewer.py", line 32, in home
    return redirect(url_for('node', node_id=first_node_id))
    File "/usr/lib/python2.7/dist-packages/flask/helpers.py", line 333, in url_for
    return appctx.app.handle_url_build_error(error, endpoint, values)
    File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1805, in handle_url_build_error
    reraise(exc_type, exc_value, tb)
    File "/usr/lib/python2.7/dist-packages/flask/helpers.py", line 323, in url_for
    force_external=external)
    File "/usr/lib/python2.7/dist-packages/werkzeug/routing.py", line 1768, in build
    raise BuildError(endpoint, values, method, self)
    BuildError: Could not build url for endpoint 'node'. Did you forget to specify values ['node_id']?
    127.0.0.1 - - [14/Jan/2018 16:56:01] "GET / HTTP/1.1" 500 -

Search idea

Here is an idea that you could use for your search function. It returns as a dict though which makes parsing a pain.

users = graph.request('/search?q='+q+'&type=user')

This can be expanded as needed, but it works for users.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.