Coder Social home page Coder Social logo

edx-analytics-pipeline's Introduction

Open edX Data Pipeline

DEPRECATION NOTICE

The Insights product and associated repositories are in the process of being deprecated and removed from service. Details on the deprecation status and process can be found in the relevant Github issue.

This repository may be archived and moved to the openedx-unsupported Github organization at any time.

The following sections are for historical purposes only.


A data pipeline for analyzing Open edX data. This is a batch analysis engine that is capable of running complex data processing workflows.

The data pipeline takes large amounts of raw data, analyzes it and produces higher value outputs that are used by various downstream tools.

The primary consumer of this data is Open edX Insights.

It is also used to generate a variety of packaged outputs for research, business intelligence and other reporting.

It gathers input from a variety of sources including (but not limited to):

  • Tracking log files - This is the primary data source.
  • LMS database
  • Otto database
  • LMS APIs (course blocks, course listings)

It outputs to:

  • S3 - CSV reports, packaged exports
  • MySQL - This is known as the "result store" and is consumed by Insights
  • Elasticsearch - This is also used by Insights

This tool uses spotify/luigi as the core of the workflow engine.

Data transformation and analysis is performed with the assistance of the following third party tools (among others):

The data pipeline is designed to be invoked on a periodic basis by an external scheduler. This can be cron, jenkins or any other system that can periodically run shell commands.

Here is a simplified, high level, view of the architecture:

Open edX Analytics Architectural Overview

Setting up Docker-based Development Environment

As part of our movement towards the adoption of OEP-5, we have ported our development setup from Vagrant to Docker, which uses a multi-container approach driven by Docker Compose. There is a guide in place for Setting up Docker Analyticstack in the devstack repository which can help you set up a new analyticstack.

Here is a diagram showing how the components are related and connected to one another:

the analyticstack

Setting up a Vagrant-based Development Environment

We call this environment the Vagrant "analyticstack". It contains many of the services needed to develop new features for Insights and the data pipeline.

A few of the services included are:

  • LMS (edx-platform)
  • Studio (edx-platform)
  • Insights (edx-analytics-dashboard)
  • Analytics API (edx-analytics-data-api)

We currently have a separate development from the core edx-platform devstack because the data pipeline depends on several services that dramatically increase the footprint of the virtual machine. Given that a small fraction of Open edX contributors are looking to develop features that leverage the data pipeline, we chose to build a variant of the devstack that includes them. In the future we hope to adopt OEP-5 which would allow developers to mix and match the services they are using for development at a much more granular level. In the meantime, you will need to do some juggling if you are also running a traditional Open edX devstack to ensure that both it and the analyticstack are not trying to run at the same time (they compete for the same ports).

If you are running a generic Open edX devstack, navigate to the directory that contains the Vagrantfile for it and run vagrant halt.

Please follow the analyticstack installation guide.

Note: Vagrant "analyticstack" official support is coming to end after Hawthorn.

Running In Production

For small installations, you may want to use our single instance installation guide.

For larger installations, we do not have a similarly detailed guide, you can start with our installation guide.

The default installation of Hadoop YARN has an administrative interface and REST API endpoint it exposes by default on port 8088 that can be used to run arbritrary tasks on the server. Secure this port in production.

How to Contribute

Contributions are very welcome, but for legal reasons, you must submit a signed individual contributor's agreement before we can accept your contribution. See our CONTRIBUTING file for more information -- it also contains guidelines for how to maintain high code quality, which will make your contribution more likely to be accepted.

edx-analytics-pipeline's People

Contributors

awais786 avatar bmedx avatar bradenmacdonald avatar brianhw avatar brittneyexline avatar dawoudsheraz avatar dylanrhodes avatar feanil avatar golub-sergey avatar hammadahmadwaqas avatar hassanjaveed84 avatar hunytalk avatar iloveagent57 avatar jbau avatar johnalbaker avatar jrowan avatar macdiesel avatar mattdrayer avatar michaelroytman avatar muhammad-ammar avatar mulby avatar pomegranited avatar pwnage101 avatar rao-abdul-mannan avatar sarina avatar thallada avatar tobz avatar zacharis278 avatar ziafazal avatar zubair-arbi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

edx-analytics-pipeline's Issues

Need to connect edx analytics dashboard (insights) to analytics pipeline

I have installed edx analytics pipeline and it is running successfully.

edxops/analytics_pipeline:latest
edxops/analytics_pipeline_spark_worker:latest
edxops/analytics_pipeline_hadoop_nodemanager:latest
edxops/analytics_pipeline_hadoop_resourcemanager:latest
edxops/analytics_pipeline_hadoop_datanode:latest
edxops/analytics_pipeline_spark_master:latest
edxops/analytics_pipeline_hadoop_namenode:latest

But when I open insights and login to it, it shows following message after login.
unauthorized_client An unauthorized client tried to access your resources.

I have created client id in lms admin and added them as Trusted client.
Now I need to add key in insights.yml. Please help me to find insights.yml in edx analytics docker. How can I loginto the instance and where can I find this file.

Apart from it if anything else is needed to fic the issue. Please let me know.

Thanks

make develop

Hello!
I get an error at 'make develop', point

Problem with setup.py can be partially cured by installing fresh pbr instead of 0.5 version, but test coverage still fails after this.
I have tried to reinstall both devstack and analyticstack several time and get this error every time, while there is no failed ansible tasks.
error

CommandError ! Can anyone help me Thanks !

When i was installing edx-analytics-dashboard on how to install and run the analytics backend locally there was an error below when i had a command (make validate) :
screenshot from 2015-03-21 01 29 10

I have no idea how to fix it ! If someone can give me a tip and i will be grateful.

Thanks !!

by jeremy

MissingParameterException

2017-04-23 16:20:38,964 ERROR 24513 [luigi-interface] worker.py:173 - Luigi unexpected framework error while scheduling ModuleEngagem[157/1932]
wTask(source=('hdfs://localhost:9000/data/',), expand_interval=2 days, 0:00:00, pattern=('.*tracking-.*.log.*',), date_pattern=%Y%m%d, warehous
e_path=hdfs://localhost:9000/edx-analytics-pipeline/warehouse/, date=2017-04-23, obfuscate=False, scale_factor=1)
Traceback (most recent call last):
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/worker.py", line 196, in add
    for next in self._add(current):
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/worker.py", line 249, in _add
    deps = task.deps()
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/hadoop.py", line 629, in deps
    return luigi.task.flatten(self.requires_hadoop()) + luigi.task.flatten(self.requires_local())
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 586, in flatten
    for result in struct:
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/edx/analytics/tasks/insights/module_engagement.py", line 
1216, in requires
    interval_end=self.date
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 100, in __call__
    param_values = cls.get_param_values(params, args, kwargs)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 315, in get_param_values
    raise parameter.MissingParameterException("%s: requires the '%s' parameter to be set" % (exc_desc, param_name))
MissingParameterException: ExternalCourseEnrollmentTableTask[args=(), kwargs={'interval_end': datetime.date(2017, 4, 23)}]: requires the 'overw
rite_n_days' parameter to be set

https://github.com/edx/edx-analytics-pipeline/blob/master/edx/analytics/tasks/insights/module_engagement.py#L1216

even i pass overrite_n_days parameter via shell, in above code not pass it to the ExternalCourseEnrollmentTableTask

Upgrade boto to boto3

boto has created issues in tubular repo. To avoid any breakage upgrade it to boto3.

This repo is also using luigi. But using via hash its pretty old commit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.