The objective of this project is that you designed, documented, and build a prototype of MovieManiacs (MvM), which is a portal that mixes movie information from from distributed sources.
MovieManiacs is a Web application that makes available to different users information related to movies (No TV). To do this, MvM runs crawling1 and scrapping2 of known Web sites that put related information, for example to the less than IMDB (Internet Movie Data Base) 3 and other websites (at least 1) that Add reviews (eg. RottenTomatoes4) or any other type of information that enriches the data released by IMDB (eg. See (dir.yahoo.com/entertainment/movies and film /). The crawler must be written in Node.js as MovieManiacs seeks to integrate your product in the future with other company (EntertainmentGossip) whose crawling technology is also implemented in such platform that would facilitate their acquisition.
Once collected the info from IMDB and at least another Web site, this should presented in a coherent manner through a form5 portal implemented in Python (bonus) or some other language (except for PHP). I.e., the pages crawleadas from IMDB and any other source should be stored in a DB oriented documents such as MongoDB (mandatory) and to show the pages that are linked (e.g. They speak of the same film) from IMDB and other sites, together, at the same time (partial or completely). Additionally, you should calculate and show the Bacon6 number for each actor who appears on the page to be displayed in the portal MvM.
Is expected that MvM has a great acceptance in the public by what the application should be prepared to support a massive demand (approximately 4 million of) users a month, with 1,000 concurrent connections), guaranteed good times of response (2 seconds guaranteed at peak time for at least 90 percent of requests criticism), availability guaranteed at time of 99 percent, 24 x 7, and a peak rate of successful requests (successful requests) of 90 percent of critical applications. Required
In addition to evaluate a technique of sharding7 versus any other cache base technique of data as a strategy to improve the performance of the application. The crawler must be homework online8
The team has 6 machines at your disposal and the software must be efficient (resource consumption) and developed on a standard Linux platform: