ptwobrussell / mining-the-social-web-2nd-edition Goto Github PK

The official online compendium for Mining the Social Web, 2nd Edition (O'Reilly, 2013)

License: Other

Ruby 3.12% Shell 0.09% Python 0.05% CSS 0.04% HTML 82.21% Jupyter Notebook 14.49%

mining-the-social-web-2nd-edition's Introduction

HTTP 301: Don't Use This Repository - 17 Jan 2019

There's good news! Mining the Social Web is now availabe in it's 3rd Edition, and there's a fully updated repository available with all of the latest changes that you will definitely not want to miss out on: the code has been fully revised and ported to Python 3, the runtime has been converted to a more convenient Docker-based setup, and there's a brand new chapter on mining Instagram data.

My co-author, Mikhail Klassen, now maintains the code, and you can get it here: https://github.com/mikhailklassen/Mining-the-Social-Web-3rd-Edition

Enjoy!

Matthew A. Russell

Jan 17, 2019

Mining the Social Web (2nd Edition)

Summary

Mining the Social Web, 2nd Edition is available through O'Reilly Media, Amazon, and other fine book retailers. Purchasing the ebook directly from O'Reilly offers a number of great benefits, including a variety of digital formats and continual updates to the text of book for life! Better yet, if you choose to use O'Reilly's DropBox or Google Drive synchronization, your ebooks will automatically update every time there's an update. In other words, you'll always have the latest version of the book if you purchase the ebook through O'Reilly, which is why it's the recommended option in comparison to a paper copy or other electronic version. (If you prefer a paperback or Kindle version from Amazon, that's a fine option as well.)

There's an incredible turn-key virtual machine experience for this second edition of the book that provides you with a powerful social web mining toolbox. This toolbox provides the ability to explore and run all of the source code in a hassle-free manner. All that you have to do is [follow a few simple steps](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/_Appendix A - Virtual Machine Experience.html) to get the virtual machine installed, and you'll be running the example code in as little as 20-30 minutes. (And by the way, most of that time is waiting for files to download.)

This short screencast demonstrates the steps involved in installing the virtual machine, which installs every single dependency for you automatically and save you a lot of time. Even sophisticated power users tend to prefer using it versus using their own environments.

If you experience any problems at all with installation of the virtual machine, file an issue here on GitHub. Be sure to also follow @SocialWebMining on Twitter and like http://facebook.com/MiningTheSocialWeb on Facebook.

Be sure to also visit http://MiningTheSocialWeb.com for additional content, news, and updates about the book and code in this GitHub repository.

Preview the Full-Text of Chapter 1 (Mining Twitter)

Chapter 1 of the book provides a gentle introduction to hacking on Twitter data. It's available in a variety of convenient formats

A free PDF download
An online ebook excerpt
An IPython Notebook (ipynb) file (checked into this repository)

Choose one, or choose them all. There's no better way to get started than following along with the opening chapter.

Preview the IPython Notebooks

This edition of Mining the Social Web extensively uses IPython Notebook to facilitate the learning and development process. If you're interested in what the example code for any particular chapter does, the best way to preview it is with the links below. When you're ready to develop, pull the source for this GitHub repository and follow the instructions for installing the virtual machine to get started.

A bit.ly bundle of all of these links is also available: http://bit.ly/mtsw2e-ipynb

[Chapter 0 - Preface](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 0 - Preface.html)
[Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 1 - Mining Twitter.html)
[Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 2 - Mining Facebook.html)
[Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 3 - Mining LinkedIn.html)
[Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and More](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 4 - Mining Google+.html)
[Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts and More](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 5 - Mining Web Pages.html)
[Chapter 6 - Mining Mailboxes: Analyzing Who's Talking To Whom About What, How Often, and More](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 6 - Mining Mailboxes.html)
[Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 7 - Mining GitHub.html)
[Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing Over RDF, and More](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 8 - Mining the Semantically Marked-Up Web.html)
[Chapter 9 - Twitter Cookbook](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter 9 - Twitter Cookbook.html)
[Appendix A - Virtual Machine Experience](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/_Appendix A - Virtual Machine Experience.html)
[Appendix B - OAuth Primer](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/_Appendix B - OAuth Primer.html)
[Appendix C - Python & IPython Notebook Tips](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/_Appendix C - Python & IPython Notebook Tips.html)

Blog & Screencasts

Be sure to bookmark the Mining the Social Web Vimeo Channel to stay up to date with short instructional videos that demonstrate how to use the tools in this repository. More screencasts are being added all the time, so check back often -- or better yet, subscribe to the channel.

A ~3 minute screencast on installing a powerful toolbox for social web mining.
View a collection of all available screencasts at http://bit.ly/mtsw2e-screencasts

You might also benefit from the content that is being regularly added to the companion blog at http://MiningTheSocialWeb.com

The Mining the Social Web Virtual Machine

You may enjoy this short screencast that demonstrates the step-by-step instructions involved in installing the book's virtual machine.

The code for Mining the Social Web is organized by chapter in an IPython Notebook format to maximize enjoyment of following along with examples as part of an interactive experience. Unfortunately, some of the Python dependencies for the example code can be a little bit tricky to get installed and configured, so providing a completely turn-key virtual machine to make your reading experience as simple and enjoyable as possible is in order. Even if you are a seasoned developer, you may still find some value in using this virtual machine to get started and save yourself some time. The virtual machine is powered with Vagrant, an amazing development tool that you'll probably want to know about and arguably makes working with virtualization even easier than a native Virtualbox or VMWare image.

Quick Start Guide

The recommended way of getting started with the example code is by taking advantage of the Vagrant-powered virtual machine as illusrated in this short screencast. After all, you're more interested in following along and learning from the examples than installing and managing all of the system dependencies just to get to that point, right?

[Appendix A - Virtual Machine Experience](https://rawgithub.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/_Appendix A - Virtual Machine Experience.html) provides clear step-by-step instructions for installing the virtual machine and is intended to serve as a quick start guide.

The Mining the Social Web Wiki

This project takes advantage of its GitHub repository's wiki to act as a point of collaboration for consumers of the source code. Feel free to use the wiki however you'd like to share your experiences, and create additional pages as needed to curate additional information.

One of the more important wiki pages that you may want to bookmark is the Advisories page, which is an archive of notes about particularly disruptive commits or other changes that may affect you.

Another page of interest is a listing of all 100+ numbered examples from the book that conveniently hyperlink to read-only version of the IPython Notebooks

"Premium Support"

The source code in this repository is free for your use however you'd like. If you'd like to complete a more rigorous study about social web mining much like you would experience by following along with a textbook in a classroom, however, you should consider picking up a copy of Mining the Social Web and follow along. Think of the book as offering a form of "premium support" for this open source project.

The publisher's description of the book follows for your convenience:

How can you tap into the wealth of social web data to discover who’s making connections with whom, what they’re talking about, and where they’re located? With this expanded and thoroughly revised edition, you’ll learn how to acquire, analyze, and summarize data from all corners of the social web including Facebook, Twitter, LinkedIn, Google+, GitHub, email, websites, and blogs.

Employ IPython Notebook, the Natural Language Toolkit, NetworkX, and other scientific computing tools to mine popular social web sites
Apply advanced text-mining techniques, such as clustering and TF-IDF, to extract meaning from human language data
Bootstrap interest graphs from GitHub by discovering affinities among people, programming languages, and coding projects
Build interactive visualizations with D3.js, a state-of-the-art HTML5 and JavaScript toolkit
Take advantage of more than two-dozen Twitter recipes presented in O’Reilly’s popular and well-known cookbook format

The example code for this data science book is maintained in a public GitHub repository and is designed to be especially accessible through a turn-key virtual machine that facilitates interactive learning with an easy-to-use collection of IPython Notebooks.

mining-the-social-web-2nd-edition's People

Contributors

Stargazers

Watchers

Forkers

antiface jwsy gjuhasz zeristor aburan28 kp-favorite ericchou-python pepsalehi web5design marsam pamsuke aleksbreian1103 thomasernste a2q rokgomiscek artbrown moisesvega anb2 mohamadhussien ukituki berenco invinciblejha ebunt kalevivt xerohour wandeg ethanhu eurekarexwb gregwjacobs modularfunction ruwanego itaraday yingzi-jin miguelpaz dducks uday12kumar ledinhminh lsdjim aurora1625 ngpestelos shootstar samjingzhao moacybarros rodbauer garrying unplug rukku atomarugent nborwankar speedtriple955 jveitchmichaelis jayvansantos egrommet spiegela p5k6 timpairish benpaez johnphp glyons03 graphgrailai imclab c0d3rm0nk3y allen3921 pombredanne aldovw satpreetsingh odewahn caitlinbk chaowu2009 maryannet inigoku macloo ilovejs hinedavid organicit chibuisimaduka gb1 pallavl kevinschwarz evgenizer mt5225 leosuga agusman88 a1m3i6n8-babaie davechan stonepeople paulkagiri dotpot kaneconet yuanda tanmayameher watty62 afey uolter shawnhansen frac mdjmay mherradora dcandela nghiaht

mining-the-social-web-2nd-edition's Issues

Present Virtualbox instructions on Mageia, Fedora, openSuse Linux test OK..

Audit All Notebooks for Unicode Best Practices

Based on recent discussions with in #39, it's now obvious that I should do a thorough audit of the entire codebase for any possible Unicode errors. At a minimum, the following things:

Prefix all strings with u so that they are Unidode as a proper practice
Use codecs.open and codecs.close to handle file I/O for saving data
Double check libraries used for Unicode support

Also update the notebook for Appendix C when complete to provide appropriate context to the reader. (Added a stub for that section for now.)

Results of following exact instructions--Appendix A

I just purchased the early release of your book. The first few pages direct me to Appendix A, which isn't done yet. So I started with the readme files on the Github site.

Just for reference, I'm using a Dell Latitude D630 notebook (Intel 64-bit dual-core Core 2 Duo with Intel chipset, 4GB memory & 80GB HD) running Mageia Linux 3 (forked from the Mandrake/Mandriva distro).

"In order to start the virtual machine, there are just a few easy steps to follow:

Download and install the latest copy of VirtualBox for your operating system at https://www.virtualbox.org/"

I downloaded and installed RPM package: virtualbox
Version: 4.2.12-2.mga3
    Architecture: x86_64
    Size: 80099 KB

I downloaded and installed RPM package: virtualbox-kernel-3.8.13-desktop-1.mga3 (virtualbox driver for kernel-desktop-3.8.13-1.mga3)                                                                                                                         
    Version: 4.2.12-11.mga3
    Architecture: x86_64
    Size: 424 KB

"Download and install Vagrant for your operating system at http://www.vagrantup.com/"

I downloaded and installed RPM package: vagrant
    Version: 1:1.2.2-1
    Architecture: x86_64
    Size: 48859 KB

"It is highly recommended that you take a moment to read its excellent "Getting Started" guide as a matter of initial familiarization"

Well, I read the first three pages and bookmarked the site, so...

"Checkout this code repository to your machine using git or with the download links at the top of the main GitHub page."

I used the 'Download zip file button' on the right near the top; didn't really see any download link AT the top.  So is the zip file kosher at this time?
I extracted the archive into my user home directory, /home/max, so I think that made the project root directory /home/max/Mining-the-Social-Web-2nd-Edition-master/vagrant.

"Navigate to the 'vagrant' directory in the checkout"

I opened up a terminal (as a user, not a superuser) and went to 'vagrant'

"Run the following command: vagrant up"

Done (again, as a user, no root priveleges).

So I got this:

[max@localhost vagrant]$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
[default] Box 'precise64' was not found. Fetching box from specified URL for
the provider 'virtualbox'. Note that if the URL does not have
a box for this provider, you should interrupt Vagrant now and add
the box yourself. Otherwise Vagrant will attempt to download the
full box prior to discovering this error.
Downloading or copying the box...
Extracting box...te: 640k/s, Estimated time remaining: 0:00:02))
Successfully added box 'precise64' with provider 'virtualbox'!
[default] Importing base box 'precise64'...
[default] Matching MAC address for NAT networking...
[default] Setting the name of the VM...
[default] Clearing any previously set forwarded ports...
[default] Creating shared folders metadata...
[default] Clearing any previously set network interfaces...
[default] Preparing network interfaces based on configuration...
[default] Forwarding ports...
[default] -- 22 => 2222 (adapter 1)
[default] -- 8888 => 8888 (adapter 1)
[default] -- 5000 => 5000 (adapter 1)
[default] -- 27017 => 27017 (adapter 1)
[default] -- 27018 => 27018 (adapter 1)
[default] -- 27019 => 27019 (adapter 1)
[default] -- 28017 => 28017 (adapter 1)
[default] Booting VM...
[default] Waiting for VM to boot. This can take a few minutes.
The VM failed to remain in the "running" state while attempting to boot.
This is normally caused by a misconfiguration or host system incompatibilities.
Please open the VirtualBox GUI and attempt to boot the virtual machine
manually to get a more informative error message.

The first thing I notice (other than the brick wall) is I got a 'precise64' instead of 'precise32' base box.
Anyway, I do what I'm told: go to the Virtualbox GUI.  Under 'Machine' I find 'Show Log'. I'm not 
including the whole thing, but the upshot is:

00:00:00.377224 Chipset cannot do MSI: VERR_NOT_IMPLEMENTED
00:00:00.469154 ******************** End of CPUID dump **********************
00:00:00.469185 HWACCM: No VT-x or AMD-V CPU extension found. Reason VERR_VMX_MSR_LOCKED_OR_DISABLED
00:00:00.469202 HWACCM: VMX MSR_IA32_FEATURE_CONTROL=1
00:00:00.650084 VMSetError: /home/iurt/rpmbuild/BUILD/VirtualBox-4.2.12/src/VBox/VMM/VMMR3/VM.cpp(373) int VMR3Create(uint32_t, PCVMM2USERMETHODS, PFNVMATERROR, void*, PFNCFGMCONSTRUCTOR, void*, VM**); rc=VERR_VMX_MSR_LOCKED_OR_DISABLED
00:00:00.650084 VMSetError: VT-x features locked or unavailable in MSR.
00:00:00.654712 ERROR [COM]: aRC=NS_ERROR_FAILURE (0x80004005) aIID={db7ab4ca-2a3f-4183-9243-c1208da92392} aComponent={Console} aText={VT-x features locked or unavailable in MSR. (VERR_VMX_MSR_LOCKED_OR_DISABLED)}, preserve=false
00:00:00.755371 Power up failed (vrc=VERR_VMX_MSR_LOCKED_OR_DISABLED, rc=NS_ERROR_FAILURE (0X80004005))

I AM NOT ASKING FOR AID IN GETTING THIS SETUP TO WORK.  But note the error message above at 650084; it references /home/iurt, whereas my user rights start at /home/max. I'm assuming the instructions on the Github site will be the basis for Appendix A-- should you be instructing people to install/run as ROOT?  Should you be telling people to set this platform up under Windows, given all the different Linux distros out there?  Or should I be sending this to the developers of Mageia's RPM packages?  I also notice you are using an Ubuntu image.  They distribute using a LiveCD-type medium.  Should you be advising readers to try running the Ubuntu LiveCD/DVD first?  Just some thoughts...

I have platforms WinXP, Win7 and a handful of different Linux distros so I'll be trying it out on all of them eventually.  I hope some of this helped with your instructions.

[email protected]

5.5 - add unicode u'—' to stopwords?

Minor request: would it make sense to add the '—' character to the stopwords?

This example is also a candidate for #52

Four short links: 12 July 2013
    Num Sentences:           7
    Num Words:               148
    Num Unique Words:        103
    Num Hapaxes:             80
    Top 10 Most Frequent Words (sans stop words):
        — (4)
        names (3)
        andy (2)
        name (2)
        work (2)
        accurate (1)
        actually (1)
        address (1)
        age (1)
        almost (1)

stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—'
    ]

Possible errors in Examples 3.9 and 3.10 if Bing does not resolve references

Reported by @jwsy:

Errors when given a c['values'][n]['location']['name'] that Bing doesn't resolve (which will return an empty list.

Suggested Fix:

geo = g.geocode(transformed_location, exactly_one=False)
if geo == []:
    continue

give an intro to List Comprehensions

ch1/ex 12 is the first time that I started to understand Python list comprehensions. They also got used in ex 8 & 9. A brief intro to how they work, or maybe just using the term 'list comprehension' so that someone can go read more about them would be helpful.

Or maybe just a comment that says "the next line uses list comprehensions to call pt.add_row for the 10 most common items in data.

Oh. Look. There is a comment in the unnamed example that reads in the canned data. I think that means that the term 'list comprehension' is non-intuitive. I think it makes sense only after you know what it is. I think my initial google searches were for "python implicit loop". If you think readers will be new to Python, this is more evidence that a few words about list comprehensions is a good idea.

Maybe add README note about Vagrant version.

Had a slight hiccough:

$ vagrant up
../vagrant/Vagrantfile:4:in <top (required)>': undefined methodconfigure' for Vagrant:Module (NoMethodError)
...

Needed to update my vagrant install.

A note in the README would probably have saved a small confusion.

iPython issues

Great idea to use Vagrant.
I built the code at work today and it worked fine on Windows.
I've built it on Mac OSX a number of times and it never seems to quite work. Previously iPython wouldn't start, that seems to be fixed.
Latest build with fresh clone builds and runs iPython, but the notebook files haven't been copied across.
Using "vagrant ssh" there are two just two ipynb files in the mtsw2e directory: Untitled0.ipynb, and Untitled1.ipynb
I take it vagrant does a git clone of mtsw2e, but why would there be two "Untitled ipynbs"?

I'm getting used to this, but hope to understand it soon to make and add my own fixes.

Possible UnicodeEncodeErrors?

Going through the book this evening, I saw this is 4.6:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 10: ordinal not in range(128)

Data:

[
 "For",
 "sure:",
 "\"War",
 "without",
 "reflection",
 "is",
 "mechanical",
 "slaughter,\u201d",
 "said",

I was able to fix this problem by encoding the text. I don't know if this might be in other places in the text, or if we should use nltk.word_tokenize instead of split like below recommended in https://code.google.com/p/nltk/issues/detail?id=370

all_content = " ".join([ a['object']['content'] for a in activity_results ])

# Approximate bytes of text
print len(all_content)

# Tokenize after encoding
unicode_text = all_content.encode("utf-8")
tokens = unicode_text.split()
# tokens = all_content.split()
text = nltk.Text(tokens)

getting oauth tokens confusing

For me the new window doesn't open I see no token to copy. If I open the page myself by typinghttp://127.0.0.1:5000//oauth_helper in the location window, I get no token.

This is running in the notebook from the VM. I should note somewhere that localhost:8889 does not work, so I did a

 ssh -L 10000:localhost:8889 -p 2222 vagrant@localhost

and then pointed my browser to localhost:10000. It didn't work running on my bare metal either.

But what I really don't understand is why the machinations with opening a browser window and copy and pasting the token? Cannot it just be stored in a variable?

Minor update to 3.9 and 3.10

Tiny fix in addition to #53, 398 should be same as 446:
geo = g.geocode(transformed_location, exactly_one=False)

I'll submit a pull

Update 1:11

Example tweet doesn't exist anymore. Retweet that worked: id=332632666681794561

----> 6 r_retweets = twitter_api.statuses.retweets(id=316944833233186816

TwitterHTTPError: Twitter sent status 404 for URL: 1.1/statuses/retweets/316944833233186816.json using parameters: (oauth_consumer_key=zJHoFn0o7drlifL7OdX4Uw&oauth_nonce=3398124918154645548&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1373805911&oauth_token=17930287-BIQadLbCvGijCl2ayMAxSGoz49AXNjejkGxqY3IAs&oauth_version=1.0&oauth_signature=F4%2BJG7PqaervE4sc6dxrXrtmTxk%3D)
details: {"errors":[{"message":"Sorry, that page does not exist","code":34}]}

ch 1 ex 18-19--consider graphing logs of the data

Looking at those counts as logs improves the graphs in Ex 18 a little, and Ex 19 a lot.

e.g.,

counts = [count for count, _, _ in retweets]
plt.hist(log(counts))
print counts
print plt.hist(log(counts))
plt.ylabel('Number of items in bin')
plt.xlabel('log(Number of times appeared)')
plt.title("Retweets")
plt.figure()

Chapter1: loading canned data, confusing, frustrating

consider a comment in the cell that loads in the sample data that suggests to comment out the line that loads in the sample data set. One needs to execute that cell to populate t, but it took me a peculiarly long time to think to just comment out the json.loads(...) line to use my own data. In fact, you might comment out the json.loads line and suggest to un-comment it for people who want to use the canned data. I'd guess that most people would rather use data that they just generated themselves and be able to fall back to the canned data if something went awry or the stuff didn't make sense with the live data.

split the loading of the canned data and populating t into separate (named) examples.

I don't understand why
if status['id'] == 316948241264549888
is in that code. Leaving it out and using my data works just fine. Like this:

t = [ status 
      for status in statuses ][0]

Trying to get Vagrant going...

Excited to get this project going and I'm a preview subscriber...

I have a Dell laptop and updated the bios settings but I get this error:

C:\Mining-the-Social-Web-2nd-Edition-master>vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
[default] Setting the name of the VM...
[default] Clearing any previously set forwarded ports...
[Berkshelf] Updating Vagrant's berkshelf: 'C:/Users/bunland/.berkshelf/default/v
agrant/berkshelf-20130727-5776-a53334-default'
←[31m[Berkshelf] Failed to download 'apt' from git: 'git://github.com/opscode-co
okbooks/apt.git' with branch: 'master' at ref: '489d2e2d60'←[0m
Berkshelf::CookbookNotFound: Cookbook 'apt' not found in any of the default loca
tions

Early days experience and question on updating

I bought the 1st edition and got a lot from it (even referenced it in an online paper http://www.researchinlearningtechnology.net/index.php/rlt/article/view/18598/html). So I've bought the 2nd edition and looking forward to using it to expand my range beyond analysing Twitter. I'm making comments here because I suspect I'm less fluent with some of these systems than people who have already posted. I may be more akin to typical buyers of the book. I thought you might be interested in my experience as I work through some sections of the book.

I'm using Win 7 and running Virtual Box for various versions of Linux.

Installing git.

I had never tried to use git so this seemed a good opportunity. I tried the tutorial at http://try.github.io but was not impressed. Some times it stopped working, but it never gave a clear idea of what the operations were accomplishing. I had to read parts of http://gitscm.com/ before I could proceed. I think using the first three chapters of this as a learning resource, rather than just a reference source as you suggest, would be a better recommendation. The difference between Git for Windows and msysgit was initially confusing.

VM vs Vagrant

I'm now using Vagrant as you suggest and finding the advantages you talked about. Which is very good. What wasn't so good was getting to this stage via Appendix A. I expected to have to download a VM, as I've done with some Coursera courses. But it seemed I had to use Vagrant to get my VM. So I went with that. BTW when I looked at the README in the Vagrant directory I found there was a possibility of downloading the VM without using Vagrant. But I can't find such a VM on the github site.

Using Vagrant and the VM

The message at the start of the Vagrant download about 'precise 64' was confusing. Should I abort the download or not? Decided to go on. A warning about this would have been helpful.

Once the VM was started I expected it to be like other Ubuntu VMs I've downloaded. So I was confused to be faced with only a command line. Then I tried http://localhost:8888 and got nothing. I finally got this to work but I'm not sure how I did it. I think it was by using bootstrap.sh which I hadn't done until that point. Or perhaps it was using vagrant resume or vagrant up. I think the general point is that Appendix A is difficult to follow, especially if you haven't used a system like Vagrant before. The written instructions could be clearer and they don't seem to completely agree with the video. For example, I can imagine some less knowledgeable people might think they had to use a browser in the VM, rather than the host.

So now I'm deep into the text and happily using Vagrant.

Updating

It's not clear to me how I update to get the latest version. Presumably git pull, but is that in the VM in the Vagrant directory?

I'm sure I'll have other comments/questions later.

Best Wishes

copyright and licensing annoying

IANAL, but

Consider making the copyright in each individual chapter a link to the license in another file, or at the end or something. Maybe a single sentence that says "you may not copy this code without reading ".

It's less obvious how you might shorten the intro paragraph to each notebook (though, maybe just delete the first paragraph with the nicely-written "click here to see github" thing.

In chapter0 you put the copyright stuff at the end. That's lovely. Do that in all the chapters unless there's a great reason not to.

ch1/ex16--how to use my data?

Having gone to some lengths to use my data rather than the canned stuff, I was surprised when Ch1/Ex16 printed something else (had I not re-run each cell after populating statuses with my data?). It'd be nice if somewhere above code captured the ID of the most-retweeted user id and either set a variable with that ID or just printed it so that someone could paste it in to this cell.

Ch1: ex 6 suggestion (show how length of statuses grows)

Consider adding a
print len(statuses)
to the end of the for loop so that one can see how the length of statuses grows.

Also, I don't understand why my query for "Xbox+One" the length is 200 the first time through the loop, 200 the second time and it doesn't run again. Similarly, increasing count to 1000 has no effect. I'd love to know why.

The VM is so cool, can you share how you make one?

The VM seems like a great practice for sharing code at work, talks or in research... a step up from the python package virtualenv. In the wiki or somewhere could you point to something informal about how to use vagrant, chef and iPython to share reproducible explorations?

Chapter 9: declare make_twitter_request() earlier

Example 16 (Resolving user profile information) uses this function but it's not declared until Example 18 (Making Robust Twitter Requests).

typo in ipython ch3 example 10

Most results contain a response that can be parsed by

picking out the first to consecutive upper case letters

as a clue for the state

should be "first two consecutive"

is this the right way to submit such nit-picky typos?
Do you wan to hear these?

Programming practice: getting the index while looping.

The following occurs frequently:

for i in range(len(something)):

Consider replacing a loop where you need the index with:

for i, v in enumerate(something):

include URL to get info from graph API explorer in a web browser

It'd be much clearer that this example was using the http interface if you included a link so that one could see the stuff show up in the browser. Something like this:

....
base_url = 'https://graph.facebook.com/me'
fields = 'id,name,friends.fields(likes)'
url = '%s?fields=%s&access_token=%s' %
(base_url, fields, ACCESS_TOKEN)

print url
r = requests.get(url)

....

I presume that works. Right now I get "error_msg": "An unknown error occurred" (both from Python and the web browser). It works when I do not include ,friends.fields(likes) in the request. I'm assuming that this is something spurious and not code related. Clicking "submit" on the same queryin the Graph API web page just waits forever.

Maybe add README note about disabling sleep

Ran vagrant up and walked away. Came back with VM stalled at boot stage — and essentially unrecoverable. Deleted in VirtualBox (and rm -r .vagrant) and started again.

A note to consider disabling sleep might have saved some time.

Avoid examples with no output (e.g., Ch1, ex 8)

Consider appending (or sticking these after each section of code that generates them)

print json.dumps(status_texts[0:5], indent=1)
print json.dumps(screen_names[0:5], indent=1) 
print json.dumps(hashtags[0:5], indent=1)

to ch1 ex 8 so that one can see that something happened. Code with no output isn't much of an example.

Typo in Chapter 2, Example 1 Code

Hi Matthew,

As you know, I'm new to github and all. But in the Chapter 2 code, Example 1 here http://nbviewer.ipython.org/urls/raw.github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/Chapter2.ipynb

The line is:
fields = 'id,name,friends.fields(likes)'

It should be:
fields = 'id,name,friends.field(likes)'

Notice that field is singular in the last instance.

Please consider updating the page as its the first example and I'm sure you don't want to crush your reader's confidence the first time around.

Now I'm not sure if I'm supposed to be running code from that page or what. The .zip file has all iPython notebook code. I don't know what that is and I can't invest the time to tackle that mountain at the moment. But since the code in the chapters seems to be straightforward, I'm just going to roll with that for now.

Also in this line:
base_url = 'https://graph.facebook.com/me'
Please mention that "me" should be replaced by the userid.

Looking forward to hearing your thoughts.

consider renaming chapter0 or making its heading be "chapter 0"

Maybe it's just me, but having the link name (chapter0) and the top heading (welcome!) being different is somehow jarring.

You said you wanted more tickets. :-)

Example 11 Chapter 1

I can't get it to work in the iPython Notebook. Can you clarify how to find the appropriate user_id to fill in the blank with (i.e. would it be possible to provide some sample code?). I appreciate any help you can give!

Working through the Advisories...

After vagrant destroy, git pull, vagrant plugin install XXX, and then vagrant up...I get the following error message:

---- Begin output of pip install git+git://github.com/ptwobrussell/jpype.git#eg
g=jpype-ptwobrussell-github ----
STDOUT: Downloading/unpacking jpype-ptwobrussell-github from git+git://github.co
m/ptwobrussell/jpype.git#egg=jpype-ptwobrussell-github
Cloning git://github.com/ptwobrussell/jpype.git to /tmp/pip-build-root/jpype-p
twobrussell-github
Complete output from command /usr/bin/git clone -q git://github.com/ptwobrusse
ll/jpype.git /tmp/pip-build-root/jpype-ptwobrussell-github:

Cleaning up...
Command /usr/bin/git clone -q git://github.com/ptwobrussell/jpype.git /tmp/pip-b
uild-root/jpype-ptwobrussell-github failed with error code 128 in None
Storing complete log in /root/.pip/pip.log
STDERR: fatal: unable to connect to github.com:
github.com[0: 204.232.175.90]: errno=Connection timed out
---- End output of pip install git+git://github.com/ptwobrussell/jpype.git#egg=
jpype-ptwobrussell-github ----
Ran pip install git+git://github.com/ptwobrussell/jpype.git#egg=jpype-ptwobruss
ell-github returned 1
Chef never successfully completed! Any errors should be visible in the
output above. Please fix your recipes so that they properly complete.

ch02

I pulled the latest version just a little while ago.

top of document says " navigating directly to https://www.linkedin.com/secure/developer." I presume you mean https://developers.facebook.com/apps/ ?

Also, I've created a couple of Facebook apps, but haven' yet gotten it to allow me to use FB as one of the apps (only my other fan pages show up).

I don''t see a place that says "Site URL" to fill in localhost:5000/oauth_redirect

I had trouble with using pip to install matplotlib. I think I'd recommend that Ubuntu users use "sudo apt-get install python-requests python-flask" rather than pip. The trade-off there is getting old versions of stuff vs getting stuff that someone has made sure it works under ubuntu. I understand the appeal of getting people to use the virtual machine rather than actually getting stuff installed on their own machine, but it was a fair amount of hassle to get that to work; more correctly, it was enough of a hassle that I didn't get the virtual machine to work. Perhaps I spoke too soon.

When I checked all of the permissions in all three tabs to generate my access token, example 1 dies with an "unknown error." You might tell people to get just the user rights on the first page, maybe.

I'm starting to see the appeal of iPython, but as an emacs user, it's a little tough. Also, I like the idea of the stand-alone apps that were in the first edition. It seems like it's a long way from a notebook into an app I can run from the command line, but perhaps that's a personal problem.

Is this the best way to communicate info like this? Is it helpful?

Remove the sudos in the bootstrap.sh script?

The shell provisioner runs as root, so there's no need to use sudo, but it may be helpful to provision a VM later on down the road.

http://docs-v1.vagrantup.com/v1/docs/provisioners.html

ch2 ex 1 needs to say what permissions are required

"appropriate permissions" is not very descriptive. It's not immediately clear that id and name are "basic permissions" or that you need to click on the 'friends data permissions' tab to find 'user_likes' or that 'user_likes' == friends.fields(likes).

Say explicitly that before an access token can be generated a window pops up to ask permission of whoever is logged in to facebook.

Say explicitly that one needs to get a token from https://developers.facebook.com/tools/explorer

9:20 needs default a for friends limit when calling get_friends_followers_ids()

friends_ids, followers_ids = get_friends_followers_ids(twitter_api, screen_name=screen_name)
Current code throws the below error.

<ipython-input-94-ba8126de5a53> in get_friends_followers_ids(twitter_api, screen_name, user_id, friends_limit, followers_limit)
     40     # Do something useful with the ids like store them to disk...
     41 
---> 42     return friends_ids[:limit], followers_ids[:limit]
     43 
     44 # Sample usage

TypeError: slice indices must be integers or None or have an __index__ method

Able to workaround by specifying friends_limit and followers_limit parameters:
friends_ids, followers_ids = get_friends_followers_ids(twitter_api, screen_name=screen_name, friends_limit=71, followers_limit=772)

Fetched 71 total friends ids for ptwobrussell
Fetched 772 total followers ids for ptwobrussell
ptwobrussell is following 71
ptwobrussell is being followed by 772
48 of 71 are not following ptwobrussell back
749 of 772 are not being followed back by ptwobrussell
ptwobrussell has 23 mutual friends

Where is vagrant/share built on the host file system?

Instructions on how to find it would be a useful addition to the Linkedin and mailbox chapters and/or appendix A. It is currently blocking me.

put note in README.md that bootstrap.sh will install everything necessary

Doh. If I'd seen bootstrap.sh from the start I might not have had to deal with fussing with Python or with installing a 2.3GB image on my hard drive. Consider adding that to the README.

On a related note, it's mildly annoying that it does all that at every start-up. Consider explicitly telling people that they can avoid that with "vagrant suspend" this lengthy startup can be avoided. OTOH, I read it the first time and realized it myself the second time I ran a 'vagrant up' command. Perhaps this is obvious.

This is just a note. You're welcome to close this ticket without doing anything.

8.5 - FuXi update

As currently installed, FuXi doesn't work

vagrant@precise64:/usr/lib$ FuXi
Traceback (most recent call last):
  File "/usr/local/bin/FuXi", line 5, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2711, in <module>
    parse_requirements(__requires__), Environment()
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 584, in resolve
    raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: rdflib<3a

I was able to get this to work by using RDFLib's repo after commit RDFLib/FuXi@351f72a

git clone https://github.com/RDFLib/FuXi.git
cd FuXi/
python setup.py install

Finally, the naive option doesn't work with the RDFLib FuXi, so I had to remove it to get the expected output.

FuXi --rules=resources/ch08-semanticweb/chuck-norris.n3 --ruleFacts #--naive

Typo in page 11

Line 10 from the bottom:
"it may be worth nothing" -> noting

Chapter 3, Example 3-11.

Code:
import nltk
print nltk.bigrams("Chief Executive Officer".split(), pad_right=True, pad_left=True)
print nltk.bigrams("Chief Technology Officer".split(), pad_right=True, pad_left=True)
print len(set(ceo_bigrams).intersection(set(cto_bigrams)))

However, ceo_bigrams and cto_bigrams are not defined as variables.

The following code would work as an alternative:

import nltk
ceo_bigrams = nltk.bigrams("Chief Executive Officer".split(), pad_right=True, pad_left=True)
cto_bigrams = nltk.bigrams("Chief Technology Officer".split(), pad_right=True, pad_left=True)

print ceo_bigrams
print cto_bigrams
print len(set(ceo_bigrams).intersection(set(cto_bigrams)))

Install nltk packages and run IPython as vagrant user

In my fork, I tried unsuccessfully to install the nltk packages and run IPython as vagrant user.
Maybe someone else will have more success, though it's only an irritant to use sudo -i to become root when logged in as the vagrant user.

ch03 can get authorization code but get error calling get_access_token

Out of the box with the latest d29f15c
I provide an API key and Secret key, and get back a valid authorization code.

Note that in this example, I changed the write location to /home/vagrant/mtsw2e/ipynb/resources/ch03-linkedin/linkedin.authorization_code

give each cell an Example header

the one in chapter 1 that loads in the example data from the json file has no such header.

Similarly, ch 1, ex 7 is merely output from the previous code block. I'll add a note for that one, but consider having someone check the entire code base for these.

Break out IDF better in Chapter 4

In the Whiz-Bang Introduction to TF-IDF you do a great job of carefully working through term frequency. The examples in Table 4-1 and 4-2 are clear and well explained. However the description for IDF is a bit disjoint.

The Figure 4-4 seems unnecessary since it's just a graph of a log function, but the numbers are too small/poor resolution to even be able to tell if you intend log base 2 or log base 10.

The discussion on the IDF calculation in the code example is in the last paragraph. This should all happen before the code example. Also, The second table in Figure 4-8 should be before the code example and be described with the exact equation that will be used. Here it is buried in with the full blown TF-IDF calculation.

I really enjoy your coverage of TF and when I read it I feel confident. When I try and discern what's intended for IDF I feel lost. Because I feel lost of IDF, I don't usually try to understand TF-IDF. When I reviewed TF-IDF again today I recalled that I remember feeling this way 2 years ago when I worked through this section previously. I'd wished it could be different (and now it can be!).

consider an intro to each chapter

I find that I read the book at one time and place and move to the code in another time and place, rather than having book and code open simultaneously. It would therefore be helpful to have each notebook have an intro to remind me what the thing was about. It might be worthwhile to have each chapter/notebook to have a brief intro that says "this notebook deals with DATA SOURCE. It'll look at how to connect to that API and then do X, Y, Z. This chapter also includes an introduction to Library (e.g., prettytable, matplotlib)"

ch2: graphic confusing and too big

I don't understand what I'm to take away from the pull-down with "mining the social web" in it. From the context, I would think that what appears there is an application that I would like to use Facebook as, or that's what I thought at first. When I pull down the gear on a regular facebook page, what I see are my name and my fan page (I Live in My van). I don't know how that's helpful here. A screen shot of the "select permissions" dialog that comes up after you click the "get access token" button on https://developers.facebook.com/tools/explorer

If it were useful, the image still needs to be resized, as it comes about about 2.5 times bigger than it is when I pull it down in my own web browser.

ch1/ex 18 & 19 -- what is the x axis?

I could make no sense at all of the x axes in the histogram graphs until I added the below to ex 19

print counts
print plt.hist(counts)

to see that there's a tweet with 2156 retweets. It's really hard to make sense of that from the graph. I haven't checked the text to see whether it makes a point about how most things get very few retweets but a few get very, very many, but it would have helped me if there were a bit of explanation.

Merely labeling the axes might make the point and would be a useful thing to know about mathplotlib. Besides, your 4th grade teacher would be ashamed after all the time she spent telling you to label the axes of your graphs. :-)

Something like:

plt.xlabel('Number of times appeared')
plt.ylabel('Number of items in bin')

It would seem that including "plt." consistently would be a good idea. That's what happens in the matplotlib tutorial:

http://matplotlib.org/users/pyplot_tutorial.html