venthur / gscholar Goto Github PK

View Code? Open in Web Editor NEW

286.0 11.0 72.0 81 KB

Query Google Scholar with Python

License: MIT License

Python 92.47% Makefile 7.53%

google-scholar python cli pdf bibtex gscholar

gscholar's Introduction

gscholar

Query Google Scholar using Python.

Requirements

Python
pdftotext (command line tool)

Installing

$ pip install gscholar

##Using gscholar as a command line tool

gscholar provides a command line tool, to use it, just call gscholar like:

$ gscholar "albert einstein"

$ python3 -m gscholar "albert einstein"

Making a simple lookup:

$ gscholar "some author or title"

will return the first result from Google Scholar matching this query.

Getting more results:

$ gscholar --all "some author or title"

Same as above but returns up to 10 bibtex items. (Use with caution Google will assume you're a bot an ban you're IP temporarily)

Querying using a pdf:

$ gscholar /path/to/pdf

Will read the pdf to generate a Google Scholar query. It uses this query to show the first bibtex result as above.

Renaming a pdf:

$ gscholar --rename /path/to/pdf

Will do the same as above but asks you if it should rename the file according to the bibtex result. You have to answer with "y", default answer is no.

Getting help:

$ gscholar --help

Using gscholar as a python library

Install the gscholar package with pip install as described above or copy the package somewhere Python can find it.

import gscholar

gscholar.query("some author or title")

will return a list of bibtex items.

gscholar's People

Contributors

Stargazers

Watchers

Forkers

fraxen malex984 danoneata klamsal yadudoc nmatra alexbw silky danwrob mattjj liean ayumilong ajawchat luciadacunto urwithajit9 niuzhiheng stoneyang-school lyq105 kolexiang nacnudus edgimar jyt109 vsoljan wszhang luzc08 spacechris philgooch mfe5003 yimsea edwinksl granzyme alejandrogallo infosaict metavi himanshurepo damianantczak terofrondelius agneshh helloworldwq rakhmatullinart afcarl bytepool jackey-qiu alex-linhares markwilkinson ercekeskin consiliance mengyuliu0520 leo526 mjgiancola yaxche-io rgarciarui denglizong dipakbagal mehran1414 lawrennd hxbjavaee playfloor jliu9 rainierraoul furas apantzar amirz7676 elijahahianyo xwenx90 freed-wu franklinliu daukantas javk5pakfa gkuo06 benature richpsharp

gscholar's Issues

Query error when used in Jupyter

Hello,

The following commands are OK in Jupyter:

!pip install gscholar
import gscholar

However, when tried to query an error is raised:

gscholar.query('A Graph Digital Signal Processing Method for Semantic Analysis')

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-1ffebe0e16ff> in <module>()
----> 1 gscholar.query('A Graph Digital Signal Processing Method for Semantic Analysis')

AttributeError: module 'gscholar' has no attribute 'query'

Thanks,
Mircea

Retrieving more than 10 entries

Hi!

We are developpers of a software for research called Sage-Combinat, and looking for a tool that will extract out of google scholar a preliminary list of papers that mention Sage-Combinat, in bibtex format. gscholar sounds great for this!

The only feature we are missing is that we would like to get all citations, and that should be roughly 60 of them. Do you see a way to allow for this?

Thanks a lot!
Nicolas

pdf downloading?

Is there a way to also fetch the accompanying pdf files?

Errors

Sometimes gscholar will work for me just fine, but it will often suddenly stop working and spitting out error like the following instead. It doesn't seem to be related to Google blocking my IP or anything because I can still search Google Scholar normally through my browser. Any suggestions?

Traceback (most recent call last):
  File "/home/brian/bin/gscholar.py", line 206, in <module>
    biblist = query(args, outformat, options.all)
  File "/home/brian/bin/gscholar.py", line 68, in query
    response = urllib2.urlopen(request)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 442, in error
    result = self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 629, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 503: Service Unavailable

Package seem to be broken. Queries return nothing.

Description

I pip installed this package hoping to use it to search Google Scholar by title, but even queries for famous author's and papers are returning []. an empty list. Seeing as this is the core functionality of the package it's a pretty big problem.

Steps to reproduce

pip install gscholar
python3
from gscholar import gscholar
gscholar.query("Albert Einstein") or gscholar.query("A Mathematical Theory of Communication")

Expected Behaviour

A non-empty list of Einstein's papers, or having Claude E. Shannon's Information Theory paper returned.

Actual Behaviour

returns [] in both cases. Seems to return [] for all queries.

400 Error no matter what

Hello,

Great software, very simple. It worked wonderfully for a while, then a few weeks ago it started doing this all the time:

$ python gscholar.py "Einstein"

Traceback (most recent call last):
File "gscholar.py", line 204, in
biblist = query(args, outformat, options.all)
File "gscholar.py", line 79, in query
response = urllib2.urlopen(request)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(_args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(_args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request
$

It does this no matter what args/kwargs you feed it, and also happens if you use it as a module. This happens from my home IP and work IP, and it's a different HTTPError than the 503 you get by being banned. I'm not sure why that happens, but I thought you should know.

Query citation number

Dear Bastian Venthur,
The gscholar is of great help to me!
Is it possible to get the citation number (Cited by XXX) by gscholar?
Thanks in advance.

Feature Request: Add semanticscholar and dblp data

Hi, I found this work is really useful. I wonder whether it could be possible to add some extra data source including semanticscholar and dblp? The bib information on google sometimes goes wrong.

No results found, try again with a different query!

I just installed gscholar under OS X 10.12.3 using pip.

The script is installed in

/usr/local/lib/python2.7/site-packages/gscholar/

so I ran

python /usr/local/lib/python2.7/site-packages/gscholar/gscholar.py -d "Albert Einstein"

and got the following response

DEBUG:root:Assuming you want me to lookup the query: Albert Einstein
DEBUG:root:Query: Albert Einstein
No results found, try again with a different query!

does that mean something is broken?

version 1.6 is missing tar-ball on pypi

can you make the tar-ball available on pypi again? (openSUSE, for example, relies on these to build their packages)

Python 3 compatibility

great script, is it compatible with Python 3?

Python2 and Python3 support

gscholar should run on Python2 and Python3.

I've prepared the first steps by converting print to print() and fixing imports. I also added two unittests. Which cover ASCII and Unicode queries. Those should be run with nosetest and nosetest3.

Some problems when converting the thml to Unicode remain unsovled for now.

[feature] provide URL in title

That is the return result is

@inproceedings{kim2020instability,
  title={\href{https://dl.acm.org/doi/abs/10.1145/3394171.3413680}{Instability of successive deep image compression}},
  author={Kim, Jun-Hyuk and Jang, Soobeom and Choi, Jun-Ho and Lee, Jong-Seok},
  booktitle={Proceedings of the 28th ACM International Conference on Multimedia},
  pages={247--255},
  year={2020}
}

not

@inproceedings{kim2020instability,
  title={Instability of successive deep image compression},
  author={Kim, Jun-Hyuk and Jang, Soobeom and Choi, Jun-Ho and Lee, Jong-Seok},
  booktitle={Proceedings of the 28th ACM International Conference on Multimedia},
  pages={247--255},
  year={2020}
}

Thanks!

No more search results

Hi,

gscholar worked perfectly for me until recently, but in the last few days I get "No results found, try again with a different query!" for every request I send, regardless of the content. I am running Arch Linux, fully updated and have updated the pip package.

Best,

kjk

Key Error with unicode?

Hi,

This works great and I just got to keep my queries at a slow rate so Google Scholar doesn't block my IP.

The problem that I'm having is if my query contains non-English characters it quits with a KeyError

Specifically, it's line 61 in gscholar.py.

Thanks!

pypi tar-ball is missing LICENSE file

would be great, if this could be included (e.g. for packaging in linux distributions)

SyntaxError

I have downloaded and Installed gscholar.py through pip, and I'm getting the following syntax error when I run the query gscholar.py "albert einstein":

File "C:\Python34\Lib\site-package\gscholar\gscholar.py", line 182
l = [i for i in year, author, title if i]
^
SyntaxError: invalid syntax

Do you know if I am doing anything wrong or how can I solve this problem. Thanks a lot in advance,

walker

Use entry_point instead of script (example included)

Check out setup.py: https://github.com/fabric/fabric/blob/ecb588b788cfea5d2becc6957ad12e8c78042a38/setup.py#L51 and https://github.com/fabric/fabric/blob/master/fabric/__main__.py

pip install gscholar fails

Hi. I've tried installing gscholar from pip without success. I'm getting this:

Collecting git+https://github.com/venthur/gscholar.git
  Cloning https://github.com/venthur/gscholar.git to /private/var/folders/3d/mwdmdx1x6rq84xt47hff9dxc0000gp/T/pip-zeres_-build
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/3d/mwdmdx1x6rq84xt47hff9dxc0000gp/T/pip-zeres_-build/setup.py", line 6, in <module>
        import gscholar
      File "gscholar/__init__.py", line 1, in <module>
        from gscholar.gscholar import *
    ImportError: No module named gscholar

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/3d/mwdmdx1x6rq84xt47hff9dxc0000gp/T/pip-zeres_-build/

pip 9.0.1
Python 2.7.10
OSX High Sierra (10.13)

Any suggestions ?

Thank you

Old version doesn't work anymore, any reason why?

So I used to use an old version of gscholar and made my own little python hack around it such that it copies the bibtex immediately to my .bib file but this version doesn't seem to be working anymore? Any reason why?

My last edit to this file seems to be april 10, 2016. Full code (single gscholar.py file):

#!/usr/bin/env python

# gscholar - Get bibtex entries from Goolge Scholar
# Copyright (C) 2011-2015  Bastian Venthur <venthur at debian org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or (at
# your option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
# 02110-1301, USA.


"""
Library to query Google Scholar.

Call the method query with a string which contains the full search
string. Query will return a list of citations.

"""

try:
    # python 2
    from urllib2 import Request, urlopen, quote
except ImportError:
    # python 3
    from urllib.request import Request, urlopen, quote

try:
    # python 2
    from htmlentitydefs import name2codepoint
except ImportError:
    # python 3
    from html.entities import name2codepoint

import re
import hashlib
import random
import sys
import os
import subprocess
import optparse
import logging


# fake google id (looks like it is a 16 elements hex)
rand_str = str(random.random()).encode('utf8')
google_id = hashlib.md5(rand_str).hexdigest()[:16]

GOOGLE_SCHOLAR_URL = "http://scholar.google.com"
# the cookie looks normally like:
#        'Cookie' : 'GSP=ID=%s:CF=4' % google_id }
# where CF is the format (e.g. bibtex). since we don't know the format yet, we
# have to append it later
HEADERS = {'User-Agent': 'Mozilla/5.0',
           'Cookie': 'GSP=ID=%s' % google_id}

FORMAT_BIBTEX = 4
FORMAT_ENDNOTE = 3
FORMAT_REFMAN = 2
FORMAT_WENXIANWANG = 5


def query(searchstr, outformat=FORMAT_BIBTEX, allresults=False):
    """Query google scholar.

    This method queries google scholar and returns a list of citations.

    Parameters
    ----------
    searchstr : str
        the query
    outformat : int, optional
        the output format of the citations. Default is bibtex.
    allresults : bool, optional
        return all results or only the first (i.e. best one)

    Returns
    -------
    result : list of strings
        the list with citations

    """
    logging.debug("Query: {sstring}".format(sstring=searchstr))
    searchstr = '/scholar?q='+quote(searchstr)
    url = GOOGLE_SCHOLAR_URL + searchstr
    header = HEADERS
    header['Cookie'] = header['Cookie'] + ":CF=%d" % outformat
    request = Request(url, headers=header)
    response = urlopen(request)
    html = response.read()
    html = html.decode('utf8')
    # grab the links
    tmp = get_links(html, outformat)

    # follow the bibtex links to get the bibtex entries
    result = list()
    if not allresults:
        tmp = tmp[:1]
    for link in tmp:
        url = GOOGLE_SCHOLAR_URL+link
        request = Request(url, headers=header)
        response = urlopen(request)
        bib = response.read()
        bib = bib.decode('utf8')
        result.append(bib)
    return result


def get_links(html, outformat):
    """Return a list of reference links from the html."""
    if outformat == FORMAT_BIBTEX:
        refre = re.compile(r'<a href="(/scholar\.bib\?[^"]*)')
    elif outformat == FORMAT_ENDNOTE:
        refre = re.compile(r'<a href="(/scholar\.enw\?[^"]*)"')
    elif outformat == FORMAT_REFMAN:
        refre = re.compile(r'<a href="(/scholar\.ris\?[^"]*)"')
    elif outformat == FORMAT_WENXIANWANG:
        refre = re.compile(r'<a href="(/scholar\.ral\?[^"]*)"')
    reflist = refre.findall(html)
    # escape html entities
    reflist = [re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
                      chr(name2codepoint[m.group(1)]), s) for s in reflist]
    return reflist


def convert_pdf_to_txt(pdf, startpage=None):
    """Convert a pdf file to text and return the text.

    This method requires pdftotext to be installed.
    """
    if startpage is not None:
        startpageargs = ['-f', str(startpage)]
    else:
        startpageargs = []
    stdout = subprocess.Popen(["pdftotext", "-q"] + startpageargs + [pdf, "-"],
                              stdout=subprocess.PIPE).communicate()[0]
    return stdout


def pdflookup(pdf, allresults, outformat, startpage=None):
    """Look a pdf up on google scholar and return bibtex items."""
    txt = convert_pdf_to_txt(pdf, startpage)
    # remove all non alphanumeric characters
    txt = re.sub("\W", " ", txt)
    words = txt.strip().split()[:20]
    gsquery = " ".join(words)
    bibtexlist = query(gsquery, outformat, allresults)
    return bibtexlist


def _get_bib_element(bibitem, element):
    """Return element from bibitem or None."""
    lst = [i.strip() for i in bibitem.split("\n")]
    for i in lst:
        if i.startswith(element):
            value = i.split("=", 1)[-1]
            value = value.strip()
            while value.endswith(','):
                value = value[:-1]
            while value.startswith('{') or value.startswith('"'):
                value = value[1:-1]
            return value
    return None


def rename_file(pdf, bibitem):
    """Attempt to rename pdf according to bibitem."""
    year = _get_bib_element(bibitem, "year")
    author = _get_bib_element(bibitem, "author")
    if author:
        author = author.split(",")[0]
    title = _get_bib_element(bibitem, "title")
    l = [i for i in (year, author, title) if i]
    filename = "-".join(l) + ".pdf"
    newfile = pdf.replace(os.path.basename(pdf), filename)
    print()
    print("Will rename:")
    print()
    print("  %s" % pdf)
    print()
    print("to")
    print()
    print("  %s" % newfile)
    print()
    print("Proceed? [y/N]")
    answer = input()
    if answer == 'y':
        print("Renaming %s to %s" % (pdf, newfile))
        os.rename(pdf, newfile)
    else:
        print("Aborting.")


if __name__ == "__main__":
    usage = 'Usage: %prog [options] {pdf | "search terms"}'
    parser = optparse.OptionParser(usage)
    parser.add_option("-a", "--all", action="store_true", dest="all",
                      default=False, help="show all bibtex results")
    parser.add_option("-d", "--debug", action="store_true", dest="debug",
                      default=False, help="show debugging output")
    parser.add_option("-r", "--rename", action="store_true", dest="rename",
                      default=False, help="rename file (asks before doing it)")
    parser.add_option("-f", "--outputformat", dest='output',
                      default="bibtex",
                      help="Output format. Available formats are: bibtex, endnote, refman, wenxianwang [default: %default]")
    parser.add_option("-s", "--startpage", dest='startpage',
                      help="Page number to start parsing PDF file at.")
    (options, args) = parser.parse_args()
    if options.debug is True:
        logging.basicConfig(level=logging.DEBUG)
    if options.output == 'bibtex':
        outformat = FORMAT_BIBTEX
    elif options.output == 'endnote':
        outformat = FORMAT_ENDNOTE
    elif options.output == 'refman':
        outformat = FORMAT_REFMAN
    elif options.output == 'wenxianwang':
        outformat = FORMAT_WENXIANWANG
    if len(args) != 1:
        parser.error("No argument given, nothing to do.")
        sys.exit(1)
    args = args[0]
    pdfmode = False
    if os.path.exists(args):
        logging.debug("File exist, assuming you want me to lookup the pdf: {filename}.".format(filename=args))
        pdfmode = True
        biblist = pdflookup(args, all, outformat, options.startpage)
    else:
        logging.debug("Assuming you want me to lookup the query: {query}".format(query=args))
        biblist = query(args, outformat, options.all)
    if len(biblist) < 1:
        print("No results found, try again with a different query!")
        sys.exit(1)
    if options.all is True:
        logging.debug("All results:")
        for i in biblist:
            print(i)
    else:
        logging.debug("First result:")
        print(biblist[0])
    if options.rename is True:
        if not pdfmode:
            print("You asked me to rename the pdf but didn't tell me which file to rename, aborting.")
            sys.exit(1)
        else:
            rename_file(args, biblist[0])