Coder Social home page Coder Social logo

linwoodc3 / gdeltpyr Goto Github PK

View Code? Open in Web Editor NEW
188.0 188.0 52.0 134.46 MB

Python based framework to retreive Global Database of Events, Language, and Tone (GDELT) version 1.0 and version 2.0 data.

Home Page: https://linwoodc3.github.io/gdeltPyR/

License: GNU General Public License v3.0

Python 12.96% Shell 0.09% Jupyter Notebook 86.88% HTML 0.07%
data-frame gdelt geolocation geospatial-data global-database news pandas python

gdeltpyr's People

Contributors

harman28 avatar iltc avatar linwoodc3 avatar pietermarsman avatar reed9999 avatar smritigambhir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gdeltpyr's Issues

Unable to install using "pip install gdelt"

Hello.
When i tried to install i got the following:

Collecting gdelt
Using cached gdelt-0.1.10.6.1-py2.py3-none-any.whl (773 kB)
Discarding https://files.pythonhosted.org/packages/65/f9/a3d5111c8f17334b1752c32aedaab0d01ab4324bf26417bd41890d5b25d0/gdelt-0.1.10.6.1-py2.py3-none-any.whl (from https://pypi.org/simple/gdelt/): Requested gdelt from https://files.pythonhosted.org/packages/65/f9/a3d5111c8f17334b1752c32aedaab0d01ab4324bf26417bd41890d5b25d0/gdelt-0.1.10.6.1-py2.py3-none-any.whl has inconsistent version: expected '0.1.10.6.1', but metadata has '0.1.10.6'
Using cached gdelt-0.1.10.6.1.tar.gz (982 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [10 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-dfkqd9oe/gdelt_fc3c3612c6dd4afbaff9146e7ebd3384/setup.py", line 39, in
read('CHANGES')),
File "/tmp/pip-install-dfkqd9oe/gdelt_fc3c3612c6dd4afbaff9146e7ebd3384/setup.py", line 15, in read
with codecs.open(os.path.join(cwd, filename), 'rb', 'utf-8') as h:
File "/usr/lib/python3.10/codecs.py", line 906, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-dfkqd9oe/gdelt_fc3c3612c6dd4afbaff9146e7ebd3384/CHANGES'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

Pull GDelt V2 GKG data for the full day

Hi!

When I pull GKG data with the following code, I only get the first 15mins of data. Is it possible to get the full day's worth of GKG data?

gd = gdelt.gdelt(version=2)
date = extract_date.strftime('%Y%m%d')
df = gd.Search(date, table='gkg', coverage=True)

Many thanks!

Add json output format

Add simple ability to output the returned data in json format. In the end, we'll return csv, json, pandas dataframe, r dataframe, or hdf.

DOC: Make documentation pages with sphinx

As this is my first module, need to learn how to use sphinx documentation. Make the page with concept description, section on how to contribute (asking for help from experienced folks), and information on CAMEO codes and how to use them.

Not all available data is downloaded!!!

I get a lot of outputs that GDELT does not return any outputs for certain dates. However, if I check this manually, data is available and I can download it. I have also checked it in the data and this data is missing

`/home/python3.10/site-packages/gdelt/parallel.py:111: UserWarning: GDELT did not return data for date time 20210201044500
warnings.warn(message)

/home/python3.10/site-packages/gdelt/parallel.py:111: UserWarning: GDELT did not return data for date time 20210201014500
warnings.warn(message)

/home/python3.10/site-packages/gdelt/parallel.py:111: UserWarning: GDELT did not return data for date time 20210201001500
warnings.warn(message)`

BUG: GDELT Version 2 not collecting the latest 15 minutes file

I've been using the same code to collect events every 15 minutes from the database for a few months now, but since yesterday I keep getting the error:

UserWarning: GDELT did not return data for date time 20200331120000 warnings.warn(message)

The code that I am using is:
gd2 = gdelt.gdelt(version=2)
results = gd2.Search('2020 03 31',table='events',output='json')

It works when collecting data for a date that it not the current date (31st March), so I think maybe it is because instead of collecting the latest 15 minutes, it is collecting whole day files only?

Is there a way to fix this?
I'm using python 3.5 64 bit on windows.

EDIT: the issue seems to be with the timezone as due to the timezone changing in the UK on the 29th, the URL being requested from the database is one hour ahead of the data available, which is the issue.

EDIT: I've temporarily fixed it by changing line 174 in dateFuncs.py to subtract an hour instead of using datetime.now() directly, would it be possible to add a feature to be able to set this from the Search function itself rather than changing dateFuncs manually?

Thank you.

How to store the news data into csv?

Hello, thank for this excellent package.

Could anyone let me know how to extract news data from GDELT using this package and store into .csv file?

Thank you very much!

Missing requirements? (pytest-cov, geopandas)

I'm running tests for the first time, and I went through the usual process of pip3 install -r requirements.txt from a virtualenv. It seems like there might be necessary packages that are missing:

pytest-cov

I got some issues with py.test: error: unrecognized arguments: --cov --cov-repo=term-missing which turned out to be a different root cause (system pytest being used). Nevertheless, in troubleshooting I got the impression pytest-cov probably should be installed explicitly. It might just be installed by upgrading to a current python-pytest, not sure.

geopandas

Now that I can run the tests, everything seems to pass except a couple with ModuleNotFoundError: No module named 'geopandas'. I thought maybe geopandas would be installed in requirements_geo.txt but apparently not. It's unclear to me which of the requirements files should install it, or both.

coverage=True for gkg search error

Whenever I set coverage=True for gkg search I receive the error below. However with the events search I don't experience this error.

Code
gkg = gd.Search(['2017 May 23'],table='gkg',normcols=True,coverage=True)

Error

AssertionError Traceback (most recent call last)
in
----> 1 gkg = gd.Search(['2017 May 23'],table='gkg',normcols=True,coverage=True)
2 gkg.columns

/opt/miniconda3/envs/thesis/lib/python3.8/site-packages/gdelt/base.py in Search(self, date, table, coverage, translation, output, queryTime, normcols)
632 else:
633
--> 634 pool = NoDaemonProcessPool(processes=cpu_count())
635 downloaded_dfs = list(pool.imap_unordered(_mp_worker,
636 self.download_list,

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py in init(self, processes, initializer, initargs, maxtasksperchild, context)
210 self._processes = processes
211 try:
--> 212 self._repopulate_pool()
213 except Exception:
214 for p in self._pool:

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py in _repopulate_pool(self)
301
302 def _repopulate_pool(self):
--> 303 return self._repopulate_pool_static(self._ctx, self.Process,
304 self._processes,
305 self._pool, self._inqueue,

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py in _repopulate_pool_static(ctx, Process, processes, pool, inqueue, outqueue, initializer, initargs, maxtasksperchild, wrap_exception)
317 """
318 for i in range(processes - len(pool)):
--> 319 w = Process(ctx, target=worker,
320 args=(inqueue, outqueue,
321 initializer,

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/process.py in init(self, group, target, name, args, kwargs, daemon)
80 def init(self, group=None, target=None, name=None, args=(), kwargs={},
81 *, daemon=None):
---> 82 assert group is None, 'group argument must be None for now'
83 count = next(_process_counter)
84 self._identity = _current_process._identity + (count,)

AssertionError: group argument must be None for now

GDELT did not return data for any date time?

We are seeing a bunch of failing http requests, but the url seems to be valid and the file can be downloaded by using another httpclient.

How can we narrow it down? Some console outputs are written like running into timeout ...

ENH: Add ability to pull specific time interval on date

Right now, gdeltPyR can pull date ranges for historical dates and current day data. Add an ability for someone to specific specific date intervals to pull data. The historical 2.0 query pulls the last 15 minute interval of the day if coverage is set to False. Need to give more flexibility

ENH: Add a new class that provides information on each table and column

This is part of Phase 1.

GDELT is a very complex data set and beginners will need to understand what is available. This is a multi-pronged issue as it is tied to #30 .

The implementation is up to the coder who takes this on, but for consideration:

  • Create a class that is an "information" or "whatIs" class. The name of the class should be easy to understand and let the user know to use this specific class to learn more about tables and column names in tables.

  • Each table for GDELT (events, gkg,iatv,mentions,literature) should have a method that returns a description of the table. GDELT Codebook descriptions may help give a generic overview of tables .

  • Each table will need to include different descriptions for the different GDELT version (version 1 and version 2). The main difference is that new columns or improvements should be highlighted in the description. For example, Events 1 table has less columns than Events 2 table. The description will explain why briefly (maybe 1 sentence at the beginning of the description of Events 2).

  • Each table will have a dataframe that provides a description of the columns. Each column will have a name, data type (integer, string, etc.), and a description.

  • Write a unit test to test each table; start by writing failing unit tests firsts (to load the table), then go back and make the tables load with the descriptions. We must have a unit test for each table.

A potential tree is:
gdelt.info -> events -> columndescription

OR

gdelt.info(version=2) -> events -> tabledescription

The version should be set in gdelt.

ENH: Add shapefile output

Add a method to convert geopandas output into a shapefile OR include an option that allows the user to write the gdeltPyR results directly to a shapefile.

Installing package doesn't work

Whenever I try installing the package I'm getting this error. This didn't happen earlier.

Collecting gdelt
Downloading gdelt-0.1.13.tar.gz (1.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 16.0 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Preparing metadata (pyproject.toml) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.

Max retries

In AWS machine, keep getting max retries message. Never get this message on my personal computer so machines with really fast processors may send requests to the GDELT servers to fast. Need to add a synthetic delay.

Add geopandas geodataframe output

Add ability to output returned data in geopandas output; this will make it easy for another output style (shapefile) and geojson. Also makes it easy to do a choropleth, mapping a statistical variable (count of a particular type of CAMEO Code) to a map. Should add the world geopandas data set to this as well (need to find a small one).

BUG: Proxy issue when importing

I get a proxy error when trying to import the module. This is problematic since you can't pass parameters when importing things (IIRC). Seems like this is the problem bit.

~/gdelt/venv/lib/python3.7/site-packages/gdelt/base.py in <module>()
     80         '/utils/' \
     81         'schema_csvs/cameoCodes.json'
---> 82     codes = json.loads((requests.get(a).content.decode('utf-8')))
     83 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/gdelt/venv/lib/python3.7/site-packages/gdelt/base.py in <module>()
     74     codes = pd.read_json(os.path.join(BASE_DIR, 'data', 'cameoCodes.json'),
---> 75                          dtype=dict(cameoCode='str', GoldsteinScale=np.float64))
     76     codes.set_index('cameoCode', drop=False, inplace=True)

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression)
    421 
--> 422     result = json_reader.read()
    423     if should_close:

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in read(self)
    528         else:
--> 529             obj = self._get_object_parser(self.data)
    530         self.close()

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in _get_object_parser(self, json)
    545         if typ == 'frame':
--> 546             obj = FrameParser(json, **kwargs).parse()
    547 

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in parse(self)
    637         else:
--> 638             self._parse_no_numpy()
    639 

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in _parse_no_numpy(self)
    852             self.obj = DataFrame(
--> 853                 loads(json, precise_float=self.precise_float), dtype=None)
    854         elif orient == "split":

ValueError: Expected object or value

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
~/gdelt/venv/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    593             if is_new_proxy_conn:
--> 594                 self._prepare_proxy(conn)
    595 

~/gdelt/venv/lib/python3.7/site-packages/urllib3/connectionpool.py in _prepare_proxy(self, conn)
    814 
--> 815         conn.connect()
    816 

~/gdelt/venv/lib/python3.7/site-packages/urllib3/connection.py in connect(self)
    323             # self._tunnel_host below.
--> 324             self._tunnel()
    325             # Mark this connection as not reusable

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py in _tunnel(self)
    910             raise OSError("Tunnel connection failed: %d %s" % (code,
--> 911                                                                message.strip()))
    912         while True:

OSError: Tunnel connection failed: 407 AuthorizedOnly

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
~/gdelt/venv/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    444                     retries=self.max_retries,
--> 445                     timeout=timeout
    446                 )

~/gdelt/venv/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    637             retries = retries.increment(method, url, error=e, _pool=self,
--> 638                                         _stacktrace=sys.exc_info()[2])
    639             retries.sleep()

~/gdelt/venv/lib/python3.7/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    397         if new_retry.is_exhausted():
--> 398             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    399 

MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /linwoodc3/gdeltPyR/master/utils/schema_csvs/cameoCodes.json (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 407 AuthorizedOnly')))

During handling of the above exception, another exception occurred:

ProxyError                                Traceback (most recent call last)
<ipython-input-1-b6a720b4b38d> in <module>()
----> 1 import gdelt

~/gdelt/venv/lib/python3.7/site-packages/gdelt/__init__.py in <module>()
      4 from __future__ import absolute_import
      5 
----> 6 from gdelt.base import gdelt
      7 
      8 __name__ = 'gdelt'

~/gdelt/venv/lib/python3.7/site-packages/gdelt/base.py in <module>()
     80         '/utils/' \
     81         'schema_csvs/cameoCodes.json'
---> 82     codes = json.loads((requests.get(a).content.decode('utf-8')))
     83 
     84 ##############################

~/gdelt/venv/lib/python3.7/site-packages/requests/api.py in get(url, params, **kwargs)
     70 
     71     kwargs.setdefault('allow_redirects', True)
---> 72     return request('get', url, params=params, **kwargs)
     73 
     74 

~/gdelt/venv/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
     56     # cases, and look like a memory leak in others.
     57     with sessions.Session() as session:
---> 58         return session.request(method=method, url=url, **kwargs)
     59 
     60 

~/gdelt/venv/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    510         }
    511         send_kwargs.update(settings)
--> 512         resp = self.send(prep, **send_kwargs)
    513 
    514         return resp

~/gdelt/venv/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs)
    620 
    621         # Send the request
--> 622         r = adapter.send(request, **kwargs)
    623 
    624         # Total elapsed time of the request (approximately)

~/gdelt/venv/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    505 
    506             if isinstance(e.reason, _ProxyError):
--> 507                 raise ProxyError(e, request=request)
    508 
    509             if isinstance(e.reason, _SSLError):

ProxyError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /linwoodc3/gdeltPyR/master/utils/schema_csvs/cameoCodes.json (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 407 AuthorizedOnly')))

Error on pulling dates older than 2013, version 1

>>> import gdelt
>>> gd = gdelt.gdelt(version=1)
>>> results = gd.Search('2013 2 20',table='events')
Traceback (most recent call last):
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-368ad372ac85>", line 1, in <module>
    results = gd.Search('2013 2 20',table='events',version=1)
TypeError: Search() got an unexpected keyword argument 'version'
gd = gdelt.gdelt(version=1)
results = gd.Search('2013 2 20',table='events')
Traceback (most recent call last):
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-e0d15ebbd9c9>", line 1, in <module>
    results = gd.Search('2013 2 20',table='events')
  File "/Users/linwood/PycharmProjects/gdeltPyR/gdelt/base.py", line 426, in Search
    else:
  File "/Users/linwood/PycharmProjects/gdeltPyR/gdelt/vectorizingFuncs.py", line 100, in urlBuilder
    if parse(dateString) < parse('2013 Apr 01'):
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/dateutil/parser.py", line 1168, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/dateutil/parser.py", line 581, in parse
    ret = default.replace(**repl)
ValueError: month must be in 1..12

.idea Folder is Not Needed

The .idea folder is an artifact of the PyCharm editor. It should be removed and added to .gitignore to reduce clutter.

BUG: Event search not working on windows 32 bit machine

import gdelt
import requests.packages.urllib3

requests.packages.urllib3.disable_warnings()
import platform
print(platform.architecture())
import gdelt

gd = gdelt.gdelt(version=2)

results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
print(results)


#output

D:\SUSHANT\pyt\python.exe C:/Users/sushant.s/PycharmProjects/testAGAIN/GDELT.py
('32bit', 'WindowsPE')
('32bit', 'WindowsPE')
('32bit', 'WindowsPE')
Traceback (most recent call last):
File "", line 1, in
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
File "D:\SUSHANT\pyt\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\SUSHANT\pyt\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\SUSHANT\pyt\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\sushant.s\PycharmProjects\testAGAIN\GDELT.py", line 11, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "D:\SUSHANT\pyt\lib\site-packages\gdelt\base.py", line 568, in Search
pool = Pool(processes=cpu_count())
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 168, in init
self._repopulate_pool()
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 233, in _repopulate_pool
w.start()
File "D:\SUSHANT\pyt\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\SUSHANT\pyt\lib\multiprocessing\popen_spawn_win32.py", line 33, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last):
File "", line 1, in
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
File "D:\SUSHANT\pyt\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\SUSHANT\pyt\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\SUSHANT\pyt\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\sushant.s\PycharmProjects\testAGAIN\GDELT.py", line 11, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "D:\SUSHANT\pyt\lib\site-packages\gdelt\base.py", line 568, in Search
pool = Pool(processes=cpu_count())
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 168, in init
self._repopulate_pool()
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 233, in _repopulate_pool
w.start()
File "D:\SUSHANT\pyt\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\SUSHANT\pyt\lib\multiprocessing\popen_spawn_win32.py", line 33, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

('32bit', 'WindowsPE')
('32bit', 'WindowsPE')

ENH: add google bigquery interface

Use [pandas.io.gbq](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_gbq.html#pandas.read_gbq)

First you will need to pip install: pip install --upgrade google-api-python-client
Here is a working query:

# load keys;  requires you to be registered
keys = json.load(open('/Users/linwood/Desktop/keysforapps/apikeys.txt'))

# setup google creds; not sure if this is required yet; but you need to do it once to authorize the api from your python ecosystem
from apiclient.discovery import build
service = build('bigquery', 'v2', developerKey=keys['google']['apikey']+"2")

# load query in proper SQL syntax as string
from pandas.io import gbq
q="""
SELECT MonthYear,count(*)c,count(IF(Actor1Code LIKE 'MUS',1,null)) c_up
FROM [gdelt-bq.full.events] WHERE EventRootCode = '19'
GROUP BY MonthYear ORDER BY MonthYear;"""


# run the query
df = gbq.read_gbq(q, project_id=<projectid>)
:[out]
Requesting query... ok.
Query running...
Query done.
Cache hit.

Retrieving results...
Got 461 rows.

Total time taken 0.75 s.
Finished at 2017-05-30 10:26:21

Cannot run sample code for GDELT v2

Hi, When I run the sample code provided for v2, the following error is received. v1 works fine. Please help and let me know why this could be happening? Thank you

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last):
File "", line 1, in
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 288, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "main.py", line 11, in
results = gd2.Search(['2016 11 01'],table='events',coverage=True)
File "/opt/anaconda3/lib/python3.9/site-packages/gdelt/base.py", line 635, in Search
pool = Pool(processes=cpu_count())
File "/opt/anaconda3/lib/python3.9/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/opt/anaconda3/lib/python3.9/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/opt/anaconda3/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/opt/anaconda3/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/anaconda3/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/anaconda3/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/anaconda3/lib/python3.9/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/opt/anaconda3/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''

AssertionError: group argument must be None for now while Quering gkg in a time period

when i try to search in a time period i get Assertion error. it's interesting that it works when i run it only with one date. example gd.Search('2016 10 19',coverage=True,table='gkg')

what i Queried:
%time results = gd.Search(['2016 10 19','2016 10 22'],coverage=True,table='gkg')

the Error

File :1

File [c:\Users\\anaconda3\envs\myenv\Lib\site-packages\gdelt\base.py:634](file:///C:/Users//anaconda3/envs/myenv/Lib/site-packages/gdelt/base.py:634), in gdelt.Search(self, date, table, coverage, translation, output, queryTime, normcols)
    630     downloaded_dfs = list(pool.imap_unordered(eventWork,
    631                                               self.download_list))
    632 else:
--> 634     pool = NoDaemonProcessPool(processes=cpu_count())
    635     downloaded_dfs = list(pool.imap_unordered(_mp_worker,
    636                                               self.download_list,
    637                                               ))
    638 pool.close()

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\pool.py:215](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/pool.py:215), in Pool.__init__(self, processes, initializer, initargs, maxtasksperchild, context)
    213 self._processes = processes
    214 try:
--> 215     self._repopulate_pool()
    216 except Exception:
    217     for p in self._pool:

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\pool.py:306](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/pool.py:306), in Pool._repopulate_pool(self)
    305 def _repopulate_pool(self):
--> 306     return self._repopulate_pool_static(self._ctx, self.Process,
    307                                         self._processes,
    308                                         self._pool, self._inqueue,
    309                                         self._outqueue, self._initializer,
    310                                         self._initargs,
    311                                         self._maxtasksperchild,
    312                                         self._wrap_exception)

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\pool.py:322](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/pool.py:322), in Pool._repopulate_pool_static(ctx, Process, processes, pool, inqueue, outqueue, initializer, initargs, maxtasksperchild, wrap_exception)
    318 """Bring the number of pool processes up to the specified number,
    319 for use after reaping workers which have exited.
    320 """
    321 for i in range(processes - len(pool)):
--> 322     w = Process(ctx, target=worker,
    323                 args=(inqueue, outqueue,
    324                       initializer,
    325                       initargs, maxtasksperchild,
    326                       wrap_exception))
    327     w.name = w.name.replace('Process', 'PoolWorker')
    328     w.daemon = True

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\process.py:82](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/process.py:82), in BaseProcess.__init__(self, group, target, name, args, kwargs, daemon)
...
---> 82     assert group is None, 'group argument must be None for now'
     83     count = next(_process_counter)
     84     self._identity = _current_process._identity + (count,)

AssertionError: group argument must be None for now

Extract all locations from the gkg table

As a geospatial analyst,
I need to extract and classify all locations from the knowledge graph.
So that, I can easily extract locations on the country, state or city level.

FEATURE: Calculate day event was added

The events 2.0 codebook describes the fraction date. Here is the code to convert the fraction date to the approximate date when the event happened. I'm assuming I had a fraction date of 2020.2438.

import datetime

datetime.datetime(day=1,month=1,year=2020) + datetime.timedelta(days=int(2438/9999 * 365))

run the example from readme.md failed

GDELT 1.0 Queries

import gdelt

Version 1 queries

gd1 = gdelt.gdelt(version=1)

pull single day, gkg table

results= gd1.Search('2016 Nov 01',table='gkg')
print(len(results))

pull events table, range, output to json format

results = gd1.Search(['2016 Oct 31','2016 Nov 2'],coverage=True,table='events')
print(len(results))

++++++++++++++++++++++++++++++++++++++++++++++++++++

~/ub16_prj % python demogdelt.py

187291
187291
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users//ub16_prj/demogdelt.py", line 13, in
results = gd1.Search(['2016 Oct 31','2016 Nov 2'],coverage=True,table='events')
File "/usr/local/lib/python3.8/site-packages/gdelt/base.py", line 629, in Search
pool = Pool(processes=cpu_count())
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

187291

Add "get.data" function to download master list

This will reduce the load time and the run time of the search function. Right now, for GDELT Version 2.0, a single day query takes 45-50 s. With this new functionality, we'll only make calls to the last 15 minute query or the historical get data master list.

BUG: Add exception handling for no data returned

gdeltPyR returns a non-intuitive error when no data returns for a single 15 minute data pull. Need to add exception handling to make it clearer to the user that no data returned; right now it looks like gdeltPyR is broken.

Example to recreate

import gdelt
gd = gdelt.gdelt()
a=gd.Search('2017 July 27')

[Out]:
...
  File "/Users/linwood/projects/gdeltPyR/gdelt/base.py", line 597, in Search
    if len(results.columns) == 57:
AttributeError: 'NoneType' object has no attribute 'columns'

DOC: Add markdown file on contributing to `gdeltPyR`

  • Use the pandas contributing document as a guide.
  • Define versioning logic
  • Implement release plans with group of features to add before version number updated
  • Explain how to set up dev environment

Early contributing guidance:

  • I'm using http://semver.org/ and this stack overflow post as a model for versioning; I'm using four schemed number (0.0.0.0):

    • major version (changes when all planned features are added)
    • minor version (changes when new classes or modules are added that change the results or analysis on GDELT data returned)
    • minor-minor version (changes with smaller enhancements like classes or modules that just return unaltered data, new parameters to existing classes/modules, etc.)
    • Bug fixes - Last number is the bug fix count for current build; no changes to existing functionality but fixes a MAJOR bug that stops the entire module for working. Simple little bug fixes don't get counted. Resets to zero on minor-minor change. Only counts bugs so if no bugs...stays at zero. Eventually will drop this number off when the unit test suite has 80% coverage.
  • Small one or two-line entry in CHANGES (gdeltPyR --> CHANGES). Just a date line and description that says something like "added support to translated GDELT data" you can add your github username if you want too.

  • Add short bullet to README.md and README.rst. (gdeltPyR --> README.md (rst)). This just announces the new feature on the first page.

This looks good. Thanks for adding the unittests.

Administrative changes before any merge.

  • Small one or two-line entry in CHANGES explaining what you did (gdeltPyR --> CHANGES). Just a date line and description that says something like "Added support to translated GDELT data" you can add your github username if you want too.

  • Add short bullet to README.md and README.rst. (gdeltPyR --> README.md (rst)). This just announces the new feature on the first page.

  • Add issue that defines what bug or feature you plan to do. Reference the issue in commits and close it if you commit addresses the issue whether it's an enhancement, bug fix, or documentation update, etc.

  • (Optional) ** Add your name or github user name (or combo) to the AUTHORS.rst** (gdeltPyR --> AUTHORS.rst). Keep track of all contributors to show power of open source. Feel free to add your country as well.

Travis builds are currently broken for >=3.6

0.36s$ pytest
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --cov-repo=term-missing
  inifile: /home/travis/build/linwoodc3/gdeltPyR/setup.cfg
  rootdir: /home/travis/build/linwoodc3/gdeltPyR
The command "pytest" exited with 4.
before_cache
0.01s$ rm -f $HOME/.cache/pip/log/debug.log
cache.2
store build cache

NameError: global name 'p' is not defined

Traceback (most recent call last):
File "D:\XXX\coding\gdelt\gdeltPyR.py", line 12, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "C:\Users\XXX\Anaconda2\lib\site-packages\gdelt\base.py", line 290, in Search
p
NameError: global name 'p' is not defined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.