linwoodc3 / gdeltpyr Goto Github PK

Python based framework to retreive Global Database of Events, Language, and Tone (GDELT) version 1.0 and version 2.0 data.

Home Page: https://linwoodc3.github.io/gdeltPyR/

License: GNU General Public License v3.0

Python 12.96% Shell 0.09% Jupyter Notebook 86.88% HTML 0.07%

data-frame gdelt geolocation geospatial-data global-database news pandas python

gdeltpyr's People

Contributors

Stargazers

Watchers

Forkers

tsvetot profshen highflykxf abeusher joshzyj stratigraph smritigambhir smchui panisson sj2384 naushadzaman vishalbelsare atucker alucard1797 peterxiaoguo yesemsanthoshkumar anakinshieh deeperunderstanding saeedente reed9999 erikalschaefer padho readall zsaur mberrett amarantolaw cp-lim liyandan ppbpdx gisfromscratch daleathan vitoman emmonsw brwoodside iltc nikhilvenkatkumsetty qixuanhou ntentes riolcrt not-today adlauw samuelhelspr allisterb dwb217 bi-wei kukupigs co0olcat xfg0913 ashispapu mostafabouzari

gdeltpyr's Issues

Unable to install using "pip install gdelt"

Hello.
When i tried to install i got the following:

Collecting gdelt
Using cached gdelt-0.1.10.6.1-py2.py3-none-any.whl (773 kB)
Discarding https://files.pythonhosted.org/packages/65/f9/a3d5111c8f17334b1752c32aedaab0d01ab4324bf26417bd41890d5b25d0/gdelt-0.1.10.6.1-py2.py3-none-any.whl (from https://pypi.org/simple/gdelt/): Requested gdelt from https://files.pythonhosted.org/packages/65/f9/a3d5111c8f17334b1752c32aedaab0d01ab4324bf26417bd41890d5b25d0/gdelt-0.1.10.6.1-py2.py3-none-any.whl has inconsistent version: expected '0.1.10.6.1', but metadata has '0.1.10.6'
Using cached gdelt-0.1.10.6.1.tar.gz (982 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [10 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-dfkqd9oe/gdelt_fc3c3612c6dd4afbaff9146e7ebd3384/setup.py", line 39, in
read('CHANGES')),
File "/tmp/pip-install-dfkqd9oe/gdelt_fc3c3612c6dd4afbaff9146e7ebd3384/setup.py", line 15, in read
with codecs.open(os.path.join(cwd, filename), 'rb', 'utf-8') as h:
File "/usr/lib/python3.10/codecs.py", line 906, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-dfkqd9oe/gdelt_fc3c3612c6dd4afbaff9146e7ebd3384/CHANGES'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

TST: Get code coverage above 50%

Part of milestone to get code coverage to 90%.

Pull GDelt V2 GKG data for the full day

Hi!

When I pull GKG data with the following code, I only get the first 15mins of data. Is it possible to get the full day's worth of GKG data?

gd = gdelt.gdelt(version=2)
date = extract_date.strftime('%Y%m%d')
df = gd.Search(date, table='gkg', coverage=True)

Many thanks!

ENH: Add method to return CAMEO code description dataframe only

This function call is already in gdeltPyR; but I need to isolate it.

I also have a hard coded table; likely just download this and make it reachable via a method:

Add json output format

Add simple ability to output the returned data in json format. In the end, we'll return csv, json, pandas dataframe, r dataframe, or hdf.

DOC: Make documentation pages with sphinx

As this is my first module, need to learn how to use sphinx documentation. Make the page with concept description, section on how to contribute (asking for help from experienced folks), and information on CAMEO codes and how to use them.

Not all available data is downloaded!!!

I get a lot of outputs that GDELT does not return any outputs for certain dates. However, if I check this manually, data is available and I can download it. I have also checked it in the data and this data is missing

`/home/python3.10/site-packages/gdelt/parallel.py:111: UserWarning: GDELT did not return data for date time 20210201044500
warnings.warn(message)

/home/python3.10/site-packages/gdelt/parallel.py:111: UserWarning: GDELT did not return data for date time 20210201014500
warnings.warn(message)

/home/python3.10/site-packages/gdelt/parallel.py:111: UserWarning: GDELT did not return data for date time 20210201001500
warnings.warn(message)`

BUG: GDELT Version 2 not collecting the latest 15 minutes file

I've been using the same code to collect events every 15 minutes from the database for a few months now, but since yesterday I keep getting the error:

UserWarning: GDELT did not return data for date time 20200331120000 warnings.warn(message)

The code that I am using is:
gd2 = gdelt.gdelt(version=2)
results = gd2.Search('2020 03 31',table='events',output='json')

It works when collecting data for a date that it not the current date (31st March), so I think maybe it is because instead of collecting the latest 15 minutes, it is collecting whole day files only?

Is there a way to fix this?
I'm using python 3.5 64 bit on windows.

EDIT: the issue seems to be with the timezone as due to the timezone changing in the UK on the 29th, the URL being requested from the database is one hour ahead of the data available, which is the issue.

EDIT: I've temporarily fixed it by changing line 174 in dateFuncs.py to subtract an hour instead of using datetime.now() directly, would it be possible to add a feature to be able to set this from the Search function itself rather than changing dateFuncs manually?

Thank you.

CLN: Change utility functions/modules to private

Cleans up the code so the exposed functions/methods to users are easy to understand. They don't need to see the underlying support functions.

How to store the news data into csv?

Hello, thank for this excellent package.

Could anyone let me know how to extract news data from GDELT using this package and store into .csv file?

Thank you very much!

Write tests for continuous integration

Write these so we have tests on any proposed code or pull requests.

Missing requirements? (pytest-cov, geopandas)

I'm running tests for the first time, and I went through the usual process of pip3 install -r requirements.txt from a virtualenv. It seems like there might be necessary packages that are missing:

pytest-cov

I got some issues with py.test: error: unrecognized arguments: --cov --cov-repo=term-missing which turned out to be a different root cause (system pytest being used). Nevertheless, in troubleshooting I got the impression pytest-cov probably should be installed explicitly. It might just be installed by upgrading to a current python-pytest, not sure.

geopandas

Now that I can run the tests, everything seems to pass except a couple with ModuleNotFoundError: No module named 'geopandas'. I thought maybe geopandas would be installed in requirements_geo.txt but apparently not. It's unclear to me which of the requirements files should install it, or both.

coverage=True for gkg search error

Whenever I set coverage=True for gkg search I receive the error below. However with the events search I don't experience this error.

Code
gkg = gd.Search(['2017 May 23'],table='gkg',normcols=True,coverage=True)

Error

AssertionError Traceback (most recent call last)
in
----> 1 gkg = gd.Search(['2017 May 23'],table='gkg',normcols=True,coverage=True)
2 gkg.columns

/opt/miniconda3/envs/thesis/lib/python3.8/site-packages/gdelt/base.py in Search(self, date, table, coverage, translation, output, queryTime, normcols)
632 else:
633
--> 634 pool = NoDaemonProcessPool(processes=cpu_count())
635 downloaded_dfs = list(pool.imap_unordered(_mp_worker,
636 self.download_list,

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py in init(self, processes, initializer, initargs, maxtasksperchild, context)
210 self._processes = processes
211 try:
--> 212 self._repopulate_pool()
213 except Exception:
214 for p in self._pool:

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py in _repopulate_pool(self)
301
302 def _repopulate_pool(self):
--> 303 return self._repopulate_pool_static(self._ctx, self.Process,
304 self._processes,
305 self._pool, self._inqueue,

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py in _repopulate_pool_static(ctx, Process, processes, pool, inqueue, outqueue, initializer, initargs, maxtasksperchild, wrap_exception)
317 """
318 for i in range(processes - len(pool)):
--> 319 w = Process(ctx, target=worker,
320 args=(inqueue, outqueue,
321 initializer,

/opt/miniconda3/envs/thesis/lib/python3.8/multiprocessing/process.py in init(self, group, target, name, args, kwargs, daemon)
80 def init(self, group=None, target=None, name=None, args=(), kwargs={},
81 *, daemon=None):
---> 82 assert group is None, 'group argument must be None for now'
83 count = next(_process_counter)
84 self._identity = _current_process._identity + (count,)

AssertionError: group argument must be None for now

how can we run to reach the visual effect?

https://camo.githubusercontent.com/ab3f2694e27bfc8d1e5f17011bf028c96e667e6c/68747470733a2f2f747769737465647369667465722e66696c65732e776f726470726573732e636f6d2f323031352f30362f70656f706c652d7477656574696e672d61626f75742d73756e72697365732d6f7665722d612d32342d686f75722d706572696f642e6769663f773d37303026683d343533

ENH: Add support for translingual events

GDELT 2.0 has english and translingual events . The gdelt python package only supports the english events and not the translingual ones.

See e.g. http://data.gdeltproject.org/gdeltv2/masterfilelist-translation.txt

dependency misspelling

in travis/requirements_all.txt ipykernel is misspelled as iypkernel

ENH: Change setup.py to require pandas 0.20 or greater

This needs to be in the setup.py and is likely the cause of #45 .

GDELT did not return data for any date time?

We are seeing a bunch of failing http requests, but the url seems to be valid and the file can be downloaded by using another httpclient.

How can we narrow it down? Some console outputs are written like running into timeout ...

ENH: Add ability to pull specific time interval on date

Right now, gdeltPyR can pull date ranges for historical dates and current day data. Add an ability for someone to specific specific date intervals to pull data. The historical 2.0 query pulls the last 15 minute interval of the day if coverage is set to False. Need to give more flexibility

Continuous integration for OSX

Need to add CI for MacOS.

DOC: Add Code Coverage Badge for master branch

Fixes small error to add code coverage badge for master branch. Also need to fix small typos on readme and add code coverage to the .rst file.

ENH: Add a new class that provides information on each table and column

This is part of Phase 1.

GDELT is a very complex data set and beginners will need to understand what is available. This is a multi-pronged issue as it is tied to #30 .

The implementation is up to the coder who takes this on, but for consideration:

Create a class that is an "information" or "whatIs" class. The name of the class should be easy to understand and let the user know to use this specific class to learn more about tables and column names in tables.
Each table for GDELT (events, gkg,iatv,mentions,literature) should have a method that returns a description of the table. GDELT Codebook descriptions may help give a generic overview of tables .
- The csv and json files here will be used
Each table will need to include different descriptions for the different GDELT version (version 1 and version 2). The main difference is that new columns or improvements should be highlighted in the description. For example, Events 1 table has less columns than Events 2 table. The description will explain why briefly (maybe 1 sentence at the beginning of the description of Events 2).
Each table will have a dataframe that provides a description of the columns. Each column will have a name, data type (integer, string, etc.), and a description.
Write a unit test to test each table; start by writing failing unit tests firsts (to load the table), then go back and make the tables load with the descriptions. We must have a unit test for each table.

A potential tree is:
gdelt.info -> events -> columndescription

gdelt.info(version=2) -> events -> tabledescription

The version should be set in gdelt.

ENH: More unit tests for Search method

Only have 48% coverage now. Need to get that up to 60-70%.

Remove masterlist download altogether; just use date to download

We lose a good 10-20 seconds on the download of the masterlist. Just use the date format and time strings to pull the data based on the current time and date.

ENH: Add shapefile output

Add a method to convert geopandas output into a shapefile OR include an option that allows the user to write the gdeltPyR results directly to a shapefile.

how to use the raw data locally

after I get all files from http://data.gdeltproject.org/gkg/index.html
then....

ENH: Make example of linking entities in GKG data to geolocations in events tables

idea for an analytic; add this to sphinx documentation as a use case.

TST: No test for API endpoint

Need to create a test for the API endpoint and also add an exception for no connection.

Installing package doesn't work

Whenever I try installing the package I'm getting this error. This didn't happen earlier.

Collecting gdelt
Downloading gdelt-0.1.13.tar.gz (1.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 16.0 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Preparing metadata (pyproject.toml) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.

Max retries

In AWS machine, keep getting max retries message. Never get this message on my personal computer so machines with really fast processors may send requests to the GDELT servers to fast. Need to add a synthetic delay.

Add geopandas geodataframe output

Add ability to output returned data in geopandas output; this will make it easy for another output style (shapefile) and geojson. Also makes it easy to do a choropleth, mapping a statistical variable (count of a particular type of CAMEO Code) to a map. Should add the world geopandas data set to this as well (need to find a small one).

BUG: Proxy issue when importing

I get a proxy error when trying to import the module. This is problematic since you can't pass parameters when importing things (IIRC). Seems like this is the problem bit.

~/gdelt/venv/lib/python3.7/site-packages/gdelt/base.py in <module>()
     80         '/utils/' \
     81         'schema_csvs/cameoCodes.json'
---> 82     codes = json.loads((requests.get(a).content.decode('utf-8')))
     83

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/gdelt/venv/lib/python3.7/site-packages/gdelt/base.py in <module>()
     74     codes = pd.read_json(os.path.join(BASE_DIR, 'data', 'cameoCodes.json'),
---> 75                          dtype=dict(cameoCode='str', GoldsteinScale=np.float64))
     76     codes.set_index('cameoCode', drop=False, inplace=True)

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression)
    421 
--> 422     result = json_reader.read()
    423     if should_close:

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in read(self)
    528         else:
--> 529             obj = self._get_object_parser(self.data)
    530         self.close()

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in _get_object_parser(self, json)
    545         if typ == 'frame':
--> 546             obj = FrameParser(json, **kwargs).parse()
    547 

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in parse(self)
    637         else:
--> 638             self._parse_no_numpy()
    639 

~/gdelt/venv/lib/python3.7/site-packages/pandas/io/json/json.py in _parse_no_numpy(self)
    852             self.obj = DataFrame(
--> 853                 loads(json, precise_float=self.precise_float), dtype=None)
    854         elif orient == "split":

ValueError: Expected object or value

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
~/gdelt/venv/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    593             if is_new_proxy_conn:
--> 594                 self._prepare_proxy(conn)
    595 

~/gdelt/venv/lib/python3.7/site-packages/urllib3/connectionpool.py in _prepare_proxy(self, conn)
    814 
--> 815         conn.connect()
    816 

~/gdelt/venv/lib/python3.7/site-packages/urllib3/connection.py in connect(self)
    323             # self._tunnel_host below.
--> 324             self._tunnel()
    325             # Mark this connection as not reusable

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py in _tunnel(self)
    910             raise OSError("Tunnel connection failed: %d %s" % (code,
--> 911                                                                message.strip()))
    912         while True:

OSError: Tunnel connection failed: 407 AuthorizedOnly

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
~/gdelt/venv/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    444                     retries=self.max_retries,
--> 445                     timeout=timeout
    446                 )

~/gdelt/venv/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    637             retries = retries.increment(method, url, error=e, _pool=self,
--> 638                                         _stacktrace=sys.exc_info()[2])
    639             retries.sleep()

~/gdelt/venv/lib/python3.7/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    397         if new_retry.is_exhausted():
--> 398             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    399 

MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /linwoodc3/gdeltPyR/master/utils/schema_csvs/cameoCodes.json (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 407 AuthorizedOnly')))

During handling of the above exception, another exception occurred:

ProxyError                                Traceback (most recent call last)
<ipython-input-1-b6a720b4b38d> in <module>()
----> 1 import gdelt

~/gdelt/venv/lib/python3.7/site-packages/gdelt/__init__.py in <module>()
      4 from __future__ import absolute_import
      5 
----> 6 from gdelt.base import gdelt
      7 
      8 __name__ = 'gdelt'

~/gdelt/venv/lib/python3.7/site-packages/gdelt/base.py in <module>()
     80         '/utils/' \
     81         'schema_csvs/cameoCodes.json'
---> 82     codes = json.loads((requests.get(a).content.decode('utf-8')))
     83 
     84 ##############################

~/gdelt/venv/lib/python3.7/site-packages/requests/api.py in get(url, params, **kwargs)
     70 
     71     kwargs.setdefault('allow_redirects', True)
---> 72     return request('get', url, params=params, **kwargs)
     73 
     74 

~/gdelt/venv/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
     56     # cases, and look like a memory leak in others.
     57     with sessions.Session() as session:
---> 58         return session.request(method=method, url=url, **kwargs)
     59 
     60 

~/gdelt/venv/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    510         }
    511         send_kwargs.update(settings)
--> 512         resp = self.send(prep, **send_kwargs)
    513 
    514         return resp

~/gdelt/venv/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs)
    620 
    621         # Send the request
--> 622         r = adapter.send(request, **kwargs)
    623 
    624         # Total elapsed time of the request (approximately)

~/gdelt/venv/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    505 
    506             if isinstance(e.reason, _ProxyError):
--> 507                 raise ProxyError(e, request=request)
    508 
    509             if isinstance(e.reason, _SSLError):

ProxyError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /linwoodc3/gdeltPyR/master/utils/schema_csvs/cameoCodes.json (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 407 AuthorizedOnly')))

Error on pulling dates older than 2013, version 1

>>> import gdelt
>>> gd = gdelt.gdelt(version=1)
>>> results = gd.Search('2013 2 20',table='events')
Traceback (most recent call last):
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-368ad372ac85>", line 1, in <module>
    results = gd.Search('2013 2 20',table='events',version=1)
TypeError: Search() got an unexpected keyword argument 'version'
gd = gdelt.gdelt(version=1)
results = gd.Search('2013 2 20',table='events')
Traceback (most recent call last):
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-e0d15ebbd9c9>", line 1, in <module>
    results = gd.Search('2013 2 20',table='events')
  File "/Users/linwood/PycharmProjects/gdeltPyR/gdelt/base.py", line 426, in Search
    else:
  File "/Users/linwood/PycharmProjects/gdeltPyR/gdelt/vectorizingFuncs.py", line 100, in urlBuilder
    if parse(dateString) < parse('2013 Apr 01'):
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/dateutil/parser.py", line 1168, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/Users/linwood/anaconda3/envs/pycharmDev/lib/python3.6/site-packages/dateutil/parser.py", line 581, in parse
    ret = default.replace(**repl)
ValueError: month must be in 1..12

.idea Folder is Not Needed

The .idea folder is an artifact of the PyCharm editor. It should be removed and added to .gitignore to reduce clutter.

DataFrame.ix removed from newer pandas versions

base.py uses DataFrame.ix in two places:

gdeltPyR/gdelt/base.py

Line 343 in df8a6f7

>>> print(results.V2Persons.ix[2])

gdeltPyR/gdelt/base.py

Line 648 in df8a6f7

results.columns = results.ix[0].values.tolist()

Which has been deprecated since pandas 0.20.0 and removed since 1.0.0.

BUG: Event search not working on windows 32 bit machine

import gdelt
import requests.packages.urllib3

requests.packages.urllib3.disable_warnings()
import platform
print(platform.architecture())
import gdelt

gd = gdelt.gdelt(version=2)

results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
print(results)

#output

D:\SUSHANT\pyt\python.exe C:/Users/sushant.s/PycharmProjects/testAGAIN/GDELT.py
('32bit', 'WindowsPE')
('32bit', 'WindowsPE')
('32bit', 'WindowsPE')
Traceback (most recent call last):
File "", line 1, in
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
File "D:\SUSHANT\pyt\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\SUSHANT\pyt\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\SUSHANT\pyt\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\sushant.s\PycharmProjects\testAGAIN\GDELT.py", line 11, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "D:\SUSHANT\pyt\lib\site-packages\gdelt\base.py", line 568, in Search
pool = Pool(processes=cpu_count())
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 168, in init
self._repopulate_pool()
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 233, in _repopulate_pool
w.start()
File "D:\SUSHANT\pyt\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\SUSHANT\pyt\lib\multiprocessing\popen_spawn_win32.py", line 33, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last):
File "", line 1, in
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
File "D:\SUSHANT\pyt\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\SUSHANT\pyt\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\SUSHANT\pyt\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\sushant.s\PycharmProjects\testAGAIN\GDELT.py", line 11, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "D:\SUSHANT\pyt\lib\site-packages\gdelt\base.py", line 568, in Search
pool = Pool(processes=cpu_count())
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 168, in init
self._repopulate_pool()
File "D:\SUSHANT\pyt\lib\multiprocessing\pool.py", line 233, in _repopulate_pool
w.start()
File "D:\SUSHANT\pyt\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\SUSHANT\pyt\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\SUSHANT\pyt\lib\multiprocessing\popen_spawn_win32.py", line 33, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "D:\SUSHANT\pyt\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

('32bit', 'WindowsPE')
('32bit', 'WindowsPE')

Add support to retrieve GDELT 1.0 gkg

Need to add this piece in; can't figure out the master list of gkg.

ENH: add google bigquery interface

Use [pandas.io.gbq](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_gbq.html#pandas.read_gbq)

First you will need to pip install: pip install --upgrade google-api-python-client
Here is a working query:

# load keys;  requires you to be registered
keys = json.load(open('/Users/linwood/Desktop/keysforapps/apikeys.txt'))

# setup google creds; not sure if this is required yet; but you need to do it once to authorize the api from your python ecosystem
from apiclient.discovery import build
service = build('bigquery', 'v2', developerKey=keys['google']['apikey']+"2")

# load query in proper SQL syntax as string
from pandas.io import gbq
q="""
SELECT MonthYear,count(*)c,count(IF(Actor1Code LIKE 'MUS',1,null)) c_up
FROM [gdelt-bq.full.events] WHERE EventRootCode = '19'
GROUP BY MonthYear ORDER BY MonthYear;"""


# run the query
df = gbq.read_gbq(q, project_id=<projectid>)
:[out]
Requesting query... ok.
Query running...
Query done.
Cache hit.

Retrieving results...
Got 461 rows.

Total time taken 0.75 s.
Finished at 2017-05-30 10:26:21

Cannot run sample code for GDELT v2

Hi, When I run the sample code provided for v2, the following error is received. v1 works fine. Please help and let me know why this could be happening? Thank you

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last):
File "", line 1, in
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 288, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "main.py", line 11, in
results = gd2.Search(['2016 11 01'],table='events',coverage=True)
File "/opt/anaconda3/lib/python3.9/site-packages/gdelt/base.py", line 635, in Search
pool = Pool(processes=cpu_count())
File "/opt/anaconda3/lib/python3.9/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/opt/anaconda3/lib/python3.9/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/opt/anaconda3/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/opt/anaconda3/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/opt/anaconda3/lib/python3.9/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/anaconda3/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/anaconda3/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/anaconda3/lib/python3.9/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/opt/anaconda3/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/opt/anaconda3/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''

ENH: Add method to pull content from urls in events table

I've already tested, and my function works on most urls, even the non-English websites.

The demo of the content scraper function will be in my next blog post. For gdeltPyR, need to add method that can be used separately; would take too long to return content for a full pull so leave at user's discretion.

This code already works: https://gist.github.com/linwoodc3/e12a7fbebfa755e897697165875f8fdb

AssertionError: group argument must be None for now while Quering gkg in a time period

when i try to search in a time period i get Assertion error. it's interesting that it works when i run it only with one date. example gd.Search('2016 10 19',coverage=True,table='gkg')

what i Queried:
%time results = gd.Search(['2016 10 19','2016 10 22'],coverage=True,table='gkg')

the Error

File :1

File [c:\Users\\anaconda3\envs\myenv\Lib\site-packages\gdelt\base.py:634](file:///C:/Users//anaconda3/envs/myenv/Lib/site-packages/gdelt/base.py:634), in gdelt.Search(self, date, table, coverage, translation, output, queryTime, normcols)
    630     downloaded_dfs = list(pool.imap_unordered(eventWork,
    631                                               self.download_list))
    632 else:
--> 634     pool = NoDaemonProcessPool(processes=cpu_count())
    635     downloaded_dfs = list(pool.imap_unordered(_mp_worker,
    636                                               self.download_list,
    637                                               ))
    638 pool.close()

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\pool.py:215](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/pool.py:215), in Pool.__init__(self, processes, initializer, initargs, maxtasksperchild, context)
    213 self._processes = processes
    214 try:
--> 215     self._repopulate_pool()
    216 except Exception:
    217     for p in self._pool:

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\pool.py:306](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/pool.py:306), in Pool._repopulate_pool(self)
    305 def _repopulate_pool(self):
--> 306     return self._repopulate_pool_static(self._ctx, self.Process,
    307                                         self._processes,
    308                                         self._pool, self._inqueue,
    309                                         self._outqueue, self._initializer,
    310                                         self._initargs,
    311                                         self._maxtasksperchild,
    312                                         self._wrap_exception)

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\pool.py:322](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/pool.py:322), in Pool._repopulate_pool_static(ctx, Process, processes, pool, inqueue, outqueue, initializer, initargs, maxtasksperchild, wrap_exception)
    318 """Bring the number of pool processes up to the specified number,
    319 for use after reaping workers which have exited.
    320 """
    321 for i in range(processes - len(pool)):
--> 322     w = Process(ctx, target=worker,
    323                 args=(inqueue, outqueue,
    324                       initializer,
    325                       initargs, maxtasksperchild,
    326                       wrap_exception))
    327     w.name = w.name.replace('Process', 'PoolWorker')
    328     w.daemon = True

File [c:\Users\\anaconda3\envs\myenv\Lib\multiprocessing\process.py:82](file:///C:/Users//anaconda3/envs/myenv/Lib/multiprocessing/process.py:82), in BaseProcess.__init__(self, group, target, name, args, kwargs, daemon)
...
---> 82     assert group is None, 'group argument must be None for now'
     83     count = next(_process_counter)
     84     self._identity = _current_process._identity + (count,)

AssertionError: group argument must be None for now

Extract all locations from the gkg table

As a geospatial analyst,
I need to extract and classify all locations from the knowledge graph.
So that, I can easily extract locations on the country, state or city level.

FEATURE: Calculate day event was added

The events 2.0 codebook describes the fraction date. Here is the code to convert the fraction date to the approximate date when the event happened. I'm assuming I had a fraction date of 2020.2438.

import datetime

datetime.datetime(day=1,month=1,year=2020) + datetime.timedelta(days=int(2438/9999 * 365))

ENH: Add unique news provider method

Add option to strip all unique news providers from returned dataset.

I already made a prototype.

run the example from readme.md failed

GDELT 1.0 Queries

import gdelt

Version 1 queries

gd1 = gdelt.gdelt(version=1)

pull single day, gkg table

results= gd1.Search('2016 Nov 01',table='gkg')
print(len(results))

pull events table, range, output to json format

results = gd1.Search(['2016 Oct 31','2016 Nov 2'],coverage=True,table='events')
print(len(results))

++++++++++++++++++++++++++++++++++++++++++++++++++++

~/ub16_prj % python demogdelt.py

187291
187291
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users//ub16_prj/demogdelt.py", line 13, in
results = gd1.Search(['2016 Oct 31','2016 Nov 2'],coverage=True,table='events')
File "/usr/local/lib/python3.8/site-packages/gdelt/base.py", line 629, in Search
pool = Pool(processes=cpu_count())
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

187291

Add "get.data" function to download master list

This will reduce the load time and the run time of the search function. Right now, for GDELT Version 2.0, a single day query takes 45-50 s. With this new functionality, we'll only make calls to the last 15 minute query or the historical get data master list.

BUG: Add exception handling for no data returned

gdeltPyR returns a non-intuitive error when no data returns for a single 15 minute data pull. Need to add exception handling to make it clearer to the user that no data returned; right now it looks like gdeltPyR is broken.

Example to recreate

import gdelt
gd = gdelt.gdelt()
a=gd.Search('2017 July 27')

[Out]:
...
  File "/Users/linwood/projects/gdeltPyR/gdelt/base.py", line 597, in Search
    if len(results.columns) == 57:
AttributeError: 'NoneType' object has no attribute 'columns'

DOC: Add markdown file on contributing to `gdeltPyR`

Use the pandas contributing document as a guide.
Define versioning logic
Implement release plans with group of features to add before version number updated
Explain how to set up dev environment

Early contributing guidance:

I'm using http://semver.org/ and this stack overflow post as a model for versioning; I'm using four schemed number (0.0.0.0):
- major version (changes when all planned features are added)
- minor version (changes when new classes or modules are added that change the results or analysis on GDELT data returned)
- minor-minor version (changes with smaller enhancements like classes or modules that just return unaltered data, new parameters to existing classes/modules, etc.)
- Bug fixes - Last number is the bug fix count for current build; no changes to existing functionality but fixes a MAJOR bug that stops the entire module for working. Simple little bug fixes don't get counted. Resets to zero on minor-minor change. Only counts bugs so if no bugs...stays at zero. Eventually will drop this number off when the unit test suite has 80% coverage.
Small one or two-line entry in CHANGES (gdeltPyR --> CHANGES). Just a date line and description that says something like "added support to translated GDELT data" you can add your github username if you want too.
Add short bullet to README.md and README.rst. (gdeltPyR --> README.md (rst)). This just announces the new feature on the first page.

This looks good. Thanks for adding the unittests.

Administrative changes before any merge.

Small one or two-line entry in CHANGES explaining what you did (gdeltPyR --> CHANGES). Just a date line and description that says something like "Added support to translated GDELT data" you can add your github username if you want too.
Add short bullet to README.md and README.rst. (gdeltPyR --> README.md (rst)). This just announces the new feature on the first page.
Add issue that defines what bug or feature you plan to do. Reference the issue in commits and close it if you commit addresses the issue whether it's an enhancement, bug fix, or documentation update, etc.
(Optional) ** Add your name or github user name (or combo) to the AUTHORS.rst** (gdeltPyR --> AUTHORS.rst). Keep track of all contributors to show power of open source. Feel free to add your country as well.

Travis builds are currently broken for >=3.6

0.36s$ pytest
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --cov-repo=term-missing
  inifile: /home/travis/build/linwoodc3/gdeltPyR/setup.cfg
  rootdir: /home/travis/build/linwoodc3/gdeltPyR
The command "pytest" exited with 4.
before_cache
0.01s$ rm -f $HOME/.cache/pip/log/debug.log
cache.2
store build cache

NameError: global name 'p' is not defined

Traceback (most recent call last):
File "D:\XXX\coding\gdelt\gdeltPyR.py", line 12, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "C:\Users\XXX\Anaconda2\lib\site-packages\gdelt\base.py", line 290, in Search
p
NameError: global name 'p' is not defined