Coder Social home page Coder Social logo

wikipedia-api's Introduction

Wikipedia API

Wikipedia-API is easy to use Python wrapper for Wikipedias' API. It supports extracting texts, sections, links, categories, translations, etc from Wikipedia. Documentation provides code snippets for the most common use cases.

build status Documentation Status Test Coverage Version Py Versions GitHub stars

Installation

This package requires at least Python 3.8 to install because it's using IntEnum.

pip3 install wikipedia-api

Usage

Goal of Wikipedia-API is to provide simple and easy to use API for retrieving informations from Wikipedia. Bellow are examples of common use cases.

Importing

import wikipediaapi

How To Get Single Page

Getting single page is straightforward. You have to initialize Wikipedia object and ask for page by its name. To initialize it, you have to provide:

  • user_agent to identify your project. Please follow the recommended format.
  • language to specify language mutation. It has to be one of supported languages.
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('MyProjectName ([email protected])', 'en')

page_py = wiki_wiki.page('Python_(programming_language)')

How To Check If Wiki Page Exists

For checking, whether page exists, you can use function exists.

page_py = wiki_wiki.page('Python_(programming_language)')
print("Page - Exists: %s" % page_py.exists())
# Page - Exists: True

page_missing = wiki_wiki.page('NonExistingPageWithStrangeName')
print("Page - Exists: %s" %     page_missing.exists())
# Page - Exists: False

How To Get Page Summary

Class WikipediaPage has property summary, which returns description of Wiki page.

import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('MyProjectName ([email protected])', 'en')

print("Page - Title: %s" % page_py.title)
# Page - Title: Python (programming language)

print("Page - Summary: %s" % page_py.summary[0:60])
# Page - Summary: Python is a widely used high-level programming language for

How To Get Page URL

WikipediaPage has two properties with URL of the page. It is fullurl and canonicalurl.

print(page_py.fullurl)
# https://en.wikipedia.org/wiki/Python_(programming_language)

print(page_py.canonicalurl)
# https://en.wikipedia.org/wiki/Python_(programming_language)

How To Get Full Text

To get full text of Wikipedia page you should use property text which constructs text of the page as concatanation of summary and sections with their titles and texts.

wiki_wiki = wikipediaapi.Wikipedia(
    user_agent='MyProjectName ([email protected])',
    language='en',
    extract_format=wikipediaapi.ExtractFormat.WIKI
)

p_wiki = wiki_wiki.page("Test 1")
print(p_wiki.text)
# Summary
# Section 1
# Text of section 1
# Section 1.1
# Text of section 1.1
# ...


wiki_html = wikipediaapi.Wikipedia(
    user_agent='MyProjectName ([email protected])',
    language='en',
    extract_format=wikipediaapi.ExtractFormat.HTML
)
p_html = wiki_html.page("Test 1")
print(p_html.text)
# <p>Summary</p>
# <h2>Section 1</h2>
# <p>Text of section 1</p>
# <h3>Section 1.1</h3>
# <p>Text of section 1.1</p>
# ...

How To Get Page Sections

To get all top level sections of page, you have to use property sections. It returns list of WikipediaPageSection, so you have to use recursion to get all subsections.

def print_sections(sections, level=0):
    for s in sections:
        print("%s: %s - %s" % ("*" * (level + 1), s.title, s.text[0:40]))
        print_sections(s.sections, level + 1)


print_sections(page_py.sections)
# *: History - Python was conceived in the late 1980s,
# *: Features and philosophy - Python is a multi-paradigm programming l
# *: Syntax and semantics - Python is meant to be an easily readable
# **: Indentation - Python uses whitespace indentation, rath
# **: Statements and control flow - Python's statements include (among other
# **: Expressions - Some Python expressions are similar to l

How To Get Page Section By Title

To get last section of page with given title, you have to use function section_by_title. It returns the last WikipediaPageSection with this title.

section_history = page_py.section_by_title('History')
print("%s - %s" % (section_history.title, section_history.text[0:40]))

# History - Python was conceived in the late 1980s b

How To Get All Page Sections By Title

To get all sections of page with given title, you have to use function sections_by_title. It returns the all WikipediaPageSection with this title.

page_1920 = wiki_wiki.page('1920')
sections_january = page_1920.sections_by_title('January')
for s in sections_january:
    print("* %s - %s" % (s.title, s.text[0:40]))

# * January - January 1
# Polish–Soviet War in 1920: The
# * January - January 2
# Isaac Asimov, American author
# * January - January 1 – Zygmunt Gorazdowski, Polish

How To Get Page In Other Languages

If you want to get other translations of given page, you should use property langlinks. It is map, where key is language code and value is WikipediaPage.

def print_langlinks(page):
    langlinks = page.langlinks
    for k in sorted(langlinks.keys()):
        v = langlinks[k]
        print("%s: %s - %s: %s" % (k, v.language, v.title, v.fullurl))

print_langlinks(page_py)
# af: af - Python (programmeertaal): https://af.wikipedia.org/wiki/Python_(programmeertaal)
# als: als - Python (Programmiersprache): https://als.wikipedia.org/wiki/Python_(Programmiersprache)
# an: an - Python: https://an.wikipedia.org/wiki/Python
# ar: ar - بايثون: https://ar.wikipedia.org/wiki/%D8%A8%D8%A7%D9%8A%D8%AB%D9%88%D9%86
# as: as - পাইথন: https://as.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8

page_py_cs = page_py.langlinks['cs']
print("Page - Summary: %s" % page_py_cs.summary[0:60])
# Page - Summary: Python (anglická výslovnost [ˈpaiθtən]) je vysokoúrovňový sk

If you want to get all links to other wiki pages from given page, you need to use property links. It's map, where key is page title and value is WikipediaPage.

def print_links(page):
    links = page.links
    for title in sorted(links.keys()):
        print("%s: %s" % (title, links[title]))

print_links(page_py)
# 3ds Max: 3ds Max (id: ??, ns: 0)
# ?:: ?: (id: ??, ns: 0)
# ABC (programming language): ABC (programming language) (id: ??, ns: 0)
# ALGOL 68: ALGOL 68 (id: ??, ns: 0)
# Abaqus: Abaqus (id: ??, ns: 0)
# ...

How To Get Page Categories

If you want to get all categories under which page belongs, you should use property categories. It's map, where key is category title and value is WikipediaPage.

def print_categories(page):
    categories = page.categories
    for title in sorted(categories.keys()):
        print("%s: %s" % (title, categories[title]))


print("Categories")
print_categories(page_py)
# Category:All articles containing potentially dated statements: ...
# Category:All articles with unsourced statements: ...
# Category:Articles containing potentially dated statements from August 2016: ...
# Category:Articles containing potentially dated statements from March 2017: ...
# Category:Articles containing potentially dated statements from September 2017: ...

How To Get All Pages From Category

To get all pages from given category, you should use property categorymembers. It returns all members of given category. You have to implement recursion and deduplication by yourself.

def print_categorymembers(categorymembers, level=0, max_level=1):
    for c in categorymembers.values():
        print("%s: %s (ns: %d)" % ("*" * (level + 1), c.title, c.ns))
        if c.ns == wikipediaapi.Namespace.CATEGORY and level < max_level:
            print_categorymembers(c.categorymembers, level=level + 1, max_level=max_level)


cat = wiki_wiki.page("Category:Physics")
print("Category members: Category:Physics")
print_categorymembers(cat.categorymembers)

# Category members: Category:Physics
# * Statistical mechanics (ns: 0)
# * Category:Physical quantities (ns: 14)
# ** Refractive index (ns: 0)
# ** Vapor quality (ns: 0)
# ** Electric susceptibility (ns: 0)
# ** Specific weight (ns: 0)
# ** Category:Viscosity (ns: 14)
# *** Brookfield Engineering (ns: 0)

How To See Underlying API Call

If you have problems with retrieving data you can get URL of undrerlying API call. This will help you determine if the problem is in the library or somewhere else.

import sys

import wikipediaapi
wikipediaapi.log.setLevel(level=wikipediaapi.logging.DEBUG)

# Set handler if you use Python in interactive mode
out_hdlr = wikipediaapi.logging.StreamHandler(sys.stderr)
out_hdlr.setFormatter(wikipediaapi.logging.Formatter('%(asctime)s %(message)s'))
out_hdlr.setLevel(wikipediaapi.logging.DEBUG)
wikipediaapi.log.addHandler(out_hdlr)

wiki = wikipediaapi.Wikipedia(user_agent='MyProjectName ([email protected])', language='en')

page_ostrava = wiki.page('Ostrava')
print(page_ostrava.summary)
# logger prints out: Request URL: http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Ostrava&explaintext=1&exsectionformat=wiki

Other Badges

Code Climate Issue Count Coveralls Version Py Versions Implementations Downloads Tags github-release Github commits (since latest release) GitHub forks GitHub stars GitHub watchers GitHub commit activity the past week, 4 weeks, year Last commit GitHub code size in bytes GitHub repo size in bytes PyPi License PyPi Wheel PyPi Format PyPi PyVersions PyPi Implementations PyPi Status PyPi Downloads - Day PyPi Downloads - Week PyPi Downloads - Month Libraries.io - SourceRank Libraries.io - Dependent Repos

Other Pages

API CHANGES DEVELOPMENT wikipediaapi/api

wikipedia-api's People

Contributors

costinsin avatar deepsource-autofix[bot] avatar dependabot[bot] avatar erayerdin avatar fjhheras avatar guillaumedrillaud avatar martin-majlis avatar sawatzkylindsey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikipedia-api's Issues

getting non empty articles for a certain category

I used below code to get articles in Physics category:

import requests

S = requests.Session()

URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "cmdir": "desc",
    "format": "json",
    "list": "categorymembers",
    "action": "query",
    "cmtitle": "Category:Physics",
    "cmlimit": "20"
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()

PAGES = DATA["query"]["categorymembers"]

for page in PAGES:
    print(page["title"])

And as a result I got this:

X-ray Reconstruction of Moving Morphology
Six Ideas that Shaped Physics
Physicalism
Portal:Physics
Physics
Category:Physics stubs
Category:Works about physics
Category:Physical systems
Category:Physics organizations
Category:Physical modeling
Category:Physics literature
Category:Physics-related lists
Category:History of physics
Category:Physics events
Category:Concepts in physics
Category:Physics awards
Category:Physicists
Category:Subfields of physics
Category:Physics by country

Some are good, but article like this Category:Physics by country is empty. Is there anyway to get real? articles not empty ones like above?

Can't find feature

Can this API change language of already opened page. In order to search information in different languages?

Section text is blank

In the code below:

from bs4 import BeautifulSoup
import wikipediaapi
wikipedia = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.HTML
)

page = wikipedia.page("1900")
section = page.sections[1].text
print(section)

The program is not printing anything, yet when I remove the .text, the section contents are being printed. Is this an issue and, if not, how would convert page.sections[1] to a string?

grabbing a fixed number of pages in a given category

I'm trying to make a request to get pages in a given categories, but only a fixed number of them.
I know that to get all of them one can do the following :

wiki_wiki = wikipediaapi.Wikipedia('en')
cat = wiki_wiki.page("Category:Featured articles")
request = wiki_wiki.categorymembers(page = cat)
print(request)

this is working, however, I'd like to pass more parameters to the query, in particular limiting the number of pages returned, I tried
to change line 3 by :

request = wiki_wiki.categorymembers(page = cat, cmlimit = 10)

it simply seems to ignore the parameter cmlimit. I tried passing it as a string too, doesn't work either.
Could you please help me ? I also think it could be valuable to add this as an example in the documentation.

Way to get plain Wikitext of page?

I'm feeling really dense... but once I've got a WikipediaPage, how do I get the wikitext for it? I can only see ways of getting the text processed into plain text or HTML.

(sorry for such a dumb question but....)

I'm new to this and I'm programming a voice assistant so I thought it would be good to put a wikipedia api since it looks extensive and with good reviews so I decided to try it but I don't know how to do it since I didn't understand well

image
image
I don't know what to do help :(

Getting error using the API, worked fine couple of days ago.

Whenever i try to use the API i now get error JSONDecodeError: Expecting value.

Example when using page_py.exists()

page_py
Out[4]: Python_(programming_language) (id: ??, ns: 0)

page_py.exists()
Traceback (most recent call last):

  File "C:\Users\Eren\anaconda3\lib\site-packages\requests\models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)

  File "C:\Users\Eren\anaconda3\lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)

  File "C:\Users\Eren\anaconda3\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())

  File "C:\Users\Eren\anaconda3\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

JSONDecodeError: Expecting value


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "C:\Users\Eren\AppData\Local\Temp\ipykernel_26872\3834438054.py", line 1, in <module>
    page_py.exists()

  File "C:\Users\Eren\anaconda3\lib\site-packages\wikipediaapi\__init__.py", line 875, in exists
    return bool(self.pageid != -1)

  File "C:\Users\Eren\anaconda3\lib\site-packages\wikipediaapi\__init__.py", line 839, in __getattr__
    self._fetch(call)

  File "C:\Users\Eren\anaconda3\lib\site-packages\wikipediaapi\__init__.py", line 1029, in _fetch
    getattr(self.wiki, call)(self)

  File "C:\Users\Eren\anaconda3\lib\site-packages\wikipediaapi\__init__.py", line 297, in info
    raw = self._query(page, params)

  File "C:\Users\Eren\anaconda3\lib\site-packages\wikipediaapi\__init__.py", line 494, in _query
    return r.json()

  File "C:\Users\Eren\anaconda3\lib\site-packages\requests\models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

JSONDecodeError: Expecting value

Newline / Space missing from .summary attribute

The .summary attribute of a page does not include a newline or space after a sentence that ends in hard brackets [ ] on the Wikipedia page.

Example:

wiki = wiki_api.Wikipedia(language="en")
query = "planet"
page = wiki.page(query)
text = page.summary
print(text[:400])

which queries the article: https://en.wikipedia.org/wiki/Planet
and returns:
A planet is an astronomical body orbiting a star or stellar remnant that is massive enough to be rounded by its own gravity, is not massive enough to cause thermonuclear fusion, and – according to the International Astronomical Union but not all planetary scientists – has cleared its neighbouring region of planetesimals.The term planet is ancient, with ties to history, astrology, science, mytholog

Observe the lack of space between planetesimals. and The at the first paragraph, which ends with "planetesimals.[b][1][2]" on the web-page.
Whilst later in the summary, at
print(text[1200:1500])
There is a space between "discovered)." and "Ptolemy" as expected:
the scientific community are no longer viewed as such under the current definition. Some of the excluded objects include Ceres, Pallas, Juno, Vesta (all of which are objects in the solar asteroid belt), and Pluto (the first trans-Neptunian object discovered). Ptolemy thought that the planets orbite

Please let me know if any additional information is needed to fix this, or if there is a workaround.

In-lined footnote references causes malformatted texts

If a paragraph (not a sentence nor word) ends with an in-lined footnote reference, the corresponding .text attribute will have a missing white-space there.

For example

wiki = wikipediaapi.Wikipedia('en')
page = wiki.page('Ross Ching')
for w in page.text.split():
    if '.' in w[:-1]:
        print(w)

prints

internet.A
outlets.Ching
bubbles.In
Jr.,
Wired.com,
Mashable.In
Wired.com

because internet, outlets, bubbles, Mashable are words at the end of a paragraph that has a footnote at the end.

Preferably, there would be a white-space after the . so that, for example, words could be extracted reliably from the text.

Unable to install

Hey so I tried to install your API but python keeps throwing error. Google told me to do pip install --upgrade setuptools but that didn't help either, any other solutions? I'm on Python 3.6
image

page.sections is missing some sections

Somehow the third-level headings in this page get fetched but not the second-level headings? It has no first-level headings so maybe that throws it off. Or maybe I'm misunderstanding how sections is supposed to work.

Code (using your example function):

import wikipediaapi
def print_sections(sections, level=0):
        for s in sections:
                print("%s: %s - %s" % ("*" * (level + 1), s.title, s.text[0:40]))
                print_sections(s.sections, level + 1)
             
wikipedia = wikipediaapi.Wikipedia('en')
page = wikipedia.page ( "Wikipedia:Spoken_articles")
print_sections ( page.sections )

Output, which is missing e.g., "Art, architecture and archaeology", "Biology" although it does show the level three headings under those "Art, architecture and archaeology biographies", "Biology biographies"

*: Art, architecture and archaeology biographies - 
**: Animals - 
**: Plants - 
**: Biology biographies - 
**: Business, economics, and finance biographies - 
**: Companies - 
**: Numismatics - 
**: Culture and society biographies - 
**: Human sexuality - 
**: Internet culture - 
**: Institutions - 
**: Africa - 
**: The Americas - 
***: The Caribbean and South America - 
***: North America - 
**: Asia - 
***: The Middle East - 
**: Europe - 
**: Oceania - 
**: Films - 
**: Television shows - 
**: Episodes of television - 
**: Media fictional characters - 
**: Media biographies - 
**: Albums - 
**: Songs - 
**: Music biographies - 
**: Political biographies - 
**: Sport and recreation biographies - 
Wikipedia:WikiProject Spoken Wikipedia/

[Bug] Apostrophes

json.decoder.JSONDecodeError

I tried to extract Wikipedia page summary using following codes:
def get_wiki_page(search_string):
....page_data=""
....page_py = wiki_wiki.page(search_string)
....if page_py:
........title = page_py.title
........summary = page_py.summary
........page_data= title+ "\t"+summary
....else:
........pass

I got errors as below:
self._fetch('extracts') File "/home/my_space/.local/lib/python3.6/site-packages/wikipediaapi/__init__.py", line 1148, in _fetch getattr(self.wiki, call)(self) File "/home/my_space/.local/lib/python3.6/site-packages/wikipediaapi/__init__.py", line 287, in extracts used_params File "/home/my_space/.local/lib/python3.6/site-packages/wikipediaapi/__init__.py", line 585, in _query return r.json() File "/home/my_space/anaconda3/envs/jaBase/lib/python3.6/site-packages/requests/models.py", line 898, in json return complexjson.loads(self.text, **kwargs) File "/home/my_space/anaconda3/envs/jaBase/lib/python3.6/json/__init__.py", line 354, in loads return _default_decoder.decode(s) File "/home/my_space/anaconda3/envs/jaBase/lib/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/my_space/anaconda3/envs/jaBase/lib/python3.6/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Please help!

Incorrect response to a simple page() query

I noticed an issue where a wikipedia.page() search is returning the "Alboran Island" page when I'm attempting to get the "Algorand" page. I've attached a screenshot demonstrating the issue.

I'm fairly certain the Algorand page should be retrievable with page("Algorand").
https://en.wikipedia.org/wiki/Algorand
Are there situations where the URL doesn't match the page name with the API?

issue_8_7_wikipedia

Categories are not being fetched for a category page

This code prints no categories for the page in question (but it does contains categories):

import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

distsPage = wiki_wiki.page("Category:Continuous Distributions")
print("Category members: %s" % distsPage.title)

print(distsPage.categorymembers)

The output is:

Category members: Category:Continuous Distributions
{}

KeyError: 'langlinks'

Hi,

looks like there is an KeyError when trying to access langlinks from a page without external links to other languages.

import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('it')
page_py = wiki_wiki.page('ago_crinale')
page_py = wiki_wiki.page('ago_crinale')
page_py.exists()
# True
langlinks = page_py.langlinks

I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/wikipediaapi/__init__.py", line 949, in langlinks
    self._fetch('langlinks')
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/wikipediaapi/__init__.py", line 1017, in _fetch
    getattr(self.wiki, '_' + call)(self)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/wikipediaapi/__init__.py", line 322, in _langlinks
    return self._build_langlinks(v, page)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/wikipediaapi/__init__.py", line 561, in _build_langlinks
    for langlink in extract['langlinks']:
KeyError: 'langlinks'

An unknown error occured: "Search request is longer than the maximum allowed length. (Actual: 655; allowed: 300)

Traceback (most recent call last):
File "C:\Users\Aluno\Desktop\PowerPoints\main.py", line 26, in
ctg = wiki.page(pages).categories
File "C:\Users\Aluno\AppData\Local\Programs\Python\Python310\lib\site-packages\wikipedia\wikipedia.py", line 270, in page
results, suggestion = search(title, results=1, suggestion=True)
File "C:\Users\Aluno\AppData\Local\Programs\Python\Python310\lib\site-packages\wikipedia\util.py", line 28, in call
ret = self._cache[key] = self.fn(*args, **kwargs)
File "C:\Users\Aluno\AppData\Local\Programs\Python\Python310\lib\site-packages\wikipedia\wikipedia.py", line 109, in search
raise WikipediaException(raw_results['error']['info'])
wikipedia.exceptions.WikipediaException: An unknown error occured: "Search request is longer than the maximum allowed length. (Actual: 655; allowed: 300)". Please report it on GitHub!

Add property 'extracts' with 'exsentences=2'

This is nicer than summary, imo, because you get to specify how many sentences you want. Jemisin has a really long summary, so you can test with with numbers greater than 2.

More readable in browser:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&explaintext&exsentences=2&titles=N._K._Jemisin

JSON version:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&explaintext&exsentences=2& format=json&titles=N._K._Jemisin

Page HTML does not include hyperlinks and lists

code used:

import wikipediaapi
wiki_html =  wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.HTML)
page_html = wiki_html.page("List_of_anime_distributed_in_the_United_States")
print(page_html.text)

example part of result:

Even though these films wer
en't very successful at the time, due to limited release, they did get positive reviews by critics and <i>Akira</i> received a cult following. Most of these films did get higher
-quality dubs later on.
</p><p>A list of anime first distributed in the U.S. during the 1980s includes:
</p>

<h2>1990s</h2>
<p>The 1990s, was the period in which anime reached mainstream popularity in the U.S. market

example part of actual html:

Even though these films weren't very successful at the time, due to limited release, they did get positive reviews by critics and <i>Akira</i> received a cult following. Most of these films did get higher-quality dubs later on.
</p><p>A list of anime first distributed in the U.S. during the 1980s includes:
</p>
<div class="div-col columns column-width" style="-moz-column-width: 22em; -webkit-column-width: 22em; column-width: 22em;">
<ul><li><i><a href="/wiki/Huckleberry_no_B%C5%8Dken" title="Huckleberry no Bōken">Adventures of Huckleberry Finn</a></i></li>
<li><i><a href="/wiki/Pinocchio:_The_Series#English_versions" title="Pinocchio: The Series">The Adventures of Pinocchio</a></i></li>

as you can see, this is not true HTML, a lot of tags like hyperlinks and the entire list section is missing.

this makes the html part of this api almost unusable.

How do I avoid getting links that are not related to the page content?

Hi,
I was wondering if I'm able to get only links that are in the page content.
For example, while I get the page about "Joseph Black" I get 108 links mapped in a dictionary, among those links I find links to pages like doi which is not exactly related to the content of the page but to the references section.
How do I avoid this situation? Using the same example, how do I get for "Joseph Black" page, only the links between the beginning of the content to "See also" section (included).
I hope there is an easy way to do it.
Thanks

In Spanish Wikipedia: 104 is not a valid Namespace

I sometimes see the following error:

ValueError: 104 is not a valid Namespace

It seems that namespace 104 is not used in the 'en' Wikipedia, but it is used in other languages (for example, it is an annex to an article in 'es' wikipedia). The same is true for 102.

Adding entries in "wikipediaapi/init.py" (line 85) solves this error, but maybe there is a more general solution?

Clarifying What is Included in Backlinks

I am testing the results when looking at backlinks, and I'm confused about what warrants inclusion. For example, when I run the following code in Python:

jay = wiki.page('Jay-Z')
wiki.backlinks(jay,blnamespace=0,blfilterredir='nonredirects')

It lists LL Cool J as a backlink. Yet when I go to LL Cool J's Wikipedia page, I can't find any link to Jay-Z's page. What exactly does it mean to be a backlink? Is it possible to only get direct links?

Connection doesn't support proxy

I set environment variable HTTPS_PROXY, but it doesn't work.
So I pass proxies = {'https' : 'proxy.example.com:8080'} to requests.get, the it works.

AttributeError: 'module' object has no attribute 'Wikipedia' / bad magic number

Hi @martin-majlis!
First of all, thank you for this API. It seems a powerful and useful API. But I am having a very strange problem... When I call the main module (Wikipedia) and I try to get a page, I get the next error:

AttributeError: 'module' object has no attribute 'Wikipedia'

I import the module with the name you show in the README.md (import wikipediaapi). What I write to test your API is simple and it shouldn't result in that error:

import wikipediaapi

wiki = wikipediaapi.Wikipedia('es')

page = wiki.page('Wikipedia')

Regards,
Iván

[Feature Request] - get lnline references

it would be nice to know if a line in a paragraph has reference at its end.
for example, the [6] in -
I wanted to use a lot of the language that the real guy actually used when I heard him, because it was more real....[6]

Feature - Http connection persistent

Is there a way to use http connection persistent feature in this library? That could make checking wikipedia faster. Thanks for the wonderful wrapper.

Hidden categories

categories property despite 'normal' also gets hidden categorires, which may be not so informative. I suppose, it is better to remove hidden cats with clshow or to make it not property, but method, which takes some flag as input, indicating whether to include hidden cats or not.

Exclude Navigation Box from Backlinks

At the bottom of many Wikipedia pages there are navigation boxes that obscure what actually backlinks to an article. For example, because Jay-Z and LL Cool J are both listed in the Grammy Award for Best Rap Solo Performance page, and that appears in both of their navigation boxes, LL Cool J is listed as a backlink to Jay-Z even though Jay-Z isn't mentioned in the body of the LL Cool J article. Is there a way to exclude backlinks from navigation boxes at the bottom of articles?

Screen Shot 2022-11-15 at 5 45 20 PM

Does not "see" html tables

The page's text does not include the html tables when ExtractFormat is set to HTML
For instance, the filmography tables for actors and actresses does not appear in the page's text, even though they are html elements that should in principle be something we can see

[Feature Request] Boolean denoting disambiguation page

It would be useful to be able to check whether the given page is a disambiguation page linking you to more specific pages. Right now disambiguation pages are treated as proper results when there should probably be a boolean denoting otherwise. Would allow programs trying to look up information about a specific topic an easy way to know it needs to spider the links and try to determine which pages are or are not relevant.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.