petermr / amilib Goto Github PK

View Code? Open in Web Editor NEW

1.0 3.0 0.0 106.21 MB

Python library of `ami` software especially NLP, HTML, downloading and related convenience utilities

License: Apache License 2.0

Python 0.94% XSLT 0.08% HTML 98.98% Shell 0.01% CSS 0.01%

amilib's Introduction

amilib

library from pyamihtml

mainly for amiclimate to start with

tests run 2024-05-24

amilib's People

Contributors

Stargazers

Watchers

amilib's Issues

Failed tests with latest amilib

system information

Windows 10, python 3.9

checked out latest `amilib`

Run `pytest`

Output summary

=================================================================== short test summary info ====================================================================
FAILED test/test_headless.py::MiscTest::test_geolocate_GEO - geopy.exc.GeocoderUnavailable: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443):...
FAILED test/test_pdf.py::PDFCharacterTest::test_pdfplumber_full_page_info_LOWLEVEL_CHARS - AssertionError: assert [0, 0, 595.22, 842] == (0, 0, 595.22, 842)
============================================== 2 failed, 145 passed, 71 skipped, 3 warnings in 229.56s (0:03:49) ===============================================

PDF routines fail 6 tests possibly due to pdfplumber upgrade

6 errors in PDF reading

(base) pm286macbook-2:amilib pm286$ python -m pytest
===================================== test session starts ======================================
platform darwin -- Python 3.8.3, pytest-6.2.5, py-1.9.0, pluggy-0.13.1
rootdir: /Users/pm286/workspace/amilib
plugins: cov-3.0.0
collected 220 items                                                                            

test/test_file.py ss                                                                     [  0%]
test/test_headless.py ss..sssssss.....s                                                  [  8%]
test/test_html.py ...s.s......s..s.....ssss...s.........s..ssssss..ss.ss..ss............ [ 40%]
....................s....s                                                               [ 52%]
test/test_nlp.py .                                                                       [ 52%]
test/test_pdf.py .ssF.......s.s.s.sssssssFs..F.ss.ss.FsF..Fssssssssss..s....ss           [ 80%]
test/test_pytest.py .                                                                    [ 80%]
test/test_stat.py .                                                                      [ 81%]
test/test_svg.py ...                                                                     [ 82%]
test/test_util.py ss.....s...s...                                                        [ 89%]
test/test_wikidata.py .s...........s.......                                              [ 99%]
test/test_xml.py ..                                                                      [100%]

=========================================== FAILURES ===========================================
____________________ PDFPlumberTest.test_pdfplumber_json_single_page_debug _____________________

self = <test.test_pdf.PDFPlumberTest testMethod=test_pdfplumber_json_single_page_debug>

    def test_pdfplumber_json_single_page_debug(self):
        """creates AmiPDFPlumber and reads pdf and debugs"""
        path = Path(os.path.join(HERE, "resources/pdffill-demo.pdf"))
        assert path.exists, f"{path} should exist"
        ami_pdfplumber = AmiPDFPlumber()
        ami_plumber_json = ami_pdfplumber.create_ami_plumber_json(path)
>       pages = ami_plumber_json.get_ami_json_pages()
E       AttributeError: 'NoneType' object has no attribute 'get_ami_json_pages'

test/test_pdf.py:133: AttributeError
------------------------------------- Captured stdout call -------------------------------------
ERROR open() takes 2 positional arguments but 3 were given for /Users/pm286/workspace/amilib/test/resources/pdffill-demo.pdf
Cannot create PDF /Users/pm286/workspace/amilib/test/resources/pdffill-demo.pdf
_________________________ PDFChapterTest.test_read_ipcc_chapter__debug _________________________

self = <test.test_pdf.PDFChapterTest testMethod=test_read_ipcc_chapter__debug>

    def test_read_ipcc_chapter__debug(self):
        """read multipage document and extract properties
    
        """
        assert IPCC_GLOSSARY.exists(), f"{IPCC_GLOSSARY} should exist"
        max_page = PDFTest.MAX_PAGE
        # max_page = 999999
        options = [WORDS, ANNOTS]
        # max_page = 100  # increase this if yu want more output
    
        for (pdf_file, page_count) in [
            # (IPCC_GLOSSARY, 51),
            (Resources.TEST_IPCC_CHAP06_PDF, 219)
        ]:
            pdf_debug = PDFDebug()
            with pdfplumber.open(pdf_file) as pdf:
                print(f"file {pdf_file}")
                pages = list(pdf.pages)
                assert len(pages) == page_count
                for page in pages[:max_page]:
>                   pdf_debug.debug_page_properties(page, debug=options)

test/test_pdf.py:756: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
amilib/ami_pdf_libs.py:760: in debug_page_properties
    self.print_annots(page)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <amilib.ami_pdf_libs.PDFDebug object at 0x7fbda0561790>
page = <pdfplumber.page.Page object at 0x7fbdd0e66af0>

    def print_annots(self, page):
        """Prints annots
    
        Here's the output of one (it's a hyperlink)
        annot: dict_items(
    [
        ('page_number', 4),
        ('object_type', 'annot'),
        ('x0', 80.75),
        ('y0', 698.85),
        ('x1', 525.05),
        ('y1', 718.77),
        ('doctop', 2648.91),
        ('top', 123.14999999999998),
        ('bottom', 143.06999999999994),
        ('width', 444.29999999999995),
        ('height', 19.91999999999996),
        ('uri', None),
        ('title', None),
        ('contents', None),
        ('data',
            {'BS': {'W': 0},
             'Dest': [<PDFObjRef:7>, /'XYZ', 69, 769, 0],
             'F': 4,
             'Rect': [80.75, 698.85, 525.05, 718.77],
             'StructParent': 3,
             'Subtype': /'Link'
             }
        )
    ]
    )
        and there are 34 (in a TableOfContents) and they work
    
        """
>       n_annot = len(page.annots)
E       AttributeError: 'Page' object has no attribute 'annots'

amilib/ami_pdf_libs.py:958: AttributeError
------------------------------------- Captured stdout call -------------------------------------
file /Users/pm286/workspace/amilib/test/resources/ar6/Chapter06/fulltext.pdf


======page: 1 ===========
W: {'x0': Decimal('149.340'), 'x1': Decimal('170.906'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'WG'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('174.114'), 'x1': Decimal('185.284'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'III'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('188.443'), 'x1': Decimal('260.930'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'contribution'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('264.089'), 'x1': Decimal('276.531'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'to'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('279.684'), 'x1': Decimal('299.069'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'the'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 318 ['WG', 'III', 'contribution', 'to', 'the'] ...  | 
_ PDFCharacterTest.test_debug_page_properties_chap6_word_count_and_images_data_wg3_old__example _

self = <test.test_pdf.PDFCharacterTest testMethod=test_debug_page_properties_chap6_word_count_and_images_data_wg3_old__example>

    def test_debug_page_properties_chap6_word_count_and_images_data_wg3_old__example(self):
        """debug the old-style IPCC WG3 PDF objects (crude)
        outputs wordcount for page, and any image data.
        Would be better if we knew how to read PDFStream
        """
        maxpage = 9  # images on page 8, and 9
        outdir = Path(AmiAnyTest.TEMP_DIR, "pdf", "ar6", "chap6")
        pdf_debug = PDFDebug()
    
        with pdfplumber.open(Resources.TEST_IPCC_CHAP06_PDF) as pdf:
            pages = list(pdf.pages)
            for page in pages[:maxpage]:
                pdf_debug.debug_page_properties(page, debug=[WORDS, IMAGES], outdir=outdir)
        pdf_debug.write_summary(outdir=outdir)
        print(f"pdf_debug {pdf_debug.image_dict}\n outdir {outdir}")
>       assert maxpage != 9 or pdf_debug.image_dict == {
            ((1397, 779), 143448): (8, (72.0, 523.3), (412.99, 664.64)),
            ((1466, 655), 122016): (8, (72.0, 523.3), (203.73, 405.38)),
            ((1634, 854), 204349): (9, (80.9, 514.25), (543.43, 769.92))
        }
E       AssertionError: assert (9 != 9 or {((Decimal('1...('769.920')))} == {((1397, 779)....43, 769.92))}
E         Differing items:
E         {((1397, 779), 143448): (8, (Decimal('72'), Decimal('523.300')), (Decimal('412.990'), Decimal('664.640')))} != {((1397, 779), 143448): (8, (72.0, 523.3), (412.99, 664.64))}
E         {((1634, 854), 204349): (9, (Decimal('80.900'), Decimal('514.250')), (Decimal('543.430'), Decimal('769.920')))} != {((1634, 854), 204349): (9, (80.9, 514.25), (543.43, 769.92))}
E         {((1466, 655), 122016): (8, (Decimal('72'), Decimal('523.300')), (Decimal('203.730'), Decimal('405.380')))} != {((1466, 655), 122016): (8, (72.0, 523.3), (203.73, 405.38))}
E         Use -v to get the full diff)

test/test_pdf.py:1342: AssertionError
------------------------------------- Captured stdout call -------------------------------------


======page: 1 ===========
image_dict {}
W: {'x0': Decimal('149.340'), 'x1': Decimal('170.906'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'WG'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('174.114'), 'x1': Decimal('185.284'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'III'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('188.443'), 'x1': Decimal('260.930'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'contribution'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('264.089'), 'x1': Decimal('276.531'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'to'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('279.684'), 'x1': Decimal('299.069'), 'top': Decimal('69.655'), 'bottom': Decimal('89.059'), 'text': 'the'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 318 ['WG', 'III', 'contribution', 'to', 'the'] ...  | 

======page: 2 ===========
image_dict {}
W: {'x0': Decimal('76.500'), 'x1': Decimal('109.035'), 'top': Decimal('70.539'), 'bottom': Decimal('83.946'), 'text': 'Chapter'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('111.242'), 'x1': Decimal('116.322'), 'top': Decimal('70.539'), 'bottom': Decimal('83.946'), 'text': '6'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('133.200'), 'x1': Decimal('143.380'), 'top': Decimal('70.539'), 'bottom': Decimal('83.946'), 'text': '44'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('189.900'), 'x1': Decimal('213.280'), 'top': Decimal('70.539'), 'bottom': Decimal('83.946'), 'text': '41-42'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('259.380'), 'x1': Decimal('294.045'), 'top': Decimal('70.539'), 'bottom': Decimal('83.946'), 'text': 'Replace:'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 66 ['Chapter', '6', '44', '41-42', 'Replace:'] ...  | 

======page: 3 ===========
image_dict {}
W: {'x0': Decimal('189.290'), 'x1': Decimal('246.076'), 'top': Decimal('88.215'), 'bottom': Decimal('102.467'), 'text': 'Chapter'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('250.002'), 'x1': Decimal('263.344'), 'top': Decimal('88.215'), 'bottom': Decimal('102.467'), 'text': '6:'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('297.290'), 'x1': Decimal('347.021'), 'top': Decimal('88.215'), 'bottom': Decimal('102.467'), 'text': 'Energy'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('351.070'), 'x1': Decimal('406.084'), 'top': Decimal('88.215'), 'bottom': Decimal('102.467'), 'text': 'Systems'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('48.480'), 'x1': Decimal('54.000'), 'top': Decimal('91.369'), 'bottom': Decimal('101.405'), 'text': '1'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 278 ['Chapter', '6:', 'Energy', 'Systems', '1'] ...  | 

======page: 4 ===========
image_dict {}
W: {'x0': Decimal('72.024'), 'x1': Decimal('94.656'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Final'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('97.460'), 'x1': Decimal('152.461'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Government'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('155.277'), 'x1': Decimal('208.291'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Distribution'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('276.170'), 'x1': Decimal('311.034'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Chapter'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('313.827'), 'x1': Decimal('319.347'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': '6'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 348 ['Final', 'Government', 'Distribution', 'Chapter', '6'] ...  | 

======page: 5 ===========
image_dict {}
W: {'x0': Decimal('72.024'), 'x1': Decimal('94.656'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Final'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('97.460'), 'x1': Decimal('152.461'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Government'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('155.277'), 'x1': Decimal('208.291'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Distribution'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('276.170'), 'x1': Decimal('311.034'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Chapter'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('313.827'), 'x1': Decimal('319.347'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': '6'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 746 ['Final', 'Government', 'Distribution', 'Chapter', '6'] ...  | 

======page: 6 ===========
image_dict {}
W: {'x0': Decimal('72.024'), 'x1': Decimal('94.656'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Final'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('97.460'), 'x1': Decimal('152.461'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Government'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('155.277'), 'x1': Decimal('208.291'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Distribution'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('276.170'), 'x1': Decimal('311.034'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Chapter'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('313.827'), 'x1': Decimal('319.347'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': '6'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 715 ['Final', 'Government', 'Distribution', 'Chapter', '6'] ...  | 

======page: 7 ===========
image_dict {}
W: {'x0': Decimal('72.024'), 'x1': Decimal('94.656'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Final'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('97.460'), 'x1': Decimal('152.461'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Government'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('155.277'), 'x1': Decimal('208.291'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Distribution'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('276.170'), 'x1': Decimal('311.034'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Chapter'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('313.827'), 'x1': Decimal('319.347'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': '6'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 342 ['Final', 'Government', 'Distribution', 'Chapter', '6'] ...  | 

======page: 8 ===========
images 2 |
image: <class 'dict'>: dict_keys(['x0', 'y0', 'x1', 'y1', 'width', 'height', 'name', 'stream', 'srcsize', 'imagemask', 'bits', 'colorspace', 'object_type', 'page_number', 'top', 'bottom', 'doctop']) 
dict_values([Decimal('72'), Decimal('412.990'), Decimal('523.300'), Decimal('664.640'), Decimal('451.300'), Decimal('251.650'), 'Im0', <PDFStream(15): raw=143450, {'BitsPerComponent': 8, 'ColorSpace': /'DeviceRGB', 'Filter': /'DCTDecode', 'Height': 779, 'Interpolate': True, 'Length': 143448, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 1397}>, (Decimal('1397'), Decimal('779')), None, 8, [/'DeviceRGB'], 'image', 8, Decimal('177.280'), Decimal('428.930'), Decimal('6070.720')])
stream <PDFStream(15): raw=143450, {'BitsPerComponent': 8, 'ColorSpace': /'DeviceRGB', 'Filter': /'DCTDecode', 'Height': 779, 'Interpolate': True, 'Length': 143448, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 1397}>
keys dict_keys(['x0', 'y0', 'x1', 'y1', 'width', 'height', 'name', 'stream', 'srcsize', 'imagemask', 'bits', 'colorspace', 'object_type', 'page_number', 'top', 'bottom', 'doctop'])
xxyy ((Decimal('72'), Decimal('523.300')), (Decimal('412.990'), Decimal('664.640')), (Decimal('1397'), Decimal('779')), 'Im0', 8)
image:  ((Decimal('1397'), Decimal('779')), 143448) => (8, (Decimal('72'), Decimal('523.300')), (Decimal('412.990'), Decimal('664.640')))
image: <class 'dict'>: dict_keys(['x0', 'y0', 'x1', 'y1', 'width', 'height', 'name', 'stream', 'srcsize', 'imagemask', 'bits', 'colorspace', 'object_type', 'page_number', 'top', 'bottom', 'doctop']) 
dict_values([Decimal('72'), Decimal('203.730'), Decimal('523.300'), Decimal('405.380'), Decimal('451.300'), Decimal('201.650'), 'Im1', <PDFStream(16): raw=122018, {'BitsPerComponent': 8, 'ColorSpace': /'DeviceRGB', 'Filter': /'DCTDecode', 'Height': 655, 'Interpolate': True, 'Length': 122016, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 1466}>, (Decimal('1466'), Decimal('655')), None, 8, [/'DeviceRGB'], 'image', 8, Decimal('436.540'), Decimal('638.190'), Decimal('6329.980')])
stream <PDFStream(16): raw=122018, {'BitsPerComponent': 8, 'ColorSpace': /'DeviceRGB', 'Filter': /'DCTDecode', 'Height': 655, 'Interpolate': True, 'Length': 122016, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 1466}>
keys dict_keys(['x0', 'y0', 'x1', 'y1', 'width', 'height', 'name', 'stream', 'srcsize', 'imagemask', 'bits', 'colorspace', 'object_type', 'page_number', 'top', 'bottom', 'doctop'])
xxyy ((Decimal('72'), Decimal('523.300')), (Decimal('203.730'), Decimal('405.380')), (Decimal('1466'), Decimal('655')), 'Im1', 8)
image:  ((Decimal('1466'), Decimal('655')), 122016) => (8, (Decimal('72'), Decimal('523.300')), (Decimal('203.730'), Decimal('405.380')))
image_dict {((Decimal('1397'), Decimal('779')), 143448): (8, (Decimal('72'), Decimal('523.300')), (Decimal('412.990'), Decimal('664.640'))), ((Decimal('1466'), Decimal('655')), 122016): (8, (Decimal('72'), Decimal('523.300')), (Decimal('203.730'), Decimal('405.380')))}
W: {'x0': Decimal('72.024'), 'x1': Decimal('94.656'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Final'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('97.460'), 'x1': Decimal('152.461'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Government'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('155.277'), 'x1': Decimal('208.291'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Distribution'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('276.170'), 'x1': Decimal('311.034'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Chapter'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('313.827'), 'x1': Decimal('319.347'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': '6'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 140 ['Final', 'Government', 'Distribution', 'Chapter', '6'] ...  | 

======page: 9 ===========
images 1 |
image: <class 'dict'>: dict_keys(['x0', 'y0', 'x1', 'y1', 'width', 'height', 'name', 'stream', 'srcsize', 'imagemask', 'bits', 'colorspace', 'object_type', 'page_number', 'top', 'bottom', 'doctop']) 
dict_values([Decimal('80.900'), Decimal('543.430'), Decimal('514.250'), Decimal('769.920'), Decimal('433.350'), Decimal('226.490'), 'Im0', <PDFStream(19): raw=204351, {'BitsPerComponent': 8, 'ColorSpace': /'DeviceRGB', 'Filter': /'FlateDecode', 'Height': 854, 'Interpolate': False, 'Length': 204349, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 1634}>, (Decimal('1634'), Decimal('854')), None, 8, [/'DeviceRGB'], 'image', 9, Decimal('72.000'), Decimal('298.490'), Decimal('6807.360')])
stream <PDFStream(19): raw=204351, {'BitsPerComponent': 8, 'ColorSpace': /'DeviceRGB', 'Filter': /'FlateDecode', 'Height': 854, 'Interpolate': False, 'Length': 204349, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 1634}>
keys dict_keys(['x0', 'y0', 'x1', 'y1', 'width', 'height', 'name', 'stream', 'srcsize', 'imagemask', 'bits', 'colorspace', 'object_type', 'page_number', 'top', 'bottom', 'doctop'])
xxyy ((Decimal('80.900'), Decimal('514.250')), (Decimal('543.430'), Decimal('769.920')), (Decimal('1634'), Decimal('854')), 'Im0', 9)
image:  ((Decimal('1634'), Decimal('854')), 204349) => (9, (Decimal('80.900'), Decimal('514.250')), (Decimal('543.430'), Decimal('769.920')))
image_dict {((Decimal('1397'), Decimal('779')), 143448): (8, (Decimal('72'), Decimal('523.300')), (Decimal('412.990'), Decimal('664.640'))), ((Decimal('1466'), Decimal('655')), 122016): (8, (Decimal('72'), Decimal('523.300')), (Decimal('203.730'), Decimal('405.380'))), ((Decimal('1634'), Decimal('854')), 204349): (9, (Decimal('80.900'), Decimal('514.250')), (Decimal('543.430'), Decimal('769.920')))}
W: {'x0': Decimal('72.024'), 'x1': Decimal('94.656'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Final'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('97.460'), 'x1': Decimal('152.461'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Government'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('155.277'), 'x1': Decimal('208.291'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Distribution'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('276.170'), 'x1': Decimal('311.034'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': 'Chapter'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
W: {'x0': Decimal('313.827'), 'x1': Decimal('319.347'), 'top': Decimal('38.069'), 'bottom': Decimal('48.105'), 'text': '6'} dict_keys(['x0', 'x1', 'top', 'bottom', 'text']) 
words 435 ['Final', 'Government', 'Distribution', 'Chapter', '6'] ...  | wrote image coords to /Users/pm286/workspace/amilib/temp/pdf/ar6/chap6/image_coords.txt
pdf_debug {((Decimal('1397'), Decimal('779')), 143448): (8, (Decimal('72'), Decimal('523.300')), (Decimal('412.990'), Decimal('664.640'))), ((Decimal('1466'), Decimal('655')), 122016): (8, (Decimal('72'), Decimal('523.300')), (Decimal('203.730'), Decimal('405.380'))), ((Decimal('1634'), Decimal('854')), 204349): (9, (Decimal('80.900'), Decimal('514.250')), (Decimal('543.430'), Decimal('769.920')))}
 outdir /Users/pm286/workspace/amilib/temp/pdf/ar6/chap6
___________________ PDFCharacterTest.test_pdfminer_font_and_character_output ___________________

self = <test.test_pdf.PDFCharacterTest testMethod=test_pdfminer_font_and_character_output>

        @unittest.skipUnless(PDFTest.DEBUG, "too much output")
        def test_pdfminer_font_and_character_output(self):
            """Examines every character and annotates it
            Typical:
    LTPage
      LTTextBoxHorizontal                               Journal of Medicine and Life Volume 7, Special Issue 3, 2014
        LTTextLineHorizontal                            Journal of Medicine and Life Volume 7, Special Issue 3, 2014
          LTChar                   KAAHHD+Calibri,Itali J
          LTChar                   KAAHHD+Calibri,Itali o
          LTChar                   KAAHHD+Calibri,Itali u
            """
            MAXITEM = 2
            from pathlib import Path
            from typing import Iterable, Any
    
>           from pdfminer.high_level import extract_pages
E           ImportError: cannot import name 'extract_pages' from 'pdfminer.high_level' (/opt/anaconda3/lib/python3.8/site-packages/pdfminer/high_level.py)

test/test_pdf.py:858: ImportError
_____________________________ PDFCharacterTest.test_pdfminer_style _____________________________

self = <test.test_pdf.PDFCharacterTest testMethod=test_pdfminer_style>

        def test_pdfminer_style(self):
            """Examines every character and annotates it
            Typical:
    LTPage
      LTTextBoxHorizontal                               Journal of Medicine and Life Volume 7, Special Issue 3, 2014
        LTTextLineHorizontal                            Journal of Medicine and Life Volume 7, Special Issue 3, 2014
          LTChar                   KAAHHD+Calibri,Itali J
          LTChar                   KAAHHD+Calibri,Itali o
          LTChar                   KAAHHD+Calibri,Itali u
            """
            from pathlib import Path
            from typing import Iterable, Any
    
>           from pdfminer.high_level import extract_pages
E           ImportError: cannot import name 'extract_pages' from 'pdfminer.high_level' (/opt/anaconda3/lib/python3.8/site-packages/pdfminer/high_level.py)

test/test_pdf.py:961: ImportError
________________ PDFCharacterTest.test_pdfplumber_full_page_info_LOWLEVEL_CHARS ________________

self = <test.test_pdf.PDFCharacterTest testMethod=test_pdfplumber_full_page_info_LOWLEVEL_CHARS>

    def test_pdfplumber_full_page_info_LOWLEVEL_CHARS(self):
        """The definitive catalog of all objects on a page"""
        assert PMC1421_PDF.exists(), f"{PMC1421_PDF} should exist"
    
        include_float = False # don't test if values are floats
        # TODO use pytest.approx or similar
    
        # also ['_text', 'matrix', 'fontname', 'ncs', 'graphicstate', 'adv', 'upright', 'x0', 'y0', 'x1', 'y1',
        # 'width', 'height', 'bbox', 'size', 'get_text',
        # 'is_compatible', 'set_bbox', 'is_empty', 'is_hoverlap',
        # 'hdistance', 'hoverlap', 'is_voverlap', 'vdistance', 'voverlap', 'analyze', ']
        with pdfplumber.open(PMC1421_PDF) as pdf:
            first_page = pdf.pages[0]
            # print(type(first_page), first_page.__dir__())
            """
            dir: ['pdf', 'root_page', 'page_obj', 'page_number', 'rotation', 'initial_doctop', 'cropbox', 'mediabox',
            'bbox', 'cached_properties', 'is_original', 'pages', 'width',
            'height', 'layout', 'annots', 'hyperlinks', 'objects', 'process_object', 'iter_layout_objects', 'parse_objects',
            'debug_tablefinder', 'find_tables', 'extract_tables', 'extract_table', 'get_text_layout', 'search', 'extract_text',
             'extract_words', 'crop', 'within_bbox', 'filter', 'dedupe_chars', 'to_image', 'to_dict',
             'flush_cache', 'rects', 'lines', 'curves', 'images', 'chars', 'textboxverticals', 'textboxhorizontals',
             'textlineverticals', 'textlinehorizontals', 'rect_edges', 'edges', 'horizontal_edges', 'vertical_edges', 'to_json',
              'to_csv', ]
            """
            assert first_page.page_number == 1
            assert first_page.rotation == 0
            assert first_page.initial_doctop == 0
            # cropbox and medibox seem to vary beteween lists and tuples on different versionns of Python
            # assert first_page.cropbox == (0, 0, 595.22, 842)
            # assert first_page.mediabox == (0, 0, 595.22, 842)
            # assert first_page.bbox == (0, 0, 595.22, 842)
>           assert first_page.cached_properties == ['_rect_edges', '_curve_edges', '_edges', '_objects', '_layout']
E           AssertionError: assert ['_rect_edges...s', '_layout'] == ['_rect_edges...s', '_layout']
E             At index 1 diff: '_edges' != '_curve_edges'
E             Right contains one more item: '_layout'
E             Use -v to get the full diff

test/test_pdf.py:1054: AssertionError
======================================= warnings summary =======================================
../../../../opt/anaconda3/lib/python3.8/site-packages/numexpr/expressions.py:21
../../../../opt/anaconda3/lib/python3.8/site-packages/numexpr/expressions.py:21
  /opt/anaconda3/lib/python3.8/site-packages/numexpr/expressions.py:21: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    _np_version_forbids_neg_powint = LooseVersion(numpy.__version__) >= LooseVersion('1.12.0b1')

../../.local/lib/python3.8/site-packages/requests/__init__.py:87
  /Users/pm286/.local/lib/python3.8/site-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.2.1) or chardet (4.0.0) doesn't match a supported version!
    warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "

test/test_nlp.py::NLPTest::test_compute_text_similarity_STAT
  /opt/anaconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py:525: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
    warnings.warn(

test/test_nlp.py::NLPTest::test_compute_text_similarity_STAT
  /opt/anaconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py:408: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'formerli', 'forti', 'ha', 'henc', 'hereaft', 'herebi', 'hi', 'howev', 'hundr', 'inde', 'latterli', 'mani', 'meanwhil', 'moreov', 'mostli', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'seriou', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thu', 'togeth', 'twelv', 'twenti', 'veri', 'wa', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.
    warnings.warn(

test/test_pdf.py::PDFCharacterTest::test_download_all_hlab_shifts_convert_to_html
test/test_wikidata.py::TestWikidataLookup_WIKI_NET::test_multiple_ids
test/test_wikidata.py::TestWikidataLookup_WIKI_NET::test_simple_wikidata_query
test/test_wikidata.py::TestWikidataLookup_WIKI_NET::test_wikidata_extractor
test/test_wikidata.py::TestWikidataLookup_WIKI_NET::test_wikidata_extractor
test/test_wikidata.py::TestWikidataLookup_WIKI_NET::test_wikidata_id_lookup
test/test_wikidata.py::TestWikidataLookup_WIKI_NET::test_wikidata_id_lookup
  /opt/anaconda3/lib/python3.8/site-packages/urllib3/poolmanager.py:316: DeprecationWarning: The 'strict' parameter is no longer needed on Python 3+. This will raise an error in urllib3 v2.1.0.
    warnings.warn(

test/test_stat.py::TestStat::test_plot_scatter_noel_oboyle_STAT_PLOT
  /opt/anaconda3/lib/python3.8/site-packages/sklearn/manifold/_mds.py:298: FutureWarning: The default value of `normalized_stress` will change to `'auto'` in version 1.4. To suppress this warning, manually set the value of `normalized_stress`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=================================== short test summary info ====================================
FAILED test/test_pdf.py::PDFPlumberTest::test_pdfplumber_json_single_page_debug - AttributeEr...
FAILED test/test_pdf.py::PDFChapterTest::test_read_ipcc_chapter__debug - AttributeError: 'Pag...
FAILED test/test_pdf.py::PDFCharacterTest::test_debug_page_properties_chap6_word_count_and_images_data_wg3_old__example
FAILED test/test_pdf.py::PDFCharacterTest::test_pdfminer_font_and_character_output - ImportEr...
FAILED test/test_pdf.py::PDFCharacterTest::test_pdfminer_style - ImportError: cannot import n...
FAILED test/test_pdf.py::PDFCharacterTest::test_pdfplumber_full_page_info_LOWLEVEL_CHARS - As...
============== 6 failed, 141 passed, 73 skipped, 13 warnings in 69.46s (0:01:09) ====

Errors from running pytest

Using VSCode on -win 10 Education with a wsl shell (linux). I had to install pytest using https://docs.pytest.org/en/6.2.x/getting-started.html pip install -U pytest

ERRORS

`worthingtons@nb-t1796:/mnt/c/git/amilib$ pytest
======================================================================================== test session starts =========================================================================================platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.4.0
rootdir: /mnt/c/git/amilib
collected 0 items / 10 errors

=============================================================================================== ERRORS ===============================================================================================_________________________________________________________________________________ ERROR collecting test/test_all.py __________________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_all.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_all.py:14: in
from amilib.wikimedia import WikidataSparql as WS
amilib/wikimedia.py:12: in
from amilib.ami_html import HtmlUtil
amilib/ami_html.py:25: in
from amilib.xml_lib import XmlLib, HtmlLib
amilib/xml_lib.py:16: in
from amilib.file_lib import FileLib
E ModuleNotFoundError: No module named 'amilib.file_lib'
_________________________________________________________________________________ ERROR collecting test/test_file.py _________________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_file.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_file.py:10: in
from amilib.file_lib import FileLib
E ModuleNotFoundError: No module named 'amilib.file_lib'
_______________________________________________________________________________ ERROR collecting test/test_headless.py _______________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_headless.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_headless.py:9: in
from geopy.geocoders import Nominatim
E ModuleNotFoundError: No module named 'geopy'
_________________________________________________________________________________ ERROR collecting test/test_html.py _________________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_html.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_html.py:20: in
from amilib.file_lib import FileLib
E ModuleNotFoundError: No module named 'amilib.file_lib'
_________________________________________________________________________________ ERROR collecting test/test_nlp.py __________________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_nlp.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_nlp.py:4: in
from amilib.ami_nlp import AmiNLP
amilib/ami_nlp.py:6: in
import matplotlib.pyplot as plt
E ModuleNotFoundError: No module named 'matplotlib'
_________________________________________________________________________________ ERROR collecting test/test_pdf.py __________________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_pdf.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_pdf.py:20: in
import test.test_all
test/test_all.py:14: in
from amilib.wikimedia import WikidataSparql as WS
amilib/wikimedia.py:12: in
from amilib.ami_html import HtmlUtil
amilib/ami_html.py:25: in
from amilib.xml_lib import XmlLib, HtmlLib
amilib/xml_lib.py:16: in
from amilib.file_lib import FileLib
E ModuleNotFoundError: No module named 'amilib.file_lib'
_________________________________________________________________________________ ERROR collecting test/test_svg.py __________________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_svg.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_svg.py:5: in
from amilib.ami_svg import AmiSVG
amilib/ami_svg.py:4: in
from amilib.xml_lib import NS_MAP, XML_NS, SVG_NS
amilib/xml_lib.py:16: in
from amilib.file_lib import FileLib
E ModuleNotFoundError: No module named 'amilib.file_lib'
_________________________________________________________________________________ ERROR collecting test/test_util.py _________________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_util.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_util.py:11: in
from amilib.file_lib import FileLib
E ModuleNotFoundError: No module named 'amilib.file_lib'
_______________________________________________________________________________ ERROR collecting test/test_wikidata.py _______________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_wikidata.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_wikidata.py:11: in
from amilib.wikimedia import WikidataPage, WikidataExtractor, WikidataProperty, WikidataFilter
amilib/wikimedia.py:12: in
from amilib.ami_html import HtmlUtil
amilib/ami_html.py:25: in
from amilib.xml_lib import XmlLib, HtmlLib
amilib/xml_lib.py:16: in
from amilib.file_lib import FileLib
E ModuleNotFoundError: No module named 'amilib.file_lib'
_________________________________________________________________________________ ERROR collecting test/test_xml.py __________________________________________________________________________________ImportError while importing test module '/mnt/c/git/amilib/test/test_xml.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/init.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/test_xml.py:5: in
from amilib.ami_html import HtmlStyle
amilib/ami_html.py:25: in
from amilib.xml_lib import XmlLib, HtmlLib
amilib/xml_lib.py:16: in
from amilib.file_lib import FileLib
E ModuleNotFoundError: No module named 'amilib.file_lib'
====================================================================================== short test summary info =======================================================================================ERROR test/test_all.py
ERROR test/test_file.py
ERROR test/test_headless.py
ERROR test/test_html.py
ERROR test/test_nlp.py
ERROR test/test_pdf.py
ERROR test/test_svg.py
ERROR test/test_util.py
ERROR test/test_wikidata.py
ERROR test/test_xml.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 10 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!========================================================================================= 10 errors in 7.92s =========================================================================================`

Running tests

xception has occurred: ModuleNotFoundError
No module named 'test.resources'
File "F:\Assignment\test_file.py", line 12, in
from test.resources import Resources
ModuleNotFoundError: No module named 'test.resources'

Getting this issues while running every test In VS Code

Extraction of terms from Wiktionary output sometimes fails

The lookup for peat retrieves the Chinese etymology, not the English.

Problem is parsing the linear Wiktionary output

amilib 0.0.8 installation

System: Windows11 Gen 64-bit operating system
Python Version: Python 3.11.9

PYAMI
***** PYAMI VERSION 0.0.8 *****
command: ['--help']
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\dhana\anaconda3\envs\amilib\Scripts\amilib.exe\__main__.py", line 7, in <module>
  File "C:\Users\dhana\anaconda3\envs\amilib\Lib\site-packages\amilib\amix.py", line 1301, in main
    amix.run_command(sys.argv[1:])
  File "C:\Users\dhana\anaconda3\envs\amilib\Lib\site-packages\amilib\amix.py", line 273, in run_command
    self.parse_and_run_args(args)
  File "C:\Users\dhana\anaconda3\envs\amilib\Lib\site-packages\amilib\amix.py", line 287, in parse_and_run_args
    parser = self.create_arg_parser()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dhana\anaconda3\envs\amilib\Lib\site-packages\amilib\amix.py", line 214, in create_arg_parser
    amilib_parser = AmiLibArgs().make_sub_parser(subparsers)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dhana\anaconda3\envs\amilib\Lib\site-packages\amilib\util.py", line 829, in make_sub_parser
    self.parser = subparsers.add_parser(self.subparser_arg)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dhana\anaconda3\envs\amilib\Lib\argparse.py", line 1197, in add_parser
    raise ArgumentError(self, _('conflicting subparser: %s') % name)
argparse.ArgumentError: argument command: conflicting subparser: HTML```

Equality test fails in pdf chars

C:\Users\asus\Desktop\Semantic\amilib\test> pytest test_pdf.py::PDFCharacterTest::test_pdfplumber_full_page_info_LOWLEVEL_CHARS
======================================================= test session starts ========================================================
platform win32 -- Python 3.12.3, pytest-8.2.0, pluggy-1.5.0
rootdir: C:\Users\asus\Desktop\Semantic\amilib
collected 1 item

test_pdf.py F [100%]

============================================================= FAILURES =============================================================
__________________________________ PDFCharacterTest.test_pdfplumber_full_page_info_LOWLEVEL_CHARS __________________________________

self = <test.test_pdf.PDFCharacterTest testMethod=test_pdfplumber_full_page_info_LOWLEVEL_CHARS>

def test_pdfplumber_full_page_info_LOWLEVEL_CHARS(self):
    """The definitive catalog of all objects on a page"""
    assert PMC1421_PDF.exists(), f"{PMC1421_PDF} should exist"

    # also ['_text', 'matrix', 'fontname', 'ncs', 'graphicstate', 'adv', 'upright', 'x0', 'y0', 'x1', 'y1',
    # 'width', 'height', 'bbox', 'size', 'get_text',
    # 'is_compatible', 'set_bbox', 'is_empty', 'is_hoverlap',
    # 'hdistance', 'hoverlap', 'is_voverlap', 'vdistance', 'voverlap', 'analyze', ']
    with pdfplumber.open(PMC1421_PDF) as pdf:
        first_page = pdf.pages[0]
        # print(type(first_page), first_page.__dir__())
        """
        dir: ['pdf', 'root_page', 'page_obj', 'page_number', 'rotation', 'initial_doctop', 'cropbox', 'mediabox',
        'bbox', 'cached_properties', 'is_original', 'pages', 'width',
        'height', 'layout', 'annots', 'hyperlinks', 'objects', 'process_object', 'iter_layout_objects', 'parse_objects',
        'debug_tablefinder', 'find_tables', 'extract_tables', 'extract_table', 'get_text_layout', 'search', 'extract_text',
         'extract_words', 'crop', 'within_bbox', 'filter', 'dedupe_chars', 'to_image', 'to_dict',
         'flush_cache', 'rects', 'lines', 'curves', 'images', 'chars', 'textboxverticals', 'textboxhorizontals',
         'textlineverticals', 'textlinehorizontals', 'rect_edges', 'edges', 'horizontal_edges', 'vertical_edges', 'to_json',
          'to_csv', ]
        """
        assert first_page.page_number == 1
        assert first_page.rotation == 0
        assert first_page.initial_doctop == 0
        assert first_page.cropbox == (0, 0, 595.22, 842)
        assert first_page.mediabox == (0, 0, 595.22, 842)
        assert first_page.bbox == (0, 0, 595.22, 842)
        assert first_page.cached_properties == ['_rect_edges', '_curve_edges', '_edges', '_objects', '_layout']
        assert first_page.is_original
        assert first_page.pages is None
        assert first_page.width == 595.22
        assert first_page.height == 842
        # assert first_page.layout: < LTPage(1)
        # 0.000, 0.000, 595.220, 842.000
        # rotate = 0 >
        assert first_page.annots == []
        assert first_page.hyperlinks == []
        assert len(first_page.objects) == 2
        assert type(first_page.objects) is dict
        assert list(first_page.objects.keys()) == ['char', 'line']
        assert len(first_page.objects['char']) == 4411
        assert first_page.objects['char'][:2] == [
            {'matrix': (9, 0, 0, 9, 319.74, 797.4203),
             'mcid': None,
             'ncs': 'DeviceCMYK',
             'non_stroking_pattern': None,
             'stroking_pattern': None,
             'tag': None,
             'fontname': 'KAAHHD+Calibri,Italic',
             'adv': 0.319,
             'upright': True,
             'x0': 319.74, 'y0': 795.1703, 'x1': 322.611, 'y1': 804.1703,
             'width': 2.870999999999981, 'height': 9.0, 'size': 9.0,
             'object_type': 'char', 'page_number': 1,
             'text': 'J', 'stroking_color': None, 'non_stroking_color': (0.86667, 0.26667, 1, 0.15294),
             'top': 37.8297, 'bottom': 46.8297, 'doctop': 37.8297
             },
            {'matrix': (9, 0, 0, 9, 322.6092, 797.4203), 'fontname': 'KAAHHD+Calibri,Italic', 'adv': 0.513,
             'mcid': None,
             'ncs': 'DeviceCMYK',
             'non_stroking_pattern': None,
             'stroking_pattern': None,
             'tag': None,
             'upright': True,
             'x0': 322.6092, 'y0': 795.1703, 'x1': 327.2262, 'y1': 804.1703, 'width': 4.617000000000019,
             'height': 9.0, 'size': 9.0,
             'object_type': 'char', 'page_number': 1, 'text': 'o', 'stroking_color': None,
             'non_stroking_color': (0.86667, 0.26667, 1, 0.15294),
             'top': 37.8297, 'bottom': 46.8297, 'doctop': 37.8297},
        ], f"first_page.objects['char'][0]  {first_page.objects['char'][0]}"
        assert len(first_page.objects['line']) == 1, f" len(first_page.objects['line'])"

      assert first_page.objects['line'][0] == {

            'bottom': 48.24000000000001,
            'doctop': 48.24000000000001,
            'evenodd': False,
            'fill': False,
            'height': 0.0,
            'linewidth': 1,
            'mcid': None,
            'non_stroking_color': (0,),
            'non_stroking_pattern': None,
            'object_type': 'line',
            'page_number': 1,
            #  this may be different y-coord system
            # 'pts': [(56.7, 793.76), (542.76, 793.76)],
            'pts': [(56.7, 48.24000000000001), (542.76, 48.24000000000001)],
            'stroke': True,
            'stroking_color': (0.3098, 0.24706, 0.2549, 0),
            'stroking_pattern': None,
            'tag': None,
            'top': 48.24000000000001,
            'width': 486.06,
            'x0': 56.7,
            'x1': 542.76,
            'y0': 793.76,
            'y1': 793.76
        }, f"first_page.objects['line'][0]  {first_page.objects['line'][0]}"

E AssertionError: first_page.objects['line'][0] {'x0': 56.7, 'y0': 793.76, 'x1': 542.76, 'y1': 793.76, 'width': 486.06, 'height': 0.0, 'pts': [(56.7, 48.24000000000001), (542.76, 48.24000000000001)], 'linewidth': 1, 'stroke': True, 'fill': False, 'evenodd': False, 'stroking_color': (0.3098, 0.24706, 0.2549, 0), 'non_stroking_color': (0,), 'mcid': None, 'tag': None, 'object_type': 'line', 'page_number': 1, 'stroking_pattern': None, 'non_stroking_pattern': None, 'path': [('m', (56.7, 48.24000000000001)), ('l', (542.76, 48.24000000000001))], 'dash': ([], 0), 'top': 48.24000000000001, 'bottom': 48.24000000000001, 'doctop': 48.24000000000001}
E assert {'bottom': 48...': False, ...} == {'bottom': 48...': False, ...}
E
E Omitting 22 identical items, use -vv to show
E Left contains 2 more items:
E {'dash': ([], 0),
E 'path': [('m', (56.7, 48.24000000000001)), ('l', (542.76, 48.24000000000001))]}
E Use -v to get more diff

test_pdf.py:1093: AssertionError
===================================================== short test summary info ======================================================
FAILED test_pdf.py::PDFCharacterTest::test_pdfplumber_full_page_info_LOWLEVEL_CHARS - AssertionError: first_page.objects['line'][0] {'x0': 56.7, 'y0': 793.76, 'x1': 542.76, 'y1': 793.76, 'width': 486.06, 'height':...
======================================================== 1 failed in 7.88s =========================================================

wiktionary failure on looking up "kangaroo"

Fails on looking up "kangaroo".
(Does not fail on nonsense words).
Needs debugging

(base) pm286macbook-2:amilib pm286$ amilib DICT --words kangaroo  --dict ~/junk/junk.xml
WARNING amix.py:159:command: ['DICT', '--words', 'kangaroo', '--dict', '/Users/pm286/junk/junk.xml']
WARNING:amilib.amix:command: ['DICT', '--words', 'kangaroo', '--dict', '/Users/pm286/junk/junk.xml']
DEBUG:amilib.html_args:================== add arguments HTML ================
WARNING amix.py:244:abstract_args <amilib.dict_args.AmiDictArgs object at 0x7febf8baa250>
WARNING:amilib.amix:abstract_args <amilib.dict_args.AmiDictArgs object at 0x7febf8baa250>
DEBUG:amilib.dict_args:DICT process_args {'version': False, 'command': 'DICT', 'dict': '/Users/pm286/junk/junk.xml', 'validate': False, 'words': 'kangaroo'}
Traceback (most recent call last):
  File "/opt/anaconda3/bin/amilib", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/amix.py", line 529, in main
    amix.run_command(sys.argv[1:])
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/amix.py", line 168, in run_command
    self.parse_and_run_args(args)
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/amix.py", line 191, in parse_and_run_args
    self.run_arguments()
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/amix.py", line 246, in run_arguments
    abstract_args.parse_and_process1(self.args)
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/ami_args.py", line 121, in parse_and_process1
    self.process_args()
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/dict_args.py", line 145, in process_args
    self.build_or_edit_dictionary()
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/dict_args.py", line 222, in build_or_edit_dictionary
    self.ami_dict, _ = AmiDictionary.create_dictionary_from_words(terms=self.words, title="unknown", wiktionary=True)
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/ami_dict.py", line 697, in create_dictionary_from_words
    dictionary.add_wiktionary_from_terms()
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/ami_dict.py", line 1163, in add_wiktionary_from_terms
    self.lookup_and_add_wiktionary_to_entry(entry)
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/ami_dict.py", line 1170, in lookup_and_add_wiktionary_to_entry
    wiktionary_page = WiktionaryPage.create_wiktionary_page(term)
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/wikimedia.py", line 1556, in create_wiktionary_page
    cls.process_parts_of_speech(html_div, mw_content_text)
  File "/opt/anaconda3/lib/python3.8/site-packages/amilib/wikimedia.py", line 1605, in process_parts_of_speech
    p_elem.insert(1, pos_span)
AttributeError: 'NoneType' object has no attribute 'insert'

Error while running pytest in command line

Running pytest in win10, python 3.9 in command line

Cloned the repo amilib

ERROR

C:\Users\ADMIN\Desktop\sc_pyami\amilib>pytest
===================================================================== test session starts ======================================================================
platform win32 -- Python 3.9.7, pytest-6.2.5, py-1.11.0, pluggy-1.2.0
rootdir: C:\Users\ADMIN\Desktop\sc_pyami\amilib
plugins: anyio-3.6.2
collected 0 items / 10 errors

============================================================================ ERRORS ============================================================================
______________________________________________________________ ERROR collecting test/test_all.py _______________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_all.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_all.py:14: in <module>
    from amilib.wikimedia import WikidataSparql as WS
amilib\wikimedia.py:12: in <module>
    from amilib.ami_html import HtmlUtil
amilib\ami_html.py:25: in <module>
    from amilib.xml_lib import XmlLib, HtmlLib
amilib\xml_lib.py:16: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
______________________________________________________________ ERROR collecting test/test_file.py ______________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_file.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_file.py:10: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
____________________________________________________________ ERROR collecting test/test_headless.py ____________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_headless.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_headless.py:13: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
______________________________________________________________ ERROR collecting test/test_html.py ______________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_html.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_html.py:20: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
______________________________________________________________ ERROR collecting test/test_nlp.py _______________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_nlp.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_nlp.py:6: in <module>
    from test.test_all import AmiAnyTest
test\test_all.py:14: in <module>
    from amilib.wikimedia import WikidataSparql as WS
amilib\wikimedia.py:12: in <module>
    from amilib.ami_html import HtmlUtil
amilib\ami_html.py:25: in <module>
    from amilib.xml_lib import XmlLib, HtmlLib
amilib\xml_lib.py:16: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
______________________________________________________________ ERROR collecting test/test_pdf.py _______________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_pdf.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_pdf.py:20: in <module>
    import test.test_all
test\test_all.py:14: in <module>
    from amilib.wikimedia import WikidataSparql as WS
amilib\wikimedia.py:12: in <module>
    from amilib.ami_html import HtmlUtil
amilib\ami_html.py:25: in <module>
    from amilib.xml_lib import XmlLib, HtmlLib
amilib\xml_lib.py:16: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
______________________________________________________________ ERROR collecting test/test_svg.py _______________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_svg.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_svg.py:5: in <module>
    from amilib.ami_svg import AmiSVG
amilib\ami_svg.py:4: in <module>
    from amilib.xml_lib import NS_MAP, XML_NS, SVG_NS
amilib\xml_lib.py:16: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
______________________________________________________________ ERROR collecting test/test_util.py ______________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_util.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_util.py:11: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
____________________________________________________________ ERROR collecting test/test_wikidata.py ____________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_wikidata.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_wikidata.py:11: in <module>
    from amilib.wikimedia import WikidataPage, WikidataExtractor, WikidataProperty, WikidataFilter
amilib\wikimedia.py:12: in <module>
    from amilib.ami_html import HtmlUtil
amilib\ami_html.py:25: in <module>
    from amilib.xml_lib import XmlLib, HtmlLib
amilib\xml_lib.py:16: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
______________________________________________________________ ERROR collecting test/test_xml.py _______________________________________________________________
ImportError while importing test module 'C:\Users\ADMIN\Desktop\sc_pyami\amilib\test\test_xml.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Program Files\Python39\lib\importlib\__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test\test_xml.py:5: in <module>
    from amilib.ami_html import HtmlStyle
amilib\ami_html.py:25: in <module>
    from amilib.xml_lib import XmlLib, HtmlLib
amilib\xml_lib.py:16: in <module>
    from amilib.file_lib import FileLib
E   ModuleNotFoundError: No module named 'amilib.file_lib'
=================================================================== short test summary info ====================================================================
ERROR test/test_all.py
ERROR test/test_file.py
ERROR test/test_headless.py
ERROR test/test_html.py
ERROR test/test_nlp.py
ERROR test/test_pdf.py
ERROR test/test_svg.py
ERROR test/test_util.py
ERROR test/test_wikidata.py
ERROR test/test_xml.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 10 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================ 10 errors in 63.02s (0:01:03) =================================================================

Installing the requirements from requirements.txt

python -- version: 3.10.11
windows 11; 64 bit
using cmd
ide used vs code
"""
error message as follows:
INFO: pip is looking at multiple versions of requests to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 14) and chardet==5.2.0 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested chardet==5.2.0
requests 2.25.1 depends on chardet<5 and >=3.0.2

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
"""

Tried updating pip but still the same issue persists

Issue with python

I have been tried to run the amilib tests but its been showing errors since a week. I have been resolving each issue but now i m stuck at a point where it shows this:
No Python at '"C:\Users\asus\Desktop\python.exe

How to resolve this?

[NameError] Branch `pmr_dict` fail

System: Windows 11, Python 3.12.3

__________________________________________ PDFTest.test_make_raw_ami_pages_with_spans_from_charstream_ipcc_chap6 __________________________________________

self = <test.test_pdf.PDFTest testMethod=test_make_raw_ami_pages_with_spans_from_charstream_ipcc_chap6>

    def test_make_raw_ami_pages_with_spans_from_charstream_ipcc_chap6(self):
        """
        The central AMI method to make HTML from PDF characters

        creates spans with coordinates inside divs
        Uses AmiPage.create_html_pages() which uses AmiPage.chars_to_spans()
        creates Raw HTML

        """
        output_stem = "raw_plumber"
        page_nos = range(3, 13)
        # page_nos = [3 4 5 8 ]
        input_pdf = Path(Resources.TEST_IPCC_CHAP06_PDF)
        assert input_pdf.exists(), f"{input_pdf} should exist"
        bbox = BBox(xy_ranges=[[60, 999], [60, 790]])
        output_dir = Path(AmiAnyTest.TEMP_PDF_IPCC_CHAP06)
        AmiPage.create_html_pages_pdfplumber(bbox=bbox, input_pdf=input_pdf,
                                             output_dir=output_dir, output_stem=output_stem,
                                             range_list=[range(3, 8), range(129, 131)])
        assert output_dir.exists()
        html_file = f"{output_stem}_{5}.html"
>       logger.info(f"created HTML file {html_file}")
E       NameError: name 'logger' is not defined

test\test_pdf.py:355: NameError

================================================================= short test summary info =================================================================

FAILED test/test_pdf.py::PDFTest::test_make_raw_ami_pages_with_spans_from_charstream_ipcc_chap6 - NameError: name 'logger' is not defined

============================================ 3 failed, 221 passed, 83 skipped, 4 warnings in 182.20s (0:03:02) ============================================

[LookupError] Branch `pmr_dict` fail

System: Windows 11, Python 3.12.3

C:\Users\User\Desktop\Semantics\amilib>pytest

================================================= test session starts =================================================
platform win32 -- Python 3.12.3, pytest-8.2.1, pluggy-1.5.0
rootdir: C:\Users\User\Desktop\Semantics\amilib
collected 307 items

test\test_dict.py ...........s...........ss.............s...........................s.ssss.......ss.....                                             [ 28%]
test\test_file.py ss                                                                                                                                 [ 28%]
test\test_headless.py s..sssssss.....s                                                                                                               [ 33%]
test\test_html.py ...s.s......s..s.....ssss...s.........s..ssssss..ss.s...ss................................s... [ 64%]
.s                                                                                                                                                   [ 65%]
test\test_misc.py s.                                                                                                                                 [ 65%]
test\test_nlp.py F                                                                                                                                   [ 66%]
test\test_pdf.py Fss........sFs.s.sssssss.s....ss.ss..s....ssssssssss..s....ss                                                                       [ 85%]
test\test_pytest.py .                                                                                                                                [ 86%]
test\test_stat.py .                                                                                                                                  [ 86%]
test\test_svg.py ...                                                                                                                                 [ 87%]
test\test_util.py ss.....s...s...                                                                                                                    [ 92%]
test\test_wikidata.py .s...........s.......                                                                                                          [ 99%]
test\test_xml.py ..                                                                                                                                  [100%]

======================================================================== FAILURES =========================================================================
________________________________________________________ NLPTest.test_compute_text_similarity_STAT ________________________________________________________

self = <WordListCorpusReader in '.../corpora/stopwords' (not loaded yet)>

    def __load(self):
        # Find the corpus root directory.
        zip_name = re.sub(r"(([^/]+)(/.*)?)", r"\2.zip/\1/", self.__name)
        if TRY_ZIPFILE_FIRST:
            try:
                root = nltk.data.find(f"{self.subdir}/{zip_name}")
            except LookupError as e:
                try:
                    root = nltk.data.find(f"{self.subdir}/{self.__name}")
                except LookupError:
                    raise e
        else:
            try:
                root = nltk.data.find(f"{self.subdir}/{self.__name}")
            except LookupError as e:
                try:
>                   root = nltk.data.find(f"{self.subdir}/{zip_name}")

..\..\..\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\nltk\corpus\util.py:84:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

resource_name = 'corpora/stopwords.zip/stopwords/'
paths = ['C:\\Users\\User/nltk_data', 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__q..._3.12.1008.0_x64__qbz5n2kfra8p0\\lib\\nltk_data', 'C:\\Users\\User\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', ...]

    def find(resource_name, paths=None):
        """
        Find the given resource by searching through the directories and
        zip files in paths, where a None or empty string specifies an absolute path.
        Returns a corresponding path name.  If the given resource is not
        found, raise a ``LookupError``, whose message gives a pointer to
        the installation instructions for the NLTK downloader.

        Zip File Handling:

          - If ``resource_name`` contains a component with a ``.zip``
            extension, then it is assumed to be a zipfile; and the
            remaining path components are used to look inside the zipfile.

          - If any element of ``nltk.data.path`` has a ``.zip`` extension,
            then it is assumed to be a zipfile.

          - If a given resource name that does not contain any zipfile
            component is not found initially, then ``find()`` will make a
            second attempt to find that resource, by replacing each
            component *p* in the path with *p.zip/p*.  For example, this
            allows ``find()`` to map the resource name
            ``corpora/chat80/cities.pl`` to a zip file path pointer to
            ``corpora/chat80.zip/chat80/cities.pl``.

          - When using ``find()`` to locate a directory contained in a
            zipfile, the resource name must end with the forward slash
            character.  Otherwise, ``find()`` will not locate the
            directory.

        :type resource_name: str or unicode
        :param resource_name: The name of the resource to search for.
            Resource names are posix-style relative path names, such as
            ``corpora/brown``.  Directory names will be
            automatically converted to a platform-appropriate path separator.
        :rtype: str
        """
        resource_name = normalize_resource_name(resource_name, True)

        # Resolve default paths at runtime in-case the user overrides
        # nltk.data.path
        if paths is None:
            paths = path

        # Check if the resource name includes a zipfile name
        m = re.match(r"(.*\.zip)/?(.*)$|", resource_name)
        zipfile, zipentry = m.groups()

        # Check each item in our path
        for path_ in paths:
            # Is the path item a zipfile?
            if path_ and (os.path.isfile(path_) and path_.endswith(".zip")):
                try:
                    return ZipFilePathPointer(path_, resource_name)
                except OSError:
                    # resource not in zipfile
                    continue

            # Is the path item a directory or is resource_name an absolute path?
            elif not path_ or os.path.isdir(path_):
                if zipfile is None:
                    p = os.path.join(path_, url2pathname(resource_name))
                    if os.path.exists(p):
                        if p.endswith(".gz"):
                            return GzipFileSystemPathPointer(p)
                        else:
                            return FileSystemPathPointer(p)
                else:
                    p = os.path.join(path_, url2pathname(zipfile))
                    if os.path.exists(p):
                        try:
                            return ZipFilePathPointer(p, zipentry)
                        except OSError:
                            # resource not in zipfile
                            continue

        # Fallback: if the path doesn't include a zip file, then try
        # again, assuming that one of the path components is inside a
        # zipfile of the same name.
        if zipfile is None:
            pieces = resource_name.split("/")
            for i in range(len(pieces)):
                modified_name = "/".join(pieces[:i] + [pieces[i] + ".zip"] + pieces[i:])
                try:
                    return find(modified_name, paths)
                except LookupError:
                    pass

        # Identify the package (i.e. the .zip file) to download.
        resource_zipname = resource_name.split("/")[1]
        if resource_zipname.endswith(".zip"):
            resource_zipname = resource_zipname.rpartition(".")[0]
        # Display a friendly error message if the resource wasn't found:
        msg = str(
            "Resource \33[93m{resource}\033[0m not found.\n"
            "Please use the NLTK Downloader to obtain the resource:\n\n"
            "\33[31m"  # To display red text in terminal.
            ">>> import nltk\n"
            ">>> nltk.download('{resource}')\n"
            "\033[0m"
        ).format(resource=resource_zipname)
        msg = textwrap_indent(msg)

        msg += "\n  For more information see: https://www.nltk.org/data.html\n"

        msg += "\n  Attempted to load \33[93m{resource_name}\033[0m\n".format(
            resource_name=resource_name
        )

        msg += "\n  Searched in:" + "".join("\n    - %r" % d for d in paths)
        sep = "*" * 70
        resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
>       raise LookupError(resource_not_found)
E       LookupError:
E       **********************************************************************
E         Resource stopwords not found.
E         Please use the NLTK Downloader to obtain the resource:
E
E         >>> import nltk
E         >>> nltk.download('stopwords')
E
E         For more information see: https://www.nltk.org/data.html
E
E         Attempted to load corpora/stopwords.zip/stopwords/
E
E         Searched in:
E           - 'C:\\Users\\User/nltk_data'
E           - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\\nltk_data'
E           - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\\share\\nltk_data'
E           - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\\lib\\nltk_data'
E           - 'C:\\Users\\User\\AppData\\Roaming\\nltk_data'
E           - 'C:\\nltk_data'
E           - 'D:\\nltk_data'
E           - 'E:\\nltk_data'
E       **********************************************************************

..\..\..\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\nltk\data.py:583: LookupError

During handling of the above exception, another exception occurred:

self = <test.test_nlp.NLPTest testMethod=test_compute_text_similarity_STAT>

>   ???

C:\Users\User\Desktop\sciCli\amilib\test\test_nlp.py:27:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
amilib\ami_nlp.py:44: in __init__
    stop_words = stopwords.words('english')
..\..\..\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\nltk\corpus\util.py:121: in __getattr__
    self.__load()
..\..\..\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\nltk\corpus\util.py:86: in __load
    raise e
..\..\..\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\nltk\corpus\util.py:81: in __load
    root = nltk.data.find(f"{self.subdir}/{self.__name}")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

resource_name = 'corpora/stopwords'
paths = ['C:\\Users\\User/nltk_data', 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__q..._3.12.1008.0_x64__qbz5n2kfra8p0\\lib\\nltk_data', 'C:\\Users\\User\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', ...]

    def find(resource_name, paths=None):
        """
        Find the given resource by searching through the directories and
        zip files in paths, where a None or empty string specifies an absolute path.
        Returns a corresponding path name.  If the given resource is not
        found, raise a ``LookupError``, whose message gives a pointer to
        the installation instructions for the NLTK downloader.

        Zip File Handling:

          - If ``resource_name`` contains a component with a ``.zip``
            extension, then it is assumed to be a zipfile; and the
            remaining path components are used to look inside the zipfile.

          - If any element of ``nltk.data.path`` has a ``.zip`` extension,
            then it is assumed to be a zipfile.

          - If a given resource name that does not contain any zipfile
            component is not found initially, then ``find()`` will make a
            second attempt to find that resource, by replacing each
            component *p* in the path with *p.zip/p*.  For example, this
            allows ``find()`` to map the resource name
            ``corpora/chat80/cities.pl`` to a zip file path pointer to
            ``corpora/chat80.zip/chat80/cities.pl``.

          - When using ``find()`` to locate a directory contained in a
            zipfile, the resource name must end with the forward slash
            character.  Otherwise, ``find()`` will not locate the
            directory.

        :type resource_name: str or unicode
        :param resource_name: The name of the resource to search for.
            Resource names are posix-style relative path names, such as
            ``corpora/brown``.  Directory names will be
            automatically converted to a platform-appropriate path separator.
        :rtype: str
        """
        resource_name = normalize_resource_name(resource_name, True)

        # Resolve default paths at runtime in-case the user overrides
        # nltk.data.path
        if paths is None:
            paths = path

        # Check if the resource name includes a zipfile name
        m = re.match(r"(.*\.zip)/?(.*)$|", resource_name)
        zipfile, zipentry = m.groups()

        # Check each item in our path
        for path_ in paths:
            # Is the path item a zipfile?
            if path_ and (os.path.isfile(path_) and path_.endswith(".zip")):
                try:
                    return ZipFilePathPointer(path_, resource_name)
                except OSError:
                    # resource not in zipfile
                    continue

            # Is the path item a directory or is resource_name an absolute path?
            elif not path_ or os.path.isdir(path_):
                if zipfile is None:
                    p = os.path.join(path_, url2pathname(resource_name))
                    if os.path.exists(p):
                        if p.endswith(".gz"):
                            return GzipFileSystemPathPointer(p)
                        else:
                            return FileSystemPathPointer(p)
                else:
                    p = os.path.join(path_, url2pathname(zipfile))
                    if os.path.exists(p):
                        try:
                            return ZipFilePathPointer(p, zipentry)
                        except OSError:
                            # resource not in zipfile
                            continue

        # Fallback: if the path doesn't include a zip file, then try
        # again, assuming that one of the path components is inside a
        # zipfile of the same name.
        if zipfile is None:
            pieces = resource_name.split("/")
            for i in range(len(pieces)):
                modified_name = "/".join(pieces[:i] + [pieces[i] + ".zip"] + pieces[i:])
                try:
                    return find(modified_name, paths)
                except LookupError:
                    pass

        # Identify the package (i.e. the .zip file) to download.
        resource_zipname = resource_name.split("/")[1]
        if resource_zipname.endswith(".zip"):
            resource_zipname = resource_zipname.rpartition(".")[0]
        # Display a friendly error message if the resource wasn't found:
        msg = str(
            "Resource \33[93m{resource}\033[0m not found.\n"
            "Please use the NLTK Downloader to obtain the resource:\n\n"
            "\33[31m"  # To display red text in terminal.
            ">>> import nltk\n"
            ">>> nltk.download('{resource}')\n"
            "\033[0m"
        ).format(resource=resource_zipname)
        msg = textwrap_indent(msg)

        msg += "\n  For more information see: https://www.nltk.org/data.html\n"

        msg += "\n  Attempted to load \33[93m{resource_name}\033[0m\n".format(
            resource_name=resource_name
        )

        msg += "\n  Searched in:" + "".join("\n    - %r" % d for d in paths)
        sep = "*" * 70
        resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
>       raise LookupError(resource_not_found)
E       LookupError:
E       **********************************************************************
E         Resource stopwords not found.
E         Please use the NLTK Downloader to obtain the resource:
E
E         >>> import nltk
E         >>> nltk.download('stopwords')
E
E         For more information see: https://www.nltk.org/data.html
E
E         Attempted to load corpora/stopwords
E
E         Searched in:
E           - 'C:\\Users\\User/nltk_data'
E           - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\\nltk_data'
E           - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\\share\\nltk_data'
E           - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\\lib\\nltk_data'
E           - 'C:\\Users\\User\\AppData\\Roaming\\nltk_data'
E           - 'C:\\nltk_data'
E           - 'D:\\nltk_data'
E           - 'E:\\nltk_data'
E       **********************************************************************

..\..\..\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\nltk\data.py:583: LookupError

================================================================= short test summary info =================================================================

FAILED test/test_nlp.py::NLPTest::test_compute_text_similarity_STAT - LookupError:

============================================ 3 failed, 221 passed, 83 skipped, 4 warnings in 182.20s (0:03:02) ============================================

Errors while cloning amilib.

C:\Users\hp\Desktop\Semantic>git clone https://github.com/petermr/amilib.git
Cloning into 'amilib'...
remote: Enumerating objects: 1349, done.
remote: Counting objects: 100% (276/276), done.
remote: Compressing objects: 100% (134/134), done.
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
error: 499 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

[AssertionError] Branch `pmr_dict` fail

System: Windows 11, Python 3.12.3

______________________________________________________________ PDFPlumberTest.test_misc_pdf _______________________________________________________________

self = <test.test_pdf.PDFPlumberTest testMethod=test_misc_pdf>

    def test_misc_pdf(self):
        """Parses an arbitrary PDF with PDFPlumber and outputs HTML to a given directory"""
        input_pdf = Path("/Users/pm286/workspace/misc/380981eng.pdf")
>       assert Path(input_pdf).exists(), f"{input_pdf} should exist"
E       AssertionError: \Users\pm286\workspace\misc\380981eng.pdf should exist
E       assert False
E        +  where False = <bound method Path.exists of WindowsPath('/Users/pm286/workspace/misc/380981eng.pdf')>()
E        +    where <bound method Path.exists of WindowsPath('/Users/pm286/workspace/misc/380981eng.pdf')> = WindowsPath('/Users/pm286/workspace/misc/380981eng.pdf').exists
E        +      where WindowsPath('/Users/pm286/workspace/misc/380981eng.pdf') = Path(WindowsPath('/Users/pm286/workspace/misc/380981eng.pdf'))

test\test_pdf.py:202: AssertionError

================================================================= short test summary info =================================================================

FAILED test/test_pdf.py::PDFPlumberTest::test_misc_pdf - AssertionError: \Users\pm286\workspace\misc\380981eng.pdf should exist

============================================ 3 failed, 221 passed, 83 skipped, 4 warnings in 182.20s (0:03:02) ============================================

argparse.ArgumentError: argument command: conflicting subparser: HTML

Windows 11, Python 3.12
'''

C:\Users\asus\Desktop\Semantic\amilib\test> pytest test_pdf.py::PDFMainArgTest::test_cannot_iterate
======================================================= test session starts ========================================================
platform win32 -- Python 3.12.3, pytest-8.2.0, pluggy-1.5.0
rootdir: C:\Users\asus\Desktop\Semantic\amilib
collected 1 item

test_pdf.py F [100%]

============================================================= FAILURES =============================================================
________________________________________________ PDFMainArgTest.test_cannot_iterate ________________________________________________

self = <test.test_pdf.PDFMainArgTest testMethod=test_cannot_iterate>

def test_cannot_iterate(self):
    """
    Test that 'PDF' subcomand works without errors
    """

  AmiLib().run_command(

        ['HTML'])

test_pdf.py:1814:

..\amilib\amix.py:155: in run_command
self.parse_and_run_args(args)
..\amilib\amix.py:169: in parse_and_run_args
parser = self.create_arg_parser()
..\amilib\amix.py:95: in create_arg_parser
amilib_parser = AmiLibArgs().make_sub_parser(subparsers)
..\amilib\util.py:829: in make_sub_parser
self.parser = subparsers.add_parser(self.subparser_arg)

self = _SubParsersAction(option_strings=[], dest='command', nargs='A...', const=None, default=None, type=None, choices={'HTML...escriptionHelpFormatter'>, conflict_handler='error', add_help=True)}, required=False, help='subcommands', metavar=None)
name = 'HTML', kwargs = {'prog': 'pytest HTML'}, aliases = ()

def add_parser(self, name, **kwargs):
    # set prog from the existing prefix
    if kwargs.get('prog') is None:
        kwargs['prog'] = '%s %s' % (self._prog_prefix, name)

    aliases = kwargs.pop('aliases', ())

    if name in self._name_parser_map:

      raise ArgumentError(self, _('conflicting subparser: %s') % name)

E argparse.ArgumentError: argument command: conflicting subparser: HTML

......\Internship\Lib\argparse.py:1219: ArgumentError
------------------------------------------------------- Captured stdout call -------------------------------------------------------
command: ['HTML']
===================================================== short test summary info ======================================================
FAILED test_pdf.py::PDFMainArgTest::test_cannot_iterate - argparse.ArgumentError: argument command: conflicting subparser: HTML
======================================================== 1 failed in 9.94s =========================================================

Need to manage Wikimedia types

Rather than having --wikipedia, wiktionary, etc. suggest a --lookup parameter:

--lookup wikipedia wiktionary wikidata
would serially lookup in these resources.

petermr / amilib Goto Github PK

amilib's Introduction

amilib

amilib's People

Contributors

Stargazers

Watchers

amilib's Issues

system information

checked out latest amilib

Run pytest

Output summary

Recommend Projects

Recommend Topics

Recommend Org

checked out latest `amilib`

Run `pytest`