Coder Social home page Coder Social logo

delftdata / valentine Goto Github PK

View Code? Open in Web Editor NEW
80.0 9.0 23.0 115.18 MB

A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching methods.

License: Apache License 2.0

Python 100.00%
schema-matching experiment-suite dataset-discovery

valentine's Introduction

Valentine: (Schema-) Matching DataFrames Made Easy

build codecov PyPI version PyPI - Downloads Python 3.8+ Codacy Badge

A python package for capturing potential relationships among columns of different tabular datasets, which are given in the form of pandas DataFrames. Valentine is based on Valentine: Evaluating Matching Techniques for Dataset Discovery

You can find more information about the research supporting Valentine here.

Experimental suite version

The original experimental suite version of Valentine, as first published for the needs of the research paper, can be still found here.

Installation instructions

Requirements

  • Python >=3.8,<3.13
  • Java: For the Coma matcher it is required to have java (jre) installed

To install Valentine simply run:

pip install valentine

Usage

Valentine can be used to find matches among columns of a given pair of pandas DataFrames.

Matching methods

In order to do so, the user can choose one of the following 5 matching methods:

  1. Coma(int: max_n, bool: use_instances, str: java_xmx) is a python wrapper around COMA 3.0 Comunity edition

    • Parameters:
      • max_n(int) - Accept similarity threshold, (default: 0).
      • use_instances(bool) - Wheather Coma will make use of the data instances or just the schema information, (default: False).
      • java_xmx(str) - The amount of RAM that Coma is allowed to use, (default: "1024m") .
  2. Cupid(float: w_struct, float: leaf_w_struct, float: th_accept) is the python implementation of the paper Generic Schema Matching with Cupid

    • Parameters:
      • w_struct(float) - Structural similarity threshold, default is 0.2.
      • leaf_w_struct(float) - Structural similarity threshold, leaf level, default is 0.2.
      • th_accept(float) - Accept similarity threshold, default is 0.7.
  3. DistributionBased(float: threshold1, float: threshold2) is the python implementation of the paper Automatic Discovery of Attributes in Relational Databases

    • Parameters:
      • threshold1(float) - The threshold for phase 1 of the method, default is 0.15.
      • threshold2(float) - The threshold for phase 2 of the method, default is 0.15.
  4. JaccardDistanceMatcher(float: threshold_dist) is a baseline method that uses Jaccard Similarity between columns to assess their correspondence score, optionally enhanced by a string similarity measure of choice.

    • Parameters:
      • threshold_dist(float) - Acceptance threshold for assessing two strings as equal, default is 0.8.

      • distance_fun(StringDistanceFunction) - String similarity function used to assess whether two strings are equal. The enumeration class type StringDistanceFunction can be imported from valentine.algorithms.jaccard_distance. Functions currently supported are:

  5. SimilarityFlooding(str: coeff_policy, str: formula) is the python implementation of the paper Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching

    • Parameters:
      • coeff_policy(str) - Policy for deciding the weight coefficients of the propagation graph. Choice of "inverse_product" or "inverse_average" (default).
      • formula(str) - Formula on which iterative fixpoint computation is based. Choice of "basic", "formula_a", "formula_b" and "formula_c" (default).

Matching DataFrame Pair

After selecting one of the 5 matching methods, the user can initiate the pairwise matching process in the following way:

matches = valentine_match(df1, df2, matcher, df1_name, df2_name)

where df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are "table_1" and "table_2"). Function valentine_match returns a MatcherResults object, which is a dictionary with additional convenience methods, such as one_to_one, take_top_percent, get_metrics and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.

Matching DataFrame Batch

After selecting one of the 5 matching methods, the user can initiate the batch matching process in the following way:

matches = valentine_match_batch(df_iter_1, df_iter_2, matcher, df_iter_1_names, df_iter_2_names)

where df_iter_1 and df_iter_2 are the two iterable structures containing pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input an iterable with names for each DataFrame. Function valentine_match_batch returns a MatcherResults object, which is a dictionary with additional convenience methods, such as one_to_one, take_top_percent, get_metrics and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.

MatcherResults instance

The MatcherResults instance has some convenience methods that the user can use to either obtain a subset of the data or to transform the data. This instance is a dictionary and is sorted upon instantiation, from high similarity to low similarity.

top_n_matches = matches.take_top_n(5)

top_n_percent_matches = matches.take_top_percent(25)

one_to_one_matches = matches.one_to_one()

Measuring effectiveness

The MatcherResults instance that is returned by valentine_match or valentine_match_batch also has a get_metrics method that the user can use

metrics = matches.get_metrics(ground_truth)

in order to get all effectiveness metrics, such as Precision, Recall, F1-score and others as described in the original Valentine paper. In order to do so, the user needs to also input the ground truth of matches based on which the metrics will be calculated. The ground truth can be given as a list of tuples representing column matches that should hold (see example below).

By default, all the core metrics will be used for this with default parameters, but the user can also customize which metrics to run with what parameters, and implement own custom metrics by extending from the Metric base class. Some sets of metrics are available as well.

from valentine.metrics import F1Score, PrecisionTopNPercent, METRICS_PRECISION_INCREASING_N
metrics_custom = matches.get_metrics(ground_truth, metrics={F1Score(one_to_one=False), PrecisionTopNPercent(n=70)})
metrics_prefefined_set = matches.get_metrics(ground_truth, metrics=METRICS_PRECISION_INCREASING_N)

Example

The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about authors and their publications, and then 2) how to assess its effectiveness based on a given ground truth (a more extensive example is shown in valentine_example.py):

import os
import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma

# Load data using pandas
d1_path = os.path.join('data', 'authors1.csv')
d2_path = os.path.join('data', 'authors2.csv')
df1 = pd.read_csv(d1_path)
df2 = pd.read_csv(d2_path)

# Instantiate matcher and run
matcher = Coma(use_instances=True)
matches = valentine_match(df1, df2, matcher)

print(matches)

# If ground truth available valentine could calculate the metrics
ground_truth = [('Cited by', 'Cited by'),
                ('Authors', 'Authors'),
                ('EID', 'EID')]

metrics = matches.get_metrics(ground_truth)
    
print(metrics)

The output of the above code block is:

{
     (('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.86994505, 
     (('table_1', 'Authors'), ('table_2', 'Authors')): 0.8679843, 
     (('table_1', 'EID'), ('table_2', 'EID')): 0.8571245
}
{
     'Recall': 1.0, 
     'F1Score': 1.0, 
     'RecallAtSizeofGroundTruth': 1.0, 
     'Precision': 1.0, 
     'PrecisionTop10Percent': 1.0
}

Cite Valentine

Original Valentine paper:
@inproceedings{koutras2021valentine,
  title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
  author={Koutras, Christos and Siachamis, George and Ionescu, Andra and Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},
  booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
  pages={468--479},
  year={2021},
  organization={IEEE}
}
Demo Paper:
@article{koutras2021demo,
  title={Valentine in Action: Matching Tabular Data at Scale},
  author={Koutras, Christos and Psarakis, Kyriakos and Siachamis, George and Ionescu, Andra and Fragkoulis, Marios and Bonifati, Angela and Katsifodimos, Asterios},
  journal={VLDB},
  volume={14},
  number={12},
  pages={2871--2874},
  year={2021},
  publisher={VLDB Endowment}
}

valentine's People

Contributors

andraionescu avatar archer6621 avatar asteriosk avatar chrisk21 avatar dependabot[bot] avatar jorgesia avatar kpsarakis avatar thanostsiamis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

valentine's Issues

get_matches for distribution based matching fails with error "'charmap' codec can't encode character..."

Hello,

I ran Valentine with the DistributionBased strategy

import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match

df1 = pd.read_csv("data/Geodata/location_cities_countries/cities.csv", encoding='utf-8')
df2 = pd.read_csv("data/Geodata/location_cities_countries/countries.csv", encoding='utf-8')

matches = valentine_match(df1, df2, DistributionBased())

And I get this error:
image
(UnicodeEncodeError: 'charmap' codec can't encode character '\u0103' in position 4: character maps to )

The csv files come from here: https://www.kaggle.com/datasets/liewyousheng/geolocation

I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.

JaccardLeven with process_num=10 has errors?

Hi folks: nice package. I tried below and curious if you had ideas on this?

  1. I tried on two DFs with 200k rows and 10 columns. It didn't converge. I had to use df.sample(4000) instead to cut down the processing to 10mins on a MacMini. with 32GB RAM and 3GHz 6-core i5. How long should I expect such a run to take? Two files of 13MB and 2MB in https://drive.google.com/drive/folders/1BIX240k6GEouT5SrjY9pWDaT7X6_QkY4?usp=sharing.

  2. I'd interpreted your comment in JaccardLevel as this spawns 10 processes for speedup? It raised error.

matcher = valentine.algorithms.JaccardLevenMatcher(0.2, 10)

raise RuntimeError('''

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Add batching

It would be nice to be able to compare two lists of datasets, resuing intermediate data structures to speed up the processing, instead of restarting the computation for each unique pair of datasets.

Failed installation on Windows

C:\Users\akatsifodimos>pip install valentine
Collecting valentine
  Using cached valentine-0.1.1.tar.gz (38.2 MB)
Collecting numpy<2.0,>=1.21
  Using cached numpy-1.21.2.zip (10.3 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Collecting valentine
  Using cached valentine-0.1.0.tar.gz (38.2 MB)
ERROR: Cannot install valentine==0.1.0 and valentine==0.1.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    valentine 0.1.1 depends on scipy<1.8 and >=1.7
    valentine 0.1.0 depends on scipy<1.8 and >=1.7

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

get_matches for distribution based matching fails due to pickled files not refound

Hello,

I ran Valentine with the DistributionBased strategy

import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match

df1 = pd.read_csv("data/Books recommender/book_titles-collaborative_books_df/book_titles.csv", encoding='utf-8')
df2 = pd.read_csv("data/Books recommender/book_titles-collaborative_books_df/collaborative_books_df.csv", encoding='utf-8')

matches = valentine_match(df1, df2, DistributionBased())

And I get this error:
https://i.postimg.cc/ryr5ZYWP/Unbenannt3.png
([WinError 2] The system cannot find the file specified: 'C:\Users\xxx\AppData\Local\Temp\tmptv_b_0v6\table1title.pkl')

The pickled rank files can not be found again. I can't find the files manually in my AppData folder either.
The csv files come from here: https://www.kaggle.com/datasets/thedevastator/book-recommender-system-itembased

I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.

Run time of Cupid()

Hello,

I am trying to match the feature name in two example data sets with 100 rows and 300 columns. It cost me more than 20 mins but still can't get the output. Is there anything wrong?
image

wrong column pkl filename with DistributionBased matching

Hello, and thank you for the library! 🍾

When using DistributionBased matching, I have the following use case:

  • I create an instance of the matcher. I have two source tables (source_1 and source_2), and one target table target.
  • I call matcher.get_matches(source_1, target). Pickle files for columns of source_1 and target tables are written to e.g., /tmp/tmpkpakbdjz, and the same files are read back with clustering_utils.get_column_from_store. Matches are generated.
  • I call matcher.get_matches(source_2, target). Pickle files for columns of source_2 and target tables are written to e.g., /tmp/tmp41gf90n2. HOWEVER, clustering_utils.get_column_from_store attempts to read pkl files created for columns of source_1 from directory /tmp/tmp41gf90n2

Add embedding-based methods

Add methods that utilize column vector representations and cosine similarity among them to determine matches.

Does Valentine currently support SemProp and EmbDI?

I saw "we implement and integrate six schema matching algorithms [14]–[19] and our own baseline method, and adapt them to the needs of dataset discovery" in your paper. At present, Valentine does not seem to support these two algorithms.

API: Metrics object with functions

It would be nice to have a matches object and do something along these lines:

...
matches.get_one_to_one()

...
matches.metrics(ground_truth)

...

Run the jobs on Slurm

Given a directory of configuration files that describe a job, run these jobs in parallel using the Slurm workload manager.

get_matches for distribution based matching fails with error "len(ranks) < 2"

Hello,

I ran Valentine with the DistributionBased strategy

import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match

df1 = pd.read_csv("data/Climate data/recent/wetter_tageswerte_00164_akt/Geographie_Stationsname/Metadaten_Geographie_00164.csv", encoding='utf-8')
df2 = pd.read_csv("data/Climate data/recent/wetter_tageswerte_00164_akt/Geographie_Stationsname/Metadaten_Stationsname_00164.csv", encoding='utf-8')

matches = valentine_match(df1, df2, DistributionBased())

And I get this error:
Unbenannt4.png
(RuntimeError: len(ranks) < 2)

The csv files come from here: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/weather_phenomena/recent/. I converted the data to csvs. I'll attach my csvs:
Metadaten_Geographie_00164.csv
Metadaten_Stationsname_00164.csv
(Overview and explanation of data: https://www.dwd.de/EN/ourservices/cdc/cdc_ueberblick-klimadaten_en.html)

I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.

Write output to file

Each run's output should be written to a JSON file with the following structure:

{
"name": "a unique identifier for the specific run",
"matches": "a dictionary that contains the output of the algorithm",
"metrics": "a dictionary with the metrics and their values"
}

Incorporating structural schema information

Dear valentine devs,
I was wondering about losing the structural information found in json / dictionaries when normalizing them into pandas data frames. To my understanding, coma3 would usually use this information in a matching process to improve the results. What do you think about supporting a nested (JSON) data source? I guess one would need to transform it to xml to be able to use it with coma ect.

All the best, and thanks for the great work!

Confusing Coma API

The Coma algorithm could either use only schema information or both schema and instance information.

Currently, we ask users to specify the strategy parameter to either COMA_OPT (schema) or COMA_OPT_INST (schema + instances) which can be difficult to understand.

It would be much easier if we replaced the strategy param with a use_instances boolean flag.

Add CUPID to the framework

Add the CUPID implementation to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.

FileNotFoundError with matches command

Hello

I am trying to execute the following 2 lines of code but I get FileNotFoundError :

import valentine
matcher =valentine.algorithms.Coma(strategy="COMA_OPT")
matches = valentine_match(df1, df2, 'matcher'==matcher)

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine_init_.py", line 20, in valentine_match
matches = dict(sorted(matcher.get_matches(table_1, table_2).items(),
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine\algorithms\coma\coma.py", line 32, in get_matches
self.__run_coma_jar(s_f_name, t_f_name, coma_output_file, tmp_folder_path)
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine\algorithms\coma\coma.py", line 49, in __run_coma_jar
subprocess.call(['java', f'-Xmx{self.__java_XmX}',
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

How could this be resolved? Is it a windows issue or an issue with the valentine library itself?

similarity_flooding case where e[1].long_name=None in __get_attribute_tuple

Hi Valentine authors!

I am having trouble with a bug that seems to be coming from Valentine, but I am unsure:

  • in similarity_flooding.py, is it expected that long_name may sometimes be None? (this is causing my experiments to crash)

  • dumbish question: is it possible that column_name should be =e[0].long_name ?

    def __get_attribute_tuple(self, node):
        column_name = None
        if node in self.__graph1.nodes():
            for e in self.__graph1.out_edges(node):
                links = self.__graph1.get_edge_data(e[0], e[1])
                if links.get('label') == "name":


                    column_name = e[1].long_name  ##### LONG_NAME is None
                    
                   
        else:
            for e in self.__graph2.out_edges(node):
                links = self.__graph2.get_edge_data(e[0], e[1])
                if links.get('label') == "name":
                    column_name = e[1].long_name
        return column_name

Cupid tests fail on master due to nltk resource download issue

Tested on MacOS 13.6 (Ventura)

This is the log generated by unittest upon running python3 -m unittest discover:

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading omw-1.4: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
E..........
======================================================================
ERROR: test_cupid (tests.test_algorithms.TestAlgorithms.test_cupid)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/linguistic_matching.py", line 27, in normalization
    tokens = nltk.word_tokenize(element)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
                      ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
    return find(path_, path + [""]).open()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/Users/wisguest/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.11/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.11/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.11/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/wisguest/Repositories/valentine/tests/test_algorithms.py", line 28, in test_cupid
    matches_cu_matcher = cu_matcher.get_matches(d1, d2)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/cupid_model.py", line 36, in get_matches
    self.__add_data("DB__"+source_input.name, source_input)
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/cupid_model.py", line 56, in __add_data
    self.__schemata[schema_name].add_node(table_name=table.name, table_guid=table.unique_identifier,
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/schema_tree.py", line 24, in add_node
    self.nodes[table_name].tokens = normalization(table_name).tokens
^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/linguistic_matching.py", line 33, in normalization
    tokens = nltk.word_tokenize(element)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
                      ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
    return find(path_, path + [""]).open()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.