Coder Social home page Coder Social logo

openai-cookbook's Introduction

OpenAI Cookbook Logo

✨ Navigate at cookbook.openai.com

Example code and guides for accomplishing common tasks with the OpenAI API. To run these examples, you'll need an OpenAI account and associated API key (create a free account here). Set an environment variable called OPENAI_API_KEY with your API key. Alternatively, in most IDEs such as Visual Studio Code, you can create an .env file at the root of your repo containing OPENAI_API_KEY=<your API key>, which will be picked up by the notebooks.

Most code examples are written in Python, though the concepts can be applied in any language.

For other useful tools, guides and courses, check out these related resources from around the web.

Contributing

The OpenAI Cookbook is a community-driven resource. Whether you're submitting an idea, fixing a typo, adding a new guide, or improving an existing one, your contributions are greatly appreciated!

Before contributing, read through the existing issues and pull requests to see if someone else is already working on something similar. That way you can avoid duplicating efforts.

If there are examples or guides you'd like to see, feel free to suggest them on the issues page.

If you'd like to contribute new content, make sure to read through our contribution guidelines. We welcome high-quality submissions of new examples and guides, as long as they meet our criteria and fit within the scope of the cookbook.

The contents of this repo are automatically rendered into cookbook.openai.com based on registry.yaml.

Open in GitHub Codespaces

openai-cookbook's People

Contributors

borispower avatar cathykc avatar cmurtz-msft avatar colin-jarvis avatar colin-openai avatar dependabot[bot] avatar dylanra-openai avatar eltociear avatar filipeabperes avatar gaborcselle avatar glojain avatar hemidactylus avatar ibigio avatar isafulf avatar jamescalam avatar jhills20 avatar joe-at-openai avatar justonf avatar kacperlukawski avatar katia-openai avatar kristapratico avatar liuliuod avatar logankilpatrick avatar mikeheaton avatar prestontuggle avatar scottire avatar shyamal-anadkat avatar simonpfish avatar ted-at-openai avatar teomusatoiu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openai-cookbook's Issues

Delayed completion for embeddings

Hi,

Is it possible to have example code to add a delayed completion for producing embeddings please?

I've read 'How to handle rate limits', however due to a lack of knowledge on my part I'm struggling to add a delayed completion function to the code below.

import openai
from openai.embeddings_utils import get_embedding

size = 'babbage'

df['embeddings'] = df['Narrative Description'].apply(lambda x: get_embedding(x, engine=f'text-search-{size}-doc-001'))

df.head()

Thank you in advance!

Section 3.1 - TypeError: can only concatenate str (not "float") to str

All o/p consistent with the note book example until 3.1; then :

for name, is_disc in [('discriminator', True), ('qa', False)]:
for train_test, dt in [('train', train_df), ('test', test_df)]:
ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)
ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)

TypeError Traceback (most recent call last)
in
1 for name, is_disc in [('discriminator', True), ('qa', False)]:
2 for train_test, dt in [('train', train_df), ('test', test_df)]:
----> 3 ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)
4 ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)

in create_fine_tuning_dataset(df, discriminator, n_negative, add_related)
46 rows = []
47 for i, row in df.iterrows():
---> 48 for q, a in zip(("1." + row.questions).split('\n'), ("1." + row.answers).split('\n')):
49 if len(q) >10 and len(a) >10:
50 if discriminator:

TypeError: can only concatenate str (not "float") to str

I add in 3 str(...) :

    for q, a in zip(("1." + str(row.questions)).split('\n'), ("1." + str(row.answers)).split('\n')):
        if len(q) >10 and len(a) >10:
            if discriminator:
                rows.append({"prompt":f"{row.context}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" yes"})
            else:
                rows.append({"prompt":f"{row.context}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" {a[2:].strip()}"})

for i, row in df.iterrows():
    for q in ("1." + str(row.questions)).split('\n'):

Which allows the code to run, but:
openai api fine_tunes.create....

Upload progress: 100% 1.00/1.00 [00:00<00:00, 2.57kit/s]
[organization=user-dyhnotsuxa3kiftffqbsno2j] Error: Expected file to have JSONL format, where every line is a JSON dictionary. Line 1 is not a dictionary. (HTTP status code: 400)

discriminator_train.jsonl and discriminator_test.jsonl are zero length files.

Type Object Error in Tutorial Example

Hello,

I've been following the example provided at: https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

Unfortunately when I get to In [6] I continually obtain the following error in the code:

TypeError Traceback (most recent call last)
in
----> 1 def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
2 result = openai.Embedding.create(
3 model=model,
4 input=text
5 )

TypeError: 'type' object is not subscriptable

Any recommendations on how to fix this would be much appreciated, as I can't seem to find an answer on Stack Overflow.

Thanks so much for your help!

Issue in line "document_section = df.loc[section_index]" of Question_answering_using_embeddings.ipynb

This is my first issue on Github so please ignore if I make any mistakes. I am trying to run Question_answering_using_embeddings.ipynb and this line "document_section = df.loc[section_index]" is causing an error.

Error: File "C:\Python310\lib\site-packages\pandas\core\indexes\base.py", line 3805, in get_loc
raise KeyError(key) from err
KeyError: 'Summary'

Please suggests me a solution.

Link: https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

Website Q&A: 1-D Input vector issue

The create_context function appears to have an issue comparing the embeddings between question and text. Going through the Notebook works up until the answer_question function execution, where it fails with the below error.

Perhaps this is due to the difference in data types between the embedding result and the question?

File ~/dev/ml/openai-cookbook/solutions/web_crawl_Q&A/env/lib/python3.10/site-packages/scipy/spatial/distance.py:611, in correlation(u, v, w, centered)
    578 """
    579 Compute the correlation distance between two 1-D arrays.
    580 
   (...)
    608 
    609 """
    610 u = _validate_vector(u)
--> 611 v = _validate_vector(v)
    612 if w is not None:
    613     w = _validate_weights(w)

File ~/dev/ml/openai-cookbook/solutions/web_crawl_Q&A/env/lib/python3.10/site-packages/scipy/spatial/distance.py:302, in _validate_vector(u, dtype)
    300 if u.ndim == 1:
    301     return u
--> 302 raise ValueError("Input vector should be 1-D.")

ValueError: Input vector should be 1-D

Update needed to Obtain_dataset Python script for incorporating throttling

This is what I used
`
import os
from dotenv import load_dotenv

print("Loading environment")
load_dotenv()

import pandas as pd

input_datapath = 'data/rpi-data-feed_1.csv' # to save space, we provide a pre-filtered dataset
print("Reading csv = ", input_datapath)
df = pd.read_csv(input_datapath, index_col='ID', header=0)
print("Input rows: ", len(df))
print("Cleaning up and aggregating")
df = df.dropna()
df['combined'] = "Title: " + df.Title.str.strip() + "; Metadata: (" + df.Metadata.str.strip() + ")"
print("Input rows after cleaning: ", len(df))
print(df)

print("Sorting rows")
df = df.sort_values('ID').tail(1_100)

from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

remove reviews that are too long

print("Counting tokens")
df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))
print("Removing capped rows")
df = df[df.n_tokens<8192]
print("Final Input rows: ", len(df))
input("Press Enter to continue...")
import openai
from openai.embeddings_utils import get_embedding

Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

import time
import backoff # for exponential backoff

@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def get_embeddings_with_backoff(*args, **kwargs):
time.sleep(1) # 60000
print("Processing: ", *args)
return get_embedding(*args, **kwargs)

print("Calculating Embeddings")
df['ada_search'] = df.combined.apply(lambda x: get_embeddings_with_backoff(x, engine='text-embedding-ada-002'))
output_datapath = 'data/products_with_embeddings.csv'
print("writing output file: ", output_datapath)
df.to_csv(output_datapath)
`

CDN download link is outdated

The link to download fine_food_reviews_with_embeddings_1k.csv in the zero-shot classification example is either outdated or the column headings are incorrect. It's still serving embeddings from babbage instead of ada at the moment.

Command:

curl -s https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv | head -n1

Output:

ProductId,UserId,Score,Summary,Text,combined,n_tokens,babbage_similarity,babbage_search

Visualizing_embeddings_in_2D - AttributeError: 'list' object has no attribute 'shape'

When I run the examples/Visualizing_embeddings_in_2D.ipynb notebook, I get an error during tsne.fit_transform(matrix):

    791 def _check_params_vs_input(self, X):
--> 792     if self.perplexity >= X.shape[0]:
    793         raise ValueError("perplexity must be less than n_samples")

AttributeError: 'list' object has no attribute 'shape'

Is this a problem with library versions I have installed? A regression from newer versions?

image

image

Sorry if this is a noob python/pandas library mismatch question.

Rate limit problem

I'm trying to follow the example here, and getting an error. To get the example to work, I had to add my API key, so I just did openai.api_key = "sk-mykey, but now I am getting the following error:

RateLimitError                            Traceback (most recent call last)
File ~/opt/anaconda3/lib/python3.9/site-packages/tenacity/__init__.py:407, in Retrying.__call__(self, fn, *args, **kwargs)
    406 try:
--> 407     result = fn(*args, **kwargs)
    408 except BaseException:  # noqa: B902

File ~/opt/anaconda3/lib/python3.9/site-packages/openai/embeddings_utils.py:23, in get_embedding(text, engine)
     21 text = text.replace("\n", " ")
---> 23 return openai.Embedding.create(input=[text], engine=engine)["data"][0]["embedding"]

File ~/opt/anaconda3/lib/python3.9/site-packages/openai/api_resources/embedding.py:34, in Embedding.create(cls, *args, **kwargs)
     33 try:
---> 34     response = super().create(*args, **kwargs)
     36     # If a user specifies base64, we'll just return the encoded string.
     37     # This is only for the default case.

File ~/opt/anaconda3/lib/python3.9/site-packages/openai/api_resources/abstract/engine_api_resource.py:115, in EngineAPIResource.create(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)
    114 url = cls.class_url(engine, api_type, api_version)
--> 115 response, _, api_key = requestor.request(
    116     "post",
    117     url,
    118     params=params,
    119     headers=headers,
    120     stream=stream,
...
--> 361     raise retry_exc from fut.exception()
    363 if self.wait:
    364     sleep = self.wait(retry_state=retry_state)

RetryError: RetryError[<Future at 0x179356970 state=finished raised RateLimitError>]

I have credit in my account, and only ~300 short sentences in my CSV. I'm surprised to be running into a rate limit, but is there a recommended way to throttle requests for this example? I should add that I'm not a Python developer, I'm just trying to figure out how to replace the search API with the embeddings API, and since the only documentation for that replacement is a Python example, I'm just trying to get the example to run so I know what I need to do in my own app.

How to do near-duplicate detection

I am looking for examples of how to do near-duplicate detection, as mentioned in the readme. I'm interested in different examples, but my current use-case is that I have a list of thousands of unique sentences, but I want to detect which of them might have similar semantic meanings. Is there an example of how near-duplicate detection might work?

module 'openai' has no attribute 'Embedding'

When trying to run the first cell of the embeddings example:

import openai

embedding = openai.Embedding.create(
    input="Your text goes here", model="text-embedding-ada-002"
)["data"][0]["embedding"]
len(embedding)

I get this error:

AttributeError: module 'openai' has no attribute 'Embedding'

I've tried updating my openai install with:

pip install --upgrade openai

I've also run:

pip install openai[embeddings]

However, I still can't get past the error. Any help would be greatly appreciated! Thanks.

Speeding up Python API calls?

Is there any way to speed up the a sequence of completion API calls to the GPT3 models? I have a use case where the response of the completion API should be returned very quickly, but I need the full reponse, so streaming would not help here. Is the python API maybe reusing the HTTP connection or auth tokens accross multiple API calls? Is there maybe a workaround? I am also fine changing the python package files locally for a while. :slight_smile: Thanks!

Is the model "ada" for finetuning based on the "text-embedding-ada-002" which is announced on 15/12/2022 ?

!openai api fine_tunes.create -t "olympics-data/discriminator_train.jsonl" -v "olympics-data/discriminator_test.jsonl" --batch_size 16 --compute_classification_metrics --classification_positive_class " yes" --model ada

Just wanna know whether the "ada", one of the alternative models for finetuning, is based on the "text-embedding-ada-002" which was announced on 15/12/2022. Or is there a way through which I can finetune "text-embedding-ada-002" with my personal data?

Rate limit reached only 60/min current 100/min

code: df['davinci'] = df.title.apply(lambda x: get_embedding(x, engine='text-similarity-davinci-001'))

RateLimitError: Rate limit reached for default-global-with-image-limits in organization org-1UIBeDGivAGB5s2IeBxupqGD on requests per min. Limit: 60.000000 / min. Current: 100.000000 / min. Contact [email protected] if you continue to have issues.

File upload status is failed

Hi

I'm trying to finetune a model, but got stuck at file upload. The file status always becomes failed after upload. I tried everything including the exact same sample here

When I try to retrieve the status of the uploaded file,
train_status = openai.File.retrieve(training_id)["status"]
it shows the status of the file is "failed". I still can't find ways to get detailed error messaging. Any support would be great.

Is there a way to ask questions that do aggregations on the document set?

@ted-at-openai Since the max token length is 2800, we are chunking and embedding the chunks. This process is efficient when we ask questions that can be answered with 2-3 chunks as context. We have a use case where we need to ask aggregation questions like "How many terms are present in this terms and conditions document".

Is there a technique to implement this with current limitations? Any hypothesis or hacks on how we can solve this usecase are also welcome. Please shed some light on this.

Instruct library

Hello. In the olympics-2-create-qa.ipynb notebook, the response engine is shown as engine="davinci-instruct-beta-v2". Was not available to me -- and perhaps rest of public. I tried engine="text-davinci-001". Seems to work.

ERROR: Can not execute `setup.py` since setuptools is not available in the build environment.

Incase anyone else comes across this issue:

error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      ERROR: Can not execute `setup.py` since setuptools is not available in the build environment.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

By running pip install setuptools --upgrade fixed the version with Successfully installed setuptools-67.2.0

The API is unable to render the emojis

Im using that OpenAI' API for converting the Movie titles to emojis!

image

The language is Dart and the framework is flutter! This above Code is the API call. The "movieController" is basically the user Input and the "setState" prints the response on screen! The API works and also gives the output but the Emojis are not rendered properly.

Suppose I give the User Input as "Star wars" then, the response from API is "�����" is like this! This happens to any userInput. The emojis do not get displayed properly!

But on the OpenAI's website. it works properly

Any Solution from anyone is appreciated

The Api is not responding the emojis.

Hi team OpenAI, I had a similar problem previously!

The OpenAI's API is not responding to the emojis properly!
Many times I get the Response only if I click the submit 5 or 6 times in my App.

This is the Code 👇, Framework: Flutter 3.1, language: Dart

image

for a better understanding of the code.
The _movieController._text is the user input and the setState() prints the emoji on the physical device screen

Whenever I enter any movie name on the official website of OpenAI playground. I get the response 95% of the time!
But when I run my App A lot of times the response is empty.
Like this 👇

**_{id: cmpl-62HS5kfwwFcTieTsZOlO3bVcp7rwy, object: text_completion, created: 1666001761, model: text-davinci-002, choices: [{text: , index: 0, logprobs: null, finish_reason: stop}], usage: {prompt_tokens: 13, total_tokens: 13}} {cache-control: no-cache, must-revalidate, content-length: 239, content-type: application/json}_**

👆 The above code snippet is the API's header response. after choices the text should be the emoji. But a lot of time it's empty.

My App.
image

suppose if my input is a movie like "Black Widow" the text(emoji) is empty in My App.

Many times I get the Response only if I click the submit 5 or 6 times in my App.

Is this a problem in my code or the API? I'm still a Beginner programmer. Any help is appreciated.

Thank you!

df.ada_similarity.apply(eval).apply(np.array) is returning an error

I'm getting an error when running the line df["ada_similarity"] = df.ada_similarity.apply(eval).apply(np.array) from example https://github.com/openai/openai-cookbook/blob/main/examples/Clustering.ipynb. The error I'm getting is:

eval() arg 1 must be a string, bytes or code object

Full error:
tmp/ipykernel_45192/3289201929.py in
2 import numpy as np
3
----> 4 df["ada_similarity"] = df.ada_similarity.apply(eval).apply(np.array)
5 matrix = np.vstack(df.ada_similarity.values)
6 matrix.shape

/apps/python3/lib/python3.7/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwargs)
4355 dtype: float64
4356 """
-> 4357 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
4358
4359 def _reduce(

/apps/python3/lib/python3.7/site-packages/pandas/core/apply.py in apply(self)
1041 return self.apply_str()
1042
-> 1043 return self.apply_standard()
1044
1045 def agg(self):

/apps/python3/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
1099 values,
1100 f, # type: ignore[arg-type]
-> 1101 convert=self.convert_dtype,
1102 )
1103

/apps/python3/lib/python3.7/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

TypeError: eval() arg 1 must be a string, bytes or code object

No such File object: file-c3shd8wqF3vSCKaukW4Jr1TT in notebook fine-tuned_qa

I am trying to get the notebooks in the fine-tuned_qa. I am experiencing an error with notebook, specifically in the code block with the following code

for name, is_disc in [('discriminator', True), ('qa', False)]:
    for train_test, dt in [('train', train_df), ('test', test_df)]:
        ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)
        ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)
No such File object: file-c3shd8wqF3vSCKaukW4Jr1TT
No such File object: file-c3shd8wqF3vSCKaukW4Jr1TT
No such File object: file-c3shd8wqF3vSCKaukW4Jr1TT
...

Searching online for this file id didn't return any results, and from my understanding of the Files documentation I'm supposed to upload this file, but I don't know how or where?

Q/A Embeddings Function constantly fails

I am fairly new to Python, but trying to follow along with the Colab notebook exactly, and for some reason no matter what I try, I continuously get the following error :

----> 1 def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
2 """
3 Read the document embeddings and their keys from a CSV.
4
5 fname is the path to a CSV with exactly these named columns:

TypeError: 'type' object is not subscriptable

For reference, this is from cell number 7 and 8 from this notebook:
Screenshot 2023-02-01 at 1 53 02 PM

https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

Any help is greatly appreciated, I've solved for the embeddings at this point and just want to try and calculate nearest K from the results, but it just won't even seem to initialize the function

Error pickle saving embeddings

Hi there,
I tried to save embeddings following the approach from this notebook:
openai-cookbook/examples/Recommendation_using_embeddings.ipynb

Error:
return pickle.load(handles.handle)
EOFError: Ran out of input

Here is the code:

set path to embedding cache

embedding_cache_path = "data/my_embeddings_cache.pkl"

load the cache if it exists, and save a copy to disk

try:
embedding_cache = pd.read_pickle(embedding_cache_path)
except FileNotFoundError:
embedding_cache = {}
with open(embedding_cache_path, "wb") as embedding_cache_file:
pickle.dump(embedding_cache, embedding_cache_file)

issue running openai.File.create

search_file = openai.File.create(
  file=open("olympics-data/olympics_search.jsonl"),
  purpose='search'
)

openai.error.InvalidRequestError: 'search' is not one of ['fine-tune'] - 'purpose'

Retry from ID of previous API request?

ChatGPT made me believe this exists,

retry

I might have reached a rate limit, but something like this could be useful in case of connection loss or other mishaps.

Is this possible?

Running Embeddings encoding locally?

Is there any plan to enable the deployment of a model locally to compute embeddings on tokenized text?

I'm currently using "text-embedding-ada-002" via the API and it's fine, but I'm trying to parse indexes with >1M items and building such an index using web requests is a pain on many levels, and I'd love to find a better-performing way to do this in the future.

why when I change the answer if it don‘t know, it will return a wrong answer 【Preventing hallucination】

e2d8bc4cb9b0b1055f243e277938dd87

prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "sorry, 我不知啊".

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
prompt=prompt,
temperature=0,
max_tokens=300,
model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

it return 'Marius Lindvik of Norway.' but it expected to answer 'sorry, 我不知啊'

Repeating Instruction Examples

Is there a way if I want to query a model for a similar question + set of instruction to avoid passing the instruction every time?
eg.

prompt = ['which categories does this <thing> fall in from the list below?\n - fruit \n -vegetable \n etc.']
response = openai.Completion.create(model="text-davinci-003",
                                            prompt=prompt,
                                            temperature=0.0,
                                            max_tokens=256,
                                            top_p=1,
                                            frequency_penalty=0,
                                            presence_penalty=0
                                            )

I want to ask this question for a lot of different <things>, and avoid being billed every time for the same words & instructions.

how do I generate this embedded_babbage_similarity_50k dataset

from the User and product embeddings notebook, it specified that the data was obtained by using the Obtain_dataset.ipynb, but I do not see how it can generate that and nor did I find any other place has this dataset. Please let me know how I can get it. thanks.

Suggestions: improve /solutions/web_crawl_Q&A/web-qa.py

Suggested improvements to web-qa.py file.

  • support international characters (by using utf-8 encoding)
  • try-except around scrapped file write

Changes

Row 138:

replace

with open('text/'+local_domain+'/'+url[8:].replace("/", "_") + ".txt", "w") as f:

with

 try:
            filename = 'text/'+local_domain+'/' + \
                url[8:].replace("/", "_") + ".txt"

            # If the text file already exists, skip it
            if os.path.exists(filename):
                continue

            # Save text from the url to a <url>.txt file
            with open(filename, "w", encoding="utf-8") as f:

Row 151:

 f.write(text)
        except:
            print("Unable to parse page " + url)

Row 184:

    with open("text/" + domain + "/" + file, "r", encoding="utf-8") as f:

Document Library Pre-Processing

Hello all,

Would it be at all possible to provide an example of document pre-processing where the dataset is not being imported from Wikipedia, instead through an individualized standard csv file?

For no less than 72 hours now over the past week I've been trying to complete the question and answering tutorial (https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb) using my own dataset. The issue is even though I downloaded the example csv file, and copied my own data into the csv file and re-saved it, I cannot get the dataset to run using the code. The code runs perfectly fine if I use the sample data set but when I try to run the sample data set with my data replacing it (even ensuring it is always saved as a csv file), it always errors past line 48. I have tried changing the data types in the columns using python, I've tried removing any special characters, I've tried, I kid you not, about three days worth of fixes with no luck. ChatGPT is now repeating recommendations without any success unfortunately.

I continually receive this error:

`ValueError Traceback (most recent call last)
Cell In [74], line 1
----> 1 prompt = construct_prompt(
2 "What is a WOC Nurse?",
3 document_embeddings,
4 df
5 )
7 print("===\n", prompt)

Cell In [73], line 16, in construct_prompt(question, context_embeddings, df)
13 document_section = df.loc[section_index]
15 chosen_sections_len += document_section.tokens + separator_len
---> 16 if chosen_sections_len > MAX_SECTION_LEN:
17 break
19 chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:1442, in NDFrame.nonzero(self)
1440 @Final
1441 def nonzero(self):
-> 1442 raise ValueError(
1443 f"The truth value of a {type(self).name} is ambiguous. "
1444 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1445 )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
`

You can see in the dataset found here: https://docs.google.com/spreadsheets/d/e/2PACX-1vSs9Ok5FUrhAOu_BnpLwV63bwpLylRtUWBDE7onAX1zrZW0Sz4gBEtBN-KtsBiC1DhKyhhZjNXfNf0i/pub?output=csv

That if you only use the first chapter, there is no issue, however, anything read past line 48 (it took a lot of trial and error to determine this) it no longer works and I either get the error noted above, or an error stating that the system cannot read the JSON content.

My assumption would be the issue is the way in which I tokenized the data or there is an issue with the content of the dataset, however you can see line 48 is only a standard paragraph with nothing special in it. Unfortunately, I am still quite new to python so any recommendations or assistance with this issue would be much appreciated. I'm about to give up trying to figure out how to use my own dataset with OpenAI to do question and answer embedding, which is quite unfortunate.

Thank you so much for your assistance!

Fine-tuning for entity extraction?

Hi Team,
FIne-tuning for entity extraction seems to be straight forward in zero-shot setting, however, when fine-tuning da Vinci or ada, for entity extraction, the model seems to hallucinate. Are there any set prompts/ recommended ways fine-tune the model?

User and product embeddings unclear

In the 'User_and_product_embeddings.ipynb' there is a requirement to load 'output/embedded_babbage_similarity_50k.csv'. A comment states that this file needs to be generated in advance, but there is no clear file to use to generate this data from. A link or explanation of where to find it would be helpful.

Question answering using embeddings error

I just wanted to point this out:

image

It's obviously no big deal, but given the context window of >8000 tokens with davinci-003 it would be a great opportunity to explain that the semantic proximity may not be a perfect metric for prepending context and in this case we solve the issue by prepending multiple articles since the second-closest one is the correct input.

requirements

I am having problems starting with this repo
Is there a requirements.txt? Cant find

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.