#PaperQA CLI Script Documentation
This documentation provides a comprehensive guide on how to use the provided PaperQA CLI script. The script is designed to process PDF documents, generate embeddings, and answer questions using OpenAI’s language models.
#Introduction
The PaperQA CLI script is a command-line tool that allows users to process PDF documents, generate embeddings, and answer questions using OpenAI’s language models. It supports various configurations and options to customize the processing and querying of documents.
#Prerequisites
Before using the script, ensure you have the following:
- Python 3.7 or higher
- Required Python packages:
argparse
,glob
,json
,pickle
,sys
,pathlib
,typing
,paperqa
#Installation
- Clone the repository or download the script.
- Install the required Python packages using pip:
pip3 install paperqa
Note: The other required packages (argparse
, glob
, json
, pickle
, sys
, pathlib
, typing
) are part of the Python standard library and do not need to be installed separately.
#Usage
#Command Line Arguments
The script accepts several command-line arguments to customize its behavior. Below is a list of available arguments:
--pdf_dir
: Directory containing PDF files--question
: Question to ask (use quotes for multi-word questions)--questions_file
: Path to a text file containing a list of questions, one per line--save_embeddings
: Path to save the Docs object with embeddings (default:paperqa_embeddings.pkl
)--load_embeddings
: Path to load a pre-saved Docs object with embeddings--save_answers
: Path to save the answers to a text file (default:answers.txt
)--save_data
: Path to save all data (answers, Docs object) to a pickle file (default:paperqa_data.pkl
)--save_embeddings_txt
: Path to save embeddings in a text file (for debugging/analysis)--llm
: OpenAI LLM model name (default:gpt-4o
)--embedding
: OpenAI embedding model name (default:text-embedding-3-large
)--summary_llm
: OpenAI LLM model name for summarization (defaults to same as--llm
)--k
: Number of top-k results to retrieve for each query (default: 50)--max_sources
: Maximum number of sources to use in the final answer (default: 50)--chunk_chars
: Number of characters per chunk when splitting documents (default: 3200)--overlap
: Number of overlapping characters between chunks (default: 1600)--json_summary
: Use JSON format for summarization (requires GPT-3.5-turbo or later)--detailed_citations
: Include full citations in the context--disable_vector_search
: Disable vector search and use all text chunks--key_filter
: Filter evidence by document keys based on question similarity--custom_prompt_file
: Path to a JSON file containing custom prompts--batch_size
: Batch size for processing documents (default: 10)--max_concurrent
: Maximum number of concurrent requests (default: 4)--strip_citations
: Strip citations from the generated answers--jit_texts_index
: Enable just-in-time text indexing--answer_length
: Specify the desired length of the answer
#Favorites
--question
“Question to ask (use quotes for multi-word questions)”
--questions_file
“Path to a text file containing a list of questions, one per line”
--load_embeddings
“Path to load a pre-saved Docs object with embeddings”
--save_embeddings_txt
“Path to save embeddings in a text file (for debugging/analysis)”
--detailed_citations
“Include full citations in the context”
--custom_prompt_file
“Path to a JSON file containing custom prompts”
--answer_length
“Specify the desired length of the answer (e.g., ‘about 200 words’)”
#Examples
- Ask a single question:
python script.py --pdf_dir /path/to/pdfs --question "What is the impact of climate change on agriculture?"
- Ask multiple questions from a file:
python script.py --pdf_dir /path/to/pdfs --questions_file questions.txt
#List of questions template:
questions = [
"Q1?",
"Q2?",
# ... add your 28 other questions here
]
- Load existing embeddings and ask a question:
python script.py --load_embeddings embeddings.pkl --question "What are the latest advancements in AI?"
#Detailed Explanation
#Loading Custom Prompts
If a custom prompt file is provided using --custom_prompt_file
, the script loads the prompts from the specified JSON file. This allows users to customize the prompts used for querying the documents.
#Creating Docs Object
The script creates a Docs
object with the specified configurations, including the LLM model, embedding model, and other parameters. This object is used to manage the documents and perform queries.
#Loading Existing Embeddings
If a pre-saved Docs object with embeddings is provided using --load_embeddings
, the script loads the object from the specified file. This allows users to reuse previously generated embeddings.
#Adding PDF Documents
The script adds PDF documents from the specified directory to the Docs
object. It ensures that only new documents are added to avoid duplication.
#Getting Questions
The script retrieves questions from the command line, a file, or standard input. It supports multiple sources for questions to provide flexibility in querying the documents.
#Answering Questions
The script processes each question using the Docs
object and generates answers. It supports various configurations for querying, including the number of top-k results, maximum sources, and answer length.
#Saving Data
The script saves the generated embeddings, answers, and other data to the specified files. This allows users to persist the results and reuse them later.
#Error Handling
The script includes error handling for various scenarios, such as file not found errors and JSON decoding errors. It prints appropriate error messages and exits gracefully in case of errors.
#Script
from paperqa import Answer, Docs, PromptCollection, OpenAILLMModel, OpenAIEmbeddingModel
import argparse
import sys
import pickle
import json
import glob
from pathlib import Path
from typing import List, Dict
def main():
parser = argparse.ArgumentParser(description="PaperQA CLI")
parser.add_argument(
"--pdf_dir",
type=str,
default="/home/mario/gpt-researcher/BIBTEX",
help="Directory containing PDF files",
)
parser.add_argument(
"--question", type=str, help="Question to ask (use quotes for multi-word questions)"
)
parser.add_argument(
"--questions_file",
type=str,
help="Path to a text file containing a list of questions, one per line",
)
parser.add_argument(
"--save_embeddings",
type=str,
default="paperqa_embeddings.pkl",
help="Path to save the Docs object with embeddings",
)
parser.add_argument(
"--load_embeddings",
type=str,
help="Path to load a pre-saved Docs object with embeddings",
)
parser.add_argument(
"--save_answers",
type=str,
default="answers.txt",
help="Path to save the answers to a text file",
)
parser.add_argument(
"--save_data",
type=str,
default="paperqa_data.pkl",
help="Path to save all data (answers, Docs object) to a pickle file",
)
parser.add_argument(
"--save_embeddings_txt",
type=str,
help="Path to save embeddings in a text file (for debugging/analysis)",
)
parser.add_argument(
"--llm",
type=str,
default="gpt-4o",
help="OpenAI LLM model name (e.g., 'gpt-4o', 'gpt-4o')",
)
parser.add_argument(
"--embedding",
type=str,
default="text-embedding-3-large",
help="OpenAI embedding model name (e.g., 'text-embedding-3-large')",
)
parser.add_argument(
"--summary_llm",
type=str,
help="OpenAI LLM model name for summarization (defaults to same as --llm)",
)
parser.add_argument(
"--k",
type=int,
default=50,
help="Number of top-k results to retrieve for each query",
)
parser.add_argument(
"--max_sources",
type=int,
default=50,
help="Maximum number of sources to use in the final answer",
)
parser.add_argument(
"--chunk_chars",
type=int,
default=2400,
help="Number of characters per chunk when splitting documents",
)
parser.add_argument(
"--overlap",
type=int,
default=1200,
help="Number of overlapping characters between chunks",
)
parser.add_argument(
"--json_summary",
action="store_true",
help="Use JSON format for summarization (requires GPT-3.5-turbo or later)",
)
parser.add_argument(
"--detailed_citations",
action="store_true",
help="Include full citations in the context",
)
parser.add_argument(
"--disable_vector_search",
action="store_true",
help="Disable vector search and use all text chunks",
)
parser.add_argument(
"--key_filter",
action="store_true",
help="Filter evidence by document keys based on question similarity",
)
parser.add_argument(
"--custom_prompt_file",
type=str,
help="Path to a JSON file containing custom prompts",
)
parser.add_argument(
"--batch_size",
type=int,
default=10, # Increased batch size for efficiency
help="Batch size for processing documents (adjust for performance)",
)
parser.add_argument(
"--max_concurrent",
type=int,
default=4,
help="Maximum number of concurrent requests (adjust for performance)",
)
parser.add_argument(
"--strip_citations",
action="store_true",
help="Strip citations from the generated answers",
)
parser.add_argument(
"--jit_texts_index",
action="store_true",
help="Enable just-in-time text indexing",
)
parser.add_argument(
"--answer_length",
type=str,
default="about 100 words",
help="Specify the desired length of the answer (e.g., 'about 200 words')",
)
args = parser.parse_args()
# Load custom prompts from JSON file if provided
custom_prompts: Dict[str, str] = {}
if args.custom_prompt_file:
try:
with open(args.custom_prompt_file, "r", encoding="utf-8") as f:
custom_prompts = json.load(f)
except (FileNotFoundError, json.JSONDecodeError) as e:
print(f"Error loading custom prompts: {e}", file=sys.stderr)
sys.exit(1)
# Initialize PromptCollection without arguments
prompts = PromptCollection()
# If custom prompts are needed, set them directly
if custom_prompts:
for key, value in custom_prompts.items():
setattr(prompts, key, value)
llm_model = OpenAILLMModel(config={"model": args.llm, "temperature": 0.1})
embedding_model = OpenAIEmbeddingModel(config={"model": args.embedding})
summary_llm_model = (
OpenAILLMModel(config={"model": args.summary_llm, "temperature": 0.1})
if args.summary_llm
else llm_model
)
# Create Docs object with supported parameters
docs = Docs(
llm=args.llm, # Pass the model name as a string
embedding=args.embedding, # Pass the model name as a string
summary_llm=args.summary_llm if args.summary_llm else args.llm, # Pass the model name as a string
prompts=prompts, # Use the initialized and updated PromptCollection
max_concurrent=args.max_concurrent,
jit_texts_index=args.jit_texts_index,
)
# Load existing embeddings if provided
if args.load_embeddings:
try:
with open(args.load_embeddings, "rb") as f:
docs = pickle.load(f)
docs.set_client() # Required after loading from pickle
except FileNotFoundError as e:
print(f"Error loading embeddings: {e}", file=sys.stderr)
sys.exit(1)
# Add PDF documents
pdf_dir = Path(args.pdf_dir)
pdf_files = glob.glob(str(pdf_dir / "*.pdf"))
for pdf_file in pdf_files:
if pdf_file not in [doc.dockey for doc in docs.docs.values()]:
docs.add(pdf_file)
# Get questions from command line, file, or standard input
questions: List[str] = []
if args.question:
questions.append(args.question)
if args.questions_file:
try:
with open(args.questions_file, "r", encoding="utf-8") as f:
questions.extend([line.strip() for line in f])
except FileNotFoundError as e:
print(f"Error reading questions file: {e}", file=sys.stderr)
sys.exit(1)
if not questions:
questions = [line.strip() for line in sys.stdin]
# Get answers for each question
answers = []
for question in questions:
answer = docs.query(question, k=args.k, max_sources=args.max_sources)
print(f"Answer object: {answer}") # Debug print to inspect the Answer object
answers.append(str(answer)) # Use str(answer) to store the entire Answer object
# Save answers to file
with open(args.save_answers, "w", encoding="utf-8") as f:
for answer in answers:
f.write(answer + "\n")
# Save embeddings to file
if args.save_embeddings:
with open(args.save_embeddings, "wb") as f:
pickle.dump(docs, f)
# Save all data to file
if args.save_data:
with open(args.save_data, "wb") as f:
pickle.dump({"docs": docs, "answers": answers}, f)
# Save embeddings to text file (for debugging/analysis)
if args.save_embeddings_txt:
with open(args.save_embeddings_txt, "w", encoding="utf-8") as f:
for doc in docs.docs.values():
f.write(f"{doc.dockey}\t{doc.embedding}\n")
if __name__ == "__main__":
main()
#2add
Automating a list of prompts to be queried:
- You can create a list of prompts and iterate over them, similar to the example in point 1.
- Here’s an example:
prompts = [ "Prompt 1", "Prompt 2", # ... "Prompt N" ] for prompt in prompts: answer = loaded_docs.query(prompt, k=50) print(f"Prompt: {prompt}") print(f"Answer: {answer.formatted_answer}\n")
Piping from command line stdout into the Python script:
- You can use the
sys
module to read input from the command line. - Here’s an example:
import sys question = sys.stdin.read().strip() answer = loaded_docs.query(question, k=50) print(answer.formatted_answer)
- You can then pipe the question from the command line:
echo "What is the meaning of life?" | python3 researcher.py
Using the response of a request/query as context for subsequent prompts:
- You can use the
answer.context
attribute to access the context used for generating the answer. - Here’s an example:
previous_context = "" for prompt in prompts: answer = loaded_docs.query(prompt, k=50, context=previous_context) previous_context = answer.context print(f"Prompt: {prompt}") print(f"Answer: {answer.formatted_answer}\n")
Implement adversarial prompting. #131
This adds a method adversarial_query. Adversarial queries first generate an answer, then asks the LLM to find problems with the answer, then finally generates the final response so that it addresses th
1 commit
2 files changed
1 contributor
Commits on Jun 2, 2023
Add method for adversarial prompting. Also work around a problem with…
davidbrodrick committed on Jun 2, 2023
Showing with 114 additions and 4 deletions.
87 changes: 83 additions & 4 deletions87
paperqa/docs.py
Original file line number Diff line number Diff line change
@@ -2,6 +2,7 @@
import os
import re
import sys
import copy
from datetime import datetime
from functools import reduce
from pathlib import Path
@@ -18,12 +19,15 @@
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms.base import LLM
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from .paths import CACHE_PATH
from .qaprompts import (
citation_prompt,
make_chain,
qa_prompt,
adversarial_prompt,
revision_prompt,
search_prompt,
select_paper_prompt,
summary_prompt,
@@ -257,7 +261,10 @@ async def adoc_match(
papers = [f"{d.metadata['key']}: {d.page_content}" for d in docs]
result = await chain.arun(
question=query, papers="\n".join(papers), callbacks=callbacks
)
)
if result=="None.":
#Something in that call stack returns "None." as a string!
result=[]
return result
def doc_match(
@@ -378,9 +385,10 @@ async def aget_evidence(
docs = self._faiss_index.similarity_search(
answer.question, k=_k, fetch_k=5 * _k
)
# ok now filter
if key_filter is not None:
docs = [doc for doc in docs if doc.metadata["dockey"] in key_filter][:k]
if (key_filter is not None):
docs = [doc for doc in docs if doc.metadata["key"] in key_filter][:k]
async def process(doc):
if doc.metadata["dockey"] in self._deleted_keys:
@@ -468,6 +476,74 @@ def generate_search_query(self, query: str) -> List[str]:
queries = [re.sub(r"^\d+\.\s*", "", q) for q in queries]
return queries
def adversarial_query(
self,
query: str,
k: int = 10,
max_sources: int = 5,
length_prompt: str = "about 100 words",
marginal_relevance: bool = True,
answer: Optional[Answer] = None,
key_filter: Optional[bool] = None,
get_callbacks: Callable[[str], AsyncCallbackHandler] = lambda x: [],
recontextualise: bool = False
) -> List[Answer]:
#Get an answer to the question
orig_answer=self.query(
query,
k=k,
max_sources=max_sources,
length_prompt=length_prompt,
marginal_relevance=marginal_relevance,
answer=answer,
key_filter=key_filter,
get_callbacks=get_callbacks,
prompt_template=qa_prompt)
if "I cannot answer this question due to insufficient information." in orig_answer.answer:
#We can't do an adversarial challenge if there was no answer
return [orig_answer]
#Ask the LLM to critique the original answer
adversarial_query="Original Question: %s\n\nAnswer to be reviewed: %s"%(query,orig_answer.answer)
if recontextualise:
#We will search for new context strings based on the original answer
critique=None
else:
#Use the original context strings
critique=copy.copy(orig_answer)
critique.question=query
critique=self.query(
adversarial_query,
k=k,
max_sources=max_sources,
length_prompt="less than 500 words",
marginal_relevance=marginal_relevance,
answer=critique,
key_filter=key_filter,
get_callbacks=get_callbacks,
prompt_template=adversarial_prompt)
#Generate a new answer which addresses the criticism
revision_query="Original Question: %s\n\nYour original answer: %s\n\nFeedback from the reviewer: %s"%(query,orig_answer.answer,critique.answer)
if recontextualise:
final_answer=None
else:
final_answer=copy.copy(critique)
critique.question=query
final_answer=self.query(
revision_query,
k=k,
max_sources=max_sources,
length_prompt=length_prompt,
marginal_relevance=marginal_relevance,
answer=final_answer,
key_filter=key_filter,
get_callbacks=get_callbacks,
prompt_template=revision_prompt)
#Return all three parts of the process
return [orig_answer, critique, final_answer]
def query(
self,
query: str,
@@ -478,6 +554,7 @@ def query(
answer: Optional[Answer] = None,
key_filter: Optional[bool] = None,
get_callbacks: Callable[[str], AsyncCallbackHandler] = lambda x: [],
prompt_template: PromptTemplate = qa_prompt
) -> Answer:
# special case for jupyter notebooks
if "get_ipython" in globals() or "google.colab" in sys.modules:
@@ -499,6 +576,7 @@ def query(
answer=answer,
key_filter=key_filter,
get_callbacks=get_callbacks,
prompt_template=prompt_template
)
)
@@ -512,6 +590,7 @@ async def aquery(
answer: Optional[Answer] = None,
key_filter: Optional[bool] = None,
get_callbacks: Callable[[str], AsyncCallbackHandler] = lambda x: [],
prompt_template: PromptTemplate = qa_prompt
) -> Answer:
if k < max_sources:
raise ValueError("k should be greater than max_sources")
@@ -542,7 +621,7 @@ async def aquery(
else:
cb = OpenAICallbackHandler()
callbacks = [cb] + get_callbacks("answer")
qa_chain = make_chain(qa_prompt, self.llm)
qa_chain = make_chain(prompt_template, self.llm)
answer_text = await qa_chain.arun(
question=query,
context_str=context_str,
31 changes: 31 additions & 0 deletions31
paperqa/qaprompts.py
Original file line number Diff line number Diff line change
@@ -25,6 +25,7 @@
"Relevant Information Summary:",
)
qa_prompt = prompts.PromptTemplate(
input_variables=["question", "context_str", "length"],
template="Write an answer ({length}) "
@@ -42,6 +43,36 @@
)
adversarial_prompt = prompts.PromptTemplate(
input_variables=["question", "context_str"],
template="You are an adversarial and critical scientific reviewer. "
"Your task is to find deficiencies and shortcomings in the following answer to the original question. "
"Please be specific about the shortcomings of the answer and offer suggestions which would make the answer more complete and accurate."
"For each sentence in your critique, indicate which sources most support it "
"via valid citation markers at the end of sentences, like (Example2012).\n"
"{context_str}\n"
"{question}\n"
"Your Critique: ",
)
revision_prompt = prompts.PromptTemplate(
input_variables=["question", "context_str"],
template="For each sentence in your answer, indicate which sources most support it "
"via valid citation markers at the end of sentences, like (Example2012).\n"
"You are a scientist and use a scholarly tone. "
"You want to rewrite and improve your original answer to the original question to "
"to include better grammatical structure and logical reasoning. "
"You have also received critical feedback provided by a reviewer. Please address the "
"feedback from the reviewer so that your new answer is more comprehensive. "
"You can use dot points and paragraphs in your response where appropriate. "
#"You may conclude with one or two opinionated sentences to summarise your answer. "
"{context_str}\n"
"{question}\n"
"Your Improved Answer: ",
)
search_prompt = prompts.PromptTemplate(
input_variables=["question"],
template="We want to answer the following question: {question} \n"