How to Analyze Documents With LangChain and the OpenAI API

Trending 1 week ago

Extracting insights from documents and information is important successful making informed decisions. However, privateness concerns originate erstwhile dealing pinch delicate information. LangChain, successful operation pinch nan OpenAI API, allows you to analyse your section documents without nan request to upload them online.

They execute this by keeping your information locally, utilizing embeddings and vectorization for analysis, and executing processes wrong your environment. OpenAI does not usage information submitted by customers via their API to train their models aliases amended their services.

Setting Up Your Environment

Create a caller Python virtual environment. This will guarantee location are nary room type conflicts. Then tally nan pursuing terminal bid to instal nan required libraries.

pip instal langchain openai tiktoken faiss-cpu pypdf

Here is simply a breakdown of really you will usage each library:

  • LangChain: You will usage it for creating and managing linguistic chains for matter processing and analysis. It will supply modules for archive loading, matter splitting, embeddings, and vector storage.
  • OpenAI: You will usage it for moving queries and obtaining results from a connection model.
  • tiktoken: You will usage it to count nan number of tokens (units of text) successful a fixed text. This is to support way of nan token count erstwhile interacting pinch OpenAI API which charges based connected nan number of tokens you use.
  • FAISS: You will usage it to create and negociate a vector store, allowing accelerated retrieval of akin vectors based connected their embeddings.
  • PyPDF: This room extracts matter from PDFs. It helps load PDF files and extracts their matter for further processing.

After each nan libraries are installed, your situation is now ready.

Getting an OpenAI API Key

When you make requests to nan OpenAI API, you request to see an API cardinal arsenic portion of nan request. This cardinal allows nan API supplier to verify that nan requests are coming from a morganatic root and that you person nan basal permissions to entree its features.

To get an OpenAI API key, proceed to nan OpenAI platform.

OpenAI API homepage

Then, nether your account’s floor plan successful nan top-right, click connected View API keys. The API keys page will appear.

OpenAI API page

Click on the Create caller secret key button. Name your cardinal and click connected Create caller concealed key. OpenAI will make your API cardinal which you should transcript and support location safe. For information reasons, you won’t beryllium capable to position it again done your OpenAI account. If you suffer this concealed key, you’ll request to make a caller one.

Importing nan Required Libraries

To beryllium capable to usage nan libraries installed successful your virtual environment, you request to import them.

from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

Notice that you import nan dependency libraries from LangChain. This allows you to usage circumstantial features of nan LangChain framework.

Loading nan Document for Analysis

Start by creating a adaptable that holds your API key. You will usage this adaptable later successful nan codification for authentication.


openai_api_key = "Your API key"

It is not recommended to difficult codification your API cardinal if you scheme to stock your codification pinch 3rd parties. For accumulation codification that you purpose to distribute, use an situation adaptable instead.

Next, create a usability that loads a document. The usability should load a PDF aliases a matter file. If nan archive is neither, nan usability should raise a ValueError.

def load_document(filename):
   if filename.endswith(".pdf"):
       loader = PyPDFLoader(filename)
       documents = loader.load()
   elif filename.endswith(".txt"):
       loader = TextLoader(filename)
       documents = loader.load()
   else:
       raise ValueError("Invalid record type")

After loading nan documents, create a CharacterTextSplitter. This splitter will divided nan loaded documents into smaller chunks based connected characters.

   text_splitter = CharacterTextSplitter(chunk_size=1000,
                                         chunk_overlap=30, separator="\n")

   return text_splitter.split_documents(documents=documents)

Splitting nan archive ensures that nan chunks are of a manageable size and are still connected pinch immoderate overlapping context. This is useful for tasks for illustration matter study and accusation retrieval.

Querying nan Document

You request a measurement to query nan uploaded archive to deduce insights from it. To do so, create a usability that takes a query drawstring and a retriever as input. It past creates a RetrievalQA instance utilizing nan retriever and an lawsuit of nan OpenAI connection model.

def query_pdf(query, retriever):
   qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=openai_api_key),
                                    chain_type="stuff", retriever=retriever)
   result = qa.run(query)
   print(result)

This usability uses nan created QA lawsuit to tally nan query and people nan result.

Creating nan Main Function

The main usability will power nan wide programme flow. It will return personification input for a archive filename and load that document. Then create an OpenAIEmbeddings lawsuit for embeddings and conception a vector store based connected nan loaded documents and embeddings. Save this vector shop to a section file.

Next, load nan persisted vector shop from nan section file. Then participate a loop wherever nan personification tin input queries. The main usability passes these queries to nan query_pdf usability on pinch nan persisted vector store's retriever. The loop will proceed until nan personification enters "exit".

def main():
   filename = input("Enter nan sanction of nan archive (.pdf aliases .txt):\n")
   docs = load_document(filename)
   embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
   vectorstore = FAISS.from_documents(docs, embeddings)
   vectorstore.save_local("faiss_index_constitution")
   persisted_vectorstore = FAISS.load_local("faiss_index_constitution", embeddings)
   query = input("Type successful your query (type 'exit' to quit):\n")

   while query != "exit":
       query_pdf(query, persisted_vectorstore.as_retriever())
       query = input("Type successful your query (type 'exit' to quit):\n")

Embeddings seizure semantic relationships betwixt words. Vectors are a shape successful which you tin correspond pieces of text.

This codification converts nan matter information successful nan archive into vectors utilizing nan embeddings generated by OpenAIEmbeddings. It past indexes these vectors utilizing FAISS, for businesslike retrieval and comparison of akin vectors. This is what allows for nan study of nan uploaded document.

Finally, usage the __name__ == "__main__" construct to telephone nan main usability if a personification runs nan programme standalone:

if __name__ == "__main__":
   main()

This app is simply a command-line application. As an extension, you tin use Streamlit to adhd a web interface to nan app.

Performing Document Analysis

To execute archive analysis, shop nan archive you want to analyse successful nan aforesaid files arsenic your project, past tally nan program. It will inquire for nan sanction of nan archive you want to analyze. Enter its afloat name, past participate queries for nan programme to analyze.

The screenshot beneath shows nan results of analyzing a PDF.

Results of analyzing a PDF record done querying connected a terminal

The pursuing output shows nan results of analyzing a matter record containing root code.

Output of a programme showing study of root codification connected nan terminal

Ensure nan files you want to analyse are successful either PDF aliases matter format. If your documents are successful different formats, you tin convert them to PDF format utilizing online tools.

Understanding nan Technology Behind Large Language Models

LangChain simplifies nan creation of applications utilizing ample connection models. This besides intends it abstracts what is going connected down nan scenes. To understand precisely really nan exertion you are creating works, you should familiarize yourself pinch nan exertion down ample connection models.

Source Tutorials
Tutorials