{"id":54,"date":"2025-03-22T10:29:46","date_gmt":"2025-03-22T17:29:46","guid":{"rendered":"http:\/\/www.ashwang.net\/?p=54"},"modified":"2025-03-22T10:30:55","modified_gmt":"2025-03-22T17:30:55","slug":"building-a-private-knowledge-base-with-langchain","status":"publish","type":"post","link":"http:\/\/www.ashwang.net\/index.php\/2025\/03\/22\/building-a-private-knowledge-base-with-langchain\/","title":{"rendered":"Building a Private Knowledge Base with LangChain"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Build a Private Knowledge Base with LangChain + RAG (Full Tutorial with Code)<\/h2>\n\n\n\n<p>Retrieval-Augmented Generation (RAG) is a powerful framework that allows you to <strong>ask questions over your own documents<\/strong> using a Large Language Model (LLM). In this post, we\u2019ll walk through how to use LangChain to build a RAG system\u2014from loading PDFs to querying your data\u2014with code and a visual workflow.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"304\" src=\"http:\/\/www.ashwang.net\/wp-content\/uploads\/2025\/03\/pkb-1024x304.jpeg\" alt=\"\" class=\"wp-image-55\" srcset=\"http:\/\/www.ashwang.net\/wp-content\/uploads\/2025\/03\/pkb-1024x304.jpeg 1024w, http:\/\/www.ashwang.net\/wp-content\/uploads\/2025\/03\/pkb-300x89.jpeg 300w, http:\/\/www.ashwang.net\/wp-content\/uploads\/2025\/03\/pkb-768x228.jpeg 768w, http:\/\/www.ashwang.net\/wp-content\/uploads\/2025\/03\/pkb.jpeg 1129w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Step-by-Step Implementation in LangChain<\/h2>\n\n\n\n<h2 class=\"wp-block-heading\">1. Document Loading<\/h2>\n\n\n\n<p>LangChain supports many loaders: PDFs, text files, URLs, even YouTube transcripts.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.document_loaders import TextLoader\n\nloader = TextLoader(\".\/data\/email.txt\")\ndocuments = loader.load()<\/code><\/pre>\n\n\n\n<p><strong>Tip:<\/strong> You can switch to <code>PyPDFLoader<\/code>, <code>UnstructuredURLLoader<\/code>, or others depending on your data source.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Chunking Documents<\/h2>\n\n\n\n<p>Long documents are broken into smaller chunks for better embedding and retrieval.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.text_splitter import RecursiveCharacterTextSplitter\n\ntext_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)\nsplits = text_splitter.split_documents(documents)<\/code><\/pre>\n\n\n\n<p><strong>Why?<\/strong> Chunking preserves context and improves matching relevance.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Embedding &amp; Vector Storage<\/h2>\n\n\n\n<p>Each chunk is converted to a vector and stored in a vector database like FAISS, Pinecone, or Weaviate.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.vectorstores import FAISS\nfrom langchain.embeddings.openai import OpenAIEmbeddings\n\nembedding = OpenAIEmbeddings()\nvectorstore = FAISS.from_documents(splits, embedding)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">4. Retrieval<\/h2>\n\n\n\n<p>This allows semantic search over your data using user queries.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>retriever = vectorstore.as_retriever()\nquery = \"What did the email say about the product launch?\"\nrelevant_docs = retriever.get_relevant_documents(query)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">5. LLM Response Generation<\/h2>\n\n\n\n<p>The final step: retrieved chunks are sent to a language model to generate an answer.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\n\nllm = OpenAI()\nchain = load_qa_chain(llm, chain_type=\"stuff\")\n\nresponse = chain.run(input_documents=relevant_docs, question=query)\nprint(response)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Full Pipeline Recap<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.document_loaders import TextLoader\nfrom langchain.text_splitter import RecursiveCharacterTextSplitter\nfrom langchain.vectorstores import FAISS\nfrom langchain.embeddings.openai import OpenAIEmbeddings\nfrom langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\n\nloader = TextLoader(\".\/data\/email.txt\")\ndocuments = loader.load()\n\ntext_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)\nsplits = text_splitter.split_documents(documents)\n\nembedding = OpenAIEmbeddings()\nvectorstore = FAISS.from_documents(splits, embedding)\n\nretriever = vectorstore.as_retriever()\nquery = \"What did the email say about the product launch?\"\ndocs = retriever.get_relevant_documents(query)\n\nllm = OpenAI()\nchain = load_qa_chain(llm, chain_type=\"stuff\")\nresponse = chain.run(input_documents=docs, question=query)\n\nprint(response)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Final Thoughts<\/h2>\n\n\n\n<p>LangChain makes it surprisingly easy to build an AI system that can <strong>answer questions using your own documents<\/strong>\u2014a game-changer for business intelligence, legal research, customer support, and more.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u2705 Control your data privacy<\/li>\n\n\n\n<li>\u2705 Build custom GPT-like tools<\/li>\n\n\n\n<li>\u2705 Integrate with Notion, websites, PDFs, etc.<\/li>\n<\/ul>\n\n\n\n<p>If you\u2019d like a web app version using Streamlit or Flask, or a WordPress plugin that lets users upload and query documents directly\u2014just reach out!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. Document Loading<\/h3>\n\n\n\n<p>LangChain supports many loaders: PDFs, text files, URLs, even YouTube transcripts.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.document_loaders import TextLoader\n\nloader = TextLoader(\".\/data\/email.txt\")\ndocuments = loader.load()<\/code><\/pre>\n\n\n\n<p><strong> Tip:<\/strong> You can switch to <code>PyPDFLoader<\/code>, <code>UnstructuredURLLoader<\/code>, or others depending on your data source.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Chunking Documents<\/h2>\n\n\n\n<p>Long documents are broken into smaller chunks for better embedding and retrieval.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.text_splitter import RecursiveCharacterTextSplitter\n\ntext_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)\nsplits = text_splitter.split_documents(documents)<\/code><\/pre>\n\n\n\n<p><strong> Why?<\/strong> Chunking preserves context and improves matching relevance.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Embedding &amp; Vector Storage<\/h2>\n\n\n\n<p>Each chunk is converted to a vector and stored in a vector database like FAISS, Pinecone, or Weaviate.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.vectorstores import FAISS\nfrom langchain.embeddings.openai import OpenAIEmbeddings\n\nembedding = OpenAIEmbeddings()\nvectorstore = FAISS.from_documents(splits, embedding)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">4. Retrieval<\/h2>\n\n\n\n<p>This allows semantic search over your data using user queries.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>retriever = vectorstore.as_retriever()\nquery = \"What did the email say about the product launch?\"\nrelevant_docs = retriever.get_relevant_documents(query)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">5. LLM Response Generation<\/h2>\n\n\n\n<p>The final step: retrieved chunks are sent to a language model to generate an answer.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\n\nllm = OpenAI()\nchain = load_qa_chain(llm, chain_type=\"stuff\")\n\nresponse = chain.run(input_documents=relevant_docs, question=query)\nprint(response)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\"> Full Pipeline Recap<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain.document_loaders import TextLoader\nfrom langchain.text_splitter import RecursiveCharacterTextSplitter\nfrom langchain.vectorstores import FAISS\nfrom langchain.embeddings.openai import OpenAIEmbeddings\nfrom langchain.chains.question_answering import load_qa_chain\nfrom langchain.llms import OpenAI\n\nloader = TextLoader(\".\/data\/email.txt\")\ndocuments = loader.load()\n\ntext_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)\nsplits = text_splitter.split_documents(documents)\n\nembedding = OpenAIEmbeddings()\nvectorstore = FAISS.from_documents(splits, embedding)\n\nretriever = vectorstore.as_retriever()\nquery = \"What did the email say about the product launch?\"\ndocs = retriever.get_relevant_documents(query)\n\nllm = OpenAI()\nchain = load_qa_chain(llm, chain_type=\"stuff\")\nresponse = chain.run(input_documents=docs, question=query)\n\nprint(response)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Final Thoughts<\/h2>\n\n\n\n<p>LangChain makes it surprisingly easy to build an AI system that can <strong>answer questions using your own documents<\/strong>\u2014a game-changer for business intelligence, legal research, customer support, and more.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li> Control your data privacy<\/li>\n\n\n\n<li> Build custom GPT-like tools<\/li>\n\n\n\n<li> Integrate with Notion, websites, PDFs, etc.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Build a Private Knowledge Base with LangChain + RAG (Full Tutorial with Code) Retrieval-Augmented Generation (RAG) is a powerful framework that allows you to ask questions over your own documents using a Large Language Model (LLM). In this post, we\u2019ll walk through how to use LangChain to build a RAG system\u2014from loading PDFs to querying [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-54","post","type-post","status-publish","format-standard","hentry","category-tutorial"],"_links":{"self":[{"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/posts\/54","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/comments?post=54"}],"version-history":[{"count":1,"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/posts\/54\/revisions"}],"predecessor-version":[{"id":56,"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/posts\/54\/revisions\/56"}],"wp:attachment":[{"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/media?parent=54"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/categories?post=54"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.ashwang.net\/index.php\/wp-json\/wp\/v2\/tags?post=54"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}