Website QA Assistant: Turning Articles into Interactive Q&A with LLMs

Reading long articles often means scrolling and searching for specific details. I built a Website QA Assistant that lets users paste any article link (like a Medium post) and ask natural language questions about its content. The assistant extracts the text, embeds it, and answers questions using a retrieval-augmented generation (RAG) pipeline.

Objective

The goal was to build a lightweight, local-friendly assistant that:

Accepts a webpage URL
Extracts and chunks the text
Embeds it into a vector database
Retrieves relevant chunks for a user query
Uses an LLM to generate an accurate answer

Approach

Framework: LangChain for chaining prompts, retrievers, and embeddings
Content Loading: UnstructuredURLLoader to fetch webpage text
Chunking: RecursiveCharacterTextSplitter for manageable text chunks
Embeddings: MiniLM sentence-transformer
Vector DB: ChromaDB for semantic search
LLMs Tried:
- GPT-2 → too generic, not instruction-following
- Flan-T5 Base → lightweight, worked but vague answers
- Flan-T5 Large → better structure, slower but clearer answers
- Hosted LLMs (Mistral, Falcon, Alpaca via HF API) → blocked by API size/task limits
- Flan-Alpaca Large (local) → final choice, balanced clarity and performance

Key Challenges

Instruction-following: Smaller models struggled to follow Q&A prompts.
API limits: Most hosted models were too large or unsupported.
Answer clarity: Required custom prompt design to avoid vague or hallucinated responses.

Final Solution

The assistant now runs locally using Flan-Alpaca Large with Hugging Face’s pipeline.

Generates structured answers from retrieved chunks
Reliable performance for multi-part queries
Custom prompts ensure clarity and completeness

Results

Successfully answers user questions about arbitrary articles
Handles multi-part queries better than smaller baselines
Runs locally without reliance on hosted endpoints

What I Learned

Not all Hugging Face models are API-compatible → always check the model card.
Local models + prompt engineering can beat small hosted APIs in usability.
Prompt design really matters → clarity improved dramatically after customization.
Trade-off: local inference offers control but needs compute; APIs are easier but often limited.

Next Steps

Add conversational memory for follow-up questions
Explore model quantization for faster local inference
Deploy as a lightweight web app instead of command-line only

This project shows how RAG pipelines and model iteration can turn unstructured web content into an interactive Q&A experience — a workflow directly applicable to knowledge assistants, research tools, and customer self-service products.

View the project on GitHub