Reading long articles often means scrolling and searching for specific details. I built a Website QA Assistant that lets users paste any article link (like a Medium post) and ask natural language questions about its content. The assistant extracts the text, embeds it, and answers questions using a retrieval-augmented generation (RAG) pipeline.
Objective
The goal was to build a lightweight, local-friendly assistant that:
- Accepts a webpage URL
- Extracts and chunks the text
- Embeds it into a vector database
- Retrieves relevant chunks for a user query
- Uses an LLM to generate an accurate answer
Approach
- Framework: LangChain for chaining prompts, retrievers, and embeddings
- Content Loading:
UnstructuredURLLoaderto fetch webpage text - Chunking:
RecursiveCharacterTextSplitterfor manageable text chunks - Embeddings: MiniLM sentence-transformer
- Vector DB: ChromaDB for semantic search
- LLMs Tried:
- GPT-2 → too generic, not instruction-following
- Flan-T5 Base → lightweight, worked but vague answers
- Flan-T5 Large → better structure, slower but clearer answers
- Hosted LLMs (Mistral, Falcon, Alpaca via HF API) → blocked by API size/task limits
- Flan-Alpaca Large (local) → final choice, balanced clarity and performance
Key Challenges
- Instruction-following: Smaller models struggled to follow Q&A prompts.
- API limits: Most hosted models were too large or unsupported.
- Answer clarity: Required custom prompt design to avoid vague or hallucinated responses.
Final Solution
The assistant now runs locally using Flan-Alpaca Large with Hugging Face’s pipeline.
- Generates structured answers from retrieved chunks
- Reliable performance for multi-part queries
- Custom prompts ensure clarity and completeness
Results
- Successfully answers user questions about arbitrary articles
- Handles multi-part queries better than smaller baselines
- Runs locally without reliance on hosted endpoints
What I Learned
- Not all Hugging Face models are API-compatible → always check the model card.
- Local models + prompt engineering can beat small hosted APIs in usability.
- Prompt design really matters → clarity improved dramatically after customization.
- Trade-off: local inference offers control but needs compute; APIs are easier but often limited.
Next Steps
- Add conversational memory for follow-up questions
- Explore model quantization for faster local inference
- Deploy as a lightweight web app instead of command-line only
This project shows how RAG pipelines and model iteration can turn unstructured web content into an interactive Q&A experience — a workflow directly applicable to knowledge assistants, research tools, and customer self-service products.
