DIY AI Research Assistant Guide

If you are a qualitative researcher, what if you could talk to your documents? I recently spent a week trying to get an automated research assistant to actually cite page numbers correctly without crashing on file names with spaces. Here is the exact architecture that worked.

Using a tool called n8n (an automation engine), and a bit of AI, I’ve built a system that does exactly this. Here is how you can set it up yourself.

1. The Preparation

Before building, you need to gather your tools. In my case, the choice of storage was already decided by the researcher’s existing workflow.

Data Storage (Nextcloud): The researcher I worked with already used Nextcloud to manage their documents. It served as the central hub where they dropped PDFs for analysis.

Component	Tool Choice	Why This Specific One?
Automation	n8n	Better at handling complex logic than Zapier.
Gateway	OpenRouter	Swapping models (Claude vs. GPT) without rewriting code.
Vector DB	Pinecone	Serverless and handles the "Cosine" math out of the box.

2. The Setup: Connecting the Data (WebDAV)

The first goal was to create a secure bridge between Nextcloud and n8n so the files could be accessed automatically.

The Problem: While Nextcloud offers several ways to connect, finding the specific "WebDAV" URL that n8n would accept was a trial-and-error process. Standard URLs provided by the interface often resulted in "404 Not Found" errors.
The Process: 1. In Nextcloud, I went to Settings > Security and created an App Password named "n8n-RAG."
In n8n, I created a Nextcloud Credential using the Nextcloud API (not OAuth2, as the API is faster to set up and more stable for this use case).
I eventually found that the most reliable URL format for the connection was: https://cloud.domain.com/remote.php/webdav/.
I ensured the username was the plain text username (not an email address) and the password was the App Password I just generated.

3. The Setup: Text Extraction and Cleaning

Once connected, the tool needs to "read" the PDFs and prepare the text for the database.

The Problem: PDFs are visual containers, not text files. If you simply pull the text, you lose the page structure. Additionally, if a filename had a space in it, n8n’s download node would fail because it couldn't interpret the encoded characters (like %20).
The Process:
Filter: I added a Filter Node so the workflow only processes files ending in .pdf.
Download: I used the Nextcloud Download Node. To fix the 404/spacing error, I used this specific expression in the File Path: {{ decodeURI($json.path).startsWith('/') ? decodeURI($json.path) : '/' + decodeURI($json.path) }}. This cleaned the file names so they were readable.
The "Source Stamp" Script: To ensure citations were accurate, I used a JavaScript Node. Instead of just splitting text, this script forces a metadata "stamp" onto every chunk. It prepends the filename and the estimated page number to the text block so the AI can always see where the information originated.

4. The Setup: Vector Storage (Pinecone)

This stage involves converting that cleaned text into mathematical "embeddings" and storing them in Pinecone.

The Process:
Create Index: In Pinecone, I created a Serverless Index with 1536 dimensions and the Cosine metric.
Connect Embeddings: I attached an Embeddings OpenAI node to the Pinecone node. Using OpenRouter, I pointed the Base URL to https://openrouter.ai/api/v1 and manually typed the model name: openai/text-embedding-3-small.
Set the Limit: In the Pinecone node, I set the Limit field to 10. This ensures that for every question, the AI looks at the 10 most relevant pages to find an answer.

5. The Setup: The Retrieval Interface

The final step was building the chat tool the researcher would actually use.

The Problem: Initially, the AI was "forgetting" the previous parts of the conversation. If the researcher asked a follow-up question, the AI lacked the context to answer correctly.
The Process:
AI Agent: I used an AI Agent Node set to "Tools Agent" mode.
Connect Tools: I physically connected the Pinecone node to the Agent as a Vector Store Tool. I renamed the tool to ResearchSearch (avoiding spaces to prevent errors).
Memory: I added a Simple Memory Node with a Context Window of 10. This allows the researcher to have a continuous conversation where the AI remembers the last 10 messages.
The Interface: I activated the Chat Trigger node, which provides a "Production URL." This is a simple web link the researcher can open in any browser to start chatting with their documents.

Build Your Own AI Research Assistant

1. The Preparation

2. The Setup: Connecting the Data (WebDAV)

3. The Setup: Text Extraction and Cleaning

4. The Setup: Vector Storage (Pinecone)

5. The Setup: The Retrieval Interface

Comments

More from this blog

The Architecture of Empathy

From Vibe Coding to Agentic Evaluation

From Broken Script to Professional Package: How I Built a Data Pipeline for EU Policy Analysis

Self-Hosting n8n for Local Business: Solving the 'Activation Failed' Loop on Hostinger VPS

Command Palette

1. The Preparation

2. The Setup: Connecting the Data (WebDAV)

3. The Setup: Text Extraction and Cleaning

4. The Setup: Vector Storage (Pinecone)

5. The Setup: The Retrieval Interface

Comments

More from this blog