Retrieval-augmented generation (RAG) is a method which improves language model responses by retrieving relevant external information before generating an answer.
It has become the go-to method for grounding language model outputs in real-world information since it directly addresses the problem of hallucination which often impacts standalone large language models (LLMs). RAGs are widely applied in enterprise search, document question and answers, customer support, legal research, and healthcare systems where accuracy and updated information is required. However, standard implementation of RAG has already started showing cracks. Current pipelines split documents into chunks, convert them into numerical embeddings, and store them in vector databases. In practice, this breaks tables apart from headings, separates paragraphs from their parent sections, and relies on embedding models which were never trained for specialized fields like law or medicine. Even when the system retrieves the closest chunks, closeness in the vector space does not always mean closeness in meaning. For example, a medical research paper where a dosage recommendation buried in the results section only makes sense alongside the patient criteria defined earlier in the methodology. A standard RAG pipeline, blind to these dependencies, retrieves one fragment and discards the other, thus the model is left reasoning over half the picture. A page-level approach preserves that connection, retrieving both sections together as intended.
Since RAG operates at the page level and leverages the language model’s own reasoning to drive retrieval, it is gaining popularity compared to other retrieval models. This article examines where RAG outperforms traditional vectorbased RAG, where it still falls short, and the way forward.
Traditional RAG follows four steps: process the query, retrieve relevant data, augment the prompt, and generate text. Documents are split into smaller chunks, each converted into an embedding and stored in a vector database. At query time, the system embeds the question, finds the most numerically similar chunks, and feeds them as context to the LLM.
Though this pipeline, popularised through frameworks such as LlamaIndex and AutoGen, has gained popularity, it carries structural problems that surface-level fixes cannot resolve
Despite its popularity, vector-based RAG is limited where retrieval has to preserve the structure, interpret specialised language, and justify its choices. Some of the challenges of vector-based RAG are:
Chunking serves connections between related elements; tables split from captions; numbered clauses separated from their definitions. Even advanced splitting strategies cannot reliably preserve these links.
Embedding models trained on broad internet data falter with specialised terminology in healthcare, legal, or engineering contexts. Swapping models means re-embedding the entire index which is costly and time-consuming.
Cosine similarity measures numerical distance, not meaning. Searching for the latest model’s performance could surface results about a financial forecasting model or a 3D product design model. The system sees the word ‘model’, not the user’s intent.
Vectorisation treats every text fragment identically. Sections, chapters, footnotes, and tables lose the structural relationships that gave them meaning originally.
Embedding generation incurs API charges and vector database can be expensive. Additionally, migrations, backups, and scaling also require ongoing engineering inputs.
A high similarity score tells a compliance officer nothing about why a result was chosen. Regulated industries need human-readable reasoning, which is something vector retrieval rarely provides.
This method considers each page as a separate unit while keeping its original formatting. It then creates a hierarchical index that the LLM can use directly for reasoning.
When documents are added, they’re processed one page at a time, keeping all the original elements like tables, headings, images, and cross-references. For each page, the system creates a structured metadata record. This record includes summaries, main points, and identifiable entities. These records together form a complete catalogue which is created once and can be used for all later questions. When a question is asked, the model looks at both the question and the entire page-level index. It then finds the pages that are relevant by understanding the language instead of the vector proximity, gets their full content, and makes a response with page-level citations that a person can check.
There are several tools that work with page-level retrieval methods. For instance, PageIndex creates a document map, readable by large language models, allowing you to select pages based on their underlying reasoning. LlamaIndex Routers use language model reasoning to figure out which sub-indices are most relevant to a query and to navigate document summaries. GraphRAG maps entities and their relationships into a graph structure, allowing retrieval to follow connections across concepts rather than relying only on similarity between isolated chunks. The unifying principle across these approaches is that retrieval should consider the text’s structure and linguistic comprehension, rather than just numerical proximity.
Though page-level retrieval systems can prove to be more beneficial than vector-based RAG, they still have a few limitations. For instance, the current context window limits how big an index may get, so even thousand-page documents may still need hierarchical summarisation. Quality of metadata is another factor which can impact the results. Another challenge is time since vector-based retrieval can give results in milliseconds but LLM-driven reasoning across huge indexes could add seconds of delay. Making summaries for every page could incur additional cost for big or often updated collections and the ecosystem is still younger than the tools that have been around for vector search.
A hybrid architecture comprising vector-based RAG and page-level retrieval could fill these gaps by utilising vector search or BM25 for quick initial filtering and then passing the job to a page-aware layer for accurate, context-rich selection.
Some of the areas where page-level retrieval might help are:
Numbered sections, cross-references, footnotes, and defined terms stay connected, producing complete answers that include penalty schedules and exception clauses.
Hierarchical structure, parameter tables, and troubleshooting steps maintain their relationships, so answers reflect the document’s actual logic.
Balance sheets and income statements remain intact alongside their notes, enabling detailed responses to queries about yearover-year performance changes.
Lab values appear alongside reference ranges, clinical notes, and patient history, meeting the safety and accuracy standards the domain demands.
In future, retrieval will probably use a combination of methods, such as vector and keyword searches for broad filtering and structured page-level reasoning for exact selection. Building on this foundation, the next wave of development is already taking shape. The metadata for visual elements will make page records richer, real-time updates will keep single-page indexes current, and tighter integration with agentic workflows will enable multi-step analysis across complex documents. As the open-source LLMs improve, structured retrieval running on-premises will become increasingly valuable for industries where privacy is not a preference but a requirement.
Page-level retrieval marks a meaningful shift in RAG design. By letting language models interpret structured, intact document content rather than abstract vector representations, it preserves critical relationships, improves transparency, reduces infrastructure complexity, and places intelligent comprehension at the heart of retrieval. In domains such as law, finance, technology, and healthcare, where structure matters most, this approach is already proving transformative. As context windows expand and models sharpen, their significance will only grow, though whether this ultimately leads to systems that genuinely understand documents or merely become more sophisticated at pattern-matching remains an open question. What is clear, however, is that page-level retrieval brings us meaningfully closer to that goal.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Vestibulum lorem sed risus ultricie.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Vestibulum lorem sed risus ultricies.