BSharedRAG: Backbone Shared Retrieval-Augmented Generation for the E-commerce Domain

The paper introduces BSharedRAG, a framework that enhances e-commerce retrieval and generation tasks by using a shared domain-specific backbone and LoRA modules, achieving significant performance improvements.

Renmin University of China, Gaoling School of Artificial Intelligence

Abstract

Retrieval Augmented Generation (RAG) system is important in domains such as e-commerce, which has many long-tail entities and frequently updated information. Most existing works adopt separate modules for retrieval and generation, which may be suboptimal since the retrieval task and the generation task cannot benefit from each other to improve performance. We propose a novel Backbone Shared RAG framework (BSharedRAG). It first uses a domain-specific corpus to continually pre-train a base model as a domain-specific backbone model and then trains two plug-and-play Low-Rank Adaptation (LoRA) modules based on the shared backbone to minimize retrieval and generation losses respectively. Experimental results indicate that our proposed BSharedRAG outperforms baseline models by 5% and 13% in Hit@3 upon two datasets in retrieval evaluation and by 23% in terms of BLEU-3 in generation evaluation.

Method

Figure 1. The architecture of the proposed BSharedRAG framework. The shared domain-specific backbone is pre-trained on a domain-specific corpus and then fine-tuned on the target dataset. The LoRA modules are trained based on the shared backbone to minimize retrieval and generation losses respectively.


WorhBuying Dataset

Table 1. Comparison of Product Question Answering (PQA) datasets. The average document word count metric is derived from a sample of documents within each dataset. The document types are classified into Product Reviews (PR), Product Information (PI), and Product Analysis from professional users (PA) .

Figure 2. Partial categories of WorthBuying dataset

We propose the WorthBuying dataset with 735K high-quality documents, 50K Question-Document-Answer (QDA) tuples, and human annotated test data of relevant documents for 1K questions and 500 QA pairs .The knowledge base in our datasets comes from professional users, reducing conflicts and errors, and is more informative, with 1.1K words per document rather than a few dozen words as in existing e-commerce knowledge bases. We also annotate high-quality QA pairs with GPT-4 and manually review the test set.

Experiment

To evaluate the retrieval effectiveness, we use an existing dataset CPR-Ecom and our constructed WorthBuying dataset. We adopt two retrieval metrics for evaluating our models: nDCG (Normalized Discounted Cumulative Gain) and Hit Rate.

Table 2. Comparing retrievers of different RAG frameworks. CPT denotes continual pre-training and HN denotes using hard negative samples. Our BShared-RAG Retriever outperforms all baselines by a large margin. CPT fails to help the BGE adapt to the e-commerce domain and even hurts the performance. FullShared-RAG performs the worst, showing that sharing all parameters between retrieval and generation leads to severe performance degradation.

To evaluate generation quality, we use our built WorthBuying dataset. We employ a comprehensive set of widely used automatic metrics for evaluation: n-gram based BLEU-3 and ROUGE-L, BERTScore which calculates semantic similarity of ground-truth and generated answers based on a BERT model. Accuracy is calculated by GPT-4 to evaluate question and answer pairs.

Table 3. Evaluation of generation results based on different retrievers on the WorthBuying-PQA test set. RAG-IT denotes retrieval augmented instruction tuning. The FullySharedRAG method performs worse because the generation objective may conflict with the retrieval objective. Compared with Baichuan2-7b series of baselines, our model achieves the best performance, demonstrating both CPT and RAG-IT contribute to the final performance.

Example

Figure 3. A representative example to compare our BSharedRAG with a separate RAG. For the given question, our BSharedRAG Retriever favors the documents, in which some sentences are easy to be generated from the prompt of question. In contrast, the BERT-like BGE-large-zh model tends to retrieve some documents, in which some sentences match the question well. However, such document may be less suitable for generating answers due to some issues, for example, important information missing or not easy to be used by generators.

Ethics and Disclosure

Our work aims to adapt a general LLM to the E-commerce domain, but the models we train may have negative impacts. For example, they could be used inappropriately, although we have performed data cleansing to avoid offensive content. However, this is a common issue currently faced in the LLM field, and it is not amplified by this work. In the future, we will consider more work on the safety of LLMs to optimize their security in the E-commerce domain. To protect the intellectual property rights of the data, we will strictly limit the dissemination of this dataset to academic research purposes only.