METHOD AND SYSTEM FOR TOPIC-BASED CLASSIFICATION OF SCIENTIFIC PAPERS TO RESEARCH PROPOSAL
For categorizing literature for a specific target proposal, authors face challenges while organizing their papers in diverse ways. Embodiments of the present disclosure provide method and system for classification of scientific papers to research proposal based on topic. A constructed dataset with a positive and negative reference text spans are augmented to obtain an extended dataset. Top-k chunks from a reference paper relevant to the citation text are considered as the reference text spans. A topic-based retriever model is trained on subset of the extended dataset by the positive reference text span, and the negative reference text span. A topic classifier model is trained using research proposal title, the proposal topic, and the reference text span from the reference paper to classy if the reference paper is aligned to the topic. A reference paper to be relevant to the proposal topic is classified with corresponding topics in the research proposal.
Latest Tata Consultancy Services Limited Patents:
- RING LOADED ALFORD-LOOP BASED PHASE GRADIENT METASURFACE LENS FOR X-BAND APPLICATIONS
- METHODS AND SYSTEMS FOR AUTOMATED PERSONALIZED DESTRESSOR RECOMMENDATION BASED ON STRESSOR ESTIMATION
- METHOD AND SYSTEM OF MODELLING A THREE-DIMESIONAL WEARABLE DEVICE FOR ASSISTING TARGETED NASAL DRUG DELIVERY
- Method and system for accelerating self-learning using meta learning in industrial process domain
- Optically sparse primary aperture for high spatial resolution imaging
This U.S. patent application claims priority under 35 U.S.C. § 119 to: India application No. 202421038907, filed on May 17, 2024. The entire contents of the aforementioned application are incorporated herein by reference.
TECHNICAL FIELDThis disclosure relates generally to data classification techniques, and, more particularly, to a method and system for topic-based classification of scientific papers to research proposal.
BACKGROUNDA conception of a research agenda is usually initiated via writing an elaborate research proposal which highlights relevance of a proposed idea and corresponding novelty with respect to prior work. Researchers frequently draft research proposals to present new ideas, defining the research agendas, and seeking approvals for funding grants. An integral aspect of the proposal writing process is reviewing relevant literature and relating it to different aspects of the proposal, which motivates the research problem's uniqueness, establishes baselines, and synthesizes methodologies and last but not the least automatic generation of literature review. The literature review discusses published information in a particular subject area, and sometimes information in the particular subject area within a certain time period. The quality of the research proposal often hinges on whether the research proposal adequately links to existing articles in the literature that help support the proposed research agenda. Several approaches are designed for automatic retrieval of scientific articles and can be applied to identify articles relevant to a proposal with a detailed description (e.g., abstract) of the proposal serving as the query. However, to get a nuanced view of why a scientific article is relevant to a research proposal, it is necessary to map the retrieved scientific articles to high level thematic categories (henceforth referred to as topics) relevant to the proposal. This mapping can be used for the downstream generation of comprehensive topic-wise literature review as opposed to a monolithic review, thereby improving readability.
Automatic literature review generation is vital for understanding and writing scientific documents and extracting information to synthesize a comprehensive summary. Existing approaches to perform the automated literature review generation are either independently summarized scientific articles or generate citation text for individual scientific articles relevant to a target manuscript independently without considering their relationship to other relevant articles. Existing approaches often generate monolithic extractive or abstractive reviews adversely affecting readability, and lacks in structured subsections linked to specific topics. Citation text generation crafts sentences citing reference papers based on corresponding abstracts, aiming to integrate into the literature review. However, existing methods assume precise knowledge of intents (e.g., background, methodology) behind citing a paper which is unavailable in early stages of research proposal writing. Moreover, these approaches rely solely on the reference paper's abstracts for sentence generation, potentially lacking adequate information for appropriate mapping to an intent or the topic.
Prior approaches to extract text spans from the research papers use the citation texts or queries. Citation intent detection presumes a presence of the citation text to classify papers into predefined categories such as motivation, background. However, during the proposal writing stage, the citation text is unavailable. Moreover, the classification categories for potential reference papers differ for each target proposal. Existing approaches for retrieving scientific papers from a corpus rely either on abstracts and titles of a target paper, detailed textual queries, or specific aspects such as a problem or a methodology. These approaches employ strategies to generate suitable embeddings for queries and research papers, often focusing on a coarser level ‘aspects’ of the target paper. Catalogue generation aims to automatically create chapter headings for a target proposal based on a set of reference papers. When categorizing literature for a specific target proposal, authors face challenges such as introducing bias, while organizing their papers in diverse ways.
SUMMARYEmbodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method of classifying one or more scientific papers to research proposals based on topic is provided. The processor implemented method includes: receiving, via one or more hardware processors, a constructed dataset as an input data using one or more existing research papers; augmenting, via the one or more hardware processors, the constructed dataset with one or more reference text spans to obtain an extended dataset; training, via the one or more hardware processors, a topic-based retriever (TR) model on a subset of the extended dataset by one or more pseudo positive reference text spans, and one or more pseudo negative reference text spans; training, via the one or more hardware processors, a topic classifier (TC) model on a subset of the extended dataset using the research proposal title, one or more proposal topics, and the one or more reference text spans from one or more reference papers retrieved using the one or more pseudo positive reference text spans retrieved using the citation text based retriever model which is labelled as positive, and the one or more pseudo negative reference text spans for the one or more reference papers which are labelled as negative, to classy if the one or more reference papers are aligned to the proposal topic or not; and classifying, via the one or more hardware processors, the one or more reference papers to be relevant to one or more proposal topic in context of the one or more reference text spans retrieved using the topic-based retriever (TR) model with the topic classifier (TC) model. The constructed dataset pertain to (i) a research proposal title, or (ii) an abstract, or (iii) a set of proposal topics, or (iv) the one or more reference papers pertinent to a research proposal, and (v) the one or more citation text under the one or more proposal topics from the set of proposal topics corresponding one or more relevant reference paper. The one or more reference text spans pertain to the one or more pseudo positive reference text spans, and the one or more pseudo negative reference text spans. One or more top-K chunks are identified from the one or more reference papers which is relevant to the one or more citation text based on a retriever model, is considered as one or more reference text spans.
In an embodiment, one or more section headings are identified from one or more target papers to extract one or more in-text citations. In an embodiment, the one or more in-text citations extracted by employing a parsing technique with one or more enhanced regular expressions (Regex) statements. In an embodiment, the extracted in-text citations associated with the set of proposal topics, and one or more citing reference papers are utilized to extract the one or more citation text. In an embodiment, the one or more citation text pertains to one or more texts surrounding the one or more in-text citations. In an embodiment, a similarity score between the one or more citation text and a chunk of the reference paper is calculated. In an embodiment, one or more textual chunks are ranked based on the similarity score to identify the one or more top-k chunks. In an embodiment, a step of classification of one or more scientific papers to one or more research proposal during an inference phase, includes: (i) the research proposal title, the abstract R, and the proposal topic from a test set are inputted to the TR model along with each chunk of a reference paper in the test set tagged to be relevant to the R to calculate the similarity score, (ii) the one or more top k chunks are obtained from the reference paper relevant to the topic of the proposal using the TR model based on the similarity score, (iii) the one or more top k chunk from the reference paper are inputted along with the topic and the proposal to the TC model, and (iv) the reference paper is iteratively classified as a positive or a negative, highlighting if the reference paper is relevant to the one or more proposal topics or not in context of the one or more top-K chunks.
In another aspect, there is provided a system for classification of one or more scientific papers to research proposals based on topic. The system includes a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive, a constructed dataset as an input data using one or more existing research papers; augment, the constructed dataset with one or more reference text spans to obtain an extended dataset; train, a topic-based retriever (TR) model on a subset of the extended dataset by one or more pseudo positive reference text spans, and one or more pseudo negative reference text spans; train, a topic classifier (TC) model on a subset of the extended dataset using the research proposal title, one or more proposal topics, and the one or more reference text spans from one or more reference papers retrieved using the one or more pseudo positive reference text spans retrieved using the citation text based retriever model which is labelled as positive, and the one or more pseudo negative reference text spans for the one or more reference papers which are labelled as negative, to classy if the one or more reference papers are aligned to the proposal topic or not; and classify, the one or more reference papers to be relevant to one or more proposal topic in context of the one or more reference text spans retrieved using the topic-based retriever (TR) model with the topic classifier (TC) model. The constructed dataset pertain to (i) a research proposal title, or (ii) an abstract, or (iii) a set of proposal topics, or (iv) the one or more reference papers pertinent to a research proposal, and (v) the one or more citation text under the one or more proposal topics from the set of proposal topics corresponding one or more relevant reference paper. The one or more reference text spans pertain to the one or more pseudo positive reference text spans, and the one or more pseudo negative reference text spans. One or more top-K chunks are identified from the one or more reference papers which is relevant to the one or more citation text based on a retriever model, is considered as one or more reference text spans.
In an embodiment, one or more section headings are identified from one or more target papers to extract one or more in-text citations. In an embodiment, the one or more in-text citations extracted by employing a parsing technique with one or more enhanced regular expressions (Regex) statements. In an embodiment, the extracted in-text citations associated with the set of proposal topics, and one or more citing reference papers are utilized to extract the one or more citation text. In an embodiment, the one or more citation text pertains to one or more texts surrounding the one or more in-text citations. In an embodiment, a similarity score between the one or more citation text and a chunk of the reference paper is calculated. In an embodiment, one or more textual chunks are ranked based on the similarity score to identify the one or more top-k chunks. In an embodiment, a step of classification of one or more scientific papers to one or more research proposal during an inference phase, includes: (i) the research proposal title, the abstract R, and the proposal topic from a test set are inputted to the TR model along with each chunk of a reference paper in the test set tagged to be relevant to the R to calculate the similarity score, (ii) the one or more top k chunks are obtained from the reference paper relevant to the topic of the proposal using the TR model based on the similarity score, (iii) the one or more top k chunk from the reference paper are inputted along with the topic and the proposal to the TC model, and (iv) the reference paper is iteratively classified as a positive or a negative, highlighting if the reference paper is relevant to the one or more proposal topics or not in context of the one or more top-K chunks.
In yet another aspect, a non-transitory computer readable medium for comprising one or more instructions which when executed by one or more hardware processors causes at least one of: receiving, a constructed dataset as an input data using one or more existing research papers; augmenting, the constructed dataset with one or more reference text spans to obtain an extended dataset; training, a topic-based retriever (TR) model on a subset of the extended dataset by one or more pseudo positive reference text spans, and one or more pseudo negative reference text spans; training, a topic classifier (TC) model on a subset of the extended dataset using the research proposal title, one or more proposal topics, and the one or more reference text spans from one or more reference papers retrieved using the one or more pseudo positive reference text spans retrieved using the citation text based retriever model which is labelled as positive, and the one or more pseudo negative reference text spans for the one or more reference papers which are labelled as negative, to classy if the one or more reference papers are aligned to the proposal topic or not; and classifying, the one or more reference papers to be relevant to one or more proposal topic in context of the one or more reference text spans retrieved using the topic-based retriever (TR) model with the topic classifier (TC) model. The constructed dataset pertain to (i) a research proposal title, or (ii) an abstract, or (iii) a set of proposal topics, or (iv) the one or more reference papers pertinent to a research proposal, and (v) the one or more citation text under the one or more proposal topics from the set of proposal topics corresponding one or more relevant reference paper. The one or more reference text spans pertain to the one or more pseudo positive reference text spans, and the one or more pseudo negative reference text spans. One or more top-K chunks are identified from the one or more reference papers which is relevant to the one or more citation text based on a retriever model, is considered as one or more reference text spans.
In an embodiment, one or more section headings are identified from one or more target papers to extract one or more in-text citations. In an embodiment, the one or more in-text citations extracted by employing a parsing technique with one or more enhanced regular expressions (Regex) statements. In an embodiment, the extracted in-text citations associated with the set of proposal topics, and one or more citing reference papers are utilized to extract the one or more citation text. In an embodiment, the one or more citation text pertains to one or more texts surrounding the one or more in-text citations. In an embodiment, a similarity score between the one or more citation text and a chunk of the reference paper is calculated. In an embodiment, one or more textual chunks are ranked based on the similarity score to identify the one or more top-k chunks. In an embodiment, a step of classification of one or more scientific papers to one or more research proposal during an inference phase, includes: (i) the research proposal title, the abstract R, and the proposal topic from a test set are inputted to the TR model along with each chunk of a reference paper in the test set tagged to be relevant to the R to calculate the similarity score, (ii) the one or more top k chunks are obtained from the reference paper relevant to the topic of the proposal using the TR model based on the similarity score, (iii) the one or more top k chunk from the reference paper are inputted along with the topic and the proposal to the TC model, and (iv) the reference paper is iteratively classified as a positive or a negative, highlighting if the reference paper is relevant to the one or more proposal topics or not in context of the one or more top-K chunks.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
There is a need for an approach to generate a comprehensive literature review featuring well-structured subsections or categories, each grouping articles related to a specific topic. Embodiments of the present disclosure provide a method and system for topic-based classification of scientific papers relevant to research proposal. A human and a large language models (LLM) baselines is established for a task. The embodiments of the present disclosure provide a two-stage approach for article-topic-based classification. During the first stage, one or more reference text spans are retrieved from scientific articles relevant to each topic with a retriever. The trained large language model using synthesized topic-text span pseudo-labels is utilized. For the second stage, we align the scientific articles relevant to one or more topics in the context of the proposal, and the retrieved text spans from the scientific article, formulating the problem as a classification of the task. The research proposal with corresponding title and an abstract R and with a set of topics are considered for forming the catalog denoted as TR=t1, t2 , . . . , tk. A corpus of reference research papers is assumed as PR=p1, p2 , . . . , pn relevant to R, retrieved using the proposal title and high-level abstract as queries. The task aims to classify a mapping of each research paper pi ∈ PR to each topic tj ∈ TR using binary labels yij ∈ 0, 1. A dataset is constructed for the task using the available target research papers as target proposals, section headings of the related work section of the proposal as the ‘topics’ and the papers cited under the section as the reference articles relevant to the ‘topics’. Training of the dataset includes samples <R, pi, tj, Cij>, where cij is a citation text citing relevant article pi under topic tj of R, whereas during inference phase, availability of the cij is not assumed. The research proposal refers to a research paper pi in context of topic tj, taking some textual content of the reference paper pi into consideration, which are relevant to the proposal and the topic. The text is referred to as reference text span sij. The availability of sij is not assumed for training as well as for the inference.
Referring now to the drawings, and more particularly to
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface device(s) 106 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a camera device, and a printer. Further, the I/O interface device(s) 106 may enable the system 100 to communicate with other devices, such as web servers and external databases. The I/O interface device(s) 106 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. In an embodiment, the I/O interface device(s) 106 can include one or more ports for connecting number of devices to one another or to another server.
The memory 104 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 104 includes a plurality of modules 110 and a repository 112 for storing data processed, received, and generated by the plurality of modules 110. The plurality of modules 110 may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types.
Further, the database stores information pertaining to inputs fed to the system 100 and/or outputs generated by the system (e.g., data/output generated at each stage of the data processing) 100, specific to the methodology described herein. More specifically, the database stores information being processed at each step of the proposed methodology.
Additionally, the plurality of modules 110 may include programs or coded instructions that supplement applications and functions of the system 100. The repository 112, amongst other things, includes a system database 114 and other data 116. The other data 116 may include data generated as a result of the execution of one or more modules in the plurality of modules 110. Herein, the memory, for example the memory 104 and the computer program code configured to, with the hardware processor for example the processor 102, causes the system 100 to perform various functions described herein under.
The constructed dataset pertain to at least one of: (i) a research proposal title, or (ii) an abstract, or (iii) a set of proposal topics, or (iv) at least one reference paper pertinent to a research proposal, and (v) at least one citation text under at least one proposal topic from the set of proposal topics corresponding at least one relevant reference paper. The constructed dataset includes a set of research papers from one or more selected domains from the UnArxiv as target proposals R, the one or more topics of each target proposal TR, and one or more reference papers relevant to a target proposal pR. The ground truth labels of the task, in terms of alignments i.e., a classification of the one or more reference papers to the one or more topics are solicited from the in-text citations, explicitly specified by one or more authors of the target papers. The in-text citations, which are part of the one or more topics i.e., subsections and cite the reference papers, serve as a ‘link’ between the one or more topics and respective reference papers. An assumption for a training set is chosen i.e., availability of the citation text cij for every pair tj and pi. Accordingly, one sample of the training data is depicted by a tuple <R, pi, tj, cij>. Similarly, for the test set the samples is <R, pi, tj>. For example, a total of 2,417 target proposals in the dataset is considered, citing 49,506 reference papers, covering 7, 123 topics, and having 62,520 proposal-topic-reference paper tuples. The target proposals are split into training, validation, and test sets to prevent information leakage across splits. The resulting corpus includes target proposals pertaining to an Artificial Intelligence (AI) (2.69%), a Machine Learning (ML) (15.56%), a Computational Linguistics (CL) (7.28%), a Computer Vision (CV) (73.23%) and a combination of CL and CV (1.24%). The constructed dataset is tailored for mapping research papers to proposal topics, which can be seamlessly extended to tasks i.e., catalogue generation, citation text generation, and a literature review generation. Dataset statistics are summarized in a below mentioned Table 1:
In an embodiment, the topic-based classification of the one or more scientific papers to the one or more research proposals are demonstrated into three stages: (i) augmenting the dataset with one or more positive and negative reference text spans retrieved from the one or more reference papers with a citation text as a query using a retriever model, (ii) training a topic-based reference text span Retriever (TR) and a proposal topic-based reference paper classifier (TC) using the augmented data, and (iii) inference pipeline of retrieving one or more reference text spans relevant to a proposal topic from a reference paper using the trained TR, and then classifying the reference paper to be relevant to the proposal topic in context of the reference text span retrieved using the topic-based retriever (TR) model with the topic classifier (TC) model.
The classification of the one or more scientific papers i.e., reference paper pi to a proposal topic tj, a reference text span sij from papers pi is needed, which is relevant to the proposal topic tj. The retriever model (TR) is trained, by the positive and the negative pairs of topics tj, and the reference text spans sij. In an embodiment, the citation text cij citing the reference paper pi as part of the proposal topic tj of the research proposal R. The citation text cij is utilized as a ‘link’ to retrieve the reference text span sij from the pi, relevant to cij, and consequently relevant to the tj, which allows to form pseudo-positive pairs <tj, sij>. The reference text spans sij is retrieved, given the citation text cij, by assessing performance of one or more existing retrieval models on equivalent tasks in a science domain, viz., (i) retrieval of paragraphs from scientific documents given a question, and (ii) retrieval of a text span given the citation text. A retriever model (SR) with a best zero-shot performance for both the tasks is identified.
-
- (a) Type 1: A bottom-k chunks retrieved from pi, with citation text cij of topic tj as a query, using the SR which serve as easy negatives for the topic tj of proposal R.
- (b) Type 2: Text spans (i.e., top-k chunks) retrieved from reference paper pi, NOT cited in topic tj, but cited in the tk where k≠j, with the citation text clk as the query using SR, serve as easy negatives for the topic tj of the proposal R.
- (c) Type 3: With each cij of topic tj as the query, retrieving top-k chunks from reference articles pl, NOT cited in topic tj, but cited in tk where k≠j using SR. The top-k chunks demonstrating maximum similarity with one of the cij serve as hard negatives for the topic tj of the proposal R.
The training dataset is augmented with the pseudo positive and negative reference text spans for each topic leading to the resultant dataset, where a sample is depicted by <R, pi, tj, Cij, , {
During training phase, the topic-based reference text span retriever (TR) unit 206 is configured to train the topic-based retriever model TR, which is a Language Model (LM) using the positive and negative topic and reference text span pairs. A probability Pr(True | tj, sij) and Pr(False | tj, sij) is maximized using a cross entropy loss. Due to the significantly higher number of negative samples compared to positives in the dataset, all positives are included but randomly sub-sample the negatives of each type uniformly to maintain a balanced training dataset. Sampling is performed with replacement for every epoch to ensure the model sees all negatives.
The proposal topic-based reference paper classifier (TC) unit 208 is configured to train the classifier model TC on a subset of the extended training dataset. A sample <R, tj, > is formed with label yij=1 and <R, tj,
During inference phase, the proposal title, abstract R, and the topic tj from the test set are fed to the model TR along with each chunk ci of a reference article pi in the test set tagged to be relevant to R. The model TR provides the similarity score simij=TR (tj, ci) for each ci. The similarity score is calculated by applying a softmax on a logits of the ‘true’ and ‘false’ tokens as depicted and taking into consideration the probability P (True| tj, ci). In an embodiment, the ci is ranked for the given R and tj based on the similarity scores simij and use the top-k ranked ci of paper pi as the retrieved reference text spans from paper pi, for the topic tj of R. The R, tj, is fed to the TC to determine the alignment of paper pi. If probability P (y=True| R, tj, ) ≥0.5, then consider pi to be aligned with tj.
At step 502, the constructed dataset (as depicted in
At step 506, the topic-based retriever (TR) model is trained (as depicted in
During an inference phase, a step of classification of one or more scientific papers to one or more research proposal, includes: (i) At step 512, the research proposal title, the abstract R, and the proposal topic from a test set are inputted to the TR model along with each chunk of a reference paper in the test set tagged to be relevant to the R to calculate the similarity score, (ii) At step 514, the one or more top k chunks are obtained from the reference paper relevant to the topic of the proposal using the TR model based on the similarity score, (iii) At step 516, the one or more top k chunk from the reference paper are inputted along with the topic and the proposal to the TC model, and (iv) At step 518, the reference paper is iteratively classified as a positive or a negative, highlighting if the reference paper is relevant to the one or more proposal topics or not in context of the one or more top-K chunks.
Experimentation Results:For example, evaluation of three models viz, the retriever model (SR), the topic-based reference text span retriever (TR) model, and the proposal topic-based reference paper classifier (TC) model by assessing corresponding performance. The performance of the SR model on evidence and reference text span retrieval tasks by computing evidence F1 score using ground truth evidence and reference text span available in the datasets. The performance of the TR model is evaluated by computing the evidence F1 score. The top-K most similar chunks retrieved from the reference paper using the trained TR model with a topic as the query forms predicted reference text span s{circumflex over (r)}ij for that topic. The top-K most similar chunks retrieved from the reference paper using the SR model with the citation text belonging to that topic as the query forms pseudo ground truth reference text span srij for that topic. The evidence F1 score is computed by comparing the top-K chunks. For the evaluation of the TC model, the ground truth binary classification labels for each proposal topic and each reference paper relevant to that proposal is considered. The ground truth labels, and the labels predicted by the TC model for each target proposal topic and reference paper pair to compute a confusion matrix for the binary classification task. The F1 score is chosen as a metric, given that an unbalanced data in terms of classification labels.
For establishing the baselines, the ‘Reference Paper Topic Classification’ task with human and LLM-generated annotations has been evaluated. An evaluation subset was constructed by uniformly sampling random four target proposals from each of the five domains from the test set and ensuring balanced domain representation. The results have been highlighted in selecting 20 target proposals with 52 topics citing 362 reference papers, forming 378 positive topic-reference paper pairs. An human annotator and the LLM were provided with information about the target proposal's title, the abstract, topic tij, reference paper title pi, and the reference text span of the reference paper sij, retrieved using (i) citation text cij using the SR model to evaluate the performance of the TC model independently, and (ii) the topic tij with the trained TR model to evaluate the performance of complete inference pipeline. The task was to assess the relevance of the reference paper to the given topic, labeling as 1 if the reference paper is found to be suitable to be cited in the topic with the given reference text span; otherwise, 0.
The embodiments of present disclosure provides a task of mapping relevant scientific articles to research proposal topics as a precursor to the literature review generation task for a new research proposal. For example, introduction of a large-scale dataset for the task and establishment of competitive baselines by an expert and the Large Language Model (LLM), underscoring task feasibility. The task assumes the real-life scenario, at the stage of proposal writing, of unavailability of citation text or detailed topic descriptions to retrieve text spans from reference articles relevant to a topic (i.e., during inference), which are required to establish the alignment of reference papers to the topic. The citation text (i.e., available for the training data) is utilized as a link between the topic and the text spans to create pseudo-labels for training a retriever. The pipeline of the present disclosure using much of smaller LMs trained with pseudo-labels yields comparable performance to that of the LLM baseline, demonstrating corresponding efficacy.
The embodiments of present disclosure herein address unresolved problem of existing assumptions made by current scientific literature retrieval methods in automatic literature review generation. The embodiments of present disclosure herein provide a framework for the topic-based classification of scientific papers or reference articles relevant to the research proposal. Instead of generating a monolithic review, the present disclosure serves as a precursor to generate comprehensive literature review for a research proposal that is well organized into the set of topics. The embodiment of present disclosure, thus focus on the task of automatic mapping relevant scientific articles to one or more topics (e.g., proposal-specific topics defined by users) in the research proposal without assuming the existence of sentences used to cite the paper (i.e., citation text). This task precedes the task of generating a literature review with well-structured subsections, each grouping relevant articles by specific topics of interest. The approach of mapping research papers to one or more topics extracts ‘text spans’ from entire content of the reference paper, ensuring the availability of comprehensive information for mapping. The embodiment of present disclosure retrieves text spans without relying on the availability of citation text or detailed queries, but assumes the presence of high-level topic names. The embodiment of present disclosure assumes the availability of research articles relevant to a proposal and defines the task of mapping these articles to a set of user-defined fine-grained topics relevant to the proposal. The embodiment of present disclosure operates under the assumption that there is (a) a corpus of relevant scientific articles related to the proposal obtained using the provided title and abstract of the proposal, and (b) a user-provided catalog of thematic categories for the target proposal consisting of a list of high-level topics, to which the scientific articles in the corpus need to be aligned.
The embodiment of present disclosure considers a more realistic setting, where the availability of not only the reference papers retrieved to be relevant to a proposal, but also high-level topics provided by the researcher is assumed, based on which he/she wants to categorize the relevant literature and focus on the task of classification of the reference papers, to corresponding topics, and constructs a dataset for benchmarking solutions to the proposed task. The dataset for the task is constructed using the available target research papers as target proposals, the section headings of the related work section of the proposal as the ‘topics’, and the papers cited under those section as the reference articles relevant to those ‘topics’. The resultant is a more personalized approach, classifying cataloging process closely with the researcher's unique perspective. The framework/method of the present disclosure serves as an upstream task for a comprehensive literature review generation task. The classification of retrieved articles to one or more topics further allows generation of a cohesive summary of work related to each topic, taking into consideration diverse perspectives on each reference paper's relevance to the proposal. The framework/method of the present disclosure maps each retrieved scientific article to one or more topics in the catalog, offering a comprehensive understanding of corresponding distinct contributions to the target proposal. Assumption of the real-life setting, at the stage of proposal writing, of unavailability of the citation text or detailed topic descriptions to retrieve the reference text spans from reference articles relevant to a topic (i.e., during the inference stage) which are required to establish the alignment of reference papers to the topic. The citation text (i.e., available for the training data) are used as a link between the topic and the text spans to create pseudo-labels for training the retriever. The embodiment of present disclosure exceeds the baseline LLM with an increase of 13.74% in the F1 score for the classification task, demonstrating corresponding efficacy.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
Claims
1. A processor implemented method comprising:
- receiving, via one or more hardware processors, a constructed dataset as an input data using at least one existing research paper, wherein the constructed dataset pertain to at least one of: (i) a research proposal title, or (ii) an abstract, or (iii) a set of proposal topics, or (iv) at least one reference paper pertinent to a research proposal, and (v) at least one citation text under at least one proposal topic from the set of proposal topics corresponding at least one relevant reference paper;
- augmenting, via the one or more hardware processors, the constructed dataset with at least one reference text span to obtain an extended dataset, wherein at least one reference text span pertain to at least one pseudo positive reference text span, and at least one pseudo negative reference text span, wherein at least one top-K chunk is identified from at least one reference paper which is relevant to at least one citation text based on a retriever model, is considered as at least one reference text span;
- training, via the one or more hardware processors, a topic-based retriever (TR) model on a subset of the extended dataset by at least one pseudo positive reference text span, and at least one pseudo negative reference text span;
- training, via the one or more hardware processors, a topic classifier (TC) model on a subset of the extended dataset using the research proposal title, at least one proposal topic, and at least one reference text span from at least one reference paper retrieved using at least one pseudo positive reference text span retrieved using the citation text based retriever model, which is labelled as positive, and at least one pseudo negative reference text span for at least one reference paper which is labelled as negative, to classy if at least one reference paper is aligned to the proposal topic or not; and
- classifying, via the one or more hardware processors, at least one reference paper to be relevant to at least one proposal topic in context of at least one reference text span retrieved using the topic-based retriever (TR) model with the topic classifier (TC) model.
2. The processor implemented method of claim 1, wherein at least one section heading is identified from at least one target paper to extract at least one in-text citation, and wherein at least one in-text citation is extracted by employing a parsing technique with one or more enhanced regular expressions (Regex) statements.
3. The processor implemented method of claim 2, wherein the extracted at least one in-text citation associated with at least one proposal topic, and at least one citing reference paper is utilized to extract at least one citation text, and wherein at least one citation text pertains to one or more texts surrounding at least one in-text citation.
4. The processor implemented method of claim 1, wherein a similarity score between at least one citation text and a chunk of the reference paper is calculated, and wherein one or more textual chunks are ranked based on the similarity score to identify at least one top-k chunk.
5. The processor implemented method of claim 1, wherein a step of classification of at least one scientific paper to at least one research proposal during an inference phase, comprises:
- inputting, via the one or more hardware processors, the research proposal title, the abstract R, and the proposal topic from a test set to the TR model along with each chunk of a reference paper in the test set tagged to be relevant to the R to calculate the similarity score;
- obtaining, via the one or more hardware processors, at least one top k chunk from the reference paper relevant to the topic of the proposal using the TR model based on the similarity score;
- inputting, via the one or more hardware processors, at least one top k chunk from the reference paper along with the topic and the proposal to the TC model; and
- iteratively classifying, via the one or more hardware processors, the reference paper as a positive or a negative, highlighting if the reference paper is relevant to at least one proposal topic or not in context of at least one top-K chunk.
6. A system comprising:
- a memory storing a plurality of instructions;
- one or more communication interfaces; and
- one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to:
- receive a constructed dataset as an input data using at least one existing research paper, wherein the constructed dataset pertain to at least one of: (i) a research proposal title, or (ii) an abstract, or (iii) a set of proposal topics, or (iv) at least one reference paper pertinent to a research proposal, and (v) at least one citation text under at least one proposal topic from the set of proposal topics corresponding at least one relevant reference paper;
- augment the constructed dataset with at least one reference text span to obtain an extended dataset, wherein at least one reference text span pertain to at least one pseudo positive reference text span, and at least one pseudo negative reference text span, wherein at least one top-K chunk is identified from at least one reference paper which is relevant to at least one citation text based on a retriever model, is considered as at least one reference text span;
- train a topic-based retriever (TR) model on a subset of the extended dataset by at least one pseudo positive reference text span, and at least one pseudo negative reference text span;
- train a topic classifier (TC) model on a subset of the extended dataset using the research proposal title, at least one proposal topic, and at least one reference text span from at least one reference paper retrieved using at least one pseudo positive reference text span retrieved using the citation text based retriever model, which is labelled as positive, and at least one pseudo negative reference text span for at least one reference paper which is labelled as negative, to classy if at least one reference paper is aligned to the proposal topic or not; and
- classify at least one reference paper to be relevant to at least one proposal topic in context of at least one reference text span retrieved using the topic-based retriever (TR) model with the topic classifier (TC) model.
7. The system of claim 6, wherein at least one section heading is identified from at least one target paper to extract at least one in-text citation, and wherein at least one in-text citation is extracted by employing a parsing technique with one or more enhanced regular expressions (Regex) statements.
8. The system of claim 7, wherein the extracted at least one in-text citation associated with at least one proposal topic, and at least one citing reference paper is utilized to extract at least one citation text, and wherein at least one citation text pertains to one or more texts surrounding at least one in-text citation.
9. The system of claim 6, wherein a similarity score between at least one citation text and a chunk of the reference paper is calculated, and wherein one or more textual chunks are ranked based on the similarity score to identify at least one top-k chunk.
10. The system of claim 6, wherein the one or more hardware processors are configured by the instructions to classify at least one scientific paper to at least one research proposal during an inference phase, comprises:
- input the research proposal title, the abstract R, and the proposal topic from a test set to the TR model along with each chunk of a reference paper in the test set tagged to be relevant to the R to calculate the similarity score;
- obtain at least one top k chunk from the reference paper relevant to the topic of the proposal using the TR model based on the similarity score;
- input at least one top k chunk from the reference paper along with the topic and the proposal to the TC model; and
- iteratively classify the reference paper as a positive or a negative, highlighting if the reference paper is relevant to at least one proposal topic or not in context of at least one top-K chunk.
11. One or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause:
- receiving a constructed dataset as an input data using at least one existing research paper, wherein the constructed dataset pertain to at least one of: (i) a research proposal title, or (ii) an abstract, or (iii) a set of proposal topics, or (iv) at least one reference paper pertinent to a research proposal, and (v) at least one citation text under at least one proposal topic from the set of proposal topics corresponding at least one relevant reference paper;
- augmenting the constructed dataset with at least one reference text span to obtain an extended dataset, wherein at least one reference text span pertain to at least one pseudo positive reference text span, and at least one pseudo negative reference text span, wherein at least one top-K chunk is identified from at least one reference paper which is relevant to at least one citation text based on a retriever model, is considered as at least one reference text span;
- training a topic-based retriever (TR) model on a subset of the extended dataset by at least one pseudo positive reference text span, and at least one pseudo negative reference text span;
- training a topic classifier (TC) model on a subset of the extended dataset using the research proposal title, at least one proposal topic, and at least one reference text span from at least one reference paper retrieved using at least one pseudo positive reference text span retrieved using the citation text based retriever model, which is labelled as positive, and at least one pseudo negative reference text span for at least one reference paper which is labelled as negative, to classy if at least one reference paper is aligned to the proposal topic or not; and
- classifying at least one reference paper to be relevant to at least one proposal topic in context of at least one reference text span retrieved using the topic-based retriever (TR) model with the topic classifier (TC) model.
12. The one or more non-transitory machine-readable information storage mediums of claim 11, wherein at least one section heading is identified from at least one target paper to extract at least one in-text citation, and wherein at least one in-text citation is extracted by employing a parsing technique with one or more enhanced regular expressions (Regex) statements.
13. The one or more non-transitory machine-readable information storage mediums of claim 12, wherein the extracted at least one in-text citation associated with at least one proposal topic, and at least one citing reference paper is utilized to extract at least one citation text, and wherein at least one citation text pertains to one or more texts surrounding at least one in-text citation.
14. The one or more non-transitory machine-readable information storage mediums of claim 11, wherein a similarity score between at least one citation text and a chunk of the reference paper is calculated, and wherein one or more textual chunks are ranked based on the similarity score to identify at least one top-k chunk.
15. The one or more non-transitory machine-readable information storage mediums of claim 11, wherein a step of classification of at least one scientific paper to at least one research proposal during an inference phase, comprises:
- inputting the research proposal title, the abstract R, and the proposal topic from a test set to the TR model along with each chunk of a reference paper in the test set tagged to be relevant to the R to calculate the similarity score;
- obtaining at least one top k chunk from the reference paper relevant to the topic of the proposal using the TR model based on the similarity score;
- inputting at least one top k chunk from the reference paper along with the topic and the proposal to the TC model; and
- iteratively classifying the reference paper as a positive or a negative, highlighting if the reference paper is relevant to at least one proposal topic or not in context of at least one top-K chunk.
Type: Application
Filed: May 12, 2025
Publication Date: Nov 20, 2025
Applicant: Tata Consultancy Services Limited (Mumbai)
Inventors: Rudra Nath PALIT (Pune), Manasi Samarth PATWARDHAN (Pune), Lovekesh VIG (New Delhi), Gautam SHROFF (Gurgaon)
Application Number: 19/205,221