Prior art search and reranking for generated patent text

Lee J.-S;Hsiang J.

DC 欄位	值	語言
dc.contributor.author	Lee J.-S	en_US
dc.contributor.author	JIEH HSIANG	en_US
dc.creator	Lee J.-S;Hsiang J.	-
dc.date.accessioned	2022-04-25T06:43:44Z	-
dc.date.available	2022-04-25T06:43:44Z	-
dc.date.issued	2021	-
dc.identifier.issn	16130073	-
dc.identifier.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85111040648&partnerID=40&md5=64d09d70fe6934756da6828a7301aca2	-
dc.identifier.uri	https://scholars.lib.ntu.edu.tw/handle/123456789/607414	-
dc.description.abstract	Generative models, such as GPT-2, have demonstrated impressive results recently. A fundamental question we would like to address is: where did the generated text come from? This work is our initial effort toward answering the question by using prior art search. The purpose of the prior art search is to find the most similar prior text in the training data of GPT-2. We take a reranking approach and apply it to the patent domain. Specifically, we pre-train GPT-2 models from scratch by using the patent data from the USPTO. The input for the prior art search is the patent text generated by the GPT-2 model. We also pre-trained BERT models from scratch for converting patent text to embeddings. The steps of reranking are: (1) search the most similar text in the training data of GPT-2 by taking a bag-of-words ranking approach (BM25), (2) convert the search results in text format to BERT embeddings, and (3) provide the final result by ranking the BERT embeddings based on their similarities with the patent text generated by GPT-2. The experiments in this work show that such reranking is better than ranking with embeddings alone. However, our mixed results also indicate that calculating the semantic similarities among long text spans is still challenging. To our knowledge, this work is the first to implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output. ? 2021 for this paper by its authors.	-
dc.relation.ispartof	CEUR Workshop Proceedings	-
dc.subject	Deep learning	-
dc.subject	Natural language generation	-
dc.subject	Natural language processing	-
dc.subject	Patent	-
dc.subject	Semantic search	-
dc.subject	Embeddings	-
dc.subject	Natural language processing systems	-
dc.subject	Patents and inventions	-
dc.subject	Semantic Web	-
dc.subject	Semantics	-
dc.subject	Bag of words	-
dc.subject	Generative model	-
dc.subject	Model-based OPC	-
dc.subject	Patent datum	-
dc.subject	Prior art search	-
dc.subject	Ranking approach	-
dc.subject	Semantic similarity	-
dc.subject	Training data	-
dc.subject	Text mining	-
dc.title	Prior art search and reranking for generated patent text	en_US
dc.type	conference paper	en
dc.identifier.scopus	2-s2.0-85111040648	-
dc.relation.pages	18-24	-
dc.relation.journalvolume	2909	-
item.cerifentitytype	Publications	-
item.openairetype	conference paper	-
item.openairecristype	http://purl.org/coar/resource_type/c_5794	-
item.grantfulltext	none	-
item.fulltext	no fulltext	-
crisitem.author.dept	Networking and Multimedia	-
crisitem.author.dept	Computer Science and Information Engineering	-
crisitem.author.orcid	0000-0002-2649-4331	-
crisitem.author.parentorg	College of Electrical Engineering and Computer Science	-
crisitem.author.parentorg	College of Electrical Engineering and Computer Science	-
顯示於：	資訊工程學系

顯示文件簡單紀錄

SCOPUS^TM
Citations

checked on 2023/12/27

Page view(s)

checked on 2024/5/25

Google Scholar^TM

檢查

TAIR相關文章

SCOPUSTM Citations

Page view(s)

Google ScholarTM

SCOPUS^TM
Citations

Google Scholar^TM