https://scholars.lib.ntu.edu.tw/handle/123456789/607414
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.author | Lee J.-S | en_US |
dc.contributor.author | JIEH HSIANG | en_US |
dc.creator | Lee J.-S;Hsiang J. | - |
dc.date.accessioned | 2022-04-25T06:43:44Z | - |
dc.date.available | 2022-04-25T06:43:44Z | - |
dc.date.issued | 2021 | - |
dc.identifier.issn | 16130073 | - |
dc.identifier.uri | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85111040648&partnerID=40&md5=64d09d70fe6934756da6828a7301aca2 | - |
dc.identifier.uri | https://scholars.lib.ntu.edu.tw/handle/123456789/607414 | - |
dc.description.abstract | Generative models, such as GPT-2, have demonstrated impressive results recently. A fundamental question we would like to address is: where did the generated text come from? This work is our initial effort toward answering the question by using prior art search. The purpose of the prior art search is to find the most similar prior text in the training data of GPT-2. We take a reranking approach and apply it to the patent domain. Specifically, we pre-train GPT-2 models from scratch by using the patent data from the USPTO. The input for the prior art search is the patent text generated by the GPT-2 model. We also pre-trained BERT models from scratch for converting patent text to embeddings. The steps of reranking are: (1) search the most similar text in the training data of GPT-2 by taking a bag-of-words ranking approach (BM25), (2) convert the search results in text format to BERT embeddings, and (3) provide the final result by ranking the BERT embeddings based on their similarities with the patent text generated by GPT-2. The experiments in this work show that such reranking is better than ranking with embeddings alone. However, our mixed results also indicate that calculating the semantic similarities among long text spans is still challenging. To our knowledge, this work is the first to implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output. ? 2021 for this paper by its authors. | - |
dc.relation.ispartof | CEUR Workshop Proceedings | - |
dc.subject | Deep learning | - |
dc.subject | Natural language generation | - |
dc.subject | Natural language processing | - |
dc.subject | Patent | - |
dc.subject | Semantic search | - |
dc.subject | Embeddings | - |
dc.subject | Natural language processing systems | - |
dc.subject | Patents and inventions | - |
dc.subject | Semantic Web | - |
dc.subject | Semantics | - |
dc.subject | Bag of words | - |
dc.subject | Generative model | - |
dc.subject | Model-based OPC | - |
dc.subject | Patent datum | - |
dc.subject | Prior art search | - |
dc.subject | Ranking approach | - |
dc.subject | Semantic similarity | - |
dc.subject | Training data | - |
dc.subject | Text mining | - |
dc.title | Prior art search and reranking for generated patent text | en_US |
dc.type | conference paper | en |
dc.identifier.scopus | 2-s2.0-85111040648 | - |
dc.relation.pages | 18-24 | - |
dc.relation.journalvolume | 2909 | - |
item.cerifentitytype | Publications | - |
item.openairetype | conference paper | - |
item.openairecristype | http://purl.org/coar/resource_type/c_5794 | - |
item.grantfulltext | none | - |
item.fulltext | no fulltext | - |
crisitem.author.dept | Networking and Multimedia | - |
crisitem.author.dept | Computer Science and Information Engineering | - |
crisitem.author.orcid | 0000-0002-2649-4331 | - |
crisitem.author.parentorg | College of Electrical Engineering and Computer Science | - |
crisitem.author.parentorg | College of Electrical Engineering and Computer Science | - |
顯示於: | 資訊工程學系 |
在 IR 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。