Text Retrieval in Restricted Domains by Pairwise Term Co-occurrence

Eriks Sneiders; Aron Henriksson

doi:10.7250/csimq.2024-41.05

Text Retrieval in Restricted Domains by Pairwise Term Co-occurrence

Authors

Eriks Sneiders Department of Computer and Systems Sciences, Stockholm University, Postbox 7003, SE-164 07 Kista, Sweden https://orcid.org/0000-0002-2803-5139
Aron Henriksson Department of Computer and Systems Sciences, Stockholm University, Postbox 7003, SE-164 07 Kista, Sweden https://orcid.org/0000-0001-9731-1048

DOI:

https://doi.org/10.7250/csimq.2024-41.05

Keywords:

Term Co-occurrence, Text Similarity, Text Matching, Term Weights, Document Retrieval, BM25, Embeddings

Abstract

Text similarity calculation by text embeddings requires fine-tuning of the language model by a large amount of labeled data, which may not be available for small text collections in their specific knowledge domains, in particular, in public organizations. As an alternative to machine learning, this research proposes pairwise term co-occurrence within plain-text matching, i.e., the query and the document share co-occurrences of two terms in a text span. In the entire document, the co-occurrences form the context that affects a term. This is analogous to a contextual word embedding, except our context affects the importance, not the meaning, of the term. Pairwise term co-occurrence has been applied in three text similarity calculation methods: term-pair-based text similarity, BM25 with term weights enhanced by pairwise term co-occurrence, and likewise enhanced cosine similarity. The three methods were evaluated for retrieval of four text types – email messages, web articles, fill-in forms, and brochures from a public organization – by having the first three as queries. Pairwise term co-occurrence performed on par with or better than BERT sentence embeddings without fine-tuning the BERT language model. With some text types, pairwise term co-occurrence outperformed bag-of-words matching by as much as 29.44 (MAP) and 31.71 (P@1) percentage points. Pairwise term co-occurrence can fill a niche by improving text similarity calculation where supervised machine learning is difficult to carry out.

References

T. Mikolov, I Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in Neural Information Processing Systems, vol. 26, 2013.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018. Available: https://doi.org/10.48550/arXiv.1810.04805

R. Gozalo-Brizuela and E. C. Garrido-Merchan, “ChatGPT is not all you need. A State-of-the-Art Review of large Generative AI models,” arXiv:2301.04655, 2023. Available: https://doi.org/10.48550/arXiv.2301.04655

S. R. Bowman, “Eight things to know about large language models,” arXiv:2304.00612, 2023. Available: https://doi.org/10.48550/arXiv.2304.00612

Y. Tan, D. Min, Y. Li, W. Li, N. Hu, Y. Chen, and G. Qi, “Evaluation of ChatGPT as a question answering system for answering complex questions,” arXiv:2303.07992, 2023. Available: https://doi.org/10.48550/arXiv.2303.07992

A. Choudhury and H. Shamszare, “Investigating the Impact of User Trust on the Adoption and Use of ChatGPT: Survey Analysis,” Journal of Medical Internet Research, vol. 25, e47184, 2023. Available: https://doi.org/10.2196/47184 DOI: https://doi.org/10.2196/47184

W. Yang, H. Zhang, and J. Lin, “Simple applications of BERT for ad hoc document retrieval,” arXiv:1903.10972, 2019. Available: https://doi.org/10.48550/arXiv.1903.10972

A. Dhar, H. Mukherjee, N. S. Dash, and K. Roy, “Text categorization: past and present,” Artificial Intelligence Review, vol. 54, pp. 3007–3054, 2020. Available: https://doi.org/10.1007/s10462-020-09919-1 DOI: https://doi.org/10.1007/s10462-020-09919-1

Q. Yaseen, “Spam email detection using deep learning techniques,” Procedia Computer Science, vol. 184, pp. 853–858, 2021. Available: https://doi.org/10.1016/j.procs.2021.03.107 DOI: https://doi.org/10.1016/j.procs.2021.03.107

W. Jiang, N. Synovic, M. Hyatt, T. R. Schorlemmer, K. Läufer, Y. Lu, G. K. Thiruvathukal, and J. C. Davis, “An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 2463–2475, 2023. Available: https://doi.org/10.1109/ICSE48619.2023.00206 DOI: https://doi.org/10.1109/ICSE48619.2023.00206

T. Y. S. S. Santosh, R. Haddad, and M. Grabmair, “ECtHR-PCR: A Dataset for Precedent Understanding and Prior Case Retrieval in the European Court of Human Rights,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 5473–5483, 2024. Available: https://aclanthology.org/2024.lrec-main.486

B. B. Sookman, “Moffatt v. Air Canada: A Misrepresentation by an AI Chatbot,” 2024. Available: https://www.mccarthy.ca/en/insights/blogs/techlex/moffatt-v-air-canada-misrepresentation-ai-chatbot. Accessed on May 17, 2024.

R. Liu, J. Wei, F. Liu, C. Si, Y. Zhang, J. Rao, S. Zheng, D. Peng, D. Yang, D. Zhou, and A. M. Dai, “Best Practices and Lessons Learned on Synthetic Data for Language Models,” arXiv:2404.07503, 2024. Available: https://doi.org/10.48550/arXiv.2404.07503

C. Ling, X. Zhao, J. Lu, C. Deng, C. Zheng, J. Wang, T. Chowdhury, H. Cui, X. Zhang, T. Zhao, A. Panalkar, C. Wei, H. Wang, Y. Liu, Z. Chen, H. Chen, C. White, Q. Gu, J. Pei, C. Yang, and L. Zhao, “Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey,” arXiv:2305.18703v7, 2024. Available: https://doi.org/10.48550/arXiv.2305.18703

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. T. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, D. “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 9459–9474, 2020.

G. Xexéo, F. Braida, M. Parreiras, and P. Xavier, “The Economic Implications of Large Language Model Selection on Earnings and Return on Investment: A Decision Theoretic Model,” arXiv:2405.17637, 2024. Available: https://doi.org/10.48550/arXiv.2405.17637

J. Han, M. Kamber, and J. Pei, Data Mining. Concepts and Techniques. Third Edition, Morgan Kaufmann, 2012.

C. Kamphuis, A. P. de Vries, L. Boytsov, and J. Lin, “Which BM25 do you mean? A large-scale reproducibility study of scoring variants,” in Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol. 12036, pp. 28–34, 2020. Available: https://doi.org/10.1007/978-3-030-45442-5_4 DOI: https://doi.org/10.1007/978-3-030-45442-5_4

SCB. Län och kommuner, 2024 (in Swedish). Available: https://www.scb.se/hitta-statistik/regional-statistik-och-kartor/regionala-indelningar/lan-och-kommuner/. Accessed on May 17, 2024.

Statskontoret. Myndigheterna under regeringen, 2024 (in Swedish). Available: https://www.statskontoret.se/fokusomraden/fakta-om-statsforvaltningen/myndigheterna-under-regeringen/. Accessed on May 17, 2024.

A. Smaldone and M. L. J. Wright, Local Governments in the U.S.: A Breakdown by Number and Type, Federal Reserve Bank of St. Louis, 2024. Available: https://www.stlouisfed.org/publications/regional-economist/2024/march/local-governments-us-number-type. Accessed on May 17, 2024.

V. Karpukhin, B. Oguz, S. Min, P. A. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih, “Dense Passage Retrieval for Open-Domain Question Answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, 2020. Available: https://doi.org/10.18653/v1/2020.emnlp-main.550 DOI: https://doi.org/10.18653/v1/2020.emnlp-main.550

C. J. Van Rijsbergen, “A theoretical basis for the use of co‐occurrence data in information retrieval,” Journal of Documentation, vol. 33, no. 2, pp. 106–119, 1977. Available: https://doi.org/10.1108/eb026637 DOI: https://doi.org/10.1108/eb026637

H. Guo, L. Z. Zhou, and L. Feng, “Self-Switching Classification Framework for Titled Documents,” Journal of Computer Science and Technology, vol. 24, pp. 615–625, 2009. Available: https://doi.org/10.1007/s11390-009-9262-z DOI: https://doi.org/10.1007/s11390-009-9262-z

K. Soumya George and S. Joseph, “Text classification by augmenting bag of words (BOW) representation with co-occurrence feature,” IOSR Journal of Computer Engineering, vol. 16, no. 1, pp. 34–38, 2014. Available: https://doi.org/10.9790/0661-16153438 DOI: https://doi.org/10.9790/0661-16153438

F. Figueiredo, L. Rocha, T. Couto, T. Salles, M. A. Gonçalves, and W. Meira Jr, “Word co-occurrence features for text classification,” Information Systems, vol. 36, no. 5, pp. 843–858. Available: https://doi.org/10.1016/j.is.2011.02.002 DOI: https://doi.org/10.1016/j.is.2011.02.002

S Yang, G. Huang, and B. Ofoghi, “Short Text Similarity Measurement Using Context from Bag of Word Pairs and Word Co-occurrence,” in Data Science. ICDS 2019. Communications in Computer and Information Science, vol. 1179, pp. 221–231, 2020. Available: https://doi.org/10.1007/978-981-15-2810-1_22 DOI: https://doi.org/10.1007/978-981-15-2810-1_22

M. Kaiser, R. Saha Roy, and G. Weikum, “Conversational question answering over passages by leveraging word proximity networks,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2129–2132, 2020. Available: https://doi.org/10.1145/3397271.3401399 DOI: https://doi.org/10.1145/3397271.3401399

H. S. Kim, I. Choi, and M. Kim, “Refining term weights of documents using term dependencies,” in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 552–553, 2004. Available: https://doi.org/10.1145/1008992.1009116 DOI: https://doi.org/10.1145/1008992.1009116

L. Shi and J. Y. Nie, “Using various term dependencies according to their utilities,” in Proceedings of the 19th ACM international conference on Information and knowledge management, pp. 1493–1496, 2010. Available: https://doi.org/10.1145/1871437.1871655 DOI: https://doi.org/10.1145/1871437.1871655

J. Gao, J. Y. Nie, G. Wu, and G. Cao, “Dependence language model for information retrieval,” in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 170–177, 2004. Available: https://doi.org/10.1145/1008992.1009024 DOI: https://doi.org/10.1145/1008992.1009024

G. Mishne and M. De Rijke, “Boosting web retrieval through query operations,” in Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol. 3408, pp. 502–516, 2005. Available: https://doi.org/10.1007/978-3-540-31865-1_36 DOI: https://doi.org/10.1007/978-3-540-31865-1_36

Y. Rasolofo and J. Savoy, “Term proximity scoring for keyword-based retrieval systems,” in Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol. 2633, pp. 207–218, 2003. Available: https://doi.org/10.1007/3-540-36618-0_15 DOI: https://doi.org/10.1007/3-540-36618-0_15

S. Büttcher, C. L. Clarke, and B. Lushman, “Term proximity scoring for ad-hoc retrieval on very large text collections,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 621–622, 2006. Available: https://doi.org/10.1145/1148170.1148285 DOI: https://doi.org/10.1145/1148170.1148285

R. Schenkel, A. Broschart, S. Hwang, M. Theobald, and G. Weikum, “Efficient text proximity search,” in String Processing and Information Retrieval. SPIRE 2007. Lecture Notes in Computer Science, vol. 4726, pp. 287–299, 2007. Available: https://doi.org/10.1007/978-3-540-75530-2_26 DOI: https://doi.org/10.1007/978-3-540-75530-2_26

K. M. Svore, P. H. Kanani, and N. Khan, “How good is a span of terms? Exploiting proximity to improve web retrieval,” in Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 154–161, 2010. Available: https://doi.org/10.1145/1835449.1835477 DOI: https://doi.org/10.1145/1835449.1835477

X. Lu, A. Moffat, and J. S. Culpepper, “Efficient and effective higher order proximity modeling,” in Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, pp. 21–30, 2016. Available: https://doi.org/10.1145/2970398.2970404 DOI: https://doi.org/10.1145/2970398.2970404

T. Tao and C. Zhai, “An exploration of proximity measures in information retrieval,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 295–302, 2007. Available: https://doi.org/10.1145/1277741.1277794 DOI: https://doi.org/10.1145/1277741.1277794

J. Zhao, J. X. Huang, and Z. Ye, “Modeling term associations for probabilistic information retrieval,” ACM Transactions on Information Systems, vol. 32, no. 2, pp. 1–47, 2014. Available: https://doi.org/10.1145/2590988 DOI: https://doi.org/10.1145/2590988

D. Metzler and W. B. Croft, “A Markov random field model for term dependencies,” in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479, 2005. Available: https://doi.org/10.1145/1076034.1076115 DOI: https://doi.org/10.1145/1076034.1076115

S. Liu, F. Liu, C. Yu, and W. Meng, “An effective approach to document retrieval via utilizing WordNet and recognizing phrases,” in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 266–272, 2004. Available: https://doi.org/10.1145/1008992.1009039 DOI: https://doi.org/10.1145/1008992.1009039

F. Jian, J. X. Huang, J. Zhao, T. He, and P. Hu, “A simple enhancement for ad-hoc information retrieval via topic modelling,” in Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 733–736, 2016. Available: https://doi.org/10.1145/2911451.2914748 DOI: https://doi.org/10.1145/2911451.2914748

R. Song, M. J. Taylor, J. R. Wen, H. W. Hon, and Y. Yu, “Viewing term proximity from a different perspective,” in Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol. 4956, pp. 346–357, 2008. Available: https://doi.org/10.1007/978-3-540-78646-7_32 DOI: https://doi.org/10.1007/978-3-540-78646-7_32

O. Vechtomova and M. Karamuftuoglu, “Lexical cohesion and term proximity in document ranking,” Information Processing & Management, vol. 44, no. 4, pp. 1485–1502, 2008. Available: https://doi.org/10.1016/j.ipm.2008.01.003 DOI: https://doi.org/10.1016/j.ipm.2008.01.003

J. Zhao and Y. Yun, “A proximity language model for information retrieval,” in Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 291–298, 2009. Available: https://doi.org/10.1145/1571941.1571993 DOI: https://doi.org/10.1145/1571941.1571993

R. Mihalcea and P. Tarau, “TextRank: Bringing order into text,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 404–411, 2004.

S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks and ISDN Systems, vol. 30, no. 1–7, pp. 107–117, 1998. Available: https://doi.org/10.1016/S0169-7552(98)00110-X DOI: https://doi.org/10.1016/S0169-7552(98)00110-X

R. Blanco and C. Lioma, “Graph-based term weighting for information retrieval,” Information Retrieval, vol. 15, pp. 54–92, 2012. Available: https://doi.org/10.1007/s10791-011-9172-x DOI: https://doi.org/10.1007/s10791-011-9172-x

W. Lu, Q. Cheng, and C. Lioma, “Fixed versus dynamic co-occurrence windows in TextRank term weights for information retrieval,” in Proceedings of the 35th international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1079–1080, 2012. Available: https://doi.org/10.1145/2348283.2348478 DOI: https://doi.org/10.1145/2348283.2348478

S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, Y. Zhang, “Sparks of Artificial General Intelligence: Early experiments with GPT-4,” arXiv:2303.12712, 2023. Available: https://doi.org/10.48550/arXiv.2303.12712

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv:2106.09685, 2021. Available: https://doi.org/10.48550/arXiv.2106.09685

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” arXiv:2305.14314, 2023. Available: https://doi.org/10.48550/arXiv.2305.14314

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022.

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “ReAct: Synergizing reasoning and acting in language models,” arXiv:2210.03629, 2022. Available: https://doi.org/10.48550/arXiv.2210.03629

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts et al., “PaLM: Scaling language modeling with pathways,” arXiv:2204.02311, 2022. Available: https://doi.org/10.48550/arXiv.2204.02311

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, et al., “A survey on large language model based autonomous agents,” Frontiers of Computer Science, vol. 18, 2024. Available: https://doi.org/10.1007/s11704-024-40231-1 DOI: https://doi.org/10.1007/s11704-024-40231-1

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, 2019. Available: https://doi.org/10.18653/v1/D19-1410 DOI: https://doi.org/10.18653/v1/D19-1410

Y. Qiao, C. Xiong, Z. Liu, and Z. Liu, “Understanding the Behaviors of BERT in Ranking. arXiv:1904.07531 2019. Available: https://doi.org/10.48550/arXiv.1904.07531

R. Nogueira and K. Cho, “Passage Re-ranking with BERT,” arXiv:1901.04085, 2019. Available: https://doi.org/10.48550/arXiv.1901.04085

O. Khattab and M. Zaharia, M. “ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48, 2020. Available: https://doi.org/10.1145/3397271.3401075 DOI: https://doi.org/10.1145/3397271.3401075

Z. Dai and J. Callan, “Deeper text understanding for IR with contextual neural language modeling,” in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 985–988, 2019. Available: https://doi.org/10.1145/3331184.3331303 DOI: https://doi.org/10.1145/3331184.3331303

C. Li, A. Yates, S. MacAvaney, B. He, and Y. Sun, “PARADE: Passage representation aggregation for document reranking. ACM Transactions on Information Systems, vol. 42, no. 2, pp. 1–26, 2023. Available: https://doi.org/10.1145/3600088 DOI: https://doi.org/10.1145/3600088

Y. Matsuo and M. Ishizuka, “Keyword extraction from a single document using word co-occurrence statistical information,” International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157–169, 2004. Available: https://doi.org/10.1142/S0218213004001466 DOI: https://doi.org/10.1142/S0218213004001466

X. Song, Y. Rui, and X. Hu, “Pairwise topic model and its application to topic transition and evolution,” in 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp. 86–95, 2016. Available: https://doi.org/10.1109/BigData.2016.7840592 DOI: https://doi.org/10.1109/BigData.2016.7840592

L. Chen, K. K. Chin, and K. Knill, “Improved language modelling using bag of word pairs,” in Interspeech 2009, pp. 2671–2674, 2009. Available: https://doi.org/10.21437/Interspeech.2009-121 DOI: https://doi.org/10.21437/Interspeech.2009-121

M. Shirakawa, T. Hara, and S. Nishio, “IDF for Word N-grams,” ACM Transactions on Information Systems, vol. 36, no. 1, pp. 1–38, 2017. Available: https://doi.org/10.1145/3052775 DOI: https://doi.org/10.1145/3052775

C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008. Available: https://doi.org/10.1017/CBO9780511809071 DOI: https://doi.org/10.1017/CBO9780511809071

B. He, J. X. Huang, and X. Zhou, “Modeling term proximity for probabilistic information retrieval models,” Information Sciences, vol. 181, no. 14, pp. 3017–3031, 2011. Available: https://doi.org/10.1016/j.ins.2011.03.007 DOI: https://doi.org/10.1016/j.ins.2011.03.007

Y. Lv and C. Zhai, “When documents are very long, BM25 fails!” in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1103–1104, 2011. Available: https://doi.org/10.1145/2009916.2010070 DOI: https://doi.org/10.1145/2009916.2010070

S. E. Robertson, S. Walker, M. M. Beaulieu, M. Gatford, and A. Payne, “Okapi at TREC-4,” Nist Special Publication, pp. 73–96, 1996. DOI: https://doi.org/10.6028/NIST.SP.500-236.city

T. Khatoon and A. Govardhan, “Query Expansion with Enhanced-BM25 Approach for Improving the Search Query Performance on Clustered Biomedical Literature Retrieval,” Journal of Digital Information Management, vol. 16, no. 2, pp. 85–98, 2018.

A. Lipani, M. Lupu, A. Hanbury, and A. Aizawa, “Verboseness fission for BM25 document length normalization,” in Proceedings of the 2015 International Conference on the Theory of Information Retrieval, pp. 385–388, 2015. Available: https://doi.org/10.1145/2808194.2809486 DOI: https://doi.org/10.1145/2808194.2809486

A. Sordoni, J. Y. Nie, and Y. Bengio, “Modeling term dependencies with quantum language models for IR,” in Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 653–662, 2013. Available: https://doi.org/10.1145/2484028.2484098 DOI: https://doi.org/10.1145/2484028.2484098

Pretrained Models. SBERT.net, 2024. Available: https://sbert.net/docs/sentence_transformer/pretrained_models.html. Accessed on June 6, 2024.

H. Dalianis, J. Sjöbergh, and E. Sneiders, “Comparing manual text patterns and machine learning for classification of e-mails for automatic answering by a government agency,” in Computational Linguistics and Intelligent Text Processing, CICLing 2011. Lecture Notes in Computer Science, vol. 6609, pp. 234–243, 2011. Available: https://doi.org/10.1007/978-3-642-19437-5_19 DOI: https://doi.org/10.1007/978-3-642-19437-5_19

E. Sneiders, J. Sjöbergh, and A. Alfalahi, “Automated email answering by text-pattern matching: Performance and error analysis,” Expert Systems, vol. 35, no. 1, 2018. Available: https://doi.org/10.1111/exsy.12251 DOI: https://doi.org/10.1111/exsy.12251

F. Rekathati, “Introducing a Swedish Sentence Transformer,” The KBLab Blog, 2021. Available: https://kb-labb.github.io/posts/2021-08-23-a-swedish-sentence-transformer/. Accessed on June 6, 2024.

M. Polignano, V. Basile, P. Basile, M. de Gemmis, and G. Semeraro, “ALBERTo: Modeling Italian Social Media Language with BERT,” IJCoL, Italian Journal of Computational Linguistics, no. 5–2, pp. 11–31, 2019. Available: https://doi.org/10.4000/ijcol.472 DOI: https://doi.org/10.4000/ijcol.472

W. Timkey and M. van Schijndel, “All bark and no bite: Rogue dimensions in transformer language models obscure representational quality,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4527–4546. Available: https://doi.org/10.18653/v1/2021.emnlp-main.372 DOI: https://doi.org/10.18653/v1/2021.emnlp-main.372

M. Wang and L. Si, “Discriminative probabilistic models for passage based retrieval,” in Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 419–426, 2008. Available: https://doi.org/10.1145/1390334.1390407 DOI: https://doi.org/10.1145/1390334.1390407

S. Wang, S. Zhuang, and G. Zuccon, “BERT-based dense retrievers require interpolation with BM25 for effective passage retrieval,” in Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 317–324, 2021. Available: https://doi.org/10.1145/3471158.3472233 DOI: https://doi.org/10.1145/3471158.3472233

A. Drozdov, H. Zhuang, Z. Dai, Z. Qin, R. Rahimi, X. Wang, D. Alon, M. Iyyer, A. McCallum, D. Metzler, and K. Hui, “PaRaDe: Passage Ranking using Demonstrations with LLMs,” in Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14242–14252, Available: https://doi.org/10.18653/v1/2023.findings-emnlp.950 DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.950

R. Pradeep, S. Sharifymoghaddam, and J. Lin, “RankVicuna: Zero-shot listwise document reranking with open-source large language models,” arXiv:2309.15088, 2023. Available: https://doi.org/10.48550/arXiv.2309.15088

Text Retrieval in Restricted Domains by Pairwise Term Co-occurrence

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite