Unsupervised Approach for Specialized Vocabulary Creation and Enrichment: A Case Study in the Multidisciplinary Building Sector

Lydia Khelifa  Chibout; Manuele Kirsch Pinheiro

doi:10.7250/csimq.2025-44.04

Unsupervised Approach for Specialized Vocabulary Creation and Enrichment: A Case Study in the Multidisciplinary Building Sector

Authors

Lydia Khelifa Chibout CSTB Scientific and Technical Center for Building, 84 Av Jean Jaurès, 77420 Champs-sur-Marne, France https://orcid.org/0000-0002-6448-4580
Manuele Kirsch Pinheiro Centre de Recherche en Informatique, Université Paris 1 Panthéon-Sorbonne, 90 rue de Tolbiac, 75013 Paris, France https://orcid.org/0000-0002-5611-4942

DOI:

https://doi.org/10.7250/csimq.2025-44.04

Keywords:

Keywords Extraction, Clustering, Vocabulary Identification, Knowledge-Based Construction, Knowledge Management

Abstract

The exponential growth of digital information has exposed organizations to unprecedented challenges in managing and structuring their knowledge repositories. In the context of knowledge management, the ability to extract, organize, and use relevant information from large collections of documents has become a critical factor for operational efficiency and informed decision-making. However, identifying necessary knowledge sources and building appropriate knowledge bases represents a significant and time-consuming barrier. In this article, we address these challenges by leveraging advanced Natural Language Processing (NLP) techniques, particularly in combination with Large Language Models (LLMs), to facilitate the selection of more representative keywords for the creation and enrichment of vocabularies for knowledge management purposes. We explore the application of clustering techniques combined with NLP-driven keyword extraction to support the construction of specialized vocabularies that address the multidisciplinary nature of the content at CSTB, a French scientific research center focused on building science. We applied a pipeline with two approaches for keyword extraction: document-based clustering and chunk-based clustering. We provide a detailed overview of the proposed pipeline, present the results of our experiments, and describe the human validation process used to evaluate these results.

References

P. Maharjan, “Knowledge Management Enablers for Knowledge Creation Combination in Nepalese Hospitality Industry,” Journal of Balkumari College, vol. 9, no. 1, 2020, pp. 25–33. Available: https://doi.org/10.3126/jbkc.v9i1.30064 DOI: https://doi.org/10.3126/jbkc.v9i1.30064

R. Morse, “Management in the 21st Century Knowledge Management Systems: Using Technology to Enhance Organizational Learning,” Proceedings of the 2000 Information Resources Management Association International Conference on Challenges of Information Technology Management in the 21st Century, pp. 426–429, 2000.

R. Y. Narazaki, M. Silveira Chaves and C. Drebes Pedron, “A project knowledge management framework grounded in design science research,” Knowledge and Process Management, vol. 27, no. 3, pp. 197–210, 2020. Available: https://doi.org/10.1002/kpm.1627 DOI: https://doi.org/10.1002/kpm.1627

J. Priti, “An Empirical Study of Knowledge Management in University Libraries in SADC Countries,” New Research on Knowledge Management Applications and Lesson Learned, pp. 137–154, 2012. Available: https://doi.org/10.5772/36309 DOI: https://doi.org/10.5772/36309

L. Yao-Sheng, “The effects of knowledge management strategy and organization structure on innovation,” International Journal of Management, vol. 24, no. 1, pp. 53–60, 2007.

H. Laihonen, A. A. Kork, and L. M. Sinervo, “Advancing public sector knowledge management: towards an understanding of knowledge formation in public administration,” Knowledge Management Research & Practice, vol. 22, no. 3, pp. 223–233, 2024. Available: https://doi.org/10.1080/14778238.2023.2187719 DOI: https://doi.org/10.1080/14778238.2023.2187719

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of NAACL-HLT 2019, pp. 4171–4186, 2019.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, and A. Neelakantan et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems 33 (NeurIPS 2020), vol. 33, pp. 1877–1901, 2020.

R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Texts,” Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411, 2004. DOI: https://doi.org/10.3115/1220575.1220627

X. Wan and J. Xiao, “Single Document Keyphrase Extraction Using Neighborhood Knowledge,” Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, pp. 855–860, 2008.

M. Grootendorst, “BERTopic: Leveraging BERT and c-TF-IDF for Topic Modeling,” arXiv:2203.05794, 2022. Available: https://doi.org/10.48550/arXiv.2203.05794

R. Churchill and L. Singh. “The Evolution of Topic Modeling,” ACM Computing Surveys, vol. 54, no. 10s, pp. 1–35, 2022. Available: https://doi.org/10.1145/3507900 DOI: https://doi.org/10.1145/3507900

R. K. Bisht, “A Comparative Evaluation of Different Keyword Extraction Techniques,” International Journal of Information Retrieval Research (IJIRR), vol. 12, no. 1, pp. 1–17, 2022. Available: https://doi.org/10.4018/IJIRR.289573 DOI: https://doi.org/10.4018/IJIRR.289573

G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing & Management, vol. 24, no. 5, pp. 513–523, 1988. Available: https://doi.org/10.1016/0306-4573(88)90021-0 DOI: https://doi.org/10.1016/0306-4573(88)90021-0

A. Delamaire, M. Beigbeder, and M. Juganaru-Mathieu, “Exploitation de syntagmes dans la découverte de thèmes,” Actes de la conférence CORIA (Conférence en Recherche d'Information et Applications), 2019 (in French).

A. Bougouin, F. Boudin, and B. Daille, “TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction,” International Joint Conference on Natural Language Processing (IJCNLP), pp. 543–551, 2013.

L. N. Khelifa, N. Lammari, J. Akoka, and T. Bouabana-Tebibel, “Building Contextualized Topic Maps,” 19th IBIMA (International Business Information Management Association) Conference on Innovation Vision 2020: Sustainable Growth, Entrepreneurship, Real Estate and Economic Development, 2012.

W. D. Abilhoa and L. N. De Castro, “TKG: A graph-based approach to extract keywords from tweets,” Distributed Computing and Artificial Intelligence, 11th International Conference, Springer, pp. 425–432, 2014. Available: https://doi.org/10.1007/978-3-319-07593-8_49 DOI: https://doi.org/10.1007/978-3-319-07593-8_49

H. M. M. Hasan, F. Sanyal, and D. Chaki, “A novel approach to extract important keywords from documents applying latent semantic analysis,” 10th International Conference on Knowledge and Smart Technology (KST), pp. 117–122, 2018. Available: https://doi.org/10.1109/KST.2018.8426144 DOI: https://doi.org/10.1109/KST.2018.8426144

A. Ahadh, G. V. Binish, and R. Srinivasan, “Text mining of accident reports using semi-supervised keyword extraction and topic modeling,”. Process Safety and Environmental Protection, vol. 155, pp. 455–465, 2021. Available: https://doi.org/10.1016/j.psep.2021.09.022 DOI: https://doi.org/10.1016/j.psep.2021.09.022

M. Umair, A. Khan, F. Ullah, A. Masmoudi, and M. Faheem, “Global and Local Context Fusion in Heterogeneous Graph Neural Network for Summarizing Lengthy Scientific Documents,” IEEE Access, vol. 13, pp. 53433–53447, 2025. Available: https://doi.org/10.1109/ACCESS.2025.3553755 DOI: https://doi.org/10.1109/ACCESS.2025.3553755

M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231, 1996.

P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987. Available: https://doi.org/10.1016/0377-0427(87)90125-7 DOI: https://doi.org/10.1016/0377-0427(87)90125-7

N. B. Mansour, H. Rahimi, and M. Alrahabi, “How Well Do Large Language Models Extract Keywords? A Systematic Evaluation on Scientific Corpora,” Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities, pp. 13–21, 2025. Available: https://doi.org/10.18653/v1/2025.aisd-main.2 DOI: https://doi.org/10.18653/v1/2025.aisd-main.2

G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” International Journal of Data Warehousing and Mining (IJDWM), vol. 3, no. 3, pp. 1–13, 2007. Available: https://doi.org/10.4018/jdwm.2007070101 DOI: https://doi.org/10.4018/jdwm.2007070101

J. Zhou, Y. Jia, Y. Qiu, and L. Lin, “The potential of applying ChatGPT to extract keywords of medical literature in plastic surgery,” Aesthetic Surgery Journal, vol. 43, no. 9, pp. NP720–NP723, 2023. Available: https://doi.org/10.1093/asj/sjad158 DOI: https://doi.org/10.1093/asj/sjad158

O. Akarsu and H. Parmaksiz. “Anatomy of Digital Leadership Studies: An Analysis with Topic Modeling Approaches,” Business and Economics Research Journal, vol. 16, no. 2, pp. 179–205, 2025. Available: https://doi.org/10.20409/berj.2025.463 DOI: https://doi.org/10.20409/berj.2025.463

F. Anowar, S. Sadaoui, and B. Selim, “Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE),” Computer Science Review, vol. 40, 2021. Available: https://doi.org/10.1016/j.cosrev.2021.100378 DOI: https://doi.org/10.1016/j.cosrev.2021.100378

K. Church, “Word2Vec,” Natural Language Engineering, vol. 23, no. 1, pp. 155–162, 2017. Available: https://doi.org/10.1017/S1351324916000334 DOI: https://doi.org/10.1017/S1351324916000334

F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language-agnostic BERT sentence embedding,” Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 878–891, 2022. Available: https://doi.org/10.18653/v1/2022.acl-long.62 DOI: https://doi.org/10.18653/v1/2022.acl-long.62

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, 2019. Available: https://doi.org/10.18653/v1/D19-1410 DOI: https://doi.org/10.18653/v1/D19-1410

L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform manifold approximation and projection for dimension reduction,” arXiv:1802.03426, 2018. Available: https://doi.org/10.48550/arXiv.1802.03426

L. Van der Maaten, and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

H. Abdi and L. J. Williams, “Principal component analysis,”. Wiley Interdisciplinary Reviews: Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010. Available: https://doi.org/10.1002/wics.101 DOI: https://doi.org/10.1002/wics.101

D. D. Xu and S. B. Wu, “An improved TFIDF algorithm in text classification,” Applied Mechanics and Materials, vol. 651–653, pp. 2258–2261, 2014. Available: https://doi.org/10.4028/www.scientific.net/AMM.651-653.2258 DOI: https://doi.org/10.4028/www.scientific.net/AMM.651-653.2258

W. Wenhui, W. Furu, D. Li, B. Hangbo, Y. Nan, and Z. Ming, “MINILM: deep self-attention distillation for task-agnostic compression of pre-trained transformers,” Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), pp. 5776–5788, 2020.

M. Ciancone, I. Kerboua, M. Schaeffer, and W. Siblini, “MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis,” arXiv:2405.20468, 2024. Available: https://doi.org/10.48550/arXiv.2405.20468

C. Malzer and M. Baum, “A hybrid approach to hierarchical density-based cluster selection,” IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), pp. 223–228, 2020. Available: https://doi.org/10.1109/MFI49285.2020.9235263 DOI: https://doi.org/10.1109/MFI49285.2020.9235263

J. Carbonell and J. Goldstein, “The use of MMR, diversity-based reranking for reordering documents and producing summaries,” Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in information Retrieval, ACM, pp. 335–336, 1998. Available: https://doi.org/10.1145/290941.291025 DOI: https://doi.org/10.1145/290941.291025

L. Martin, B. Muller, P. Javier, O. Suárez, Y. Dupont, L. Romary, E. De la Clergerie, D. Seddah, and B. Sagot, “CamemBERT: a tasty French language model,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219, 2020. Available: https://doi.org/10.18653/v1/2020.acl-main.645 DOI: https://doi.org/10.18653/v1/2020.acl-main.645

W. Antoun, F. Kulumba, R. Touchent, E. De la Clergerie, B. Sagot, and D. Seddah, “CamemBERT 2.0: A Smarter French Language Model Aged to Perfection,” arXiv:2411.08868, 2024. Available: https://doi.org/10.48550/arXiv.2411.08868