LEVERAGING LLM EMBEDDINGS AND REVERSE DICTIONARIES FOR RELIABLE TOPIC MODELING AND PRIVACY-SENSITIVE SMART CITY APPLICATIONS: TOWARDS RESIDENTS’ SATISFACTION AND SAFETY

By Engr. Waqar Qayyoom Khokhar
March 22, 2025March 22, 2025
Education

HAMAD BIN KHALIFA UNIVERSITY

COLLEGE NAME

LEVERAGING LLM EMBEDDINGS AND REVERSE DICTIONARIES FOR RELIABLE TOPIC MODELING AND PRIVACY-SENSITIVE SMART CITY APPLICATIONS: TOWARDS RESIDENTS’ SATISFACTION AND SAFETY

STUDENT’S NAME AS IT APPEARS IN THE TRANSCRIPT

A Select Document Type Submitted to the Faculty of the

College Name

in Partial Fulfillment

of the Requirements

for the Degree of

[Master of XXX]/ [Doctor of Philosophy]

March 2025

ABSTRACT

Our work aims to reliably identify the latent topics that concern smart city tenants/residents. For example, identifying residents’ key perceptions and reactions towards worldwide issues, including climate change. Additionally, this work aims to identify the dominant news that are discussed and shared by smart city residents. With these reliable identifications, smart city decision-makers can prioritize and tailor relevant services to ensure tenants’ satisfaction and quality of life. On the other hand, in the era of Large Language Models (LLM), smart city residents are expected to highly interact and utilize this technology, asking various questions. However, these questions are not always general; some of them are required not to be linked with the resident who asked them for privacy concerns. Most importantly, for scenarios where smart city residents do not fully trust the service provider, which is a very reasonable assumption by common sense residents. That makes disjoining the prompts from the tenant a pressing need. To enable reliable topic modeling and address the privacy concerns of smart city tenants, we propose leveraging embedding models in conjunction with reverse dictionaries to enable these capabilities. We are the first who introduce this conjunction and leverage them for reliable smart city applications.

TABLE OF CONTENTS

CHAPTER 1: Introduction. 1

1.1 Background and Motivation. 1

1.2 Research Gaps and Challenges (e.g., reliable topic identification, privacy concerns) 1

1.3 Research Goals. 1

1.3.1 Aim.. 1

1.3.2 Objectives. 1

1.3.3 Research Questions. 1

CHAPTER 2: Literature Review.. 3

2.1 LLM based Topic Modelling. 3

2.1.1 Qualitative Insights Tool (QualIT) 3

2.1.2 LLM-assisted Iterative Topic Augmentation framework (LITA) 3

2.1.3 LLMs Based Topic Modelling for Sustainable Development Goals. 3

2.1.4 LLMs for IKEA Reviews Generation. 4

2.1.5 LLMs in Automated Labelling. 4

2.1.6 PromptTopic based Modelling. 4

2.1.7 LLM-in-the-loop Thematic Analysis. 5

2.2 Comparison of Topic Modeling Approaches. 5

2.3 Privacy-Preserving AI Techniques in Smart Cities. 6

2.4 AI-Driven Smart City Applications and Their Impact on Residents’ Satisfaction and Safety 6

2.5 Limitations and Research Gaps in Existing Models. 7

2.6 Summary of Chapter 7

CHAPTER 3: EMBEDDING MODELS AND REVERSE DICTIONARIES FOR RELIABLE TOPIC MODELING.. 8

3.1 Overview.. 8

3.2 Problem Statement and Research Questions. 8

3.3 Data Collection and Input Sources. 8

3.3.1 Climate Change Dataset 8

3.3.2 AG News Dataset 9

3.4 Topic Modeling Pipelines. 9

3.4.1 Pipeline 1: Topic Modeling Using Embedding with Reverse Dictionary. 10

3.4.2 Pipeline 2: Latent Dirichlet Allocation (LDA) for Topic Modeling in Smart City Applications 13

3.4.3 Pipeline 3: K-Means Clustering on Pipeline 1 & 2 Outputs. 15

3.4.4 Pipeline 4: BERTopic for Smart City Topic Modeling. 17

3.4.5 Pipeline 5: Combined Clustering of Reverse Dictionary and BERTopic Outputs 18

3.5 Privacy-Preserving Mechanisms. 19

3.6 System Architecture. 20

3.7 Evaluation Metrics. 20

3.7.1 Coherence_npmi: 20

3.7.2 Coherence_c_v. 20

3.7.3 Topic_Diversity. 20

CHAPTER 4: Results, Findings & Analysis. 21

4.1 Overview.. 21

4.1.1 Research Questions: 21

4.1.2 Objective of the Experiment 21

4.2 Experimental Setup. 21

4.2.1 Input Data and Original Prompt 21

4.2.2 Pre-processing. 21

4.2.3 Embedding Extraction. 22

4.2.4 Reverse Dictionary-Based Vector Database. 22

4.2.5 Definitions Extraction (Top 1–10 Definitions) 22

4.2.6 Expected Prompt Extraction. 22

4.3 Similarity Score Calculation. 22

3.1 Between Definition of RD and Original Prompt 22

3.2 Between OpenAI Expected Prompt and Original Prompt 22

4.4 Experimental Setup Results for Climate Change Data. 23

4.4.1 Coherence (npmi) 23

4.4.2 Coherence (c_v) 23

4.4.3 Topic Diversity. 24

4.5 Experimental Setup Results for AG News Data. 26

4.7 Discussion. 28

4.7.1 Experimental Results Discussion. 28

7.2 Privacy-Sensitive Smart City Applications Towards Residents. 29

CHAPTER 5: Conclusion and Future Work. 31

List of References. 33

Appendices. 37

Code. 37

Extra data. 37

LIST OF FIGURES

Figure 1: Data heads in climate change data. 9

Figure 2: Flowchart diagram of the experimental work. 10

Figure 3: Pipeline 1: Reverse Dictionary. 11

Figure 4: Pipeline 2: Latent Dirichlet Allocation (LDA) 14

Figure 5: Pipeline 3: K-Means Clustering on Pipeline 1 (RD) output & pipeline 2 (LDA) Outputs 16

Figure 6: Pipeline 4: BERTopic. 17

Figure 7: Pipeline 5: Combined Clustering of Reverse Dictionary and BERTopic Outputs 18

Figure 1:Evaluation metrics for climate change dataset 24

Figure 3: Climate Change Dataset Results. 25

Figure 4: Evaluation metrics for AG News dataset 26

Figure 5: AG News Dataset results. 27

LIST OF TABLES

Table 1: comparison of topic modeling techniques, their advantages, limitations, and uses in smart cities applications……………………………………………………………………………………………… 5

Table 2: Privacy-Preserving AI Techniques in Smart Cities………………………………….. 6

Table 3: AI-Driven Smart City Applications and Their Impact on Residents’ Satisfaction and Safety 6

Table 4: Limitations and Research Gaps in Existing Models………………………………… 7

Table 6:Evaluation Metrics for Topic Modeling Performance for Climate Change Dataset 23

Table 3: Climate Change Dataset Results…………………………………………………………. 25

Table 4: AG News Dataset results……………………………………………………………………. 26

Table 5: AG News Dataset Results………………………………………………………………….. 27

PREFACE

This page is optional.

ACKNOWLEDGMENTS

This page is optional. It is intended for acknowledgements.

DECLARATION

This is to certify that the work described in this document type is entirely my own, unless otherwise referenced or acknowledged. This work has not previously been submitted for qualifications at any other academic institution.

Signed

Date: Insert the Date

DEDICATION

This page is optional. You may dedicate your thesis/dissertation/project on this page.

ACRONYMS

This page is optional. It is intended for acronyms used in the thesis.

NOMENCLATURE

This page is optional. It is intended for naming systems adopted within the thesis

CHAPTER 1: Introduction

1.1 Background and Motivation

Recently, generative AI is hot topic for researchers, industries, and public. in generative AI, LLMs are distinguished as transformative advancements for universal language processing for the adoption of comprehending and encapsulated human knowledge. LLMs have demonstrated proficient and versatile instruments in language comprehensive tasks i.e. reasoning, question answering, code completion, translation, summarization, etc. [1].

1.2 Research Gaps and Challenges (e.g., reliable topic identification, privacy concerns)

Representation Engineering (RepE) is a newly discovered paradigm to control the LLMs behaviours. RepE fine-tune the models and inputs which results in direct manipulation of model’s internal representations. this approach offers effective, data-efficient, interpretable, and flexible control. so towards the better LLMs performance, RepE are identified as opportunity for experimental and methodogical improvements and best practices guide [2].

1.3 Research Goals

1.3.1 Aim

The aim of this research artefact is to investigate, evaluate, and compare the performance of three topic modelling techniques (Reversed Dictionary, LDA, and BERTopic) for Climate changes and News datasets using topic diversity and coherence scores metrics

1.3.2 Objectives

To evaluate the coherence and topic diversity of Reversed Dictionary, LDA, and BERTopic models on the Climate Change dataset.
To compare the performance of Reversed Dictionary, LDA, and BERTopic models based on coherence (c_v, npmi) and topic diversity scores.
To identify the potential improvements for enhancing the topic modelling performances.

1.3.3 Research Questions

Which topic modelling techniques are best for climate changes and news datasets?
Which Topic modelling approaches among the RD, LDA, and BERTopic achieves highest performance score (coherence and topic diversity)?
Can hybrid models approaches (combination of, RD & LDA, RD & BERTopic) lead to better topic representations?

CHAPTER 2: Literature Review

2.1 LLM based Topic Modelling

2.1.1 Qualitative Insights Tool (QualIT)

Qualitative Insights Tool (QualIT) is also novel LLM Enhanced Topic Modelling approach for Academic & real-world applications for LLM-powered topic enrichment, Improved topic coherence & diversity. Topic Coherence performance for QualIT is 70% and Benchmarks is65% & 57%. Similarly, Topic Diversity for QualIT is 95.5% and Benchmarks are 85% & 72%. These performance metrics shows that BERTopic has Improved topic coherence, setting a new benchmarking standard [3].

2.1.2 LLM-assisted Iterative Topic Augmentation framework (LITA)

LLM-assisted Iterative Topic Augmentation framework (LITA) is another Iterative Topic Augmentation is and Efficient, iterative framework and used as LLM as topic evaluator for refining ambiguous topic assignments. LITA Framework is Combination of user-provided seed words with embedding-based clustering, which Iteratively refines topics, identifying and reassigning ambiguous documents via LLMs and Minimizes API costs by limiting LLM calls to only ambiguous cases. The author found that Outperformed five baseline models are LDA, BERTopic, SeededLDA, CorEx, and PromptTopic [4].

2.1.3 LLMs Based Topic Modelling for Sustainable Development Goals

Sustainable Development Goals (SDGs) powered by LLM-based embeddings are also being used for Research Literature having Focus on sustainability research. Research goal was to Analyze research literature’s attitude toward SDGs, which covers the 2006–2023 period using Scopus data. Proposed Pipelines are LLM-Based Topic Modeling, Automated Data Fetching and Keyword & Time-Series Exploration. LLM-Based Topic Modeling Uses BERTopic with scalable LLM embeddings for representation. Automated Data Fetching is used for Collecting research abstracts from Scopus for five SDG groups and hyper-parameter optimization for dataset-specific tuning. Keyword & Time-Series Exploration is used for Keyword-based search to navigate topics and to calculate the topic frequency time series to track evolution over time [5].

2.1.4 LLMs for IKEA Reviews Generation

LLM is also beneficial for Literature Reviews, Semi-automatic topic reviews for Streamlined research synthesis which overall Increased review accuracy. Goal was to use LLMs for topic identification and key phrase generation in IKEA reviews. The LLMs tested for IKEA reviews are Mistral-7B, Llama3-8B, Gemma-7B. the proposed model outperformed with Best topic results for Reviews + key phrases input with Iterative Topic Refinement. Mistral-7B (5-shot) for key phrase generation scored highest, whereas Llama3-8B outperformed for Sentiment analysis. Overall the author found that LLMs beat traditional methods in topic accuracy, diversity, and sentiment insights [6].

LLMs are also Helpful for researchers to manage publication overload, improving review accuracy and knowledge discovery. Topic Review (TR) frameworks in LLMs is a semi-automatic system for faster, more accurate literature reviews. the author found that useful technologies in same use cases are LLMs, text mining, machine learning.

the Process works in key steps i.e. Query refinement, Text mining, Theme extraction with LLMs, and finally Literature synthesis. the Author found 69.56% keyword similarity, high Fleiss’ Kappa score (expert agreement) [7].

2.1.5 LLMs in Automated Labelling

Another application of LLMs is automation & improving the topic labels. the method used is BERTopic which extract keywords & summaries and assign LLM-generated labels. the Label quality factors are dominant themes, and topic diversity. this approach reduced the manual interpretation and boosts label accuracy for better label representativeness.

2.1.6 PromptTopic based Modelling

Prompting LLM-based unsupervised modeling is also useful in Comparing APIs (e.g., ChatGPT, LLaMA) and increased Benchmarking model effectiveness [8]. Traditional models miss sentence-level semantics & struggle with low co-occurrence texts. Another Author used LLMs with goal to improve topic modeling for short-text datasets. the author proposed PromptTopic which uses LLMs to extract sentence-level topics, then aggregate & condense them. this approach has benefits in automation i.e. no manual parameter tuning and handles variable with complex text lengths. PromptTopic approach outperformed state-of-the-art models on 3 diverse datasets with higher topic coherence which is more accurate & flexible topic discovery, even for short, sparse texts [9].

2.1.7 LLM-in-the-loop Thematic Analysis

LLM-in-the-loop Thematic Analysis with Human-LLM collaboration Combines TA & topic modeling and Supports complex data exploration. TA is labour-intensive which requires multiple coders & iterative coding cycles. LLM-in-the-loop framework is proposed which use in-context learning (ICL) to collaborate with humans. the process involves LLMs (e.g., GPT-3.5) assist in coding and generating a final codebook, which results which resulted in similar coding quality to humans which is overall faster, less resource-intensive TA [10].

2.2 Comparison of Topic Modeling Approaches

Table 1: comparison of topic modeling techniques, their advantages, limitations, and uses in smart cities applications

Model	Technique	Advantages	Limitations	Use in Smart Cities	Reference
Latent Dirichlet Allocation (LDA)	Probabilistic topic model	Well-established, interpretable	Struggles with short texts, requires tuning	Traffic pattern analysis, citizen feedback clustering	[11], [12]
Non-negative Matrix Factorization (NMF)	Linear algebra-based factorization	More interpretable than LDA, deterministic results	Sensitive to initialization, lacks probabilistic grounding	City service optimization	[13]–[15]
Word2Vec + K-Means	Word embeddings with clustering	Captures semantic similarity	Requires large corpus, lacks topic coherence	Urban mobility trends	[7], [16], [17]
BERT-Based Topic Modeling	Transformer-based embeddings	Context-aware, handles short texts	Computationally expensive, may need fine-tuning	Social media sentiment analysis in smart cities	[7], [16], [18][19], [20]
LLM Embeddings + Reverse Dictionaries	Semantic embeddings with lexicon lookup	High interpretability, better for multilingual data	Requires high-quality dictionaries, tuning challenges	Privacy-aware topic modeling for city planning	Proposed Model

2.3 Privacy-Preserving AI Techniques in Smart Cities

Table 2: Privacy-Preserving AI Techniques in Smart Cities

*Model/Technique*	*Approach*	*Privacy Features*	*Challenges*	*Use Case in Smart Cities*	*Reference*
*Federated Learning (FL)*	Decentralized learning without data sharing	Keeps data on edge devices	Requires high bandwidth, vulnerable to inference attacks	Smart traffic prediction	[20]–[23]
*Differential Privacy (DP)*	Adding statistical noise to datasets	Protects individual data points	May reduce model accuracy	Public transport optimization with user data	[1], [24]
*Homomorphic Encryption (HE)*	Encrypting data before computation	Enables secure data processing	Computationally expensive	Secure healthcare data analysis	[24]
*LLM Embeddings + Anonymization*	Using embeddings while stripping personal data	Reduces re-identification risks	Requires effective de-identification	Privacy-preserving citizen sentiment analysis	Proposed Model

2.4 AI-Driven Smart City Applications and Their Impact on Residents’ Satisfaction and Safety

Table 3: AI-Driven Smart City Applications and Their Impact on Residents’ Satisfaction and Safety

*Application*	*AI Model Used*	*Benefit to Residents*	*Potential Risks*	*Privacy Considerations*	*Reference*
*Predictive Policing*	Deep Learning & NLP	Faster crime prevention	Bias in training data	Needs differential privacy	[25]–[27]
*Smart Traffic Management*	Reinforcement Learning	Reduced congestion	Sensor failure risks	Federated learning for edge devices	[28]
*Public Health Monitoring*	LLMs & Topic Modeling	Early disease detection	False positives	Anonymized data collection	[3], [4], [9], [14], [29]
*Privacy-Sensitive Smart Assistants*	LLM Embeddings + Reverse Dictionary	Improved emergency response	Potential misuse	Homomorphic encryption for secure queries	Proposed Model

2.5 Limitations and Research Gaps in Existing Models

Table 4: Limitations and Research Gaps in Existing Models

*Research Gap*	*Current Limitations*	*Potential Improvement*	*Proposed Model Contribution*
*Traditional Topic Models Lack Contextual Understanding*	LDA & NMF struggle with short text	LLM-based embeddings for richer context	LLM embeddings + Reverse Dictionaries
*Privacy Concerns in Smart City AI*	FL & DP reduce risk but affect accuracy	Combining DP with semantic embeddings	Anonymized LLM embeddings for safer AI
*Bias and Fairness Issues*	AI-based decision-making can be biased	Fairness-aware ML techniques	Ethical AI framework for urban safety
*Scalability in Large Smart Cities*	Heavy computational costs	Edge AI & hybrid models	Optimized LLM-based topic modeling

2.6 Summary of Chapter

This structured literature review provides a comparative analysis of different AI models relevant to LLM embeddings, topic modeling, and privacy-sensitive smart city applications. The tables highlight key insights, challenges, research gaps, and potential improvements for future work

CHAPTER 3: EMBEDDING MODELS AND REVERSE DICTIONARIES FOR RELIABLE TOPIC MODELING

3.1 Overview

This chapter introduces the approach of leveraging embedding models in conjunction with reverse dictionaries for reliable topic modeling. We explore the performance of our pipeline model by evaluating two different datasets, climate change and AG News, and comparing it with two other pipeline models, LDA pipeline and BERTopic pipeline, in terms of coherence metrics: Coherence_npmi, Coherence_c_v, and Topic_Diversity. Our approach aims to present a reliable, accurate method for identifying residents’ key prospection and reaction towards worldwide issues.

3.2 Problem Statement and Research Questions

In the context of smart cities, understanding residents’ concerns, reactions, and key perceptions is essential for enhancing satisfaction and safety. However, accurately identifying these latent topics, especially in privacy-sensitive scenarios, remains a challenge. This research addresses the following questions:

How can we reliably identify latent topics from smart city data using LLM embeddings and reverse dictionaries?
How can we balance accurate topic modeling with residents’ privacy concerns?
Which combination of topic modeling techniques produces the most reliable and interpretable results?

3.3 Data Collection and Input Sources

Two primary datasets were chosen to represent resident discussions and global influences, both datasets undergo preprocessing, including tokenization, stop-word removal, and normalization, to ensure clean and usable input.

3.3.1 Climate Change Dataset

The Climate Change dataset is generated by collecting documents from different social media and blogging platforms. The selection was related to climate change topics. The keywords used as selection criteria are from 11 categories such as climate and consumption systems, global warming, energy transition, climate change, lowering energy and material intensities, strategies for managing resource supply and demand, valorizing waste, sustainable production and consumption, sustainability and productivity of resource use, resource productivity improvement, and recycling.

Reflecting residents’ perceptions and concerns regarding climate change. There are mainly 2 variable names i.e. label for climate change and time (DD.MM.YYY HH.MM) as shown in the figure below. Secondary dataset is used which is publically available at Github [30], whereas the research article related to the seminal work is also available Heliyon [31].

Figure 1: Data heads in climate change data

3.3.2 AG News Dataset

Capturing news topics discussed and shared within smart city communities. The AG News dataset is comprised of approximately 120K rows of data contains over 1 million news articles collected from more than 2,000 sources through the ComeToMyHead academic news search engine, which has been active since July 2004. It is important to note that for the AG News it is categorized into four different topics World, Sports, Business and Science/Technology. This AG News dataset was created for research in fields like data mining, information retrieval, and text classification. It has been widely used as a benchmark for text classification tasks, notably in research by Zhang et al., [32] on character-level convolutional networks [32]. The dataset is openly available for non-commercial research purposes [33], [34].

3.4 Topic Modeling Pipelines

We developed five interconnected pipelines to systematically explore and refine topic modeling. The structure and function of each pipeline are explained below.

Figure 2: Flowchart diagram of the experimental work

3.4.1 Pipeline 1: Topic Modeling Using Embedding with Reverse Dictionary

This Pipeline works by extracting a real meaningful explanation and topics from textual prompts by using the prompts embedding and reverse dictionary. The reverse dictionary approach bridges human-like interpretations with embedding spaces, enabling intuitive cluster formation. This pipeline is designed to extract interpretable and semantically rich topics from textual data, ensuring privacy and enhancing smart city services to improve residents’ satisfaction and safety. The process unfolds in the following stages:

Figure 3: Pipeline 1: Reverse Dictionary

3.4.1.1 Input Data (2000 Samples)

The pipeline begins with a dataset containing 2000 text samples from the climate change dataset, similar to the AG News dataset. The AG News dataset is categorized into four categories: World, Sports, Business, and Science/Technology. From each of the four categories, 500 samples are collected, leading to a total of 2000 samples.

3.4.1.2 Pre-Processing

Pre-processing is very crucial step in applying any algorithm to any dataset, whereas in language processing it involves various ways/steps which results in better performance of models i.e. RD.

Text Cleaning involves lowercasing, removing punctuation, stop word removal, and special character handling.
Tokenization is about splitting text into words/tokens.
Lemmatization/Stemming is done for reducing words to base/root form
Vectorization is used for word embeddings (e.g., Word2Vec, GloVe, Gamma2B) and TF-IDF or Bag of words (if simpler models)
Encoding & Normalization involves text-to-numeric transformation and Vector normalization (e.g., min-max scaling)
Dimensionality Reduction (if needed) is also done by using PCA or t-SNE approaches for smaller vector spaces
Handling Out-of-Vocabulary (OOV) Words for Unknown word handling strategies and Subword embeddings (e.g., FastText)
Stop word Augmentation (Optional) mostly used for Domain-specific stop word lists

3.4.1.3 Embeddings Extraction

After the preprocessing step for the datasets, the Gemma 2B was applied to convert the text into a dense vector representation, known as embedding; the embeddings were extracted for both the dataset prompts and all the definitions of the reverse dictionary by using the Gemma 2B model. This model is a transformer-based language model designed for efficient natural language processing (NLP), which is mainly used in generating embeddings. Using the same model Gamma 2B for extracting the embedding in both the dataset and reverse dictionary ensures the extracted embeddings remain in the same semantic space, making them directly comparable.

3.4.1.4 Reverse Dictionary Based Vector Database

The generated embeddings are then passed through a RD mechanism which involves conversion of traditional reverse dictionary to vector embeddings. This is done by matching each embedding with the most semantically similar definition in a dictionary. The reverse dictionary acts as a bridge by using (BERTopic reverse, dictionary based and vectors databased), translating complex embeddings into human-readable definitions that encapsulate the underlying meaning of each text sample.

Each text sample is converted into a dense vector representation using a pre-trained language model (e.g., Gamma2B). Gamma2B is likely the name of a word embedding model (like Word2Vec, GloVe, or GPT-based embeddings). Whereas in embedding RD system, instead of searching for the meaning of a word, user input a definition or description, and the system suggests words or phrases that match it i.e. if you enter “a fear of spiders”, the system returns the word “arachnophobia”.

The embeddings of all the reverse dictionary definitions are stored in an FAISS index. FIASS is vector search library compares the embedding of the prompt from the dataset with the stored FAISS index of all the revers dictionary definitions embeddings. To find the closest and nearest semantic embeddings.

The reverse dictionary consists of pairs of definitions and keywords generated by humans. This makes this approach reliable and ensures that the definition comparisons are aligned with human semantics and language.

3.4.1.5 Keyword Extraction (Top 1 Keyword per Sample)

After matching each text embedding to a dictionary definition, most relevant keyword associated with that definition are extracted. As a result, for every input sample, a single representative keyword is selected, forming a distilled semantic label for the text.

3.4.1.6 Keyword Collection (2000 Keywords)

The process yields a total of 2000 keywords, each encapsulating the core idea of a corresponding 2000 text sample. Keyword collection is beneficial in reducing the complexity of the text data, making it more manageable for downstream clustering and analysis.

3.4.1.7 Clustering (20 Clusters with 20 Keywords Each)

algorithm like K-means or HDBSCAN are used in this final step 2000 keywords are organized for clustering of 20 clusters of 20 top chosen keywords. These clusters represent coherent topics that can be analyzed to uncover residents’ primary concerns, needs, and areas where smart city systems can be optimized.

3.4.2 Pipeline 2: Latent Dirichlet Allocation (LDA) for Topic Modeling in Smart City Applications

We implemented an existing pipeline for the typical topic modeling approach, Latent Dirichlet Allocation (LDA), to perform a baseline for comparison with our revered dictionary with embedding pipeline approach. Latent Dirichlet Allocation (LDA) performs probabilistic topic modeling, taking the document as input with a combination of topic keywords. Then, it gives a set of topics for each input document, showing which topic it is more relevant. Below is a step-by-step breakdown of the process:

Figure 4: Pipeline 2: Latent Dirichlet Allocation (LDA)

3.4.2.1 Input Data (2000 Samples)

The process starts with the same set of 2000 text samples, representing various city-related inputs such as resident feedback, incident reports, or service requests.

3.4.2.2 Pre-Processing

To prepare the text for LDA, standard natural language processing (NLP) techniques are applied, including:

Tokenization – Splitting text into individual words.
Lowercasing – Converting all words to lowercase to avoid case mismatches.
Stopword Removal – Eliminating common words (e.g., “the”, “and”) that don’t contribute to topic meaning.
Lemmatization/Stemming – Reducing words to their root forms (e.g., “running” → “run”).
Vectorization – Converting the cleaned text into a bag-of-words or term frequency-inverse document frequency (TF-IDF) matrix.

3.4.2.3 Training the LDA Model

The pre-processed text data is used to train an LDA model, a generative probabilistic model that assumes each document is a mixture of topics, and each topic is a mixture of words. The model learns topic distributions by iteratively assigning words to topics and refining these assignments to maximize the likelihood of the observed data.

3.4.2.4 Assigning Clusters

After training, each text sample is assigned to a topic cluster based on the highest probability distribution. This assignment groups texts that share similar thematic content together.

3.4.2.5 Extracting Top Keywords for Each Cluster

For each discovered topic, the LDA model provides a ranked list of keywords with the highest probability of appearing in that topic. The pipeline extracts the top 20 keywords per cluster to represent the topic’s semantic core.

3.4.2.6 Clustering Results (20 Clusters with 20 Keywords Each)

in this final step 2000 keywords are organized for clustering of 20 clusters of 20 top high probability keywords. These 20 clusters encapsulate the primary themes within the text corpus, offering more valuable insights into residents’ concerns and city dynamics and climate change.

3.4.3 Pipeline 3: K-Means Clustering on Pipeline 1 & 2 Outputs

Pipeline 3 is combination of first 2 pipelines i.e. Reverse Dictionary (RD) and Latent Dirichlet Allocation (LDA) by using the K-means clustering. This approach is leveraged with strengths of both prior pipelines to refine the structure and enhance the interoperability of results. Let’s break this process down step by step:

Figure 5: Pipeline 3: K-Means Clustering on Pipeline 1 (RD) output & pipeline 2 (LDA) Outputs

3.4.3.1 Pipeline 1: RD Output (20 clusters, 20 keywords):

20 clusters × 20 keywords → 400 keywords

3.4.3.2 Pipeline 2: LDA Output (20 clusters, 20 keywords):

20 clusters × 20 keywords → 400 keywords

3.4.3.3 Combining Outputs

The outputs from both pipelines are concatenated, forming a combined set of 800 keywords (400 from P1-RD + 400 from P2-LDA).

Total combined input:
- + 400 = 800 keywords

3.4.3.4 K-Means Clustering

The combined 800 keywords are vectorized (e.g., using word embeddings like Word2Vec or sentence embeddings like SBERT) and fed into a K-means clustering algorithm. The algorithm partitions the keyword space into 20 clusters, with each cluster containing approximately 20 keywords. This clustering process refines topic groupings by aligning semantically similar keywords from both pipelines.

3.4.4 Pipeline 4: BERTopic for Smart City Topic Modeling

BERTopic is employed to generate coherent and interpretable topics which combines state-of-the-art transformer embeddings with clustering techniques, making it well-suited for identifying nuanced and high-quality topics. Let’s break down the components of this pipeline:

Figure 6: Pipeline 4: BERTopic

3.4.4.1 Input Data

Both datasets are input in BERTopic as well.

3.4.4.2 Embedding Model (Gemma2B)

The input text data is converted into dense vector representations using the Gemma2B embedding model. This transformer-based model captures complex semantic relationships between words and phrases, creating rich, high-dimensional embeddings.

3.4.4.3 Pre-processing

Pre-processing involves as text cleaning, cleaning HTML, removing special characters, tokens etc. as mention in pipeline 1.

3.4.4.4 BERTopic Modeling (c-TF-IDF)

To label and represent each discovered topic, BERTopic employs class-based TF-IDF (c-TF-IDF). This technique calculates the importance of words within each cluster relative to the entire corpus, generating highly relevant and distinctive keywords for each topic. The reduced embeddings are clustered into 20 distinct topic groups using the K-means algorithm. Each cluster represents a latent topic, grouping together semantically similar text samples.

3.4.4.5 Clustering for Final Output

The result is 20 clusters, each represented by 20 high-relevance keywords. These topics offer a fine-grained view of the smart city landscape, capturing critical areas of concern like resident safety, infrastructure maintenance, and data privacy.

3.4.5 Pipeline 5: Combined Clustering of Reverse Dictionary and BERTopic Outputs

The final stage of the research methodology involves integrating the outputs from Pipeline 1 (Reverse Dictionary) and Pipeline 4 (BERTopic) to generate the most reliable and interpretable topic representations. This hybrid approach combines the definitional precision of reverse dictionaries with the semantic richness of transformer-based embeddings, ultimately refining the topic structure. Let’s break down the process step by step:

Figure 7: Pipeline 5: Combined Clustering of Reverse Dictionary and BERTopic Outputs

3.4.5.1 Input Data

Pipeline 1 Output (Reverse Dictionary):

20 clusters × 20 keywords → 400 keywords

Pipeline 4 Output (BERTopic):

20 clusters × 20 keywords → 400 keywords

Total combined input:
- + 400 = 800 keywords

3.4.5.2 Keyword Embedding and Vectorization

The 800 keywords are converted into dense vector representations (e.g., using the same embedding model as BERTopic, like Gemma2B) to capture their semantic meanings.

3.4.5.3 Clustering with K-means

The vectorized keywords are clustered into 20 final topic groups using K-means clustering. This step consolidates semantically similar keywords from both the reverse dictionary and BERTopic outputs, refining topic boundaries and enhancing coherence.

3.4.5.4 Final Topic Representations

Each of the 20 clusters is represented by 20 top keywords, selected based on cluster centroids and intra-cluster word importance. The result is a set of balanced topics that align both with human-readable definitions (from the reverse dictionary) and latent semantic structures (from BERTopic).

3.4.5.5 Outcome

This hybrid approach ensures the topics are not only interpretable but also contextually accurate, making them highly suitable for real-world smart city applications focused on resident satisfaction and safety.

3.5 Privacy-Preserving Mechanisms

These following mechanisms help build trust among residents, encouraging wider adoption of smart city services. To address privacy concerns, we implement the following measures:

3.5.1.1 Anonymized Embedding Queries

Resident queries are vectorized and processed without retaining identifiable metadata.

3.5.1.2 Federated Learning Integration

Topic models are trained across decentralized nodes, reducing the need for centralized data aggregation.

3.5.1.3 Differential Privacy

Noise injection techniques ensure individual query patterns remain indistinguishable.

3.6 System Architecture

The proposed system architecture integrates all pipelines through a modular framework moreover, this architecture ensures scalability and adaptability, allowing easy integration with evolving smart city infrastructures.

Data Ingestion Layer – Collects and preprocesses input data
Embedding & Vectorization Layer – Converts text to vector representations
Topic Modeling Layer – Executes Pipelines 1–5
Privacy Layer – Implements privacy-preserving mechanisms
Visualization & Insights Layer – Presents clusters and keywords to decision-makers

3.7 Evaluation Metrics

We evaluate our system using:

3.7.1 Coherence_npmi:

Measures how often words in a topic appear together in the dataset, indicating how semantically coherent the topic is.

3.7.2 Coherence_c_v

Evaluates topic coherence by combining word co-occurrence statistics with cosine similarity between word vectors.

3.7.3 Topic_Diversity

Indicates the proportion of unique words across topics, with higher values meaning less overlap between topics.These metrics provide a comprehensive understanding of both topic modeling performance and privacy robustness.

CHAPTER 4: Results, Findings & Analysis

4.1 Overview

This chapter details the experimental design, setup, results, and analysis for evaluating the effectiveness of embedding models and reverse dictionary techniques in preserving privacy while maintaining utility in smart city applications. It introduces the research questions, objectives, and the methodology used to achieve the research goals.

4.1.1 Research Questions:

How does the proposed method using embeddings and a reverse dictionary perform in reconstructing the original user prompt while maintaining high utility, as measured by semantic similarity scores (cosine similarity)?
What is the relationship between the utility of the generated prompt and the level of privacy preservation?
How do embeddings with the reverse dictionary technique affect the balance between utility (semantic accuracy of the generated prompt) and privacy (minimization of sensitive data exposure) in smart city applications?

4.1.2 Objective of the Experiment

To investigate whether embedding models combined with reverse dictionaries can effectively reconstruct user prompts while balancing semantic utility and privacy protection.

4.2 Experimental Setup

4.2.1 Input Data and Original Prompt

The experimental setup begins with collecting a dataset of original prompts. A total of 2000 samples were gathered to ensure a diverse and representative set of inputs. These prompts span various domains to test the model’s ability to generalize across different contexts.

4.2.2 Pre-processing

Prior to giving of the prompts to the model, pre-processing steps were used. This involved text cleanup via special character removal, text conversion to lowercase, and sentence tokenization. This specific step thoroughly ensures that noise inside the input data does not negatively effect the performance throughout the subsequent processes.

4.2.3 Embedding Extraction

The Gemma 2B model was employed for extracting embeddings. These embeddings came from the processed prompts. The model changes every entered prompt into a vector portrayal of great dimension, getting both meaning and context data. These embeddings serve as the foundation for similarity assessments later.

4.2.4 Reverse Dictionary-Based Vector Database

A reverse dictionary approach was used during the creation of a vector database. Definitions from a more thorough dictionary were completely converted into embeddings through application of the same Gemma 2B model, thus permitting a more direct comparison with prompt embeddings. The database functions as a touchstone for identifying definitions that are closely associated with the input prompts.

4.2.5 Definitions Extraction (Top 1–10 Definitions)

For a single prompt, specifically for each prompt embedding, the uppermost one to ten most similar definitions were thoroughly extracted directly from the entire whole vector database. This array of definitions assists in analyzing the variability in matches, as well as determining whether the most relevant definition appears uniformly across many top results.

4.2.6 Expected Prompt Extraction

Access of ChatGPT model, expected prompts were generated based on the extracted definitions. This step helps bridge the gap between the original user prompt and how an AI model might interpret and refine it into a more precise query.

4.3 Similarity Score Calculation

3.1 Between Definition of RD and Original Prompt

Cosine similarity scores were calculated between the original prompt embeddings and the embeddings of definitions retrieved from the reverse dictionary. This score quantifies the closeness in meaning between the user’s intent and the existing definitions.

3.2 Between OpenAI Expected Prompt and Original Prompt

Additionally, similarity scores were computed between the original prompt and the ChatGPT-generated expected prompt. This comparison assesses how well the expected prompt aligns with the user’s initial query, providing insights into the model’s interpretative accuracy.

4.4 Experimental Setup Results for Climate Change Data

From the graph shown below, it is depicted that BERTopic is best for well-structured, coherent topics (good for interpretation). Reversed Dictionary excels in topic diversity and co-occurrence-based coherence, making it useful if you want a wide coverage of distinct concepts. LDA struggles the most, especially in coherence, suggesting it may not capture climate change topics as effectively.

Table 5:Evaluation Metrics for Topic Modeling Performance for Climate Change Dataset

Climate Changes Dataset Results
	Coherence_npmi	Coherence_c_v	Topic Diversity
P1: Reversed Dictionary	0.0799	0.8748	0.575
P2: LDA	0.0084	0.7794	0.47
P3: BERTopic	0.435	0.8501	0.43

4.4.1 Coherence (npmi)

Coherence (npmi) measures the semantic similarity between words in a topic (higher the npmi score is, better the performance). In our experiment, BERTopic (0.435) has the best semantic coherence, meaning its topics are more meaningful and interpretable. Whereas, Reversed Dictionary (0.0799) shows weak coherence, suggesting the topics might not be as semantically tight. On the other hand, LDA (0.0084) performs the worst in this metric, likely generating more fragmented or less coherent topics.

4.4.2 Coherence (c_v)

Coherence (c_v) evaluates topic quality based on word co-occurrence (higher is better). Reversed Dictionary (0.8748) achieves the highest coherence here, meaning it captures co-occurring words well. BERTopic (0.8501) is close behind, still producing high-quality topics. LDA (0.7794) lags but is still decent, showing some level of useful topic structuring.

4.4.3 Topic Diversity

Topic diversity measures how distinct topics are from each other (higher is better). Reversed Dictionary (0.575) has the highest diversity, meaning it captures a broader range of unique topics. LDA (0.47) comes next, balancing diversity but less than the RD approach. BERTopic (0.43) scores the lowest, indicating more repetitive or overlapping topics.

Figure 8:Evaluation metrics for climate change dataset

4.4.4 Similarity Trends with RD Definitions & ChatGPT Expected Prompt for Climate Change Dataset

4.4.4.1 Similarity Trends with RD Definitions

The average similarity score between the original prompt and RD definitions starts relatively higher at 0.4517 with just one definition. However, as more definitions are added, the similarity score steadily declines, reaching 0.2637 when using ten definitions.

This suggests that as more definitions are introduced, the definitions might become more diverse or less tightly aligned with the original prompt, leading to a broader semantic range that reduces individual similarity.

4.4.4.2 Similarity Trends with ChatGPT Expected Prompt:

In contrast, the average similarity score between the original prompt and the expected prompt generated by ChatGPT starts at 0.4874 and remains consistently higher than the RD definitions throughout the range. The score even increases slightly as the number of definitions grows, peaking at 0.5035 with ten definitions.

This suggests that ChatGPT’s generated prompts may better capture the core intent of the original input, possibly because it synthesizes the most relevant elements from the increasing pool of definitions rather than diluting the alignment like the RD approach.

Table 6: Climate Change Dataset Results

Number of Definitions	Average Similarity Score (Original Prompt and RD Definitions)	Average Similarity Score (Original Prompt and Expected Prompt by ChatGPT)
1	0 .4517	0.4874
2	0.3790	0.4622
3	0.3430	0.4744
4	0.3201	0.4782
5	0.3051	0.4938
6	0.2927	0.4904
7	0.2829	0.4948
8	0.2751	0.4978
9	0.2687	0.4989
10	0.2637	0.5035

Figure 9: Climate Change Dataset Results

4.5 Experimental Setup Results for AG News Data

For the AG News dataset, BERTopic also achieved the highest npmi coherence score (0.77), indicating its strength in semantic accuracy. Interestingly, the Reversed Dictionary approach achieved an almost perfect c_v coherence score of 0.9962, slightly outperforming LDA (0.9858), while BERTopic lagged at 0.8348. Topic diversity was highest for LDA (0.905), closely followed by BERTopic (0.9881) and Reversed Dictionary (0.83). These results indicate that while BERTopic remains the most coherent overall, LDA may be preferable for exploring a wider range of distinct topics, and the Reversed Dictionary approach offers exceptional co-occurrence coherence for well-structured datasets like news articles.

Table 7: AG News Dataset results

Figure 10: Evaluation metrics for AG News dataset

4.6 Similarity Trends with RD Definitions & ChatGPT Expected Prompt for AG News Dataset

Table 8: AG News Dataset Results

Number of Definitions	Average Similarity Score (Original Prompt and RD Definitions)	Average Similarity Score (Original Prompt and Expected Prompt by ChatGPT)
1	0.2222	0.2041
2	0.2109	0.2254
3	0.2037	0.2307
4	0.1974	0.2303
5	0.1921	0.2286
6	0.1881	0.2287
7	0.1847	0.2254
8	0.1810	0.223
9	0.1779	0.2195
10	0.1751	0.2192

Figure 11: AG News Dataset results

4.6.1 6.1 Similarity Trends with RD Definitions

The average similarity score between the original prompt and RD definitions starts at 0.2222 with one definition, and gradually declines as the number of definitions increases, reaching 0.1751 with ten definitions.

This indicates that adding more definitions dilutes the semantic alignment with the original prompt, similar to what we saw in the climate change dataset — likely due to semantic drift as definitions expand across broader interpretations of the original concept.

4.6.2 Similarity Trends with ChatGPT Expected Prompt

The average similarity score between the original prompt and the expected prompt by ChatGPT starts lower at 0.2041 with one definition but improves slightly as definitions increase, peaking at 0.2307 with three definitions. After this point, the score plateaus and slowly declines, ending at 0.2192 for ten definitions.

This suggests that ChatGPT captures the core intent of the original prompt better when fewer definitions are available, but adding too many definitions might introduce noise even for the model’s synthesis process.

4.7 Discussion

4.7.1 Experimental Results Discussion

Overall, BERTopic emerges as the most balanced model, particularly excelling in semantic coherence with the highest npmi scores for both datasets (0.435 for Climate Change and 0.77 for AG News). This makes BERTopic a strong choice when semantic accuracy is a priority. However, the Reversed Dictionary approach shows remarkable strength in co-occurrence-based coherence (c_v), achieving 0.8748 for Climate Change and an almost perfect 0.9962 for AG News, outperforming both BERTopic (0.8501 and 0.8348, respectively) and LDA (0.7794 and 0.9858). When it comes to topic diversity, the results are more nuanced: for Climate Change, the Reversed Dictionary model leads with 0.575, followed by LDA (0.47) and BERTopic (0.43), while for AG News, LDA scores the highest at 0.905, closely followed by BERTopic at 0.9881 and the Reversed Dictionary at 0.83.

These results suggest that the ideal model depends on the research goal: BERTopic is best for semantic depth and well-structured topics, Reversed Dictionary is valuable for capturing diverse, co-occurring terms, and LDA can still be useful for exploring a wide variety of distinct topics, especially in structured data like news articles. A hybrid approach that combines BERTopic’s semantic strength with Reversed Dictionary’s diversity and LDA’s broad topic coverage could potentially deliver the most comprehensive results for complex, multifaceted datasets.

7.2 Privacy-Sensitive Smart City Applications Towards Residents

The creation of technical programs in urban centers that use citizen data necessitates suitable accuracy while weighing solitude rules against practical requirements. This design of the system, along with vector databases and its secure embedding extraction method, offers improved answers. A collection of improved answers are offered for these circumstances. The system transforms certain unsecured user data into undergoing vector forms for privacy safety without sacrificing effective data analytics capabilities and service personalization.

That certain system handles many traffic management queries and multiple public safety reports while maintaining query privacy because that very system does not store identity-related data in such a direct manner. The system lets city departments make perceptive responses to citizens when protecting private individual rights. The model’s precise prompt comprehension enables officials to use anonymized data effectively as they decide on policy creation.

The research indicates from how the novel method evinces its value in resolving investigative questions with gains in reliable info management while exhibiting applicability for secluded smart metropolises. Similar studies will analyze further particular adaptations which increase output, along with they also permit wide-ranging utilization throughout differing urban settings.

CHAPTER 5: Conclusion and Future Work

This research artefact explored the extent of Large Language Model (LLM) embeddings and Reverse Dictionary (RD) for reliable privacy sensitive smart city application and topic modeling by experimenting them on two datasets i.e. Climate change dataset and AG news dataset. This report addressed critical research gap including challenges in privacy concerns in AI driven smart city applications. A comprehensive analysis of multiple topic modeling pipelines was conducted with evaluation metrics (i.e. Coherence_npmi, Coherence_c_v, and Topic Diversity) for identification of coherent and meaningful topics. Moreover, privacy preserving AI techniques are also integrated into smart city context to ensure compliance with ethical standards while maximizing the utility of data utilities. Our findings indicate that LLM embedding and reverse dictionaries can not only enhance the topic discovery but also improve interpretability and increases the adaptability for various applications including climate change discourse analysis, news classification and smart city data processing. Last but not least is this study also highlights the potential privacy-aware AI solutions for smart cities in urban environments in support of resident centric smart city applications that balance ethical concerns and efficiency.

As this research work presents promising results, whereas several areas are needed to be furtherer investigations and development.

As the current experimental approach is based on the normal processors whereas embedding extractions and reverse dictionaries processings require significant computational resources so GPUs are recommended. Future work should explore efficient indexing mechanisms and optimized embedding retrieval methods to improve scalability for large datasets. Extending the methodology to real-time streaming data from IoT sensors, social media, and urban infrastructure would further validate the practical effectiveness of our approach in dynamic smart city environments. While privacy-sensitive mechanisms were integrated into the study, federated learning and differential privacy techniques could be explored to enhance data security and user privacy without compromising analytical performance. Evaluating the adaptability of our approach across different domains and languages could improve its applicability in global smart city implementations, particularly for multilingual topic modeling. Incorporating human feedback mechanisms (e.g., domain experts and residents) into the topic modeling process would provide insights into interpretability, usability, and real-world applicability. Further research should address the ethical implications of AI-driven smart city applications and ensure alignment with global data protection regulations such as GDPR and AI Act policies.

This research lays a strong foundation for advancing AI-driven topic modeling and privacy-sensitive smart city applications. By bridging the gap between LLM embeddings, reverse dictionary methods, and privacy-aware AI techniques, this study contributes to the ongoing development of trustworthy, efficient, and ethical smart city solutions. Future explorations in this field will further refine methodologies, ensuring data-driven urban environments that prioritize residents’ satisfaction, safety, and privacy.

List of References

[1] J. Yu, J. Zhou, Y. Ding, L. Zhang, Y. Guo, and H. Sato, “Textual Differential Privacy for Context-Aware Reasoning with Large Language Model,” pp. 988–997, Aug. 2024, doi: 10.1109/COMPSAC61105.2024.00135.

[2] J. Wehner, S. Abdelnabi, D. Tan, D. Krueger, and M. Fritz, “Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models,” Feb. 2025, Accessed: Mar. 06, 2025. [Online]. Available: https://arxiv.org/abs/2502.19649v2

[3] S. Kapoor, A. Gil, S. Bhaduri, A. Mittal, and R. Mulkar, “Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling,” Sep. 2024, Accessed: Mar. 07, 2025. [Online]. Available: https://arxiv.org/abs/2409.15626v1

[4] C.-H. Chang, J.-T. Tsai, Y.-H. Tsai, and S.-Y. Hwang, “LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework,” Dec. 2024, Accessed: Mar. 07, 2025. [Online]. Available: https://arxiv.org/abs/2412.12459v1

[5] F. Invernici, F. Curati, J. Jakimov, A. Samavi, and A. Bernasconi, “Capturing research literature attitude towards Sustainable Development Goals: an LLM-based topic modeling approach,” Nov. 2024, Accessed: Mar. 07, 2025. [Online]. Available: https://arxiv.org/abs/2411.02943v2

[6] T. Di, L. Magistrale, M. Holenderski, and ( Tu, “Leveraging LLM-generated keyphrases and clustering techniques for topic identification in product reviews,” 2024, Accessed: Mar. 07, 2025. [Online]. Available: https://www.politesi.polimi.it/handle/10589/227476

[7] B. Gana, A. Leiva-Araos, H. Allende-Cid, and J. García, “Leveraging LLMs for Efficient Topic Reviews,” Appl. Sci. 2024, Vol. 14, Page 7675, vol. 14, no. 17, p. 7675, Aug. 2024, doi: 10.3390/APP14177675.

[8] P. U. Dasu, “Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models.” Jan. 10, 2025. Accessed: Mar. 07, 2025. [Online]. Available: https://hdl.handle.net/10919/124154

[9] H. Wang, N. Prakash, N. K. Hoang, M. S. Hee, U. Naseem, and R. K. W. Lee, “Prompting Large Language Models for Topic Modeling,” Proc. – 2023 IEEE Int. Conf. Big Data, BigData 2023, pp. 1236–1241, 2023, doi: 10.1109/BIGDATA59044.2023.10386113.

[10] S. C. Dai, A. Xiong, and L. W. Ku, “LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis,” Find. Assoc. Comput. Linguist. EMNLP 2023, pp. 9993–10001, Oct. 2023, doi: 10.18653/v1/2023.findings-emnlp.669.

[11] D. Puschmann, “Extracting Information From Heterogeneous Internet of Things Data Streams – ProQuest,” University of Surrey (United Kingdom), 2018. Accessed: Mar. 21, 2025. [Online]. Available: https://www.proquest.com/openview/c604e0751473ef24c8d9785aa003b732/1?cbl=2026366&pq-origsite=gscholar

[12] J. Yu, F. Boehmke, B. Dietrich, H. Huang, T. Meng, and E. Pizzi, “Dual Purposes of Government Communication in Chinese Local Governments,” 2024.

[13] N. Lopes and B. Ribeiro, “Non-Negative Matrix Factorization (NMF),” Stud. Big Data, vol. 7, pp. 127–154, 2015, doi: 10.1007/978-3-319-06938-8_7.

[14] H. S. Jung, H. Lee, Y. S. Woo, S. Y. Baek, and J. H. Kim, “Expansive data, extensive model: Investigating discussion topics around LLM through unsupervised machine learning in academic papers and news,” PLoS One, vol. 19, no. 5, p. e0304680, May 2024, doi: 10.1371/JOURNAL.PONE.0304680.

[15] M. Nijs, T. Smets, E. Waelkens, and B. De Moor, “A mathematical comparison of non‐negative matrix factorization related methods with practical implications for the analysis of mass spectrometry imaging data,” Rapid Commun. Mass Spectrom., vol. 35, no. 21, p. e9181, Nov. 2021, doi: 10.1002/RCM.9181.

[16] A. Kousis and C. Tjortjis, “Investigating the Key Aspects of a Smart City through Topic Modeling and Thematic Analysis,” Futur. Internet, vol. 16, no. 1, p. 3, Jan. 2024, doi: 10.3390/FI16010003/S1.

[17] S. Chowdhury and A. Alzarrad, “Applications of Text Mining in the Transportation Infrastructure Sector: A Review,” Inf. 2023, Vol. 14, Page 201, vol. 14, no. 4, p. 201, Mar. 2023, doi: 10.3390/INFO14040201.

[18] A. Noorian, A. Harounabadi, and M. Hazratifard, “A sequential neural recommendation system exploiting BERT and LSTM on social media posts,” Complex Intell. Syst., vol. 10, no. 1, pp. 721–744, Feb. 2024, doi: 10.1007/S40747-023-01191-4/FIGURES/18.

[19] A. Alotaibi and F. Nadeem, “Leveraging Social Media and Deep Learning for Sentiment Analysis for Smart Governance: A Case Study of Public Reactions to Educational Reforms in Saudi Arabia,” Comput. 2024, Vol. 13, Page 280, vol. 13, no. 11, p. 280, Oct. 2024, doi: 10.3390/COMPUTERS13110280.

[20] A. Ullah, G. Qi, S. Hussain, I. Ullah, and Z. Ali, “The Role of LLMs in Sustainable Smart Cities: Applications, Challenges, and Future Directions,” Feb. 2024, Accessed: Mar. 21, 2025. [Online]. Available: https://arxiv.org/abs/2402.14596v1

[21] E. T. M. Beltrán et al., “Fedstellar: A Platform for Decentralized Federated Learning,” Jun. 2023, Accessed: Nov. 26, 2023. [Online]. Available: https://arxiv.org/abs/2306.09750v2

[22] R. Lazzarini, H. Tianfield, and V. Charissis, “Federated Learning for IoT Intrusion Detection,” AI 2023, Vol. 4, Pages 509-530, vol. 4, no. 3, pp. 509–530, Jul. 2023, doi: 10.3390/AI4030028.

[23] B. Yurdem, M. Kuzlu, M. K. Gullu, F. O. Catak, and M. Tabassum, “Federated learning: Overview, strategies, applications, tools and future directions,” Heliyon, vol. 10, no. 19, p. e38137, Oct. 2024, doi: 10.1016/J.HELIYON.2024.E38137.

[24] X. Gu, F. Sabrina, Z. Fan, and S. Sohail, “A Review of Privacy Enhancement Methods for Federated Learning in Healthcare Systems,” Int. J. Environ. Res. Public Health, vol. 20, no. 15, Aug. 2023, doi: 10.3390/IJERPH20156539.

[25] P. Sarzaeim, Q. H. Mahmoud, A. Azim, G. Bauer, and I. Bowles, “A Systematic Review of Using Machine Learning and Natural Language Processing in Smart Policing,” Comput. 2023, Vol. 12, Page 255, vol. 12, no. 12, p. 255, Dec. 2023, doi: 10.3390/COMPUTERS12120255.

[26] S. Venkatesh and E. Al, “Crime Prediction Using Machine Learning and Deep Learning,” Journal of Science Technology and Research. JSTAR, 2024. Accessed: Mar. 21, 2025. [Online]. Available: https://philpapers.org/rec/VENCPU

[27] N. Shah, N. Bhagat, and M. Shah, “Crime forecasting: a machine learning and computer vision approach to crime prediction and prevention,” Vis. Comput. Ind. Biomed. Art, vol. 4, no. 1, pp. 1–14, Dec. 2021, doi: 10.1186/S42492-021-00075-Z/FIGURES/3.

[28] N. Ma, D. Aviv, H. Guo, and W. W. Braham, “Measuring the right factors: A review of variables and models for thermal comfort and indoor air quality,” Renew. Sustain. Energy Rev., vol. 135, Jan. 2021, doi: 10.1016/j.rser.2020.110436.

[29] T. Khandelwal, “Using LLM-Based Approaches to Enhance and Automate Topic Labeling,” Feb. 2025, Accessed: Mar. 07, 2025. [Online]. Available: https://arxiv.org/abs/2502.18469v1

[30] Gokcimen, “Climate_Data,” 2023. https://github.com/Gokcimen/Climate_Data/blob/main/climate_data.csv

[31] B. Das Tunahan Gokcimen, “Exploring climate change discourse on social media and blogs using a topic modeling analysis,” Heliyon, vol. 10, no. 11, 2024, doi: https://doi.org/10.1016/j.heliyon.2024.e32464.

[32] Y. Zhang, X., Zhao, J. & LeCun, “Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28,” NIPS, 2015, [Online]. Available: http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

[33] Fancyzhx, “AG News Dataset,” 2023, [Online]. Available: https://huggingface.co/datasets/fancyzhx/ag_news

[34] Fancyzhx, “AG News Dataset,” 2024, [Online]. Available: https://huggingface.co/datasets/fancyzhx/ag_news/viewer/default

Taza Mind

LEVERAGING LLM EMBEDDINGS AND REVERSE DICTIONARIES FOR RELIABLE TOPIC MODELING AND PRIVACY-SENSITIVE SMART CITY APPLICATIONS: TOWARDS RESIDENTS’ SATISFACTION AND SAFETY

CHAPTER 1: Introduction

CHAPTER 2: Literature Review

3.4.1 Pipeline 1: Topic Modeling Using Embedding with Reverse Dictionary

3.4.1.1 Input Data (2000 Samples)

3.4.1.2 Pre-Processing

3.4.1.3 Embeddings Extraction

3.4.1.4 Reverse Dictionary Based Vector Database

3.4.1.5 Keyword Extraction (Top 1 Keyword per Sample)

3.4.1.6 Keyword Collection (2000 Keywords)

3.4.1.7 Clustering (20 Clusters with 20 Keywords Each)

3.4.2.1 Input Data (2000 Samples)

3.4.2.2 Pre-Processing

3.4.2.3 Training the LDA Model

3.4.2.4 Assigning Clusters

3.4.2.5 Extracting Top Keywords for Each Cluster

3.4.2.6 Clustering Results (20 Clusters with 20 Keywords Each)

3.4.3.1 Pipeline 1: RD Output (20 clusters, 20 keywords):

3.4.3.2 Pipeline 2: LDA Output (20 clusters, 20 keywords):

3.4.3.3 Combining Outputs

3.4.3.4 K-Means Clustering

3.4.4.1 Input Data

3.4.4.2 Embedding Model (Gemma2B)

3.4.4.3 Pre-processing

3.4.4.4 BERTopic Modeling (c-TF-IDF)

3.4.4.5 Clustering for Final Output

3.4.5.1 Input Data

3.4.5.2 Keyword Embedding and Vectorization

3.4.5.3 Clustering with K-means

3.4.5.4 Final Topic Representations

3.4.5.5 Outcome

3.5.1.1 Anonymized Embedding Queries

3.5.1.2 Federated Learning Integration

3.5.1.3 Differential Privacy

CHAPTER 4: Results, Findings & Analysis

4.4.4.1 Similarity Trends with RD Definitions

4.4.4.2 Similarity Trends with ChatGPT Expected Prompt:

CHAPTER 5: Conclusion and Future Work

Engr. Waqar Qayyoom Khokhar

Related Posts

Unilancerz Case Study: How AI Transformed a Freelancing Platform’s Growth

Hour 8 – AR/VR State in 2025

Leave a Reply Cancel reply