Project Overview
This project focuses on improving SCADA signal naming consistency across large and diverse datasets by combining semantic similarity detection with structured parsing. Inconsistent naming in SCADA systems can hinder data analysis, automation, and integration across sites. Our solution aims to streamline this by automating the identification of naming inconsistencies and restructuring raw signal names into standardized formats.
Semantic Similarity Detection
We utilize Sentence-BERT (SBERT) to generate embeddings for each raw signal name, capturing their semantic meaning beyond surface-level text. By computing pairwise similarity scores, we can detect signals that are likely duplicates or inconsistently named. Pairs with a similarity score above 0.90 are automatically flagged for review, significantly reducing the need for manual inspection while preserving accuracy.
Structured Parsing
Benefits
Technology Stack
The project is implemented using Python, with Sentence-BERT from HuggingFace for embeddings, FAISS for fast similarity searches, and custom parsing logic using regular expressions. The entire workflow is hosted and executed in Azure Databricks, ensuring scalability across thousands of signals and multiple operational sites.
Future Work
Next steps include extending the framework to handle multilingual datasets, introducing active learning for more intelligent synonym detection, and generating automated signal renaming suggestions to support full-cycle standardization.