This project improves SCADA signal naming consistency by combining semantic similarity detection with structured parsing. First, we use Sentence-BERT to generate embeddings of raw signal names and calculate pairwise similarities. High-similarity pairs (e.g., >0.90) are flagged to identify duplicates or inconsistently named signals, reducing the need for full manual review.
In parallel, we developed a custom parser that deconstructs each raw name into meaningful components—such as Region, Site Type, Asset Levels, and Signal Type—based on known patterns and domain-specific keywords. This structured breakdown helps analysts understand each signal’s context and supports further standardization efforts across large datasets.