Signal Name Standardization

Project Overview

This project focuses on improving SCADA signal naming consistency across large and diverse datasets by combining semantic similarity detection with structured parsing. Inconsistent naming in SCADA systems can hinder data analysis, automation, and integration across sites. Our solution aims to streamline this by automating the identification of naming inconsistencies and restructuring raw signal names into standardized formats.

Semantic Similarity Detection

We utilize Sentence-BERT (SBERT) to generate embeddings for each raw signal name, capturing their semantic meaning beyond surface-level text. By computing pairwise similarity scores, we can detect signals that are likely duplicates or inconsistently named. Pairs with a similarity score above 0.90 are automatically flagged for review, significantly reducing the need for manual inspection while preserving accuracy.

Structured Parsing

Alongside similarity detection, we developed a custom rule-based parser to decompose each raw name into meaningful components such as Region, County, Site Type, Asset Levels, and Signal Type. This structured breakdown is based on regex patterns and domain-specific keywords and allows for consistent categorization and filtering of signals, regardless of variations in naming syntax.

Benefits

By combining these two methods, the pipeline improves naming consistency, enhances data quality, and minimizes manual workload for analysts. It enables scalable signal standardization and supports a wide range of downstream use cases such as performance monitoring, alarm management, and cross-site analytics.

Technology Stack

The project is implemented using Python, with Sentence-BERT from HuggingFace for embeddings, FAISS for fast similarity searches, and custom parsing logic using regular expressions. The entire workflow is hosted and executed in Azure Databricks, ensuring scalability across thousands of signals and multiple operational sites.

Future Work

Next steps include extending the framework to handle multilingual datasets, introducing active learning for more intelligent synonym detection, and generating automated signal renaming suggestions to support full-cycle standardization.

Tags: Databricks NLP Python SQL