面向选煤厂领域知识图谱的数据分类方法

赵欣; 张树森

doi:10.16447/j.cnki.cpt.2024.02.011

面向选煤厂领域知识图谱的数据分类方法

赵欣,
张树森

Coal preparation plant domain knowledge graph oriented data classification method

摘要

摘要: 工业数据资源的开放共享是工业大数据产业发展的重要途径，选煤厂数据的自动分类有利于实现高效的数据管理。然而选煤厂数据纷繁复杂，数据之间存在交叉重叠和孤立无关联等问题，导致选煤厂数据缺乏标准化和规范化，制约了面向选煤厂智能化应用的发展。针对选煤厂结构化库表数据中标签数据少、数据交叉重叠等问题，提出一种基于知识图谱的选煤厂结构化库表数据自动分类算法。通过选煤厂领域的主题词列表构建了选煤厂领域知识图谱；以选煤厂领域知识图谱为基础，提出将KG-BERT分类模型用于非主题数据的扩展分类；基于TF-IDF的多主题权重判定模型，利用知识图谱的知识体系增强了文本分类的可控性和可解释性；结合选煤厂领域知识图谱、KG-BERT分类模型以及基于TF-IDF的主题权重判定模型，提出用基于多模型融合的分类模型来实现选煤厂结构化库表数据自动分类。实验数据均来自选煤厂结构化库表数据全量目录，可验证算法的有效性。对比实验表明：KG-BERT分类模型采用了BERT架构，具有一定的泛化能力，相较于CNN，RNN，LSTM模型能较好应对无主题情况下的文本分类任务；从训练数据集上看，KE数据集在模型上表现更好；基于多模型融合的分类模型在选煤厂领域结构化库表数据分类较单一模型具有更好的有效性和适用性。基于多模型融合的分类模型自动分类效果良好，有助于提升选煤厂数据管理效率，进一步挖掘选煤厂数据资源的潜在价值。

Abstract: Opening up and sharing of industrial data resources is an important approach for the development of industrial big data industry, and automatic classification of data of coal preparation plant is conducive to realization of highly efficient data management. However, the problems of miscellaneous and complicated nature of the plant data and their intersection, overlapping, and independence make it difficult for such data to become standardized coal normalized, restricting, as a result, the use of the data for developing intelligent coal preparation plants. To tackle the problems of inadequacy of label data and data intersection and overlapping, an automatic classification algorithm of structured database table data of coal preparation plant, based on knowledge atlas is proposed. The knowledge graph is formed up through tabulation of plant-related subject words. On the basis of the knowledge map, the following models are developed: the KG-BERT-based classification model for expanded classification of non-subject data; the knowledge map-based single-subject classification model; and the TF-IDF-based multi-subject weighted decision model for enhancing controllability and interpretability of text classification. It is proposed to realize automatic classification of structured database table of coal preparation plant through integration of the models as described above. All the data used in experiment is selected from the universal directory of database table data, which can also be used to check the effectiveness of the algorithms. As indicated by comparative study, the KG-BERT-based model with the BERT structure has a certain universality, and its capacity for making non-subject text classification is higher than those of the CNN, RNN and LSTM models; as for the training data sets, the KE data set plays a good part with the models; the integrated model proves to be better than a single model in terms of effectiveness and applicability; and the use of the integrated model can help improve data management efficiency and bring about latent potentials of data resources of coal preparation plant.

HTML全文

参考文献(21)

施引文献

资源附件(0)

面向选煤厂领域知识图谱的数据分类方法

Coal preparation plant domain knowledge graph oriented data classification method

联系我们