首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于C-SOM和Spark的并行空间离群挖掘方法及应用
引用本文:潘淼鑫,林甲祥,陈崇成,叶晓燕.基于C-SOM和Spark的并行空间离群挖掘方法及应用[J].地球信息科学,2019,21(1):128-136.
作者姓名:潘淼鑫  林甲祥  陈崇成  叶晓燕
作者单位:1. 福州大学福建省空间信息工程研究中心空间数据挖掘与信息共享教育部重点实验室,福州 3501082. 福建师范大学数学与信息学院,福州 3501173. 福建省公共服务大数据挖掘与应用工程技术研究中心,福州 3501174. 福建农林大学计算机与信息学院,福州 350002
基金项目:福建省重点科技计划项目(2015H0015);福建省教育厅基金(JAT160125);福建省社科青年项目(FJ2017C084)
摘    要:空间离群挖掘可以发现空间数据集中非空间属性值与邻域中其他空间对象明显不同的空间对象。随着空间数据量的快速增加,传统集中式处理模式面临单机性能瓶颈、难以扩展等问题,已逐渐不能满足应用需要。因此,本文根据Spark并行计算框架,充分利用Spark快速内存计算和扩展性的优势,提出了一种基于考虑约束条件的空间离群挖掘算法(C-SOM)和Spark的并行空间离群挖掘算法和原型系统。该并行算法以C-SOM为核心,并行地在多个计算节点对全局数据集和各局部数据集执行C-SOM算法,得到全局离群和局部离群。轻量级的原型系统基于Spark实现了该并行算法,采用Browser/Server架构,提供给用户可视化的操作界面,简洁实用。最后,通过福建省东南沿海土壤化学元素调查数据和人工合成数据的离群分析,验证了该并行算法和原型系统的合理性、有效性和高效性。

关 键 词:C-SOM  Spark  并行计算  空间离群  数据挖掘  
收稿时间:2018-05-03

Parallel Spatial Outliers Mining based on C-SOM and Spark
Miaoxin PAN,Jiaxiang LIN,Chongcheng CHEN,Xiaoyan YE.Parallel Spatial Outliers Mining based on C-SOM and Spark[J].Geo-information Science,2019,21(1):128-136.
Authors:Miaoxin PAN  Jiaxiang LIN  Chongcheng CHEN  Xiaoyan YE
Abstract:Spatial outlier mining can find the spatial objects whose non-spatial attribute values are significantly different from the values of their neighborhood. Faced with the explosion of spatial data and problems such as single machine performance bottleneck and difficult expansion, the traditional centralized processing mode has gradually failed to meet the needs of applications. In this paper, we propose a parallel spatial outlier mining algorithm and its prototype system which are based on Constrained Spatial Outlier Mining (C-SOM) and make full use of the advantages of a parallel computing framework Spark's fast memory computing and scalability. The parallel algorithm uses C-SOM algorithm as the core algorithm, executes the C-SOM algorithm on a Spark cluster composed of multiple nodes for a global dataset and many local datasets concurrently to get the global outliers and the local outliers. Datasets are divided into multiple regional datasets according to the administrative division. A region dataset is considered as a local dataset and the global dataset contains all of the selected local datasets to be mined. The lightweight prototype system implements the parallel algorithm based on Spark and adopts Browser/Server architecture to provide users with a visualized operation interface which is concise and practical. Users can select the region datasets and set the parameters of C-SOM algorithm on interfaces. The prototype system will execute the parallel algorithm on a Spark cluster and finally list both the global and local outliers which have the top largest outlier factor values so that users can make further analysis. At last, we use the soil geochemical investigation data from Fujian eastern coastal zone area in China and a series of artificial datasets to carry out experiments. The results of the soil geochemical datasets experiments validate the rationality and effectiveness of the parallel algorithm and its prototype system. The results of the artificial datasets experiments show that, compared to single machine implementation, our parallel system can support analysis for much more datasets and its efficiency is much higher when the number of datasets is big enough. This study confirms the local instability characteristics of spatial outliers and demonstrates the rationality, and effectiveness of the parallel algorithm and its prototype system to detect global and local spatial outliers simultaneously.
Keywords:C-SOM  Spark  parallel computing  spatial outlier  data mining  
本文献已被 CNKI 等数据库收录!
点击此处可从《地球信息科学》浏览原始摘要信息
点击此处可从《地球信息科学》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号