首页 | 本学科首页   官方微博 | 高级检索  
     检索      

大规模数据集Spark并行优化谱聚类
引用本文:吕洪林,尹青山.大规模数据集Spark并行优化谱聚类[J].测绘通报,2019,0(12):96-100.
作者姓名:吕洪林  尹青山
作者单位:辽宁对外经贸学院,辽宁 大连,116052;辽宁对外经贸学院,辽宁 大连116052;吉林大学,吉林 长春130000
基金项目:辽宁对外经贸学院博士科研启动基金(2019XJLXBSJJ002);辽宁省教育厅科学研究项目(ldxy2017008)
摘    要:针对已有大规模数据集并行谱聚类算法的计算耗时和资源占用巨大等问题,基于当前批处理和图计算兼顾的Spark并行技术,提出了大规模数据集谱聚类的并行优化改进算法,算法通过并行单向迭代避免了相似矩阵计算时的数据重复计算,通过并行位置变换、标量乘法替换及距离缩放优化算法的资源占用,通过近似特征向量替代进一步优化算法的计算量。试验结果验证了算法近特征向量的有效性及在大规模数据集下良好聚类性能和扩展性。

关 键 词:大规模集谱聚类  近似特征向量  Spark并行框架  K-means距离计算  优化
收稿时间:2019-06-24
修稿时间:2019-10-30

Spark parallel optimization large-scale spectral clustering
Lü Honglin,YIN Qingshan.Spark parallel optimization large-scale spectral clustering[J].Bulletin of Surveying and Mapping,2019,0(12):96-100.
Authors:Lü Honglin  YIN Qingshan
Institution:1. Liaoning University of International Business and Economics, Dalian 116052, China;2. College of Mining Engineering, Jilin University, Changchun 130000, China
Abstract:To solve the problems of computational time-consuming and resource occupation, which is hard to be prevented in existing spectral clustering on large-scale datasets, based on the Spark technology, an improved parallel optimization algorithm for spectral clustering is proposed. In which, repetitive calculation of data in similar matrix calculations is avoided by parallel one-way iteration, the resource occupancy is optimized by the parallel position transformation, the scalar multiplication replacement and the distance scaling, and the calculation amount is further optimized by the use of the approximate eigenvectors. The experimental results verify the effectiveness of the approximate eigenvectors and the good clustering performance and scalability under large-scale data sets.
Keywords:large-scale spectral clustering  approximate eigenvector  Spark parallel computing  K-means distance calculation  optimization  
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《测绘通报》浏览原始摘要信息
点击此处可从《测绘通报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号