首页 | 本学科首页   官方微博 | 高级检索  
     检索      

大数据环境下Spark性能优化分析研究与应用
引用本文:黄志,苏传程,苏晓红.大数据环境下Spark性能优化分析研究与应用[J].气象科技,2022,50(1):51-58.
作者姓名:黄志  苏传程  苏晓红
作者单位:广西壮族自治区气象信息中心,南宁 530022
基金项目:2021年广西气象科研计划指令性项目(桂气科2021ZL02)资助
摘    要:针对长时间序列、多站点和多气象要素的大数据量查询需求,现有的CIMISS(China Integrated Meteorological Information Sharing System)存在支撑能力严重不足的问题。本研究使用广西气象站点建站至今的历史地面气象记录月报表数据资料和现有Hadoop集群物理资源,重新设计数据ETL流程,构建Parquet格式数据集并完成HDFS转换存储;嵌入Spark的Broadcast广播变量,优化Spark集群执行参数,提高了集群的处理并行度和SparkSql的关联查询效率。结果表明,Parquet格式数据集的最高压缩比超过95%,一次性大数据量的查询效率比原来提升了1~5倍,并支持高并发访问,为各类相关预报预测业务的开展提供了有效的技术支撑。

关 键 词:Hadoop  Spark  ETL  Parquet  列式存储  Broadcast
收稿时间:2021/4/24 0:00:00
修稿时间:2021/9/6 0:00:00

Research and Application of Spark Performance Optimization Analysis in Big Data Environment
HUANG Zhi,SU Chuancheng,SU Xiaohong.Research and Application of Spark Performance Optimization Analysis in Big Data Environment[J].Meteorological Science and Technology,2022,50(1):51-58.
Authors:HUANG Zhi  SU Chuancheng  SU Xiaohong
Institution:Guangxi Meteorological Information Center, Nanning 530022
Abstract:Aiming at a large amount of data query requirements of long time series, multi sites and multi meteorological elements, the supporting capacity of the existing CMISS(China Integrated Meteorological Information Sharing System) is seriously insufficient. In this study, the monthly report data of historical surface meteorological records since the establishment of the meteorological stations in Guangxi and existing Hadoop cluster physical resources are used to redesign the ETL process, construct the Parquet format dataset, and complete HDFS conversion storage. Besides, the Broadcast variable of Spark is embedded to optimize the execution parameters of the Spark cluster, which improves the processing parallelism of the cluster and the association query efficiency of SparkSql. The results show that the maximum compression ratio of the Parquet format data set was more than 95%; the query efficiency of the one time large amount of data was 1 to 5 times higher than the original and supported high concurrent access, providing effective technical support for the development of various related forecasting services.
Keywords:Hadoop  Spark  ETL(Extract Transform Load)  Parquet  column store  Broadcast
点击此处可从《气象科技》浏览原始摘要信息
点击此处可从《气象科技》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号