期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

黄志詹利群任晓炜李涛《气象科技》2019,47(5):768-772

在Hadoop分布式计算和存储架构下,自定义ETL数据清洗规则将海量自动站小时单站文件按所属年和站号合并为大文件流转存储至HDFS中,并运用SparkSQL并行计算框架进行统计处理生成常用气象要素日统计值。结果表明,数据处理和获取时效较关系型数据库方式有显著提升。采用SparkSQL并行计算框架对多气象要素多站点和长时间序列进行数据统计处理查询均能达到秒级别响应,并随着统计站点数的不断增加和时间跨度的延长其优势更为明显,能更高效地支撑此类气象数据服务,为海量气象数据处理从关系型数据库到大数据分布式架构的转换处理提供了新思路。相似文献

2.

A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

Fei Hu Yongyao Jiang Yun Li Weiwei Song Daniel Q. Duffy 《International Journal of Digital Earth》2020,13(3):410-428

ABSTRACT

Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in distributed computing frameworks. To efficiently process big geospatial data, this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System (HDFS) from the following aspects: (1) improve I/O efficiency by adopting the chunking data structure; (2) keep the workload balance and high data locality by building the global index (k-d tree); (3) enable Spark and HDFS to natively support geospatial raster data formats (e.g., HDF4, NetCDF4, GeoTiff) by building the local index (hash table); (4) index the in-memory data to further improve geospatial data queries; (5) develop a data repartition strategy to tune the query parallelism while keeping high data locality. The above strategies are implemented by developing the customized RDDs, and evaluated by comparing the performance with that of Spark SQL and SciSpark. The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency. 相似文献

3.

A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce

Zhenlong Li Fei Hu John L. Schnase Daniel Q. Duffy Tsengdar Lee Michael K. Bowen 《International journal of geographical information science》2017,31(1):17-35

Climate observations and model simulations are producing vast amounts of array-based spatiotemporal data. Efficient processing of these data is essential for assessing global challenges such as climate change, natural disasters, and diseases. This is challenging not only because of the large data volume, but also because of the intrinsic high-dimensional nature of geoscience data. To tackle this challenge, we propose a spatiotemporal indexing approach to efficiently manage and process big climate data with MapReduce in a highly scalable environment. Using this approach, big climate data are directly stored in a Hadoop Distributed File System in its original, native file format. A spatiotemporal index is built to bridge the logical array-based data model and the physical data layout, which enables fast data retrieval when performing spatiotemporal queries. Based on the index, a data-partitioning algorithm is applied to enable MapReduce to achieve high data locality, as well as balancing the workload. The proposed indexing approach is evaluated using the National Aeronautics and Space Administration (NASA) Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. The experimental results show that the index can significantly accelerate querying and processing (~10× speedup compared to the baseline test using the same computing cluster), while keeping the index-to-data ratio small (0.0328%). The applicability of the indexing approach is demonstrated by a climate anomaly detection deployed on a NASA Hadoop cluster. This approach is also able to support efficient processing of general array-based spatiotemporal data in various geoscience domains without special configuration on a Hadoop cluster. 相似文献

4.

地理国情海量数据云存储技术设计研究

朱力维王志强王军《东北测绘》2014,(6):131-133,139

研究了地理国情海量数据特征及数据内容,针对目前数据存储中存在的问题,研究了云存储概念及相关的关键技术,采用HDFS技术提出了云环境下地理国情海量数据组织与资源共享的云存储模型,通过云应用服务接口访问云存储服务,能够有效解决地理国情数据资源海量存储和服务问题,为地理国情数据管理方式的转变提供借鉴。相似文献

5.

Hadoop在气象数据密集型处理领域中的应用 总被引：2，自引：0，他引：2

肖卫青杨润芝胡开喜林润生刘立明谷军霞《气象科技》2015,43(5):823-828

气象资料的统计分析计算属于数据密集型计算,目前的处理方式多为单机处理,对大量数据的处理比较慢,难以应对日益增长的数据,对气象资料的研究形成一定的制约。针对数据密集型气象数据的处理,尝试应用Hadoop的MapReduce思想提高计算效率;对Hadoop在处理大量小文件组成的气象数据时的低效率,提出对原始文件进行预处理,将多个小文件整合成能直接用于计算的大文件。试验证明,该方法解决了Hadoop处理大量小文件时的低效率问题,通过与Oracle入库检索的比较,应用Hadoop处理数据密集型气象资料具有实际意义。相似文献

6.

利用HDF5数据格式构建气象预报业务平台数据库

李振锋李五生禄永旭《气象与环境科学》2014,37(3):114-119

气象预报业务平台是各级气象业务部门预报人员每天要进行操作的平台,其操作的效率和设计人性化程度直接影响着预报工作的效率和效果。针对目前气象预报业务平台数据库多采用Windows文件系统进行数据管理,存在的文件零碎、数目多、浏览速度缓慢等不足,分析了HDF5数据格式气象预报业务平台数据库构建方法,并将其运行性能和采用开源关系数据库FireBird格式数据库的运行性能进行了比较,结果表明：HDF5格式数据库能够将多种数据格式存储在一个文件中,具有良好的自我描述性和扩展性,方便用户使用和管理。而且,在运行效率上,对于单机模式和网络比较稳定的情况,HDF5格式数据库存储数据耗时比Firebird数据库节省约50％的时间;而在读取数据中,HDF5格式数据库耗时为Firebird数据库方式耗时的1／10;对于本地存储这种情况,HDF5格式数据库仅为Firebird数据库方式耗时的1／50。相似文献

7.

大数据环境下地质资料的存储策略与文本化导入技术

刘文毅邓吉秋韩肖肖《江苏地质》2019,43(3):367-371

在分析地质资料文档内容与形式特征的基础上,提出Hadoop大数据环境下的地质资料一体化耦合数据模型与存储策略,分析确定HDFS下地质资料文本化目标格式,并对地质资料原始格式、转换后的文本格式及地质信息的存储方式与模式进行设计;研究常见地质资料格式的文本化实现方式,并构建文本转换技术流程。为大数据环境下地质资料的文本导入提供技术路径,以及大数据环境下文本化地质资料的信息抽取、融合等智能化处理提供统一数据基础,对地质资料大数据分析具有实际意义。相似文献

8.

NetCDF物理海洋数据云存储技术研究

下载免费PDF全文

夏伟艾波杨应召尚恒帅《海洋技术学报》2019,38(4):71-78

物理海洋数据具有多维、时空和海量等特征,主要以NetCDF结构化文件格式进行存储。然而,在分布式环境中,结构化文件存在数据块寻址困难、边界不易判定等问题,制约着大数据场景下的存储及应用。论文设计基于HDFS+Spark的NetCDF物理海洋数据云存储方案,首先采用HDFS分布式存储技术存储和管理物理海洋数据;并设计基于Spark并行计算框架的数据分片方案,复写读取接口获取分布式环境下的NetCDF文件数据块地址,实现了物理海洋数据的高效率存储与查询分析。选取中国海域100 a时长的物理海洋数据进行波高-周期散布图统计实验。结果表明:在数亿级记录数条件下,文中方法可将查询分析耗时由集中式文件存储方式的2 300 s缩短至50 s内,效率较集中式文件存储方式提升95%以上,验证了该方法的正确性和有效性。相似文献

9.

数字海洋云计算服务流中数据预部署研究

SHI Suixiang XU Lingyu DONG Han WANG Lei WU Shaochun QIAO Baiyou WANG Guoren 《海洋学报(英文版)》2014,33(9):82-92

Data pre-deployment in the HDFS （Hadoop distributed file systems） is more complicated than that in traditional file systems. There are many key issues need to be addressed, such as determining the target location of the data prefetching, the amount of data to be prefetched, the balance between data prefetching services and normal data accesses. Aiming to solve these problems, we employ the characteristics of digital ocean information service flows and propose a deployment scheme which combines input data prefetching with output data oriented storage strategies. The method achieves the parallelism of data preparation and data processing, thereby massively reducing I/O time cost of digital ocean cloud computing platforms when processing multi-source information synergistic tasks. The experimental results show that the scheme has a higher degree of parallelism than traditional Hadoop mechanisms, shortens the waiting time of a running service node, and significantly reduces data access conflicts. 相似文献