Challenges of parallel computation in the Cloud
The rise of the data economy because of ongoing technological evolutions in cloud computing, Internet of Things, mobile computing, etc. is raising issues of efficiency in cloud computing of large datasets about performance and latency throughputs. Emerging distributed programming and parallelization approaches which employ all manner of fine-grained operations and scheduling beyond the known large-scale-synchronous and messaging process paradigm in Hadoop MapReduce and other innovative graph-based approaches that relies heavily on graph representations of large datasets(Eric P. Xing, 2015). According to the definition of a distributed system, a cloud computing environment is considered a distributed system. Therefore, both the shared memory distributed programming and the message-passing distributed programming can be applied in the cloud. However, if you use cloud computing to conduct large-scale data processing, the existing capabilities of both the shared memory distributed programming or the message passing distributed programming are not sufficient(Sakr, 2014)
A distributed shared memory-programming paradigm enables a set of different processes to access to a single memory space without reliance on inter-process communication mechanisms. The principal objective of this approach lies with their easy of programmability because there are no inter-process communications allowed, thereby making them transparent to end-users(Lionel Brunie, 2019).
On the other hand, message-passing programming model involves the distributed tasks communicate via sending and receiving of messages. This approach provides an abstraction layer that relies on processes while those of shared memory use threads instead. The challenge with both approaches the requirement for coordination on others to achieve a common purpose or provide a service(S. Sakr, & Gaber, M., 2014).
The sharing inherent in the first approach necessitated the need for a synchronization mechanism to control the order in which write and write operations are implemented. In essence, the distributed tasks are prevented from simultaneously writing to shared memory space, to avoid data corruption or inconsistency in service operation. This is usually achieved via semaphores, locks or barriers at runtime. Examples of the modern application of the shared memory abstraction model are the MapReduce programming model and GraphLab. The shared memory programming abstraction model which well satisfied by the Hadoop MapReduce programming approach satisfies two main criteria: developers need not explicitly write programs that send and receive messages while the underlying Hadoop Distributed File System (HDFS) provides a shared view to all tasks transparently(M. H. a. M. F. Sakr, 2013).
While message exchange forms the critical synchronization, approaches in the message-passing distributed programming model without the illusion for a single shared memory address space. An industry example of the message-passing model is provided by the Message Passing Interface (MPI), which a standard library for developing a message passing programs. In contrast to the shared memory model, the messaging passing applications require a complete turn in program development paradigm, wherein the developer must think apriori about how to partition data across tasks, manage data and communicate cum aggregate results using explicit messaging. Unlike the shared memory model, scaling up the system does not entail memory tuning. These challenges begin to significantly impact performance in a non-linear/non-uniform access latency large scale distributed systems like the cloud. See below a summarized comparison between both distributive programming models(M. H. a. M. F. Sakr, 2013):
Table 1: Comparison of the shared memory and the message-passing models(M. H. a. M. F. Sakr, 2013).
Cloud computation has been around for approximately two decades, and despite reports pointing to the business efficiencies, cost benefits and competitive advantages it brings over the old way of doing business, a large portion of the business community continues to operate the old way as specific challenges exist adopting it for certain workloads. According to a study by the International Data Group, over 69 percent of businesses are already using cloud technologies in one form or the other, and 18 percent are planning to implement cloud-computing solutions at some point in the future. At the same time, Dell technologies report that companies that invest in big data analytics, cloud computing, mobility, and security enjoy up to 53 percent faster revenue growth than their competitors(Salesforce, 2018). However, designing and implementing a distributed large-scale data program for the cloud includes more than just sending and receiving messages or deciding on the computational and architectural models. As already mentioned performance and reliability issues around heterogeneity, scalability, communication, synchronization, fault-tolerance, and scheduling, especially in the context of real-time streaming data, are persistent.
To better appropriate the emerging scope of large scale data sets, graphs have emerged as useful and vital data representations technique showing relevant relationships between data elements like interaction or dependencies, and their analysis can reveal valuable insights for many use cases — machine learning, social influence analysis, anomaly detection, clustering, recommendations, bioinformatics, etc.. Cloud computing platforms based on open source distributed graph processing systems especially those based on the Hadoop MapReduce framework are also becoming popular with numerous implementation soft graph libraries and connectors to graph databases as well. However, efficient distributed graph processing applications are also plagued by whether they have adopted the shared-memory or message-sharing paradigm. Developing efficient graph computation is excessively difficult due to issues of computation parallelization, data partitioning, and communications management depending on whether the distributed abstraction approach adopted(Vasiliki Kalavri, 2016).
Huang et al. in their work “A General-purpose Distributed Programming System using Data-parallel Streams” advocate a paradigm shift in abstraction level conceptualization in other to make parallel and distributed programming easier to handle large streaming datasets. Their application DtCraft is designed and implemented as a general purpose distributed programming system based on data-parallel streams with higher-level abstraction. The application can seamlessly and intuitively adopt either the shared-memory or an optimized message-passing model at runtime thus evading the constraints of being burg down by one. DtCraft has demonstrated its power in the stream-oriented programming model and can power large-scale industry applications in machine-learning, social media and could computing, hence its recent acquisition and support from Defence Advanced Research Projects Agency (DARPA), to serve as the foundation of the next generation of distributed computing platform(Huang, Lin, Guo, & Wong, 2018).
References
Eric P. Xing, Q. H., Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. (2015). Petuum: A New Platform for Distributed Machine Learning on Big Data.
Huang, T.-W., Lin, C.-X., Guo, G., & Wong, M. D. F. (2018). A General-purpose Distributed Programming System using Data-parallel Streams. Paper presented at the 2018 ACM Multimedia Conference on Multimedia Conference — MM ‘18.
Lionel Brunie, L. L., Olivier Reymann, Nathalie Restivo (Producer). (2019, April 29). Distributed Shared Memory Systems. DOSMOS. Retrieved from https://perso.ens-lyon.fr/laurent.lefevre/DOSMOS/DSM.html
Sakr, M. H. a. M. F. (2013). Distributed Programming for the Cloud: Models, Challenges and Analytics Engines.
Sakr, S., & Gaber, M. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.
Salesforce (Producer). (2018, 10 30). 12 Benefits of Cloud Computing. Salesforce. Retrieved from https://www.salesforce.com/hub/technology/benefits-of-cloud/
Vasiliki Kalavri, V. V., and Seif Haridi. (2016). High-Level Programming Abstractions for Distributed Graph Processing. KTH Royal Institute of Technology, Stockholm.
Originally published at https://www.jacobsedo.com on April 30, 2019.