PAC3 - solution PDF

Title	PAC3 - solution
Author	Antonio Albiol Perez
Course	Sistemas distribuidos
Institution	Universitat Oberta de Catalunya
Pages	15
File Size	473.5 KB
File Type	PDF
Total Downloads	100
Total Views	219

Preview

CLICK TO PREVIEW PDF

Summary

PAC3 Distributed Systems05 - UOC 2017/18-TABLE OF CONTENTSQuestion 1: Read the BitTorrent Protocol Specification v2 and answer the following questions (bittorrent/beps/bep_0052.html)..................................................................................a) Describe the BitTorrent peer prot...

Description

PAC3 Distributed Systems 05.589 - UOC 2017/18-2

TABLE OF CONTENTS Question 1: Read the BitTorrent Protocol Specification v2 and answer the following questions (http://bittorrent.org/beps/bep_0052.html)..................................................................................1 a) Describe the BitTorrent peer protocol and give an example of how a peer A connects to peer B and starts sharing the requested file...........................................................................1 b) Describe what are unchoked peers and how it affects to performance of the peers.........2 Question 2: Memcached (https://memcached.org/) is a widely distributed memory caching system which is currently used by many known web sites such as Facebook or YouTube.......3 a) Find out more about this technology and describe how it achieves big scalability and helps to improve the performance of web sites......................................................................3 b) Last February GitHub suffered the biggest DDoS attach in history (https://www.wired.com/story/github-ddos-memcached/?mbid=social_fb) memcached servers. Describe what were the design considerations of memcached that helped the attackers to achieve such a huge DDoS attack......................................................................4 Question 3: Find out and explain what RabbitMQ is, explain the different patterns it uses to deliver messages and describe the protocols that it supports....................................................5 Question 4: Discuss the differences between Chord and Pastry routing algorithms by giving an example..................................................................................................................................8 Question 5: Describe the Google File System (GFS) and the Hadoop Distributed File System (HDFS), how they work, with what purpose they were designed and what differences there are. Explain what would happen if GFS were used for the cases where HDFS is used and vice versa..................................................................................................................................11

Distributed Systems – PAC3

05.589 - UOC 2017/18-2

Question 1: Read the BitTorrent Protocol Specification v2 and answer the following questions (http://bittorrent.org/beps/bep_0052.html) a) Describe the BitTorrent peer protocol and give an example of how a peer A connects to peer B and starts sharing the requested file BitTorrent is a P2P file sharing application intended for the download of immutable large files identified by a hash string – the GUID or Global Unique Identifier – generated from the contents of its file tree disposed in a particular format as metadata. The particularity of its design is that it relies on the capabilities of individual machines that contain replicas of the same files to serve the downloads, instead of relying on a centralized site or server to do so. More precisely, with BitTorrent, clients can download different pieces or chunks of the same file from different peers at once – that is, in a parallel way. Concretely, the BitTorrent protocol works in the following way: when a peer A – known as seeder - firstly makes a complete file containing all of its chunks available in the P2P network, a metadata file with .torrent extension containing information related to the new file is created. The metadata file includes information such as the name and length in bytes of the file, the location of a download manager or tracker located at a centralized server and a set of checksums for each different chunk of the shared file. Once the file become available, a peer B wanting to download it – known as leecher – can start to get different chunks of the file in no particular order, by taking a look to its tracker, which informs him of the number of available chunks and the peers that can serve those. Then a symmetrical TCP connection is made between peers A & B, where peer B, acting as server, is listening in a port between 6881 and 6889. Prior starting file transfer, a handshake procedure is made, consisting on the interchange of different messages between peers. During handshaking, both peers exchange messages including the 20 bytes truncated infohash – the file identifier. If both peers don’t send the same value, the connection is severed. During the final exchange of messages in the handshaking procedure, the leecher peer sends its peer id – its GUID – to the one that initiated the transfer. Then, the ID is sent to the tracker to register the new download. If the peer ID doesn’t match the one the initiating side expects, the connection is severed. That’s all for the handshaking procedure. Once data starts being transferred, downloaders should keep several chunk requests queued to get good TCP performance – known as 'pipelining'. During the time connection is stablished, keepalive messages are sent every two minutes in order to maintain the connection between peers, but they could become disconnected if there are timeouts and data is expected. Once the peer B gets all the file chunks, the connection is severed and the tracker at the http server notes the new download and adds it to its statistics. Finally, the peer B or leecher becomes a seeder, making possible the spread of the file through the network. The sum of the tracker, the seeders and the leechers are referred as the swarm for a particular file.

1

Distributed Systems – PAC3

05.589 - UOC 2017/18-2

One important feature of the protocol is that the download scheduling is made in a decentralized way among the different clients’ participants in the network. The protocol also relies on the non-egoistic behavior of the participants in the network, in the way that they should contribute while taking advantage of the network – that is known in BitTorrent terminology as the tit-for-tat mechanism. Because of that, the protocol incentives the clients that follow the tit-for-tat rule by giving them priority to download chunks of files before others that do not follow the rule. One positive side effect of this protocol feature is that it emphasizes the optimum use of bandwith by peers.

b) Describe what are unchoked peers and how it affects to performance of the peers In the BitTorrent peer protocol, connections contain two bits of state on both ends: on the seeder or client peer there is a choking bit, while leecher or downloader peers contain the interested bit. In regards of the choking bit, it stablishes that the client should not send any data to its connected downloader peer while the bit stays in choked state, so data will start being uploaded to the downloader or receiver peer once unchoking happens. At first, connections start out in the choked state, so even if there is a stablished connection – that’s it, the handshake procedure was successful – no data is transferred at all until the unchoked sate is achieved. Choking is done for several reasons; by instance, if the client is uploading at its full capacity, or if the receiver peer becomes blacklisted , then the chocked state becomes active in order to disable the transfer of data. In regards of performance, choking is done to improve the TCP congestion control feature, which behaves very poorly when sending over many connections at once. When this happens, the client sets the choking bit on choked state in some of its connections until other connections get severed – that’s it, the choking algorithm that controls the choking state cap the number of simultaneous uploads to improve TCP performance. Other criteria a good choking algorithm should meet is that it should avoid fast choking – unchoking changes of state, behavior known as fibrillation. It also should set connections to unchoked state to peers that let download – that is, with the interested bit active. Finally, it should try out unused connections every now and then to find out if those would be better that the currently used ones – this is known as optimistic unchoking.

Distributed Systems – PAC3

05.589 - UOC 2017/18-2

Question 2: Memcached (https://memcached.org/) is a widely distributed memory caching system which is currently used by many known web sites such as Facebook or YouTube. a) Find out more about this technology and describe how it achieves big scalability and helps to improve the performance of web sites. Memcached is a distributed memory object caching system that gives the ability of sharing a virtual pool of volatile cache memory to a cluster of webservers. This means that each webserver in the cluster, instead of having its own cache memory to store frequent data accessed by clients (i.e. dynamically generated data from a webserver database such as mysql query results that take a high amount of time to be processed) they have a shared virtual physical memory that can be accessed from all of the webservers, so if one of the webservers creates an object in the shared cache, then other servers can also access it, therefore improving the speed and performance at the moment of serving that data. Moreover, the amount of memory in the virtual pool is the sum of all the physical memory dedicated to memcached in all the webservers on the cluster, so those servers that would have scarce memory resources in a not shared memory approach would benefit from other servers with free memory for its cache. The major point of memcached is that it puts together sections of memory from multiple hosts and make the app hosted by the webserver cluster to see it as one large section of memory.

Regarding scalability, if the demand of served data grows to the point of having to add more servers to the cluster, the shared cache memory would also be incremented. Moreover, if the use of the webservers physical memories are not enough to store all the objects created by every webserver replica in the virtual pool, or the organization just doesn’t need to upgrade the cluster with more webserver machines (that could become idle servers with no use at all), it exists the possibility to use dedicated machines to be exclusively used as memcache servers. In this way, the use of dedicated hardware for memcache would mean that other applications running in the server won’t interfere with memcache. For example, one could use a machine with high RAM specs (i.e. 64Gb) and the use machines with lower specs to act merely as webservers. This property of the memcache system is what makes it extremely scalable, since it gives the benefit of expanding larger amounts of memory space.

Distributed Systems – PAC3

05.589 - UOC 2017/18-2

b) Last February GitHub suffered the biggest DDoS attach in history (https://www.wired.com/story/github-ddos-memcached/?mbid=social_fb) memcached servers. Describe what were the design considerations of memcached that helped the attackers to achieve such a huge DDoS attack. During the massive DDoS attack suffered by GitHub in late February’18 (by the way, the largest recorder until the date), publicly available memcached servers (which weren’t protected behind a firewall) were targeted by human attackers – that is, the attack wasn’t automatically executed by a malicious botnet malware. The attack consisted on sending a certain number of small queries per second to the GitHub’s memcached server. These queries though, were able to generate a way bigger response to the IP’s that generated the queries (that by the way, were spoofed by the attackers) for each query made to the memcached servers. The result was that the data served by memcached servers to the spoofed IP’s grew up to 1.3 Tpbs (Terabits per second), hence making the service unavailable to other legit users that were trying to access the GitHub site during the time that attack lasted (concretely, the memcached server became too busy serving the attackers queries results, so they stopped serving data to the rest of the site’s users). This kind of attack is known as an “amplifier attack”, since queries are a lot lighter than their replies. The design considerations of memcached that permitted the attackers to perform the massive DDoS attack were that memcached servers do not require any kind of authentication protection in order to respond a query, so if the server is exposed to the public internet, anyone with a certain level of knowledge at operating Memcached servers would be able to send a special command to make it respond with a much larger reply. We could say then that the Memcached system itself is not the one to blame as vulnerable system, but the unsafe implementation of the system – that is, to not protect it behind a firewall.

Figure 2 - Real-time traffic from the DDoS attack that lasted around 20 minutes and reached a peak of 1,3 Tbps of inbound bits. Image obtained from https://www.wired.com/story/github-ddos-memcached/?mbid=social_fb

Distributed Systems – PAC3

05.589 - UOC 2017/18-2

Question 3: Find out and explain what RabbitMQ is, explain the different patterns it uses to deliver messages and describe the protocols that it supports. RabbitMQ Software Overview Plainly speaking, RabbitMQ can be understood as an interface software used to exchange data between processes, applications and servers in an asynchronous manner. More precisely, it is a message-queueing software, also known as message broker or queue manager, where messages queues can be defined so applications can connect to the queue manager and fetch the messages contained on it. The queue is implemented like a FIFO data structure. The messages in the queue can contain any kind of information, such as data about a process that should be started on a remote application, or just simple plain text messages that need to be distributed by an email list provider. The queue-manager software keeps messages stored in the queue until a receiving application stablishes connection and takes the message off the queue in order to be processed. A message broker or queue manager can act as an outsourced resource for different services, such as a web application. Its goal is to be used to reduce loads and delivery times on the services that use it, since tasks that would normally take a certain amount of processing time are delegated to a third party or middleman that performs the task on his behalf. The architecture of a message broker is in fact simple: Client applications (known as “producers”) create the messages and deliver them to the FIFO queue – that is, the message broker. Then, other applications (known as consumers) connect to the queue and get the messages. The messages in the queue are stored there until the consumer fetches them.

Figure 3 – Basic RabbitMQ Architecture. Both the image and the previous basic information regarding RabbitMQ was obtained from https://www.cloudamqp.com/blog/2015-05-18-part1-rabbitmq-for-beginners-what-is-rabbitmq.html

Message Patterns The different message patterns currently implemented in RabbitMQ are described below:

Distributed Systems – PAC3

05.589 - UOC 2017/18-2



Channel Patterns  describe how messages are transported across a Message Channel.



Message Construction Patterns  describe the intent, form and content of the messages that travel across the messaging system.



Routing Patterns  discuss how messages are routed from a sender to the correct receiver. Message routing patterns consume a message from one channel and republish it message, usually without modification, to another channel based on a set of conditions.



Transformation Patterns  change the content of a message, for example to accommodate different data formats used by the sending and the receiving system. Data may have to be added, taken away or existing data may have to be rearranged.



Endpoint Patterns  describe how messaging system clients produce or consume messages.



System Management Patterns  describe the tools to keep a complex messagebased system running, including dealing with error conditions, performance bottlenecks and changes in the participating systems.

Supported Protocols RabbitMQ support different messaging protocols:

 STOMP  It’s a text-based protocol that emphasized simplicity. It poorly defined in the way of message semantics. However, it’s easy to implement and is the only protocol that can be used by hand over a telnet / ssh connection.  MQTT  It’s a binary protocol that emphasizes lightweight publish / subscribe messaging, for which it has well defined message semantics. It’s mostly used on clients running in constrained devices.  AMQP (0.8, 0.9, 0.9.1 versions)  RabbitMQ was originally developed to support AMQP, which the "core" protocol supported by the message-broker. It’s a binary protocol and defines quite strong messaging semantics. For clients it's an easy protocol to implement

Distributed Systems – PAC3

05.589 - UOC 2017/18-2

 AMQP (1.0 version)  AMQP 1.0 is totally different than its prior versions. It imposes fewer semantic requirements, so it’s easier to add support for AMQP 1.0 to existing brokers. The protocol is substantially more complex than AMQP 0.8, 0.9 & 0.9.1  HTTP  Even though HTML is not a messaging protocol, RabbitMQ can transmit messages over HTTP in different ways: o Using a management plugin that supports a simple HTTP API (intended for diagnostic purposes) o Supporting STOMP messaging to the browser, using WebSockets through a web-STOMP plugin o Using the JSON-RPC channel plugin that supports AMQP 0.9.1 messaging over JSON-RPC

7

Distributed Systems – PAC3

05.589 - UOC 2017/18-2

Question 4: Discuss the differences between Chord and Pastry routing algorithms by giving an example. On any peer-to-peer system, peers organize themselves into overlay networks. We can find two different ways of organizing the peers into the overlay: In an unstructured way, where the system doesn’t impose a specific topology of nodes and hence they self-organize themselves, but at a cost of not offering neither absolute guaranties nor performance when locating objects and producing a relatively high message overhead, or in a structured manner, where localization of objects is guaranteed with a relatively low message overhead and high performance, but at a cost of having to maintain complex overlay structures. It’s important to note that structured peer-to-peer systems offer a higher degree of fault tolerance, so when a node unexpectedly leaves the overlay due to network failures or power outages, the other nodes have the ability to re-arrange themselves (by instance, they use what’s called a heartbeat messages to know when a neighbor node becomes unavailable). Another interesting feature of structured overlays is that they offer load balance to the system. In the case of the structured overlays organized by Distributed Hash Tables, the use of DHT routing algorithms is needed to locate nodes & objects (identified by GUID’s generated using a SHA-1 cryptographic function, so both nodes & objects share the same namespace) within the overlay in order to retrieve / place data from / to them. In that way, the main goal in both Chord and Pastry routing algorithms based in Distributed Hash Tables is to create a completely decentralized and structured peer-to-peer overlay where objects (identified by its hash string) can be efficiently located, and lookup queries efficiently routed. Structured peer-to-peer overlays use the concept of consistent hashing and are able to locate objects within it with a cost at the order of O(log N) – That is, with logarithmical complexity. N is the number of nodes that joined the overlay at some point. Chord Overlay Structure  Chord uses consistent hashing and SHA-1 as a hash function to assign an m-bit ID to nodes and objects (known as keys), where m is a system parameter. Then, these ID’s in a circle topology, starting from 0 up to 2 m – 1. Data items (keys) are then mapped to nodes with an ID great...