ICS 32 - 4:24 - URL and HTTP PDF

Title	ICS 32 - 4:24 - URL and HTTP
Course	Into to Programming
Institution	University of California Irvine
Pages	7
File Size	186.9 KB
File Type	PDF
Total Downloads	51
Total Views	150

Preview

CLICK TO PREVIEW PDF

Summary

Download ICS 32 - 4:24 - URL and HTTP PDF

Description

ICS 32 Spring 2018 Notes and Examples: URLs and HTTP

Background Thus far in this course and the preceding one, you've written Python programs that read data from text files and that exchange data over a network via sockets, which are two big steps that push outward the boundaries around what we can accomplish in Python. However, there is an elephant in the room, so to speak. If we think about where most of the interesting data on the Internet resides, it's on the web. Web sites display content and allow human users to interact with web-based data, while web services provide a similar ability to other programs. Both web sites and web services are organized around the same fundamentals we've already seen: A connection is initiated by a client connecting to a server (quite often on the server's port 80) and a protocol is followed that governs what the conversation looks like. So if we want to interact with web data — the simplest example of which is to download the content of a web page — we need to know enough about that protocol to be able to implement the conversation. That HTTP is a common, standard protocol is good news for us: There's a pretty good chance we're going to be able to use it without having to implement all of the low-level, fiddly code we had to write when we implemented our own custom protocol before. But, nonetheless, we still need to understand what HTTP is, its basic structure, the terminology surrounding it, and so on. The fine-grained details, however, will be something we can gloss over, yet still be able to get real work done.

URLs When we use a browser to visit a web page, all we need to do is tell the browser where we want to go and it handles the rest. The notion of "where you want to go" is encapsulated by a URL (Uniform Resource Locator), which specifies a few things:   

What protocol should be used to download the web page? From what host (i.e., an IP address or the name of a machine, like www.ics.uci.edu) should the web page be downloaded? Occasionally, we also specify the port, when we want it to be something other than the default. What page on that machine should be downloaded?

One of the earlier code examples included a link to a short Python module called oops.py. The complete URL for that link is: http://www.ics.uci.edu/~thornton/ics32a/Notes/Exceptions/oops.py. Here's what that URL means:







The first few characters (preceding the colon) indicate what protocol should be used for the network conversation. For most web pages, that protocol will be listed as http, which means that we'd like to use the protocol called HTTP (HyperText Transfer Protocol). Another common alternative is https, which uses HTTP over a secure connection, which provides the dual benefits of making eavesdropping very difficult and of validating that you're really connecting to the server that you think you are. After the colon and the two slashes is the host. In this case, that host is listed as www.ics.uci.edu, which is the machine on which the ICS web site is hosted. It is possible also to specify a port, by following the host with a colon and a port number (e.g., www.ics.uci.edu:8080). The default port number for HTTP traffic is 80; for HTTPS, it's 443. Since most web sites use these default port, port numbers are not usually specified in a browser except in the rare instances that they're something other than the default. Web services (consumed by programs, as opposed to human users) often use alternative ports, though. The rest of the URL specifies what web page we'd like to download from the given host using the given protocol. In this case, that page is /~thornton/ics32a/Notes/Exceptions/oops.py, which is a page in the web directory that's under my control.

Given that information, a browser will know just what it needs to do:   

Initiate a socket connection to port 80 on www.ics.uci.edu. Use HTTP to request the page /~thornton/ics32a/Notes/Exceptions/oops.py. Parse the HTTP response and draw the page in the browser window.

But browsers aren't the only programs that can have conversations using HTTP; our Python programs can do it, too. But we need to know a little bit about HTTP in order to do so effectively.

Some background on HTTP HTTP (HyperText Transfer Protocol) is the protocol with which most web traffic on the Internet is transacted. Its latest version is HTTP/2.0, though it's still in the early stages of worldwide adoption; we'll stick with the more broadly-used (and easily-understood) HTTP/1.1 for now. HTTP is a request-response protocol, which means that its conversations go something like this:    

Client initiates connection to server Server accepts connection Client makes a request Server sends a response

After that single request and response, both sides close the connection. (I should note that there are performance optimizations available that let a client specify that the connection should be kept open if, for example, the client knows that it needs not just a web page's text but also several images from the same server. For our purposes, we'll stick with a single request and response per connection.)

Python programs can make these requests and parse these responses, but that requires us to know a little bit about the format of each. HTTP requests come in a few flavors, but the most common of them is called a GET, which means that the client would like to "get" a resource (a web page, an image, etc.) from the server. (We may see other alternatives later if we find a need for them.) A GET request in HTTP/1.1 looks like this. GET /~thornton/ics32a/Notes/Exceptions/oops.py HTTP/1.1 Host: www.ics.uci.edu The first line of a GET request begins with the word GET, is followed by the web resource you want to download (the part of the URL that follows the protocol and host), and finally is followed by HTTP/1.1, as a way to indicate what protocol we expect to be using for the conversation. Notice that there are spaces separating the word GET and the resource, and also between the resource and the HTTP/1.1. Because these spaces are part of the protocol — and because the presence of spaces elsewhere could make this more difficult for a server to handle — note that URLs are not permitted to contain spaces. The second and subsequent lines contain what are called headers, which allow us to specify a variety of supplementary information that the server can use to figure out how to send us a response. In our case, we've included just one, a header called Host:, which specifies the name or IP address of the host we think we're connecting to; this is useful in the case that the same machine has multiple names (e.g., more than one web site being served up by the same machine), and is generally required in most HTTP requests. Additional headers include specifying what browser (and what version) is being used — so, for example, a server can send back different output for a small-sized screen like an iPhone than to a larger-sized screen like a laptop or desktop — or a variety of performance optimizations that are available, or securityrelated information (such as a password or an access token that grants access to a page that might otherwise be hidden). A blank line following the last header informs the server that there are no more headers. At that point, the request is complete. Using PuTTY (Windows) or Telnet (Mac), connect yourself to www.ics.uci.edu on port 80 and try sending the request above (plus a blank line following it, so the server will know there are no more headers) and you should get back a response very much like this one (some details left out here for brevity). HTTP/1.1 200 OK Date: Wed, 31 Jan 2018 07:56:07 GMT Server: Apache/2.2.15 (CentOS) ... ... Content-Length: 435 Content-Type: text/plain; charset=UTF-8 # oops.py # # ICS 32 Winter 2018 # Code Example ...

... if __name__ == '__main__': f() The first line of the response indicates that the server agrees to have an HTTP/1.1 conversation (that's the HTTP/1.1 part), followed by what's called a status code (in this case, 200) and a reason phrase (in this case, OK). There are forty or so status codes that are defined as part of the HTTP/1.1 standard; the two most common ones are:  

200 (OK), which means that everything went as planned, the server's way of saying "Okay, cool, here's the web page you asked for!" 404 (Not Found), which means that the server doesn't have the page that you asked to download. (If you've ever seen "404" show up in a browser during your travels around the web, this is why; it's an HTTP status code, "geekspeak" for a web page that doesn't exist.)

The first line of the response is followed by headers, just as the first line of the request is. The server determines what headers to send, and the details there are too numerous to list, but I've included a few of the more interesting ones in the example above:    

Date is the date/time at which the response was generated. Server specifies what type of server is being run and what version. As of this writing, the ICS web server is running version 2.2.15 of a server called Apache (which is quite common on the web). Content-Length specifies the length, in bytes, of the content that will be sent back. This allows the client to know when the content has ended. Content-Type specifies what kind of content is being sent back (e.g., a web page, a text file, audio, video, etc.). Browsers respond to the content type by deciding what to do with the content: web pages are shown in the browser, video is often displayed in a video plugin or an external media player, etc. If a browser isn't sure what to do with content, it generally just asks you if you want to save the file somewhere on your hard drive.

After the last header is a blank line, followed by the desired content — in this case, the contents of the file oops.py that is linked from one of my code examples. For those of you who are interested in the full details of HTTP, the specification for it can be found here. Don't feel obligated to read through it unless you're interested; it's not a part of the course. But if you want to get an idea of the complexity level of HTTP, and why we should be so quick to want to find a library that implements all of that complexity for us, take a quick look through it (and note that one of the main authors of the specification, Roy Fielding, was completing his Ph.D. here at UCI at the time it was written).

The urllib.request module in the Python standard library Unlike the protocols we've implemented in this course, which had a fairly straightforward sequence of what needed to be sent from client to server and vice versa, HTTP is anything but simple. It is used for everything from fetching a simple web page, implementing the "guts" of the

conversations happening behind the scenes while you use full-featured web sites like Gmail, and even for allowing non-browsers to interact with web data (e.g., programs that can send tweets via Twitter). While we could certainly implement an HTTP conversation using the techniques we've seen so far — opening a socket connection to a server's port 80, constructing and sending a GET request, parsing the response — this is a very complex task. In order to do the job right, we would need to implement the entire specification, which weighs in (when printed) at well over 100 pages. Happily, HTTP support is something so fundamental to the needs of so many programmers, many programming language libraries include HTTP support; Python is no exception. Python's library includes a number of modules that implement different parts of the HTTP specification, with the main trick being to understand which module you need in a given circumstance. Suppose our goal is simple: We just want to download the contents of a single web page in Python, given its URL. (Note that your task in Project #3 is similar: Given the URL to information on the web that your program will need, you just want to download and use that information.) More complex interactions require more complex tools, but the interactions we've needed thus far are the simplest ones, so the simplest part of the library will suffice. That module is called urllib.request. The urllib.request module has one function that we're interested in: urllib.request.urlopen(). Looking through its documentation reveals many more details than we need to know if we only want to download a web page using a GET request; downloading one page can be done in the Python shell by doing just this: >>> import urllib.request >>> response = urllib.request.urlopen('http://www.ics.uci.edu/~thornton/ics32a/Notes/Excepti ons/oops.py') The urlopen() function returns an object called an HTTPResponse, which provides a few useful attributes and methods, the most important of which is the read() method, which retrieves all of the content from the response (i.e., the contents of the web page you asked for). >>> >>> >>> b"#

data = response.read() response.close() data oops.py\r\n#\r\n# ICS 32 Winter 2018\r\n# Code Example\r\n#\r\n......."

There are a couple of things worth noticing here. One is that we closed the response object once we were done reading the data from it. Just like you want to close files and sockets when you're finished with them, you're going to want to close the response objects you get back from urlopen(), too. (In fact, internally, there's probably a socket that is being closed behind the scenes when you close the response.) Also, if you look carefully at what's shown in the Python shell when we look at the value of data, you'll notice that it doesn't look quite like the other strings you've seen before. This string has a b displayed in front of the quote that begins it; as usual, when you see a little distinction like that, it probably means something. Let's take a look at data's type.

>>> type(data)

Interlude: What is a bytes object? A bytes object in Python represents what it sounds like: a collection of bytes. A byte is a simple concept: it's eight binary digits, each being either 0 or 1. Ultimately, everything in your computer's memory — and everything sent from one machine to another via a computer network — is represented this way; the question is how the bytes are interpreted. If you see the byte 10001101, what does it mean? The answer depends very much on what kind of data you expect to get. In other words, the bytes don't mean anything without you knowing what the encoding is; the encoding is a mapping between bytes and their meanings. Conceptually, we think of strings as sequences of characters. But what is each character? In truth, each character is really stored numerically, using the same binary digits as everything else. But how do we know which binary digits mean 'A', which mean '8', and so on? That's where an encoding comes into play: An encoding for strings maps characters to their binary representations (and back again). We can encode a string into its bytes, and we can decode the bytes back into a string again. The only trick is telling Python which encoding to use. The most common encoding on the Internet is one called UTF-8, which is one way that a character set called Unicode can be encoded. The details are well beyond the scope of our work here; for us, it's enough to know that UTF-8 exists and that it's a particular kind of encoding of strings as bytes. If we have a bytes object, we can turn it into the corresponding string by calling the method decode() on it. >>> text = data.decode(encoding = 'utf-8') >>> type(text)

Similarly, we can take a string object and call the encode() method on it to turn it back into a bytes object. >>> encoded = text.encode(encoding = 'utf-8') >>> type(encoded)

Notice, in both cases, that we passed an encoding argument to the method. This is because it's not enough to say that we want to do the conversion; because there are lots of different conversions possible, we have to say how we want to do the conversion. It will be a safe assumption, in our work, that strings are encoded as UTF-8, though this is hardly a 100% safe assumption on every project you'll ever do. (Note that this is why the HTTP response we saw earlier included a Content-Type header, which described not only that the content was text, but that its encoding was UTF-8; unless the server tells us what we got, we have no way to know how to use it.)

Continuing with our previous example

Now that we understand what a bytes object is, we can decide what we want to do with it. Sometimes, we'll want to decode it, because we know it's a string (and we know what the appropriate encoding it). Other times, we'll want to write it to a file, send it to another host via a socket, or any number of other things. What you do next depends on what you want. If our goal was to print out the text from the web page, though, we would need to decode it, because it's easier to work with strings when we print text than with bytes. >>> >>> >>> ['#

text = data.decode(encoding = 'utf-8') lines = text.splitlines() lines oops.py', '#', '# ICS 32 Winter 2018', '# Code Example', '#', .......]

And, at that point, we could loop over the line of lines and print each one out. Once we have a list of strings, all of the techniques we already know about will come into play.

The code Below is a link to a short program that asks the user to type a URL, as well as a path on their local hard drive, then downloads the contents of that URL and saves it into a file at the specified path, using the techniques demonstrated above. 

download_file.py...