10. Introduction¶
The popular notion of the internet is of a “cloud”, a vast, complex, liminal space with little grounding in the physical world. As developers of web applications (and as informed digital citizens), we need to move beyond this model to a deeper understanding. The better way to think of the internet is as the sum total of many pairwise conversations - a single computer talking to another computer. These conversations between computers are governed by a simple set of protocols that each pair of computers can communicate in a way that they both understand. In this chapter, we’ll learn a bit about the history of the internet, and some of the protocols that make communication over the internet possible.
11. History¶
In order to understand how the internet works today, it’s helpful to consider how people communicated before the internet. We’ll find that almost all of the modern internet is based on principles that are far older.
Some of the earliest communication came through human messengers. One famous story from this time is of the first marathon - during the Battle of Marathon a messenger ran from Marathon (a city in greece) to Sparta to deliver news of the Persian invasion. He arrived, delivered his news, and then collapsed and died. Theses runners were often captured, so code systems were invented to make sure that opponents who captured messages couldn’t understand them. Some famous examples include the Caesar cipher. Modern messages are still sometimes physically transported through sneakernets. One other famous example of physical transportation is the Pony Express <https://en.wikipedia.org/wiki/Pony_Express>__
Another common method in ancient times was to distribute messages by attaching them to pigeons. The results of the first Olympics were even announced using pigeons. The Internet Engineering Task Force once proposed bringing back pigeons and successfully demonstrated their success as part of an April fools joke. A single pigeon is limited in how far it can fly, so sophisticated networks for “pigeon lofts” were established - a pigeon would fly as far as it could, and then an operator would move the message to a new pigeon and relaunch it - similary to how routers work in the modern internet.
Telegrams <https://en.wikipedia.org/wiki/Telegraphy#Telegram_services> developed as a sophisticated communications network in the 1800’s. Telegrams were messages that were encoded into Morse code and transmitted long distances over electrical wires. There were many technical advances from the telegraph era that still exist in the modern internet:
A centrally-maintained list of telegraph addresses improved delivery rates, similar to how IP addresses are used on the internet today
The first undersea cable <https://en.wikipedia.org/wiki/Transatlantic_telegraph_cable> was laid for a telegraph line in 1854, and undersea still carry most of the world’s internet traffic
Telegrams were notoriously unreliable, and there was no way for a sender to ensure that their message was received correctly. To combat this issue, people would often “repeat” their message back to the sender. If the sender received their own message back, they could be confident that it was received correctly. The modern Transmission Control Protocol uses similar strategies to handle reliability issues on the internet.
12. Internet Protocols, in four parts¶
For this text, we’ll use the TCP/IP model of the internet, which encompasses enough detail for our purposes. The actual modern internet follows the OSI Model , which is slightly more complex but uses the same underlying principles.
The TCP/IP model is divided into four “layers”, each of which is responsible for a different step in the process of transferring messages. Each layer defines a set of communication protocols , and each message must pass through each of these layers.
12.1. Application Layer¶
The internet is capable of transmitting any kind of message you might want to send. There are a wide variety of “applications” that transmit messages, including email, SMS (text messages), and the World Wide Web.
Each of these applications has its own protocol that encodes the user’s message into binary so that it can be sent over the internet.
There are very many protocols in the application layer. For this book, we’ll focus on Hypertext Transfer Protocol (HTTP), the protocol that is used by the World Wide Web.
12.1.1. HTTP¶
HTTP is the protocols that allows us to access websites. You might recognize
HTTP from the front of URLs, for example https://en.wikipedia.org/
(we’ll
talk about what the s
in https means soon). Every time that you open a web
page in a browser, you’re using HTTP.
An HTTP message requires two participants:
The client, is the computer that wants to access a web page. The client sends a special message called a request to ask for the page.
The server is the computer that has the web page. After the server receives the request, it sends back a response, which is usually just the requested page.
We’ll often refer to HTTP as a client-server model, and the process of accessing an HTTP page as the request-response cycle
12.1.1.1. HTTP Requests¶
When you type a web address into your browser, it generates an HTTP request, a special kind of message that is transferred over the internet. The request includes a lot of information, specifically: - The “hostname” of the server that it is sending the request to - A verb (sometimes called a “request method”), that describes what the client is asking the server to do. There are a lot of HTTP verbs, and you should know two of them (note: HTTP verbs are usually written in all capital letters): - GET - asks a server to send a file (for example, when you first type a web address into your browser, it creates a GET request asking for the web page) - POST - sends information to a server (for example, when you fill out a form on a website, your browser sends the information from the form in a POST request)
12.1.1.2. HTTP Responses¶
When a server receives a request, it processes the request and sends back a response. Responses are divided up into two parts - the “header” and the “content”. The header includes information about the request, and the content is the actual information that was requested. For example, if you type https://python.org into your web browser, the response that comes back will include a header, and the content will be the HTML for the website that you requested.
The most important piece of information in the response header is the HTTP Status Code, a 3-digit number that summarizes the outcome of the request. You should know several of these status codes: - 200 means that the request was handled successfully. A response with code 200 should include the requested content - 404 mean that the requested content could not be found - 401 or 403 means that the requested content exists, but that you are not allowed to see it. You will often see this when you are not logged into a website
12.1.1.3. Encryption¶
By default, all messages on the internet are send in plain text, which means that anybody who intercepts the message can read it. We often want to keep our web traffic private (especially when we are transferring usernames, passwords, and payment information). When encryption is desired, the Application layer is responsible for adding it.
For request and responses on the World Wide Web, we use HTTPS, a variant of HTTP that includes encryption. HTTPS works just like HTTP, except that it includes an extra step where the request is encrypted before it is transferred, and then decrypted by the server. The exact process of how encryption works is outside the scope of this text, but we should use HTTPS instead of HTTP whenever possible.
You’ll know that you are using HTTPS by inspecting the URL. Most modern browsers will also try to encourage users to use HTTPS by showing a lock icon in the URL bar, and most browsers will show a warning to users when they connect to a website using HTTP instead of HTTPS.
12.2. Transport Layer¶
As we’ll see in the next section, the internet is not a perfect messaging system. The transport layer sits on top of the internet and ensures that messages move successfully through the internet.
The transport layer receives a message from a protocol in the application layer, tags it with an identifier (called a port) that helps it remember which application the message came from and breaks the message up into several small parts which are sent through the internet individually. The specific details of how this is achieved depend on transport layer protcol. There are two protocols you should know: the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP).
12.2.1. TCP¶
TCP is the most common protocol in the internet, and its goal is to ensure that the entire message is perfectly transmitted. TCP takes several steps to make sure that the message is transmitted successfully:
First, it calls each part of the message a “segment”. It assigns a sequential ID to each segment. When the message is received, it collects all of the segment IDs so that it can tell whether any segments arrived out of order or did not arrive at all
TCP also has built-in error checking - it inspects each segment on receipt to make sure that it didn’t get changed during the transmission
12.2.2. UDP¶
In some situations, we don’t mind if the application message is only partially transmitted. For example, when we are streaming live video, we care that the video is transferred quickly. If any part of the live video fails to transfer, we would rather move on and receive the most recent part of the video instead of going back and re-sending the part that was lost. For this purpose, we use UDP, a much simpler protocol.
Like TCP, UDP divides the message into short pieces, which it calls “datagrams.” UDP is often described as a “fire and forget” protocol - meaning that the sender sends each datagram through the network, with no additional checks to handle lost or corrupted datagrams.
12.3. Internet Layer¶
The internet layer is responsible for getting the segments (or datagrams) from one computer to another, using a protocol that is cleverly named the Internet Protocol. In the Internet protocol, each of these segments is named a packet.
In the internet protocol, each computer is assigned an address called an Internet Protocol Address (or IP Address).
TODO - add DNS
The internet is made up of a vast amount of computers which each have direct connections to a small number of neighbors. Together, these computers form a distributed network that the message must traverse. Redundancy is an important quality of this network, so a single message that goes from a sender to a receiver might take many different paths through the network, which we call routes. Each of these computers acts as a ‘router’, which receives the message and chooses which neighbor to send the message to next.
This set of locally optimal decisions usually works well, but causes a few specific issues:
No delivery guarantees: Some packets might take a very long time to move through the network, and packets don’t always make it to their destination. When a packet does not make it, we say that the packet was “dropped”. Dropped packets occur for many reasons, including hardware errors and poor routing decisions (one interesting issue is called a packet vortex where two or more computers route packets to each-other in a loop).
No order guarantee: Each packet moves through the network on its own. Some packets will move through the network via a more efficient route than others, so the order that the packets enter the network is not necessarily the order that they will arrive.
12.4. Network Layer (sometimes called the Link Layer)¶
The internet layer is responsible for the entire route from the message’s sender to the destination. Over the course of that process, the message must make many hops from one router to another. The network layer is responsible for the physical process that moves the message across some physical medium from one computer to another. The protocols in the network layer vary based on the physical medium.
You’re likely familiar many of these protocols, including:
- Ethernet
which sends messages over a physical wire
- Wifi
which sends messages over a wireless connection using radio waves
- Bluetooth
which sends messages using radio waves similarly to Wifi, but uses lower power and is focused on smaller, closer messages
12.5. The lifecycle of a request¶
Each message passes through every layer of the TCP/IP model, so here’s an explanation of how this works when we use our web browser to access a webpage:
First, you open your browser and type
https://www.python.org/
in the address bar. Your browser generates a message, containing an HTTP GET request.The message gets passed to TCP in the transport layer, which divides the message into segments and adds the extra TCP header information (including segment ID) to each segment.
Each segment gets passed to the Internet Protocol, which adds its own header information. The TCP segment with an IP header is now called a packet.
With each hop through the network, the IP packet gets passed through a physical medium, which requires adding a medium-specific header. The IP packet with a networking header is called a frame. After each hop, the frame header is removed to reveal the IP packet. The destination IP address is checked to figure out whether the packet is at its destination. if not, it’s passed on through the network.
Once a packet reaches its destination, the IP header is removed to reveal the TCP segment.
TCP combines all of the segments that it receives and re-assembles them to the message
The message gets passed to the HTTP server. It looks up the requested information, and produces a new HTTP response message with code 200 and containing the requested HTML page.
The HTTP response message goes back through this entire process to reach the client.
After the message is received and re-assembled, your browser interprets the HTTP response and displays the HTML page to you. When you click on a link, this whole process starts over.
Notice that the order that the message traverses the layers is reversed on the client vs the server. You might write the process like this:
Application -> Transport -> Internet -> Network -> Internet -> Transport -> Application
When a message it being sent (steps 1-4 above), each new layer of the TCP/IP model adds its own new header information onto the information it received. We call this process “encapsulation.” On the receiving computer, each layer removes its header in a process called “deencapsulation.”
13. Command Line Network Tools¶
13.1. Ping¶
ping is a simple command-line tool that sends one packet at a time to a destination and provides information about how long it takes to reach the destination.
Try running this command at your terminal:
> ping python.org
13.2. Traceroute¶
Traceroute helps understand the route that a packet takes from its source to its destination. It achieves this by sending a series of packets, each with a header set called “time to live”, which determines how many hops the packet should travel before stopping and returning back to the sender.
Try running this command at your terminal:
> traceroute python.org
13.3. Curl¶
curl is a simple web browser that is built into the terminal.
Try running this command at your terminal:
curl https://www.python.org/
14. Python for network communication¶
TODO: reproduce code from https://wiki.python.org/moin/UdpCommunication, which is under the GPL license