“The Internet is a tidal wave. It changes the rules. It is an incredible
opportunity as well as incredible challenge”

Bill Gates, in an internal Microsoft memo, 1995

“[I]t behooves wired people to know a few things about wires - how they
work, where they lie, who owns them, and what sorts of business deals
and political machinations bring them into being.”

Neal Stephenson, 1996

10. Networking¶

10.1. Introduction¶

The popular notion of the internet is of a “cloud”, a vast, complex, liminal space with little grounding in the physical world. As developers of web applications (and as informed digital citizens), we need to move beyond this model to a deeper understanding. The better way to think of the internet is as the sum total of many pairwise conversations - a single computer talking to another computer. These conversations between computers are governed by a simple set of protocols that each pair of computers can communicate in a way that they both understand. In this chapter, we’ll learn a bit about the history of the internet, and some of the protocols that make communication over the internet possible.

10.2. History¶

In order to understand how the internet works today, it’s helpful to consider how people communicated before the internet. We’ll find that almost all of the modern internet is based on principles that are far older.

Some of the earliest communication came through human messengers. One famous story from this time is of the first marathon - during the Battle of Marathon a messenger ran from Marathon (a city in greece) to Sparta to deliver news of the Persian invasion. He arrived, delivered his news, and then collapsed and died. Theses runners were often captured, so code systems were invented to make sure that opponents who captured messages couldn’t understand them. Some famous examples include the Caesar cipher. Modern messages are still sometimes physically transported through sneakernets. One other famous example of physical transportation is the Pony Express

Another common method in ancient times was to distribute messages by attaching them to pigeons. The results of the first Olympics were even announced using pigeons. The Internet Engineering Task Force once proposed bringing back pigeons and successfully demonstrated their success as part of an April fools joke. A single pigeon is limited in how far it can fly, so sophisticated networks for “pigeon lofts” were established - a pigeon would fly as far as it could, and then an operator would move the message to a new pigeon and relaunch it - similar to how routers work in the modern internet.

Telegrams developed as a sophisticated communications network in the 1800’s. Telegrams were messages that were encoded into Morse code and transmitted long distances over electrical wires. There were many technical advances from the telegraph era that still exist in the modern internet:

A centrally-maintained list of telegraph addresses improved delivery rates, similar to how IP addresses are used on the internet today
The first undersea cable was laid for a telegraph line in 1854, and undersea cables still carry most of the world’s internet traffic
Telegrams were notoriously unreliable, and there was no way for a sender to ensure that their message was received correctly. To combat this issue, people would often “repeat” their message back to the sender. If the sender received their own message back, they could be confident that it was received correctly. The modern Transmission Control Protocol uses similar strategies to handle reliability issues on the internet.

10.3. Internet Protocols, in four parts¶

For this text, we’ll use the TCP/IP model of the internet, which encompasses enough detail for our purposes. The actual modern internet follows the OSI Model , which is slightly more complex but uses the same underlying principles.

The TCP/IP model is divided into four “layers”, each of which is responsible for a different step in the process of transferring messages. Each layer defines a set of communication protocols , and each message must pass through each of these layers.

10.3.1. Application Layer¶

The internet is capable of transmitting any kind of message you might want to send. There are a wide variety of “applications” that transmit messages, including email, SMS (text messages), and the World Wide Web.

Each of these applications has its own protocol that encodes the user’s message into binary so that it can be sent over the internet.

There are very many protocols in the application layer. For this book, we’ll focus on Hypertext Transfer Protocol (HTTP), the protocol that is used by the World Wide Web.

10.3.1.1. HTTP¶

HTTP is the protocols that allows us to access websites. You might recognize HTTP from the front of URLs, for example https://en.wikipedia.org/ (we’ll talk about what the s in https means soon). Every time that you open a web page in a browser, you’re using HTTP.

An HTTP message requires two participants:

The client, is the computer that wants to access a web page. The client sends a special message called a request to ask for the page.
The server is the computer that has the web page. After the server receives the request, it sends back a response, which is usually just the requested page.

We’ll often refer to HTTP as a client-server model, and the process of accessing an HTTP page as the request-response cycle

10.3.1.2. HTTP Requests¶

When you type a web address into your browser, it generates an HTTP request, a special kind of message that is transferred over the internet. The request includes a lot of information, specifically: - The “hostname” of the server that it is sending the request to - A verb (sometimes called a “request method”), that describes what the client is asking the server to do. There are a lot of HTTP verbs, and you should know two of them (note: HTTP verbs are usually written in all capital letters):

GET - asks a server to send a file (for example, when you first type a web address into your browser, it creates a GET request asking for the web page)
POST - sends information to a server (for example, when you fill out a form on a website, your browser sends the information from the form in a POST request)

10.3.1.3. HTTP Responses¶

When a server receives a request, it processes the request and sends back a response. Responses are divided up into two parts - the “header” and the “content”. The header includes information about the request, and the content is the actual information that was requested. For example, if you type https://python.org into your web browser, the response that comes back will include a header, and the content will be the HTML for the website that you requested.

The most important piece of information in the response header is the HTTP Status Code, a 3-digit number that summarizes the outcome of the request. You should know several of these status codes:

200 means that the request was handled successfully. A response with code 200 should include the requested content
404 mean that the requested content could not be found
401 or 403 means that the requested content exists, but that you are not allowed to see it. You will often see this when you are not logged into a website

10.3.1.4. Encryption¶

By default, all messages on the internet are send in plain text, which means that anybody who intercepts the message can read it. We often want to keep our web traffic private (especially when we are transferring usernames, passwords, and payment information). When encryption is desired, the Application layer is responsible for adding it.

For request and responses on the World Wide Web, we use HTTPS, a variant of HTTP that includes encryption. HTTPS works just like HTTP, except that it includes an extra step where the request is encrypted before it is transferred, and then decrypted by the server. The exact process of how encryption works is outside the scope of this text, but we should use HTTPS instead of HTTP whenever possible.

You’ll know that you are using HTTPS by inspecting the URL. Most modern browsers will also try to encourage users to use HTTPS by showing a lock icon in the URL bar, and most browsers will show a warning to users when they connect to a website using HTTP instead of HTTPS.

10.3.2. Transport Layer¶

As we’ll see in the next section, the internet is not a perfect messaging system. The transport layer sits on top of the internet and ensures that messages move successfully through the internet.

The transport layer receives a message from a protocol in the application layer, tags it with an identifier (called a port) that helps it remember which application the message came from and breaks the message up into several small parts which are sent through the internet individually. The specific details of how this is achieved depend on transport layer protcol. There are two protocols you should know: the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP).

10.3.2.1. TCP¶

TCP is the most common protocol in the internet, and its goal is to ensure that the entire message is perfectly transmitted. TCP takes several steps to make sure that the message is transmitted successfully:

First, it calls each part of the message a “segment”. It assigns a sequential ID to each segment. When the message is received, it collects all of the segment IDs so that it can tell whether any segments arrived out of order or did not arrive at all
TCP also has built-in error checking - it inspects each segment on receipt to make sure that it didn’t get changed during the transmission

10.3.2.2. UDP¶

In some situations, we don’t mind if the application message is only partially transmitted. For example, when we are streaming live video, we care that the video is transferred quickly. If any part of the live video fails to transfer, we would rather move on and receive the most recent part of the video instead of going back and re-sending the part that was lost. For this purpose, we use UDP, a much simpler protocol.

Like TCP, UDP divides the message into short pieces, which it calls “datagrams.” UDP is often described as a “fire and forget” protocol - meaning that the sender sends each datagram through the network, with no additional checks to handle lost or corrupted datagrams.

10.3.3. Internet Layer¶

The internet layer is responsible for getting the segments (or datagrams) from one computer to another, using a protocol that is cleverly named the Internet Protocol. In the Internet protocol, each of these segments is named a packet.

10.3.3.1. IP addressing¶

In the internet protocol, each computer is assigned an address called an Internet Protocol Address (or IP Address).

IP addresses are mapped to human-readable domain names by the Domain Name System (DNS). When you type a URL into your web browser, it first makes a request to the DNS to find the IP address that corresponds to that domain name.

We can do DNS lookups in Python using a function called getaddrinfo() in the sockets module (which we’ll explore later in this chapter). This function returns a lot of information, so here is some sample code that parses the output to find the IP address for python.org. Notice that this domain actually has several IP addresses. This is likely for backup and performance reasons.

>>> import socket
>>> name = "python.org"
>>> addrs = { str(i[4][0]) for i in socket.getaddrinfo(name, 80) }
>>> print(addrs)
{'151.101.192.223', '151.101.64.223', '151.101.128.223', '151.101.0.223'}

10.3.3.2. Internet Routing¶

The internet is made up of a vast amount of computers which each have direct connections to a small number of neighbors. Together, these computers form a distributed network that the message must traverse. Redundancy is an important quality of this network, so a single message that goes from a sender to a receiver might take many different paths through the network, which we call routes. Each of these computers acts as a ‘router’, which receives the message and chooses which neighbor to send the message to next.

This set of locally optimal decisions usually works well, but causes a few specific issues:

No delivery guarantees: Some packets might take a very long time to move through the network, and packets don’t always make it to their destination. When a packet does not make it, we say that the packet was “dropped”. Dropped packets occur for many reasons, including hardware errors and poor routing decisions (one interesting issue is called a “packet vortex” where two or more computers route packets to each-other in a loop).
No order guarantee: Each packet moves through the network on its own. Some packets will move through the network via a more efficient route than others, so the order that the packets enter the network is not necessarily the order that they will arrive.

10.3.4. Network Layer (sometimes called the “Link Layer”)¶

The internet layer is responsible for the entire route from the message’s sender to the destination. Over the course of that process, the message must make many hops from one router to another. The network layer is responsible for the physical process that moves the message across some physical medium from one computer to another. The protocols in the network layer vary based on the physical medium.

You’re likely familiar many of these protocols, including:

Ethernet: which sends messages over a physical wire
Wifi: which sends messages over a wireless connection using radio waves
Bluetooth: which sends messages using radio waves similarly to Wifi, but uses lower power and is focused on smaller, closer messages

10.3.5. The lifecycle of a request¶

Each message passes through every layer of the TCP/IP model, so here’s an explanation of how this works when we use our web browser to access a webpage:

First, you open your browser and type https://www.python.org/ in the address bar. Your browser generates a message, containing an HTTP GET request.
The message gets passed to TCP in the transport layer, which divides the message into segments and adds the extra TCP header information (including segment ID) to each segment.
Each segment gets passed to the Internet Protocol, which adds its own header information. The TCP segment with an IP header is now called a packet.
With each hop through the network, the IP packet gets passed through a physical medium, which requires adding a medium-specific header. The IP packet with a networking header is called a frame. After each hop, the frame header is removed to reveal the IP packet. The destination IP address is checked to figure out whether the packet is at its destination. if not, it’s passed on through the network.
Once a packet reaches its destination, the IP header is removed to reveal the TCP segment.
TCP combines all of the segments that it receives and re-assembles them to the message
The message gets passed to the HTTP server. It looks up the requested information, and produces a new HTTP response message with code 200 and containing the requested HTML page.
The HTTP response message goes back through this entire process to reach the client.
After the message is received and re-assembled, your browser interprets the HTTP response and displays the HTML page to you. When you click on a link, this whole process starts over.

Notice that the order that the message traverses the layers is reversed on the client vs the server. You might write the process like this:

Application -> Transport -> Internet -> Network -> Internet -> Transport -> Application

When a message it being sent (steps 1-4 above), each new layer of the TCP/IP model adds its own new header information onto the information it received. We call this process “encapsulation.” On the receiving computer, each layer removes its header in a process called “deencapsulation.”

10.4. Command Line Network Tools¶

10.4.1. Ping¶

ping is a simple command-line tool that sends one packet at a time to a destination and provides information about how long it takes to reach the destination.

Try running this command at your terminal:

> ping python.org

10.4.2. Traceroute¶

Traceroute helps understand the route that a packet takes from its source to its destination. It achieves this by sending a series of packets, each with a header set called “time to live”, which determines how many hops the packet should travel before stopping and returning back to the sender.

Try running this command at your terminal:

> traceroute python.org

10.4.3. Curl¶

curl is a simple web browser that is built into the terminal.

Try running this command at your terminal:

> curl https://www.python.org/

10.5. Python for network communication¶

Sockets are a technology in the Application layer that give us low-level control over how data is transmitted through the network. Sockets are very useful in programming web applications, and have a nice Python interface

Here is an example (adapted from the python.org wiki) of how to use sockets in Python to send and receive messages over UDP:

# socket_server.py
import socket

UDP_IP = "127.0.0.1"
UDP_PORT = 5005

# In this example, socket.SOCK_DGRAM is a constant that
# designates that the socket should send the message over UDP
# recall that "Datagram" is the name of a message in UDP
sock = socket.socket(socket.AF_INET,
                      socket.SOCK_DGRAM)
sock.bind((UDP_IP, UDP_PORT))

while True:
     data, addr = sock.recvfrom(1024) # buffer size is 1024 bytes
     print("received message: %s" % data)

# socket_client.py
import socket

UDP_IP = "127.0.0.1"
UDP_PORT = 5005
MESSAGE = b"Hello, World!"

print("UDP target IP: %s" % UDP_IP)
print("UDP target port: %s" % UDP_PORT)
print("message: %s" % MESSAGE)

sock = socket.socket(socket.AF_INET,
                    socket.SOCK_DGRAM)
sock.sendto(MESSAGE, (UDP_IP, UDP_PORT))

To execute this example, copy the above two code blocks into their own files: socket_server.py and socket_client.py - the client will send a message, and the server will receive the message.

The server will need to run first. It has a while True loop, so it will continue running and listening for messages until you end the program.

When you execute the client code, it will generate a new message and then use UDP to send to the server. You should notice that the server program quickly prints out the output that it received.

In this example, the client and the server are both running on your own computer. To achieve this, the connection is sent to the special IP address: 127.0.0.1 - this special IP address designates messages that go to programs on your own computer.

10.6. Navigating the Internet with Python¶

There is a very useful Python library called requests that we can use to generate HTTP requests and interpret their responses.

requests is not packaged by default with Python, so we’ll need to run this command to install it:

python3 -m pip install requests

After installing, here is a simple example of using this package in the Python REPL. (note: the entire HTML page is printed out to the REPL. The output is truncated here for brevity).

>>> import requests
>>> r = requests.get("https://example.com")
>>> print(r.text)
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

We will explore requests more in this chapter’s exercises.

10.6.1. Responses - an introduction to structured data¶

In the above examples, you probably noticed that the information returned was a very long string full of HTML. HTML is meant to be consumed by browsers and displayed to humans, so there are better options we can use when our goal is to request information that we’ll use in our python programs.

TODO: Write about about APIs, CSV, JSON

10.7. Glossary¶

Datagram¶: The unit of information processed by UDP
Deencapsulation¶: The process of removing headers as messages are passed through layers in the TCP/IP model
Encapsulation¶: The process of adding additional headers as messages are passed through layers in the TCP/IP model
Frame¶: The unit of information processed by a protocol in the network layer
HTTP¶: The protocol that is used to request and transport HTML over the internet
Packet¶: The unit of information processed by the Internet Protocol
Port¶: An identifier added by protocols in the Transport Layer that identify a message with the application that generated it.
Protocol¶: An agreed-upon standard for the format and content of a message sent between two computers
Request¶: An HTTP message sent by a client to a server
Response¶: An HTTP message sent by a server back to a client after it processes a request
Segment¶: The unit of information processed by TCP
Socket¶: An application-layer protocol that sends and receives messages using TCP or UDP.

10.8. Exercises¶

Chapter 10 Exercise Set 0: Chapter Review