Http protocol and servers for web application developpers beginners

in #programming7 years ago (edited)

http-617x350.jpg

I wanted to make this post to help beginners to understand the underlying mechanism behind http based applications and sites.

Client server architecture

The first thing to understand is the basics of client-server architecture, even if that's not really necessary to develop http application as this layer is already handled by the operating system and/or routers, it's still good idea to know the basics of this.

When a client want to access a resource located on a server, typically the application will create a socket specifying the address of the server, and the port that identify the service on this remote computer.

A socket is quite like a file, it's a number allocated to the application by the operating system, associated with a stack of protocol used by the operating system and routers to carry the packet from the client to the server and back.

Usually a typical packet is composed of 3 layers.

  • the hardware layer containing the mac address of the hardware network card

  • the ip layer containing the address of the remote server and the port associated with the service to send the request to, and the address of the client for the server to send back the packet containing the data for the request.

  • Additional information for tcp to keep track of streamed connection, packet orders, and various states of the connection.

A good blog on steem explaining the details of the basic client architecture based on tcp/ip is found here :

https://steemit.com/computing/@eneismijmich/computer-networks

When a service is run on a server, the server application will typically open a port, and the operating system will then 'bind' the port number to a socket handled by the application, and route packets incoming to this port number to the server application associated with the port.

The server application can then read the data containing the request from the packet's data sent to this port.

HTTP Protocol

HTTP.png

https://www.ntu.edu.sg/home/ehchua/programming/webprogramming/HTTP_Basics.html

Additionally to the lower layer used to carry the packet, the packet itself will usually contain the data to be sent to the service on the remote computer specified in the lower layer of the protocol.

This data will contain the request to be executed by the server.

HTTP request are originally made to retrieve the content of a file located on the server, and will be typically formated as an URL , or Uniform Resource Location, which contain the address of the server, the port where the service is bound, and the path of the file on the server.

Typically http url looks like this

serveraddress.com:server_port/path/to/file.html

When an HTTP url is entered into an http client, the client will first resolve the domain name to the ip of the server, and then send the HTTP request containing the path of the file to retrieve from the server.

There are 3 types of HTTP requests commonly used, HEAD,GET and POST.

A typical http request will consist of one of those request type, the path of the file to retrieve, followed by the http version used by the client, and a serie of lines containing a number of values called http request headers.

  • HEAD request will say the server to only send back the response header associated with the specified file.

  • GET request will say the server to send the data stored in the specified file additionally to the response header

  • POST request is the same then GET request, but additional data will be included by the client in the request, such as uploading a file or the data from an html form.

http headers are additional information the client will add to the request automatically to specify certain options that can be used by the server additionally to the path of the file itself.

HTTP Request headers

host

The first option commonly used is the 'host' value, which is used when http server hosts several domain on the same port, to distinguish the root directory from where to fetch the file.

HTTP server calls this virtual hosting, and most http servers can be configured to host several domain on the same port, each associated with their own document root.

The file path contained in the url will be relative to the document root associated with the hostname specified in the server configuration that match the value of this field.

UserAgent

This field is used by http client to specify a string that can be used by the server to idendify the type of browser used to make the request.

Cookies

This field is used by the http client to send some persitent data to the server, often used to keep a session token, or any data that need to be kept by the client when it send multiple request to the same service.

Referer

This field is used by http client to specify the page that linked the browser to the current url.

Range

This filed is used by http client to request only a portion of the specified file. It can be used for download resume, for the server to only send the data starting at a specified position.

Accept-charset

This field is used by http client to specify which character set they accept (UTF8, unicode, ISO-XXX).

Accept-language

This field is used by the http client to specify which language it can accept.

Accept-encoding

This field is used by the http client to specify which type of encoding or compression it can accept.

Content-length

Used with POST request to indicate the size of the additional data sent by the client

Content-type

Used with POST request to indicate the type of the data sent by the client

Content-encoding

Used with POST request to indicate the format of the data sent by the client.

A request to the server encoded in this url

serveraddress.com:server_port/path/to/file.html

will typically translate to the client resolving the ip of the server, opening a socket to this ip and port, and send the following data

GET /path/to/file.html
HTTP/1.1
Host:serveraddress.com
UserAgent:Firefox

HTTP Response headers

When the server has proccessed the request, it will send back the client a status code, to indicate if the request could be fullfilled, and the data stored in the file specified by the url if the path is found in the folder associated with the hostname in the request headers.

Common status code are

  • 200 to mean the request succeded
  • 404 to mean the specified file was not found on the server
  • 500 to indicate a technical problem in the server or permission problem

Additional response header commonly generated by the server are

Content-Length

Indicate the length of the data stored in the requested file that will be sent back to the client.

Content-type

Indicate the mime-type of the data that is being sent to the server.

Set-Cookies

Used by the server to tell the client to store a certain number of variable, and send them back when another request is sent to the same domain.

Cookies will typically contain an expiration time, a domain path for which the cookie is to be sent along side with requests, and a series of values to be kept by the client and sent back when a request is made to the domain path specified in the cookie header.

Location

indicate a redirection, which mean the client should send a new request to this location to retrieve the specified resource.

Date

Will specify the date and time of the server.

Cache-control

Is used by the server to tell the client for how many time it can keep the specified resource in its local cache, which will prevent subsequent download from the server before the period expire, and can save server resource and bandwidth, and can make loading of page faster when it contain lot of static data that will not change during this period of time.

Content-Encoding

This field is used to specify the encoding or compression used by the server to encode the data.

Last-Modified

This field indicate to the client the last modification time of the file. Can be usefull when used with an HEAD request to know if a full GET request need to be done on the file if the last modification time show an update since the last download.

Full list of header fields can be found there :

https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

CGI Request

CGI_common_gateway_interface.png

Originally HTTP protocol is used to retrieve a static content from a file stored server, but most HTTP server allow to execute the file as a script, and send the output of the script instead of the raw content of the file.

In apache it's configured using specific handlers, associated with a file extention, which will tell the server to run a script interpreter via a cgi (common gateway interface) module instead of sending the content of the file.

The CGI interface allow to establish a common protocol to set up an execution environment for the script engine, and initializing a certain number of built in variable in the target script for it to get informations about the server configuration and the request.

Typical script that can be executed over CGI include php, python, perl, or C modules.

Query string

When the HTTP request target a script to be executed via the CGI interface, the url can optionally contain a so called 'query string', which will contain additional data that can be used by the script.

This query string will be marked by a '?' at the end of the url, and will typically contain a certain number of variables seperated by '&'.

In the following url, the query string is marked in bold.

serveraddress.com:server_port/path/to/script.php?variable1=value1&variable2=value2

PHP CGI

ElePHPant_-_Mascot_PHP-logo-4C78D1AC4E-seeklogo.com.jpg

For the server to be able to execute php script, it needs a specific module that implement the cgi interface to be installed on the server, and an handler to be configured to associate request on files with a specified extension with it.

This module will set up the script variables from the server internal configuration, as well as the fields from HTTP request headers, and POST content that will be used by the script, and then execute the script and send the output of the script to the client instead of the file itself.

A typical php script will be marked by the tags

script.php
<?

PHP SCRIPT CODE

?>

The built variables set up by the CGI module will be made available as associative array, which are variable that can contain several named keys, each associated with a value.

Such variables are

  • $_SERVER will contain general informations about the server itself
  • $_GET will contain the variables set into the query string
  • $_POST will contain the variable sent as html form data as a POST request
  • $_COOKIES will contain the different values parsed from the http request header cookies

To access those variable from the script and print the output to the client, can use a script such as

script.php

<?php

echo $_SERVER['REMOTE_ADDR'];

?>

Will print the IP address of the client connecting the server.


serveraddress.com:server_port/path/to/script.php?variable1=value1&variable2=value2

script.php

<?php

echo $_GET['variable1'];

?>

Will print the value of variable1 contained in the query string. In the example of the previous url, that would print 'value1'.


script.php

<?php

// test if the request header already contain a cookie with this name
if(isset($_COOKIE['my_cookie']))
{
// print the cookie value
echo $_COOKIE['my_cookie'];
}
else
{
//tell the client to reload the page after it's loaded
header('redirect:path/to/script.php');

//Tell the client to store the value 'cookie value' in the cookie named 'my_cookie'

// and resend it in the request header for all subsequent requests
setcookie ('my_cookie', 'cookie value');
}

?>

This script will set the cookie on the first load of the page, and then tell the browser to reload the page, and print the value of the cookie on subsequent requests to files on the same domain.

I guess that's it for the basics, maybe i'll make more advanced tutorial to show how to handle html formular, basic sql data, and more advanced php code, and handling of dynamic data / ajax in the client browser to make dynamic requests.

URL rewriting

When using query string variables with CGI script, it can be usefull to use rewrite script to shorten url, and hide variable names.

To do this, need to a a file name .htaccess in the folder containing the php script file, which use regular expression to transform the input url into a new one that will be used by the server transparently before to execute the CGI and fill the script variables.

For example to transform an url such as

serveraddress.com:server_port/path/to/script.php?variable1=value1&variable2=value2

to

serveraddress.com:server_port/path/to/script.php/value1/value2

This .htaccess file can be put in the folder documentRoot/path/to/

RewriteEngine On
RewriteRule ^/path/to/script.php/([^/]+)/([^/]+) /path/to/script.php?variable1=$1&variable2=$2

The first string after RewriteRule is a regular expression that will be matched against the input url, the expressions between the parenthesis are the parts of the input url that will be captured as a variable, and replaced in the target url as the variables $1 and $2.

If more than one value need to be captured from the input url, several expressions can be captured by adding more expression between parenthesis, and the captured values can be replaced in the destination url as $2, $3 etc.

Sort:  

I've worked in web development for nearly 20 years in some form or another and lots of my long standing colleagues could do with reading this never mind beginners :-)

Yes i tend to notice peoples often jump into cgi programming, and dont take a good look at the http protocol beneath, which can lead to weird architecture bugs because the way parameters are passed through http request and cgi is confused with php internal structure.