09 Assignment 6 HTTP Proxy PDF
09 Assignment 6 HTTP Proxy PDF
09 Assignment 6 HTTP Proxy PDF
Your penultimate CS110 assignment has you implement a multithreaded HTTP proxy and
cache. An HTTP proxy is an intermediary that intercepts each and every HTTP request and
(generally) forwards it on to the intended recipient. The servers direct their HTTP responses
back to the proxy, which in turn passes them on to the client. In its purest form, the HTTP
proxy is little more than a nosy network middleman that reads and propagates all incoming
and outgoing HTTP activity.
Here’s the neat part, though. When HTTP requests and responses travel through a proxy, the
proxy can control what gets passed along. The proxy might, for instance:
• block access to social media sites—sites like Google Plus, Twitter, and LinkedIn. J
• block access to large documents, like videos and high-definition images, so that slow
networks doesn’t become congested and interfere with lightweight applications like
email and instant messaging.
• block access to all web sites hosted in Canada. You know, as payback for Justin Bieber.
• strip an HTTP request of all cookie and IP address information before forwarding it to
the server as part of some larger effort to anonymize the client.
• intercept all requests for GIF, JPG, and PNG files and instead serve a proxy-stored
image of Mehran Sahami.
• cache the HTTP responses to frequently requested, static resources that don’t change
very often so it can respond to future HTTP requests for the same exact resources
without involving the origin servers.
• redirect the user to an intermediate paywall to collect payment for wider access to the
Internet, as some airport and coffee shop WiFi systems are known for.
Building a proxy is no small task, but with your smarts and my love to guide you, I am
absolutely confident you can do it.
Compile often, test incrementally and almost as often as you compile, hg commit so
you don’t lose your work if someone accidentally crushes your laptop, and run
/usr/class/cs110/bin/submit when you’re done.
2
You should descend into your own assign6 directory and create a symbolic link to
the sample solution, which resides in /usr/class/cs110/samples/assign6:
myth22> ln -s /usr/class/cs110/samples/assign6/http-proxy_soln http-proxy_soln
My recommendation is that you create a symbolic link to the sample executable instead of
making a deep copy, just in case I need to publish bug fixes to the sample over the course of
the next week.
You can run the sample http-proxy_soln (and your own http-proxy) as you’d expect:
myth22> ./http-proxy_soln
Listening for all incoming traffic on port <port number>.
The port number issued depends on your SUNet ID, and with high probability, you’ll be the
only one ever assigned it. If for some reason http-proxy says the port number is in use, you
can select any other port number between 1024 and 65535 (I’ll choose 12345 here) that isn’t
in use by typing:
In isolation, http-proxy_soln doesn’t do very much. In order to see it work its magic, you
should download and launch a web
browser that allows you to appoint a proxy
for just HTTP traffic. I’m recommending
you download Firefox, because I’ve used it
for months specifically to exercise the
http-proxy code base, and it’s worked
very well for me. Once you download and
launch Firefox, you can configure it to
connect to the Internet through http-
proxy by launching the Preferences Panel,
selecting Advanced, selecting Network
within Advanced, selecting Connection
Settings within Network, and then
activating a manual proxy as I have in the
screenshot on the right. (On Windows,
proxy settings can be configured by
selecting Tools → Options).
Of course, you should enter the myth machine you’re working on (and you should get
in the habit of ssh’ing into the same exact myth machine for the next week so you
3
don’t have to continually change these settings), and you should enter the port number
that your http-proxy is listening to.
Of course, you can use whatever web browser you want to, but I’m recommending
Firefox for a few reasons. Here are two of them:
• Most of you don’t use Firefox by default, so you won’t need to manually toggle
proxy settings between on and off to surf the Internet using whatever browser it is
you normally use. Firefox can be your CS110 browser for this assignment cycle,
and Chrome, Safari, Internet Explorer or whatever it is you use normally can be
your default.
• Some other browsers don’t allow you to configure browser-only proxy settings,
but instead prompt you to configure computer-wide proxy settings for all HTTP
traffic—for all browsers, Dropbox and/or iCloud synchronization, iTunes
downloads, and so forth. You don’t want that level of interference.
If you’d like to start small and avoid the browser, you can use telnet from your own
machine to converse to your proxy, like this (everything I type in is in bold, and everything
sent back by the proxy running on myth22:12345 is italicized):
HTTP/1.1 200 OK
access-control-allow-origin: *
cache-control: private, no-cache, no-store, must-revalidate
connection: close
content-length: 1591
content-type: text/javascript; charset=UTF-8
date: Tue, 20 May 2014 14:35:58 GMT
etag: "28e963166db2400f24e21eef904639ac701b4e02"
expires: Sat, 01 Jan 2000 00:00:00 GMT
pragma: no-cache
x-fb-debug: rJy2+iAL3ikHaOJkyBKYYsEoJo3a5SVUZSPhdR7yJBE=
x-fb-rev: 1255522
(Note that after you enter Host: graph.facebook.com, you need to hit enter twice.)
For the v1 milestone, you shouldn’t worry about threads or caching. You should
transform the initial code base into a sequential but otherwise legitimate proxy. The
code you’re starting with responds to all HTTP requests with a placeholder status line
consisting of an "HTTP/1.0" version string, a status code of 200, and a curt "OK"
reason message. The response includes an equally curt payload announcing the client’s
IP address. Once you’ve configured your browser so that all HTTP traffic is directed
toward the relevant port of the myth machine you’re working on,1 go ahead and launch
http-proxy and start visiting any and all web sites. Your proxy should intercept every
HTTP request and respond with something like this:
1
Again, it’s probably a good idea to ssh into the same exact myth machine—e.g. myth22.stanford.edu—
every single time so that you needn’t repeatedly update your browser’s proxy settings.
5
For the v1 milestone, you should upgrade the starter application to be a true proxy—an
intermediary that ingests HTTP requests from the client, establishes connections to the
origin servers (those’re the machines for which the requests are actually intended),
passes the HTTP requests on to the origin servers, waits for HTTP responses from these
origin servers, and then passes those responses back to the clients. Once the v1
checkpoint has been implemented, your http-proxy application should basically be a
busybody application that nosily intercepts HTTP requests and responses and passes
them on to the intended recipients.
Each intercepted HTTP request is passed along to the origin server pretty much as is,
save for three small changes.
• You should modify the intercepted request URL within the first line — the request
line as it’s called — as needed so that when you forward it as part of the request,
it includes only the path and not the protocol or the host. The request line of the
intercepted HTTP request should look something like this:
Of course, GET might be any one of the legitimate HTTP method names, the
protocol might be HTTP/1.0 instead of HTTP/1.1, and the URL will be any one
of a jillion URLs. But provided your browser is configured to direct all HTTP
traffic through your proxy, the URLs are guaranteed to include the protocol (e.g.
the leading "http://") and the host name (e.g. news.yahoo.com). The
protocol and the host name are included whenever the request is directed to a
proxy, because the proxy would otherwise have no clue where the forwarded
HTTP request should go. When you do forward the HTTP request, you need to
strip the leading "http://" and the host name from the URL. For the specific
example above, the proxy would need to forward the HTTP request on to
news.yahoo.com, and the first line of that forwarded HTTP request would need
to look this this:
I’ve implemented the HTTPRequest class to manage this detail for you
automatically (inspect the implementation of operator<< in request.cc and
you’ll see), but you need to ensure that you don’t break this as you start
modifying the code base.
• You should add a request header entity named "x-forwarded-proto" and set
its value to be "http". If "x-forwarded-proto" is already included in the
request header, then simply add it again.
6
• You should add a new request header entity called "x-forwarded-for" and set
its value to be the IP address of the requesting client. If "x-forwarded-for" is
already present, then you should extend its value into a comma-separated chain of
IP addresses the request has passed through before arriving at your proxy. (The IP
address of the machine you’re directly hearing from would be appended to the
end). Your reasons for adding these two new fields will become apparent later on.
Most of the code you write for your v1 milestone will be confined to
request-handler.h and request-handler.cc files (although you’ll want to make
a few changes to request.h/cc as well). The HTTPRequestHandler class you’re
starting with has just one public method, a placeholder implementation for that method,
and that’s it. You will need to familiarize yourself with all of the various classes at your
disposal to determine which ones should contribute to the v1 implementation. Of course,
you’ll want to leverage the client socket code presented in lecture to open up a
connection to the origin server. Your implementation of the one public method will
evolve into a substantial amount of code—substantial enough that you’ll want to
decompose and add a good number of private helper methods.
Once you’ve reached your v1 milestone, you’ll be the proud owner of a sequential (but
otherwise fully functional) http-proxy. You should visit every popular web site
imaginable to ensure the round-trip transactions pass through your proxy without
impacting the functionality of the site. Of course, you can expect the sites to load very
slowly, since your proxy has this much parallelism: zero. For the moment, however,
concern yourself with the networking and the proxy’s core functionality, and worry
about improving application throughput in later milestones.
Why block access to certain web sites? There are several reasons, and here are a few:
• Law firms, for example, don’t want their attorneys surfing Yahoo, AOL, or Facebook
when they should be working and billing clients.
• Parents don’t want their kids to accidentally trip across a certain type of web site.
• Professors configure their browsers to proxy through a university intermediary that itself
is authorized to access a wide selection of journals, online textbooks, and other
materials—all free of charge—that shouldn’t be accessible to the general public. (This is
the opposite of blocking, I guess, but the idea is the same).
7
• Some governments forbid their citizens to visit Facebook, Twitter, The New York Times,
and other media sites.
• Microsoft IT might "encourage" its employees to use Bing by blocking access to other
search engines during lockdown periods when a new Bing feature is being tested
internally.
Why should the proxy maintain copies of static resources (like images and JavaScript
files)? Here are two reasons:
• The operative adjective here is static. A large fraction of HTTP responses are
dynamically generated—after all, the majority of your Facebook, LinkedIn,
Google Plus, Flickr, and Instagram feeds are constantly updated—sometimes
every few minutes. HTTP responses providing dynamically generated content
should never be cached, and the HTTP response headers are very clear about
that. But some responses—those serving images, JavaScript files, and CSS files,
for instance—are the same for all clients, and stay the same for several hours,
days, weeks, months—even years! An HTTP response serving static content
usually includes information in its header stating the entire thing is cacheable.
Your browser uses this information to keep copies of cacheable documents, and
your proxy can too.
• Along the same lines, if a static resource—the omnipresent Google logo, for
instance—rarely changes, why should a proxy repeatedly fetch the same image
over and over again an unbounded number of times? It’s true that browsers won’t
even bother issuing a request for something in its own cache, but users clear their
browser caches from time to time (in fact, you should clear it very, very often
while testing), and some HTTP clients aren’t savvy enough to cache anything at
all. By maintaining its own cache, your proxy can drastically reduce network
traffic by serving up cached copies when it knows those copies would be exact
replicas of what it’d otherwise get from the origin servers. In practice, web
proxies are on the same local area network, so if requests for static content
doesn’t need to leave the LAN, then it’s a win for all parties.
In spite of the long-winded defense of why caching and blacklisting are reasonable
features, incorporating support for each is relatively straightforward, provided you
confine your changes to the request-handler.h and .cc files. In particular, you
should just add two private instance variables—one of type HTTPBlacklist, and a
second of type HTTPCache to HTTPRequestHandler. Once you do that, you should
do this:
Your to-do item for caching? Before passing the HTTP request on to the origin
server, you should check to see if a valid cache entry exist. If it does, just return a
copy of it—verbatim!—without bothering to forward the HTTP request. If it does
not, then you should forward the request as you would have otherwise. If the
HTTP response identifies itself as cacheable, then you should cache a copy
before propagating it along to the client.
What’s cacheable? The code I’ve given you makes some decisions—technically
off specification, but good enough for our purposes—and implements pretty
much everything. In a nutshell, an HTTP response is cacheable if the HTTP
request method was "GET", the response status code was 200, and the response
header was clear that the response is cacheable and can be cached for a
reasonably long period of time. You can inspect some of the HTTPCache
method implementations to see the decisions I’ve made for you, or you can just
ignore the implementations for the time being and use the HTTPCache as an off-
the-shelf.
Once you’ve hit v2, you should once again pelt your proxy with oodles of requests to
ensure it still works as before, save for some obvious differences. Web sites matching
domain regexes listed in blocked-domains.txt should be shot down with a 403,
and you should confirm your http-proxy‘s cache grows to store a good number of
documents, sparing the larger Internetz from a good amount of superfluous network
activity. (Again, to test the caching part, make sure you clear your browser’s cache a
9
whole lot—you might even set the browser cache size to 0 so the browser itself never
caches anything and all requests are forward to your proxy.)
The initial version of scheduler.h/.cc provides the lamest scheduler ever: It just passes the
buck on to the HTTPRequestHandler, which proxies, blocks, and caches on the main
thread. Calling it is a scheduler is an insult to all other schedulers, because it doesn’t really
schedule anything at all. It just passes each socket/IP-address pair on to its
HTTPRequestHandler underling and blocks until the underling’s serviceRequest
method sees the full HTTP transaction through to the last byte transfer.
One extreme solution might just spawn a separate thread within every single call to
scheduleRequest, so that its implementation would go from this:
to this:
The time-server examples presented in the lecture slides take this approach, although we
go with an anonymous function here that wraps a call to serviceRequest. (Note that this
needs to be captured, else the anonymous thread routine won’t have access to the handler
member variable).
The above solution doesn’t limit the number of threads that can be running at any one time,
though. If your proxy were to receive hundreds of requests in the course of a few seconds—in
practice, a very real possibility—the above approach would create hundreds of threads in the
10
course of those few seconds, and that would be bad. Should the proxy endure an extended
burst of incoming traffic—scores of requests per second, sustained over several minutes or
even hours, the above approach would create so many threads that the thread count would
soon exceed a thread-manager-defined maximum. Of course, the above approach succeeds in
getting the request off of the main thread (which is huge), but we can’t employ an unbounded
number of threads to do that. You’ll paralyze the thread manager if you do.
Fortunately, you built a ThreadPool class for Assignment 5, which is exactly what you want
to employ here. I’ve included the thread-pool.h file in the assign6 repositories, and I’ve
updated the Makefile to link against my working solution of the ThreadPool class. You
should leverage a single ThreadPool with 16 worker threads, and use that to elevate your
sequential proxy to a multithreaded one. Given a properly working ThreadPool, going from
sequential to concurrent is actually not very much work at all.
• You must, of course, ensure there are no race conditions—specifically, that no two
threads are trying to search for, access, create, or otherwise manipulate the same cache
entry at any one moment.
• You can have at most one open connection for any given request. If two threads are
trying to fetch the same document (e.g. the HTTP requests are precisely the same), then
one thread must go through the entire examine-cache/fetch-if-not-present/add-cache-
entry transaction before the second thread can even look at the cache to see if it’s there.
You should not lock down the entire cache with a single mutex for all requests, as that
introduces a huge bottleneck into the mix, allows at most one open network connection at
a time, and renders your multithreaded application to be essentially sequential. You could
take the map<string, unique_ptr<mutex>> approach that the implementation of
oslock and osunlock takes (you took that approach in Assignment 4 to manage per-
server connection limits as well), but that solution doesn’t scale for real proxies, which run
uninterrupted for months at a time and cache millions of documents.
11
Instead, your HTTPCache implementation should maintain an array of 1001 mutexes, and
before you do anything on behalf of a particular request, you should hash it and acquire
the mutex at the index equal to the hash code modulo 1001. You should be able to
inspect the initial implementation of the HTTPCache and figure out how to surface a hash
code and use that to figure out which mutex guards any particular request. A specific
HTTPRequest will always map to the same mutex, which guarantees safety; different
HTTPRequests may very, very occasionally map to the same mutex, but we’re willing to
live with that, since it happens so infrequently.
I’ve ensured that the starting code base relies on thread safe versions of functions
(gethostbyname_r instead of gethostbyname, readdir_r instead of readdir), so you
don’t have to worry about any of that. In past quarters, I’ve made students make these
changes, but they’ve resented me for it, so I’ve backed down and ensure that a reentrant
versions of a function is used whenever there is one. (Note your assign6 repo includes
client-socket.[h/cc], updated to use gethostbyname_r.)
Implementing v4: Concurrent http-proxy with blacklisting, caching, and proxy chaining
Some proxies elect to forward their requests not to the origin servers, but instead to secondary
proxies. Chaining proxies makes it possible to more fully conceal your web surfing activity,
particularly if you pass through proxies that pledge to anonymize your IP address, cookie jar,
etc. A proxied proxy might even have more noble intentions—to simply rely on the services of
an existing proxy while providing a few more services—better caching, custom blacklisting,
and so forth—to the client.
The http-proxy_soln we’ve supplied you allows for a secondary proxy to be specified, as
with this:
Provided a second proxy is running on myth12 and listening to port 43383, the proxy running
on myth22 would forward all HTTP requests—unmodified, save for the updates to the
"x-forwarded-proto" and "x-forwarded-for" header fields—on to the proxy running
on myth12:43383, which for all we know forwards to another proxy!
We actually don’t require that the secondary proxy be listening on the same port number, so
something like this might be a legal chain:
In that case, the myth22:43383 proxy would forward all requests to the proxy that’s
presumably listening to port 12345 on myth12.stanford.edu. (If the --proxy--port
option isn’t specified, then the proxy assumes the same port number it’s using is used by the
secondary.)
The HTTPProxy class we’ve given you already knows how to parse these
additional --proxy-server and --proxy-port flags, but it doesn’t do anything with
them. You’re to update the hierarchy of classes to allow for the possibility that a secondary
proxy is being used, and if so, to forward all requests (as is, except for the modifications to the
"x-forwarded-proto" and "x-forwarded-for" headers) on to the secondary proxy.
This’ll require you to extend the signatures of many methods and/or add methods to the
hierarchy of classes to allow for the possibility that requests will be forwarded to another proxy
instead of the origin servers. (If you notice the chained set of proxy IP addresses that leads to a
cycle, you should respond with a status code of 504.)
For fun, we’re supplying a python script called run-proxy-farm.py, which can be used to
manage a farm of proxies that forward to each other. One you have proxy chaining
implemented, open the python script up, update the HOSTS variable to be a list of one or more
myth machine numbers (e.g. HOSTS = [14, 15, 18, 2]) to get a daisy chain of
http-proxy processes running on the different hosts.
Additional Tidbits
• You should absolutely add logging code and publish it to standard out. We won’t be
auto-grading the logging portion of this assignment, but you should still add tons of
logging so that you can confirm your proxy application is actually moving and getting
stuff done. (For obvious reasons, your logging code should be thread-safe).
• You can assume your browser and all web sites are solid and respect HTTP request and
response protocols. While testing, you should hit as many sites as possible, sticking to
major web products like www.wikipedia.org, www.apple.com,
www.nytimes.com, www.sfgate.com, www.stanford.edu, as so forth. You
should avoid sites that require a login or some other form of authentication, since
they’ll likely mingle HTTP and HTTPS requests.
• Your proxy will intercept all HTTP traffic, but it won’t even see any HTTPS traffic. As
your surf the net, note whether the site switches over to HTTPS2, lest you think your
proxy isn’t actually doing anything, when in fact it’s not supposed to be. You’ll
probably want to avoid web sites like www.google.com and www.yahoo.com while
testing your proxy, since they’re all HTTPS as far as I can tell.
• Your http-proxy application maintains its cache in a subdirectory of your home
directory called .http-proxy-cache-myth<num>. The accumulation of all cache
entries might very well amount to megabytes of data over the course of the next two
2
Facebook and Gmail do this.
13
• Your http-proxy application should, in theory, run until you explicitly quit by
pressing either ctrl-Z or ctrl-C. A real proxy would be polite enough to wait until all
outstanding proxy requests have been handled, and it would also engage in a
bidirectional rendezvous with the scheduler, allowing it the opportunity to bring down
the ThreadPool a little more gracefully. You don’t need to worry about this at all—
just kill the proxy application without regard for any cleanup.
14
I hope you enjoy the assignment as much as I’ve enjoyed developing it. It’s genuinely
thrilling to know that all of you can implement something as sophisticated as an industrial-
strength proxy, particularly in light of the fact that many of you took CS106B and CS106X
less than a year ago.