Can’t wake up early enough? Write a program!

Last semester, sometime around October or November (whenever it was that registration opened), I had armed myself with an incredible schedule for Spring 2009: two out of my three core computational biology courses (710 and 711), and a course in web application design.  It was awesome.  Following the conclusion of Spring ’09, I’d technically be halfway through my M.S. degree, but I’d have satisfied 2/3 of the credit requirements, freeing up a significant amount of space for just about anything else I could want to take.

Then disaster struck: a mere 70 minutes after registration officially opened, when I logged on to register, I discovered to my dismay that I was already 60th on the waitlist for my web application course.  Which also meant an additional 30 had snapped up the spots in the course.  Yes, in the 70 minutes registration was open, 90 people had jumped a pool built for 30.  And none of those 90 were me.

Nothing like this had ever happened while I was at Georgia Tech.  Of course, we’d also had assigned registration periods; at CMU, all graduate students were allowed to register starting at 6am of the appointed day.  I made a bad assumption, but I had assumed that, like GT, the classes I wanted would be available.

Naive, yes.  And I learned from it.  This semester, with registration just over a week away, I have been working on a script to register me automatically, blasting through the authentication and registration forms faster than any human could possibly click their way through.

PHP’s cURL (Client URL) library offers a very elegant way of replicating HTTP requests. The challenge, then, is determining what the server is expecting and making sure to fulfill those requests.  This is done by setting various cURL options through curl_setopt().

Let’s step through this, shall we?

Initialization

The first step is setting up our cURL session. This is done with curl_init(), and works very simply as follows:

$handle = curl_init();

No big deal there. However, we also need to look at the first web page we’ll be accessing. In this case, CMU has a series of what I call “administration pages” which allow students to access email, registration, payment information, course standings, and so on. What is interesting about these pages is that they all have an automatic redirect to CMU’s authentication portal, which they call “WebISO”. This page provides a twofold check: it looks at the user’s cookies to see if they are properly authenticated and, if not, presents them with a login form. It also checks that the user has a valid SSL server certificate.

First, we have to obtain the server certificate.  CMU posts its root certificates specifically for students to download, and once doing so we have to instruct cURL where to find them.

curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, '0');
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, '0');
curl_setopt($handle, CURLOPT_CAINFO, '/path/to/server.crt');

The well-seasoned among you may recognize that turning off both host and peer verification isn’t the right way to go about things.  Unfortunately – and for reasons you all are welcome to try and explain, as I haven’t found an explanation yet – I receive a cURL error message if I replace either one of those 0’s with 1’s.  My best guess is it has something to do with the fact that CMU’s root certificate is self-signed.

Now, we also need to establish the cookie session, so CMU’s website can set the session cookies and allow us to navigate multiple pages without having to re-authenticate.  This is done by setting the following options:

curl_setopt($handle, CURLOPT_COOKIESESSION, true);
curl_setopt($handle, CURLOPT_COOKIEFILE, '/path/to/cookiefile');
curl_setopt($handle, CURLOPT_COOKIEJAR, '/path/to/cookiefile');

Once those have been set up, all that’s left is to point cURL to the URL we want to hit, which in this case is the registration page (since we know it’ll automatically redirect to the WebISO portal page):

curl_setopt($handle, CURLOPT_URL, 'https://www.theregistrationpage.com/');

Oh, and one more step: EXECUTE!

curl_exec($handle);

Propagation

Now that the initial request is sent and the connection established, we’ll need to manually double-check that the output is what was expected – mainly, a redirection to the WebISO portal page.

(hint: look into the option CURLOPT_RETURNTRANSFER for controlling cURL output more closely)

Provided no errors occurred, all that needs to happen is a simple change in the URL option, and another execute to get us there.

curl_setopt($handle, CURLOPT_URL, 'https://www.webisopage.com/');
curl_exec($handle);

This should propagate our request to the WebISO portal page.

Now, important to note: the more seasoned among us would normally have expected this redirection to happen automatically, a la the “Location: ” header field. For whatever reason, CMU’s administration pages don’t use this directive. Instead, once a form has been successfully accessed or submitted, they use <meta http-equiv="refresh"> directives to manually redirect through simple HTML. Thus, to those who would point out the CURLOPT_FOLLOWLOCATION option, it won’t do any good here. We have to emulate this redirection ourselves through another HTTP request.

Once at the authentication page, we have to actually log in. This is probably the trickiest part of the whole endeavor.

Authentication

What makes this trickier is not only that we will be using the HTML POST method instead of the GET we have implicitly used in the first few requests, but we also have to take note of all the form fields that are being submitted to the server and make sure we duplicate them. If we don’t duplicate them properly, then the server-side verification will fail and we’ll be denied access.

The best way I’ve found to do this, tedious though it is (again, experts: feel free to suggest another method that works better for you) is to examine the HTML source of the WebISO authentication page and take note of whatever key-value pairs are posted there. Everything inside the <form></form> tags is crucial for us.

Here’s a small excerpt of what I found on the WebISO page after performing the above steps:

<input type="hidden" name="two" value="/">
<input type="hidden" name="creds_from_greq" value="1">
<input type="hidden" name="five" value="GET">
<input type="hidden" name="pre_sess_tok" value="607260569">
<input type="hidden" name="create_ts" value="1239473521">

Any field labeled as an “input” type needs to be duplicated and submitted through cURL. The problem, though, is generating the correct values. Through repeated testing in the browser, I discovered that the fields such as “two” and “creds_from_greq” and “five” remained exactly the same, so those I could hard-code into cURL easily. However, the other two, “pre_sess_tok” and “create_ts” would change every time I accessed the page. This presented a problem that could be solved one of two ways: either I could figure out how the value was being generated and allow the PHP cURL script to generate it (easy option), or parse cURL’s HTML response for the actual values and plug them into the script (harder option).

As is probably obvious to the experts, the “create_ts” field is actually just the Unix timestamp, so that is easily reproduceable within the PHP script without any need for HTML output parsing. The other value, “pre_sess_tok”, seems completely random, but testing revealed that it is actually quite integral to successful authentication; I could provide the correct username and password and every other POST field, but if that one field wasn’t what the server was expecting, I would receive an authentication error. So this field’s value would have to be extracted.

Best way to do this? RegEx. Ohhhhhh yes.

Here’s a small utility function I whipped up just for the occasion:

function extractSession($content) {
$pattern = '/<input type="hidden" name="pre_sess_tok" value="(-?[0-9]*)">/';
$retval = 943604320; // random number
$matches = "";
if (preg_match($pattern, $content, $matches) > 0) {
$retval = $matches[1];
}
return $retval;
}

This function pulls out the “pre_sess_tok” number generated by the server via regular expressions (it could be positive or negative, hence the “-?” expression) and returns it.

We’re all set. Now all we have to do is let cURL know what the form variables are going to be, and also tell it to use the HTML POST method instead of the GET one we’ve been using previously. Oh, and the POST fields need to be urlencode()‘ed:

curl_setopt($handle, CURLOPT_POST, '1');
curl_setopt($handle, CURLOPT_POSTFIELDS, urlencode('two=/&creds_from_greq=1&five=GET&pre_sess_tok=' . extractSession($content) . '&create_ts=' . time()));

Don’t forget: the username and password fields need to be set as well! These are the actual fields where you’d normally type in your username and password. 😛

(technically, the urlencode() statement above is incorrect – the ampersands should not be included; this was done here for the sake of brevity. Ideally, you should generate an associative key-value array of the data and run it through a foreach loop, and run urlencode on each value, appending raw ampersands and equality signs as you go)

And voila! All that’s left now is to submit the form.

curl_setopt($handle, CURLOPT_URL, 'https://www.webisopage.com/login');
curl_exec($handle);

Provided we did everything correctly, the “Authentication Successful” page should show up. At this point, all the cookies are correctly set, everything has been verified, so we’ll no longer need to worry about being redirected back to the WebISO page, and we can browse the other administration pages at our leisure.

Including the registration pages 😉

(feel free to track this project’s progress at its Wiki page)

funny-pictures-your-kitten-uses-linux

Advertisements

About Shannon Quinn

Oh hai!
This entry was posted in Academics, Graduate School, lolcat, Programming, Technology and tagged , , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s