Live link

Suppose I would like to drop into my web calendar the up-to-date price of an AMTRAK train ticket from New York to Boston on February 17 at 5pm.

That sure sounds easy enough, but it’s surprising how hard it is to implement. You can’t just point to some page on AMTRAK’s website. The only way to find the price of a ticket on-line is to enter your start/destination and time/date into a web form. Then AMTRAK then crunches that data for you and generates a page that shows you the current ticket price.

There is nothing in the address field of that results page to indicate which route or date you’d asked for. That’s all done on AMTRAK’s server — the page you end up looking at is custom made for you by that server, on the fly.

I was talking today with Murphy Stein — a colleague and Ph.D. student here at NYU — and we came to the conclusion that it might be possible to solve this problem with a macro capability. In other words, you go through your usual process of entering the data into the web form, but you tell your browser to track what you’re doing, and it stores (somewhere) all the steps you just went through. It also lets you highlight the price on the results page, and it remembers where you highlighted.

Later, your browser periodically goes on a little web crawl onto AMTRAK’s site, entering the same values that you did into AMTRAK’s on-line web form, and then “clicking” on the same “GO!” button that you’d clicked on. Your browser then looks at the place on the results page that corresponds to the number you had highlighted. Unless things go horribly wrong, there will indeed be a number there (although it won’t be the same number if prices have gone up).

Every time you look at your web calendar, you will see an up-to-date price (how up-to-date depends on how often your browser goes crawling for updates).

None of this is sure-fire. For one thing, we are relying on AMTRAK to not change their on-line query form. For another thing, we are relying on your browser being able to find the price on the updated results page, working just from the location of the price you’d originally highlighted. That can turn out to be tricky if AMTRAK hasn’t designed their page sensibly.

Also, of course, AMTRAK might become unhappy if its customers’ browsers keep revisiting its site on their own every hour or so. At some point all of those repeat cyber-visits will start to overload company web servers.

So maybe none of this is practical. But it’s a nice thing to think about — being able to create a live and up-to-date link from anywhere to anywhere else on the web.

8 thoughts on “Live link”

  1. This sounds a lot like a technique called “Screen Scraping”: http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping.

    To solve the website’s problem, assuming there’s nothing in the URL’s query string (which means that the HTTP POST method was used), we could examine the web traffic, figure out the elements that the page submits, and generate a custom HTTP request that submits to that same page. It’s a long shot, but it can work.

  2. Yes, I see what you mean. And that brings up a related point.

    Most users will not understand that by the time they are looking at the result (eg: the ticket price), the information needed to ask the same question in the future is already gone.

    Which means we need to proactively keep track of web traffic, because we never know when the user is going to want to link from a page that was created in this way.

    In the hypothetical browser feature I described, the user highlights some value on a results page and indicates “I want to link from here”. In general, the browser will then need to store, for later reuse, the proactively stored HTTP request, rather than the page itself.

    Interestingly, the text retrieved by “view source” is completely useless here. It doesn’t tell us what the HTTP request was asking for, and it also doesn’t reliably give us the contents of the page the user is actually looking at, since the downloaded document might have been partially rewritten by javascript.

  3. Oh, ok. The “View Source” gives complete info, like this “form” element for e.g. in the Amtrak ticket booking page does work of submitting the request:

    …action=”http://tickets.amtrak.com/itd/amtrak” name=”form” method=”post”.

    Oh, I didn’t mean that we’d need to monitor traffic all the time. We would only need to do that once so that we know what kind of HTTP requests we need to construct ourselves.

    So, if we mimic this form element in our very own HTTP request, populate the form with our desired values and send out the HTTP request, we could get the results we want (we could even encapsulate this in a small library which does this, so it wouldn’t matter how we accept user input. We could even run a small shell program that utilizes the library, possibly running in a separate thread that periodically sends out the request). We’d obviously need to parse the results to make sense of the data. Tthis way we could send arbitrary requests and get results. But, as you mentioned in your article, the biggest problem is that if the website changes it would break our code.

  4. …of course this is basically the same technology that is used by spambots to automatically create accounts on comment systems (such as this one) or forums.
    And thus the reason for Captchas (which one even finds at web based whois queries nowadays).

    Cheers,
    Mike

  5. The only problem with “we mimic this form element in our very own HTTP request” is that the user sitting in front of the browser generally doesn’t realize they want to link to the AMTRAK ticket price until after they are already looking at the results page. While they are looking around for prices, they aren’t thinking about HTTP requests — so they end up looking at a number on a page, without every really worrying about how they got there. But that’s ok — behind the scenes we can just keep a short history of the HTTP requests that went out from the client. As you point out, this works as long as AMTRAK doesn’t change its protocol.

    The real danger of SpamBots, I think, comes when they manage to fool a server into thinking that they are a trusted human — hence the utility of Captchas.

    I’m thinking of something altogether more positive — more powerful information being made available from servers for humans who have already passed the “I’m really me and I can be trusted” test — however that test is conducted. In the case of this blog, because of its small scale, I can invite people individually to be “trusted humans”, but the principle of “the server now knows that I am really who I say they am” would be the same on a larger scale.

    Given that level of necessary security, my original thought was that it would be cool to design new kinds of user interaction with the web in a way that gives people (and not just people with technical chops) more opportunities to create their own active links — links with information that stays up-to-date and can therefore be used in various interesting ways — from one place on the Web to another.

  6. And I was just thinking about this today: this thought process sounds similar to the one used for your new language, to make programming easier for people. In this case of a live link, a complex task is performed with just a few clicks. Of course, we’d need to a layer of abstraction in between to achieve that.

  7. Yeah, I did this once last year for work. They had bought or subscribed a horrible web app for timesheet entry. I wrote an script that would grab the data off the website and then make some cool graphs. “Screen scraping”, as stated above, is the term I believe. I used the ruby library mechanize which worked fairly well.

Leave a Reply

Your email address will not be published. Required fields are marked *