For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retrieve the data to populate the "Model" and "Year" drop down menus when the user selects an option from the "Make" menu (Figures 2-3).
|Fig 1. KBB.com uses Ajax to retrieve data for the drop down menus.|
|Fig 2. The user selects the Make option, which initiates an Ajax request...|
|Fig 3. ... and the Model and Year data from the Ajax response is used in their respective drop down menus.|
|Fig 4. Ajax is used to retrieve additional data from the server and change the state of the client.|
|Fig 5. The memento of KBB.com is not completely functional due to the reliance on Ajax to load extra-client data after the initial page load.|
We are interested in a similar approach and have been investigating headless browsing tools and client-side automation utilities. Specifically, Selenium (a client-side automation tool), PhantomJS (a headless browsing client), and a non-archival project called VisualEvent have piqued our interest as most useful to our approach.
There are other similar tools (Browsertrix, WebRecorder.io, CrawlJAX) but these are slightly outside the scope of what we want to do. We are currently performing research that requires a tool to automatically identify interactive elements of a page, map the elements to a client-side state, and recognize and execute user interactions on the page to move between client-side states. Browsertrix uses Selenium to record HTTP traffic to create higher fidelity archives a page-at-a-time; this is an example of an implementation of Selenium, but does not match our goal of automatically running. WebRecorder.io can record user interactions and replay them with high fidelity (including the resulting changes to the representation), and matches our goal of replaying interactions; WebRecorder.io is another appropriate use-case for Selenium, but does not match our goal of automatically recognizing and interacting with interactive DOM elements. CrawlJAX is an automatic Ajax test suite that constructs state diagrams of deferred representations; however, CrawlJAX is designed for testing rather than archiving.
In this blog post, I will discuss some of our initial findings with detecting and interacting with DOM elements and the trade-offs we have observed between the tools we have investigated.
For example, PhantomJS can be used to explore the DOM for all available buttons or button click events. In the KBB.com example, PhantomJS can discover the onclick events attached to the KBB menus. However, without external libraries, PhantomJS has a difficult time recognizing the onchange event attached to the drop downs.
Selenium is not a headless tool -- we have used the tongue-in-cheek phrase "headful" to describe it -- as it loads an entire browser to perform client-side automation. There are several APIs including Java, Python, Perl, etc. that can be used to interact with the page. Because Selenium is headful, it does not provide as close an integration between the DOM and the script as does PhantomJS. However, it provides better utilities for automated action through mouse movements.
Based on our experimentation, Selenium is a better tool for canned interaction. For example, a pre-scripted set of clicks, drags, etc. A summary of the differences between PhantomJS, Selenium, and VisualEvent (to be explored later in this post) is presented in the below table. Note that our speed testing is based on brief observation and should be used as a relative comparison rather than a definitive measurement.
|Speed (seconds)||2.5-8||4-10||< 1 (on user click)|
|DOM Integration||Close integration||3rd party||Close integration/embedded|
|DOM Event Extraction||Semi-reliable||Semi-reliable||100% reliable|
|DOM Interaction||Scripted, native, on-demand||Scripted||None|
Sources online (e.g., Stack Overflow, Real Python, Vilimblog) have recommended using Selenium and PhantomJS in tandem to leverage the benefits of both, but this is too heavy-handed an approach for a web-scale crawl. Instead, we recommend that canned interactions or recorded and pre-scripted events be performed using Selenium and adaptive or extracted events be performed in PhantomJS.
To confirm this, we tested Selenium and PhantomJS on Mat Kelly's archival acid test (shown in Figure 6). Without a canned, scripted interaction based on a priori knowledge of the test, both PhantomJS and Selenium fail Test 2i, which is the user interaction test but pass all others. This indicates that both Selenium and PhantomJS have difficulty in identifying all events attached to all DOM elements (e.g., neither can easily detect the onchange event attached to the KBB.com drop downs).
|Fig 6. The Acid Test is identical for PhantomJS and Selenium, failing the post-load interaction test.|
|Fig 7. VisualEvent adds a DIV overlay to identify the interactive elements of the DOM.|
We use PhantomJS to inject the VisualEvent code into a page, extract interactive elements, and use PhantomJS to interact with those interactive elements. This discovers states on the client that traditional crawlers like Heritrix cannot capture.Using this approach, PhantomJS can capture all interactive elements on the page, including the onchange events attached to the drop downs menus on KBB.com.
So far, this approach provides the fastest, most accurate ad hoc set of DOM interactions. However, this is a recommendation from our personal experience for our use case: automatically identifying a set of DOM interactions; other experiment conditions and goals may be better suited for Selenium and other client-side tools.
Note that this set of recommendations is based on empirical evidence and personal experience. It is not meant as a thorough evaluation of each tool, but hope that our experiences are beneficial for others.
--Justin F. Brunelle