QueryPath Blogs
QueryPath: "What's with the at the end of every line?"
Old problems never die, it seems. A few people have mentioned an interesting QueryPath problem they have experienced. Roughly summarized, the question is "What's with the 
 at the end of every line?"
note: The trailing ; of the entity name has been removed to avoid stripping by an overzealous content filter.
I hadn't experienced this problem for myself until recently, as I worked on an importer that was parsing thousands of ancient HTML files. These files came from all over the place, and dozens of them had the XML entity 
 appended to the end of every line. While this may look odd at first glance, a second glance reveals that this is a problem we've likely all seen in the past.
What is 
?One of the things QueryPath does automatically (unless you tell it not to) is re-code entities. This is important for XML, since it does not (out of the box) support the array of named entities that are part of HTML. Instead of using named entities like , XML uses numeric representations of the character.
What is 
? It is the decimal notation for Carriage Return (often encoded as \r). Yup, the old CR-LF problem rears its head again. When a document with Windows CR-LFs is serialized by QueryPath (which, in turn, just uses the PHP DOM library), any CRs are converted to entities.
How do we solve the problem?As far as I can tell, this behavior accords with the XML standard. For that reason, I'm not inclined to change it.
However, you can avoid the problem altogether by removing CR characters from a document. This is as easy as doing something like this:
<?php $doc = str_replace(chr(13), '', file_get_contents($file)); qp($doc); ?>The above will remove all of the carriage returns using str_replace<code>, and then pass the file contents on to QueryPath.
TweetyPants in the News
TweetyPants (http://tweetypants.com) ranks Twitter users on Style, Smarts, and Shizzle (slang). It was built using QueryPath and OpenAmplify. TweetyPants has been mentioned several times in the news lately. Here's the roundup:
- Twi5, June 23, 2009: TweetyPants is featured as the Twitter App of the Day.
- Killer Startups, June 22, 2009: TweetyPants is discussed as an enticing example of the border between fun and insight.
- TechCrunchIT, June 17, 2009: TweetyPants (and the Drupal Amplify module) are both mentioned as implementations of OpenAmplify.
Parsing Old Remote HTML Docs with QueryPath
An issue in the QueryPath issue queu made me realize that parsing crufty old HTML documents is not exactly intuitive. Here's a quick tutorial for parsing HTML documents.
The most difficult part of handling documents is that it is not always clear what kind of content a document is. While some extensions (like .html and .xml) make this easy, others (like .qti, and XML format) are not so readily discernable. And files fetched from a URL may not have an extension at all!
So sometimes when QueryPath parses files, it basically guesses at the content type.
Here are the guessing rules that QueryPath follows:
- If passed a string of markup (e.g, <foo><bar/></foo>), check whether it begins with an XML declaration. By definition, every XML file must begin with something like this:
If the declaration does not appear on the first line, the document is not to be considered XML. If this is the case, QueryPath throws it into the HTML parser.
- If passed a file name and a context, inspect the contents. QueryPath uses the PHP stream system to handle files. One of the benefits of this is that you can pass in a context which tells QueryPath how to retrieve the document. Typically, this is used to modify HTTP parameters.
Whenever a QueryPath object is created with a context, QueryPath automatically inspects the contents of a file. Example:
<?php require '../Code/QueryPath/bin/QueryPath.compact.php'; $url = 'http://example.com/old_crufty_html.foo'; $cxt = array('context' => stream_context_create()); // This will parse it as HTML: print qp($url, 'title', $cxt)->text(); ?>Assuming that the URL above points to an old and crufty HTML document, this code will parse the document as HTML. And if the URL above pointed to an XML document (or an XHTML document), the XML parser would be used instead. This happens because when a context is passed into QueryPath, it checks the content to see what data type it is dealing with.
- If passed a file name and no context, inspect the file extension. This is sort of a last-ditch attempt, and what it will do is assume that if the file ends with .html the code is HTML. Otherwise, it will assume that the file is XML.
Obviously this will work for simple cases. When retrieving URLs, though, it may have unexpected results. So why do things this way? One word: Performance. Large files work much better when we use this method. The underlying system can optimize reading of the file.
QueryPath 2.0 may change this behavior. The next version of QueryPath may use the method outlined above for all files. For QueryPath 1, though, this is how files are interpreted.
When parsing moderately sized old HTML files, you will do best to pass a context into qp(). This will give you the greatest chances of successfully parsing the document.
QueryPath 2.0 Alpha 2
QueryPath Alpha 2 is now available.
The following major changes were made in Alpha 2:
- The QueryPathImpl class has been re-named QueryPath. The QueryPath interface has been removed.
- The file QueryPathImpl.php has been merged with the file QueryPath.php, and the interface has been removed from QueryPath.php
- This version adds support for selectors as an argument to branch().
- Bug Fix: When a selector that contained only an '#id' was executed and no such id was found, old matches were incorrectly returned. (Reported by Ryan Mahoney).
Only a few more changes are likely to come along before we switch from Alpha to Beta releases.
Please report bugs here: http://github.com/technosophos/querypath/issues
How to Access OpenAmplify from QueryPath
I've written a handful of tools that make use of the OpenAmplify web service. In all cases, I've used QueryPath to retrieve the XML from the remote server and then work with it locally. In this short article, I will explain how QueryPath can be used to retrieve content from OpenAmplify's web service. I will also provide brief examples of how QueryPath can work with the results.
To see these techniques in action, you can visit TweetyPants. To see some functional code, take a look at the Amplify and QP Services modules for Drupal.
Retrieving Data from OpenAmplifyLet's begin with some example code. The following code performs a simple POST-based query against the OpenAmplify web service.
<?php require 'QueryPath/QueryPath.php'; $url = 'http://portaltnx.openamplify.com/AmplifyWeb/AmplifyThis?'; $key = 'OPENAMPLIFY_API_KEY_GOES_HERE'; $text = 'This is the text we are going to amplify.'; $params = array( 'apiKey' => $key, ); $url .= http_build_query($params); $options = array( 'http' => array( 'method' => 'POST', 'user_agent' => 'QueryPath/2.0', 'header' => 'Content-type: application/x-www-form-url-encoded', 'content' => http_build_query(array('inputText' => $text)), ), ); $context = stream_context_create($options); try { $qp = qp($url, NULL, array('context' => $context)); } catch (Exception $e) { print "Could not retrieve data from OpenAmplify." . $e->getMessage(); exit; } ?>To begin, we define $url, $key, and $text. $url will just contain the URL of the OpenAmplify server. $key is the API key that OpenAmplify issues to you. $text is the text that we are going to amplify. Typically, this will come from some other source, like a document or user input.
The $params array is going to contain GET parameters. Posting documents to OpenAmplify will still require us to pass some information as part of the GET string. Namely, the API key must be passed in a URL like this:
http://portaltnx.openamplify.com/AmplifyWeb/AmplifyThis?apiKey=MY_KEYThe $options array is a little more complex. When working with QueryPath, we rely upon PHP stream contexts to configure a particular HTTP request. The $options array will be used to define the context.
For our purposes, we only need to configure the HTTP stream wrapper. Inside that array, we set the HTTP method to POST, define a user agent (which is optional), and add an HTTP header telling the remote server what type of encoding we are using. Finally, we add content, which holds the encoded name/value pairs that we need to pass to the server.
For our query, we only need to pass the inputText parameter, whose value will be the text we want to analyze.
Once we have the $options list created, we use this to build the stream context for PHP. And from there, we simply need to retrieve the data from OpenAmplify:
<?php $context = stream_context_create($options); try { $qp = qp($url, NULL, array('context' => $context)); } catch (Exception $e) { print "Could not retrieve data from OpenAmplify." . $e->getMessage(); exit; } ?>There are a few things to notice here. First, the second argument to qp() is a null simply because we do not need to search the return document immediately. The third parameter holds the options for QueryPath. In our case, we need to set the stream context.
Finally, the try/catch block wraps this block so that we can detect an error right away.
This brief description should get you started when connecting to OpenAmplify. But what do you do with the resulting QueryPath object?
Working with the ResultsFrom the point above, you can access the contents of the OpenAmplify XML document via the $qp variable. For example, this code prints out the top 20 proper nouns that OpenAmplify has identified:
<?php $qp->find('ProperNouns>TopicResult>Topic>Name')->slice(0, 20); // Set up the output: $out = qp(QueryPath::HTML_STUB, 'body')->append('<ul/>')->find('ul'); // Add a list item for each noun: foreach ($qp as $name) { $out->append('<li>' . $name->text() . '</li>'); } // Write the contents to STDOUT $out->writeHTML(); ?>The first line of the code snippet above searches for the name of every proper noun and then slices (keeps) the first 20. Note that we use the direct child combinator (>) because it is faster than using the any descendant combinator (represented by an empty space).
The $out QueryPath object will be the output document. The foreach loop simply goes through each proper noun and adds it to an unordered list in the output HTML document.
Finally, the results are printed out as an HTML document.
This should give you some ideas on how to work with OpenAmplify content from within QueryPath. To see some of the other techniques, take a look at the QP Services Drupal module, which makes frequent use of OpenAmplify data.
Reading ODT Files with QueryPath
One of the most popular word processing document formats is the ODT (Open Document Text) format, supported natively by OpenOffice.org, and supported as an export format for other major word processors, including Microsoft Office.
An ODT document is actually a ZIP archive composed of several files, including metadata, the document text, and embedded items. Most of these files are XML documents. And that means we can easily access their contents using QueryPath.
In this short article, we'll see how to access the contents of an ODT file.
The codeThe text content of an ODT file is stored in the content file inside of the ODT ZIP archive. To skip through it, you can do something like this:
$ unzip openoffice.odt $ cat content.xmlThe command above will display the text contents (in XML format) of the document named openoffice.odt. This is the file we are going to parse.
For our simple example, what we are going to do is print out a plain-text outline of the document. We will build the outline by reading the section headers from the ODT file.
Here's the code:
<?php require_once 'QueryPath/QueryPath.php'; $file = 'zip://openoffice.odt#content.xml'; $doc = qp($file); foreach ($doc->find('text|h') as $header) { $style = $header->attr('text:style-name'); $attr_parts = explode('_', $style); $level = array_pop($attr_parts); $out = str_repeat(' ', $level) . '- ' . $header->text(); print $out . PHP_EOL; } ?>After requiring the QueryPath library, we get to work parsing the file. The file is a ZIP archive. Rather than unzip it ourselves, though, we want to use PHP's ZIP stream handler to uncompress it (as needed) internally. To cause PHP to invoke the ZIP stream handler, we use a special URL to refer to the file:
$file = 'zip://openoffice.odt#content.xml';The above tells PHP to unzip the openoffice.odt file and access the content.xml file in that archive.
We can pass that URL straight into QueryPath, which will then unzip and access the desired data.
The foreach loop contains the brunt of our application code. It iterates over all of the headers in the document, and then formats some output based on the header.
Notice the CSS selector passed into find(). It is a little out of the ordinary: text|h. The pipe operator is rarely used in CSS 3 when you are working with HTML documents. It provides XML namespace support for CSS. So the above will seek for elements that look like this:
<text:h>Header text</text:h>Effectively, the pipe (|) in the selector replaces the colon in the tagname. In ODT, headers are stored as h tags inside of the namespace urn:oasis:names:tc:opendocument:xmlns:text:1.0, which in turn is usually aliased to text. Thus, text|h searches for all headers in the document.
So the iterator is now looping through all of the headers in the file. With each header, five things are done:
$style = $header->attr('text:style-name'); $attr_parts = explode('_', $style); $level = array_pop($attr_parts); $out = str_repeat(' ', $level) . '- ' . $header->text(); print $out . PHP_EOL;First, we extract the style of the header. This will enable us to determine what level of heading this is (e.g. level 1, 2, 3, and so on).
Style is stored in the namespaced attribute text:style-name. Since we are retrieving the attribute, we use the entire XML name: text:style-name. (We do not replace the ':' with a '|' because we are not executing a CSS 3 Selector.)
A style name will look like this: text:style-name="Heading_20_2". The last digit, 2 indicates the level of the heading, and that is the number in which we are interested.
We retrieve the heading number by exploding the attribute name and then retrieving the last item from the attribute array:
$attr_parts = explode('_', $style); $level = array_pop($attr_parts);Now $level contains a digit indicating the header number. From there, we simply want to display the header formatted to indicate its depth in the outline:
$out = str_repeat(' ', $level) . '- ' . $header->text(); print $out . PHP_EOL;The str_repeat function pads the beginning of the string with one space for every heading level. A first level heading will be indented one space. A third level heading will be indented three spaces. After the spaces, we add a dash (for formatting) and then the title of the section.
Finally, each line is printed to standard output.
So what does the output of this command look like? Let's take a quick look at our sample document as rendered by OpenOffice.org:
Notice the multiple levels of headings. Let's extract those now with the tool we just built:
- Section One - Subsection A - Subsection B - Section 2 - Subsection 2A - Item AA - Item BB - Subsection 2B - ConclusionOur simple tool parsed OpenOffice.org's XML and displayed an outline based on the headings from the document.
The code presented here is based on documents output from OpenOffice.org 3.x. Because of the flexibility of XML, it is possible for other ODT documents to be generated which will not conform to the same namespace convention we have used. What does this mean? It means you may have to tweak this little example to make it work on certain files (YMMV). But the principles will remain the same.
Word DocX documents also contain an XML payload. In the future, perhaps we will examine parsing and reading such files.
Presentations from Drupal Camp Wisconsin
DrupalCampWILast weekend, I joined a couple hundred other Drupal users at Drupal Camp Wisconsin at the University of Wisconsin, Madison. This well-organized two-day event was fantastic. I met many new people (and can now connect a face with an IRC handle for many more). And the crack team of conference organizers have already put together videos of many conference sessions.
For me, the conference highlights included a handful of sessions on Drupal in education, a pair of sessions on GIS and mapping, and a BOF that I attended on web services, portlets, and the future of distributed web applications. A perennial strong point for Drupal Camps is the coverage of Drupal basics. DrupalCampWI had around half a dozen sessions for beginners. If you are just learning Drupal, a camp like this can really help you find your footing.
The camp's commons area was fantastic, providing ample space for both small and large BOFs as well as impromptu brainstorming sessions. Many conferees stayed at the same hotel, making after-hours ad hoc get togethers easy. And Wisconsin food? I ate my first (and probably last) "bacon bratwurst pretzel burger with cheese."
I gave two sessions. The first was on JavaScript and jQuery in Drupal. The second was on Web Services, mashups and QueryPath in Drupal (a preview version of what I hope to show in Paris this September). Most of the conference sessions are now available in video form.
Update: Added link to QueryPath video
QueryPath 2.0 Alpha 1
QueryPath 2.0 Alpha 1 has been released. You can grab a copy for testing from the download page.
This new version adds some new methods, adds a few of the straggling CSS 3 Selectors, provides a new object for global configuration, and employs new (faster) internal data structures. You should notice speed improvements with this version -- even if you are using an opcode cache.
QueryPath 2.0 Alpha 1 is a testing release. Further API changes will be made before the final release (though few will have impact on your development). It is not recommended for large production sites.
Existing QueryPath adapters, such as the Drupal QueryPath module should be able to use this new version seemlessly.
QueryPath 1.3 module released, now has an XML cache
The QueryPath module, version 1.3 is now available. This release adds a new submodule called QP Cache.
QP Cache is a cache system optimized for XML storage. It supports keys of arbitrary type and length (objects, strings, arrays) as well as fuzzy expiration dates ("2 weeks"). Cache lookups are very fast. Cache maintenance is left to the implementor. (In other words, Drupal cache clears have no impact on this cache, by design). The main use case, it is anticipated, is to store local copies of documents retrieved from remote web services. While QP Cache can be used without QueryPath, it provides integrated functions that make it trivially easy to work with QueryPath objects.
The Amplify module uses QP Cache to store OpenAmplify data. The code there is a good place to start when developing for QP Cache.
TweetyPants: Mashup of the Day at Programmable Web
Mashup of the DayTweetyPants was selected as the "Mashup of the Day" on ProgrammableWeb.
TweetyPants demonstrates how QueryPath can be used to combine multiple XML and HTML sources. It takes a user's recent Twitter activity, cleans it up a little, and then submits it for analysis to OpenAmplify. OpenAmplify returns an XML document with a semantic analysis of the content. QueryPath then takes that information and generates some HTML to display the results in a silly way.

