Angle brackets are a way of life
Update: Minor code cleanup.
Prepending a list xs with a list ys to obtain the concatenation (of ys ++ xs ) of the two is usually a cheap operation which, when done non-destructively, requires only to copy the list ys to a new list y1s and link the last item of y1s to the head of xs.
Thus if we have to prepend K such lists:
y1s, y2s, ..., yks
starting with the prepend of y1s to xs, and prepending each of the following lists above to the result of the last prepend operation, the result will be a list consisting of the items of:
yks, ..., y2s, y1s, xs
in that order. Every list will be copied only once and the total operation of K prepends will require copying the items of all the lists. So, prepending a number of lists is a linear operation -- proportional to the total number of items in those lists.
On the other side, continuous appending of lists may be O(N^2) when done in a naive way, because the growing list will have to be copied again and again in each append operation.
In the XSLT 2.0 processors available today, the cost of repetitive prepending and appending are different than the above. I have studied two XSLT 2.0 processors, P1 and P2.
P1 optimizes repetitive appends, by achieving linear complexity. Its time complexity for repetitive prepending is O(N^2) -- square.
P2 doesn't optimize either way.
The simple transformation below:
|
<xsl:stylesheet
version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:my="my:namespace"
<xsl:variable
name="vRepetitions"
as="xs:integer"
<xsl:variable
name="vAppends"
as="element()*">
<xsl:template
match="/"> |
when run with P1 with different number of repetitions, takes the following times:
|
Appends |
Time, ms |
Remarks |
|
100 |
42 |
Hundred appends. |
|
1000 |
47 |
Thousand appends. |
|
10000 |
85 |
Ten thousand appends |
|
100000 |
458 |
One hundred thousands appends. |
|
1000000 |
7141 |
One million appends. |
We see that the claims of P1's linear or better performance are true.
P2 behaves much worse and has O(N^2) results:
|
Appends |
Time, ms |
Remarks |
|
100 |
21 |
Hundred appends. |
|
1000 |
7420 |
Thousand appends. |
|
5000 |
38100 |
Failed after 38 seconds. Five thousand appends |
|
10000 |
------ |
Failed. Ten thousands appends. |
With repetitive prepends both P1 and P2 have square time
complexity. This was tested with the following
transformation:
<xsl:variable name="vPrepends"
as="element()*">
<my:node/><my:node/><my:node/><my:node/><my:node/>
<my:node/><my:node/><my:node/><my:node/><my:node/>
</xsl:variable>
<xsl:template match="/">
<xsl:variable name="vNodes"
as="element()*">
<xsl:sequence
select="my:iter($vRepetitions, ())"/>
</xsl:variable>
<xsl:value-of
select="name($vNodes[last()])"/>
</xsl:template>
<xsl:function name="my:iter"
as="element()*">
<xsl:param name="pNumTimes"
as="xs:integer"/>
<xsl:param name="pstartSeq"
as="element()*"/>
<xsl:sequence select=
"if($pNumTimes gt
0)
then
my:iter($pNumTimes -1, insert-before($pstartSeq, 1,
$vPrepends))
else
$pstartSeq
"/>
</xsl:function>
</xsl:stylesheet>
P1's prepend results were the following:
|
Prepends |
Time, ms |
Remarks |
|
100 |
62 |
Hundred prepends. |
|
1000 |
1729 |
Thousand prepends. |
|
10000 |
204682 |
Ten thousand prepends |
P2's prepend results were again O(N^2) and much worse than P1's.
The problem I am trying to solve here is the following:
Can we implement an algorithm for repetitive prepends that will be run by P1 in linear time, in addition to its excellent linear performance of repetitive appends?
P1's O(N^2) repetitive prepends performance has stayed the same for years so maybe its developer thought it could not be improved or an improvement would be not too important.
At first, it may seem that repetitive prepends are not so important (as repetitive appends -- the process via which we create every output). However, some of the most important data structures, such as the stack, often undergo a long series of prepends (the "push" operation). Another important use-case is any attempt to port an existing application that uses repetitive list prepends.
The answer to this question is contained in my next post.
Posted at 00:34
Milestone 6 of the PsychoPath XPath 2.0 processor is now
available. I missed making the announcement last time when
Milestone 5 was available, so here is a combined list of bugs that
have been fixed with Milestone 5 and Milestone 6:
| 301539 | nor | P3 | All | mukul.gandhi@in.ibm.com | RESO | FIXE | [xpath2] fn:name function doesn't evaluate properly for zero arity |
| 298267 | nor | P3 | All | d_a_carver@yahoo.com | RESO | FIXE | [xpath2] problems with xpath2 "instance of" evaluation |
| 298535 | nor | P3 | All | d_a_carver@yahoo.com | RESO | FIXE | [xpath2] problems with xpath2 "instance of" evaluation, on attribute nodes |
| 298519 | nor | P3 | All | mukul.gandhi@in.ibm.com | RESO | FIXE | [xpath2] improvements to fn:number implementation, catering to node arguments |
| 297707 | maj | P3 | All | jesper@selskabet.org | RESO | FIXE | [xpath2] The type "empty-sequence()" is still "empty()" |
| 297958 | nor | P3 | All | jesper@selskabet.org | RESO | FIXE | [xpath2] fn:nilled never returns true |
Most of these are stablization and conformance fixes. The next
Milestone will focus purely on bug fixes and documentation updates.
Also, as of Milestone 5 of Web Tools Platform 3.2, PsychoPath is
the XPath 2.0 engine for the XPath View. If setting your XPath
preferences in the view to XPath 2.0, the PsychoPath processor will
be used.
As always, the user
manual is available on the wiki, and we encourage adopters and
the community to contribute to it's up keep. We plan to freeze the
documentation near the end of Milestone 7 and incorporate it into
the release candidate builds.
Posted at 20:54
The XML Security Working Group published four Working Drafts today:
Learn more about the Security Activity.
Posted at 18:15
What happened was, a few days ago I wanted to try out some fancy language technology on Android. A cat got in the way but the WiFi saved the day.
The way I understand it so far, the Android runtime is the Dalvik VM and the “Harmony+Android” libraries (unless you’re a hardcore gamer who wants to bend metal in C++ with the NDK). So I don’t think I need to be tied to curly braces and semicolons forever.
One obvious option is JRuby; no, there’s no rule saying you can’t have another runtime system on top of Android. JRuby runs on Android but isn’t really fast enough yet.
Duby, the Ruby-flavored statically-typed dynamic language that Charles Nutter has been cooking up, might be a good intermediate step, and Phil Hagelberg had built ohai-android to explore Duby on Android.
So I thought I’d get that running as a first step. Unfortunately, first of all you have to get a recent JRuby and Bitescript and then Duby itself, and figuring it all out took me a little while. During which time the cat snuck onto my lap and went to sleep.
Eventually, Ant built my Android app out of 14 lines of Duby, and I needed the USB cable so I could ship it to the phone. The phone was handy but the USB cable was across the room, and my elderly female cat has earned a few evenings of undisturbed lap time.
Then I remembered that my laptop has a Web server and my phone
was on the same home LAN, so I copied the .apk over to
/Library/WebServer/Documents/whatever.apk and did an
ifconfig -a | grep 192 to find my
address and then pointed the phone’s browser at
http://192.168.1.57/whatever.apk. The phone installed
the app, I proved to myself that it worked, and did some further
enjoyable tinkering, all while routing round the cat.
Posted at 06:34
Update: Minor correction in the last two rows of the table -- thanks to a comment by Michael Ludwig. I will talk about the efficiency of this and other related XPath expressions in my next post.
In my first post I provided a compact one-liner XPath expression that obtains all duplicate items in a given sequence.
Since then a number of people have contacted me, asking for an explanation what this expression means and how it is really evaluated by a compliant XPath 2.0 engine.
C. M. Sperberg-McQueen, a MIT professor and member of W3C, blogged about it:
"Dimitre’s solution is beautiful and concise: 21 characters long (longer if you use variables with names longer than singler characters), and well worth the five or ten minutes it took me to work out why it works. I won’t explain it to you, dear reader; you deserve the brief puzzlement and Aha-moment of understanding it."
"Dimitre gets my vote for this month’s best programming-language application of Strunk and White’s rule: “Omit needless words!” "
I have always admired any of Michael's works and his high recognition is the best reward I would not have dared dreaming of.
Now that enough time has passed for anyone to try and understand how the solution works, let me explain it myself.
From the very start working on this problem I decided that a good solution must use the available standard functions on sequences.
While distinct-values() was mentioned by the OP, it is not directly useful here, as we are looking for any item that occurs more than once in the sequence.
We can eventually think of using exists() or empty(), however the one function that is most relevant to the task is index-of(). As explained in the F & O Spec.,
15.1.3 fn:index-of
|
fn:index-of( |
$seqParam |
as xs:anyAtomicType*, |
|
$srchParam |
as xs:anyAtomicType) as xs:integer* |
|
|
fn:index-of( |
$seqParam |
as xs:anyAtomicType*, |
|
$srchParam |
as xs:anyAtomicType, |
|
|
$collation |
as xs:string) as xs:integer* |
Summary: Returns a sequence of positive integers giving the positions within the sequence $seqParam of items that are equal to $srchParam.
The first step in the solution is to ask the question:
How are duplicate items different from unique items (in terms of index-of() ) ?
The answer comes naturally:
For any unique item vUi, the sequence that
index-of($vSeq, $vUi)
produces, is just of length of one.
For any duplicate item vDup, the sequence that
index-of($vSeq, $vDup)
produces, is of length of two or more.
In other words,
index-of($vSeq, $vDup)
always has a second item, while
index-of($vSeq, $vUi)
never has a second item.
To express this in XPath 2.0, the expression:
index-of($vSeq, $vItem)[2]
produces the empty sequence for any unique item $vItem, and it produces an xs:integer index for any $vItem that occurs two or more times in the sequence.
Then one way to find all duplicate items in the sequence is to evaluate:
$vSeq[exists(index-of($vSeq, .)[2])]
In XPath 2.0 the expression in the [] brackets that follow a sequence is called a FilteringExpression. It is evaluated for every item of the sequence and only items that satisfy this predicate are produced.
Let $vSeq is defined as:
1, 2, 2, 3, 4, 5, 5, 6, 7
Then
$vSeq[exists(index-of($vSeq, .)[2])]
produces:
2, 2, 5, 5
This is quite close to the result we need, we just need to dedup it:
distinct-values
($vSeq[exists(index-of($vSeq,
.)[2])])
produces the wanted result:
2, 5
This is fine, but the last expression seems already too long. Can we do even better?
What about this one:
$vSeq[index-of($vSeq,.)[2]]
It produces :
2, 5
exactly the wanted result and it is really tight.
The only problem is to explain how this expression "works" :)
Let me admit that on the morning after discovering this solution I found I had completely forgotten the rational behind it. After failing again and again to explain it, I finally asked the Gods for help.
Here is the explanation provided by Dr. Michael Kay:
"This expression
$vSeq[index-of($vSeq,.)[2]]
means
$vSeq [position() = index-of($vSeq,.)[2]]
Let's evaluate the predicate for each item in $vseq:
|
pos |
value |
index-of($vSeq,.) |
index-of($vSeq,.)[2] |
pos=index-of($vSeq,.)[2] |
|
1 |
1 |
1 |
() |
false |
|
2 |
2 |
2, 3 |
3 |
false |
|
3 |
2 |
2, 3 |
3 |
true |
|
4 |
3 |
4 |
() |
false |
|
5 |
4 |
5 |
() |
false |
|
6 |
5 |
6, 7 |
7 |
false |
|
7 |
5 |
6, 7 |
7 |
true |
|
8 |
6 |
8 |
() |
false |
|
9 |
7 |
9 |
() |
false |
So the only items that match the predicate are those in positions 3 and 7, with values (2, 5)."
To summarize, the critical step in understanding how this works is to remember that if the type of the FilteringExpression is xs:integer, then
$someSequence[someFilteringExpression]
is simply an abbreviation for:
$someSequence[position() = someFilteringExpression]
Therefore,
$vSeq[index-of($vSeq,.)[2]]
means :
All items from $vSeq such that their position is equal to the value of the index of their second occurence in the sequence.
As there is at most one item at any given position, this is how we get just one "2" and one "5" in
2, 5
as contrasted to two of each in
2, 2, 5, 5
Concluding, here is one final note:
If you have ever used the Muenchian method of grouping, you may have spotted its resemblance to the current solution.
A typical expression used in Muenchian grouping is something like this:
<xsl:key match="person" name="kPersByName" use="@name"/>
<xsl:for-each select=
"*/person
[generate-id()
=
generate-id(key('kPersByName',
@name)[1])
]">
<!-- Do something -->
</xsl:for-each>
In the above, the role of the filtering expression is played by:
generate-id()
=
generate-id(key('kPersByName',
@name)[1])
and this is the same as the role played by:
index-of($vSeq,.)[2]
in finding the duplicate items.
The Muenchian method uses an index of "1" in:
key('kPersByName', @name)[1]
and in our solution:
index-of($vSeq,.)[2]
a similar role is played by the index "2".
"1" in the Muenchian method means: Take one node from each group of nodes with unique keys. Because any group can consist just of only one node, it is only safe here to take the first node from any group.
In the case of finding all duplicate items in a sequence, we want to pick one index from the list of indexes of any duplicate item. Having a first index does not qualify an item as duplicate -- it needs to have at least a second index in the sequence. On the other side there may be duplicate items that occur only twice (not thrice or more times) and they will not have a third index.
Therefore, the only safe way to take an existing index for any duplicate item is to take its second index.
If you have successfully read up to here, you would have little difficulty solving the following problems, provided as exercises:
Let $vSeq be (1, 2,2, 3,3,3, 4,4,4,4, 5,5, 6, 7)
P1. Produce all items in $vSeq that occur only once in it.
P2. Produce all items in $vSeq that occur exactly two times in it.
P3. Produce all items in $vSeq that occur exactly three times in it (triples).
P4. Produce all items in $vSeq that occur exactly four times in it (quadruples).
P5. Produce all items in $vSeq that occur more than once but less than four times.
Hint: You may consider using the last() function in your solution.
Posted at 05:31
I was recently on a panel at the South by South West interactive conference (SXSW) where we discussed multiple applications of the real-time Web and the things that might prevent us from seeing its true potential. I’ve found it interesting that the key take away from the panel is that privacy issues will be one of the biggest problems we will face as we move forward. You can see this perspective in CNN’s coverage of the panel in the story Privacy concerns hinder 'real-time Web' creation, developers say and GigaOm’s write-up SXSW: Is Privacy on the Social Web a Technical Problem?
This overlap of privacy and real-time web features is brought into sharp relief when you look at services such as Foursquare and Gowalla which provide a mechanism for people to broadcast their physical location to a group of friends in real-time. I started using Foursquare last week and I’ve noticed I’m even more careful about who I accept friend requests from than on Facebook or Windows Live Messenger. The fact is that I may share status updates and photos with people but it doesn’t mean I want them to be aware of where I am on an up to the minute basis especially if I’m out spending time with my family and friends. This difference in how we view location data from other sorts of real-time data we share is captured by the co-founder of Foursquare in the article Facebook Isn't For Real Life Friends Anymore, Says Foursquare's Dennis Crowley where it states
Facebook plans to clone Foursquare's central service -- the ability for site members to use their phones to "check-in" from restaurants and bars -- and make it a mere Facebook feature.
But Foursquare cofounder Dennis Crowley says there's something Facebook can't clone: the real-life friendships between Foursquare users.
"Facebook used to be who your friends are, now it's everyone," Dennis told us in an interview.
"[Foursquare] is more tightly curated to who you want to have as your check-in friends. Facebook is good place for status updates and sharing photos, not to keep tabs on where people are going."
I think Crowley is on to something when he says Facebook can’t clone the Foursquare relationship model. I suspect that like Twitter, Foursquare has created a social network whose value proposition is differentiated enough from Facebook’s that it can grow into a relatively popular albeit smaller service that will not be “killed” by Facebook*. Secondly, there is a lot of synergy between Foursquare and Facebook as evidenced by the fact that Facebook is the largest referrer of traffic to Foursquare thanks to their implementation of Facebook Connect. So I think the claims that one will kill the other is just the usual tech press creating conflict to generate page views.
One thing I have noticed is that location can’t just be a field you bolt on to a status update. It has to be a key part of the information you are sharing with others otherwise it adds little value to the user experience and in fact may detract from it by adding clutter. For example, compare what a location-based update from Foursquare looks like on Facebook versus what the exact same update looks like on Twitter

VS

The difference between both updates is almost night and day even though the actual status text I shared is the same. The way Twitter has approached location is to treat it as a bunch of “poorly translated” GPS coordinates that are bolted on to the end of my status update. The Facebook update not only gives you that but also a human readable location for where I am down to the room number and includes some social context such as the fact that I was attending the talk with two coworkers from Windows Live.
As real-time location data starts to permeate social experiences, there’s a lot to learn from the above screenshots. In the example above, people who are interested in the topic based on my status knew which room to find danah’s talk from the Facebook update whereas they were told “downtown austin” in the Twitter update. As designers of social software applications, we should be mindful that location data enhances the experience and the information being shared. Adding location simply for buzzword compliance or to add metadata to the status update without enhancing the experience actually ends up crufting it up.
* Twitter’s value proposition is that it is the place to interact with celebrities and microcelebrities that you care about. It is useful to note that the much maligned Suggested Users List was key in establishing this value proposition in the minds of users. This is different from Facebook’s position as the social network for your real world friends, family, coworkers and acquaintances.
Now
Playing:
B.O.B. -
Nothin' On You (featuring Bruno Mars) 
Posted at 16:05
China's national standards body CNIS has a draft document out Guide for the Implementation of the Inclusion of Patents in National Standards. (For an English translation see the first column of this.) Wang Yiyi's China's Approach to Standards-related Intellectual...
Posted at 14:27
With his twelve posts under the common title of "The Wide Finder Project" Tim Bray formulated a problem, which obviously has since then stirred some people, anxious to prove that their programming language of choice fares well in implementing solutions for this class of problems.
While I am not aware of any XSLT implementation that provides explicit or implicit support for parallel processing (with the obvious goal to take advantage of the multi-core processors that have almost reached a "prevalent" status today), I was curious to determine at least two things:
Before going further let me remind that there is a popular (and as we'll see unfounded!) belief that any XSLT solution must be hugely inefficient and that any XSLT code can only be ugly. In fact, the following comment to one of Tim's posts reflects exactly this mindset:
|
"From: Rornan (Sep 25 2007, at 08:31) Tim, If you had anything at all to do with creating XSLT, then you have no right at all to comment on any other language deficency ever ever again." |
My initial XSLT solution to Tim's problem is below. No optimization attempts have been attempted, not only because I don't have an XSLT processor that utilizes multi-core processor, but also because it seems there's only a limited possibility for optimization (adding more CPU's is not likely to speed up the reading of a huge file from a single drive).
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:output method="text"/>
<xsl:variable name="vLines" as="xs:string*" select= "tokenize(unparsed-text('file:///C:/Log1000000.txt'),'\n')"/>
<xsl:variable name="vRegEx" as="xs:string" select= "'^.*?GET /ongoing/When/[0-9]{3}x/([0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+) .*$|^.+$'"/>
<xsl:template match="/"> <xsl:for-each-group group-by="." select= "for $line in $vLines return replace($line,$vRegEx,'$1')[.]"> <xsl:sort select="count(current-group())" order="descending" />
<xsl:if test="not(position() > 10)"> <xsl:value-of select=
"concat(count(current-group()),':',current-grouping-key(),'
')"/> </xsl:if> </xsl:for-each-group> </xsl:template> </xsl:stylesheet>
This 22-line-transformation is performed by Saxon 9.0 on a 3GHz DELL desktop. The file Log1000000.txt was constructed using the 10000 lines file provided by Tim and copying it into another file 100 times. The size of this file is about 200MB.
The results were produced in 36.175 seconds using about 929MB of RAM. Saxon should be timed using the -repeat:3 command-line option, which performs the transformation three times. Using only the results from the first/single transformation will be misleading, as they include the time for the Java VM startup.
1. How well a good XSLT 2.0 processor and a straightforward solution fare against other languages/solutions?
The non-optimized, uniprocessor version of this solution has a time of 36 seconds for processing about 200MB log file. It is likely the timing for the full , five times bigger log file used by Tim Bray, will be about 5 times bigger: 180 sec, or about 3 minutes. 3 minutes for processing 1GB of data is not a bad time for an XSLT transformation, considering that sizes even of a few hundreds of megs are still considered monstrous.
XSLT could be in fact one of the winners in the WF competition, had its creators stepped just a little bit further exploiting the natural, non-sequential XSLT processing model. Here I am talking about specifying a single f:parMap() function, which can be defined exactly as the current FXSL f:map() function:
<xsl:function name="f:map" as="item()*"> <xsl:param name="pFun" as="element()"/> <xsl:param name="pList1" as="item()*"/> <xsl:sequence select= "for $this in $pList1 return f:apply($pFun, $this)" /> </xsl:function>
The only difference is that f:parMap() adds to the semantics of f:map() the hint to use as many available CPU processors as appropriate when evaluating the "for" expression in the code of the function.
Yes, this can be done for any "/" operator or for any "for" expression such as the one used in lines 13 - 14 in our XSLT solution:
"for $line in $vLines
return replace($line,$vRegEx,'$1')[.]"
and for any <xsl:apply-templates .../> instruction, and for any <xsl:for-each .../> instruction, and for any <xsl:for-each-group ...> instruction and ... and ... and ...
Judging from the published results, this straightforward, non-optimized uniprocessor XSLT solution comes not too-far behind some of the optimized for multi-processing solutions. I hope that in the not so distant future there will be XSLT processors exploiting the inherent non-sequential nature of XSLT processing to implement highly-optimized multi-processing solutions.
2. Where is the XSLT code on the scale of "beautiful code"?
By its compactness (22 lines) it is in 3rd place and rivals the 2nd place (17 lines), as we could easily remove 4 lines (3 of them blank) used to add readability. One of Tim's criteria for "beautiful code" is avoiding awkwardness.He speaks about Ruby not needing ending semicolons or variable type declarations. However, Tim's solution still had to use two lines for defining a hash and setting its default to 0. By contrast, the XSLT solution above does not require the programmer to introduce and initialize a special data structure -- the support for grouping is right there in the core of the language. There is simply no way the programmer can do something wrong in declaring or using a hash table.
Both the timing and especially the amount of RAM used indicate how the XSLT processor did its work:
Imagine that the XSLT Processor is really lazy:
Using these rules a truly lazy XSLT processor will only need space large enough for the longest line, and for a hash table to keep the keys and counters for all different articles being grouped. In this case there are just 562 unique values extracted from the strings of $vLines.
In this way, the processing -- reading a line, finding the article it contains and feeding this to the hash table -- could be accomplished in streaming mode while reading the text file. Upon reaching the end of the file, there would be very little left to do, and thus almost nothing to be further optimized.
I truly believe that the described improvements can be implemented by at least some of the best XSLT 2.0 processors around here. For many years I have been using Saxon and am very grateful to its developer Dr. Michael Kay. The performance efficiency has been constantly growing -- a very good example to follow by all other XSLT vendors and if they do there will be competition, and competition is good for us all.
Posted at 09:30
As of this morning I work for Google. The title is “Developer Advocate”. The focus is Android. Fun is expected.
Google and I have been a plausible match for a long time. Web-centric, check. Search, check. Open-source, check. The list goes on. We’ve talked repeatedly over the years, but the conversations all ended at the point when I said “...and I don’t want to move to the Bay Area”.
Then that changed. The process started with Dan Morrill who led me to Mike Winton who led me through the notorious Google Interview Process. I think I talked to eleven people in the course of my day there, failing one logic puzzle but acing the what-does-a-browser-actually-do test. Then they made an offer and I accepted and here I am. By “here” I mean Vancouver; I’ll be working remotely.
Leaving the Googleplex after a day of interviewing; taken with my previous Android phone, the Dev Phone 2.
I’d had an offer to stay with Oracle which I decided to decline; I’ll maybe tell the story when I can think about it without getting that weird spiking-blood-pressure sensation in my eyeballs. So I reached out to a couple of appealing potential next employers, both were interested, and Google seemed like the best bet.
It’s now too big to be purely good or in fact purely anything. I’m sure that tendrils of stupidity and evil are even now finding interstitial breeding grounds whence they will emerge to cause grief. And there are some Google initiatives that I feel no urge to go near.
But there are those Ten Things and you know, I’m down with ’em. Unreservedly.
The reason I’m here is mostly Android. Which seems to me about as unambiguously a good thing as the tangled wrinkly human texture of the Net can sustain just now. Here’s why:
It’s not good to be on the Net at all times, but it’s very good to have the Net available at all times.
Google needs, and is committed to, Android; it’s not just a hobby.
The Android user experience is very good and, more important, getting better fast.
It’s developer-friendly; the barriers to entry are very low for the several million people on the planet who are comfy with the java programming language.
The APIs are pretty good in my experience, and even more important, complete. Near as I can tell, there’s nothing interesting the phones can do that’s not exposed through some API or other.
Anyone can build any hardware they want around the Android software; no approval required.
Anyone can sell any program they write via the Android Market; no approval required.
It’s open-source.
The smartphone arena where Android plays is extra interesting right now, with space for radical experimentation both on the technology and business fronts.
The mobile space has had a huge impact in the emerging economies of the less-developed world and I think that’s just getting started. I want to be part of that story and Android seems like the right software platform for it.
I’ll enjoy competing with Apple.
As of now, they’re selling around 90K iPhones per day compared to around 60K Android handsets. It’s a horse race!
The iPhone vision of the mobile Internet’s future omits controversy, sex, and freedom, but includes strict limits on who can know what and who can say what. It’s a sterile Disney-fied walled garden surrounded by sharp-toothed lawyers. The people who create the apps serve at the landlord’s pleasure and fear his anger.
I hate it.
I hate it even though the iPhone hardware and software are great, because freedom’s not just another word for anything, nor is it an optional ingredient.
The big thing about the Web isn’t the technology, it’s that it’s the first-ever platform without a vendor (credit for first pointing this out goes to Dave Winer). From that follows almost everything that matters, and it matters a lot now, to a huge number of people. It’s the only kind of platform I want to help build.
Apple apparently thinks you can have the benefits of the Internet while at the same time controlling what programs can be run and what parts of the stack can be accessed and what developers can say to each other.
I think they’re wrong and see this job as a chance to help prove it.
The tragedy is that Apple builds some great open platforms; I’ve been a happy buyer of their computing systems for some years now and, despite my current irritation, will probably go on using them.
Not sure yet. Obviously I’ll go on blogging here.
Are you an Android developer? Or might you become one? Or have you given up on Android? If you’re any of these, you’re a person I need to learn from. Help teach me, I’m easy to find: twbray at google.com.
A few other things are obvious to me: I’m going to have to buckle down and write a useful Android app so that I have a better feel for the issues. I’m going to have to get savvier about HTML5-based applications, because a lot of smart people think the future’s there, that the “native app” notion will soon seem quaint. I’m going to have to dig in and really understand the Android Market. I’m going to have to spend a lot of time at the Googleplex to get to know the people.
There’s nothing that says I’m just doing Android, but it seems that there’s enough Android work to keep a dozen of me busy. A couple of other things have come up in the conversation where I might be useful; we’ll see.
I’m not going to change the tone here; I admire the creamy gloss of the language on the official Google Web properties, but that ain’t me. Just like the disclaimer says, what it says here is what I think, don’t count on Google or anyone else agreeing with it or even having seen it before I publish it. Disclosure: Google asked to see an advance draft of the piece you’re now reading “for coordinating messaging”, but didn’t suggest any changes.
I’m probably not going to get much involved in the social-networking arena. I see myself as behind the pack on that stuff; still can’t explain why it is I like Twitter so much more than Facebook, and loathe FriendFeed.
I’m not going to stop liking Ruby. To start with, there are things like Ruboto and ohai-android, which I have running on my Nexus One. Plus, I never bought into the notion that serious coding requires curly braces and semicolons.
I’m not going to stop worrying about concurrent programming, because our failure to equip developers to do it right is going to bite our asses just as hard in the mobile space as anywhere else. Maybe harder, since mobiles are power-starved by definition and current data seem to show that slower many-core CPUs give you more computing per milliwatt.
We’re close to Vancouver’s excellent Mount Pleasant Community Centre and take our kids to the library there. It has good free WiFi and lots of public-access computers. When we visit I always make a surveillance pass, glancing over shoulders at screens. Some days, I see Google on more than half of them.
That, and Android; that’s why.
Posted at 07:08
Posted at 04:49
Update: (Think that) finally managed to get from Windows Live Spaces the formatting I wanted...
Update: Enhanced the JSON Lexer to properly deal with escaped characters within strings. How to handle \b and \f ???
====================================
Ever wanted to access and manipulate JSON as ordinary XML? To transform it with XSLT?
No problem, use the f:json-document() and f:json-file-document() as provided by FXSL.
Here is a quick example:
Let's have (yes, this is the Yahoo Traffic Service):
|
Then, f:json-document($vStr) evaluates to:
|
and you can transform nicely this XML document with XSLT now.
The functions f:json-document() and f:json-file-document() are available immediately from the CVS of the FXSL project.
All this pure XSLT magic (and sure, expect more to come) is possible with using the LR Parsing Framework implemented in FXSL.
More to come soon.
Posted at 04:48
In her recent post "RELAX NG for matching" Jeni Tennison said:
The “grouper” is the most interesting and difficult of these. It needs to act like a tokeniser, except use regular expressions over markup rather than over text. For example, say I had:
<number>06</number><punc>/</punc><number>03</number><punc>/</punc><number>08</number>I want to be able to create a rule that says “any sequence that looks like a number element that contains a
two-digit number between 1 and 31, followed by a punc element that contains a slash, followed by another two-digit number between 1 and 12, followed by a punc element that contains a slash, followed by another two-digit number should be wrapped in a date element”.Now this is something that XPath is really bad at. Try writing an expression that selects, from a sequence of elements that may contain other
<number>and<punc>elements as well as other elements, only those sequences of elements that match the pattern I just described. It’s something like:number[. >= 1 and . <= 31 and string-length(.) = 2] [following-sibling::*[1]/self::punc = '/'] [following-sibling::*[2]/self::number[. >= 1 and . <= 12 and string-length(.) = 2]] [following-sibling::*[3]/self::punc = '/'] [following-sibling::*[4]/self::number[string-length(.) = 2]] /(self::number, following-sibling::*[position() <= 4])which is fiddly and messy and only works in this particular example because I know precisely how many elements there are
supposed to be in the group.
Then she proceeds by making the suggestion that
"As a language, RELAX NG is really good at this"
Jeni ends with the following statement:
"So I think a “grouper” component should use RELAX NG to identify sequences to be marked up. But I have no idea if there are RELAX NG libraries out there that can be used in this way: to identify and extract matching sequences rather than to validate entire documents"
It is obvious that the solution of this problem
is not strictly bound to RELAX NG
itself
(which just happens to be able to parse a schema defined using the
RNC grammar). The tool Jeni reasons about would be any
compiler writing system that allows additional grammar
rules, that are not required to reach the start symbol of the
language, but may be useful for other purposes.
Very fortunately, I know at least one such system:
the Labelled BNF Grammar Formalism (LBNF).
Jeni's "grouper" tool can be implemented by
adding additional rules to the language being specified
using LBNF's "internal pragmas".
Here's how the authors of LBNF Markus Forsberg,
Aarne
Ranta from Chalmers University of Technology
and the University of
Gothenburg describe LBNF's internal
pragmas:
6 LBNF Pragmas
6.1 Internal pragmas
Sometimes we want to include in the abstract syntax structures that are not part of the concrete syntax, and hence not parsable. They can be, for instance, syntax trees that are produced by a type-annotating type checker. Even though they are not parsable, we may want to pretty-print them, for instance, in the type checker’s error messages. To define such an internal constructor, we use a pragma
"internal" Rule ";"
where Rule is a normal LBNF rule. For instance,
internal EVarT. Exp ::= "(" Ident ":" Type ")";
introduces a type-annotated variant of a variable expression.
Of course, LBNF is a very nice and cool tool to carry out a number of really important language development tasks, besides the "grouper".
Posted at 04:48
Actually, it is not a knife and is known under the name of
Finger Tree.
Background:
Created by Ralf
Hinze and Ross Paterson in 2004, and
based to a large extent on the work of Chris
Okasaki on Implicit Recursive Slowdown and Catenable
Double-Ended Queus, this data structure, to quote the
abstract of the paper introducing Finger Trees,
is:
"a functional representation of persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece. Representations achieving these bounds have appeared previously, but 2-3 finger trees are much simpler, as are the operations on them. Further, by defining the split operation in a general form, we obtain a general purpose data structure that can serve as a sequence, priority queue, search tree, priority search queue and more."
Why the finger tree deserves to be called the Swiss knife of data structures can best be explained by again quoting the introduction of the paper:
"The operations one might expect from a sequence abstraction
include adding and removing elements at both ends (the deque
operations), concatenation, insertion and deletion at arbitrary
points, finding an element satisfying some criterion, and splitting
the sequence into subsequences based on some property. Many
efficient functional implementations of subsets of these operations
are known, but supporting more operations efficiently is difficult.
The best known general implementations are very complex, and little
used.
This paper introduces functional 2-3 finger trees, a general
implementation that performs well, but is much simpler than
previous implementations with similar bounds. The data structure
and its many variations are simple enough that we are able to give
a concise yet complete executable description using the functional
programming language Haskell (Peyton Jones, 2003). The paper should
be accessible to anyone with a basic knowledge of Haskell and its
widely used extension to multiple-parameter type classes (Peyton
Jones et al., 1997). Although the structure makes essential use of
laziness, it is also suitable for strict languages that provide a
lazy evaluation primitive."
Efficiency and universality are the two most attractive features of finger trees. Not less important is simplicity, as it allows easy understanding, straightforward implementation and uneventful maintenance.
Stacks support efficient access to the first item of a sequence only, queues and deques support efficient access to both ends, but not to an randomly-accessed item. Arrays allow extremely efficient O(1) access to any of their items, but are poor at inserting, removal, splitting and concatenation. Lists are poor (O(N)) at locating a randomly indexed item.
Remarkably, the finger tree is efficient with all these operations. One can use this single data structure for all these types of operations as opposed to having to use several types of data structures, each most efficient with only some operations.
Note also the words functional and persistent, which mean that the finger tree is an immutable data structure.
In .NET the IList<T> interface specifies a number of void methods, which change the list in-place (so the instance object is mutable). To implement an immutable operation one needs first to make a copy of the original structure (List<T>, LinkedList<T>, ..., etc). An achievement of .NET 3.5 and LINQ is that the set of new extension methods (of the Enumerable class) implement immutable operations.
In the year 2008, Finger Tree implementations have been known only in a few programming languages: in Haskell, in OCaml, and in Scala. At least this is what the popular search engines say.
What about a C# implementation? In February Eric Lippert had a post in his blog about finger trees. The C# code he provided does not implement all operations of a Finger Tree and probably this is the reason why this post is referred to by the Wikipedia only as "Example of 2-3 trees in C#", but not as an implementation of the Finger Tree data structure. Actually, he did have a complete implementation at that time (see the Update at the start of this post), but desided not to publish it.
My modest contribution is what I believe to be the first published complete C# implementation of the Finger Tree data structure as originally defined in the paper by Hinze and Paterson (only a few exercises have not been implemented).
Programming a Finger Tree in C# was as much fun as challenge. The finger tree structure is defined in an extremely generic way. At first I even was concerned that C# might not be sufficiently expressive to implement such rich genericity. It turned out that C# lived up to the challenge perfectly. Here is a small example of how the code uses multiple types and nested type constraints:
// Types:
// U -- the type of Containers that can be split
// T -- the type of elements in a container of type U
// V -- the type of the Measure-value when an element is measured
public class
Split<U, T, V>
where U : ISplittable<T, V>
where T :
IMeasured<V>
{
// ..........................................................
}
Another challenge was to implement lazy evaluation (the .NET term for this is "deferred execution") for some of the methods. Again, C# was up to the challenge with its IEnumerable interface and the ease and finesse of using the "yield return" statement.
The net result: it was possible to write code like this:
public override IEnumerable<T> ToSequence()
{
ViewL<T,
M> lView = LeftView();
yield
return lView.head;
foreach (T t
in lView.ftTail.ToSequence())
yield return t;
}
Another challenge, of course, was that one definitely needs to understand Hinze's and Ross' article before even trying to start the design of an implementation. While the text should be straightforward to anyone with some Haskell and functional programming experience, it requires a bit of concentration and some very basic understanding of fundamental algebraic concepts. In the text of the article one will find a precise and simple definition of a Monoid. My first thought was that such academic knowledge would not really be necessary for a real-world programming task. Little did I know... It turned out that the Monoid plays a central role in the generic specification of objects that have a Measure.
I was thrilled to code my own version of a monoid in C#:
public class Monoid<T>
{
T theZero;
public delegate T monOp(T t1, T
t2);
public monOp theOp;
public Monoid(T tZero, monOp aMonOp)
{
theZero = tZero;
theOp = aMonOp;
}
public T zero
{
get
{
return theZero;
}
}
}
Without going into too-much details, here is how the correct Monoids are defined in suitable auxiliary classes to be used in defining a Random-Access Sequence, Priority Queue and Ordered Sequence:
public static class Size
{
public static Monoid<uint> theMonoid =
new Monoid<uint>(0, new Monoid<uint>.monOp(anAddOp));
public static uint anAddOp(uint s1, uint s2)
{
return s1 + s2;
}
}
public static class Prio
{
public static Monoid<double> theMonoid =
new Monoid<double>
(double.NegativeInfinity,
new Monoid<double>.monOp(aMaxOp)
);
public static double aMaxOp(double d1, double d2)
{
return (d1 > d2) ? d1 : d2;
}
}
public class Key<T, V> where V : IComparable
{
public delegate V getKey(T t);
// maybe we shouldn't care for NoKey, as this is too theoretic
public V NoKey;
public getKey KeyAssign;
public Key(V noKey, getKey KeyAssign)
{
this.KeyAssign = KeyAssign;
}
}
public class KeyMonoid<T, V> where V : IComparable
{
public Key<T, V> KeyObj;
public Monoid<V> theMonoid;
public V aNextKeyOp(V v1, V v2)
{
return (v2.CompareTo(KeyObj.NoKey) == 0) ? v1 : v2;
}
//constructor
public KeyMonoid(Key<T, V> KeyObj)
{
this.KeyObj = KeyObj;
this.theMonoid =
new Monoid<V>(KeyObj.NoKey,
new Monoid<V>.monOp(aNextKeyOp)
);
}
}
Yet another challenge was to be able to create methods dynamically, as currying was essentially used in the specification of finger trees with measures. Once again it was great to make use of the existing .NET 3.5 infrastructure. Below is my simple FP static class, which essentially uses the .NET 3.5 Func object and a lambda expression in order to implement currying:
public static class FP
{
public static
Func<Y, Z> Curry<X, Y,
Z>
(this Func<X, Y, Z> func, X x)
{
return (y) => func(x, y);
}
}
And here is a typical usage of the currying implemented above:
public T ElemAt(uint ind)
{
return treeRep.Split
(new MPredicate<uint>
(
FP.Curry<uint, uint, bool>(theLessThanIMethod2, ind)
),
0
).splitItem.Element;
}
Now, for everyone who have reached this point of my post, here is the link to the complete implementation.
Be reminded once again that .NET 3.5 is needed for a successful build.
In my next posts I will analyze the performance of this Finger
Tree implementation and how it fares compared to existing
implementations of sequential data structures as provided by
different programming languages and environments.
Posted at 04:47
Michael Crichton died unexpectedly at an age of 66.
As with Asimov, will miss the continuation of his writing, which I had taken for granted.
Is this really the end of an epoch?
Posted at 04:47
A few days ago Amazon Upgrade was announced, which lets you
access (some of) the books you bought from them -- for a reasonable
fee.
Here's how the announcement looks like:
|
Get online access to your physical books with Amazon Upgrade and you can:
And you can do all this from any Internet-connected computer,
meaning your book is always with you.
Online access to your books with Amazon
Upgrade Select All Amazon Upgrade price: $6.99
Amazon Upgrade price: $7.99 0 title(s) selected. Total: $0.00* (Not all books are eligible. Why?) * Sold by Amazon Digital Services, Inc. - Additional taxes may
apply |
Now, if only this book could also be made accessible:

Posted at 04:45
Someone recently asked the following question:
"I have an XPath expression which provides me a sequence of values like the one below:
1 2 2 3 4 5 5 6 7
It is easy to convert this to a set of
unique values "1 2 3 4 5 6 7" using the distinct-values function.
However, what I want to extract is the list of duplicate values =
"2 5". I can't think of an easy way to do this.
Can
anyone help?"
One of the best XPath/XSLT experts, whom I greatly respect, provided the following answer:
"What about:
distinct-values(
for $item in $seq
return if (count($seq[. eq $item]) > 1)
then $item
else ())
This iterates through the items in the
sequence, and returns the item if the number of items in the
sequence that are equal to that item is greater than one. You then
have to use distinct-values() to remove the duplicates
from that list."
Can you do better?
My own solution is a 21 chars one-liner, and would only be slightly longer, if efficiency would need to be addressed.
Posted at 04:44
In Part 1 of these series I described an implementation of the Finger Tree data structure in C#.
Part 2 reported the results of performance tests with the standard .NET Enumerable class and the finger-tree-based sequence
The new data provided here is the results of similar performance tests, this time between the finger-tree-based sequence and XPath 2.0 fuctions on sequences as implemented by a group of three XSLT 2.0 processors.
I apologize for not specifying the names of the XSLT 2.0 processors used in this experiment. This was done intentionally, since the purpose of the experiment was only to compare the current efficiency of the XPath sequences and strings as implemented by these XSLT processors to the efficiency of a Finger Tree - based implementation. This experiment did not have the goal of comparing the XSLT processors with each other. The results were consistent with and did not alter my prior perception of each XSLT processor's efficiency.
Also, the usual warning:
Disclaimer: This comparison was carried out on my 4-year old PC, having CPU speed of 3GHz and RAM of 2GB and running Windows XP. Results will largely vary on other configurations, depending on CPU and memory speed, available memory, cache size, ..., etc. Another factor to consider is the time necessary for jitting the .NET IL code on initial method execution or the JVM startup time. However the observed ratios (speedup) should remain closely the same.
The following XPath 2.0 functions on sequences were tested:
op:item-access() (the position() function with integer argument)
The following XPath 2.0 functions on strings were tested:
fn:concat() (with 2 arguments)
The finger-tree-based sequence was implemented in such a way that an instance of it was instantiated as an extension object of the Saxon 9.1 (for .NET) XSLT processor and then the methods were referenced as extension functions.
Any of these extension functions was executed 100 000 times and the average time was recorded. Unfortunately, two of the three tested XSLT processors were very slow, which necessitated that during the tests of these XSLT processors many functions had to be executed 10 000 times or even only 1000 or, in a very few, exceptional circumstances, only 100 times.
Writing such code can be tricky, especially due to the very aggressive optimization performed by one of the tested XSLT processors. Thus, in order to obscure from such a clever XSLT processor the fact that a variable had been defined outside of an <xsl:for-each> instruction, something like the following code had to be used:
|
<xsl:for-each select="1 to 100000"> <xsl:variable name="vFSeq3" <xsl:variable name="vRem"
as="xs:integer" <xsl:variable name="vFSeqY" <xsl:variable name="vFSeq5"
select= <xsl:variable name="vLen" select= <xsl:value-of
select="(1,2,3)[$vLen]"/> |
Note how the last instruction within the <xsl:for-each> above causes a lazy-evaluating XSLT processor to finish its work and produce the complete sequence.
The sequences (and character strings) on which the above functions were performed had length of 10 000.
The code for these tests is here.
As in the .NET comparison test (reported in Part 2), there were two approaches:
As can be seen from these tables (my weblog host allows only limited length for an entry and could not accommodate them inline), which summarize the results of the "uniform" or "average" testing of the sequence implementations, using a Finger Tree as a general sequence data structure in XPath 2.0 may speed up sequence operations (roughly) 5 to 40 times even when only the best of the three XSLT processors' results are compared, 5 to 4500 times if all results are taken into account.
The XPath string implementation of the three XSLT 2.0 processors was tested against a finger-tree based string object and the results from the "average" tests are summarized in these tables.
As can be seen, using a Finger Tree to represent a string in XPath 2.0 may speed up string operations (roughly) 3 to 9 times even when only the best of the three XSLT processors' results are compared, 3 to 40 times if all results are taken into account.
I will conclude the series on finger trees with the following, quite
Obvious Conclusion:
Existing sequence and string implementations in .NET and in XSLT
2.0 processors can be speeded up significantly by using the Finger
Tree data structure. Additionally, .NET will benefit from having a
truly immutable sequence implementation.
Certainly, the question is not "if" such implementations will be produced, but only "when".
Posted at 04:44
In a recent post to the xml-dev mailing list, "XSLT question on transitive closures", Rick Jelliffe wrote:
"Am I right in thinking that
1) XPath2 functions don't have have a function for transitive closure (along provided xpaths)
2) SAXON 8 does not have the saxon:closure() extension function that older versions of SAXON had
3) The one to use is probably still Christian Nentwich's code from circa 2001 as adopted into EXSLT ?"
"I think it would be nice to do it properly based on FXSL higher-order functions, which are much more cleanly specified. Perhaps there is already a suitable function in FXSL.
The other thing that's needed is the ability to check for cycles. Simply blowing the stack or looping isn't good enough."
David Carlisle replied by providing a solution that would dynamically generate a new XSLT stylesheet and a closure function in it that uses only a specific function and implements its closure. He also said:
"> I think it would be nice to do it properly based on FXSL higher-order
True (but I'll leave that for Dimitre :-)"
Thanks to these nice people I felt just like... something had been left for me :o)
So, now we have in FXSL the function
which, given a function pFun and a starting set of items pstartSet, produces the transitive closure of pFun.
While the complete 47 lines of code can be viewed using the above link, the essence of the implementation is in the following 20 lines:
<xsl:function name="f:closure" as="node()*"> <xsl:param name="pFun" as="element()"/> <xsl:param name="pstartSet" as="node()*"/> <xsl:sequence select="f:closure2($pFun, $pstartSet,$pstartSet)"/> </xsl:function> <xsl:function name="f:closure2" as="node()*"> <xsl:param name="pFun" as="element()"/> <xsl:param name="pCurClosure" as="node()*"/> <xsl:param name="pstartSet" as="node()*"/> <xsl:if test="exists($pstartSet)"> <xsl:variable name="vNew" select= "f:map($pFun,$pstartSet) except $pCurClosure"/> <xsl:sequence select= "$pstartSet | $vNew | f:closure2($pFun,$pCurClosure | $vNew, $vNew)"/> </xsl:if> </xsl:function>
And here is my reply to David's post (code hyperlinked, full code omitted):
"
>> I think it would be nice to do it properly based on FXSL
higher-order
> > True (but I'll leave that for Dimitre:-)
I am sorry I only read this a few days ago. Below is the code of the FXSL function. While the code is straightforward, the following must be noted:
1. The "set" at present is only a set of nodes. I will probably produce a more general f:closure() function, which operates on any set of items. Then this function should be also passed as parameters a "union" and a "difference" functions.
2. It seems that David's solution would go into an infinite loop for more involved examples (see the second test with reachability of nodes in cyclic graphs below). Therefore, the algorithm was slightly changed and works correctly. The following files can be downloaded from the CVS of the FXSL project:
The last transformation should be applied on the following xml
file:
testFunc-closure2.xml
The result from running the second test transformation above (two cases of finding all nodes of a given graph, reachable from a specific node. In the second case there is a cycle involving the nodes V2, V6, V7) is:
=== Reachable from V1 =======
<node id="V1"/>
<node id="V3"/>
<node id="V4"/>
<node id="V5"/>
=======================
=== Reachable from V2 =======
<node id="V2"/>
<node id="V4"/>
<node id="V5"/>
<node id="V6"/>
<node id="V7"/>
=======================
Due to its generality, the f:closure() function is a useful
addition to the FXSL library.
Cheers, Dimitre Novatchev
"
Posted at 04:44
IEnumerable<uint> vs. FSeq
|
Method |
.NET Av. Time |
FSeq Av. time |
FSeq SpeedUp |
|
Concat() |
5268 µs |
56 µs |
94.07 |
|
ElementAt(), [ ] |
1 µs |
46 µs |
--- |
|
Skip() |
2022 µs |
49 µs |
41.27 |
|
Take() |
1513 µs |
50 µs |
30.26 |
|
Reverse() |
4859 µs |
93 µs |
52.25 |
string vs. FString
|
Method |
.NET Av. Time |
FString Av. time |
FSeq SpeedUp |
|
Concat(), + |
655 µs |
129µs |
5.08 |
|
Insert() |
653 µs |
220µs |
2.97 |
|
Remove()
|
176 µs |
55 µs |
3.20 |
|
Substring() |
167 µs |
54 µs |
3.09 |
Posted at 04:44
The book "Real World Haskell" has been nominated a JOLT Finalist.
I have been reading this book
in online form for the past month and got the hardcopy a few days ago. Two thirds through, I can confidently say that this is the best Haskell book I have read.
From the Jolt Awards Site:
"How are the winners selected?
Our judges
not only examine the standard criteria of audience suitability,
productivity, innovation, quality, ROI, risk and flexibility, but
also seek to honor products that are ahead of the curve.
Jolt-winning products are universally useful; are simple, yet rich
in functionality; redefine their product space; and/or solve a
nagging problem that has consistently eluded other products and
books."
Posted at 04:35
Update: Minor code cleanup (using better names now).
In my previous post I defined the problem of improving the quadratical performance of an XSLT processor P1 when performing repetitive prepends to a sequence, while having an excellent linear repetitive appends performance.
The question asked was:
"Can we implement an algorithm for repetitive prepends that will be run by P1 in linear time, in addition to its excellent linear performance of repetitive appends?"
This transformation produces the result of repetitive prepends:
|
<xsl:stylesheet
version="2.0"
<xsl:variable
name="vPrepends"
as="element()*">
<xsl:template
match="/">
<xsl:sequence
select= <xsl:value-of select="name($vNodes[last()])"/> <xsl:text>, </xsl:text>
<xsl:value-of
select="count($vNodes)"/>
<xsl:function
name="my:prepend-iter">
<xsl:sequence
select= |
When run with the P1 XSLT processor, the results are:
|
Prepends |
Time, ms |
Speedup |
Remarks |
|
100 |
42 |
1.47 |
Hundred prepends. |
|
1000 |
67 |
25 |
Thousand prepends. |
|
10000 |
145 |
1411 |
Ten thousand prepends |
|
100000 |
932 |
~ 16000 |
Estimated speedup for one hundred thousand prepends. |
|
1000000 |
11666 |
~ 160000 |
Estimated speedup for one million prepends. |
So, we see that the repetitive prepends have been carried out with linear complexity! We achive a speedup that can reach thousands and even hundred of thousands times!
When carried out with P2, the time complexity remains quadratical, however the actual results are two times faster than with the non-optimized algorithm.
I will explain the algorithm implemented in the above transformation, in Part III of this post.
Posted at 04:34
In Part II of this post a
solution was given to the problem presented in Part I:
"Can we implement an algorithm for repetitive prepends that will be run by P1 in linear time, in addition to its excellent linear performance of repetitive appends?"
The solution is implemented within the function my:prepend-iter() that is defined in the following way:
|
<xsl:function name="my:prepend-iter"> <xsl:param name="pNumTimes" as="xs:integer"/> <xsl:param name="pstartSeq" as="element()*"/> <xsl:param name="pInserts" as="element()*"/>
<xsl:sequence select= "if($pNumTimes gt 0) then my:prepend-iter ($pNumTimes -1, $pstartSeq, ($pInserts, reverse($vPrepends) ) ) else (reverse($pInserts), $pstartSeq) "/> </xsl:function> |
The first argument passed to this function, pNumTimes, is the number of times to perform the prepend operation on the initial sequence pstartSeq, which is passed as the second argument.
The third argument has an auxiliary purpose. As its name, pInserts suggests, it contains the accumulation of all sequences that are to be prepended to pstartSeq.
The idea is to produce the final result only when pNumTimes is zero, using at this time the accumulated prepends in pInserts and the initial sequence pstartSeq.
The idea of the algorithm is to replace the repetitive prepend operations with repetitive append operations, which are carried out by the xslt processor with linear complexity.
We notice that:
y2s ++ y1s ++ xs =
reverse( reverse(y1s) ++ reverse(y2s) ) ++ xs (1)
Or in general, if we have N sequences:
y1s, y2s, ..., yNs
to be prepended to an initial sequence xs in that order,
yNs ++ yN-1s ++ ... y2s ++ y1s ++ xs =
reverse(reverse(y1s)++reverse(y2s) ++ ...
... ++ reverse(yNs) ) ++xs (2)
Remarkably, the argument of the outermost reverse() is a repetitive appends operation and it has only linear complexity with P1!
We have added N+1 reverse() operations, but each of them is of linear complexity, so the total complexity of implementing (2) is O(N).
Certainly, the function my:prepend-iter() we have demonstrated is a simplified case of the many different real world scenarios we may encounter that will require repetitive prepends. Nevertheless its use is sufficient to prove and demonstrate how such an O(N) algorithm can be constructed.
We could use essentially the same algorithm, with a modification
that accepts as argument a function
next-prepend(), which
produces the next sequence to be prepended. To signal the end of
the repetitive prepend operation, we can use a convention that
next-prepend() will indicate
this by producing some special sequence, such as the empty
sequence.
In case you may be wondering how can a function be passed as an argument to another function in XSLT, do read about FXSL.
Posted at 04:34
In his recent blog-article: “XPath needs virtual axes. Making XPath more XPathy?” Rick Jelliffe sais:
“I really like XPath2. I would never recommend anyone start with XPath1, unless you were doing very basic transformations with no text processing or data formatting.
But the niggle I have with XPath2 is that it is less XPath-y than XPath1. It does not significantly improve the central syntactical feature of XPaths: the location steps. (The only improvement that springs to mind is that XPath2 did improve the use of parentheses in location steps.) Instead, XPath2 provided much more conventional features like a for iterator. I think these significantly decrease the comprehensibility of an XPath, are anonymous and therefore require may comments to explain them, and fracture the line. To an extent, once you start to use nested syntaxes and iterators, why both using XPath at all?”
Rick gives this example:
find-rep( find-client(//manager/clients/client-ref)/rep-ref)/name
and proposes to improve the expression above by introducing “virtual axes” so that it could be written as:
//manager/clients/client-ref/find-client::rep-ref/find-rep::rep/name
So, curious reader, pause for a while, don’t read below, and think: has Rick uncovered a hole in XPath?
What you are asking for can be expressed almost exactly in the same form (actually I prefer the current XPath 2.0 form of expression).
You are asking for:
//manager/clients/client-ref/find-client::rep-ref/find-rep::rep/name
One can write this in XPath 2.0 as:
//manager/clients/client-ref/find-client(.)/rep-ref/find-rep(.)/name
The current XPath way of expressing this is cleaner -- no need for virtual axes.
This is a feature of XPath 2.0 that is not widely used and known: any function can be used as the current location step. The syntax rules that resolve its use are (at http://www.w3.org/TR/2007/REC-xpath20-20070123/#id-grammar):
[27] StepExpr ==> [38] FilterExpr ==> [41] PrimaryExpr
and the fact that a FunctionCall is a PrimaryExpr:
[41]
PrimaryExpr::=Literal | VarRef | ParenthesizedExpr | ContextItemExpr | FunctionCall
Also, consider that the axes in XPath have been useful so far mainly because there are just a few of them. Imagine having to deal with zillions of axes and struggling to remember what they mean. And if everyone can introduce their own axes, then why bother with them at all?
Dear reader, it's up to you to decide... as I have already done.
Posted at 04:34
As we encourage linked data adoption within the UK public sector, something we run into again and again is that (unsurprisingly) particular domain areas have pre-existing standard ways of thinking about the data that they care about. There are existing models, often with multiple serialisations, such as in XML and a text-based form, that are supported by existing tool chains.
In contrast, if there is existing RDF in that domain area, it’s usually been designed by people who are more interested in the RDF than in the domain area, and is thus generally more focused on the goals of the typical casual data re-user rather than the professionals in the area.
To give an example, the international statistics community uses SDMX for representing and exchanging statistics (and a lot more besides; it’s a huge standard). SDMX includes a well-thought through model for statistical datasets and the observations within them, as well as standard concepts for things like gender, age, unit multipliers and so on. By comparison, SCOVO, the main RDF model for representing statistics, barely scratches the surface in comparison.
This isn’t the only example: the INSPIRE Directive defines how geographic information must be made available. GEMINI defines the kind of geospatial metadata that that community cares about. The Open Provenance Model is the result of many contributors from multiple fields, and again has a number of serialisations.
You could view this as a challenge: experts in their domains already have models and serialisations for the data that they care about; how can we persuade them to adopt an RDF model and serialisations instead?
But that’s totally the wrong question. Linked data doesn’t, can’t and won’t replace existing ways of handling data. But it has got some interesting features that can bring great benefit to people who want to publish their data, namely:
The question is really about how to enable people to reap these benefits; the answer, because HTTP-based addressing and typed linkage is usually hard to introduce into existing formats, is usually to publish data using an RDF-based model alongside existing formats. This might be done by generating an RDF-based format (such as RDF/XML or Turtle) as an alternative to the standard XML or HTML, accessible via content negotiation, or by providing a GRDDL transformation that maps an XML format into RDF/XML.
Either way, the underlying model needs to be mapped into RDF. We’re furthest down this road with statistical data. I wanted to explore here what it might look like for the Open Provenance Model, building on lessons learned from the statistical domain.
The Open Provenance Model talks about three main nodes:
and five kinds of edges that can be defined between them:
Then things start getting more complicated. OPM indicates that each artifact and agent plays a different role when it is used by, generated by or controls a process. What’s more, each artifact and agent might be involved in the process at different times (though timing information is optional within OPM). And a given provenance graph may contain several accounts of how artifacts, processes and agents fit together.
The OWL ontology for OPM for OPM is a very literal mapping of OPM into RDF. Each of the types of nodes is a separate class, and each of the types of edges is a separate class. Thus, it introduces a lot of n-ary relationships. Take a really simple example of an XML file being transformed into HTML using XSLT. With the OPM ontology, the RDF would look something like:
_:transformation a opm:Process .
<doc.html> a opm:Artifact .
<doc.xml> a opm:Artifact .
<doc.xsl> a opm:Artifact .
_:processor a opm:Agent .
_:Jeni a opm:Agent .
_:stylesheetLink a opm:Used ;
opm:effect _:transformation ;
opm:cause <doc.xml> ;
opm:role eg:xsltSource .
_:sourceLink a opm:Used ;
opm:effect _:transformation ;
opm:cause <doc.xsl> ;
opm:role eg:xsltStylesheet .
_:resultLink a opm:WasGeneratedBy ;
opm:effect <doc.html> ;
opm:cause _:transformation ;
opm:role eg:xsltResult .
_:processorLink a opm:WasControlledBy ;
opm:effect _:transformation ;
opm:cause _:processor ;
opm:role xslt:processor .
_:userLink a opm:WasControlledBy ;
opm:effect _:transformation ;
opm:cause _:Jeni ;
opm:role xslt:user .
_:derivation a opm:WasDerivedFrom ;
opm:effect <doc.html> ;
opm:cause <doc.xml> .
xslt:source a opm:Role ;
opm:value "source" .
xslt:stylesheet a opm:Role ;
opm:value "stylesheet" .
xslt:result a opm:Role ;
opm:value "result" .
xslt:processor a opm:Role ;
opm:value "processor" .
xslt:user a opm:Role ;
opm:value "user" .
To give you an idea of what this mapping means, if I wanted to
work out who created doc.html, I would have to do a
query like:
SELECT ?who
WHERE {
?generatedBy
opm:cause <doc.html> ;
opm:role xslt:result ;
opm:effect ?transformation .
?controlledBy
opm:effect ?transformation ;
opm:role xslt:user ;
opm:cause ?who .
}
There are two things that I want to pull out about the RDF mapping described above.
It reminds me of the mapping of object-oriented or relational data models into each other or into XML, which often result in a god awful mess and people swearing that technology X is goddamned ugly.
The fact is that elegant uses of each modelling paradigm — ones that are easy to understand and efficient to query — always take advantage of the unique features of that paradigm. For example, good XML vocabularies take advantage of the distinctions between attributes and elements, of nesting and hierarchies, and of the ability to hold mixed content.
It’s the same with RDF. There are four features of RDF that I think good vocabularies will take suitable advantage of:
Reusing existing vocabularies takes advantage of the ease of bringing together diverse domains within RDF, and it makes data more reusable. For example, an OPM mapping that encourages the reuse of FOAF for people and organisations saves time and effort for the developers of the OPM RDF vocabulary, that they would otherwise have spent modelling the details of agents; and it means that any agents that are described within the description of a piece of provenance are automatically available as agents in the wider FOAF cloud. The same goes for using DOAP to describe software.
By reusing vocabularies, the data isn’t isolated any more, locked within a single context designed for a single use. This is a huge benefit of the linked data approach and it makes sense to leverage it.
Using inheritance means creating general
purpose classes and properties and encouraging other people to use
rdfs:subClassOf or rdfs:subPropertyOf to
specialise them according to their own requirements. Within OPM,
the different roles that artifacts and agents might play in a
process is a natural fit with either sub-properties or sub-classes,
depending on how the edges in the model are represented. For
example, rather than
_:stylesheetLink a opm:Used ;
opm:effect _:transformation ;
opm:cause <doc.xsl> ;
opm:role eg:xsltStylesheet .
xslt:stylesheet a opm:Role ;
opm:value "stylesheet" .
you could generate data that looked like:
_:stylesheetLink a xslt:Stylesheet ;
opm:effect _:transformation ;
opm:cause <doc.xsl> .
where xslt:Stylesheet is defined as a subclass of
opm:Used.
Inheritance is a basic form of reasoning. In
the case of the subclass relationship outlined above, the reasoning
is that anything that is a xslt:Stylesheet is also a
opm:Used, and thus:
_:stylesheetLink a xslt:Stylesheet .
implies
_:stylesheetLink a xslt:Used .
Taking the scenario where you’re doing native linked data publishing — storing data in a triplestore and then publishing it out from there — you have two choices:
The latter is obviously the more user-friendly approach. (And a triplestore could make it easy by understanding and applying schemas, ontologies and rules as data is loaded in.)
To take a more complex example, provenance could be modelled in a much more direct way, such as:
<doc.html> a opm:Artifact ;
opm:derivedFrom <doc.xml> ;
opm:generatedBy [
xslt:source <doc.xml> ;
xslt:stylesheet <doc.xsl> ;
xslt:processor _:processor ;
xslt:user _:Jeni ;
] .
where xslt:source and xslt:stylesheet
are sub-properties of a property called opm:used, and
xslt:processor and xslt:user are
sub-properties of opm:controlledBy. This removes the
n-ary properties, which (given the use of inheritance to represent
roles) are only actually needed if the model needs to capture the
timing of the involvement of particular artifacts or agents within
a process, and makes the provenance information much easier to
query than before:
SELECT ?who
WHERE {
<doc.html> opm:generatedBy ?transformation .
?transformation xslt:user ?who .
}
But what if we also want to support the more complex,
n-ary-relation-based models? We would need to assert, somehow, a
rule that said that the presence of a opm:controlledBy
relationship from a process to an agent was equivalent to having a
opm:WasControlledBy instance with a
opm:cause pointing to the agent and an
opm:effect pointing to the process. Combine this with
xslt:user being sub-property of
opm:controlledBy and you have the statement:
_:transformation xslt:user _:Jeni .
implying:
_:transformation opm:controlledBy _:Jeni .
which in turn implies:
[] a opm:WasControlledBy ;
opm:effect _:transformation ;
opm:cause _:Jeni .
The same reasoning could be applied in the opposite direction,
of course. Part of the definition of the use of OPM in RDF could be
that the presence of a opm:WasControlledBy with a
opm:cause pointing to an agent and
opm:effect pointing to a process implies a
opm:controlledBy link between the
opm:effect and the opm:cause. Whichever
was used in the initial modelling of the data, the same query could
be used to query the data (accepting some loss of precision along
the way, but if you’re not interesting in timing information then
why should you suffer the cost of querying through n-ary
relations?).
The final thing that I mentioned above that mappings from existing models to RDF should take advantage of is named graphs. In OPM, the obvious way that named graphs could play a role is in providing support for the different accounts of provenance. Separate named graphs could be used to represent separate accounts, referencing the same artifacts, agents and processes where appropriate. Individually, the graphs can remain simple; together, you have the full power of OPM.
Modelling is a complex design activity, and you’re best off avoiding doing it if you can. That means reusing conceptual models that have been built up for a domain as much as possible and reusing existing vocabularies wherever you can. But you can’t and shouldn’t try to avoid doing design when mapping from a conceptual model to a particular modelling paradigm such as a relational, object-oriented, XML or RDF model.
If you’re mapping to RDF, remember to take advantage of what it’s good at such as web-scale addressing and extensibility, and always bear in mind how easy or difficult your data will be to query. There is no point publishing linked data if it is unusable.
Posted at 20:35
I'm shocked to find out today that Schematron is in its second decade. Amazing how time flies. Congrats to Rick.
Posted at 14:07
Long time Microsoft MVP Oleg Tkachenko has joined Microsoft's Visual Studio team.
Oleg is well-known for his work on a number of popular XML/XSLT - related open source projects, such as EXSLT.NET, MVP.XML, nXslt.exe, the Iron XSLT project type of Visual Studio 2008, ... , etc.
Oleg is also a member of the FXSL open source project.
I am glad for Oleg and want to congratulate him on his new role.
Posted at 03:36
Assignment for Dailyshoot 116 on 2010/03/11: “"Rules" can be stifling if taken to extremes. Break the rules today with focus, composition, etc, and see what happens.”
This is Railspur Alley on Granville Island; the picture exhibits no fidelity whatever to the actually fairly interesting albeit soft shades of this much-patched wall.
Posted at 09:05
Suppose you want to use Flash on your website. I think this is a bad idea, but I accept that some are going to ignore my wise advice and do it anyhow. If you do this and you’re not careful, you can make it absolutely impossible for some people to see your show.
Consider for example thesixtyone, a nice sort-of-game-structured music site that I use for background when I’m working on something that’s not particularly taxing. It’s lively and well-designed and dynamic and an example of intelligent and graceful Ajax. Only it wouldn’t work for me when I first visited it.
The reason is that it uses Flash to play its music. On both Camino and Safari, like many other people I use a Flash blocker; this disables Flash by default and shows me some sort of graphic indicating that there’s Flash there if I want to click on it. The number of irritating squirmy lame-ass ads that I never have to look at is remarkable; and when I hit YouTube or something like that, I click and watch.
On thesixtyone, they’d hidden their Flash player, made it invisible somehow, so there was nowhere to click to make the music go. Don’t do that!
Posted at 00:01
W3C is pleased to announce the creation of the Decisions and Decision-Making Incubator Group, whose mission is to determine the requirements, use cases, and a representation of decisions and decision-making in a collaborative and networked environment suitable for leading to a potential standard for decision exchange, shared situational awareness, and measurement of the speed, effectiveness, and human factors of decision-making.. The following W3C Members have sponsored the charter for this group: DISA, MITRE, and CNR. Read more about the Incubator Activity, an initiative to foster development of emerging Web-related technologies. Incubator Activity work is not on the W3C standards track but in many cases serves as a starting point for a future Working Group.
Posted at 19:40
The User Agent Accessibility Guidelines Working Group has published an updated Working Draft of the User Agent Accessibility Guidelines (UAAG) 2.0. UAAG defines how browsers, media players, and other "user agents" should support accessibility for people with disabilities and work with assistive technologies. This draft adds requirements in seven new areas, including support for speech input, video playback controls and a new section on conformance. It introduces a new supporting document, Implementing UAAG 2.0 as a First Public Working Draft. Read the invitation to review the UAAG 2.0 Working Draft and about the Web Accessibility Initiative (WAI).
Posted at 18:49