Planet XMLhack

Angle brackets are a way of life


September 02

Tim Bray: Galaxy Tab

So, there’s a new kind of Android device in the world. The world still isn’t sure just where it is that tablets are the right tool for the job. That granted, this is a nifty product. And I’m developing my own theory of what tablets are for.

My impressions are based on a couple hours playing with one, which at this point is a couple hours more than almost anyone else. The model I played was not quite production — among other things, the product name stenciled on the back wasn’t “Galaxy Tab” — but close.

I won’t have one on next week’s trip to Mainz for MobileTech, but I’m pretty sure I’ll be able to take one along to GDD Tokyo and JAOO in Aarhus, Denmark.

Other coverage: At the Financial Times’ ft.com/techbog, also Android Central (with a useful iPad comparo), also Engadget.

Impressions

All the apps I tried ran just fine, including a couple of immersive games that really benefited from the extra inches. I’ve heard of a few apps that misbehave, but their problems were obvious & easy to fix; watch for details over on the Android Dev Blog, starting later today.

Samsung has sprinkled some sugar on the out-of-the-box Google UI elements, and while the community’s opinions on hardware companies’ efforts to improve Android software have been, um, mixed (my own is extremely mixed), I have to say that the Samsungers have shown restraint, putting the extra real estate to good use in good places, for example the notifications pull-down. There may be some of that integrated-social-everything that frankly gets up my nose, but my nose remained clear around the Tab, so if it’s there it‘s at least easy to ignore.

It’s snappy, especially on games where that matters; maybe there are places where servicing the extra bits in the 1024x600 screen will hurt, but I didn’t run across them.

It’s got a phone but (at least on the pre-release model I used) you can’t hold it up to your head, which is a good thing as that would look supremely dorky.

Did I mention that the screen is beautiful? Also it feels really good in the hand and looks pretty nice, and is obviously in the first microsecond’s glance not an iPad.

What Are Tablets For?

The trade-off is obvious. You win because you can show a bigger picture, which is important, and you lose because it just won’t fit in many pockets, which is important. It’ll go in most purses, though.

I know what I’ll use the Galaxy Tab for: to show off Android. The big screen just makes everything easier to see and point at, and graphics look outstanding, and it passes from hand to hand easily. Showing off Android is part of my job and this will help me do my job better.

Which leads to a general theory, reinforced by informal observation of hipsters with iPads in coffee shops: a tablet is, crucially, a more shareable computer. A laptop, with its fragile hinge-ware and space-gobbling keyboard, is just not comfy to share. A tablet is easier to bring to the café, easier to hand across the table or along the sofa, easier to seize in the heat of the moment, easier to hold up in triumph, easier to set aside when you need to meet someone’s eyes.

How big a market is that? Anyone who says they know is lying.

Posted at 10:01

September 01

W3C News: Five XML Security Drafts Published

The XML Security Working Group has published five working drafts today. XML Signature 2.0, Canonical XML 2.0 and the XML Signature Streamable Profile of XPath 1.0 are part of an ongoing effort to rework XML Signature and Canonical XML in order to address issues around performance, streaming, robustness, and attack surface. The Working Group has also published updated Working Drafts for its XML Signature Best Practices and XML Security Relax NG Schemas Working Group Notes. Learn more about XML Security.

Posted at 20:52

Sean McGrath: what does law.gov mean to you?

Herein is my response to the question what does law.gov mean to you?

I am an IT architect and a builder of legislative systems more so than a direct legal publisher. Having said that, I have worked with most of the worlds legal publishing entities at some time or other over the last twenty years. My current focus is creating legislative systems for legislatures - mostly in the U.S.A. - the content our systems produce is then published by legislatures themselves and also by third party publishers.

I am a technologist first and foremost. I recently started blogging about the KLISS eDemocracy system here in Kansas in the hope that the technical details I am blogging will help other technologists to understand the legislative domain better and thus help create a more informed tech community around one of the most important aspects of any democracy.

I agree with pretty much everything Ed Walters said about the AOL Moment that is currently happening in the legal publishing industry. I also also agree with pretty much everything Carl Malamud says about the desirability of free, unfettered access to authenticated, machine readable primary legal materials in the context of the law.gov initiative.

For me however, the most interesting vista that law.gov opens up is the potential for the most significant event in the evolution of democracy since the funeral oration of Pericles 2400 years ago. For the first time in human history, we now have all the technological pieces we need to bring participation in the democratic process to levels not seen since ancient Greece when everyone could literally congregate in the same place. To quote Don Heiman, CITO for the Kansas State Legislature:


"Anything, including law making, you do in the presence of government you can do electronically without regards to wall or clocks provided it is easy to use and free to citizens."


There are no longer any technical reasons why we cannot publish the public activities of a legislature in real-time, or have statute databases codified on the fly, or provide direct visibility of what the impact of a proposed modification to the law would look like before it gets voted on. No technical reason why we cannot allow citizens to not only observe, but also participate in the making of law *as it is being made* - not just see the results ex post facto.

It is a lot of work for sure but it is only work at this point. No new technology breakthroughs are required. What needs to happen next (and there are signs it is happening) is for the world of law and the world of software development to both come to the realization that they are both in the same business from content management and publishing perspectives. I really believe that law is source code in the sense that the disciplines and techniques that have been perfected in the software development world have a tremendous amount to offer those who manage corpora of legal texts.

I look forward to the day when we speak of, for example "release 7.8a (Rev 456422) of the consolidated statutes of Tumbolia (MD5: checksum d03730288a7f0278e36afc82f220ddab)."

I look forward to the day when we can jump into a time machine and look at Rev 674245 of the 2011 Legislative Biennium Corpus for Tumbolia in order to better understand the legislative intent of an amendatory bill.

I look forward to the day when we can look at the laws of Tumbolia, as they were at noon Wed, 20 Jan 2010 in order to present attorneys and the courts with a complete view of what the law said at the time some contested action took place.

I look forward to the day when we can detail edit-by-edit how the consolidated statutes of Tumbolia came to be what they are by starting with the Constitution of Tumbolia from 1899 and rolling forward changes to its statute from its session laws, step-by-step with all the rigor of an accounting audit trail of transaction ledgers.

I hope that the law.gov initiative heads in that direction. The http://legislation.gov.uk website clearly points the way for what is possible. Speaking as a technologist, we techies stand ready willing and able to make this happen. Is the political will there to make it happen? Is the disruption of the status quo too much too soon for such a staid and contemplative field as law and law-making? I can answer neither of these questions but I sincerely hope the answers are "yes" and "no" respectively.

The biggest threat to any democracy is a disinterested electorate. In years to come, I hope law.gov will be seen as the catalyst that re-invigorated an entire generation to engage with the democratic process. A process that too many currently feel is beyond their realm of influence. We can change that now. For our sakes and the sakes of future generations, I hope we do.

Posted at 15:18

Eliot Kimber: Norm Reconsiders DITA Specialization

Norm Walsh has published a very interesting post to his blog, Reconsidering specialization, part the first.

This is very significant and I eagerly await Norm's thoughts.

As Norm relates in his post, he and I had what I thought was a very productive discussion about specialization and what it could mean in a DocBook context. I think Norm characterized my position accurately, namely that the essential difference between DocBook and DITA is specialization and that makes DITA better.

Here by "better" I mean "better value for the type of applications to which DITA and DocBook are applied". It's a better value because:

1. Specialization enables blind interchange, which I think is very important, if not of utmost importance, even if that interchange is only with your future self.

2. Specialization lowers the cost of implementing new markup vocabularies (that is, custom markup for a specific use community) roughly an order of magnitude easier.

There's more to it than that, of course, but that's the key bits.

All the other aspects of DITA that people see as distinguishing: modularity, maps, conref, etc., could all be replicated in DocBook.

If we assume that DITA's more sophisticated features like maps and keyref and so forth are no more complicated than they need to be to meet requirements, then the best that DocBook could do is implement the exact equivalent of those features, which is fine. So to that degree, DocBook and DITA are (or could be) functionally equivalent in terms of specific markup features. (But note that any statement to the effect that "DITA's features are too complicated" reflects a lack of understanding of the requirements are that DITA satisfies--I can assure you that there is no aspect of DITA that is not used and depended on by at least one significant user community. That is, any attempt, for example, to add a map-like facility to DocBook that does not reflect all the functional aspects of DITA maps will simply fail to satisfy the requirements of a significant set of potential users.)

But note that currently DocBook and DITA are *not* functionally equivalent: DocBook lacks a number of important features needed to support modularity and reuse. But I don't consider that important. What really matters is specialization.

Note also that I'm not necessarily suggesting that DocBook adapt the DITA specialization mechanism exactly as it's formulated in DITA. I'm suggesting that DocBook needs the functional equivalent of DITA's specialization facility.

Note also that DocBook as currently formulated at a content model level probably cannot be made to satisfy the constraints specialization requires in terms of consistency of structural patterns along a specialization hierarchy and probably lacks a number of content model options that you'd want to have in order to support reasonable specializations from a given base.

But those are design problems that could be fixed in a DocBook V6 or something if it was important or useful to do so.

Finally, note that in DITA 2.0 there is the expectation that the specialization facility will be reengineered from scratch. That would be the ideal opportunity to work jointly to develop a specialization mechanism that satisfied requirements beyond those specifically brought by DITA. In particular, any new mechanism needs to play well with namespaces, which the current DITA mechanism does not (but note that it was designed before namespaces were standardized).

Posted at 14:46

August 31

Tim Bray: A Story of O

In recent days I’ve been thinking of JavaOne, as we kicked it around and decided we just couldn’t send speakers; and of Oracle OpenWorld, to which JavaOne will now serve as an appendage. It reminded me of a conversation I had last year about Oracle.

The conversation involved myself and a person with a convincing title who, as they’d say in the paper, was “familiar with the situation”.

My question was: “OpenWorld is this totally all-about-business conference. The Oracle Develop meeting is just a second-rate sidebar. Where does Oracle go about building developer mindshare?”

I’ll try to reproduce the answer in full as best as I can remember it:

“You don’t get it. The central relationship between Oracle and its customers is a business relationship, between an Oracle business expert and a customer business leader. The issues that come up in their conversations are business issues.

“The concerns of developers are just not material at the level of that conversation; in fact, they’re apt to be dangerous distractions. ‘Developer mindshare’... what’s that, and why would Oracle care?”

Posted at 22:33

Norm Walsh: Mexico

Cancun and a day trip to the Riviera Maya brings me to country number 16.

Posted at 22:32

W3C News: Voice Extensible Markup Language (VoiceXML) 3.0 Draft Published

The Voice Browser Working Group has published a Working Draft of Voice Extensible Markup Language (VoiceXML) 3.0. Voice XML is used to create interactive media dialogs that feature synthesized speech, recognition of spoken and DTMF key input, telephony, mixed initiative conversations, and recording and presentation of a variety of media formats including digitized audio, and digitized video. Learn more about the Voice Browser Activity.

Posted at 19:33

August 30

Norm Walsh: Reconsidering specialization, part the first

It's been a few years since I first considered DITA specialization. I wonder if I missed the point? I think that might depend on the assumptions that I brought to the table.

Posted at 17:19

W3C News: W3C Launches HTML Speech Incubator Group

W3C is pleased to announce the creation of the HTML Speech Incubator Group, whose mission is to determine the feasibility of integrating speech technology in HTML5 in a way that leverages the capabilities of both speech and HTML (e.g., DOM) to provide a high-quality, browser-independent speech/multimodal experience while avoiding unnecessary standards fragmentation or overlap. The following W3C Members have sponsored the charter for this group: Voxeo, Microsoft, Openstream, Google, AT&T, Mozilla. Read more about the Incubator Activity, an initiative to foster development of emerging Web-related technologies. Incubator Activity work is not on the W3C standards track but in many cases serves as a starting point for a future Working Group.

Posted at 15:40

Rick Jelliffe: Vale Java? Scala Vala palava - and Go too

Dave Megginson (who drove the development of the SAX API that will be familiar to many XML developers who use Java) recently wrote Java is dead. Java stood out as a programming language (though not as a platform) in that...

Posted at 15:31

Sean McGrath: Its all about the back end

David Eaves : Creating effective open government portals. Amen to that.

Here is the thing...most http://data.[whatever] websites are only as good as their ability to serve up fresh content. That oftentimes means that re-thinking back-end processes is required. Otherwise a one-off data dump happens to get things rolling but then...

Nothing kills a web-o-data project so ruthlessly as information latency.

Machine readable content - even more so than human readable content - must be current.

Posted at 14:59

Dimitre Novatchev: Not for XSLT? More Fun with Project Euler

 

Here is another Project Euler problem that seems exactly what XSLT was not intended for:

 

In the 5 by 5 matrix below, the minimal path sum from the top left to the bottom right, by only moving to the right and down, is indicated in bold red and is equal to 2427.

clip_image002  

131

673

234

103

18

201

96

342

965

150

630

803

746

422

111

537

699

497

121

956

805

732

524

37

331

   clip_image004

 

Find the minimal path sum, in matrix.txt (right click and 'Save Link/Target As...'), a 31K text file containing a 80 by 80 matrix, from the top left to the bottom right by only moving right and down.

My solution:

 

<xsl:stylesheet version="2.0"

 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

 xmlns:xs="http://www.w3.org/2001/XMLSchema"

 xmlns:saxon="http://saxon.sf.net/"

 xmlns:mx="my:my" exclude-result-prefixes="xs saxon mx"

 >

 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 

 <xsl:variable name="vMatrix" as="element()*">

  <xsl:variable name="vLines" as="xs:string*"

   select="tokenize(unparsed-text('matrix.txt'),'\s+')[.]"/>

   <xsl:for-each select="$vLines">

     <line>

      <xsl:for-each select="tokenize(.,',')">

        <v><xsl:value-of select="."/></v>

      </xsl:for-each>

     </line>

   </xsl:for-each>

 </xsl:variable>

 

 <xsl:variable name="vDimension" as="xs:integer"

  select="count($vMatrix)"/>

 

 <xsl:template match="/">

  <xsl:sequence select="mx:path-minSum(1,1,0)"/>

 </xsl:template>

 

 <xsl:function name="mx:path-minSum" as="xs:integer"

  saxon:memo-function="yes">

  <xsl:param name="pX" as="xs:integer"/>

  <xsl:param name="pY" as="xs:integer"/>

  <xsl:param name="pcurSum" as="xs:integer"/>

 

  <xsl:variable name="curVal" as="xs:integer"

   select="mx:mtx($pX, $pY)"/>

 

  <xsl:sequence select=

   "if($pX eq $vDimension and $pY eq $vDimension)

      then $curVal

      else

        for $nextX in min(($vDimension, $pX+1)),

            $nextY in min(($vDimension, $pY+1)),

            $s1 in if($nextY gt $pY)

                    then

                     $pcurSum + $curVal

                     + mx:path-minSum($pX, $nextY, $pcurSum)

                    else 999999999999,

            $s2 in if($nextX gt $pX)

                    then

                     $pcurSum + $curVal

                     + mx:path-minSum($nextX, $pY, $pcurSum)

                    else 999999999999

          return

            min(($s1,$s2))

   "/>

 </xsl:function>

 

 <xsl:function name="mx:mtx" as="xs:integer"

  saxon:memo-function="yes">

  <xsl:param name="pX" as="xs:integer"/>

  <xsl:param name="pY" as="xs:integer"/>

 

  <xsl:sequence select="$vMatrix[$pY]/*[$pX]"/>

 </xsl:function>

</xsl:stylesheet>

 

 

Properties:

Posted at 12:23

August 29

Tim Bray: Late But Essential Review

I read Michael Lewis’ The Big Short: Inside the Doomsday Machine months ago, and have been feeling guilty about not recommending it, because this material is sort of essential for anyone who would like to understand how our economy ended up in the toilet. Read on, not just for a (spoiler: positive) review, but for potentially time- and money-saving advice.

Sidebar: Michael Lewis

I should disclose that I’m a hopeless Michael Lewis fan; in my review of Moneyball I wrote “I suspect there may not be a greater living writer of reportorial non-fiction” and yes, I still suspect that. So you could either use my admitted bias to discount this review, or alternately join the club and next time you see a Lewis book in the airport bookstore, just grab it.

Sidebar: Maybe Don’t Read the Book

The book has its roots in a November 2008 piece in Portfolio magazine, The End. I’m not actually sure that the book is a better piece of work than the (much shorter) essay; and I am pretty sure that all the really important lessons of the book are there in The End.

Now if you enjoy Lewis’ narrative of this frankly-incredible-even-though-we-all-watched-it-happen story, you probably want to get the book, just because there’s room for more narrative and the rest is mostly just as good.

But considered as a work of the non-fiction writer’s art, I’d have to favor the shorter version for its ruthless focus and pace.

The Story

It’s simple enough; Lewis found individuals and partnerships (three in the book, one in the essay) who decided the mortgage bubble was bullshit far before most others did, and made gobs of money betting it would pop, and he walks us through their experience. Most of their time doing this was pretty stressful, because a lot of apparently-smart people were betting against them every step of the way and taking home millions doing so. Even when it all fell apart and they got paid it was stressful, because they understood the scale of the disaster before anyone else.

Anyhow, they’re interesting people, the financial tools used both for inflating the bubble and betting against it are interesting too, and Lewis can tell a story with the best of them. Unless you’re oblivious to recent economic history, you’ll like it.

Lesson 1

It’s one we should have learned already, long ago: The business of Finance is 100% about making big money for people in the business of Finance. Everything else is irrelevant. The party line is that it’s about routing capital from those who have it to those who need it in a maximally-efficient market-driven way. Hah hah hah.

There’s another legend that it’s about making money for investors. Double hah hah hah. If you’re an investor, it’s about them making bets with your money and if they lose you lose, if they win you get a few scraps and they get a bigger vacation home.

This really should not be surprising. There is apparently no social or ethical force that will cause people to bypass a chance to shovel money into their own pockets, without regard for catastrophic costs to their fellow-humans no matter how predictable. This needs to be an axiom in the thinking of all future regulatory planners.

Lesson 2

Of all the failures that led to the big meltdown, the most aggravating is the failure of the bond-rating agencies. These people took good money for pasting AAA credit ratings on piles of the most implausible shit imaginable, and what’s irritating isn’t that they did it, it’s that apparently they didn’t break any laws and thus there’s little prospect of the long prison terms that anything smelling of natural justice would require.

If you shared my blank, astonished “how could that happen?” reaction, you’ll probably enjoy Roger Lowenstein’s Triple-A Failure, published in April 2008 in the New York Times.

Once again, not surprising: having debt ratings paid for by the people issuing debt creates a huge conflict of interest, and per Lesson 1, any such conflicts will be taken advantage of by Finance insiders to fleece the sheep otherwise known as you and me.

Lesson 3

Finance’s relationship to the economy should best be considered by policymakers as that of a dangerous parasite to its host. Any benefits offered at the margin by its market-making functions are dwarfed by the existential threats it is empirically observed to pose, on a regular basis, to the proper functioning of the real economy.

An earlier draft had the word “real” in the previous sentence enclosed in quotes. I took them away, because it really is real, as opposed to Finance which, on the evidence, is the mostly-toxic product of pure imagination, imagination fevered by a lethal illness that the rest of us are in danger of catching.

I’d say Finance should be regulated into utility status all over the civilized world, and if the community of hard-core financial engineers really wants to go on being a collective pathogen, they’ll be forced to acknowledge that what they do just isn’t civilized behavior, and do it somewhere else. We’ll be way better off without them among us.

Posted at 22:31

Dare Obasanjo: Lessons from Google Wave and REST vs. SOAP: Fighting Complexity of our own Choosing

Software companies love hiring people that like solving hard technical problems. On the surface this seems like a good idea, unfortunately it can lead to situations where you have people building a product where they focus more on the interesting technical challenges they can solve as opposed to whether their product is actually solving problems for their customers.

I started being reminded of this after reading an answer to a question on Quora about the difference between working at Google versus Facebook where Edmond Lau David Braginsky wrote

Culture:
Google is like grad-school. People value working on hard problems, and doing them right. Things are pretty polished, the code is usually solid, and the systems are designed for scale from the very beginning. There are many experts around and review processes set up for systems designs.

Facebook is more like undergrad. Something needs to be done, and people do it. Most of the time they don't read the literature on the subject, or consult experts about the "right way" to do it, they just sit down, write the code, and make things work. Sometimes the way they do it is naive, and a lot of time it may cause bugs or break as it goes into production. And when that happens, they fix their problems, replace bottlenecks with scalable components, and (in most cases) move on to the next thing.

Google tends to value technology. Things are often done because they are technically hard or impressive. On most projects, the engineers make the calls.

Facebook values products and user experience, and designers tend to have a much larger impact. Zuck spends a lot of time looking at product mocks, and is involved pretty deeply with the site's look and feel.

It should be noted that Google deserves credit for succeeding where other large software have mostly failed in putting a bunch of throwing a bunch of Ph.Ds at a problem at actually having them create products that impacts hundreds of millions people as opposed to research papers that impress hundreds of their colleagues. That said, it is easy to see the impact of complexophiles (props to Addy Santo) in recent products like Google Wave.

If you go back and read the Google Wave announcement blog post it is interesting to note the focus on combining features from disparate use cases and the diversity of all of the technical challenges involved at once including

The product announcement read more like a technology showcase than an announcement for a product that is actually meant to help people communicate, collaborate or make their lives better in any way. This is an example of a product where smart people spent a lot of time working on hard problems but at the end of the day they didn't see the adoption they would have liked because they they spent more time focusing on technical challenges than ensuring they were building the right product.

It is interesting to think about all the internal discussions and time spent implementing features like character-by-character typing without anyone bothering to ask whether that feature actually makes sense for a product that is billed as a replacement to email. I often write emails where I write a snarky comment then edit it out when I reconsider the wisdom of sending that out to a broad audience. It’s not a feature that anyone wants for people to actually see that authoring process.


Some of you may remember that there was a time when I was literally the face of XML at Microsoft (i.e. going to http://www.microsoft.com/xml took you to a page with my face on it Smile ). In those days I spent a lot of time using phrases like the XML<-> objects impedance mismatch to describe the fact that the dominate type system for the dominant protocol for web services at the time (aka SOAP) actually had lots of constructs that you don’t map well to a traditional object oriented programming language like C# or Java. This was caused by the fact that XML had grown to serve conflicting masters. There were people who used it as a basis for document formats such as DocBook and XHTML. Then there were the people who saw it as a replacement to for the binary protocols used in interoperable remote procedure call technologies such as CORBA and Java RMI. The W3C decided to solve this problem by getting a bunch of really smart people in a room and asking them to create some amalgam type system that would solve both sets of completely different requirements. The output of this activity was XML Schema which became the type system for SOAP, WSDL and the WS-* family of technologies. This meant that people who simply wanted a way to define how to serialize a C# object in a way that it could be consumed by a Java method call ended up with a type system that was also meant to be able to describe the structural rules of the HTML in this blog post.

Thousands of man years of effort was spent across companies like Sun Microsystems, Oracle, Microsoft, IBM and BEA to develop toolkits on top of a protocol stack that had this fundamental technical challenge baked into it. Of course, everyone had a different way of trying to address this “XML<-> object impedance mismatch which led to interoperability issues in what was meant to be a protocol stack that guaranteed interoperability. Eventually customers started telling their horror stories in actually using these technologies to interoperate such as Nelson Minar’s ETech 2005 Talk - Building a New Web Service at Google and movement around the usage of building web services using Representational State Transfer (REST) was born. In tandem, web developers realized that if your problem is moving programming language objects around, then perhaps a data format that was designed for that is the preferred choice. Today, it is hard to find any recently broadly deployed web service that doesn’t utilize on Javascript Object Notation (JSON) as opposed to SOAP.


The moral of both of these stories is that a lot of the time in software it is easy to get lost in the weeds solving hard technical problems that are due to complexity we’ve imposed on ourselves due to some well meaning design decision instead of actually solving customer problems. The trick is being able to detect when you’re in that situation and seeing if altering some of your base assumptions doesn’t lead to a lot of simplification of your problem space then frees you up to actually spend time solving real customer problems and delighting your users. More people need to ask themselves questions like do I really need to use the same type system and data format for business documents AND serialized objects from programming languages?

Note Now Playing: Travie McCoy - Billionaire (featuring Bruno Mars) Note

Posted at 02:47

August 28

Tim Bray: Tethering

I travel quite a bit, and I have found that the “tethering & portable hotspot” facility in Android 2.2 is just absolutely wonderful. It has saved me considerable money and got me reasonably-good connectivity in places I wouldn’t otherwise have had it; I’m looking at you, big-name US hotel chains.

When I heard that telephone companies were charging extra for this, I couldn’t figure out how they were doing it; without considerable deep-packet inspection, how can you tell that there are other computers gatewaying through my Nexus One, which in fact seems to hotspot just fine on certain networks that are said to charge extra? The answer is obvious but only once you see it: the network operators modify Android on the locked phones they sell cheap along with a contract (perfectly legal, it’s open-source) to remove the built-in tethering/hotspot option, and replace it with one of their own, which they charge for.

I’m not going to weigh in on the pros and cons of the business model, because I have no insight into telco cost structures or indeed what would happen if tethering became free for everyone. There’s no doubt that for some of us it’s a major value-add and it doesn’t seem unreasonable to pay a little extra for it. I paid a few bucks a month for Boingo until I got this going, and that seemed fair.

However, I will point out that for people who travel a lot, an unlocked phone (in the range of $500 for most decent Android devices) might end up looking cheap.

Further practical advice: plug that puppy in if you’re going to be doing this for more than a few minutes, because that WiFi radio seems to eat watts in hotspot mode. And don’t stick it in your pocket; the Nexus One, at least, runs way hot when plugged-in and tethering.

Posted at 22:23

XML.com: Vale Java? Scala Vala palava

Dave Megginson (who drove the development of the SAX API that will be familiar to many XML developers who use Java) recently wrote Java is dead. Java stood out as a programming language (though not as a platform) in that...

Posted at 05:21

David Megginson: davidmegginson

I installed the Google Chat Voice plugin today, and found that I was able to make free Google Voice calls from Canada to both a US and a Canadian POTS number. I’m still unable to register for Google Voice at google.com/voice, and I cannot use the Google Voice app on my Android phone, but at least I can initiate a call from inside GMail on my laptop now.

Does this mean that Google is about to roll out full Google Voice support for Canada, or just that they forgot to plug a hole in the code that that’s supposed to prevent non-US accounts from using the service?

Google Voice on my Android phone would be fantastic, because I could make unlimited North American phone calls on (say) a 6 GB, $30/month data plan instead of paying the world’s highest cell phone bills for (limited) voice and long distance. I’m sure Rogers and other mobile carriers won’t be happy about that, but I hope their lobbyists can’t stop it.


Tagged: business, mobile, news, voip

Posted at 01:04

August 27

Lauren Wood: Busy, Busy

Like many people I know, the dichotomy between doing and blogging is often resolved by more doing, and not so much blogging, especially with Twitter, Identi.ca, et al around for the quick asides. Time to craft a careful post is in short supply, especially sufficient time to craft a post that looks effortless.

But today one of my projects has finished one major phase so I’m taking some time. I’ve started working in healthcare, or more precisely, doing project management on a project basis for Alschuler Associates, involving lots of XML, lots of client discussions, and working with a distributed team across 3 timezones. It’s interesting, and complicated, and I still feel like I’m just getting started although I’ve been working on it for almost six months.

And it’s just as well those projects are in a slower spell, since in a little over a week the XML Summer School starts, for which I’m Course Director. Most of the prep work has been done, and soon the fun and learning start. I enjoy going each year, catching up on new technologies, learning more about the ones I’ve heard about before but haven’t had a chance to try out, catching up on what’s new in the world of XML. I didn’t make it to Balisage this year due to project commitments (see above); the XML Summer School makes up for that to some extent. And this year we’re in Oxford at the right time for the St Giles Fair, which makes for a change to the usual pub crawl.

Other projects are taking a back seat, unfortunately. There’s only so much time in the day, and so many interesting things to fill it with.

Posted at 15:56

August 26

Dave Beckett: Leaving Yahoo – Joining Digg

I’m heading to a new adventure at Digg in San Francisco to be a lead software engineer working on APIs and syndication.

I’ve been at Yahoo! nearly 5 years so it is both a happy and sad time for me, and I wish all the excellent people I worked with the best of luck in future.

Here is a summary of the main changes:

Exciting!

Posted at 20:44

August 24

Tim Bray: Late Summer Tech Tab Sweep

Some of these puppies have been keeping a browser tab open since April. No theme; ranging on the geekiness scale from extreme to mostly-sociology.

People

First, the good news. There’s real demand for senior people in our trade. Simon Phipps, who got me the job at Sun and whose opinions I pay careful attention to even when I disagree, has a new gig at ForgeRock, where they’re trying to build a sensible profitable business around open-source principles and some damn good technology that Oracle was too stupid to get behind.

Also, my long-time compatriot Dave Orchard just started looking for a gig; we had coffee the other day and he’s fielding some super-interesting offers. He hasn’t accepted any; if you want that sort of talent, better move fast.

On the other hand, half the people out there are women, and while I have to say that their progress through the educational and business worlds gives lots of reason for cheer, we still are mostly failing at attracting them to technology careers. A few of pieces on this front crossed my radar recently: Nicole Sullivan a.k.a. “Stubbornella” on Woman in technology, Alice Adams’ What Women Want and How Not to Give It to Them, and Anil Dash’s Mechanisms of Exclusion. These are neither short nor uncontroversial, but I’ll leave my side of the controversies out and just assert that they’re really worth reading. Well, except to say, in response to Anil, that I’d advise most entrepreneurs, women and men both, to stay well away from VCs at this moment in history.

How Does a Cellphone Work?

I’d previously come across Harald Welte as one of the leaders of the fascinating but fruitless OpenMoko project; his Anatomy of contemporary GSM cellphone hardware (PDF) is deep and well-written. Used to be I didn’t understand how all that “radio” layer stuff worked; I still don’t, but now I sort of know what I don’t know.

Android Miscellanea

Christian Neukirchen: Programming for Android with Scala.

Probably because I spend way too much time in airplanes, I’ve always enjoyed Flight Level 390, by an anonymous commercial airline captain, on the pains and pleasures of flying Airbuses all over the New World.

In Independence Day Over Pensacola, he talks about a tricky landing in Orlando and as he’s winding up, writes “The crew van is rolling as iPhones and Droids come out of pockets and purses to call loved ones.” And that’s about how it is, you know; the mainstream is Them and Us, for now.

Concurrency

I used to worry about it all the time in my previous job, and I still watch that world. I see that there was an Intel Threading Challenge 2010, and I’m unsurprised that it was won by Dmitriy V’jukov, also the winner of my Wide Finder 2 challenge, with some of the gnarliest C code imaginable. Which served to demonstrate my point that this stuff is still way, way too hard.

Oh, and that Intel challenge has a Phase 2.

Future, With a Zinger

I’m talking about Michael Nygard’s The Future of Software Development, and contains probably the harshest prognosis for Java’s future that I’ve read from someone who’s actually speaking in a reasoned tone of voice.

Ruby Love

Here are two short essays on the same subject: Why many people like using Ruby; I’m one of them: Michael Bleigh writes The Future’s Pretty Cool, or Why I Love Ruby and Len Smith’s 8 Reasons I love Ruby.

Enterprise Awfulness

From The Economist, Computer says no; I’m delighted that someone is telling civilians the truth about how badly our discipline is practiced, most places, most times.

Data Freedom

From Kellan, and oh my goodness does he put it well. Minimal Competence: Data Access, Data Ownership, and Sharecropping. Sample quote: “It’s your data, and you’ve granted us a limited license to use it... The ability to get out the data you put in is the bare minimum. All of it, at high fidelity, in a reasonable amount of time.”

Mmmmm, tasty.

Posted at 02:01

August 23

Sean McGrath: Normal people, normal spreadsheets and RDF

In a post about Gridworks Jeni says:

"Like a lot of spreadsheets created by normal people, who want to create something readable by human beings rather than computers, it has some extra lines at the top to explain what the spreadsheet contains..."

There is a terribly, terribly common pattern here and it has always surprised me that spreadsheet developers have never made row 1 and col 1 "special" for exactly this reason. I've lost count of the number of spreadsheets I've seen that have labels in row 1, labels in col 1 and data in the intersection cells.

Subject, predicate, object anyone:-) Where do all the triples go?.

Posted at 19:27

Micah Dubinko: Eulogy for SearchMonkey

This is indeed a sad day for all of us, for on October 1, a great app will be gone. Though we hardly had enough time during his short life to get to know him, like the grass that withers and fades, this monkey will finish his earthly course.

Updated SearchMonkey logo

Photo by Micah

I know he left many things undone, for example only enhancing 60% of the delivered result pages. He never got a chance to finish his life’s ambition of promoting RDFa and microformats to the masses or to be the killer app of the (lower-case) semantic web. You could say he will live on as “some of this structured data processing will be supported natively by the Microsoft platform”. Part of the monkey we loved will live on as enhanced results continue to flow forth from the Yahoo/Bing alliance.

The SearchMonkey Alumni group on LinkedIn is filled with wonderful mourners. Micah Alpern wrote there

I miss the team, the songs, and the aspiration to solve a hard problem. Everything else is just code.

Isaac Asimov was reported to have said “If my doctor told me I had only six minutes to live, I wouldn’t brood. I’d type a little faster.” Today we can identify with that sentiment. Keep typing.

-m

Posted at 06:07

August 22

Jeni Tennison: Using Freebase Gridworks to Create Linked Data

When we encourage people to put their data on the web as linked data, the biggest question is “How?”. There are so many “How?” questions to answer:

and, of course:

Our goal within the linked data part of data.gov.uk (and I know we haven’t achieved it yet) is to both answer these questions and to make the answers as simple as possible. The answers to the questions cannot either require up-front knowledge of all possible types of data that might be published or depend on the availability of linked data for all the things we want to talk about. It cannot require registration at centralised services. It cannot require everyone to do everything in the same way or at the same pace.

We must take adopt an approach that encourages people to make their data available in forms that are easier for other people to pick up and use because they see the benefits for them and their stakeholders and because the effort of doing so is not too high to bear. We must grow, adapt and evolve incrementally. If linked data eventually wins, it will be due to its benefits, not to faith.

Anyway, enough rant. The point of this blog post is to talk about one of the answers to the ‘How do we create it?’ question: using Freebase Gridworks. For those who haven’t encountered it, Gridworks is an incredibly useful application that enables you to easily analyse, clean and manipulate tabular data. In a few steps, it can be used to generated linked datasets which can then be published on the web just like any other file, ready for other people to reuse without jumping through hoops. I’m going to assume that you can download it and install it following the instructions provided on the Gridworks site.

In this post, I’m going to talk about how to use Gridworks to generate linked data, using an example of local government spending data from Windsor and Maidenhead council. Like a good train journey, there’s quite a lot to see along the way.

Note: Many thanks to Dave Reynolds for his work on this data and comments on an earlier version of this post.

Importing Data

The first step is to import the data into Gridworks. If you just take the Windsor & Maidenhead data and import it directly, you’ll get a single not-very-useful column as shown in the following screenshot:

If you look at the spreadsheet in a normal spreadsheet programme then you’ll see why. Like a lot of spreadsheets created by normal people, who want to create something readable by human beings rather than computers, it has some extra lines at the top to explain what the spreadsheet contains, as shown in the following screenshot:

Fortunately, Gridworks lets us easily skip over these first few lines. When you import the data, put the number 1 in the box for “Ignore X initial non-blank lines”, as shown here:

(You need the number 1 because although there are three lines before the table really starts, the second two of those are blank.)

That done, the data should look a lot more useful, as shown in the following screenshot:

Cleaning Data

The next thing to do is to explore the data a bit to get a handle on what’s there and work out whether any cleaning or rationalisation is necessary to improve its quality.

With columns that hold names, such as ‘Directorate’, ‘Service’ or ‘Supplier Name’, you’re looking for slight misspellings caused by bad data entry. Gridworks helps you find these by creating a list of the distinct values for a particular column and telling you how many instances there are of each. Use the arrow at the side of the column name to pull down the menu, then choose Facet > Text Facet to create this list, as shown here:

Once you’ve chosen Text Facet, the list pops up on the left hand side of the window. You can click on these to filter the table to contain just those rows that have that value for that column, but you can then scan through this to spot any places where there looks to be a typo or two entries that should really be the same. For example, the Services list holds both ‘Libraries & Information Services’ and ‘Library & Information Services’, as shown here:

It’s unlikely that there are really two distinct services with such similar names, so we’d like to clean up this data by standardising on one name or another. You can quickly change all occurrences of one value to another using the edit option that appears just to the right of the value when you hover over it. This brings up a dialog that enables you to change all of those values to something else, as shown here:

You can do something similar with numeric columns, such as the ‘Amount excl vat £’ column. This time choose Numeric Facet rather than Text Facet and you’ll get a histogram up as shown here:

This is useful for identifying outliers. If you grab the handle on the left of the histogram and move it to the centre, the rows will get filtered to only those that have an amount within that range. For example, moving it to only show rows between £500,000 and £1,500,000 shows that there are three payments of this size, all made by Children’s Services to Wilmott Dixon Construction Limited, as shown in this screenshot:

Although these values are much higher than most of the others in the spreadsheet, they don’t seem to be errors — I guess a new school was being built or something — so there’s nothing to correct here, but it shows how numeric facets can be used to explore the data.

Another approach to exploring and cleaning the data is to use the clustering algorithms that are built into Gridworks to identify duplicates. To do this, pull down the column menu and this time choose Edit Cells... > Cluster and Edit, as shown in the following screenshot, this time for the ‘Supplier Name’ column:

This brings up a dialog that groups together values that look similar. In this case, ‘Siemens plc’ and ‘Siemens PLC’, as shown in the following screenshot:

You can use this dialog to change all the similar values to a standard one. Check the Merge checkbox for the clusters of values that should be merged, edit the New Cell Value field to whatever standard value you want to adopt, and choose Apply & Re-cluster or simply Apply & Close to make the change.

You will often find that the default clustering algorithm (key collision/fingerprint) doesn’t come up with any clusters as it’s fairly conservative. It’s worth playing around a bit with different algorithms to look for other duplicates by selecting other possibilities from the drop-down menus. For example, choosing the ‘nearest neighbour’ method with the Levenstein distance function and a radius of 2 (edits) results in four possible duplicates within the Suppliers list, as shown here:

If you’re not sure about whether the cluster is due to a typo or not, hover over the row and click on the Browse this cluster link that appears. That will bring up a separate window that will show you just the rows in the cluster, from which you should be able to make a judgement. For example, it’s not clear whether ‘Academia Ltd’ is a typo for ‘Academics Ltd’ but browsing the cluster shows that the Cost Centre codes and the Types of the transactions are completely different for the two Suppliers, so they are probably different.

Deriving Data

The next step is to derive some data from what we have within the spreadsheet. Since our goal is to produce linked data, the kind of derived data that we’re interested in are URIs.

At this point we need to start making decisions about what URIs to use. If you look at the list of spending data from Windsor and Maidenhead, you’ll see that there are a whole bunch of these spreadsheets. It would be really useful if we could tie these spreadsheets together by using the same URIs for the same things across the datasets. For that reason, the only URI that’s going to be local to the dataset is the URI for each line (or data point if you like) itself. On the other hand, most of the things that are named here are going to be local to Windsor & Maidenhead: ‘Abba Cars’ may be sufficient to identify a single company within Windsor & Maidenhead, but certainly wouldn’t be nationwide. So the URIs I’m going to create here are mostly going to be within the www.rbwm.gov.uk domain.

Here’s the table of the columns and the associated URIs that I’m going to use. I should stress that this is just for example purposes, but I’ve used the following principles:

This is what we’re doing within data.gov.uk, but it’s an important principle of the web that different councils might well choose their own URI schemes, depending on the kind of technology support that they have, without any bad side-effects on the interpretation of the data.

Column URI pattern
(Dataset) http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2
(Row/ExpenditureLine) http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#{row-number}
(Council) http://statistics.data.gov.uk/id/local-authority/00ME
Directorate http://www.rbwm.gov.uk/id/directorate/{directorate-slug}
Updated http://reference.data.gov.uk/id/day/{date}
TransNo/Payment http://www.rbwm.gov.uk/id/transaction/{transaction-number}
Service http://www.rbwm.gov.uk/id/service/{service-slug}
Cost Centre http://www.rbwm.gov.uk/def/cost-centre/{cost-centre-code}
Supplier Name http://www.rbwm.gov.uk/id/supplier/{supplier-slug}

As you can see, those of the columns that contain text fields have, as part of their URI, a ‘slug’. This is a shortened, normalised value suitable for putting in a URI: basically ensuring that the string doesn’t contain any punctuation or spaces. For example, ‘Adult & Community Services’ would turn into ‘adult-community-services’.

Our first task will be to create these slugs. To do this, we’ll create a new column based on the existing ones by choosing Edit Column > Add Column Based on This Column ... from the drop-down menu on the appropriate column:

Selecting this will bring up a dialog which will ask you to name the new column and then enter a formula to calculate the new value, as shown here:

The default language for this formula is Gridworks’ own, though there are other options available. To create the slug, we need to:

  1. turn the value to lower case
  2. replace all spaces with hyphens
  3. remove anything that isn’t a letter, number, or hyphen
  4. replace all sequences of two hyphens with a single hyphen

This is done in two steps. The first three steps can be done using the formula:

replace(replace(toLowercase(value), ' ', '-'), /[^-a-z0-9]/, '')

Gridworks helps by listing the original and resulting values for the first several rows of the spreadsheet, so that you can see whether it’s working as expected. When you’re happy, hitting OK creates the new column.

The last step (replacing all sequences of two hyphens with a single hyphen) can be done by editing the cells in the new column. Bring up the Edit Cells... > Transform... dialog using the menu:

and use the formula:

replace(value, '--', '-')

then check the Re-transform until no change checkbox so that any pairs of hyphens are repeatedly replaced with single hyphens, as shown here:

The other tabs in the new column and edit cells dialogs are really helpful. The History tab lets you choose formulae that you’ve used before to use again. This is useful here because we want to create the slugs for the Service and Supplier Name in the same way. The Help tab lists all the functions that you can use within the formula.

Creating the URIs for the columns proceeds in the same way, except this time the formulae are more like:

'http://www.rbwm.gov.uk/id/directorate/' + value

There are two that are slightly different. First, there’s the URI for the date, which needs to be constructed from the date/time value held by Gridworks as follows. We can do this in two stages. First, to construct a new column called ‘Date’ to hold the formatted date:

datePart(value, 'year') + '-' + 
if (datePart(value, 'month') < 9, '0', '') + replace(datePart(value, 'month') + 1, '.0', '') + '-' + 
if (datePart(value, 'day') < 10, '0', '') + datePart(value, 'day')

(note that the datePart() function returns a 0-based count for the month) and then to create the Date URI column based on this:

'http://reference.data.gov.uk/id/day/' + value

Second, there’s the URI for the row (an expenditure line) itself, which needs to be constructed using the row number. It’s useful to construct it as a local URI (ie just the fragment) as this means the same code can be used to construct the column across different datasets, so it’s just:

'#' + rowIndex

Exporting Data

Once the extra columns have been made, it’s time to export data from Gridworks. While Gridworks makes it easy to export to CSV or into Freebase, it’s also possible to export in any format you want using templates. Use the Project menu and choose Export Filtered Rows > Templating ..., as shown in the following screenshot:

Note that this will only export the rows that you currently have selected, so if you want to export everything, make sure that you deselect any facets that you’ve currently got selected.

Choosing the Templating ... option will open up a dialog that you can use to create whatever format you want. The default, as shown in the following screenshot, is JSON.

On the left are four fields:

One thing to be extremely careful of here is that any changes you made to the fields on the left here will not be saved when the dialog is closed. For that reason, it’s a good idea to create your templates in a separate text file and copy and paste them in. Also note that the sample data on the right is only for the first set of rows, not for the whole spreadsheet.

We’re going to generate Turtle using the template, so the next stage is to work out precisely what Turtle to generate. We’ve been working on small vocabulary for payment data based on the Data Cube vocabulary and that’s what I’ll use here, although it isn’t quite complete and available yet as it will be. We’ll start at the bottom, with the individual rows, and then add extra surrounding information as we go.

Row Template

Within this data, each row corresponds to a payment:ExpenditureLine within the dataset. The expenditure lines can be organised into groups based on the payment:Payment that they’re associated with, which is indicated through the ‘TransNo’ column in the database. Within the payment vocabulary we’re using, we can assign individual expenditure lines to the payment using the payment:expenditureLine property.

The payment:payer of each payment:Payment is Windsor & Maidenhead council. The payment:payee is the ‘Supplier’ listed in the spreadsheet. The payment:date is the ‘Updated’ date.

Each individual line in the spreadsheet is a payment:ExpenditureLine which is associated with one of these payments. The payment:expenditureCode is the ‘Cost Centre’ and the actual payment:amountExcludingVAT is the ‘Amount excl vat £’ value. Some example Turtle for the first line is thus:

<http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2>
  qb:slice <http://www.rbwm.gov.uk/id/transaction/2650750> .

<http://www.rbwm.gov.uk/id/transaction/2650750>
  a payment:Payment , qb:Slice ;
  rdfs:label "Transaction 2650750"@en ;
  qb:sliceStructure payment:payment-slice ;
  payment:transactionReference "2650750" ;
  payment:payer <http://statistics.data.gov.uk/id/local-authority/00ME> ;
  payment:payee <http://www.rbwm.gov.uk/id/supplier/1st-choice-d-b-driveways-limited> ;
  payment:date <http://reference.data.gov.uk/id/day/2010-04-09> ;
  payment:expenditureLine <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0> .

<http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0>
  a payment:ExpenditureLine , qb:Observation ;
  rdfs:label "Expenditure Line 0"@en ;
  qb:dataSet <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2> ;
  payment:expenditureCode <http://www.rbwm.gov.uk/def/cost-centre/LM05> ;
  payment:amountExcludingVAT 1875.00 .

That’s the basic data for each line, but there’s also some other information which should be brought out for each line:

In each of these cases, pulling the information out from each line is going to lead to a lot of repetition, because the same payee, date and so on will be described in multiple lines, but we don’t have any choice and we can tidy it up by removing duplicates afterwards. The Turtle for the first line will look like:

<http://www.rbwm.gov.uk/id/supplier/1st-choice-d-b-driveways-limited>
  a org:Organization ;
  rdfs:label "1st Choice - D B Driveways Limited"@en .

<http://reference.data.gov.uk/id/day/2010-04-09>
  a interval:CalendarDay ;
  rdfs:label "2010-04-09" ;
  time:hasBeginning <http://reference.data.gov.uk/id/gregorian-instant/2010-04-09T00:00:00> ;
  interval:ordinalYear 2010 ;
  interval:ordinalMonthOfYear 4 ;
  interval:ordinalDayOfMonth 9 .

<http://reference.data.gov.uk/id/gregorian-instant/2010-04-09T00:00:00>
  a time:Instant ;
  time:inXSDDateTime "2010-04-09T00:00:00"^^xsd:dateTime .

<http://www.rbwm.gov.uk/def/cost-centre/LM05>
  a rbwm:CostCentre , skos:Concept ;
  rdfs:label "Cost Centre LM05"@en ;
  rbwm:costCentreCode "LM05"^^rbwm:CostCentreCode ;
  rbwm:service <http://www.rbwm.gov.uk/id/service/magnet-leisure-centre> .

<http://www.rbwm.gov.uk/id/service/magnet-leisure-centre>
  a rbwm:Service ;
  rdfs:label "Magnet Leisure Centre"@en ;
  rbwm:providedBy <http://www.rbwm.gov.uk/id/directorate/adult-community-services> .

<http://www.rbwm.gov.uk/id/directorate/adult-community-services>
  a rbwm:Directorate ;
  rdfs:label "Adult & Community Services"@en ;
  org:unitOf <http://statistics.data.gov.uk/id/local-authority/00ME> ;
  rbwm:provides <http://www.rbwm.gov.uk/id/service/magnet-leisure-centre> .

<http://statistics.data.gov.uk/id/local-authority/00ME>
  org:hasUnit <http://www.rbwm.gov.uk/id/directorate/adult-community-services> .

You’ll see that in the last part of this I’ve introduced some properties and classes with a rbwm: prefix. These are for classes and properties that are here in this data, but aren’t part of the payment vocabulary. The basic schema is:

rbwm:CostCentre a rdfs:Class ;
  rdfs:label "Cost Centre"@en ;
  rdfs:comment "A cost centre."@en .

rbwm:Service a rdfs:Class ;
  rdfs:label "Service"@en ;
  rdfs:comment "A service provided by the council."@en .

rbwm:Directorate a rdfs:Class ;
  rdfs:label "Directorate"@en ;
  rdfs:comment "A directorate within the council"@en .

rbwm:service a rdf:Property , owl:ObjectProperty ;
  rdfs:label "Service"@en ;
  rdfs:comment "The service associated with a particular cost centre."@en ;
  rdfs:domain rbwm:CostCentre ;
  rdfs:range rbwm:Service .

rbwm:providedBy a rdf:Property , owl:ObjectProperty ;
  rdfs:label "Provided By"@en ;
  rdfs:comment "The directorate that provides this service."@en ;
  rdfs:domain rbwm:Service ;
  rdfs:range rbwm:Directorate .

rbwm:provides a rdf:Property , owl:ObjectProperty ;
  rdfs:label "Provides"@en ;
  rdfs:comment "A service provided by this directorate."@en ;
  rdfs:domain rbwm:Directorate ;
  rdfs:range rbwm:Service .

rbwm:costCentreCode a rdf:Property , owl:DatatypeProperty ;
  rdfs:label "Cost Centre Code"@en ;
  rdfs:comment "The code of this cost centre."@en ;
  rdfs:domain rbwm:CostCentre ;
  rdfs:range rbwm:CostCentreCode .

rbwm:CostCentreCode a rdfs:Datatype ;
  rdfs:label "Cost Centre Code"@en ;
  rdfs:comment "A cost centre code consisting of two capital letters followed by two digits."@en .

This illustrates how individual councils might extend the information that they make available in RDF without having to seek any kind of prior agreement from anyone else. If, later on, a third party starts to make available ontologies for cost centres, services and directorates, Windsor & Maidenhead could start to link up their RDF with those more widely standardised classes and properties, with appropriate use of rdfs:subClassOf or rdfs:subPropertyOf.

Now we have an idea about what data we can extract for a single row, we can turn this into a Gridworks template. The templates are fairly straight forward. Wherever you want to insert a value from a particular column, you use the syntax ${Column Name}. If you want to do any further processing, you can use the syntax {{Formula}} to insert the result of a calculation.

<http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2>
  qb:slice <${Transaction URI}> .

<${Transaction URI}>
  a payment:Payment , qb:Slice ;
  rdfs:label "Transaction ${TransNo}"@en ;
  qb:sliceStructure payment:payment-slice ;
  payment:transactionReference "${TransNo}" ;
  payment:payer <http://statistics.data.gov.uk/id/local-authority/00ME> ;
  payment:payee <${Supplier URI}> ;
  payment:date <${Date URI}> ;
  payment:expenditureLine <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2${Line URI}> .

<http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2${Line URI}>
  a payment:ExpenditureLine , qb:Observation ;
  rdfs:label "Expenditure Line {{rowIndex}}"@en ;
  qb:dataSet <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2> ;
  payment:expenditureCode <${Cost Centre URI}> ;
  payment:amountExcludingVAT {{cells['Amount excl vat £'].value + 0}} .

Note that the last line here uses the expression cells['Amount excl vat £'].value + 0 in order to ensure that every figure has a decimal place, which makes them into xsd:decimal values within the resulting RDF.

I won’t do the rest of the row template here, though it’s available in full in a separate file.

The other parts of the template are easier to complete. The prefix needs to contain any namespace prefixes that are used within the RDF. It’s also useful to put a base URI here and describe the dataset itself. The RDF for the dataset should contain a number of properties about the dataset as a whole. There are a number of levels at which the dataset can be described:

The Turtle for this description is shown here:

<http://www.rbwm.gov.uk/public/finance_supplier_payments>
  a void:Dataset ;
  void:subset <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2> .

<http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2>
  a payment:PaymentDataset , void:Dataset ;
  # basic metadata
  rdfs:label "Windsor & Maidenhead Supplier Payments where charge to specific cost centre is >= £500 for period April 2010 - June 2010"@en ;
  dct:license <http://data.gov.uk/id/licence> ;
  dct:temporal [
    # this time is retrieved from the Last-Modified date on the original spreadsheet
    time:hasBeginning <http://reference.data.gov.uk/id/gregorian-instant/2010-08-02T08:37:02>
  ] ;

  # statistical metadata
  qb:structure payment:payments-with-expenditure-structure ;
  qb:sliceKey payment:payment-slice ;
  payment:currency <http://dbpedia.org/resource/Pound_sterling> ;

  # linked data metadata
  void:exampleResource
    <http://www.rbwm.gov.uk/id/transaction/2650750> ,
    <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0> ;
  void:vocabulary payment: , qb: , rbwm: ;
  void:subset [
    a void:Linkset ;
    void:linkPredicate qb:slice ;
    void:subjectsTarget <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2> ;
    void:objectsTarget <http://www.rbwm.gov.uk/id/transaction> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:payer ;
    void:subjectsTarget <http://www.rbwm.gov.uk/id/transaction> ;
    void:objectsTarget <http://statistics.data.gov.uk/id/local-authority> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:payee ;
    void:subjectsTarget <http://www.rbwm.gov.uk/id/transaction> ;
    void:objectsTarget <http://www.rbwm.gov.uk/id/supplier> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:date ;
    void:subjectsTarget <http://www.rbwm.gov.uk/id/transaction> ;
    void:objectsTarget <http://reference.data.gov.uk/id/day> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:expenditureLine ;
    void:subjectsTarget <http://www.rbwm.gov.uk/id/transaction> ;
    void:objectsTarget <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:expenditureCode ;
    void:subjectsTarget <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2> ;
    void:objectsTarget <http://www.rbwm.gov.uk/def/cost-centre> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:service ;
    void:subjectsTarget <http://www.rbwm.gov.uk/def/cost-centre> ;
    void:objectsTarget <http://www.rbwm.gov.uk/id/service> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:providedBy ;
    void:subjectsTarget <http://www.rbwm.gov.uk/id/service> ;
    void:objectsTarget <http://www.rbwm.gov.uk/id/directorate> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:provides ;
    void:subjectsTarget <http://www.rbwm.gov.uk/id/directorate> ;
    void:objectsTarget <http://www.rbwm.gov.uk/id/service> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate org:hasUnit ;
    void:subjectsTarget <http://statistics.data.gov.uk/id/local-authority> ;
    void:objectsTarget <http://www.rbwm.gov.uk/id/directorate> ;
  ] , [
    a void:Linkset ;
    void:linkPredicate org:unitOf ;
    void:subjectsTarget <http://www.rbwm.gov.uk/id/directorate> ;
    void:objectsTarget <http://statistics.data.gov.uk/id/local-authority> ;
  ] .

Provenance

I’ve described here, verbally, exactly what I’ve done in terms of the cleaning of the data, deriving new columns, and the template that I’ve used to create a Turtle rendition of the data in this spreadsheet. One of the things that we’ve worked hard on within data.gov.uk is finding ways of expressing this provenance information in RDF. There are two reasons for this:

  1. Providing provenance increases transparency and enables you to check the processing that the data has been through, increasing your trust in the data.
  2. Describing the process in sufficient detail for you to replicate that process enables you to modify and repeat the process, which both enables you to add value and to apply the same processing to your own situation, thus spreading best practice.

The basic provenance vocabulary that we’re using within data.gov.uk is the Open Provenance Model Vocabulary. This vocabulary talks about Artifacts, Processes that create and use them, and Agents that control those processes. We’ve created an extension of this vocabulary specifically to help describe this kind of scenario, where a spreadsheet is processed using Gridworks and then exported using a template. I’ll put this provenance information in a separate file simply because embedding provenance information, which includes a template, in the template itself gets us into nasty recursion issues.

As well as the template, there are two supplementary artifacts that we need to record the provenance of this data:

The first can be exported using the Project menu. The second is accessed through the Undo/Redo tab as shown in the following screenshot:

This tab shows the actions that have been carried out on the data, and enables you to undo them in sequence. The extract link at the bottom opens up the dialog shown in the following screenshot:

You have to manually copy and paste the JSON description from the right of this dialog into a separate file in order to save it.

We can then start describing the provenance of the RDF; this needs to go in the Turtle file itself. We start by saying that the RDF that we’ve created was created from the Gridworks project and through an extraction operation. A simple link to the spreadsheet that was used as the source of the data also provides a quick link back to the original data:

<http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2>
  a opmv:Artifact ;
  dct:source <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls> ;
  gridworks:wasExportedBy <finance_supplier_payments_2010_q2_provenance#gridworks-export> ;
  gridworks:wasExportedFrom <finance_supplier_payments_2010_q2_project.tar.gz> .

The provenance information then needs to describe the export process:

<#gridworks-export>
  a gridworks:ExportUsingTemplate , opmv:Process ;
  rdfs:label "Process for Exporting Windsor & Maidenhead data as Turtle" ;
  gridworks:project <finance_supplier_payments_2010_q2_project.tar.gz> ;
  gridworks:template <#gridworks-template> .

The project itself was created from the original Excel spreadsheet. The details of how it was generated are through an import that ignored a single non-blank header row and then went through the set of operations described by the JSON.

<finance_supplier_payments_2010_q2_project.tar.gz>
  a gridworks:Project , opmv:Artifact ;
  rdfs:label "Windsor & Maidenhead Supplier Payments April 2010 - June 2010 Gridworks Project"@en ;
  gridworks:wasCreatedFrom <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls> ;
  opmv:wasGeneratedBy <#gridworks-processing> .

<#gridworks-processing>
  a gridworks:Process , opmv:Process ;
  rdfs:label "Processing on the Gridworks Project"@en ;
  common:usedData <http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls> ;
  gridworks:ignore 1 ;
  gridworks:operationDescription <finance_supplier_payments_2010_q2_operations.json> .

<finance_supplier_payments_2010_q2_operations.json>
  a gridworks:OperationDescription , opmv:Artifact ;
  rdfs:label "Dump of the Processing carried out by Gridworks on Windsor &amp; Maidenhead Supplier Payments April 2010 - June 2010 data"@en ;
  gridworks:wasExportedFrom <finance_supplier_payments_2010_q2_project.tar.gz> ;
  gridworks:wasExportedBy <#gridworks-operation-description-extraction> .

<#gridworks-operation-description-extraction>
  a gridworks:ExtractOperationDescription , opmv:Process ;
  rdfs:label "Extraction of the operation description from the Windsor &amp; Maidenhead Supplier Payments April 2010 - June 2010 Project from Gridworks"@en ;
  gridworks:project <finance_supplier_payments_2010_q2_project.tar.gz> .

The template is described in terms of the separate parts; in fact it’s useful to use this provenance file as the record of the template that you use, given that Gridworks won’t save the template in the project itself.

<#gridworks-template>
  a gridworks:Template , opmv:Artifact ;
  gridworks:prefix """
...
"""^^xsd:string ;
  gridworks:rowTemplate """
...
"""^^^xsd:string .

Rinse and Repeat

Gridworks makes it easy to repeat a given set of operations on another spreadsheet that follows the same structure. If you download the Windsor and Maidenhead spending data from 2009 Q4 and import it into Gridworks, you’ll see that it uses the same set of columns as the 2010 Q2 data that we’ve been looking at. (Strangely enough, the 2010 Q1 data doesn’t quite follow the same structure as it doesn’t include the ‘TransNo’ column.)

There are a couple of differences:

You might want to do some more cleaning, for example to check for duplicates, but once that is done, you use the apply link at the bottom of the Undo/Redo tab to apply the JSON operation description that you imported for the previous spreadsheet on this one. The templates require only a little tweaking to give different filenames and labels, but otherwise can be used as-is.

So while the process of cleaning data, deriving values and creating a template for exporting as Turtle is a bit of effort, the likelihood is that you will be able to repeat the same operations on similar data with a minimal amount of work.

Conclusions

Gridworks is a simply amazing tool for data cleansing, analysis and, as we’ve seen, transformation. It’s set to become more so for our purposes in the near future, as it comes to support the mapping of names for things to URIs using configurable reconciliation services (which might allow it to automatically map Government Department names to URIs, for example), and the creation of RDF using a more intuitive and user-friendly approach than the templates that I’ve illustrated here.

Of course there are issues, particularly for UK civil servants who typically have to operate on locked-down machines running IE7 (if they’re lucky). Gridworks also only deals with the fairly simple cases of data that fits in a spreadsheet-like structure, without the complexities of annotations on rows, columns or individual cells that we often see in government data.

Nevertheless, there’s huge potential here to provide a fairly easy route to the publication of linked data for people who are familiar with spreadsheets, in particular one that can be tweaked and extended to allow for the variety and complexity of real-world data.

Posted at 22:23

Dave Beckett: Rasqal RDF Query Library 0.9.20

I just released a new version of my Rasqal RDF Query Library for two main new features:

  1. Support more of the new W3C SPARQL working drafts of 1 June 2010 for SPARQL 1.1 Query and SPARQL 1.1 Update.
  2. Support building with Raptor V2 API as well as Raptor V1 API..

The main change is to start to add to Rasqal’s APIs and query engine changes for the new SPARQL 1.1 working drafts. This release adds support the syntax for all the changes for Query and Update. The new draft syntax is available via the ‘laqrs’ query language name, until the SPARQL 1.1 syntax is finalized. The ‘sparql’ query language provides SPARQL 1.0 support.

On Query 1.1, the addition is primarily syntax and API support for the new syntax. There is expression execution for the new functions IF(), URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN() which are noew usable as part of the normal expression grammar. The existing aggregate function support was extended to add the new SAMPLE() and GROUP_CONCAT() but remains syntax-only. Finally the new GROUP BY with HAVING conditions were added to the syntax and had consequent API updates but no query engine execution of them.

For Update 1.1 the full set of update operations syntax were added and they create API structures. Note, however there seem to be some ambiguities in the draft syntax especially around multiple optional tokens in a row near WITH which are particularly hard to implement in flex and bison (aka “lex and yacc”).

The main non-SPARQL 1.1 related change is to allow building Rasqal with Raptor V2 APIs rather than V1. Raptor V2 is in beta so this is not a final API and is thus not the default build, it has to be enabled with --enable-raptor2 with configure. When raptor V2 is stable (2.0.0), Rasqal will require it.

The changes to Rasqal in this release, in summary, are:

See the Rasqal 0.9.20 Release Notes for the full details of the changes.

Get it at http://download.librdf.org/source/rasqal-0.9.20.tar.gz.

PS The source code control has also moved to GIT and hosted at GitHub.

Posted at 21:33

Uche & Chime Ogbuji: Hypnotic Brass Ensemble / " Spottie" etc. on Shuffler

Brass ensemble remake of Spottieottie..etc.. on one of earlier classic Outkast albums. 'Nuff said.

Permalink | Leave a comment  »

Posted at 20:09

Uche & Chime Ogbuji: Quotīdiē ❧ Ndugu's proper chocolate jam,

Shed a tear of delight; don't you worry about a fall tonight
Birds flying free; What about you and me
Ooh!Take some time to let your feelings flow free
You can't hide away from what you'll be
Search the sky for new horizons to unfold
Set yourself on the oceans of dreams to behold

—from "Take Some Time" by Ndugu & The Chocolate Jam Company

I remember hearing this slow jam a couple of times at dances in Nigeria in the early 80s.  When Erykah Badu flipped it for "Ummm Hmmm" off her latest masterpiece New Amerykah Part Two (Return of the Ankh), she put a weeks-long itch in my skull, and I bet a lot of others who had grown up on a soul diet.  I finally twigged it last week, and went to hunt down the Ndugu & The Chocolate Jam Company original, but it seems to have faded into the mists of the past a bit, which is a true shame.  I did find the following audio version on YouTube, though.

Here is Badu's "Ummm Hmmm," accompanied by some lovely stills of Fat Belly Bella herself.Of course Badu wasn't the first to discover the great sample possibilities of the Leon Chancler (AKA "Ndugu") jam.  DJ Premier used it back in '07 for the NYG'z project song "Welcome To G-Dom."Of course, I love me some Primo, but Erykah pwned this bitch.  It's over.  I hope no other DJs think they should dare follow her.Then again I'm thinking of using the Primo loop to back a poem recital one day.  And maybe I have just the poem.  Having learned about the terzanelle form from Heather Fowler a few weeks ago, I fell in love with the form, and I've been writing a sequence of terzanelles, one for each song on New Amerykah Part Two.  I'm on "Ummm Hmmm" and the first few stanzas of my poem are as follows:

Take some time to let your feelings run free
Heart's desire—thump! thump! I've been here before—
You can't hide away from what you'll be.You can't hide; don't cheat I've been keeping score.
Place your bet, love; scared money don't make none.
Heart's desire—thump! thump! I've been here before.Truth and Icarus dare, the money sun,
Angel bird, let's jump off into your world.
Place your bet, love; scared money don't make none.


Naturally it includes elements from Ndugu's song, as well as Badu's.  I can't find the lyrics to "Take Some Time" anywhere on the Net so I, ah, took some time to transcribe them myself.  As you can see from the square brackets and the ellipses, there are some parts I can't figure out right now, but I think I got most of it.

"Take Some Time" by Ndugu & The Chocolate Jam Company
Do you always conceal what you feel inside
Man does not ever drift with the flow of the tide
Makes it hard to see when [it attracts you and me]
And there comes a time when your feelings should run freeAnd understand that you're over me when you're ...
It'll lend you a helping hand when your [crimes] cross the tide
Takes you high in the sky of your heart's desire
Float through the valley of love; you'll start to fly
Like a bird in the sky who's just learned to fly
Makes you feel so proud you might want to cryShed a tear of delight; don't you worry about a fall tonight
Birds flying free; What about you and meNdugu Chancler
Ooh!Take some time to let your feelings flow free
You can't hide away from what you'll be
Search the sky for new horizons to unfold
Set yourself on the oceans of dreams to beholdWell you're pride by your side when we're looking on
Keep your head to the sky through the weather of the stormTake the compliment as if it came heaven-sent
From someone up above [with music] with loveWhat's the nature of your mind when the trouble starts to [grind]
Do you leave yourself behind, not to be caught up on the line
Signs of life is a lot to see that you hold in your [belief]Free your time; what about your mind
Wow!Take some time to let your feelings flow free
You can't hide away from what you'll be
Search the sky for new horizons to unfold
Set yourself on the oceans of dreams to beholdYou will find further on down the line
Is what you've got to do, to see you throughTake some time to let your feelings flow free
You can't hide away from what you'll be
Search the sky for new horizons to unfold
Set yourself on the oceans of dreams to beholdTake some time to let your feelings flow free
You can't hide away from what you'll be
Search the sky for new horizons to unfold
Set yourself on the oceans of dreams to behold

 

Permalink | Leave a comment  »

Posted at 20:09

Uche & Chime Ogbuji: Sliver of Blue - DeviantArt

I'd love to find that spot (or have that painting)

Permalink | Leave a comment  »

Posted at 20:09

Uche & Chime Ogbuji: Numerical type with units - via Python Cookbook

I implemented dimensions.py perhaps eight years ago as an exercise and have used it occasionally ever since.

It allows doing math with dimensioned values in order to automate unit conversions (you can add m/s to mile/hour) and dimensional checking (you can't add m/s to mile/lightyear). It specifically does not convert 212F to 100C but rather will convert 9F to 5C (valid when converting temperature differences).

It is similar to unums (http://home.scarlet.be/be052320/Unum.html) but with a significant difference:

I used a different syntax Q(25,'m/s') as opposed to 100*m/s (I recall not wanting to have all the base SI units directly in the namespace). I'm not entirely sure which approach is really better.

I also had a specific need to have fractional exponents on units, allowing the following:

>>> km=Q(10,'N*m/W^(1/2)')
>>> km
Q(10.0, 'kg**0.5*m/s**0.5')

Looking back I see a few design decisions I might do differently today, but I'll share it anyway.

Some examples are in the source below the line with if __name__ == "__main__":

Note that I've put two files into the code block below, dimensions.py and dimensions.data, so please cut them apart if you want to try it.

Very impressive library. I recently incorporated the use of the Measurement Unit Ontology into the Computer-based Patient Record (CPR) ontology and (on the surface) it seems like a library like this can provide the unit conversion machinery for RDF instances that use such a framework.

Permalink | Leave a comment  »

Posted at 20:09

Uche & Chime Ogbuji: Talib Kweli on the Politics of Oil: Ballad of the Black Gold

Spotted this new Talib Kweli song, called the Ballad of the Black Gold in a hypem link (you can watch the video there). Very timely given the recent BP mess. Much respect to Talib for going into some of the history of Oil politics in Nigeria; an excerpt from Verse 2 is below:


Nigeria is celebrating 50 years of independence
They still feel the colonial effects of Great Britain's presence
Dictators quick to imitate the West
Got in bed with oil companies and now the place is a mess
Take a guess, which ones came and violated
They oiled up the soil, the Ogoni people was almost annihilated
But still they never stayed silent
They was activists and poets using non-violent tactics
That was catalyst for soldiers to break into they crib
Take it from the kids and try to break'em like a twig
And make examples of the leaders; executed Saro-Wiwa,
Threw Fela's mom out the window right after they beat her
In an effort to defeat hope. Now the people's feet soaked in oil [?]
So the youth is doing drive-bys through speed boats [?]
They kidnap the workers, they blowing up the pipelines
You see the fires glowing in the nighttime

Permalink | Leave a comment  »

Posted at 20:09

Uche & Chime Ogbuji: We *were* all witness

Well, I pretty much knew it was going to happen as soon as they were bounced out of the playoffs. This poster downtown is going to look real stupid tomorrow. At least he didn't do it for the money.  Cleveland folks should show some respect to how much he elevated our game. We hadn't been contenders since the days of Larry Nance; remember those Cavs?

Permalink | Leave a comment  »

Posted at 20:09

Copyright © XMLhack. A Useful Production. Contact us.