Friday, March 03, 2006

Self-citing Internet Sources

Nearly every genealogical tool or service provider has at one time or other thought of how much better the world would be if there were a way to automatically cite the source of genealogical data on the Internet. Then they quickly get discouraged at the idea as they think of the level of industry cooperation that would be required to make it happen.

There are some isolated cases of self-citing sources. For example, PAF Insight Manager (OhanaSoftware) is a very popular tool among LDS genealogists. The tool has the ability to automatically cite sources of information obtained from the FamilySearch website. I'm sure there are other similar implementations of which I'm not currently aware. At the present time there does not seem to be a defacto standard for how to do this.

I don't think this is rocket science. We don't need some Automatic Citation Encapsulation (ACE) protocol. We simply need to agree upon a way to embed citation details into a web page. The citation detail doesn't neccessarily need to be viewable to the user. Any time an Internet-based source is cited, the tool or service can simply look at the citation details on the web page and
appropriately cite the source for the user. In fact, the embedded citation block could be as simple as the following:

<Embedded Citation Block>
Super Cool Online Internet Source Which Links Me to Adam
Published on Some Day
Found on Some Page
Random Text
</Embedded Citation Block>


There are a lot of extremely smart people that will look at the above example and immediately see that additional tags could be added to impose some type of taxonomy on the citation. Even if the embedded citation block was never more sophisticated than what is proposed above, it would be substantially better than the ad hoc citations of the masses that don't have degrees in library science.

The efficiency of an embedded citation block may be be questioned. After all, if an application is just trying to get the data in the citation block, why impose the overhead of serving up the rest of the web page? Wouldn't it be more efficient to have a service or simply a URL which only delivers the citation details without this overhead? There is some validity to this concern and the the approaches of embedding a citation block in the web page and offering a specific citation service or URL for obtaining just the citation details aren't mutually exclusive. I prefer the embedded citation block approach. It seems easier to implement. Anyone that can create a web page can create an embedded citation block.

So here's the rally cry. Let's get together and define an embedded citation block standard. Let's keep it as simple as possible to start and see if we can't get a few of the more popular online content providers and tool vendors to implement it. Send me your comments. FamilySearch can easily implement this approach into the system we are building to deliver digitized microfilm. How would you like to see this implemented?

10 comments:

Anonymous said...

I think a Wiki like definition
would allow non-technical savy people use the format, yet it would also allow for parsing by an
automated tool for those people that need that capability.

Manfred Riem

Dan Hanks said...

Yes, please! What an excellent idea!

I think microformats might fit the bill for what you're looking for here:

http://www.microformats.org/

"Designed for humans first and machines second". Microformats are essentially snippets of structured (X)HTML that are machine-readable, embedded in pages that are human-readable.

It appears there is already work in progress to develop a citation microformat: (http://www.microformats.org/wiki/citation.

I can think of a number of ways this kind of thing would be useful. As one example, imagine browsing through all the digitized images and being able to click a browser bookmarklet (http://en.wikipedia.org/wiki/Bookmarklet) that does something like "Associate the source document displayed on this page with an individual/marriage/event/etc in my account in the church's new FamilyTree system." Clicking the bookmarklet would scan the current page for the microformat, and then lead you to a page in the FamilyTree system that would let me select the individual/event/marriage with which to associate the image as a source.

Of course, the church could also just put a link on each image display page that does just that, but a microformat would allow systems from different vendors to interact with these sources. A genealogy program could allow me to paste in a URL from which it could automatically extract the source information. Or going one step further, genealogy tool providers (RootsMagic, The Master Genealogist, Legacy, et al.) could provide browser plugins or browser toolbars that automatically detect these self-citing pages, and which could offer actions similar to the example above.

Now if the ContentDMs (digital library software used by BYU and many other digital libraries) of the world could do similar things with their image display pages, we'd really be moving somewhere.

I think if the chuch, with its weight and influence were to adopt such a standard, we'd potentially see a lot of other digital content providers begin to follow suit.

As to the concern of having to "load the whole document" just to get the source citation, HTTP allows one to only fetch a page's text content without also fetching the pages images, so I don't see this as (too big of) a problem, unless the text on the page itself is also very large. In a digitized image delivery system, I think the pages would be fairly lightweight as far as the text goes.

And why stop with just citations? One possibility is to create a "GEDCOM microformat" in which linkage and other genealogical data can be embedded in machine-readble forms in human-readable pages. (http://www.microformats.org/wiki/genealogy-formats) If each software program and each online family tree system that display pedigrees and other genealogical information were to include these microformats, there's a huge number of possibilities for how such a format could be used:

- Tools vendors could again provide bookmarklets, toolbars, or browser plugins to do things like, "Import the individual on this page (and all their ancestors) into my genealogy database." Clicking on such a button/link would popup the user's genealogy tool of choice, which would pull down the page in question, parse the microformatted data, and chase the resulting tree.

- Plugins could be provided to display lists of individuals/marriages/events/sources that are on the currently displayed page in a sidebar, each with options to import and/or process with the user's tool of preference.

- Search engines and aggregators could do automated match/merge of individuals they parsed out from spidered pages, offering suggestions to their users as to pages related to their research.

Again, if the church were to get behind this kind of effort, starting with (the publically accessible) portion of the new FamilyTree system, a lot of other vendors would soon follow.

Let's do it!

Dan Hanks said...

Some more thoughts:

Some advantages of microformats:
- They're invisible to the average user
- Yet they provide so many possibilities to tool providers and geeks like me (and through them/us to the masses via the tools they/we build)
- I don't see them as being that difficult to add/implement in most of the tools that are out there that generate HTML, once we have a good standard established (that's the hard part :-).

One more application idea along the lines of a "GEDCOM microformat":

Imagine if the church's new Family Tree system published RSS/Atom feeds of activity happening on people's trees. Each time a person was added, for example, an item could be added to an feed for that tree or account (if such info was safe to do so, e.g., the new person was not living). And if those feed items embedded these microformats, I could subscribe to these feeds with my aggregator (such as bloglines.com). As new individuals were added to my trees of interest, they would show up in my aggregator, and the lovely browser plugins (this assumes people are using web-based aggregators) would pick up on these microformats in the feeds I'm looking at, offering all the options to import, etc.

Dan Lawyer said...

Microformats looks like a possible option for what we need to do. I'll spend some time learning about it.

I've thought about the value of having RSS or ATOM feeds from people the Family Tree. Seems like a powerful concept. There is a question of granularity on such a feed. Is it scoped to a person, family, family line, n number of generations, etc.?

The microformat concept for genealogy has some potential also. It would definitely need to be coupled with a citation capability otherwise there's a risk that we make it easier to propogate unsubstantiated pedigrees.

Dan Hanks said...

Offer flexible levels of granularity, a la del.icio.us or flickr.com.

For example, on del.icio.us (a social bookmarking site), I can subscribe to feeds of:
- all recent urls being submitted
- all popular urls being submitted (urls that are getting submitted most frequently)
- urls being tagged by a specific tag (e.g., 'linux')
- urls being submitted by a particular user
- urls being tagged by a specific tag by a specific user

and so forth.

For an example of a usage of that last feed, one of the sidebars on my website brainshed.com is generated by slurping down and parsing the feed of all urls I have submitted and tagged with 'perl_module'

Flickr offers similar functionality for various aspects of their photo service.

As another example for feed possibilities, Yahoo provides RSS feeds for search results. So I can subscribe to an RSS feed based upon the search results of say 'hanks genealogy', and theoretically, (although it doesn't quite work like I'd like) be notified any time a new search result pops up for those search terms.

So, for FamilyTree, it would be fun to have:
- an RSS feed for all changes being made by a particular user (given the user's permission to do so, etc)
- a feed for all changes to a particular individual or set of individuals
- a feed for all changes to an individual or any of his ancestors for N generations
- a feed for all changes to an individual or any of his descendants for N generations
- a feed for any sources that are added to an individual, a set of individuals, an individual and his descendants, an individual and his ancestors, etc, etc.
- A feed for all new digitized images coming online (i.e., one entry in the feed for when images from FHL #123456789 become generally available)
- A personalized feed for any disputes that are submitted for info in any of my lines.
- And so forth :-).

Now, I don't envy the developers who have to build the backend for such a system, but I don't see it being too hard. Somehow you log changes being made in the system, and for each change you determine which feed interests (see the list above) that change would apply to (you'd also have to determine if the change is a private change, and shouldn't be made publically available. Then you'd have an application/CGI/etc to then take incoming HTTP requests for feeds and dynamically determine which of the change events need to go in each feed requested (with plenty of caching involved, of course).

Granted that's probably a simplistic view of what would be needed to implement such a system, but I hope you get the idea. Make the set of feeds available infinitely (or nearly so...) customizable, and we'll all probably be surprised at the varety of uses that arise from the availability of these feeds.

Dallan said...

I agree that self-citing internet sources would be an excellent idea. I think in order for them to become widely-used you need two things: (1) they need to be supported by a desktop application, and (2) they need to be incorporated into GEDCOM (or whatever follows it).

Dan Lawyer said...

In response to Dan Hanks thoughts on RSS feeds for the Family Tree...

I think we share the same overall vision of the potential of using feeds in the Family Tree. I'll pull together a future blog post more specific to the feed idea and add some more details at that time. I'll also push the thread we've had so far to those inside the Church that can start wrapping their brains around it.

Dan Lawyer said...

I think Dallan is on target with what it would take to get self-citing Internet sources off the ground. I know that several of the tool vendors are following this blog (not sure why they haven't chimed in...). They have always been very good to support these types of initiatives from the Church. I'm sure they would be willing to add this functionality to their applications. The Church will definitely add some type of embedded citation capability to our systems moving forward. I don't know yet whether we'll embrace the microformats approach or something similar.

Steven M. Law said...

I also agree that this is a great idea. I've wished for years that there was a utility that I could run as a sidebar program such that when I'm on a page I wish to cite or viewing a digital image at Ancestry or HeritageQuest I simply press a button and the full citation data is extracted for me and passed into my genealogy program. To be of maximum utility it would be necessary that the citation have at least 2 levels of hierarchy to it. Generally I would want to cite the original document (a certain set of elements) as represented by the digital image at such and such database, URL, etc. Another instances of this hierarchy are: a facsimile of a will in Archive X as published in article Y of periodical Z, digitized and published on the web in some database, with URL, etc. Generally when I've seen these citations in academic literature this method is used: cite the original represented in a later publication. Do you think this complexity can be handled?

Gaylon Findlay said...

Dan:

Let me weigh in on this issue. For 20+ years, the genealogical computerized industry has relied on GEDCOM as a standard to communicate data back and forth between programs.

There is a lot of code written to deal with creating and interpreting GEDCOM data. Why not keep this standard in place, and allow all of this code to be reused, rather than requiring yet-another-method that will require yet-more-code to be written? At first glance at this issue it seems that by putting GEDCOM snippets into HTML comments would serve the purpose for both Source Citations, and for any other types of data we wish to communicate. Take this example:

<!--GEDCOM
0 @I2@ INDI
1 NAME Jonathan /Swift/
1 SEX M
1 BIRT
2 DATE 22 OCT 1763
2 PLAC London, England
2 SOUR @S1@
0 @S1@ SOUR
1 REPO @REPO1@
1 _TYPE Birth Record
1 TITL Birth Record for Jonathan Swift
1 AUTH Anglican Church
1 PUBL 1763
0 @REPO1@ REPO
1 NAME Westminster Abbey
-->

It may become important to delineate which version of GEDCOM was used to generate the embedded GEDCOM, so we could expand the first line of the above snippet to be

<!--GEDCOM:4.0

or

<!--GEDCOM:4.0-5.5

thus allowing the version or versions of GEDCOM which can deal with this syntax to be mentioned. But we could default to simply

<!--GEDCOM

if the authoring software did not feel that the version of GEDCOM was important to note.

Over the last few years, GEDCOM has not kept up very well with advances made by many software manufacturers in additional types of data that need to be thus transferred, but this could be addressed. For example, note the above line:

1 _TYPE Birth Record

This "_TYPE" tag is a "custom" or "proprietary" tag, invented to supplement the GEDCOM standard, in an attempt to communicate information about the source that didn't seem to be addressed by the standard.

We could reactivate discussions about adding to the GEDCOM standard to provide standards for the new types of data being recorded by various genealogy software programs.

But in any case, it seems that the simplest approach to what you are proposing is to incorporate GEDCOM. This expands what you are proposing to not only sources, but to a way to embed any other type of genealogy data in machine consumable format within a web page.

And maybe we need a new tag, like:
0 COPYRIGHT YES

as a signal to those companies who scan the web for information they can take and incorporate into "for fee" databases, that this data is for non-commercial use only.

Gaylon Findlay
Incline Software