Saturday, March 11, 2006

Genealogical Embedded Citation Standard 0.1 (Strawman)

Thanks to Michael Nelson and Derek Maude for taking a first whack at what the structure for a genealogical embedded citation standard might be. The following structure is intended to be compatible with GEDCOM and the upcoming FamilySearch Family Tree. It could easily be implemented in XML or a microformat. There are some outstanding questions that Michael and Derek pose which follow this quick strawman proposal. Please review and share your thoughts through the comments link below or by e-mailing me (lawyerdc@ldschurch.org).

Looking for more information on Genealogical Embedded Citations? See the March 3rd article, Self-citing Internet Sources

Nested list of genealogical embedded citation elements

citation
     url
     film-number
     sheet-number
     page-number
     frame-number
     call-number
     book-number
     image-number
     record-number
     batch-number
     serial-number
     date-recorded
     certainty
     comment
     source
         url
         title
         author
         abbreviation
         publication-info
         description
         time-period
         locality
         language
         film-number
         call-number
         batch-number
         comment
         repository
             name
             address
             phone
             email
             url
             comment

Just in case your browser doesn't like the way I've chosen to try and indent I've included a text description of the hierarchy at the bottom of the page.

Some questions to consider
1. Other formats have the ability to include the actual text. Is this necessary given the application?

2. Should there be a "provider" field for sources?

3. Should there be a "source-type" field for, say, stating the source is a census record?

4. Should this citation embedded in a page represent a citiation for that page or for the original record?

5. We included a description field and a comment field in the source. Is that necessary?

6. Should there be an "agency" field to include what organization originally created the record?

Text Description of Hierarchy
'Citation' is level 1 in the hierarchy. It contains the following level 2 elements: url, film-number, sheet-number, page-number, frame-number, call-number, book-number, image-number, record-number, batch-number, serial-number, date-recorded, certainty, comment, source.

The level 2 element 'source' contains the following level 3 elements: url, title, author, abbreviation, publication-info, description, time-period, locality, language, film-number, call-number, batch-number, comment, repository.

The level 3 element 'repository' contains the following level 4 elements: name, address, phone, email, url, comment.

15 comments:

Anonymous said...

I don't know anything, but it seems if you can conceive of an need, you should provide for it. Actual text, Agency, etc. cost little to define into the standard compared to what they cost to define out of it, no?

Judy said...

The structure seems very good and comprehensive. The reason I want to see where information came from is to judge its credibility, especially if it conflicts with other information. I judge credibility by knowing its source (government record, primary, secondary-type stuff) the provider (some aren’t as reliable) and then having the ability to go to the source and seeing if I read it the same way. Thus, to me, I would like to see the source and the provider prominently displayed in the citation as well as the other information to allow me to go check it out.

Michael said...

Actual Text - it would seem wierd to me to include actual-text because the idea is to embed this citation in a page that contains the actual text (if there is any). I think anyone that desires to see the actual text should use the URL to go look at it and if a program wants to store the actual text it should load the page and store that. That's just my opinion though. If you disagree then please speak up and say so.

Agency - My problem is more with the name than the field its self. I think it will mean different things to different people. Say, for example, I have a marriage record created by "G E Boer" of the Dutch Reformed Church that is being preserved at Calvin College and an image of which is offered on my personal web site. What should be in the Agency field? According to the GEDCOM standard entry for RESPONSIBLE_AGENCY the answer would be the Dutch Reformed Church. I think it is valuable information but "Agency" isn't very intuitive. Maybe "authors-organization"?

Provider - Judy's point is a very good one. Maybe Provider should be added as a group of fields like homepage, name, address, phone, email, ... What do you think?

Dan Lawyer said...

Michael,

Actual Text. I had not initially contemplated putting a text transcription (full or partial) into the citation block. I can however see that this would facilitate some automated or semi-automated data entry. To facilitate this however, you would need more than just the "actual text" field that is commonly implemented today. You would need to completely describe any genealogical information in the artifact in such a way that it could automagically be populated into the consuming application. After thinking through that, it seems more reasonable not to include the Actual Text as part of the citation block and consider other mechanisms for passing the transcribed data to a consuming application.

Agency. I like the term Authors-organization. Seems more intuitive.

Provider. I like this concept. Very similar to the contributor concept in the FamilySearch Family Tree. Does the contributor and submitter terminology work for this purpose?

Dan Lawyer said...

I received an e-mail from a genealogy application developer with some great feedback on the thread.

Hopefully I won't take too much liberty in summarizing the remarks from the e-mail.

Here is my summary of the e-mail.

GEDCOM has been around for 20+ years. There are many tools and applications that support it. Why not use the GEDCOM standard to describe the citation? That way all existing tools that support GEDCOM could easily import the source citation.

GEDCOM doesn't support everything that needs to be included in the citation but does allow for custom tags that could be used for the additional data.

Here is an example of how GEDCOM might be embedded in a comment in a web page:

<!--GEDCOM
0 @I2@ INDI
1 NAME Jonathan /Swift/
1 SEX M
1 BIRT
2 DATE 22 OCT 1763
2 PLAC London, England
2 SOUR @S1@
0 @S1@ SOUR
1 REPO @REPO1@
1 _TYPE Birth Record
1 TITL Birth Record for Jonathan Swift
1 AUTH Anglican Church
1 PUBL 1763
0 @REPO1@ REPO
1 NAME Westminster Abbey
-->


I can see how this approach would make it easier for existing applications to deal with the citations. The above example is conceptually sound but isn't quite right for this purpose as a source citation would not include the first portion of the GEDCOM

0 @I2@ INDI
1 NAME Jonathan /Swift/
1 SEX M
1 BIRT
2 DATE 22 OCT 1763
2 PLAC London, England

but only the source information

2 SOUR @S1@
0 @S1@ SOUR
1 REPO @REPO1@
1 _TYPE Birth Record
1 TITL Birth Record for Jonathan Swift
1 AUTH Anglican Church
1 PUBL 1763
0 @REPO1@ REPO
1 NAME Westminster Abbey

Would we be overly constrained by this approach? Would there be value in describing the source citation as both XML and GEDCOM? Once approach would be to describe the citation in XML and then deliver a library which easily converts it to GEDCOM syntax.

Thoughts?

Dan Lawyer said...

Received another e-mail from a product manager in the field.

They point out the need to standardize the description of genealogical data in web pages so that it can be easily searched by search engines, and consumed automagically by genealogy applications.

I can see another thread starting around this topic - a standard for describing genealogical data on the web.

Michael said...

I think there should be a suggested mapping to GEDCOM and I really like the idea of a library that performs the conversion. I don't think, however, that GEDCOM is sufficiently intuitive. Not many people in the world know what the SOUR, REPO, PUBL, and CALN fields should hold.

One of the things I really don't like about GEDCOM sources is the free form nature of PAGE and CALN (call number). The specification basically says "put whatever you want in these fields so that someone could find the source". I think a better approach would be to define the specific fields and then if someone wants to jam it all into these two fields in a GEDCOM file they can.

Here is the same data from the GEDCOM style example in an XML form of the strawman:

<citation>
  <url>?</url>
  <source>
    <title>Birth Record for Jonathon Swift</title>
    <author>Anglican Church</author>
    <time-period>1763</time-period>
    <repository>
      <name>Westminster Abbey</name>
    </repository>
  </source>

Michael said...

No one has said anything about the "certainty" field. GEDCOM has the QUAY tag which contains a CERTAINTY_ASSESSMENT between 0 and 3:
0 = Unreliable evidence or estimated data
1 = Questionable reliability of evidence (interviews, census, oral genealogies, or potential for bias
for example, an autobiography)
2 = Secondary evidence, data officially recorded sometime after event
3 = Direct and primary evidence used, or by dominance of the evidence

I would suggest that the "certainty" field here contain one of the following (each corresponds to a number in GEDCOM's QUAY):
Unreliable
Questionable
Secondary
Primary

Also, I don't think this should be a required field - many people don't really understand the difference between these four designations.

Dan Hanks said...

I want to suggest an alternative to handling self-citation in a way that involves no new standards but (I think) accomplishes the same goals that we may want to consider.

(apologies for the extra spaces in the tags, couldn't figure out how to get Blogger to allow them otherwise)

In many blogs these days, there is a < link > tag inside the < head > tags that specifies that the page can be accessed in an alternative format. This is mostly used to indicate the address for the blog's accompanying RSS/Atom feed. Any number of these link tags can be provided.

As an example:

< link rel="alternate" type="application/rss+xml" title="RSS 1.0" href="http://brainshed.com/blog/rss/1.0/brainshed.xml"/ >

is what I use for my blog. Both Firefox and Opera pick up on this and indicate by a particular icon that the page being viewed has an RSS feed associated with it. When I click on the icon the browser asks if I want to subscribe to the feed.

Could this perhaps be used to cite our genealogical sources as well? We could define an XML format in which the citation would appear (or use GEDCOM, or both!), and these would be available at an alternate url. For example:

< link rel="alternate" type="application/gedcom" href="http://somesite.com/citation.ged" title="GEDCOM format of this citation" >

to indicate where a GEDCOM-formatted citation for the currently viewed document could be found. Also:

< link rel="alternate" type="text/xml" href="http://somesite.com/citation.xml" title="XML representation of this citation" >

Browsers/plugins could pick up on this and (in the background) pull down the content in the linked URLs and offer appropriate options to process the data found there.

The advantage here is that citations in several standard formats could be provided such as GEDCOM or XML (the church could provide a DTD or XML-Schema to specify the allowed fields in the XML document).

Another advantage to this is that we're not defining any new 'standards' per se, but utilizing those that already exist (XML and GEDCOM in these examples).

The other advantage is that this method is already in use in various ways.

More info can be found here:
- http://www.w3.org/TR/REC-html40/struct/links.html#h-12.3.3

This approach could also address the need to 'standardize the description of genealogical data in web pages so that it can be easily searched by search engines, and consumed automagically by genealogy applications.'

I think we already have standards in place that can be used (again XML and GEDCOM, although while an XML form of GEDCOM hasn't really been nailed down it has been explored). By having pages with genealogical information add < link > tags which point to pages with the same data in other formats (XML and GEDCOM for example) we could go a long way. I think the effort required to go this route would actually be less than trying to embed the same information in some new-as-of-yet-unspecified format in the same page.

Thoughts?

Dan Hanks said...

One more comment:

Whatever format you go with, please include a top-level version field, so parsers can know how best to parse what they have.

Dan Lawyer said...

I really like the suggestion from Dan Hanks around using the <link> in the <head> of the web page. This would be an easy way to expose the citation information in both XML and GEDCOM format. As far as standardization, we would still need to standardize on a description of the citation in XML and perhaps agree upon some nuances of how to describe somethings in GEDCOM that we'd like to include in a citation that aren't currently well defined.

I think the next step here is for us to post a proposed XML and GEDCOM schema for describing a source citation.

John Vilburn said...

Standards galore....
For a list of more citation standards than you ever wanted to know about, go to the references at the bottom of http://dublincore.org/documents/dc-citation-guidelines/

Whatever is done, if it is used in connection with the Church's efforts to put the granite vault microfilms online, it will have tremendous weight. We need to have a good balance between expressibility and simplicity.

There is a unique need to handle the multiple levels of sources inherent in those microfilms. First, we have the URL where the scanned microfilm can be found. At that page is a representation of a certain film number and page number within that film. That film number and page number provide a representation of a certain original document, whether it be a census page, a marriage certificate, or a page in a birth register. I believe that all of these levels are relevant, because they all help you to find the source in some form.

The URL does not need to be represented in the online citation, since the URL is what got us to that source. Whether this is used by an individual through a browser, or by a program accessing the page, the URL is already a known quantity. The URL is also the most transient attribute of any source.

Dan Lawyer said...

John,
Thanks for your insights. Regarding the transient nature of URLs for sources. This has to stop. One of the attributes of institutional collections of sources in the future must be immutable URLs. This is our intention for the Church's collection.

I agree with your statements about the need to capture the multiple layers of source citation. If URLs become immutable then it makes more sense to capture them as another layer of a citation.

Gary Hoffman said...

Dan, it sounds like we need to go to the root of the Internet and define a URI scheme for genealogy source citation. See http://www.w3.org/TR/uri-clarification/ for the discussion of what URIs mean today. We've wrestled with the problem of a uniiform citation scheme ever since the early genweb discussions 12 years ago.

Dan Lawyer said...

Gary,

I spent some time reading through the information about URIs and Uniform Resource Citations. I wish they would have nailed it back in the early days of the Internet. As it is, the microformat approach seems more appealing than trying to get the Internet to support another URI type.