Wednesday, June 27, 2007

More on Source-Centricity: Conclusions, Artifacts, and Evidence

Conclusions, Artifacts, and Evidence
We've been prototyping a flow for the Life Browser which allows users to create a person, add conclusions about that person, add artifacts about that person, and identify the artifacts that are supporting evidence for these conclusion. In this approach there are three basic types of data 1) Conclusions 2) Artifacts 3) Evidence.

Conclusions: These are basically what people believe to be the vital information about a person (birth, marriage, death, relationships, etc.).

Artifacts: A digital representation of something relating to a person's life. It consists of metadata which describes the artifact and provides source citation information; and may contain images, video, audio, or text. Artifacts are also of a particular type: picture, record, story, video, audio, personal knowledge.

Evidence: Evidence is the linkage of an artifact to one or more conclusions. It contains the linkage between the conclusion(s) and the artifact and a note explaining the linkage (optional).

One of the current shortcomings is we haven't determined how to represent a conclusion which is a hypothesis and may or may not have evidence to support the hypothesis.

Evidence of Relationships
On a related note, for some time we've wanted to be more explicit in identifying evidence that supports relationships. I haven't seen any tools that do this. Have I missed something? For example, it is one thing to have evidence that 1 Jan 1900 is the birth date and quite another to have evidence that Jack and Jill are the parents of Bob. Most tools however only allow you to cite the source of the birth event but not the source of the relationship information (which may be the same or a separate source). We're starting to play with a prototype that lets you explicitly identify evidence of relationships.

Anyone have any thoughts or ideas to share that might help us out with our prototyping? Please share.


RussellHltn said...

Hmmmm. I had a very different data model in mind. I'll avoid titles to avoid confusion with this one.

The lowest level is the digital image or source citation.

The next level up is an extraction of the source into a modified group sheet fragment. This is the lowest level the computer can comprehend and assist us.

The next level up is the connection of these fragments together - to connect the James Smith on the birth certificate is the same James Smith on a death certificate, and is the same as the one on a Census. (Much like the nFS does.)

I think it's this level that you are talking about. I had envisioned another "source" level where the researcher makes the connections. The computer documents the person's name, date, and any notes - such as the researcher's reasons for believing this to be the same person. (Frankly this is the level of genealogy research that gets me - how do you KNOW they are the same? What level of similarities is enough to say they are the same?) One could also include a computer score on similarities. (Again, much like nFS gives us a score in looking at matches.) It makes it easy to review and reverse any bad decisions done at this level.

One question I didn't resolve is should the fragments connect to another fragment or to the composite whole? Perhaps both options would be available.

At the highest level of the data model is the "report" - it's the computer's tabulated results of all group sheet fragments stitched together. This is the level where we get the traditional group sheet and pedigree charts.

One reason I went with this model is that an item, such as a family bible or an unsourced GEDCOM becomes a single source. (And easy to back out at a later date.) An imported GEDCOM could become the only source - but it would be clear that it's unsubstantiated. (Part of my data model includes "grades" for the sources to aid in resolving conflicts). Since a unsourced GEDCOM has a "blank" grade, the results show a poor level of confidence.

But getting back to your question - I think I need to know the pros and cons of the data model you're talking about against the one I came up with. I don't quite understand yours.

Blake Christensen said...

russellhltn, I think your data model and Lawyer's models overlap with different
assumptions. Your image/citation corresponds with Lawyer's Artifact. Your
Extraction would need to be included in Lawyer's model in either the Artifact or
the Evidence. Your Connection corresponds to the Lawyer's Evidence and your alternate ideas of linking to a "composite whole" corresponds to Lawyer's conclusion. (I think you should be linking to a "composite whole" by the way.) Lawyer didn't mention a report but that would also be part of Lawyers' conclusion.
In otherwords, I think you are thinking in a similiar manner to what Lawyer

On a related note, for some time we've wanted to be more explicit in identifying evidence that supports relationships. I haven't seen any tools that do this.

"The Master Genealogist" does this. The parent child links can have citation
information attached with the type of relationship (natural, adopted).

leifbk said...

I was delighted to find your blog earlier today, as I've been on the same track for a long time. You may want to look at my "Forays" article series at, in particular the "Exodus" article.

I just added your blog to my blogroll at

regards, Leif.

Ranbo said...

While there are similarities between russellhltn and Lawyer's models, russell hit upon a key point when he said the extraction is the "lowest level the computer can comprehend". Without that, what you have is conclusions with pointers to images or citations that are only human-understandable. If the computer doesn't know what the document on the image says, then the user has to manually link that image to the person, to each conclusion, etc. If there is an extraction, though, then the computer can automatically know that the source supports various conclusions and relationships.

For example, if a birth certificate has a father, mother and child; and all 3 people are created in our "tree" system; and these three people point at the three "personas" in the extraction; then the computer can see that these relationships exist both in our tree and in the extracted source, and know that it is a source for that relationship conclusion.

To me, the difference between a "sourced" and "source-centric" system is that a "sourced" system has conclusions with links to human-readable evidence; while a "source-centric" system has links to computer-understandable data that reflects what the source says.

We have been doing "sourced" genealogy for over a century, using textual source citations to tell people how to find the original "images" in books. Being able to link to images is the same thing, just much faster. That's awesome, but it's not the whole enchilada.

What's still missing is the ability to tell what we've already done, and what still needs to be done, and what conclusions are supported (or contridicted) by which sources. By having extracted data, you can tell which sources support which conclusions, without the user having to specify this for each field or relationship. If there is also a "source authority" that assigns unique IDs to sources, we can keep track of which sources have been extracted, which personas in each source have been linked into our "tree", and what else still needs to be done. That allows us to stop duplicating so much effort and put our effort where it is needed.

A "sourced" system still doesn't allow that. If 10 people link to an image or have the same source citation, we still don't know if we got everybody, nor do we know which of those 10 people are really referring to the same name on the same page (i.e., duplicates).

Russell's model sounds like it comes closer to a full source-centric system, so while there are similarities with the other model, I wouldn't be too quick to say that they're equivalent. I think he has hit upon the main difference.