Sunday 13 December 2009

Building an integrated historical geographical information system, part 1

Historical source-documents are increasingly becoming available on the Internet. It's a huge resource waiting to be elaborated. The sources often come in one of two formats. They are published as graphical scans of printed book pages with or without large chunks of unedited OCR-text, or they can be fully formatted as an electronic text, that can be copied, edited or searched. Often little or no metadata about the source documents are available. I will mention a couple of examples. Digital libraries such as Google Books or Gallica have huge amounts of scans of book-pages with OCR-text. A very rough estimation is that the accompanying OCR-text are correct to 60 percent. Searching for a term in the text will sometimes give a correct result, sometimes not. Maybe we can expect the OCR-technique to perform better in the future, I dont know. Otherwise we have to depend on humans taking responsibility for manual processing and correction of the texts. An example of the latter is the project Chartae Burgundiae Medii Aevi, University of Burgundy, in Dijon France, aiming at making all medieval charter editions of Burgundy available on the internet as fully corrected electronic texts, in part through own digitazation efforts, including publishing previously unedited texts from manuscripts online, correction of digitized text found at Gallica and Google Books. Currently there are 25 editions avialable, which are all downloadable in Text- or Word-format. Other examples of projects with high ambitions to publish fully corrected editions of historical source-documents are the digital Monumenta Germaniae historica (dMGH), Regesta Imperii (RI), and Codice diplomatico della Lombardia medievale (secoli VIII - XII).

It's important for scholarly work to have the possibility to cite/quote individual pages or source documents in editions. In order for electronical text to become part of the scientific research process they must fulfill these requirements. Currently there are a number of ways that these requirements are beginning to become fullfilled. The dMGH allows building of URL to indivudal pages through-out the entire body of all editions in Monumenta Germaniae historica, using widely known abbreviations to individual editions of source documents. The following example is an URL to a individual page in Gesta Dagoberti I. in: SS rer. Merov. 2. page 396 ("S." means "Seite", German for page).
http://www.mgh.de/dmgh/resolving/SS_rer._Merov._2_S._396

Next example demonstrates an URL built from widely used sequential number of royal diplomas in Die Urkunden der Karolinger 1 (Royal diplomas issued by king Pepin, Carloman and Charlemagne) The following two example refers to diploma no. 165, donation by Charlemagne to monastery Prüm (Rheinland-Pfalz) issued on the 9th of June 790 in Mainz (Mayence). Note the slight different composition of the two URL retrieving the same document by charter-number and page-number respectively.
http://www.mgh.de/dmgh/diplomata/resolving/D_Kar._1_165
http://www.mgh.de/dmgh/resolving/DD_Kar._1_S._222

Regesta Imperii is a series of source-summaries (regesta) ordering all written testimonials of Frankish and German kings chronologically throughout the middle-ages, with references to evidene in source editions. The lastest contribution is the first volume of source summaries of Charles the Bald, king of West-Francia (840-877), Die Regesten Karls des Kahlen 840 (823) - 877, edited by Irmgard Fees and published 2007. Not only evidence in diplomas but also evidence of activities of rulers in narrative sources. Regesta Imperii is also important for the listing of diplomas and other evidence of emperor Louis the Pioux, because of the still missing edition of his diplomas in MGH (the draft of this editions was destroyed during WW2). The following source summary refers to Regesta Imperii no 1005, Royal donation issued by Louis the Pious issued on the 8th of May 840 in Salz, concerning royal estates in modern Belgium.
http://regesten.regesta-imperii.de \ 
/index.php?uri=0000-00-00_1_0_1_1_0_0_1005

Even in non scientific resources like Google Books it's possible to link to indivudal pages of the source-edition. In Google by page-number of the printed original. Charter issued by Haroin in Wissembourg in 742, Liber donationum, no. 1 on page 7. in: Traditiones possessionesque Wizenburgenses. Codices duo cum supplementis. Zeuss (ed.) Speyer 1842, where id is a distinct and peristant identifier of this source edition.
http://books.google.com/books?id=yLoGAAAAQAAJ&pg=PA7

Unfortunatelly Gallica only permits link to the sequential number of the scanned pages, which is not compatible with the printed edition of the edition. If you wish to link to a certain page, you have the visit the actual webpage and copy the link. In other words, you can't construct the link with knowledge of the book-identifier alone, like Google.

My project Regnum Francorum Online aims at referencing historical events from the Merovingian and Carolingian period in time, space, and by agency, building a collection of meta-data about the events including links to indivudal source-documents if they are available online, taking advantage of the possibility to link to individual source-documents as described in the examples above. Referencing in time means that events are given a numerical estimation of time. In PHP the concept of Julian day count is implemented, and it's utilzed here. Referencing in space means geo-referencing places mentioned in the event to modern geographical concept of longitude and latitude as well as administrative affinity like country and province and other territorial divisions, distinctly identifying a placename. Referencing by agency means identififying individuals mentioned in the event as well-known historical persons, or if not possible, to individuals with a recognized name. Uncertainty in referencing must be taken into account.

Monday 16 November 2009

Historical GIS and Semantic web

The historical GIS application Regnum Francorum Online references historical events in time, space and by agency and link the events to source documents and literature available online. In doing so, the application becomes a GIS-interface to a growing number of both primary and secondary sources online. This also includes the huge collection of articles in the Wikipedia. To me, it has become evident that the Wikipedia will become a major source to all kinds of knowledge in the future. Thus it is of great importance to closer examine how the Regnum Francorum Online can be closely integrated with the Wikipedia.

Each article in the Wikipedia has an unique tag, together with an ID, which is necessary to build a permanent link to the article, according to the instructions in the Wikipedia. However, I have never seen a reference to the Wikipedia, including this ID. Not even in the semantic web project DBpedia, which has extracted and coded articles and their content into XML/RDF, including geographic information of such features. The DBpedia project uses the same unique tags as Wikipedia, but has also collected the geographic features of the GeoNames project, which are identified with a unique numerical ID. This project also geo-reference articles in the Wikipedia. Lately, the Wikipedia project has collected alternative identifiers of populated places, the local administrative units, which in the European Union are basic units of official statistics. These units are municipalities (e.g. commune, Gemeinde). The tags of geographic features in Wikipedia and DBpedia are often just the offical name. Alternative tags are also allowed, using redirects to the main article. Taking all this into consideration, a named geographic place within EU can be uniquely identified with a commonly knowned combination of country-code, administrative-code and place-name, e.g. Mommenheim in Rheinland-Pfalz, Germany (DE/07339037/Mommenheim), can be separated from Mommenheim in Alsace (FR/67301/Mommenheim). The tags in Wikipedia are Mommenheim (Germany) and Mommenheim,_Bas-Rhin (France) respectively. It is in this context of inter-linked resources between Wikipedia, GeoNames, DBpedia, and national agencies of statistics, both in HTML and XML/RDF format, I would like to make the geographic features of Regnum Francorum Online inter-linked as well, maintaining the administrative code and the Wikipedia-tags of geographical features. In almost all cases, an article about the history of a city can be found in the city article itself. There are a few exceptions, but in these cases the city article refers to the separate history-article, e.g article about Lendorf in Austria referring to a separate article about the roman municipium Teurnia.

In both Regnum Francorum Online and Wikipedia/DBpedia there are other geographic features as well, that is, institutions of the state/kingdom, latin regnum, county pagus/comitatus, march marchae, duchy ducatum; and church: bishopric, latin episcopatum and monastery, monasterium. The tags identifying these institutions in the Wikipedia are not as consistent and predictable compared to populated places. The articles about bishoprics are reflecting the current division of the catholic church, differing between ancient and current dioceses, e.g there is an article about Roman_Catholic_Diocese_of_Passau and another one about Prince-Bishopric_of_Augsburg, both containing historical information about the bishoprics respectively. In Regnum Francorum Online this is implemented as bishopsric/Augsburg and bishopric/Passau respectively. Furthermore, articles about the history of monasteries are referred to as the placename with the suffix _Abbey, e.g. Lorsch_Abbey, corresponding to monastery/Lorsch in Regnum Francorum.

When it comes to territorial subdivisions (institutions) of the kingdom, the confusion in Wikipedia becomes bigger. From a history of early medieval Europe perspective it would have been desirable with tags describing traditional divisions of provinces and kingdoms, and maybe that will come in the future. In the English Wikipedia there is a short listing of Carolingian counties containing 7 entries. This category more or less corresponds to the listing of Gau pagus in the German Wikipedia, containing 147 entries of Gau/Gaugrafschaft situated mainly in modern Germany, Austria and Switzerland. For modern France there is a corresponding category Liste historique des comtés français, in English, list of the historical french counties, referring to different articles like lists of counts, or to a historical region. Obviously these categories are still under development.

In Wikipedia there is also categories of historical events (battles, treaties) that are exiting to take a closer look at. Articles of such events are often well-written and with substantial content, e.g. Battle of Poitiers 732. In the category Battles involving the Franks there are currently 30 entries. Unfortunately, the implementation of events in Regnum Francorum are somewhat ambiguous at the moment, suffering from the original implementation as historical documents, rather than events. Later it became evident to me that a historical source-document can contain information about several events distinct from each other in time, space and/or by agency. Consequently source-information about the battle of Poitiers can be retrieved from the database in terms of place and time, not from searching a record of the battle of Poitiers directly.

Well, to summarize, in order to inter-link Regnum Francorum with other significant websites like Wikipedia and GeoNames, common identifiers of place and institution must be maintained. This is also the first step to a full integration into the future semantic web.

Saturday 31 October 2009

Working with spatial data in binary representation in MySQL and PHP

In an earlier post I showed how to store and retrieve spatial POINT geometries in MySQL and PHP. A POINT is a geometry that represents a single location in coordinate space. In this post we will be dealing with more complex spatial data such as LINESTRING and POLYGON, in real world representing for example rivers, roads, or territories in coordinate space. A number of Shapefiles (SHP) describing territories and division of kingdoms in early medieval Europe have been imported into MySQL. The geometries are drawn directly from the SQL-database using PHP-script, building the basemaps of the historical GIS-application Regnum Francorum Online. This article will discuss how this was accomplished. A Linestring is a one-dimensional geometry represented by a sequence of points, whereas a Polygon is a planar surface representing a multisided geometry, with a single exterior boundary and zero or more interior boundaries, where each interior boundary defines a hole in the polygon. The first and last Point of the exterior boundary is the same, we say it's closed. Here we will only deal with polygons without interior boundaries.

In order to get these geometries on a computer screen map we must be able to somehow retreive the single points building these geometries and transform them from coordinates to pixels, and then use some graphics functions to draw lines and polygons. The following steps were used to achieve this task.

1. Define a minimal bounding rectangles (MBR) representing the boundaries of the map we will display on the screen and compare this geometry with the lines and polygons in the database and see if they should be selected. If this box is expressed in coordinate space we will be able to take advantage of functions comparing the spatial relationship between two geometries in MySQL. The MBR of the following Linestring reaching from one corner to the opposite, represents the map given its width, height and scale.
SET @g = GeomFromText('LineString($longitude1 $latitude1,$longitude2 $latitude2)')

2. The geometries holding the coordinates for rivers or territories has a spatial column of type LINESTRING or POLYGON and has been given the name SHAPE2 on all database-tables of this kind (layers). We will Select values from this column that Intersects the MBR defining the boundaries of the map (@g). We will then retreive the data as a binary string (BLOB). The Well-Known Binary (WKB) representation for geometric values is defined by the OpenGIS specification. WKB uses one-byte unsigned integers, four-byte unsigned integers, and eight-byte double-precision numbers. For fast access of the data, we use the Unbuffered Query in MySQL.
$query="SELECT AsBinary(SHAPE2), color FROM $layer WHERE MBRIntersects(SHAPE2,@g)";
...
$result=mysql_unbuffered_query($query);


3. In the code snippet above, we are retreiving two columns of each row in the database table. Unpack is a PHP-function that unpacks data from a binary string into an associative array according to a given format, in this case defined by the WKB. The first parameter of the geometry is the byte order (1 byte), the second, geometry type (4 bytes), and the third, number of elements (4 bytes). This sums to 9 bytes, and at this position (offset) in the binary string, we will find the number of elements of the given geometry type. The class-function drawLine deals with Linestrings and the function drawRings deals with Polygons. We will not deal any further with Multipolygon and Point geometries.

while($row = mysql_fetch_row($result)) {
$color=$row[1];
$g = unpack("Corder/Ltype/Lnum",$row[0]);
$type = $g['type'];
$num = $g['num'];
$offset = 9;
switch($type) {
case 1: // POINT
break;
case 2: // LINESTRING
$this->drawLine($offset,$num,$color,$row[0]);
break;
case 3: // POLYGON
$this->drawRings($offset,$num,$color,$row[0]);
break;
case 6: // MULTIPOLYGON
break;
}
}
mysql_free_result ($result);

4. Now we are ready to transform the coordinates to pixels on the map. We will begin with the Linestring-geometry. Offset is the position in the binary string ($row) where we will start to read. $numpts is the number of points in the Linestring. From the offset position in the buffer we start to unpack into the points-array (pts) eight-byte double-precision numbers (d) until we reach the end of the buffer (*). Next we assign an array geom, that will hold the resulting pixel-values of the points defining the Linestring. When reading the points-array, we know that the first number is the longitude and the second the latitude, on so forth. When reaching every second number we have a longitude/latitude pair that will be transformed using the gRef-function. The resulting pixel-values are assigned to the geom-array. Then we use the PHP graphics-function imageline to draw a line-segment between x1,y1 and x2, y2, in the given color.

function drawLine($offset,$numpts,$color,$row) {
$pts = unpack("@$offset/d*", $row);
$geom = array();
$lon=0;$lat=0;
$odd=true;
foreach ($pts as $value) {
$odd ? $lon=$value : $lat=$value;
if (!$odd) {
$this->gRef($lon, $lat);
$geom[]= $this->Xp();
$geom[]= $this->Yp();
}
$odd=!$odd;
} // end foreach
$max=($numpts-1)*2;
for ($i=0;$i < $max; $i+=2) {
imageline ( $this->img, $geom[$i], $geom[$i+1],
$geom[$i+2], $geom[$i+3], $color);
}
} // end function

5. Now we deal with the Polygons. In it's core this is the same function as the one for Linestrings, because polygons are Linestrings with the same start- and end-point.

function drawRings($offset,$numrings,$color,$row){
$off=$offset;
$x=$numrings;
while($x > 0) {
$h=unpack("@$off/Lnumpts",$row);
$numpts=$h['numpts'];
$off+=4;
$nump=$numpts*2;
$pts = unpack("@$off/d$nump", $row);
$geom = array();
$lon=0;$lat=0;
$odd=true;
foreach ($pts as $value) {
$odd ? $lon=$value : $lat=$value;
if (!$odd) {
$this->gRef($lon, $lat);
$geom[]= $this->Xp();
$geom[]= $this->Yp();
}
$odd=!$odd;
} // end foreach
imagefilledpolygon ( $this->img, $geom, $numpts, $color);
// If we wish to draw a border
imagepolygon ( $this->img, $geom, $numpts,
$this->bordercolor);
// Increase byte pointer
$off+=($nump*8);
$x--;
} // end while x
return($off);
} // end function

Wednesday 28 October 2009

Using MySQL spatial extensions in historical GIS

This post describes some aspects of using MySQL spatial extensions in the historical GIS application Regnum Francorum Online. Because I was already using MySQL to store evidence of historical events, the choise to try out the spatial extensions of MySQL that were introduced in version 4 was very close at hand. The geographical information system Regnum Francorum Online (RFO) is a MySQL database with a number of interrelated database-tables. The Events SQL-table holds information about the historical event (time, type of event, source-document etc.) and is linked to evidence of places and actors of the event. In turn, the Evidence of place is linked to entries in the Places-table which will be the example of this post. The Places SQL-table consists of the following columns: id, name, type, country, official geographic code and coordinates. The coordinates column (pt) is defined as the POINT-geometry type, holding a single longitude-latitude pair. Indexes are constructed for the columns id, name and a spatial index is constructed for the pt-column.
CREATE TABLE places (
id INT 8 UNSIGNED NOT_NULL AUTO_INCREMENT,
name BLOB(255) NOT_NULL,
type INT(3) UNSIGNED,
cc CHAR(2),
cog CHAR(8),
pt POINT NOT_NULL,
PRIMARY KEY (id),
KEY (name),
SPATIAL KEY(pt),
)

Now the Places-table is ready to be populated with data. In real life this has of course been a long process to collect data for more than 11,000 places currently in the database. From the beginning, name, country and coordinates were collected from the GeoNames geographical database. The GeoNames service is still the only service that I know of, which allows you to geocode adresses worldwide for free. Due to license restrictions in the Geocoding service of Google Maps API, retreiving coordinates for other purposes than showing a Google map is not allowed. Collecting official geographic codes is a work in progress, data is provided by state-agencies in the different countries. This far, geographic codes for France from INSEE has been added. This data is crucial for the distinct identification of places, and holds information about the administrative belonging of a certain place, e.g. Quierzy (a carolingian palace) has the code 02631, which can be translated into département Aisne, arrondissement Laon, canton Coucy-le-Château-Auffrique, commune Quierzy.
INSERT INTO 'places' VALUES (51,'Quierzy',3,'FR','02631',GeomFromText('POINT(3.1440379 49.5708778)') )

GeomFromText and Point are two MySQL-functions building the binary representation of the POINT Geometry-type. The more than 11,000 places in the FRO-database are now ready for very fast retrieval, based on SQL-queries that take the limits of the output map into account, together with other features of the map selected by the user.


Example: Coins of Pepin, Charlemagne and Carloman
Let's say we want to retrieve all evidence of mints in the Carolingian kingdom of the Francs. The evidence of mints are, for example, the different coins published in various catalogs. One such catalog is Les monnaies royales de France sous la race Carolingienne, deuxième partie by Ernest Gariel, Strasbourg 1884, available at Google Books, and the Internet Archive, containing all known coins at that time issued by king Pepin (752-768), Charlemagne (768-814) and Carloman (768-771). The evidence of the different coins with information of the mint on one side of the coin, and the name of the king on the other, is suitable to put in the Events SQL-table: no. 73. silver coin of type denier with inscription RP· | +TRI/CAS, that is, evidence of actor (RP = king Pepin 752-768), and evidence of place (TRI/CAS = Troyes, dép. Aube). The Evidence of place SQL-table is connecting the event with the place, holding ID:s of column-type UNSIGNED INTEGER(8) for the Events and Places SQL-tables respectively.


Follow this link to see this example in the real database application.


Retrieving data from a geometry column type POINT is very simple, use the MySQL-function AsText() to SELECT the values as a string-representation of the stored binary values, in the example above Troyes has the coordinates "4.0748 48.2975".
SELECT name, AsText(pt) FROM places
Individual values of longitude and latitude can be retreived with the functions X() and Y() respectively. However, we will take this a step further and show some other features of the spatial extensions as well. Values of a spatial enabled SQL-table can be selected using minimal bounding rectangles (MBR:s), in this case, all places with evidence of mint inside the minimal bounding rectangle defined by @g of the output map, will be selected. The first step is to set the MBR of the output map using it's upper-left and lower-right corner, expressed as two pairs of longitude and latitude. The $variables are defined in the PHP-environment.
mysql_query("SET @g = GeomFromText('LineString($longitude1 $latitude1, $longitude2 $latitude2)')");

Next we will compose the SQL-query for the evidence of all mint, within the selected time-period ($minyear - $maxyear). Placeindex is the name of the SQL-table holding the evidence of places in events, the event-type of minting is represented by the numerical value 163. The query uses GROUP BY to sum the rows by unique places of minting in the database, and to put the number of evidence in the cnt (count) column.
$query="SELECT
 placeindex.pid,
 places.name,
 X(places.pt),
 Y(places.pt),
 events.type IN(163) as mint,
 COUNT(*) as cnt
FROM placeindex, places, events
WHERE MBRWithin(places.pt,@g)
 AND placeindex.pid=places.id
 AND placeindex.eid=events.id
 AND events.type IN (163)
 AND events.maxyear >= '$minyear'
 AND events.maxyear <= '$maxyear'
GROUP BY placeindex.pid";

Now we perform our SQL-query and draw the map, which is saved as a PNG-image in a temporary-directory on the server. The member-function gRef() is where the fetched coordinates (decimal longitude and latitude) of the cities are transformed into pixel-cordinates in the UTM-projection of the current map. This is not explained in this post.
[PHP code (simplified) running inside our map-drawing class]:
$this->img = imagecreatetruecolor($this->width, $this->height);
$pic_coin = imagecreatefromgif ($_SERVER["DOCUMENT_ROOT"]."/pics/coin.gif");
$result = mysql_query($query);
while($row = mysql_fetch_row($result)) {
 $pid = $row[0];
 $oname=htmlspecialchars($row[1], ENT_QUOTES);
 $mint = $row[4];
 $this->gRef($row[2], $row[3]);
 $px= $this->Xp();
 $py= $this->Yp();
 imagecopymerge ($this->img, $pic_coin, $px-8, $py-8, 0, 0, 16, 16, 100);
} // end while
imagepng($this->img,$this->filename);
imagedestroy($this->img);

Conclusion
The conclusions are coming soon.

Regnum Francorum Online historical GIS

Regnum Francorum Online — interactive maps and sources of early medieval Europe 614-840 is a historical geographic information system (GIS), aiming at referencing historical events of Merovingian and Carolingian Europe (Frankish kingdom) in time and space. The information system covers the time period approx. 614 to 840. Historical events are recognized through source-documents of different kind, mainly contemporary charter documents or copies of such documents, but also archeological evidence like coins. Meta-data about the events has been collected, including time and geographical locations of the events, type of event (donation, privilege, assembly, battle, siege etc.), actors involved in the event (historical persons like Charlemagne, or persons or groups identified by name) and links to source-documents available online. The information system is implemented as an online database-application running on a Apache-server, using MySQL with spatial extensions, PHP server-script and AJAX, producing interactive maps and inter-related output of historical and geographical information. Over the last years, a growing number of editions of primary sources have become available online in digital libraries such as Google Books, Gallica, Monumenta Germaniae Historica (dMGH), and Regesta Imperii, just to mention the largest collections. The main purpose of this information system is to provide an interactive geographical interface for the visualisation of the events and their historical context, connecting them to freely available online resources like full-text source documents and literature, but also other sources like medieval manuscripts, coins and maps.

I intend to use this blog to elaborate and discuss conceptual, methodological and technical issues regarding the development of Regnum Francorum Online. I have written my own PHP-class to draw the maps from coordinates stored in MySQL database-tables. The maps are currently drawn using the Universal Transverse Mercator - projection. There are a number of map-layers collected and compiled from various sources including own digitization of printed maps. In parallel to my own maps, basemaps from the Google Maps - service are also used to visualize layers with historical information. The blog will also be used to report on digitization of historical (medieval) sources, semantic web, historical GIS and online mapping in general (Google Maps API, Google Earth, Openlayers). Questions and comments are most welcome.