This section explains how the database storage of Kapow Katalyst works.
The tables you create for a type in a database will have a column for each of the attributes in your type, plus an additional 7 house-hold fields, named: ObjectKey, RobotName, ExecutionId, FirstExtracted, LastExtracted, ExtractedInLastRun, and LastUpdated. The most important of these is the ObjectKey, as it is the primary key for the table.
Note: The reason for the name 'ObjectKey' is to be found in the terminology previously used in Kapow Katalyst. Previously, types and variables were called 'objects'. Refer to this upgrade guide for more information. To adhere to the new terminology, 'ObjectKey' should be called 'ValueKey'. Renaming it would cause quite a lot of backward compatibility problems, though, and therefore it has been allowed to keep its old name.
The ObjectKey for a type is what uniquely identifies values extracted from variables of that type when stored in a database. You therefore have to figure out what uniquely identifies values of the type. If you are building a car repository, the VIN number may be enough to provide unique identification of each car. If you are collecting baseball results, you may need the year, the team names, the ballpark, the date, to uniquely identify each match.
As you build the type you can select how the ObjectKey is going to be calculated. This is done by checking the "part of database key" option when creating a new attribute. For our car example the VIN number would be the only attribute marked part of the database key, for the baseball match example, the attributes year, team names, ballpark, date, would all be marked as part of database key.
The robot developer may also specify the Key directly on the Store in Database action, if he wishes to override the default algorithm defined on the type.
The attributes which are not Part of Database Key are sometimes referred to as non-key fields. The car could for example have a price attribute, but even if the price changed we would still consider it the same car.
Kapow Katalyst provides 3 actions for managing values in a database: Store in Database, Find in Database, Delete from Database. The Find and Delete actions are rather trivial, but Store in Database does more than just store the value.
Store in Database may either insert a new value (of some type) into the table, or update an existing value which was previously stored. Here is a list of exactly what happens.
When storing the value of some variable, the ObjectKey is calculated based on the variable's values of the attributes which in the variable's type are marked Part of Database Key, or, alternatively, if the robot developer specifies a key on the action, this key is used instead.
Using the calculated key, a check is made to see if the value already exists in the database.
If the value does not exist, a new row is inserted into the database (under this ObjectKey).
If the value already exists, it is updated, that is, all the non-key attributes are written to the table (under this ObjectKey).
Whenever a value is inserted all the 7 house-hold fields are updated. On update, only some of the fields change. The table below provides an overview.
|ObjectKey||The primary key for this value||Insert|
|RobotName||The name of the robot which stored this value||Insert and Update|
|ExecutionId||The execution id for the robot execution which stored this value||Insert and Update|
|FirstExtracted||The first time the value was stored.||Insert|
|LastExtracted||The last time the value was stored.||Insert and Update|
|LastUpdated||The date when the value was last updated.||Update*|
|ExtractedInLastRun||If the value was extracted in the latest run (uses 'y' and 'n').||Insert and Update*|
The 7 house-hold fields, and when they change
After each robot execution (in which the robot used Store in Database), all values that have previously been collected by this robot, but have not been stored during this run will have ExtractedInLastRun set to 'n' and LastUpdated set to now, indicating that the value was not found on the website during the latest run.
Note: If a value was found in the previous run, but no non-key fields have changed, then LastUpdated will not be updated. However, if the value was not found in the previous run, but in a run prior to that, LastUpdated will be updated even if the non-key fields have not changed, as this means that the value was deleted from the site and then reappeared again later.
The tables created by Kapow Katalyst are often referred to as harvest tables, as the robots are harvesting data into these tables.
To find out what information was available on a website the last time the robot was run you can use SQL like:
SELECT * FROM table WHERE ExtractedInLastRun = 'y'
However if you are running queries against a table at the same time as a robot is storing data into the table, the result will be comprised of data from the previous run, mixed with whatever data the executing robot has stored so far. Kapow therefore recommends that you copy the data out of the harvest tables, and into a different set of production tables, so you can run your queries against a stable data set.
There are many solution where robots are used to store data in a database, but most of them fall under one of the 3 scenarios listed below.
|Repository matching website (small data sets)||
The idea is to have a repository that matches the items on a website 1-to-1.The easiest way to accomplish this is have a production table, that is truncated (deleting all rows) every time the robot is done executing, and then copy every record from the harvest table where ExtractedInLastRun='y' into this table. This works well for small data sets.
|Repository matching website (large data sets)||
Same as above, but the data set is too large to copy all data after every robot execution, instead we want to update the production table after each robot execution based on the changes which have occurred.
This is where the LastUpdated field comes in handy. All values which have been updated, will have a LastUpdated field value larger than the start time of the robot. You can get the start time from the database logging tables, or you can have the robot store it somewhere.
You can detect deleted values like
You can detect new values like this
You can detect updated values
The default setup allows you to see when a value was first extracted and when it was last updated, but you cannot see if it was found in the 5th, 7th and 10th run of the robot, but not in the 6th and 8th run.
In this case, you should copy all the data from your harvest table into another table after the robot run, but in your new table the ObjectKey should not be a primary key. Instead create an extra column called RUN_ID and use it together with the ObjectKey to create a compound primary key. If you don't need a RUN_ID you could simply create an auto incremented column and use that as the primary key of your secondary table. Truncate the harvest table before each run.
Copy harvest data to production tables
You don't have to copy all the house-hold fields to your production table, only the ObjectKey is required for you to update your production tables.
If you have multiple robots storing values of the same type to the same database there are a few things you should be aware of.
Every time a value is stored, the RobotName column is updated. If you have two robots storing the 'same' value (as identified by ObjectKey), only the last one will show after the robots are done executing.
If two robots store the same value at 'exactly' the same time, you will get an error. They will both find that the value is not in the table, and try to insert it, but only one of them will succeed. In most cases the error can simply be ignored as it is in fact precisely the same value.
If you run the same robot twice at the same time and the robot stores data in a database, you break the way the ExtractedInLastRun column is used!
When the first robot is done executing, it will update the ExtractedInLastRun to 'n' for all values it has not stored. This includes all values stored by the second robot so far. Later when the second robot finishes, it will set ExtractedInLastRun to 'n' for all values stored by the first robot, completely negating the first run.
The storage system does not provide an automated way of managing relations between values. If you have a value of type Person and one of type Address, and you want to link them, you will have to maintain this link.
The easiest way to create a link is to have the ObjectKey of then Person value be a foreign key in the Address value which should be linked to this person.
If the ObjectKey is calculated automatically from the type, you can use the Calculate ObjectKey action, to generate the key and assign it to each of the address values before you store them.
You should be careful when building robots with relations between stored values. If an error occurs when you store the Person value, you should make sure that no Address values are stored.
If you are using MySQL, Oracle or Sybase, you should be aware of the following with regards to ObjectKeys.
On Oracle empty string is stored as null.
On Sybase empty string is store as " " (a string with a single space).
MySQL doesn't have millisecond precision on timestamps.
These three cases all result in a potential loss of data when the data is stored in the database. The ObjectKey is calculate inside the robot based on the data in the given variable, if you later load the value from the database and try to recalculate the ObjectKey, the ObjectKey will be different if data loss occurred in any of the attributes marked as part of database key.