Before we start though we need some tables to work with, so to make things relevant we will use a file list and a list of hashes of contraband material. Our files table lists a file id (say the MFT no.) a file name and the hash of the file (we will just use a 4 digit integer representation of a hash to make things readable):
CREATE TABLE files (id int primary key, name text, hash int)
CREATE TABLE hashes (hash int, level int)
For anyone reading this who doesn’t know the significance of hashes in investigations, here is a brief summary.
Each file in the file list has a digital signature (the hash) that uniquely identifies it. Any two files with the same hash are statistically exactly the same. The hashes table contains a list of digital signatures that correspond to known contraband files. These could be illegal images, instructions for making drugs or other contraband etc. etc. Note that for this example the hash column has not been designated a primary key, this is intentional.
So on to the different joins:
Suppose we want a list of all files and their associated level whether or not the file has a matching hash in the hashes table. This is known as a LEFT JOIN (or LEFT OUTER JOIN) and we want our results table to show all of the rows in the files table with the matching level from the hashes table if applicable. Visually this can be represented by a Venn diagram with the file table represented by the left hand circle and hashes table the right hand circle.
The resulting table shows us the four fields we have selected with those files that match our list of contraband files from our hash table also listing their “level”, or if they don’t have a matching hash the level column is blank.
The results we expect can be visualised in the following table:
This shows that every row in the files table will be shown in our query results joined to those rows in the hashes table that have a matching hash.
In the forensic Browser we can also visually see the relationship between the tables in the query along with the results:
Using The Forensic Browser we can create the join the old fashioned way by just typing in the text of the query, once we tab out (or click outside) of the text box the visual elements will be created automatically (your query will be nicely formatted). Or, we can create the query visually as follows:
Drag each of the two tables from the “Tables” panel at the top right into the central visual query designer (or just double-click on each of them), the first table you add is your “left table”. Click on the checkbox next to each column that you want in the final report. Select the column in the left table with the mouse and drag the cursor to the column in the right table that you wish to “join” with. Execute your query.
The LEFT JOIN is the default join in The Forensic Browser when two tables/rows are joined in this way.
In English this can be explained as follows:
SELECT files.id, files.name, files.hash, hashes.level FROM files LEFT OUTER JOIN hashes ON files.hash = hashes.hash
Line 5 says we want to gather our data from the files table
And line 6 says we want a LEFT JOIN, i.e. all the rows from the left table (files, the first named table) irrespective of whether they have a matching row in the hashes table, with those rows in the hashes table that match based on the join field specified in this row. The joining field in this example is the hash field from both tables.
So our final result should show all of the rows and all three columns from the files table and just a single column and those rows from the hashes table that have a matching hash with the files table.
Note that if there were two or more rows in the hashes table that matched a hash (i.e. duplicate hash values) then there would be a row in the query results for each matching file AND each matching hash.
If the hashes table is modified such that there are duplicate hashes stored within, it looks as follows:
Then the result table is quite different. In the table below each of the two rows in the files table with hash 8753, has been joined with both of the rows in the hashes table that have this hash.
It is important to understand the schema of the tables that you perform a join upon and in particular the definition of the columns. A LEFT JOIN on two columns each of which is guaranteed to have unique values, i.e. a primary key, will result in a one to one relationship, i.e. one row in the results table for each row in the left table. A LEFT JOIN where one or both of the joined columns has duplicate entries can result in many more output rows than input rows.
The RIGHT JOIN is the reverse of the LEFT JOIN, i.e. ALL of the rows from the right table with just those rows from the left table that match the joining condition. RIGHT JOINS are not supported by SQLite, however, it is simple enough to replicate utilising a LEFT JOIN using the tables in the opposite order from that above.
Our examination so far is going well but if we have many thousands of files we might not initially want to see those files that aren’t contraband, so we want a query that lists all the files in our file list that have a matching entry in the hashes list, this brings us on to the INNER JOIN, represented by the Venn diagram below:
Left excluding join
Our investigation has now moved on, we have examined all of the files with matching hashes but we need to look at those unknown files to see if there is any new evidence. So we want to isolate those files that don’t have a matching hash in the hashes table. For this we need what is sometimes known as a LEFT EXCLUDING JOIN, which is essentially a LEFT JOIN with a qualifying clause.
Right Excluding Joins
The RIGHT EXCLUDING JOIN is the reverse of this. RIGHT JOINS are not supported by SQLite, however it is simple enough to create a LEFT EXCLUDING JOIN using the tables in the reverse order, as above.
Full outer join
While undertaking our investigation we come across a clone of the hard disk we are examining and we would like to compare this with the original and see what the differences are. So we want to design a query that returns just those rows that don’t appear in both databases. The query we want is a FULL OUTER JOIN and the Venn diagram for it is:
It’s quite easy to see the differences in the examples above, but with many thousands of files reviewing the data manually would be daunting. For our examination we are interested in the files on the original disk that are not on the clone and vice versa. To do this we will again use the hash value.
So how do we simulate a FULL OUTER JOIN?
I hope that you can all see that a FULL OUTER JOIN is essentially the results of a LEFT EXCLUDING JOIN merged with the result of a RIGHT EXCLUDING JOIN, i.e.
Our first query is a LEFT EXCLUDING JOIN and as expected it returns just the one file in the files table that is not in the clone table:
And then as previously discussed we simulate the RIGHT EXCLUDING JOIN by swapping the table names and doing a LEFT EXCLUDING JOIN.
We could leave things there as the results are easy enough to understand at this level and indeed in some ways more useful from an investigators standpoint than just one list of files that differ. But we started this section with the intention of simulating a FULL OUTER JOIN so we will continue and complete our task.
To get a results table with the combined contents of the two tables above we simply need a UNION of the two queries. A UNION can be applied to any two (or more) queries provided that the queries produce the same output columns, in the same order. Our queries clearly do this.
Creating a UNION in The Forensic Browser for SQLite is straight forward.
First create each query in turn (as we did above), then with the first query displayed in The Forensic Browser click on the “add union” button (the plus symbol below)
In the subsequent new visual query designer add the second part of the query, The Forensic Browser will join the two queries with a UNION and display the SQL.
*Note that the union above is not a technically accurate simulation of a full outer join, the reasoning behind this is complex and beyond the scope of this article, and there is no distinction between the source of the files. i.e. we can’t see from the query which device a file was on.
To partially get around this restriction we can add a label to each part of the UNION (remember each part must have the same columns in the same order). Our final SQL query and the results are shown below with an additional column showing the device from which each file is missing.