
------------------------------------------------------------
Copyright 1999, The University of California at Irvine
------------------------------------------------------------

Please read the README file first which describes the hybrid tree code.
This file (called README_relfeed) describes the changes/additions you need
to the hybrid code as described in README to run relevance feedback
on top of the hybrid tree (also see the papers [1], [2]).

DESCRIPTION OF SOURCE CODE FILES
================================

This distribution has 12 files.

(1) It contains the 8 new source/header files which you would need 
in addition to the hybrid tree files described in README:

Cluster.h
Cluster.C 
qpm.h
qex.h
qpm.C
qex.C
naive.C
refinement.C

(2) It also contains 3 files which should replace the original
hybrid tree files with the same names:

HTree.h
HTreeSearch.C
makefile

(3) It also contains this README_relfeed file.

COMPILING THE CODE
==================

Run "make" (using the new makefile).


RUNNING THE CODE
================

The makefile will generate 3 executable programs:
(1) HTreeTest
(2) Refine_naive
(3) Refine_FR

HTreeTest is the same as the HTreeTest described in the original code.
Use HTreeTest to construct the hybrid tree from a data file and dump
it into a binary file. For example, to construct a  hybrid tree from the
included data file (CH.16d.asc) and dump it into "CH.dump", 
run "HTreeTest 0 CH.16d.asc HTdump.color". See README for detailed 
instructions. You can also use HTreeTest to run ``no feedback'' queries
(see README for detailed instructions).

To run feedback queries, we have 2 test programs:
(1) naive.C
(2) refinement.C

The executables corresponding to the above test files are
Refine_naive and Refine_FR. For the same query and the same
refinement model, the answers generated by the 
test programs are identical. (The difference lies
in the speed of the techniques. The naive takes longer time
while the FR (full reconstruction) approach takes much less time.
The reason is that the naive approach executes each
feedback query just like a initial query i.e. do the whole
thing from scratch. On the other hand, FR reuses the work done 
during previous feedback iterations to answer the query 
in subsequent feedback iterations. See [2] for details.)
Both test programs support multiple refinement models:
(1) query expansion (2) query point movement (3) query reweighting
and (4) both query point movement and query reweighting.
See [1] and [2] for details on these refinement models.
To run the naive approach, type in:

Refine_naive CH.16d.Query  model CH.dump
where model = 1 for query expansion 
            = 2 for query point movement 
            = 3 for query reweighting 
            = 4 for both query point movement and query reweighting

You can use any query file instead of CH.16d.Query (as long as it is 
in the same format as CH.16d.Query, see README for details on the 
format) and whatever hybrid tree dump file in place of CH.dump.

IMPORTANT: BEFORE YOU RUN THE ABOVE COMMAND, CREATE A SUBDIRECTORY CALLED
result UNDER YOUR CURRENT DIRECTORY. THE refine_naive PROGRAM WILL
GENERATE SOME RESULT FILES INSIDE THE result DIRECTORY. THE PROGRAM
WILL CRASH IF THERE IS NO result SUBDIRECTORY UNDER THE CURRENT DIRECTORY.

The above command will generate the following output:
Query Expansion
Number of empty nodes = 0 
Loaded 1879 nodes from disk 
Loaded 68041 objects from disk 
Starting query
Query 0
Query 1
Query 2
Query 3
Query 4
Query 5
Query 6
Query 7
Query 8
Query 9
Query 10
Query 11
Query 12
Query 13
.
.
.
Query 86
Query 87
Query 88
Query 89
Query 90
Query 91
Query 92
Query 93
Query 94
Query 95
Query 96
Query 97
Query 98
Query 99

Query file: CH.16d.Query
Data file: CH.dump
Query Expansion
HistNaive
Computing Precision-Recall Graph
 Iteration 0 recall = 0.420667  DiskAcc 317.05  CPU time 0.212335
 Iteration 1 recall = 0.660191  DiskAcc 322.91  CPU time 0.296015
 Iteration 2 recall = 0.714     DiskAcc 328.61  CPU time 0.362086
 Iteration 3 recall = 0.734667  DiskAcc 332.66  CPU time 0.441674
 Iteration 4 recall = 0.748952  DiskAcc 335.86  CPU time 0.515103
 Iteration 5 recall = 0.752952  DiskAcc 337.47  CPU time 0.515009
done

It will also generate a bunch of result files under the result directory.
Before we explain the above result and the result files, I will
briefly explain what the test program is doing. You can to go 
through naive.C for further details.


First, we choose GROUND_TRUTH (set to 3) number of query points
at a time (from the query file) to generate multipoint queries 
(see [2] for details on multipoint queries).
For each multipoint query, we construct a graded set of 
its RELEVANT (set to 50) neighbors (based on L1 distance)
i.e. the top 10 answers have the highest grades,
the next 10 have slightly lower grades etc.
We refer to this set as the relevant set or ground truth of the
multipoint query. We construct the starting query by NOT choosing
the multipoint query we used to generate the ground truth
but a minor variation of it. We construct the starting 
query by using only the first STARTING_POINTS (set to 1) number
of points among the GROUND_TRUTH (set to 3) number of points.
The idea is to start with an approximation of the ideal
query (the one used to generate the ground truth) and 
keep refining the query till we are close to the ground truth 
(the ideal set of answers). We request for RETRIEVED 
(set to 100) number of top answers to the starting query (iteration 0).
We refer to the set of answers returned as the retrieved set.
We obtain the refined query by taking the graded intersection 
of the retrieved and relevant sets i.e. 
if an object $O$ in the retrieved set is also present
in relevant set, it is added to the new multipoint query
with its grade as in relevant set. See [1] and naive.C
for further details on the construction of the refined query.
The goal here is to refine the query such that the retrieved set 
gets as close as possible to relevant set over a small number 
of refinement iterations. 

In the output above, the recall printed quantifies the 
``the graded intersection of the retrieved and relevant sets''
i.e. higher the recall, the closer the retrieved set is
to the relevance set (see the code for details).
So as we can see, the retrieved set is getting closer
and closer to the relevant set in successive iterations
(which means the refinement is working :)).
The program also prints the average number of (random) disk 
accesses performed the hybrid tree (as well as the CPU time) 
to return the top RETRIEVED (set to 100) for each iteration.

The program also outputs precision recall graphs, one for each iteration,
in the result directory. 


To run the full reconstruction approach, type in:

Refine_FR CH.16d.Query  model CH.dump
where model = 1 for query expansion 
            = 2 for query point movement 
            = 3 for query reweighting 
            = 4 for both query point movement and query reweighting

IMPORTANT: BEFORE YOU RUN THE ABOVE COMMAND, CREATE A SUBDIRECTORY CALLED
result UNDER YOUR CURRENT DIRECTORY. 

The above command will generate the following output:
Query Expansion
Number of empty nodes = 0 
Loaded 1879 nodes from disk 
Loaded 68041 objects from disk 
Starting query
Query 0
Query 1
Query 2
Query 3
Query 4
Query 5
Query 6
Query 7
Query 8
Query 9
Query 10
Query 11
Query 12
Query 13
.
.
.
Query 86
Query 87
Query 88
Query 89
Query 90
Query 91
Query 92
Query 93
Query 94
Query 95
Query 96
Query 97
Query 98
Query 99

Query file: CH.16d.Query
Data file: CH.dump
Query Expansion
Hist#
Computing Precision-Recall Graph
 Iteration 0 recall = 0.420667  DiskAcc 317.05  CPU time 0.215482
 Iteration 1 recall = 0.660191  DiskAcc 45.78   CPU time 0.524043
 Iteration 2 recall = 0.714     DiskAcc 17.77   CPU time 0.612719
 Iteration 3 recall = 0.734667  DiskAcc 8.71    CPU time 0.735577
 Iteration 4 recall = 0.748952  DiskAcc 5.57    CPU time 0.793892
 Iteration 5 recall = 0.752952  DiskAcc 3.22    CPU time 0.819333
done

Note that the only part of the output that is different from the 
one generated by Refine_naive is the disk accesses and the CPU time.
The recall remains the same because the both approaches 
generate exactly the same retrieved set in each iteration
(since we are using the same queries and same refinement model).
The disk accesses of the FR approach falls drastically with each 
iteration i.e. you can perform the later iterations of feedback at almost
zero cost. That is exactly the benefit of using the FR approach
instead of the naive approach. See [2] for more details.

The program, like the naive approach, will also generate 
a precision recall graph for each iteration in the result directory. 
These graphs are identical to the graph generated by
the naive approach (because the retrieved sets are the same).


References
==========

[1] @article{
author="K. Porkaew, K. Chakrabarti and S. Mehrotra",
title="Query Refinement for Multimedia Similarity Retrieval in MARS",
journal="1999 ACM Multimedia Conference",
month="November",
year="1999"
}


[2] @article{
author="K. Chakrabarti, K.Porkaew, M. Ortega and S. Mehrotra",
title="Evaluating Refined Queries in Top-$k$ Retrieval Systems",
journal="Submitted for publication. Available online at http://www-db.ics.uci.edu/pages/publications/index.shtml#TR-MARS-00-05",
month="July",
year="2000"
}


===================================================================
Contact kaushik@ics.uci.edu in case of any problems.
