------------------------------------------------------------ Copyright 1999, The University of California at Irvine ------------------------------------------------------------ Please read the README file first which describes the hybrid tree code. This file (called README_relfeed) describes the changes/additions you need to the hybrid code as described in README to run relevance feedback on top of the hybrid tree (also see the papers [1], [2]). DESCRIPTION OF SOURCE CODE FILES ================================ This distribution has 12 files. (1) It contains the 8 new source/header files which you would need in addition to the hybrid tree files described in README: Cluster.h Cluster.C qpm.h qex.h qpm.C qex.C naive.C refinement.C (2) It also contains 3 files which should replace the original hybrid tree files with the same names: HTree.h HTreeSearch.C makefile (3) It also contains this README_relfeed file. COMPILING THE CODE ================== Run "make" (using the new makefile). RUNNING THE CODE ================ The makefile will generate 3 executable programs: (1) HTreeTest (2) Refine_naive (3) Refine_FR HTreeTest is the same as the HTreeTest described in the original code. Use HTreeTest to construct the hybrid tree from a data file and dump it into a binary file. For example, to construct a hybrid tree from the included data file (CH.16d.asc) and dump it into "CH.dump", run "HTreeTest 0 CH.16d.asc HTdump.color". See README for detailed instructions. You can also use HTreeTest to run ``no feedback'' queries (see README for detailed instructions). To run feedback queries, we have 2 test programs: (1) naive.C (2) refinement.C The executables corresponding to the above test files are Refine_naive and Refine_FR. For the same query and the same refinement model, the answers generated by the test programs are identical. (The difference lies in the speed of the techniques. The naive takes longer time while the FR (full reconstruction) approach takes much less time. The reason is that the naive approach executes each feedback query just like a initial query i.e. do the whole thing from scratch. On the other hand, FR reuses the work done during previous feedback iterations to answer the query in subsequent feedback iterations. See [2] for details.) Both test programs support multiple refinement models: (1) query expansion (2) query point movement (3) query reweighting and (4) both query point movement and query reweighting. See [1] and [2] for details on these refinement models. To run the naive approach, type in: Refine_naive CH.16d.Query model CH.dump where model = 1 for query expansion = 2 for query point movement = 3 for query reweighting = 4 for both query point movement and query reweighting You can use any query file instead of CH.16d.Query (as long as it is in the same format as CH.16d.Query, see README for details on the format) and whatever hybrid tree dump file in place of CH.dump. IMPORTANT: BEFORE YOU RUN THE ABOVE COMMAND, CREATE A SUBDIRECTORY CALLED result UNDER YOUR CURRENT DIRECTORY. THE refine_naive PROGRAM WILL GENERATE SOME RESULT FILES INSIDE THE result DIRECTORY. THE PROGRAM WILL CRASH IF THERE IS NO result SUBDIRECTORY UNDER THE CURRENT DIRECTORY. The above command will generate the following output: Query Expansion Number of empty nodes = 0 Loaded 1879 nodes from disk Loaded 68041 objects from disk Starting query Query 0 Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Query 11 Query 12 Query 13 . . . Query 86 Query 87 Query 88 Query 89 Query 90 Query 91 Query 92 Query 93 Query 94 Query 95 Query 96 Query 97 Query 98 Query 99 Query file: CH.16d.Query Data file: CH.dump Query Expansion HistNaive Computing Precision-Recall Graph Iteration 0 recall = 0.420667 DiskAcc 317.05 CPU time 0.212335 Iteration 1 recall = 0.660191 DiskAcc 322.91 CPU time 0.296015 Iteration 2 recall = 0.714 DiskAcc 328.61 CPU time 0.362086 Iteration 3 recall = 0.734667 DiskAcc 332.66 CPU time 0.441674 Iteration 4 recall = 0.748952 DiskAcc 335.86 CPU time 0.515103 Iteration 5 recall = 0.752952 DiskAcc 337.47 CPU time 0.515009 done It will also generate a bunch of result files under the result directory. Before we explain the above result and the result files, I will briefly explain what the test program is doing. You can to go through naive.C for further details. First, we choose GROUND_TRUTH (set to 3) number of query points at a time (from the query file) to generate multipoint queries (see [2] for details on multipoint queries). For each multipoint query, we construct a graded set of its RELEVANT (set to 50) neighbors (based on L1 distance) i.e. the top 10 answers have the highest grades, the next 10 have slightly lower grades etc. We refer to this set as the relevant set or ground truth of the multipoint query. We construct the starting query by NOT choosing the multipoint query we used to generate the ground truth but a minor variation of it. We construct the starting query by using only the first STARTING_POINTS (set to 1) number of points among the GROUND_TRUTH (set to 3) number of points. The idea is to start with an approximation of the ideal query (the one used to generate the ground truth) and keep refining the query till we are close to the ground truth (the ideal set of answers). We request for RETRIEVED (set to 100) number of top answers to the starting query (iteration 0). We refer to the set of answers returned as the retrieved set. We obtain the refined query by taking the graded intersection of the retrieved and relevant sets i.e. if an object $O$ in the retrieved set is also present in relevant set, it is added to the new multipoint query with its grade as in relevant set. See [1] and naive.C for further details on the construction of the refined query. The goal here is to refine the query such that the retrieved set gets as close as possible to relevant set over a small number of refinement iterations. In the output above, the recall printed quantifies the ``the graded intersection of the retrieved and relevant sets'' i.e. higher the recall, the closer the retrieved set is to the relevance set (see the code for details). So as we can see, the retrieved set is getting closer and closer to the relevant set in successive iterations (which means the refinement is working :)). The program also prints the average number of (random) disk accesses performed the hybrid tree (as well as the CPU time) to return the top RETRIEVED (set to 100) for each iteration. The program also outputs precision recall graphs, one for each iteration, in the result directory. To run the full reconstruction approach, type in: Refine_FR CH.16d.Query model CH.dump where model = 1 for query expansion = 2 for query point movement = 3 for query reweighting = 4 for both query point movement and query reweighting IMPORTANT: BEFORE YOU RUN THE ABOVE COMMAND, CREATE A SUBDIRECTORY CALLED result UNDER YOUR CURRENT DIRECTORY. The above command will generate the following output: Query Expansion Number of empty nodes = 0 Loaded 1879 nodes from disk Loaded 68041 objects from disk Starting query Query 0 Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Query 11 Query 12 Query 13 . . . Query 86 Query 87 Query 88 Query 89 Query 90 Query 91 Query 92 Query 93 Query 94 Query 95 Query 96 Query 97 Query 98 Query 99 Query file: CH.16d.Query Data file: CH.dump Query Expansion Hist# Computing Precision-Recall Graph Iteration 0 recall = 0.420667 DiskAcc 317.05 CPU time 0.215482 Iteration 1 recall = 0.660191 DiskAcc 45.78 CPU time 0.524043 Iteration 2 recall = 0.714 DiskAcc 17.77 CPU time 0.612719 Iteration 3 recall = 0.734667 DiskAcc 8.71 CPU time 0.735577 Iteration 4 recall = 0.748952 DiskAcc 5.57 CPU time 0.793892 Iteration 5 recall = 0.752952 DiskAcc 3.22 CPU time 0.819333 done Note that the only part of the output that is different from the one generated by Refine_naive is the disk accesses and the CPU time. The recall remains the same because the both approaches generate exactly the same retrieved set in each iteration (since we are using the same queries and same refinement model). The disk accesses of the FR approach falls drastically with each iteration i.e. you can perform the later iterations of feedback at almost zero cost. That is exactly the benefit of using the FR approach instead of the naive approach. See [2] for more details. The program, like the naive approach, will also generate a precision recall graph for each iteration in the result directory. These graphs are identical to the graph generated by the naive approach (because the retrieved sets are the same). References ========== [1] @article{ author="K. Porkaew, K. Chakrabarti and S. Mehrotra", title="Query Refinement for Multimedia Similarity Retrieval in MARS", journal="1999 ACM Multimedia Conference", month="November", year="1999" } [2] @article{ author="K. Chakrabarti, K.Porkaew, M. Ortega and S. Mehrotra", title="Evaluating Refined Queries in Top-$k$ Retrieval Systems", journal="Submitted for publication. Available online at http://www-db.ics.uci.edu/pages/publications/index.shtml#TR-MARS-00-05", month="July", year="2000" } =================================================================== Contact kaushik@ics.uci.edu in case of any problems.