At Vector Space Model (VSM) there are some documents and a query which are represented as vectors in a n-dimentional space, where each dimension represents a term (of dictionary). To find the similarity between the query and the each document applies the type of cosine similarity. After that, similarities are sorted in descending order.
How does this program work?
1. Insert terms
- Manually
- From file
- txt
- xml
- No
- Yes
You have to fill in one query and 2-10 documents. To increase the documents, change the value of field "Number of docs".
You have to give only .txt or .xml file format. For convenience, you can download the example of file you want to see the template.
The file should consist of all documents you want to insert, each one of them must begin with "D: " and with "Q: " must begin the query. The query must be only one and the number of documents must be at least two. Each one has to be in different lines!
This file should consist not only of all the documents and the query, but also: the variation of term frequency (one for the documents and one for the query), the variation of inverse document frequency (one for the documents and one for the query), the variation of normalization (one for the documents and one for the query) and the distance method that the algorithm will use.
After you fill out everything you want, click on tab '2. Insert parameters'.
2. Insert parameters
Some fields at this step are skipped by uploading an xml file!
Here we give all the parameters of the algorithm except for the documents and query that gave them previously.
Field 'tf for documents' is one of the variations of term frequency for documents and it can be natural (n), logarithm (l) or augmented (a).
Field 'tf for query' is one of the variations of term frequency for query and it can be natural (n), logarithm (l) or augmented (a).
Field 'idf for documents' is one of the variations of inverse document frequency for documents and it can be no (n) or idf (t).
Field 'idf for query' is one of the variations of inverse document frequency for query and it can be no (n) or idf (t).
Field 'normalization for documents' is one of the variations of normalization for documents and it can be none (n) or cosine (c).
Field 'normalization for query' is one of the variations of normalization for query and it can be none (n) or cosine (c).
Field 'Distance/Similarity method' has the method that will be used for calculating the distance/similarity between query and documents.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The next exist and at XML files.
There is a question that asks if you want to use some other collection or the one you inserted.
If you click 'No' you have nothing else to do.
If you click 'Yes', you have to fill in the field 'N' with the number of documents you have in your collection and for each term the number of documents that appears.
After you fill out everything you want, click on tab '3. Results'.
3. Results
You see the results.