CFMDS

CFMDS
(CUDA-based Fast Multidimensional Scaling)


Figure 1 : An example plot of CFMDS result (drawn using MATLAB)	Figure 2 : Running time comparision of CFMDS and general MDS. (The vertical axis is in log-scale and the time unit is seconds.)

1. Introduction

Multidimensional scaling (MDS) is a widely used method for dimensionality reduction. MDS has mainly been applied to data visualization and feature selection. Among various MDS methods, the classical MDS is not quite appropriate for quick analysis of data having large numbers of objects, on normal desktop computers due to its computational complexity. In precise, it usually takes several hours to project datasets consisting of tens of thousands of data points by the classical MDS. We propose an efficient approximation algorithm for the classical MDS based on divide-and-conquer and CUDA.
CUDA (Compute Unified Device Architecture, http://www.nvidia.com/object/cuda_home_new.html) is NVIDIA's parallel computing architecture exploiting the power of the GPU (graphics processing unit). The principle of divide-and-conquer is applied for circumventing the small memory problem of usual graphics cards.
Our application software, i.e., CFMDS is an expeditious solution for analysis and visualization of data containing several thousands of objects, which is hundreds times faster than general MDS softwares.

2. CFMDS (CUDA-based Fast Multidimensional Scaling)

CFMDS (CUDA-based Fast Multidimensional Scaling) is an implementaton of the classical MDS based on Euclidean distances between objects, using CUDA and divide-and-conquer. CFMDS has two running modes : the classical MDS based on CUDA (Mode 1) and the approximate classical MDS based on divide-and-conquer for large data (Mode 2).
If input data size is bigger than the maximum size of GPU's free memory, CFMDS uses a divide-and-conquer approach. The number and size of data partitions are automatically decided by the size of available memory. CFMDS with divide-and-conquer samples data points from each partition. Two ways of sampling are Random and Maxmin. Random denotes usual random sampling without replacement. In the Maxmin approach, data points are chosen one at a time, and each new point maximizes, over all unused data points, the minimum distance to any of the previously-sampled points. The Random approach works quite well in practice. Maxmin has the advantage of being controllable, which may be useful in some contexts where data manifold is skewed.
If input data size is smaller than the maximum size of GPU's free memory, CFMDS apply the classical MDS algorithm. This mode is not based on divide-and-conquer and random sampling, so it produces the same results as the original classical MDS.

3. System Requirements

Operating Systems	: Windows XP 32 & 64-bit Windows Vista 32 & 64-bit Windows 7 32-bit Ubuntu Linux 9.04 Red Hat Enterprise Linux 5.3 / 4.7 Fedora 11
Hardware	: NVIDIA's GPU which supports CUDA. (You can check if your GPU supports CUDA or not. Click here.)
CUDA version	: CUDA 2.3 toolkit. The latest version of CUDA driver. (You can download the toolkit & driver here.)
	(!) We do not support CUDA 3.0 toolkit yet. But our software will be updated once the required information is available.
Others	: The latest version of CULA basic libraries. (You can download them here.)

4. How to Use

All parameters has to be set when running the application. The input data file format should be n x n dissimilarity matrix (Tab delimited. Each row is separated by '\n'. See the sample data far below). If input data includes header information, then the header information must be located on the first row and column. Our applicaiton generates four output files, that is, result of MDS, indexes of sampled data points (for Mode 2), logs of CFMDS process, and eigenvalues (optional). The format of each output files is based on column-major order.
CFMDS has seven parameters which are described in Table 1.

Table 1 : Parameters description

Parameter		Description	Example

-s	Size	The number of data points.	3000, 4000, etc.
-d	Dimension	The target dimension which is lower than the original data dimension.	2, 3, etc.
-h	Header	Input data has title column and row (Yes or No).	Y, y, N, n (default : No)
-e	Eigenvalue calculation	Calculate eigenvalues only (Yes or No).	Y, y, N, n (default : No)
-m	Method	Sampling strategy for divide-and-conquer mode. (Mode 2)	R, r, M, m (defualt : Random)
-o	Output	Prefix of output files.	home/usr/result/abc_test, irisMDS, etc.
-i	Input	Input file name.	home/usr/3000.out, c:\data\iris.txt, etc.

5. License

CFMDS is distributed under the GNU General Public License version 2.
A copy of that license should be distributed with every copy of CFMDS or derivatives of CFMDS.

For complete information about the GNU GPL please visit the Free Software Foundation.

6. Downloads

Sample data
Sample input file 1 (n x n dissimilarity matrix)	Click
Sample input file 2 (Before Euclidean distance calculation)	Click
Result of SVD (eigenvalues) for input file 1	Click
Result of CFMDS for input file 1	Click

Applications

	Windows version	Linux version

7. Contact Information

Specific questions regarding the use of CFMDS may be sent to Kyu-Baek Hwang and Sungin Park.

Bug reporting on CFMDS may be sent to Sungin Park.
Questions about CUDA and CULA setting or compile for CFMDS may be sent to Sungin Park.

General inquiries may be addressed to Kyu-Baek Hwang.