# clj-ml A machine learning library for Clojure built on top of Weka and friends. ## Installation In order to install the library you must first install Leiningen. ### To install from source git clone the project, then run: $ lein deps $ lein javac $ lein uberjar ### Installing from Clojars [cc.artifice/clj-ml "0.3.5"] ### Installing from Maven (add Clojars repository) cc.artifice clj-ml 0.3.4 ## Supported algorithms * Filters * supervised discretize * unsupervised discretize * supervised nominal to binary * unsupervised nominal to binary * string to word vector * reorder attributes * resample (supervised, unsupervised) * Classifiers * C4.5 (J4.8) * naive Bayes * multilayer perceptron * support vector machines * Clusterers * k-means ## Usage API documenation can be found [here](http://antoniogarrote.github.com/clj-ml/index.html). ### I/O of data REPL>(use 'clj-ml.io) REPL>; Loading data from an ARFF file, XRFF and CSV are also supported REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff")) REPL>; Saving data in a different format REPL>(save-instances :csv "file:///Users/antonio.garrote/Desktop/iris.csv" ds) ### Working with datasets REPL>(use 'clj-ml.data) REPL>; Defining a dataset REPL>(def ds (make-dataset "name" [:length :width {:kind [:good :bad]}] [ [12 34 :good] [24 53 :bad] ])) REPL>ds # REPL>; Using datasets like sequences REPL>(dataset-seq ds) (# #) REPL>; Transforming instances into maps or vectors REPL>(instance-to-map (first (dataset-seq ds))) {:kind :good, :width 34.0, :length 12.0} REPL>(instance-to-vector (dataset-at ds 0)) [12.0 34.0 :good] ### Filtering datasets REPL>(use '(clj-ml filters io)) REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff")) REPL>; Discretizing a numeric attribute using an unsupervised filter REPL>(def discretize (make-filter :unsupervised-discretize {:dataset-format ds :attributes [:sepallength :petallength]})) REPL>(def filtered-ds (filter-apply discretize ds)) REPL>; You can also use the filter's fn directly which will create and apply the filter: REPL>(def filtered-ds (unsupervised-discretize ds {:attributes [:sepallength :petallength]})) REPL>; The above way lends itself to the -> macro and is useful when using multiple filters. REPL>; The eqivalent operation can be done with the ->> macro and make-apply-filter fn: REPL>(def filtered-ds (->> "file:///home/kiran/Downloads/weka/weka-3-6-9/data/iris.arff" (load-instances :arff) (make-apply-filter :unsupervised-discretize {:attributes [0 2]}))) ### Using classifiers REPL>(use 'clj-ml.classifiers) REPL>; Building a classifier using a C4.5 decission tree REPL>(def classifier (make-classifier :decision-tree :c45)) REPL>; We set the class attribute for the loaded dataset REPL>(dataset-set-class ds 4) REPL>; Training the classifier REPL>(classifier-train classifier ds) # 0.6 | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9 REPL>; We evaluate the classifier using a test dataset REPL>; last parameter should be a different test dataset, here we are using the same REPL>(def evaluation (classifier-evaluate classifier :dataset ds ds)) === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 49 1 | b = Iris-versicolor 0 2 48 | c = Iris-virginica === Summary === Correctly Classified Instances 147 98 % Incorrectly Classified Instances 3 2 % Kappa statistic 0.97 Mean absolute error 0.0233 Root mean squared error 0.108 Relative absolute error 5.2482 % Root relative squared error 22.9089 % Total Number of Instances 150 REPL>(:kappa evaluation) 0.97 REPL>(:root-mean-squared-error e) 0.10799370769526968 REPL>(:precision e) {:Iris-setosa 1.0, :Iris-versicolor 0.9607843137254902, :Iris-virginica 0.9795918367346939} REPL>; The classifier can also be evaluated using cross-validation REPL>(classifier-evaluate classifier :cross-validation ds 10) === Confusion Matrix === a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 4 46 | c = Iris-virginica === Summary === Correctly Classified Instances 142 94.6667 % Incorrectly Classified Instances 8 5.3333 % Kappa statistic 0.92 Mean absolute error 0.0452 Root mean squared error 0.1892 Relative absolute error 10.1707 % Root relative squared error 40.1278 % Total Number of Instances 150 REPL>; A trained classifier can be used to classify new instances REPL>(def to-classify (make-instance ds {:class :Iris-versicolor, :petalwidth 0.2, :petallength 1.4, :sepalwidth 3.5, :sepallength 5.1})) REPL>(classifier-classify classifier to-classify) 0.0 REPL>(classifier-label classifier to-classify) # REPL>; The classifiers can be saved and restored later REPL>(use 'clj-ml.utils) REPL>(serialize-to-file classifier "/Users/antonio.garrote/Desktop/classifier.bin") ### Using clusterers REPL>(use 'clj-ml.clusterers) REPL> ; we build a clusterer using k-means and three clusters REPL> (def kmeans (make-clusterer :k-means {:number-clusters 3})) REPL> ; we need to remove the class from the dataset to REPL> ; use this clustering algorithm REPL> (dataset-remove-class ds) REPL> ; we build the clusters REPL> (clusterer-build kmeans ds) REPL> kmeans #YourKit Java Profiler and YourKit .NET Profiler. ## License MIT License