# clj-ml A machine learning library for Clojure built on top of Weka and friends. ## Installation ### Installing from Clojars [cc.artifice/clj-ml "0.5.0-SNAPSHOT"] ### Installing from Maven (add Clojars repository) cc.artifice clj-ml 0.4.0 ## Supported algorithms * Filters * Discretization (supervised, unsupervised, PKI) * Nominal to binary (supervised, unsupervised) * Numeric to nominal * String to word vector * Attribute manipulation (reorder, add, remove range, remove percentage, etc.) * Resample (supervised, unsupervised) * Replace missing values with mean (numeric attributes) or mode (nominal attributes) * Classifiers * k-Nearest neighbor * Decision trees: C4.5/J4.8, Boosted stump, Random forest, Rotation forest, M5P * Naive Bayes * Multilayer perceptrons * Support vector machines (grid-based training), SMO, Spegasos * Regression * Linear * Logistic * Pace * Additive gradient boosting * Clusterers * k-Means * Cobweb * Expectation-maximization ## Usage API documenation can be found [here](http://clj-ml.artifice.cc/doc/index.html). ### I/O of data ```clojure user> (use 'clj-ml.io) nil user> (def ds (load-instances :arff "file:///home/josh/git/clj-ml/iris.arff")) #'user/ds user> ds # (def ds (load-instances :arff "http://repository.seasr.org/Datasets/UCI/arff/iris.arff")) #'user/ds user> (save-instances :csv "iris.csv" ds) nil user> (println (slurp "iris.csv")) sepallength,sepalwidth,petallength,petalwidth,class 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5.4,3.7,1.5,0.2,Iris-setosa ... user> (def ds (load-instances :csv "file:///home/josh/git/clj-ml/iris.csv")) #'user/ds user> ds # (use 'clj-ml.data) nil user> (def ds (make-dataset "my-name" [:length :width {:style nil} {:kind [:good :bad]}] [[12 24 "longish" :good] [8 5 "shortish" :bad]])) #'user/ds user> ds # user> (dataset-seq ds) (# #) user> (map instance-to-map (dataset-seq ds)) ({:kind :good, :style "longish", :width 24.0, :length 12.0} {:kind :bad, :style "shortish", :width 5.0, :length 8.0}) user> (map instance-to-vector (dataset-seq ds)) ([12.0 24.0 "longish" :good] [8.0 5.0 "shortish" :bad]) ``` ### Filtering datasets ```clojure user> (use 'clj-ml.filters 'clj-ml.io) nil user> (def ds (load-instances :csv "file:///home/josh/git/clj-ml/iris.csv")) #'user/ds user> (def discretize (make-filter :unsupervised-discretize {:dataset-format ds :attributes [:sepallength :petallength]})) #'user/discretize user> (def filtered-ds (filter-apply discretize ds)) #'user/filtered-ds user> (map instance-to-map (dataset-seq filtered-ds)) ({:class :Iris-setosa, :petalwidth 0.2, :petallength :'(-inf-1.59]', :sepalwidth 3.5, :sepallength :'(5.02-5.38]'} {:class :Iris-setosa, :petalwidth 0.2, :petallength :'(-inf-1.59]', :sepalwidth 3.0, :sepallength :'(4.66-5.02]'} {:class :Iris-setosa, :petalwidth 0.2, :petallength :'(-inf-1.59]', :sepalwidth 3.2, :sepallength :'(4.66-5.02]'} {:class :Iris-setosa, :petalwidth 0.2, :petallength :'(-inf-1.59]', :sepalwidth 3.1, :sepallength :'(-inf-4.66]'} {:class :Iris-setosa, :petalwidth 0.2, :petallength :'(-inf-1.59]', :sepalwidth 3.6, :sepallength :'(4.66-5.02]'} ...) ;; the petallength and sepallength attributes are now nominal ``` Equivalently, ```clojure user> (def filtered-ds (->> "file:///home/josh/git/clj-ml/iris.csv" (load-instances :csv) (make-apply-filter :unsupervised-discretize {:attributes [:sepallength :petallength]}))) ``` ### Using classifiers ```clojure user> (use 'clj-ml.classifiers 'clj-ml.data 'clj-ml.utils) nil user> (def ds (-> (load-instances :arff "file:///home/josh/git/clj-ml/iris.arff") (dataset-set-class :class))) #'user/ds user> (def classifier (-> (make-classifier :decision-tree :c45) (classifier-train ds))) #'user/classifier user> (def instance (-> (first (dataset-seq ds)) (instance-set-class-missing))) user> (classifier-classify classifier instance) :Iris-setosa ``` Evaluation: ```clojure user> (def evaluation (classifier-evaluate classifier :cross-validation ds 10)) #'user/evaluation user> (clojure.pprint/pprint (dissoc evaluation :summary :confusion-matrix)) {:incorrect 7.0, :root-relative-squared-error 36.693518966642074, :sf-entropy-gain -4076.3670930399717, :recall {:Iris-setosa 0.9795918367346939, :Iris-versicolor 0.94, :Iris-virginica 0.94}, :kb-information 217.7935138195151, :kb-relative-information 13741.240800360849, :false-positive-rate {:Iris-setosa 0.0, :Iris-versicolor 0.04040404040404041, :Iris-virginica 0.030303030303030304}, :percentage-correct 95.30201342281879, :roc-area {:Iris-setosa 0.984845423317842, :Iris-versicolor 0.9456, :Iris-virginica 0.9496}, :kb-mean-information 1.4617014350303028, :percentage-unclassified 0.0, :percentage-incorrect 4.697986577181208, :root-mean-squared-error 0.17297908222448935, :unclassified 0.0, :correlation-coefficient {:nan "Can't compute correlation coefficient: class is nominal!"}, :correct 142.0, :sf-mean-entropy-gain -27.358168409664238, :mean-absolute-error 0.04083212821368881, :relative-absolute-error 9.187228848079984, :error-rate 0.04697986577181208, :kappa 0.9295222650179066, :f-measure {:Iris-setosa 0.9896907216494846, :Iris-versicolor 0.9306930693069307, :Iris-virginica 0.94}, :false-negative-rate {:Iris-setosa 0.02040816326530612, :Iris-versicolor 0.06, :Iris-virginica 0.06}, :evaluation-object #, :average-cost 0.0, :precision {:Iris-setosa 1.0, :Iris-versicolor 0.9215686274509803, :Iris-virginica 0.94}} user> (println (:summary evaluation)) Correctly Classified Instances 142 95.302 % Incorrectly Classified Instances 7 4.698 % Kappa statistic 0.9295 Mean absolute error 0.0408 Root mean squared error 0.173 Relative absolute error 9.1872 % Root relative squared error 36.6935 % Total Number of Instances 149 Ignored Class Unknown Instances 1 nil user> (println (:confusion-matrix evaluation)) === Confusion Matrix === a b c <-- classified as 48 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica nil ``` Saving and restoring (trained) classifiers: ```clojure user> (serialize-to-file classifier "my-classifier.bin") "my-classifier.bin" user> (def classifier2 (deserialize-from-file "my-classifier.bin")) #'user/classifier2 user> (classifier-classify classifier2 instance) :Iris-setosa ``` Text document handling: ```clojure user> (def docs [{:id 10 :title "Document title 1" :fulltext "This is the fulltext..." :has-class? false} {:id 11 :title "Another document title" :fulltext "Some more \"fulltext\"; rabbit artificial machine bananas" :has-class? true}]) #'user/docs user> (docs-to-dataset docs "bananas-model" "my-models" :stemmer true :lowercase false) # user> ``` Words appearing in the dataset will only be those appearing in the documents (or a subset; by default, the most common 1000 words). This presents a problem when new documents are loaded and used in a classifier trained on other documents. The classifier will not know how to handle word attributes that were not present in the training set. The `docs-to-dataset` function provides the ability to save the training documents dataset and "filter" the testing documents through this dataset to ensure the same word attributes are extracted for both sets. The following example shows that the words "foo, bar, baz, quux" are ignored in the new (testing) documents, and all the original attributes in the training dataset are retained. ```clojure user> (docs-to-dataset docs "Topic" "Sports" 1 "/tmp" :stemmer true :lowercase false :training true) # user> (def docs2 [{:title "Document title 1 foo bar" :fulltext "baz rabbit quux" :terms {"Topic" ["Sports"]}}]) #'user/docs2 user> (docs-to-dataset docs2 "Topic" "Sports" 1 "/tmp" :stemmer true :lowercase false :testing true) # user> ``` ### Using clusterers ```clojure user> (use 'clj-ml.clusterers) nil user> (def ds (-> (load-instances :arff "file:///home/josh/git/clj-ml/iris.arff") (dataset-remove-attribute-at 4))) #'user/ds user> ds # (def clusterer (make-clusterer :k-means {:number-clusters 3})) #'user/clusterer user> (clusterer-build clusterer ds) nil user> clusterer # user> (def clustered-ds (clusterer-cluster clusterer ds)) #'user/clustered-ds user> clustered-ds # (def homes (make-dataset "homes" [:house-size :lot-size :bedrooms :granite :bathroom :sellingPrice] [[3529, 9191, 6, 0, 0, 205000] [3247, 10061, 5, 1, 1, 224900] [4032, 10150, 5, 0, 1, 197900] [2397, 14156, 4, 1, 0,189900] [2200, 9600, 4, 0, 1, 195000] [3536, 19994, 6, 1, 1,325000] [2983, 9365, 5, 0, 1, 230000]])) #'user/homes user> (def homes (dataset-set-class homes :sellingPrice)) #'user/homes user> homes # user> (def reg (classifier-train (make-classifier :regression :linear) homes)) #'user/reg user> reg # user> user> (classifier-predict-numeric reg (make-instance homes [3198, 9669, 5, 1, 1, nil])) 219328.35717359098 ``` ## Example: Predicting survival on the Titanic https://www.kaggle.com/c/titanic-gettingStarted First globally replace all double quoted strings `""foo""` with backslash quoted strings: `\"foo\"`. Weka does not handle the former. ```clojure user> (def titanicds (load-instances :csv "file:///home/josh/git/clj-ml/titanic-train.csv")) user> titanicds # #'user/titanicds user> (def titanicds (dataset-set-class titanicds :Survived)) #'user/titanicds user> (dataset-class-index titanicds) 1 user> (def titanicds (make-apply-filter :numeric-to-nominal {:attributes [:Survived]} titanicds)) #'user/titanicds user> titanicds # user> (def titanicds (make-apply-filter :replace-missing-values {} titanicds)) user> (def titanicds (make-apply-filter :remove-attributes {:attributes [:PassengerId :Name :Ticket :Cabin]} titanicds)) #'user/titanicds user> titanicds # user> (dataset-class-index titanicds) 0 user> (dataset-class-name titanicds) :Survived user> (def evaluation (classifier-evaluate (make-classifier :decision-tree :random-forest) :cross-validation titanicds 10)) #'user/evaluation user> (println (:summary evaluation)) Correctly Classified Instances 727 81.5937 % Incorrectly Classified Instances 164 18.4063 % Kappa statistic 0.6039 Mean absolute error 0.2409 Root mean squared error 0.3819 Relative absolute error 50.9302 % Root relative squared error 78.532 % Total Number of Instances 891 nil user> (println (:confusion-matrix evaluation)) === Confusion Matrix === a b <-- classified as 483 66 | a = 0 98 244 | b = 1 nil ``` Ok, looks good, let's try training on the full training data and testing on the testing data. ```clojure user> (def titanic-testds (load-instances :csv "file:///home/josh/git/clj-ml/titanic-test.csv")) user> (def titanic-test-passids (map (comp int :PassengerId) (dataset-as-maps titanic-testds))) #'user/titanic-test-passids user> titanic-test-passids (892 893 894 895 896 897 898 899 900 ...) user> (def titanic-testds (->> titanic-testds (make-apply-filter :remove-attributes {:attributes [:PassengerId :Name :Ticket :Cabin]}) (make-apply-filter :replace-missing-values {}) (make-apply-filter :add-attribute {:type :nominal :name :Survived :column 0 :labels ["0" "1"]}))) #'user/titanic-testds user> (def titanic-testds (dataset-set-class titanic-testds :Survived)) #'user/titanic-testds user> titanic-testds # user> (def classifier (classifier-train (make-classifier :decision-tree :random-forest) titanicds)) #'user/classifier user> (def preds (for [instance (dataset-seq titanic-testds)] (name (classifier-classify classifier instance)))) #'user/preds user> preds ("0" "1" "0" "0" "0" "0" "1" "0" "0" "0" ...) #'user/preds user> (spit "titanic-predictions.csv" (clojure.string/join "\n" (cons "Survived,PassengerId" (map (fn [c1 c2] (format "%s,%d" c1 c2)) preds titanic-test-passids)))) nil user> (println (slurp "titanic-predictions.csv")) Survived,PassengerId 0,892 1,893 0,894 0,895 0,896 0,897 1,898 0,899 0,900 0,901 0,902 ... ``` ## Thanks YourKit! YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler. ## License MIT License ## Authors * 2010: [Antonio Garrote](https://github.com/antoniogarrote) * 2010-2013: [Ben Mabey](https://github.com/bmabey) * 2013: [Joshua Eckroth](https://github.com/joshuaeckroth)