8.1 KiB
8.1 KiB
clj-ml
A machine learning library for Clojure built on top of Weka and friends.
Installation
In order to install the library you must first install Leiningen.
To install from source
git clone the project, then run:
$ lein deps
$ lein javac
$ lein uberjar
Installing from Clojars
[cc.artifice/clj-ml "0.3.4"]
Installing from Maven
(add Clojars repository)
<dependency>
<groupId>cc.artifice</groupId>
<artifactId>clj-ml</artifactId>
<version>0.3.4</version>
</dependency>
Supported algorithms
-
Filters
- supervised discretize
- unsupervised discretize
- supervised nominal to binary
- unsupervised nominal to binary
- string to word vector
- reorder attributes
- resample (supervised, unsupervised)
-
Classifiers
- C4.5 (J4.8)
- naive Bayes
- multilayer perceptron
-
Clusterers
- k-means
Usage
API documenation can be found here.
I/O of data
REPL>(use 'clj-ml.io)
REPL>; Loading data from an ARFF file, XRFF and CSV are also supported
REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff"))
REPL>; Saving data in a different format
REPL>(save-instances :csv "file:///Users/antonio.garrote/Desktop/iris.csv" ds)
Working with datasets
REPL>(use 'clj-ml.data)
REPL>; Defining a dataset
REPL>(def ds (make-dataset "name" [:length :width {:kind [:good :bad]}] [ [12 34 :good] [24 53 :bad] ]))
REPL>ds
#<ClojureInstances @relation name
@attribute length numeric
@attribute width numeric
@attribute kind {good,bad}
@data
12,34,good
24,53,bad>
REPL>; Using datasets like sequences
REPL>(dataset-seq ds)
(#<Instance 12,34,good> #<Instance 24,53,bad>)
REPL>; Transforming instances into maps or vectors
REPL>(instance-to-map (first (dataset-seq ds)))
{:kind :good, :width 34.0, :length 12.0}
REPL>(instance-to-vector (dataset-at ds 0))
[12.0 34.0 :good]
Filtering datasets
REPL>(use '(clj-ml filters io))
REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff"))
REPL>; Discretizing a numeric attribute using an unsupervised filter
REPL>(def discretize (make-filter :unsupervised-discretize {:dataset-format ds :attributes [:sepallength :petallength]}))
REPL>(def filtered-ds (filter-apply discretize ds))
REPL>; You can also use the filter's fn directly which will create and apply the filter:
REPL>(def filtered-ds (unsupervised-discretize ds {:attributes [:sepallength :petallength]}))
REPL>; The above way lends itself to the -> macro and is useful when using multiple filters.
REPL>; The eqivalent operation can be done with the ->> macro and make-apply-filter fn:
REPL>(def filtered-ds (->> "file:///Applications/weka-3-6-2/data/iris.arff")
(load-instances :arff)
(make-apply-filter :unsupervised-discretize {:attributes [0 2]}))
Using classifiers
REPL>(use 'clj-ml.classifiers)
REPL>; Building a classifier using a C4.5 decission tree
REPL>(def classifier (make-classifier :decission-tree :c45))
REPL>; We set the class attribute for the loaded dataset
REPL>(dataset-set-class ds 4)
REPL>; Training the classifier
REPL>(classifier-train classifier ds)
#<J48 J48 pruned tree
------------------
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
REPL>; We evaluate the classifier using a test dataset
REPL>; last parameter should be a different test dataset, here we are using the same
REPL>(def evaluation (classifier-evaluate classifier :dataset ds ds))
=== Confusion Matrix ===
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
=== Summary ===
Correctly Classified Instances 147 98 %
Incorrectly Classified Instances 3 2 %
Kappa statistic 0.97
Mean absolute error 0.0233
Root mean squared error 0.108
Relative absolute error 5.2482 %
Root relative squared error 22.9089 %
Total Number of Instances 150
REPL>(:kappa evaluation)
0.97
REPL>(:root-mean-squared-error e)
0.10799370769526968
REPL>(:precision e)
{:Iris-setosa 1.0, :Iris-versicolor 0.9607843137254902, :Iris-virginica
0.9795918367346939}
REPL>; The classifier can also be evaluated using cross-validation
REPL>(classifier-evaluate classifier :cross-validation ds 10)
=== Confusion Matrix ===
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 4 46 | c = Iris-virginica
=== Summary ===
Correctly Classified Instances 142 94.6667 %
Incorrectly Classified Instances 8 5.3333 %
Kappa statistic 0.92
Mean absolute error 0.0452
Root mean squared error 0.1892
Relative absolute error 10.1707 %
Root relative squared error 40.1278 %
Total Number of Instances 150
REPL>; A trained classifier can be used to classify new instances
REPL>(def to-classify (make-instance ds
{:class :Iris-versicolor,
:petalwidth 0.2,
:petallength 1.4,
:sepalwidth 3.5,
:sepallength 5.1}))
REPL>(classifier-classify classifier to-classify)
0.0
REPL>(classifier-label classifier to-classify)
#<Instance 5.1,3.5,1.4,0.2,Iris-setosa>
REPL>; The classifiers can be saved and restored later
REPL>(use 'clj-ml.utils)
REPL>(serialize-to-file classifier "/Users/antonio.garrote/Desktop/classifier.bin")
Using clusterers
REPL>(use 'clj-ml.clusterers)
REPL> ; we build a clusterer using k-means and three clusters
REPL> (def kmeans (make-clusterer :k-means {:number-clusters 3}))
REPL> ; we need to remove the class from the dataset to
REPL> ; use this clustering algorithm
REPL> (dataset-remove-class ds)
REPL> ; we build the clusters
REPL> (clusterer-build kmeans ds)
REPL> kmeans
#<SimpleKMeans
kMeans
======
Number of iterations: 3
Within cluster sum of squared errors: 7.817456892309574
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(150) (50) (50) (50)
==================================================================================
sepallength 5.8433 5.936 5.006 6.588
sepalwidth 3.054 2.77 3.418 2.974
petallength 3.7587 4.26 1.464 5.552
petalwidth 1.1987 1.326 0.244 2.026
class Iris-setosa Iris-versicolor Iris-setosa Iris-virginica
Thanks YourKit!
YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.
License
MIT License