249 lines
7 KiB
Text
249 lines
7 KiB
Text
# clj-ml
|
|
|
|
A machine learning library for Clojure built on top of Weka and friends
|
|
|
|
## Installation
|
|
|
|
In order to install the library you must first install Leiningen.
|
|
You should also download the Weka 3.6.2 jar from the official weka homepage.
|
|
If maven complains about not finding weka, follow its instructions to install
|
|
the jar manually.
|
|
|
|
### To install from source
|
|
|
|
* git clone the project
|
|
* $ lein deps
|
|
* $ lein compile
|
|
* $ lein compile-java
|
|
* $ lein uberjar
|
|
|
|
### Installing from Clojars
|
|
|
|
[clj-ml "0.0.3-SNAPSHOT"]
|
|
|
|
### Installing from Maven
|
|
|
|
(add Clojars repository)
|
|
|
|
<dependency>
|
|
<groupId>clj-ml</groupId>
|
|
<artifactId>clj-ml</artifactId>
|
|
<version>0.0.3-SNAPSHOT</version>
|
|
</dependency>
|
|
|
|
## Supported algorithms
|
|
|
|
* Filters
|
|
- supervised discretize
|
|
- unsupervised discretize
|
|
- supervised nominal to binary
|
|
- unsupervised nominal to binary
|
|
|
|
* Classifiers
|
|
- C4.5 (J4.8)
|
|
- naive Bayes
|
|
- multilayer perceptron
|
|
|
|
* Clusterers
|
|
- k-means
|
|
|
|
## Usage
|
|
|
|
* I/O of data
|
|
|
|
REPL>(use 'clj-ml.io)
|
|
|
|
REPL>; Loading data from an ARFF file, XRFF and CSV are also supported
|
|
REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff"))
|
|
|
|
REPL>; Saving data in a different format
|
|
REPL>(save-instances :csv "file:///Users/antonio.garrote/Desktop/iris.csv" ds)
|
|
|
|
* Working with datasets
|
|
|
|
REPL>(use 'clj-ml.data)
|
|
|
|
REPL>; Defining a dataset
|
|
REPL>(def ds (make-dataset "name" [:length :width {:kind [:good :bad]}] [ [12 34 :good] [24 53 :bad] ]))
|
|
REPL>ds
|
|
|
|
#<ClojureInstances @relation name
|
|
|
|
@attribute length numeric
|
|
@attribute width numeric
|
|
@attribute kind {good,bad}
|
|
|
|
@data
|
|
12,34,good
|
|
24,53,bad>
|
|
|
|
REPL>; Using datasets like sequences
|
|
REPL>(dataset-seq ds)
|
|
|
|
(#<Instance 12,34,good> #<Instance 24,53,bad>)
|
|
|
|
REPL>; Transforming instances into maps or vectors
|
|
REPL>(instance-to-map (first (dataset-seq ds)))
|
|
|
|
{:kind :good, :width 34.0, :length 12.0}
|
|
|
|
REPL>(instance-to-vector (dataset-at ds 0))
|
|
[12.0 34.0 :good]
|
|
|
|
* Filtering datasets
|
|
|
|
REPL>(us 'clj-ml.filters)
|
|
|
|
REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff"))
|
|
|
|
REPL>; Discretizing a numeric attribute using an unsupervised filter
|
|
REPL>(def discretize (make-filter :unsupervised-discretize {:dataset *ds* :attributes [0 2]}))
|
|
|
|
REPL>(def filtered-ds (filter-process discretize ds))
|
|
|
|
* Using classifiers
|
|
|
|
REPL>(use 'clj-ml.classifiers)
|
|
|
|
REPL>; Building a classifier using a C4.5 decission tree
|
|
REPL>(def classifier (make-classifier :decission-tree :c45))
|
|
|
|
REPL>; We set the class attribute for the loaded dataset
|
|
REPL>(dataset-set-class ds 4)
|
|
|
|
REPL>; Training the classifier
|
|
REPL>(classifier-train classifier ds)
|
|
|
|
#<J48 J48 pruned tree
|
|
------------------
|
|
|
|
petalwidth <= 0.6: Iris-setosa (50.0)
|
|
petalwidth > 0.6
|
|
| petalwidth <= 1.7
|
|
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
|
|
| | petallength > 4.9
|
|
| | | petalwidth <= 1.5: Iris-virginica (3.0)
|
|
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
|
|
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
|
|
|
|
Number of Leaves : 5
|
|
|
|
Size of the tree : 9
|
|
|
|
|
|
REPL>; We evaluate the classifier using a test dataset
|
|
REPL>; last parameter should be a different test dataset, here we are using the same
|
|
REPL>(def evaluation (classifier-evaluate classifier :dataset ds ds))
|
|
|
|
=== Confusion Matrix ===
|
|
|
|
a b c <-- classified as
|
|
50 0 0 | a = Iris-setosa
|
|
0 49 1 | b = Iris-versicolor
|
|
0 2 48 | c = Iris-virginica
|
|
|
|
=== Summary ===
|
|
|
|
Correctly Classified Instances 147 98 %
|
|
Incorrectly Classified Instances 3 2 %
|
|
Kappa statistic 0.97
|
|
Mean absolute error 0.0233
|
|
Root mean squared error 0.108
|
|
Relative absolute error 5.2482 %
|
|
Root relative squared error 22.9089 %
|
|
Total Number of Instances 150
|
|
|
|
REPL>(:kappa evaluation)
|
|
|
|
0.97
|
|
|
|
REPL>(:root-mean-squared-error e)
|
|
|
|
0.10799370769526968
|
|
|
|
REPL>(:precision e)
|
|
|
|
{:Iris-setosa 1.0, :Iris-versicolor 0.9607843137254902, :Iris-virginica
|
|
0.9795918367346939}
|
|
|
|
REPL>; The classifier can also be evaluated using cross-validation
|
|
REPL>(classifier-evaluate classifier :cross-validation ds 10)
|
|
|
|
=== Confusion Matrix ===
|
|
|
|
a b c <-- classified as
|
|
49 1 0 | a = Iris-setosa
|
|
0 47 3 | b = Iris-versicolor
|
|
0 4 46 | c = Iris-virginica
|
|
|
|
=== Summary ===
|
|
|
|
Correctly Classified Instances 142 94.6667 %
|
|
Incorrectly Classified Instances 8 5.3333 %
|
|
Kappa statistic 0.92
|
|
Mean absolute error 0.0452
|
|
Root mean squared error 0.1892
|
|
Relative absolute error 10.1707 %
|
|
Root relative squared error 40.1278 %
|
|
Total Number of Instances 150
|
|
|
|
REPL>; A trained classifier can be used to classify new instances
|
|
REPL>(def to-classify (make-instance ds
|
|
{:class :Iris-versicolor,
|
|
:petalwidth 0.2,
|
|
:petallength 1.4,
|
|
:sepalwidth 3.5,
|
|
:sepallength 5.1}))
|
|
REPL>(classifier-classify classifier to-classify)
|
|
|
|
0.0
|
|
|
|
REPL>(classifier-label to-classify)
|
|
|
|
#<Instance 5.1,3.5,1.4,0.2,Iris-setosa>
|
|
|
|
|
|
REPL>; The classifiers can be saved and restored later
|
|
REPL>(use 'clj-ml.utils)
|
|
|
|
REPL>(serialize-to-file classifier
|
|
REPL> "/Users/antonio.garrote/Desktop/classifier.bin")
|
|
|
|
* Using clusterers
|
|
|
|
REPL>(use 'clj-ml.clusterers)
|
|
|
|
REPL> ; we build a clusterer using k-means and three clusters
|
|
REPL> (def kmeans (make-clusterer :k-means {:number-clusters 3}))
|
|
|
|
REPL> ; we need to remove the class from the dataset to
|
|
REPL> ; use this clustering algorithm
|
|
REPL> (dataset-remove-class ds)
|
|
|
|
REPL> ; we build the clusters
|
|
REPL> (clusterer-build kmeans ds)
|
|
REPL> kmeans
|
|
|
|
#<SimpleKMeans
|
|
kMeans
|
|
======
|
|
|
|
Number of iterations: 3
|
|
Within cluster sum of squared errors: 7.817456892309574
|
|
Missing values globally replaced with mean/mode
|
|
|
|
Cluster centroids:
|
|
Cluster#
|
|
Attribute Full Data 0 1 2
|
|
(150) (50) (50) (50)
|
|
==================================================================================
|
|
sepallength 5.8433 5.936 5.006 6.588
|
|
sepalwidth 3.054 2.77 3.418 2.974
|
|
petallength 3.7587 4.26 1.464 5.552
|
|
petalwidth 1.1987 1.326 0.244 2.026
|
|
class Iris-setosa Iris-versicolor Iris-setosa Iris-virginica
|
|
|
|
|
|
## License
|
|
|
|
MIT License
|