clj-ml/README
2010-02-28 20:29:51 +01:00

249 lines
7 KiB
Text

# clj-ml
A machine learning library for Clojure built on top of Weka and friends
## Installation
In order to install the library you must first install Leiningen.
You should also download the Weka 3.6.2 jar from the official weka homepage.
If maven complains about not finding weka, follow its instructions to install
the jar manually.
### To install from source
* git clone the project
* $ lein deps
* $ lein compile
* $ lein compile-java
* $ lein uberjar
### Installing from Clojars
[clj-ml "0.0.3-SNAPSHOT"]
### Installing from Maven
(add Clojars repository)
<dependency>
<groupId>clj-ml</groupId>
<artifactId>clj-ml</artifactId>
<version>0.0.3-SNAPSHOT</version>
</dependency>
## Supported algorithms
* Filters
- supervised discretize
- unsupervised discretize
- supervised nominal to binary
- unsupervised nominal to binary
* Classifiers
- C4.5 (J4.8)
- naive Bayes
- multilayer perceptron
* Clusterers
- k-means
## Usage
* I/O of data
REPL>(use 'clj-ml.io)
REPL>; Loading data from an ARFF file, XRFF and CSV are also supported
REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff"))
REPL>; Saving data in a different format
REPL>(save-instances :csv "file:///Users/antonio.garrote/Desktop/iris.csv" ds)
* Working with datasets
REPL>(use 'clj-ml.data)
REPL>; Defining a dataset
REPL>(def ds (make-dataset "name" [:length :width {:kind [:good :bad]}] [ [12 34 :good] [24 53 :bad] ]))
REPL>ds
#<ClojureInstances @relation name
@attribute length numeric
@attribute width numeric
@attribute kind {good,bad}
@data
12,34,good
24,53,bad>
REPL>; Using datasets like sequences
REPL>(dataset-seq ds)
(#<Instance 12,34,good> #<Instance 24,53,bad>)
REPL>; Transforming instances into maps or vectors
REPL>(instance-to-map (first (dataset-seq ds)))
{:kind :good, :width 34.0, :length 12.0}
REPL>(instance-to-vector (dataset-at ds 0))
[12.0 34.0 :good]
* Filtering datasets
REPL>(us 'clj-ml.filters)
REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff"))
REPL>; Discretizing a numeric attribute using an unsupervised filter
REPL>(def discretize (make-filter :unsupervised-discretize {:dataset *ds* :attributes [0 2]}))
REPL>(def filtered-ds (filter-process discretize ds))
* Using classifiers
REPL>(use 'clj-ml.classifiers)
REPL>; Building a classifier using a C4.5 decission tree
REPL>(def classifier (make-classifier :decission-tree :c45))
REPL>; We set the class attribute for the loaded dataset
REPL>(dataset-set-class ds 4)
REPL>; Training the classifier
REPL>(classifier-train classifier ds)
#<J48 J48 pruned tree
------------------
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
REPL>; We evaluate the classifier using a test dataset
REPL>; last parameter should be a different test dataset, here we are using the same
REPL>(def evaluation (classifier-evaluate classifier :dataset ds ds))
=== Confusion Matrix ===
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
=== Summary ===
Correctly Classified Instances 147 98 %
Incorrectly Classified Instances 3 2 %
Kappa statistic 0.97
Mean absolute error 0.0233
Root mean squared error 0.108
Relative absolute error 5.2482 %
Root relative squared error 22.9089 %
Total Number of Instances 150
REPL>(:kappa evaluation)
0.97
REPL>(:root-mean-squared-error e)
0.10799370769526968
REPL>(:precision e)
{:Iris-setosa 1.0, :Iris-versicolor 0.9607843137254902, :Iris-virginica
0.9795918367346939}
REPL>; The classifier can also be evaluated using cross-validation
REPL>(classifier-evaluate classifier :cross-validation ds 10)
=== Confusion Matrix ===
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 4 46 | c = Iris-virginica
=== Summary ===
Correctly Classified Instances 142 94.6667 %
Incorrectly Classified Instances 8 5.3333 %
Kappa statistic 0.92
Mean absolute error 0.0452
Root mean squared error 0.1892
Relative absolute error 10.1707 %
Root relative squared error 40.1278 %
Total Number of Instances 150
REPL>; A trained classifier can be used to classify new instances
REPL>(def to-classify (make-instance ds
{:class :Iris-versicolor,
:petalwidth 0.2,
:petallength 1.4,
:sepalwidth 3.5,
:sepallength 5.1}))
REPL>(classifier-classify classifier to-classify)
0.0
REPL>(classifier-label to-classify)
#<Instance 5.1,3.5,1.4,0.2,Iris-setosa>
REPL>; The classifiers can be saved and restored later
REPL>(use 'clj-ml.utils)
REPL>(serialize-to-file classifier
REPL> "/Users/antonio.garrote/Desktop/classifier.bin")
* Using clusterers
REPL>(use 'clj-ml.clusterers)
REPL> ; we build a clusterer using k-means and three clusters
REPL> (def kmeans (make-clusterer :k-means {:number-clusters 3}))
REPL> ; we need to remove the class from the dataset to
REPL> ; use this clustering algorithm
REPL> (dataset-remove-class ds)
REPL> ; we build the clusters
REPL> (clusterer-build kmeans ds)
REPL> kmeans
#<SimpleKMeans
kMeans
======
Number of iterations: 3
Within cluster sum of squared errors: 7.817456892309574
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(150) (50) (50) (50)
==================================================================================
sepallength 5.8433 5.936 5.006 6.588
sepalwidth 3.054 2.77 3.418 2.974
petallength 3.7587 4.26 1.464 5.552
petalwidth 1.1987 1.326 0.244 2.026
class Iris-setosa Iris-versicolor Iris-setosa Iris-virginica
## License
MIT License