clj-ml/README.md

270 lines
8.2 KiB
Markdown
Raw Normal View History

2010-02-28 12:14:17 +00:00
# clj-ml
2010-10-28 16:31:07 +00:00
A machine learning library for Clojure built on top of Weka and friends.
2010-02-28 12:14:17 +00:00
2010-02-28 12:42:14 +00:00
## Installation
In order to install the library you must first install Leiningen.
### To install from source
2010-10-27 23:01:18 +00:00
git clone the project, then run:
$ lein deps
$ lein javac
$ lein uberjar
2010-02-28 12:42:14 +00:00
### Installing from Clojars
2013-03-30 22:17:32 +00:00
[cc.artifice/clj-ml "0.3.5"]
2010-02-28 12:42:14 +00:00
### Installing from Maven
2010-02-28 12:47:24 +00:00
(add Clojars repository)
2010-02-28 12:45:46 +00:00
2010-10-27 23:01:18 +00:00
<dependency>
<groupId>cc.artifice</groupId>
2010-10-27 23:01:18 +00:00
<artifactId>clj-ml</artifactId>
2013-03-25 20:27:57 +00:00
<version>0.3.4</version>
2010-10-27 23:01:18 +00:00
</dependency>
2010-02-28 12:42:14 +00:00
## Supported algorithms
* Filters
2010-10-27 23:01:18 +00:00
* supervised discretize
* unsupervised discretize
* supervised nominal to binary
* unsupervised nominal to binary
* string to word vector
2013-03-22 14:09:38 +00:00
* reorder attributes
2013-03-22 16:36:01 +00:00
* resample (supervised, unsupervised)
* Classifiers
2010-10-27 23:01:18 +00:00
* C4.5 (J4.8)
* naive Bayes
* multilayer perceptron
2013-03-30 22:17:32 +00:00
* support vector machines
2010-10-27 23:01:18 +00:00
* Clusterers
* k-means
2010-02-28 12:14:17 +00:00
## Usage
2010-10-28 16:31:07 +00:00
API documenation can be found [here](http://antoniogarrote.github.com/clj-ml/index.html).
2010-10-27 23:01:18 +00:00
### I/O of data
2010-02-28 12:14:17 +00:00
2010-02-28 12:45:46 +00:00
REPL>(use 'clj-ml.io)
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; Loading data from an ARFF file, XRFF and CSV are also supported
REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff"))
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; Saving data in a different format
2010-02-28 12:47:24 +00:00
REPL>(save-instances :csv "file:///Users/antonio.garrote/Desktop/iris.csv" ds)
2010-02-28 12:14:17 +00:00
2010-10-27 23:01:18 +00:00
### Working with datasets
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>(use 'clj-ml.data)
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; Defining a dataset
REPL>(def ds (make-dataset "name" [:length :width {:kind [:good :bad]}] [ [12 34 :good] [24 53 :bad] ]))
2010-02-28 12:35:04 +00:00
REPL>ds
2010-02-28 12:18:33 +00:00
2010-02-28 12:27:36 +00:00
#<ClojureInstances @relation name
2010-02-28 12:18:33 +00:00
2010-02-28 12:27:36 +00:00
@attribute length numeric
@attribute width numeric
@attribute kind {good,bad}
2010-02-28 12:18:33 +00:00
2010-02-28 12:27:36 +00:00
@data
12,34,good
24,53,bad>
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; Using datasets like sequences
REPL>(dataset-seq ds)
2010-02-28 12:18:33 +00:00
2010-02-28 12:27:36 +00:00
(#<Instance 12,34,good> #<Instance 24,53,bad>)
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; Transforming instances into maps or vectors
REPL>(instance-to-map (first (dataset-seq ds)))
2010-02-28 12:18:33 +00:00
2010-02-28 12:27:36 +00:00
{:kind :good, :width 34.0, :length 12.0}
2010-02-28 12:18:33 +00:00
2010-02-28 12:35:04 +00:00
REPL>(instance-to-vector (dataset-at ds 0))
[12.0 34.0 :good]
2010-02-28 12:14:17 +00:00
2010-10-27 23:01:18 +00:00
### Filtering datasets
2010-02-28 12:14:17 +00:00
2010-10-27 23:22:51 +00:00
REPL>(use '(clj-ml filters io))
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>(def ds (load-instances :arff "file:///Applications/weka-3-6-2/data/iris.arff"))
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; Discretizing a numeric attribute using an unsupervised filter
REPL>(def discretize (make-filter :unsupervised-discretize {:dataset-format ds :attributes [:sepallength :petallength]}))
2010-02-28 12:14:17 +00:00
2010-10-27 23:22:51 +00:00
REPL>(def filtered-ds (filter-apply discretize ds))
2011-12-14 17:24:53 +00:00
REPL>; You can also use the filter's fn directly which will create and apply the filter:
REPL>(def filtered-ds (unsupervised-discretize ds {:attributes [:sepallength :petallength]}))
REPL>; The above way lends itself to the -> macro and is useful when using multiple filters.
2010-10-27 23:22:51 +00:00
REPL>; The eqivalent operation can be done with the ->> macro and make-apply-filter fn:
REPL>(def filtered-ds (->> "file:///home/kiran/Downloads/weka/weka-3-6-9/data/iris.arff"
(load-instances :arff)
(make-apply-filter :unsupervised-discretize {:attributes [0 2]})))
2010-02-28 12:14:17 +00:00
2010-10-27 23:01:18 +00:00
### Using classifiers
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>(use 'clj-ml.classifiers)
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; Building a classifier using a C4.5 decission tree
REPL>(def classifier (make-classifier :decision-tree :c45))
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; We set the class attribute for the loaded dataset
REPL>(dataset-set-class ds 4)
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; Training the classifier
REPL>(classifier-train classifier ds)
2010-02-28 12:18:33 +00:00
#<J48 J48 pruned tree
------------------
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; We evaluate the classifier using a test dataset
REPL>; last parameter should be a different test dataset, here we are using the same
REPL>(def evaluation (classifier-evaluate classifier :dataset ds ds))
2010-02-28 12:18:33 +00:00
=== Confusion Matrix ===
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
=== Summary ===
Correctly Classified Instances 147 98 %
Incorrectly Classified Instances 3 2 %
Kappa statistic 0.97
Mean absolute error 0.0233
Root mean squared error 0.108
Relative absolute error 5.2482 %
Root relative squared error 22.9089 %
Total Number of Instances 150
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>(:kappa evaluation)
2010-02-28 12:18:33 +00:00
0.97
2010-02-28 12:35:04 +00:00
REPL>(:root-mean-squared-error e)
2010-02-28 12:18:33 +00:00
0.10799370769526968
2010-02-28 12:35:04 +00:00
REPL>(:precision e)
2010-02-28 12:18:33 +00:00
{:Iris-setosa 1.0, :Iris-versicolor 0.9607843137254902, :Iris-virginica
0.9795918367346939}
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; The classifier can also be evaluated using cross-validation
REPL>(classifier-evaluate classifier :cross-validation ds 10)
2010-02-28 12:18:33 +00:00
=== Confusion Matrix ===
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 4 46 | c = Iris-virginica
=== Summary ===
Correctly Classified Instances 142 94.6667 %
Incorrectly Classified Instances 8 5.3333 %
Kappa statistic 0.92
Mean absolute error 0.0452
Root mean squared error 0.1892
Relative absolute error 10.1707 %
Root relative squared error 40.1278 %
Total Number of Instances 150
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; A trained classifier can be used to classify new instances
REPL>(def to-classify (make-instance ds
2010-02-28 12:14:17 +00:00
{:class :Iris-versicolor,
:petalwidth 0.2,
:petallength 1.4,
:sepalwidth 3.5,
:sepallength 5.1}))
2010-02-28 12:35:04 +00:00
REPL>(classifier-classify classifier to-classify)
2010-02-28 12:18:33 +00:00
0.0
REPL>(classifier-label classifier to-classify)
2010-02-28 12:18:33 +00:00
#<Instance 5.1,3.5,1.4,0.2,Iris-setosa>
2010-02-28 12:14:17 +00:00
2010-02-28 12:35:04 +00:00
REPL>; The classifiers can be saved and restored later
REPL>(use 'clj-ml.utils)
2010-02-28 12:14:17 +00:00
REPL>(serialize-to-file classifier "/Users/antonio.garrote/Desktop/classifier.bin")
2010-02-28 12:14:17 +00:00
2010-10-27 23:01:18 +00:00
### Using clusterers
REPL>(use 'clj-ml.clusterers)
REPL> ; we build a clusterer using k-means and three clusters
REPL> (def kmeans (make-clusterer :k-means {:number-clusters 3}))
REPL> ; we need to remove the class from the dataset to
REPL> ; use this clustering algorithm
REPL> (dataset-remove-class ds)
REPL> ; we build the clusters
REPL> (clusterer-build kmeans ds)
REPL> kmeans
#<SimpleKMeans
kMeans
======
Number of iterations: 3
Within cluster sum of squared errors: 7.817456892309574
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(150) (50) (50) (50)
==================================================================================
sepallength 5.8433 5.936 5.006 6.588
sepalwidth 3.054 2.77 3.418 2.974
petallength 3.7587 4.26 1.464 5.552
petalwidth 1.1987 1.326 0.244 2.026
class Iris-setosa Iris-versicolor Iris-setosa Iris-virginica
2012-09-07 15:46:37 +00:00
## Thanks YourKit!
YourKit is kindly supporting open source projects with its full-featured Java Profiler.
YourKit, LLC is the creator of innovative and intelligent tools for profiling
Java and .NET applications. Take a look at YourKit's leading software products:
<a href="http://www.yourkit.com/java/profiler/index.jsp">YourKit Java Profiler</a> and
<a href="http://www.yourkit.com/.net/profiler/index.jsp">YourKit .NET Profiler</a>.
2010-02-28 12:14:17 +00:00
## License
MIT License