Version 1, last updated by doomie at February 10, 2010 16:20 UTC

Les données de Thomas Breuel sont maintenant disponible sur le réseau DIRO: /data/lisa/data/ocr_breuel/filetensor. Il s'agit de 2,136,211 images de caractères scannés, 32×32. Voici le fichier README:

The original images were in a bounding box in which the digits were well-centered. On average, this bounding box was of 28x28 (or so). If the bounding box fit completely into a 32x32 image, we just pasted it on a blank background. If not, we resized it so that it fits.

Two important notes:

(1) the classes are *very* unbalanced, probably reflecting the distribution of these characters in
normal, printed text:

digits ('0'-'9'): label followed by the number of examples with that label

0 16655, 1 17031, 2 10313, 3 8090, 4 6540, 5 6898, 6 5639, 7 5359, 8 5656, 9 7225

upper-case letters ('A'-'Z'):

10 11109, 11 4469, 12 8295, 13 5193, 14 9082, 15 4599, 16 3421, 17 4025, 18 9137
19 1606, 20 1487, 21 4800, 22 6037, 23 6998, 24 5690, 25 5745, 26 307, 27 7383,
28 11226, 29 11697, 30 2911, 31 1824, 32 3831, 33 241, 34 1736, 35 274

lower-case letters ('a'-'z'):

36 161667, 37 26171, 38 66383, 39 73531, 40 242404, 41 43134, 42 35652, 43 79944, 44 141529,
45 1634, 46 11481. 47 82287, 48 46036, 49 137907, 50 148262, 51 41723, 52 2610, 53 128413
54 128559, 55 170914, 56 56027, 57 20558, 58 26607, 59 4969, 60 32650, 61 2630

For instance, "40 242404" means that letter 'e' occurs 242404 times (more than
10% of the entire dataset!).

(2) The dataset is NOT shuffled. The data is, basically, a list of characters *in order* from a collection of documents. Thus, there is high correlation between successive characters. If/when constructing a training+testing set, you might want to shuffle the data (or you might not, depending on your objective).

 


 

Pour le cours, j'ai mélangé les données d'une manière aléatoire dans les fichiers unlv-corrected-2010-02-01-shuffled.ft et unlv-corrected-2010-02-01-labels-shuffled.ft

Veuillez noter qu'il s'agit d'un fichier d'une taille 2.1GB et que pour le mettre en mémoire il vous faut une machine avec au moins 3GB de RAM. maggie46 est un bon choix.

SVP ne pas distribuer les données en dehors du cours.