Processing huge files

January 14th, 2013 No comments

I recently had to process a set of files containg historical tick-by-tick fx market data and quickly realized that none of them could be read into memory using a traditional InputStream because every file was over 4 gigabytes in size. Emacs couldn’t even open them.

In this particular case I could write a simple bash script that divide files into smaller pieces and read them as usual. But I don’t want that since binary formats would invalidate this approach.

So the way to handle this problem properly is to process regions of data incrementally using memory mapped files. What’s nice about memory mapped files is that they do not consume virtual memory or paging space since it is backed by file data on disk.

Okey, let’s have a look at these files and extract some data. Seems like they contain ASCII text rows with comma delimited fields.

Format: [currency-pair],[timestamp],[bid-price],[ask-price]

Example: EUR/USD,20120102 00:01:30.420,1.29451,1.2949

Fair enough, I could write a program for that format. But reading and parsing files are orthogonal concepts; so let’s take a step back and think about a generic design that can be reused in case confronted with a similar problem in the future.

The problem boils down to incrementally decode a set of entries encoded in a infinitely long byte array without exhausting memory. The fact that the example format is encoded in comma/line delimited text is irrelevant for the general solution so it is clear that a decoder interface is needed in order to handle different formats.

Again, every entry cannot be parsed and kept in memory until the whole file is processed so we need a way to incrementally hand off chunks of entries that can be written elsewhere, disk or network, before they are garbage collected. An iterator is a good abstraction to handle this requirement because they act like cursors, which is exactly the point. Every iteration forwards the file pointer and let us do something with the data.

So first the Decoder interface. The idea is to incrementally decode objects from a MappedByteBuffer or return null if no objects remains in the buffer.

1 2 3
public interface Decoder<T> {
public T decode(ByteBuffer buffer);
}

Then comes the FileReader which implements Iterable. Each iteration will process next 4096 bytes of data and decode them into a list of objects using the Decoder. Notice that FileReader accept a list of files, which is nice since it enable traversal through the data without worrying about aggregation across files. By the way, 4096 byte chunks are probably a bit small for bigger files.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
public class FileReader implements Iterable<List<T>> {
private static final long CHUNK_SIZE = 4096;
private final Decoder<T> decoder;
private Iterator<File> files;
private FileReader(Decoder<T> decoder, File... files) {
this(decoder, Arrays.asList(files));
}
private FileReader(Decoder<T> decoder, List<File> files) {
this.files = files.iterator();
this.decoder = decoder;
}
public static <T> FileReader<T> create(Decoder<T> decoder, List<File> files) {
return new FileReader<T>(decoder, files);
}
 
public static <T> FileReader<T> create(Decoder<T> decoder, File... files) {
return new FileReader<T>(decoder, files);
}
@Override
public Iterator<List<T>> iterator() {
return new Iterator<List<T>>() {
private List<T> entries;
private long chunkPos = 0;
private MappedByteBuffer buffer;
private FileChannel channel;
@Override
public boolean hasNext() {
if (buffer == null || !buffer.hasRemaining()) {
buffer = nextBuffer(chunkPos);
if (buffer == null) {
return false;
}
}
T result = null;
while ((result = decoder.decode(buffer)) != null) {
if (entries == null) {
entries = new ArrayList<T>();
}
entries.add(result);
}
// set next MappedByteBuffer chunk
chunkPos += buffer.position();
buffer = null;
if (entries != null) {
return true;
} else {
Closeables.closeQuietly(channel);
return false;
}
}
private MappedByteBuffer nextBuffer(long position) {
try {
if (channel == null || channel.size() == position) {
if (channel != null) {
Closeables.closeQuietly(channel);
channel = null;
}
if (files.hasNext()) {
File file = files.next();
channel = new RandomAccessFile(file, "r").getChannel();
chunkPos = 0;
position = 0;
} else {
return null;
}
}
long chunkSize = CHUNK_SIZE;
if (channel.size() - position < chunkSize) {
chunkSize = channel.size() - position;
}
return channel.map(FileChannel.MapMode.READ_ONLY, chunkPos, chunkSize);
} catch (IOException e) {
Closeables.closeQuietly(channel);
throw new RuntimeException(e);
}
}
@Override
public List<T> next() {
List<T> res = entries;
entries = null;
return res;
}
@Override
public void remove() {
throw new UnsupportedOperationException();
}
};
}
}

Next task is to write a Decoder and I decided to implement a generic TextRowDecoder for any comma delimited text file format, accepting number of fields per row and a field delimiter and returning an array of byte arrays. TextRowDecoder can then be reused by format specific decoders that maybe handle different character sets.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
public class TextRowDecoder implements Decoder<byte[][]> {
private static final byte LF = 10;
private final int numFields;
private final byte delimiter;
public TextRowDecoder(int numFields, byte delimiter) {
this.numFields = numFields;
this.delimiter = delimiter;
}
@Override
public byte[][] decode(ByteBuffer buffer) {
int lineStartPos = buffer.position();
int limit = buffer.limit();
while (buffer.hasRemaining()) {
byte b = buffer.get();
if (b == LF) { // reached line feed so parse line
int lineEndPos = buffer.position();
// set positions for one row duplication
if (buffer.limit() < lineEndPos + 1) {
buffer.position(lineStartPos).limit(lineEndPos);
} else {
buffer.position(lineStartPos).limit(lineEndPos + 1);
}
byte[][] entry = parseRow(buffer.duplicate());
if (entry != null) {
// reset main buffer
buffer.position(lineEndPos);
buffer.limit(limit);
// set start after LF
lineStartPos = lineEndPos;
}
return entry;
}
}
buffer.position(lineStartPos);
return null;
}
public byte[][] parseRow(ByteBuffer buffer) {
int fieldStartPos = buffer.position();
int fieldEndPos = 0;
int fieldNumber = 0;
byte[][] fields = new byte[numFields][];
while (buffer.hasRemaining()) {
byte b = buffer.get();
if (b == delimiter || b == LF) {
fieldEndPos = buffer.position();
// save limit
int limit = buffer.limit();
// set positions for one row duplication
buffer.position(fieldStartPos).limit(fieldEndPos);
fields[fieldNumber] = parseField(buffer.duplicate(), fieldNumber, fieldEndPos - fieldStartPos - 1);
fieldNumber++;
// reset main buffer
buffer.position(fieldEndPos);
buffer.limit(limit);
// set start after LF
fieldStartPos = fieldEndPos;
}
if (fieldNumber == numFields) {
return fields;
}
}
return null;
}
private byte[] parseField(ByteBuffer buffer, int pos, int length) {
byte[] field = new byte[length];
for (int i = 0; i < field.length; i++) {
field[i] = buffer.get();
}
return field;
}
}

And this is how files are processed. Each list contain elements decoded from a single buffer and each element is an array of byte arrays as specified by the TextRowDecoder.

1 2 3 4 5
TextRowDecoder decoder = new TextRowDecoder(4, comma);
FileReader<byte[][]> reader = FileReader.create(decoder, file.listFiles());
for (List<byte[][]> chunk : reader) {
// do something with each chunk
}

We could stop here but there was one more requirement. Every row contain a timestamp and each batch must be grouped according to periods of time instead of buffers, day-by-day or hour-by-hour. I still want to iterate through each batch so the immediate reaction was to create a Iterable wrapper for FileReader that would implement this behaviour. One additional detail is that each element must to provide its timestamp to PeriodEntries by implementing the Timestamped interface (not shown here).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
public class PeriodEntries<T extends Timestamped> implements Iterable<List<T>> {
private final Iterator<List<T extends Timestamped>> entriesIt;
private final long interval;
private PeriodEntries(Iterable<List<T>> entriesIt, long interval) {
this.entriesIt = entriesIt.iterator();
this.interval = interval;
}
 
public static <T extends Timestamped> PeriodEntries<T> create(Iterable<List<T>> entriesIt, long interval) {
return new PeriodEntries<T>(entriesIt, interval);
}
@Override
public Iterator<List<T extends Timestamped>> iterator() {
return new Iterator<List<T>>() {
private Queue<List<T>> queue = new LinkedList<List<T>>();
private long previous;
private Iterator<T> entryIt;
@Override
public boolean hasNext() {
if (!advanceEntries()) {
return false;
}
T entry = entryIt.next();
long time = normalizeInterval(entry);
if (previous == 0) {
previous = time;
}
if (queue.peek() == null) {
List<T> group = new ArrayList<T>();
queue.add(group);
}
while (previous == time) {
queue.peek().add(entry);
if (!advanceEntries()) {
break;
}
entry = entryIt.next();
time = normalizeInterval(entry);
}
previous = time;
List<T> result = queue.peek();
if (result == null || result.isEmpty()) {
return false;
}
return true;
}
private boolean advanceEntries() {
// if there are no rows left
if (entryIt == null || !entryIt.hasNext()) {
// try get more rows if possible
if (entriesIt.hasNext()) {
entryIt = entriesIt.next().iterator();
return true;
} else {
// no more rows
return false;
}
}
return true;
}
private long normalizeInterval(Timestamped entry) {
long time = entry.getTime();
int utcOffset = TimeZone.getDefault().getOffset(time);
long utcTime = time + utcOffset;
long elapsed = utcTime % interval;
return time - elapsed;
}
@Override
public List<T> next() {
return queue.poll();
}
@Override
public void remove() {
throw new UnsupportedOperationException();
}
};
}
}

The final processing code did not change much by introducing this functionality, only one clean and tight for-loop that does not have to care about grouping elements across files, buffers and periods. PeriodEntries is also flexible enough to mange any length on the interval.

1 2 3 4 5 6 7 8 9 10 11
TrueFxDecoder decoder = new TrueFxDecoder();
FileReader<TrueFxData> reader = FileReader.create(decoder, file.listFiles());
long periodLength = TimeUnit.DAYS.toMillis(1);
PeriodEntries<TrueFxData> periods = PeriodEntries.create(reader, periodLength);
for (List<TrueFxData> entries : periods) {
// data for each day
for (TrueFxData entry : entries) {
// process each entry
}
}

As you may realize, it would not have been possible to solve this problem with collections; choosing iterators was a crucial design decision to be able to parse terabytes of data without consuming too much heap space.

Categories: Java, Uncategorized Tags:

WSDL sucks

December 7th, 2012 No comments

WSDL sucks. The whole WS-* protocol stack sucks. There, I said it.

It hurts, a pain in the but. Difficult to write and hard to debug. I cannot think of another technology that have wasted more of my time and I have yet to find one person that can give me a clear and concise explanation exactly how to to use it. Allegedly one of the most overengineered technology in the history of computers.

Doing wrong is easy, doing right is hard.

Its bloated. Tooling support is poor and hide necessary complexity. Interoperability is hard. Caching is not an option. Noncompliant with traditional web technology. The wire format is insanely verbose. Backward and forward compatibility is a nightmare. You have to read endless piles of specifications to design anything sophisticated; even sometimes bend over backwards to do the simplest of things. All your are left with is a pile crap to maintain in the end.

WTF? Where is KISS and productivity to be found in this mess?

System integration should not be this hard.

Categories: architecture, coding, Java Tags:

The Windows preinstall paralysis

November 22nd, 2012 2 comments

No lie.

I have been a Windows desktop user for many years, even as a programmer. But I also worked with Linux, mainly in shell mode on remote servers. I did contemplate many times on trying a Linux desktop but was always too comfortable with Windows plug-n-play, afraid of hardware compatibility, learning new stuff and what-not. Looking back on this paralysis, I wish that someone would have slapped me (real hard) earlier to wake me up from this infectious Microsoft abuse.

As a programmer there is no good excuse to use Windows unless you work for Microsoft. Linux is simply the best operating system out there for programmers (go ahead, flame me). Linux will, without doubt, make you a lot more productive once you’ve learned to take advantage of it; especially the shell and having the power of Open Source at your fingertips.

Unless you are gamer you have everything you need and more. Ah, but you are having that occasional powerpoint or word document obligation to please your manager? Nah, not a good excuse for you or your manager. Use GRUB, a VM or similar and be done with it.

If you are a novice and want to use Linux professionally my advice is to start by installing a Linux desktop at home. This forces you to get comfortable with it. You don’t want to mess up at work. I began with a Ubuntu distro but later switched to Debian for sake of stability, memory footprint and liking the social contract. I want my window manager light and snappy but still user friendly, so I am using XFCE at the moment.

Try to setup a home network, learn the filesystem, bash, the software/package system and format, how user/group permissions work, configure nifty little keyboard shortcuts etc. Do as much as possible in shell and configuration files. If you get stuck, google, all your problems have been solved before. The community will be there for you. No license or expensive support contract needed. All we ask from you is to later do the same for others. You will soon be flying casual.

Corporate governance rules that hinder programmers from using Linux should be alarming. Such companies are built on a foundation of control and regulation, not trusting their employees. Personally I avoid companies that take this freedom away from me simply because Linux have become my 10x productivity multiplier through the years. Really.

This is my five-finger-wakeup-slap on involuntary Windows users that I wish I got earlier. Linux is a lot easier than you might think.

Switch now. You will not regret it.

Categories: Linux, open source, Uncategorized Tags:

tools4j-config part 1, introduction

Tools4j-config 0.0.1 was released in Maven Central Repository about a week ago, a framework that aims to support creating configurable applications in a productive and consistent way.

This is a quick introduction on the possibilities for defining configuration and constraints, and how to administrate it for applications.

Prerequisites
Tools4j-config is fully functional in any Java SE 6+ compatible environment and is distributed as a set of Maven 3+ projects.

Quickstart
First we’ll define configuration used by our application. Create a maven project with the following dependencies.

1 2 3 4 5 6 7 8 9 10 11 12
<dependency>
<groupId>org.deephacks.tools4j</groupId>
<artifactId>config-api-runtime</artifactId>
<version>0.0.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.deephacks.tools4j</groupId>
<artifactId>config-core</artifactId>
<version>0.0.1</version>
<scope>runtime</scope>
</dependency>
view raw gistfile1.xml This Gist brought to you by GitHub.

Let’s assume we need access to a database, so we create a class that will represent the configuration needed to connect to it.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
@Config(desc = "User database.")
public class Database {
@Id(desc = "Identification of database.")
private String id;
@Config(desc = "Address for connecting to database.")
private URL url;
// username with default value 'test'
@Config(desc = "Username for connecting to database.")
private String username = "test";
// password with default value 'test'
@Config(desc = "Password for connecting to database.")
private String password = "test";
@Config(desc = "Database connections in pool.")
private Integer poolSize;
}

Now register this class with the framework and read its configuration.

1 2 3
RuntimeContext runtime = Lookup.get().lookup(RuntimeContext.class);
runtime.register(Database.class);
List<Database> tests = runtime.all(Database.class);

Done! Our application is now configurable. Pretty quick and simple.

But what did we read? Nothing actually. The list is empty since there is no configuration available yet. We need the administrator to provision configuration first. So let’s take the administrator perspective for a second.

Create a new maven project with the following dependencies.

1 2 3 4 5 6 7 8 9 10 11 12
<dependency>
<groupId>org.deephacks.tools4j</groupId>
<artifactId>config-api-admin</artifactId>
<version>0.0.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.deephacks.tools4j</groupId>
<artifactId>config-core</artifactId>
<version>0.0.1</version>
<scope>runtime</scope>
</dependency>
view raw pom.xml This Gist brought to you by GitHub.

Configuration is managed programmatically and this is how our imaginary administrator creates a database named user and sets values for it.

1 2 3 4 5 6 7
AdminContext admin = Lookup.get().lookup(AdminContext.class);
Bean bean = Bean.create(BeanId.create("user", Database.class.getName()));
bean.addProperty("url", "/dev/null");
bean.addProperty("username", "admin");
bean.addProperty("password", "admin123");
bean.addProperty("poolSize", "nonsense");
admin.create(bean);

All good? Not quite. The code above fails twice.

Bean Database@user have a property java.net.URL@url with value /dev/null not matching its type.
Bean Database@user have a property java.lang.Integer@poolSize with value nonsense not matching its type.

These failures should be obvious. The administrator provisioned values that did not conform to the types of the configurable class, which brings up an important point: type safety. Administrators should not be able to (accidentally nor intentionally) break constraints of applications.

Assuming values are corrected, we can switch back to the application’s perspective and read the user instance.

1
Database user = runtime.get("user", Database.class);

The user instance is now initialized with values provisioned by the administrator and the application can read it without having to redeploy itself or restart the JVM.

At this very basic level it is important to notice a couple of things. Neither developer nor administrator made any assumptions on the runtime environment, nor did they know/care from where or how configuration was read/written.

Our quickstart use-case is complete but let’s dive a little deeper in order to understand the modelling capabilities of tools4j-config.

Built-in types
Configurable fields can have any of the following types.

  • java.lang.String
  • java.lang.Number and derived types
  • java.lang.Boolean
  • java.lang.Enum and derived types, including user-defined ones
  • java.util.Date
  • java.util.Currency
  • java.util.Locale
  • java.io.File
  • java.net.InetAddress
  • java.net.URL
  • java.net.URI
  • javax.xml.datatype.Duration

Fields can also be declared as a java.util.Collection implementation, generified with any of the above types. Fields initialized at declaration are considered default values (provisioned values take precedence). It is possible to declare user-defined types, but it requires some extra effort and will be covered in a future post.

All of these declarations are valid (annotations omitted for brevity).

1 2 3 4 5 6 7 8 9 10 11 12
// 1, 2, 3 is easy
List<Integer> counting = Arrays.asList(1, 2, 3);
// mathematical variable
Double x;
// measuring performance
TimeUnit precision = TimeUnit.NANOSECONDS;
// 5:01 developers
Set<Day> working = new HashSet<Day>(Arrays.asList(MON, TUE, WED, THU, FRI));
// content tags
List<String> labels;
// forever young
Date young = new Date(Long.MAX_VALUE);

References
Configurable classes can have references to other configurable classes, including themselves (recursive relationships). Circular references are allowed as well (person ‘a’ is person ‘b’:s best friend and usually vice versa).

Declaring references is identical to regular Java types.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
@Config(desc = "An individual (or application program) identity")
public class User {
@Id(desc = "username")
private String username;
 
@Config(desc = "password")
private String password;
 
@Config(desc = "Roles assigned to this user")
private Set<Role> roles = new HashSet<Role>();
}
 
@Config(desc = "Role for the permission to access a set of resources")
public class Role {
@Config(desc = "Permissions to access a set of resources")
private EnumSet<Privilege> privileges = EnumSet.noneOf(Privilege.class);
 
@Config(desc = "Roles assigned to this role")
private Set<Role> roles = new HashSet<Role>();
}

Referential integrity is enforced in order to keep relationships consistent. It is not possible create references to instances that does not exist; or remove instances that other instances already have references to.

Administrating references is almost to identical to provisioning regular Java type values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
BeanId deployerRoleId = BeanId.create("deployer", Role.class.getName());
Bean deployer = Bean.create(deployerRoleId);
 
BeanId assemblerRoleId = BeanId.create("assembler", Role.class.getName());
Bean assembler = Bean.create(assemblerRoleId);
 
BeanId adminRoleId = BeanId.create("administrator", Role.class.getName());
Bean adminRole = Bean.create(adminRoleId);
adminRole.addReference("roles", deployerRoleId);
adminRole.addReference("roles", assemblerRoleId);
 
BeanId adminId = BeanId.create("admin", Role.class.getName());
Bean administrator = Bean.create(adminId);
administrator.addReference("roles", adminRoleId);
administrator.setProperty("password", "xxxxx");

Inheritance
Configurable classes support inheritance and configurable fields will be inherited from their parent class, enabling reuse of configurable fields and methods.

Validation
Tools4j-config integrates with JSR-303 Bean Validation to help developers to further constrain the premises under which configuration may be provisioned.

To enable Bean Validation we need to add the following dependencies to both maven projects mentioned earlier.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
<dependency>
<groupId>javax.validation</groupId>
<artifactId>validation-api</artifactId>
<version>1.0.0.GA</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-validator</artifactId>
<version>4.1.0.Final</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.deephacks.tools4j</groupId>
<artifactId>config-provider-jsr303</artifactId>
<version>0.0.1</version>
<scope>runtime</scope>
</dependency>
view raw gistfile1.xml This Gist brought to you by GitHub.

Next follows a hypothetical example where Bean Validation constraints are used to make sure that three properties of binary search trees are satisfied.

  • The left subtree of a node contains only nodes with keys less than the node’s key.
  • The right subtree of a node contains only nodes with keys greater than the node’s key.
  • Both the left and right subtrees must also be binary search trees.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
@Config(desc = "A binary tree")
@BinaryTreeConstraint
public class BinaryTree {
@Id(desc = "id of current node")
private String id;
 
@Config(desc = "value of current node")
@NotNull @Min(1)
private Integer value;
 
@Config(desc = "left child")
private BinaryTree left;
 
@Config(desc = "right child")
private BinaryTree right;
 
public Integer getValue() { return value; }
public BinaryTree getLeft() { return left; }
public BinaryTree getRight() { return right; }
public String toString() { return id + "=" + value; }
}
 
public class BinaryTreeValidator implements
ConstraintValidator<BinaryTreeConstraint, BinaryTree> {
 
public boolean isValid(BinaryTree n, ConstraintValidatorContext c) {
if (n.getLeft() != null && n.getValue() < n.getLeft().getValue()) {
String msg = n.getLeft() + " must be to right of " + n;
c.buildConstraintViolationWithTemplate(msg).addConstraintViolation();
return false;
}
if (n.getRight() != null && n.getValue() > n.getRight().getValue()) {
String msg = n.getRight() + " must be to left of " + n;
c.buildConstraintViolationWithTemplate(msg).addConstraintViolation();
return false;
}
return true;
}
 
public void initialize(BinaryTreeConstraint constraintAnnotation) { }
}
 
@Target({ TYPE })
@Retention(RetentionPolicy.RUNTIME)
@Constraint(validatedBy = BinaryTreeValidator.class)
public @interface BinaryTreeConstraint {
String message() default "";
Class<?>[] groups() default {};
Class<? extends Payload>[] payload() default {};
}

Tools4j-config is fully compatible with JSR-303 and support any combination of constraints on behalf on Bean Validation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
import javax.validation.constraints.Max;
import javax.validation.constraints.NotNull;
import javax.validation.constraints.Past;
import javax.validation.constraints.Pattern;
import javax.validation.constraints.Size;
 
@Config @NotNull @Size(min=1, max=5)
List<Person> persons;
 
@Config @NotNull @Size(min=8, max=25)
String password;
 
@Config @Max(200, message="Area too large")
Integer area() { return length * width; }
 
@Config @Past
Date time;
 
@Config @Pattern(regexp="^[\\w\\-]([\\.\\w])+[\\w]+@([\\w\\-]+\\.)+[a-zA-Z]{2,4}$" , message="Bad email")
String email;

We will not explore validation possibilities further but please do so on your own. JSR303 provide a great way for reusing constraint-related efforts.

What’s next?
Tools4j-config encourage developers to be very precise about defining configurable parts, strengthening guarantees that applications will operate on correct and meaningful data. Tools4-config also enable productivity, leaving room to focus on domain concerns and not forcing a lot of boiler-plate code around administration, persistence and many other unrelated concerns.

Tools4j-config has a lot more functionality than this post could hold – so this is the end. But the following topics will be explored in the future.

  • Persistence and data distribution

    How to choose between different ways of storing configuration such as XML files, SQL, NoSQL and more.

  • Compile-time checks

    How to catch schema faults at compile-time.

  • Administration interfaces

    How to provisioning configuration using interface such as JAX-WS, JAX-RS, auto-generated GUI, CLI and more.

  • Administration users and roles

    How to define users and enforce roles for administration.

  • Sessions and atomic commits

    How to make several changes in the scope of a session and commit changes atomically.

  • Schema discovery

    How to explore the configuration schema at runtime.

  • Change notifications

    How applications can subscribe for configuration changes.

  • Automatic documentation generation

    How to generate operational documentation from code.

  • Performance, scalability and consistency

    How to support low-latency high-performance centralized management in environments of scale with hundreds of thousands configuration instances.

  • Java EE and OSGi integration and portability

    How to integrate tools4j-config in Java EE, OSGi, Spring, CDI and other programming models and frameworks.

  • Extendability

    How to provide tailor-made implementations using the SPI mechanism and verify them against the TCK.

Please visit the website if you think tools4j-config seems interesting.

Categories: Java, Java EE, open source, tools4j-config Tags:

Open Source Culture and Ideals

April 14th, 2012 No comments

Watching the commercial software industry frenetically trying to make process work occasionally feels similar to watching tragicomedy. This constant struggle of trying to force their feet into these tailor-made development processes left and right: agile, lean, scrum, kanban, even waterfall, whatever, just because some manager or “tech-lead” read a blog post. The Emperor’s New Groove, eh?

It can be a really shocking experience to observe this rather bizarre circus sometimes.

But what is even more astounding is that these corporations often seem ignorant towards open source cultures and ideals. Do they know they exist? Are they ignorant? Or is decentralization and meritocracy so scary (even threatening) to leaders of the organization? Maybe. Who knows. Go figure.

But in the context of corporations, I cannot help thinking about people like Jim Whitehurst, President and CEO of Red Hat. This man seems to be a humble and smart guy, wanting the best for his people and company. What has Jim seen that so many others have not?

Well, if you work at a company that struggle with process without getting any real work done, you first have to realize that “Culture is King”. You will never get process right without culture, ideals and beliefs. One important difference between culture and process is that culture is never forced. Culture is more like a garden or family. Treat it carefully, give it freedom and nourishment to grow and eventually (with good seeds of course) it will flourish and process emerge in a strong and natural way.

I’m not saying that open source is the answer per se; not everything must be open source, but open source culture accumulates so much timeless experience, healthy principles and working ethics. On the contrary, corporations often lose invaluable information along with people leaving them and they constantly struggle to re-build expertise and educate their workforce.

Anyway, here is a short reading list (in no particular order) for those of who want to know more about open source culture, community collaboration and (in opinion) the most appealing and practical way of doing software development.

(Comments on each reading is not mine, but from its author(s) or other source(s))

Open AdviceMisc
Open Advice is a knowledge collection from a wide variety of Free Software projects. It answers the question what 42 prominent contributors would have liked to know when they started so you can get a head-start no matter how and where you contribute.

The Open Source WayRed Hat Community Architecture team
Guide for helping people to understand how to and how not to engage with community over projects such as software, content, marketing, art, infrastructure, standards, and so forth. It contains knowledge distilled from years of Red Hat experience.

Open Source Community Values
Jeff Cohen
Welcome to Our Community. Here Are the Ground Rules.

The Art of Community: Building the New Age of ParticipationJono Bacon
Will help you develop the broad range of talents you need to recruit members to your community, motivate and manage them, and help them become active participants.

Producing Open Source SoftwareKarl Fogel
A book about the human side of open source development. It describes how successful projects operate, the expectations of users and developers, and the culture of free software.

Open Sources: Voices from the Open Source Revolution
Misc
Leaders of Open Source come together for the first time to discuss the new vision of the software industry they have created. The essays in this volume offer insight into how the Open Source movement works, why it succeeds, and where it is going.

Debian ConstitutionThe Debian Project
This document describes the organisational structure for formal decision-making in the Debian Project.

The Cathedral and the BazaarEric Steven Raymond
Surprising theories about software engineering suggested by the history of Linux.

The Art of Unix ProgrammingEric Steven Raymond
This book has a lot of knowledge in it, but it is mainly about expertise. It is going to try to teach you the things about Unix development that Unix experts know, but aren’t aware that they know.

How To Ask Questions The Smart WayEric Steven Raymond
In the world of hackers, the kind of answers you get to your technical questions depends as much on the way you ask the questions as on the difficulty of developing the answer. This guide will teach you how to ask questions in a way more likely to get you a satisfactory answer.

How the ASF worksApache Software Foundation
Will give you everything you always wanted to know about ASF but were afraid to ask.

Apache Subversion Community GuideSubversion Community
Subversion community participation guidelines.

Python Community Diversity StatementPython Community
The Python Software Foundation and the global Python community welcome and encourage participation by everyone. Our community is based on mutual respect, tolerance, and encouragement, and we are working to help each other live up to these principles.

Eclipse Development ProcessThe Eclipse Foundation
This document describes the Development Process for the Eclipse Foundation.

Ubuntu Code of ConductUbuntu Community
This Code of Conduct covers our behaviour as members of the Ubuntu Community, in any forum, mailing list, wiki, web site, IRC channel, install-fest, public meeting or private correspondence.

Mozilla Code of Conduct (draft)Mozilla Foundation
This Code of Conduct covers our behaviour as members of the Mozilla Community, in any forum, mailing list, wiki, web site, IRC channel, bug, event, public meeting or private correspondence.

Communities of practiceEtienne Wenger
This brief and general introduction examines what communities of practice are and why researchers and practitioners in so many different contexts find them useful as an approach to knowing and learning.

Teaching Open SourceMisc
This is a neutral collaboration point for professors, institutions, communities, and companies to come together and make the teaching of Open Source a global success.

Categories: business, ethics, open source, principles Tags:

Copy paste in urxvt

March 10th, 2012 No comments

I recently switched to a urxvt terminal on my Debian desktop but was a bit annoyed with how copy paste is handled and thought of adjusting it to my liking. But first a bit of background.

X server have 3 selection buffers: PRIMARY, SECONDARY and CLIPBOARD. PRIMARY is (conventionally) where the current selection is copied and pasted from using middle mouse button. CLIPBOARD on the other hand is primarily used by applications to copy selection when users explicitly request it, such as “copy” from a menu or pressing C-c, C-x, C-v keyboard shortcuts.

Urxvt use PRIMARY for copy/paste, which is a pain for me since my middle button is a bit sketchy when clicking it, sometimes slipping and scrolling away accidently. So I wanted to be able to paste urxvt selection using the CLIPBOARD instead, i.e. paste using C-v.

Turns out this is pretty easy. First you need ‘xsel‘ which is a command-line program for getting and setting the contents of the X selection.

1
$ sudo apt-get install xsel
view raw gistfile1.sh This Gist brought to you by GitHub.

Then you need create the following perl program /usr/lib/urxvt/perl/clipboard, which takes the current urxvt selection and copies it into CLIPBOARD using xsel with the -b flag.

1 2 3 4 5 6 7 8
#! /usr/bin/perl
 
sub on_sel_grab {
my $query=quotemeta $_[0]-&gt;selection;
$query=~ s/\n/\\n/g;
$query=~ s/\r/\\r/g;
system( &quot;echo &quot; . $query . &quot; | xsel -i -b -p&quot; );
}
view raw gistfile1.pl This Gist brought to you by GitHub.

The final piece of the puzzle is to active this script by adding the following line to your ~/.Xdefaults.

1
urxvt*perl-ext-common: default,matcher,clipboard
view raw gistfile1.txt This Gist brought to you by GitHub.

This post got inpiration from the ArchWiki, more specifically from Skottish.

Categories: Linux Tags:

Logo for tools4j-config

February 23rd, 2012 No comments

Tools4j-config is in steady progress but there are still a lots and lots of ideas and functionality to be implemented. At the moment i am working mostly on integration with OSGi and Eclipse RCP/SWT which have been very time-consuming since I havent had any past hands-on experience.

I spend roughly 20h a week on the project and it can be quite exhausting to be honest, especially since I do this on my own. There is a lot more involved than just happy hacking. I need to do field research, take notes, learn new technologies, write documentation and examples, testing, release management and of course think long and hard about purpose, goals, design and conceptual integrity etc.

It is very easy to get carried away into one direction and then accidently neglect the rest. But one good thing about having a tight time-account is that I have gotten (even more) obsessed about my productivity. Good tools truly are a key ingredient in the software development soup.

But the hardest part is the lack of feedback. I have only myself to trust that the project is going in the right direction which can be a bit demoralizing sometimes. I have to tell myself to stay focused, have patience and work hard. However, no users means that the project can change easily, but I cross my fingers that the open source community gradually will begin support me with feedback in the future.

It is not all about wanting success. The project is of course also a very stimulating and fun hobby. I love programming and this something that I wanted to do for a long time. I did not anticipate the great feeling of freedom that allow me to take my time to do stuff right. Very refreshing! I have also stopped watching all that junk on TV and doing mindless website surfing. There is still time over for friends, training and work. Blogging maybe not so much, eh?

Anyway, there is now also a logo for tools4j-config to give the project some identity and style.

I knew I wanted something fresh that would not invite old-fashioned hierarchical thinking into mind. With the logo I have tried capture what I think configuration management is all about: decentralized, autonomous and organic patterns that tell how small pieces connect together to make a whole.

The picture was bought from www.shutterstock.com for a few dollars then resized, some text added, convert to png using GIMP and Voila!

What do you think?
tools4j-config-logo

Categories: misc, open source Tags:

tools4j-config

February 1st, 2012 No comments

I am proud to announce my open-source project tools4j-config; a project that will try to address configuration concerns in Java once and for all.

I have seen and heard about far too many projects that handle configuration carelessly, causing endless headaches when put into production. Some notable nuisance are non-uniform interfaces, unmanageable structural and data changes in disparate sources, diffuse configuration intents and correlation to system concepts and lack of documentation.

Tools4j-config is my reaction from scratching an itch trying to help developers, operators and administrators (devops) to cooperatively manage configuration. This is the starting point and an honest attempt to try implement a framework that handle these concerns in a simple, productive, uniform, extendible and portable way.

The mission statement and motivation for tools4j-config is taken from the announcement on freecode.com and goes something like this.


Tools4j-config support long-running enterprise Java applications with a framework for handling configuration changes without restarting themselves.

It also aids in developing applications which are decoupled from knowing how and where to store, retrieve, and validate configurations.

The aim is to liberate applications to use configurations seamlessly on the terms of their particular environment, without constraining them to Java SE, EE, OSGi, Spring, CDI, or any other programming model or framework.

Tools4j-config is a true open source project, contributing ideas or criticism on any collaborative level is highly appreciated and will never be neglected or considered too small. Committers are welcomed with open arms.

The information on tools4j-config is presentable but a bit scarce at the moment but will build up gradually towards a 1.0 release.

Sure been a long time coming but expect a lot more on the topic of Configuration Management from me this year :-)

Categories: Java, Java EE, open source Tags:

Personal gains from contributing to Open Source

January 2nd, 2012 1 comment

Many may find it difficult to understand why certain people spend a lot of their spare time producing stuff without being paid and then give it away for free. Is this altruism on the edge of stupidity or are there personal benefits gained from participating in such activities?

The act of charity and joy of programming arise but may not be the ultimate goal. The motives for participation is subjective but it seems that many does it to boost professional work in one way or another.

This is my attempt to try seduce you to start practicing the arcane arts of Open Source.

Contribution
Schools benefit greatly from reduced costs and many students would not have had the opportunity to get a computer science degree without the wealth of information and experience found in open source. Many corporates certainly also benefit from open source.

Yes, some people actually develop feelings of wanting to give something back. Maybe not trying to make a difference but simply showing a token of gratitude to a community providing such a strong foundation for learning and education to anyone in society.

Appreciation
Programmers want others to use their stuff. We are social beings and it feels good to hear someone express their appreciation for your work. Appreciation motivates the will to understand different point of views, reduce insecurity and allow you to put others before yourself. Collaboration and social interaction create a feeling of belonging and coding for a community can make this activity even more energizing and enjoyable.

Corporate companies sometimes have a tendency to give managers most of the props, which can be disappointing and demoralizing indeed. Reading emails of gratitude and receiving help from others can feel refreshing, especially for those who have been working under less gratifying conditions.

Self-education
This is your chance to work on projects and problems that excite and inspire you the most. A strong motivator for doing your best and reach creative heights.

It may seem scary to know that your work will be reviewed and criticized publicly. But this is a tool for improving your skills, strengthening your attitude and habits towards quality. You will not code sloppy knowing that your work will be accessible anyone.

The larger projects that have survived for years and continue to evolve often have great leadership, organization and development guidelines. Technical skill is just one of the many things to observe and absorb. There is also a chance that you will join a team and learn from people that are many levels better than yourself.

Reputation
Open sourcing will build a public resume that is accessible to anyone. It looks good to have worked on a open source projects, especially famous ones. Meritocracy has a tendency to arise so offering bug corrections, improvements and ideas will earn your peers/users recognition and enhance your reputation. But keep in mind that quality is key. People do not want to spend time on contributions not following guiding principles just because the contributor was too lazy to read them.

Such a relationship can be quite stimulating as compared with the typical interaction trying to impress your manager, which interest usually lies with delivering on-time.

Transparency also feeds honest and humble communication since nobody can hide bad or selfish decisions. Strong disagreement that otherwise may end in rudeness and cruelty behind closed doors are likely be discussed more calmly knowing that others observe.

Control
Most people wish for freedom to control their lives. It can be incredibly frustrating to work on a project with budget constraints where software is rushed into a unmanageable mess. Reorganization and outsourcing can also seed feelings of disappointment and helplessness.

With open source you are no longer are a victim of such circumstances. You are free to implement and improve the features you think matters, while users help with finding relevance and set priorities.

Reuse
Most programmers develop an urge to not repeat themselves throughout their careers. Producing open source software is the freedom to truly reuse efforts when changing jobs (or starting your own company) and share them with anyone.

These intentions stimulate thinking using broader perspectives and designs that are cooperative, flexible and adaptable to different environments in order to maximize opportunities for reuse. Keeping users loyal often means maintaining version compatibility and upgradability. Having to deal with all this complexity will make you a better programmer.

And this is the right thing to do. Newton would have been proud to see this tradition of code-sharing and reuse. Reinventing wheels is a terrible waste of time and human skill.

Many view patents as the direct opposite. A threat that prevents reuse and slow programmers down. Patents also encourage a culture where people build barriers instead of helping each other. It is understandable that patents make the open source community frown.

Conclusion
Open source is a lot about a community of freedom and sharing and it is not hard to see why open source developers often are highly respected. Participation will introduce you to a community of incredible talented, like-minded and caring people that may help improve your skills beyond imagination.

Unexpected and exciting job opportunities may indeed arise, maybe at a company that will give you the fortune to produce open source software and get paid at the same time.

Last but not least: you will help support an open world where liberty and justice is praised.

And with that I leave you with a thoughtful quote from the King Penguin :-)

Software is like sex: it’s better when it’s free.

- Linus Torvalds

Categories: business, ethics, open source Tags:

Zenburn

October 18th, 2011 No comments

I like coding at night.

Turned off phone, twitter, mail and facebook – quiet with no distractions or interruptions. Dimmed-down lights, but not dark. 3-4 hidden low-light sources does the trick. It is really easy for me to get into the zone under these conditions.

Typing speed increase, keyboard-shortcuts comes easy, not a log statement get missed, bugs get slapped silly, clarity – there is no spoon, creativity flows, time disappear and nothing but the activity itself exist. A highly productive state of absolute concentration and focus.

Long coding sessions like this hurts my eyes since my blinking rate goes down dramatically under these conditions. A lot of bright colors on the screen is not an option for me. I need colors that are easy on the eyes and blends into the environment.

Zenburn is low-contrast color scheme that was originally designed for vim, but there are themes for Eclipse, Emacs, bash and other editors aswell. You can of course also use the palette as inspiration to adjust your environment to your liking.

Here is an example of what my desktop can look like during a coding session with zenburn’ish settings. I haven’t yet figured out how to get Ubuntu to tone down that bright grey colors around the edges of Eclipse though.

It may seem like a silly detail but zenburn really help me to not loose focus from tired and sore eyes. I can code for hours and hours with it and it is by far the best color scheme I have found.

What tricks/techniques do you use to get yourself into the zone and stay there for longer periods of time?

Categories: coding, misc Tags: