Should be possible to add new Serializers dynamically

(No description)

Leave a comment

on 2011-02-23 19:48 *

By debasishg

Assigned to set to debasishg

Status changed from New to Accepted

on 2011-02-23 19:49 *

By debasishg

Viktor - will u pls expand a bit on your thoughts here ..

on 2011-02-23 21:27 *

By viktorklang

I could imagine something like this:

akka {
serializers {
SJSON = "fqn.to.SJSON Serializer"
Protobuf = "fqn.to.Protobuf.Serializer"

myOwn = "fqn.to.my.own.Serializer"
}

remote {
serializer = "myOwn"
}
}

Then perhaps Serializer could be a sort of BiMap[AnyRef,Array[Byte]] so one can loop over the installed serializers:

for(s <- installedSerializers if s.canSerialize(msg)) s.serialize(msg)

for(s <- installedSerializers if s.canDeserialize(bytes)) s.deserialize(bytes)

Just brainstorming,

the general idea is to be able to create your own serializers without having to fork akka and add it that way. I can imagine people having their own formats, or MsgPack, Avro or something else.

on 2011-03-03 17:39 *

By Jonas Bonér

Milestone changed from 1.1 to 1.2

Updating tickets (#595, #605, #429, #679)

on 2011-04-06 10:15 *

By Jonas Bonér

How are this one coming along? For 1.2 or later?

on 2011-04-06 10:49 *

By debasishg

When is 1.2 scheduled ? Will try to get it by 1.2 .. terribly terribly immersed in some crap right now.

on 2011-05-09 11:35 *

By debasishg

The idea is to have a set of serializers installed dynamically through configuration. Then the user can pick a serializer of his own choice while developing akka based applications. During startup, the list of serializers is stored in a Map and the user can do a get on the Map to pick the one of his choice. Is this the current thought ?

Viktor - what's your idea of the for comprehension and the s.canSerialize invocation ?

on 2011-05-09 17:39 *

By Anonymous

Yeah, that's my current thought, but don't take it as "the way to do it"™ :-)

When you receive a message in the remoting, unless we start to encode the encoding used for each message, we'll have to guess, that's what the for-comprehesion and canSerialize does.
Feel free to solve it any way you find appropriate :-)

By the way, as a part of this ticket, make so that MessageSerializer isn't an object, it's all kinds of bad, mutating global state all over the place is messy, one should have to instantiate a MessageSerializer and provide it with a ClassLoader.

If you have any questions just ask.

Cheers,
√

on 2011-05-10 07:54 *

By debasishg

AFAIU the use case for pluggable serializers is

to allow users to plug in their own custom serializers and
to identify the exact serializer/de-serializer when remoting gets a message

Is this true ?

In that case will it be an efficient strategy to loop through the list of installed serializers every time we process a message in remoting ?

Alternatively can we think of a SerializerRegistry where users can register custom serializers along with the associated message type that can be serialized/de-serialized using them ? Of course it will be the users' responsibility to do this registration and subsequent de-registration of serializers. May be we can use Zookeeper for this registry implementation.

Just brainstorming ..

Cheers.
- Debasish

on 2011-05-10 09:16 *

By Anonymous

I'd go for the later as long as it will be possible to register by parent types. So for example, if my messages implement my custom Streamable interface, it should be possible to register a serializer by the Streamable.class.

on 2011-05-10 11:07 *

By debasishg

Thinking more about the problem .. How about this idea ..

Currently (from the akka docs) ..
"All messages that are sent to remote actors needs to be serialized to binary format to be able to travel over the wire to the remote node. This is done by letting your messages extend one of the traits in the ‘akka.serialization.Serializable’ object. If the messages don’t implement any specific serialization trait then the runtime will try to use standard Java serialization."

The problem is that currently all messages need to extend one of the traits which we have. We would like to make this open instead of having it constrained to one of our given traits. Why not make Serializer a typeclass as we have done for actor serialization ? Then anyone can define an instance of the typeclass and plug in his own serializer. Need to think more on how we can attach a serializer with the message implicitly .. But do you find any obvious bottleneck with this approach ?

on 2011-05-10 11:16 *

By viktorklang

If we store the serializers in an array, the JVM JIT will most likely do loop unrolling on it, so it won't be expensive, since there'll likely not be > 10 serializers installed.

Adding the serializer to each message is rather wasteful and that also means that we'll need to be able to send them over the wire and down to disk for durable mailboxes, which is essentially what I said above:

"unless we start to encode the encoding used for each message,"

Having to manually map (in config) is a no go, it's just too much overhead.

Let's say I want a message to be sent using MsgPack:

trait MsgPackMsg

case class SomeMessage(foo: String, bar: Bar)

class MsgPackSerializer extends Serializer {
def canSerialize(a: Any) = a.isInstanceOf[MsgPackMsg]
}

If we implement it as typeclasses we're a bit hosed since the Java solution would most likely suck, or?

Then we also have the issue, what if I want to use MsgPack to serialize the message, but I want individual objects in the message to be serialized using other schemes? Like "Bar" above, what if that want's to use Protobuf?

More food for thought :-)

on 2011-05-10 13:20 *

By debasishg

continuing my trend of thoughts on typeclasses ..

Instead of passing the serializer with each message, we can have it fetched from an implicit which needs to be set to the particular instance of the typeclass. Your observation regarding the Java API can be addressed by having the custom implementation fetched explicitly from a static instance.

For the example ..

case class SomeMessage(foo: String, bar: Bar)

if the user wants to use serializer_1 for the whole message, but a different scheme for an embedded object, it is difficult to address since at any point in time we can have only one serializer in scope. But is this flexibility really required for real world usage ? The typeclass solution looks very clean if it solves the issue ..

Like to have your thoughts :)

on 2011-05-10 23:57 *

By hossam.karim

Would it possible to simply give the API client the control over the serialization process? Something like Camel's DataFormat concept.

On Tue, May 10, 2011 at 3:51 PM, debasishg <akka@alerts.assembla.com> wrote:
alert by debasishg in space akka Comment (by debasishg):
continuing my trend of thoughts on typeclasses ..

Instead of passing the serializer with each message, we can have it fetched from an implicit which needs to be set to the particular instance of the typeclass. Your observation regarding the Java API can be addressed by having the custom implementation fetched explicitly from a static instance.

For the example ..

case class SomeMessage(foo: String, bar: Bar)

if the user wants to use serializer_1 for the whole message, but a different scheme for an embedded object, it is difficult to address since at any point in time we can have only one serializer in scope. But is this flexibility really required for real world usage ? The typeclass solution looks very clean if it solves the issue ..

Like to have your thoughts :)

More details
Assembla | Knowledge, Tools, and Talent for agile teams

on 2011-05-11 01:09 *

By Anonymous

Why not just keeping it simple:

public interface Serializer {
    void serialize(Object object, OutputStream out);
    void deserialize(DataOutput InputStream in);
}

You don't need to send the serializer on the wire along with each message. Just like you don't need to send the classes of the serialized messages. One can assume the Serializer is configured on all nodes in the cluster.
If you need different serializers for the different types of messages, one can easily create a CompoundSerializer which chooses the right serializer based on the object type.

on 2011-05-11 13:14 *

By viktorklang

We should definitely not implement it as side-effecting by default.

Also, we need to be able to pass a ClassLoader to the deserializer.

on 2011-05-13 22:29 *

By Jonas Bonér

The type class serialization requires the user to import them and use the remoting API explicitly.
How would that be done now that all remoting API will be removed and handled only internally and only configured from the outside in a configuration file.

Also, I want the same solution to work with serialization of Actor as well, not just messages.

To reiterate the question:

if remoting and serialization should be only handled internally and never exposed to the user (more than through a config file)
therefore, the user can never import any serializer classes into his scope
how can then a type class approach work?

We have the same problem with the current serialization of Actor/ActorRef. In the new Akka 2.0 world these type classes are never used as intended either.

Thoughts?

on 2011-05-13 22:44 *

By Anonymous

Didn't know about this remoting changes .. The typeclass approach will not work. It has to be all reflection based.

If remoting and serialization are all handled internally, then what will the user specify in the config file ? A sample snippet will help.

Is there any place where the intended usage of pluggable serializers in 2.0 is documented ? I think I need to know the usecase in more detail before going into the implementation ..

on 2011-05-14 06:14 *

By debasishg

trying to clarify some of the doubts ..

What will the user specify in the config file ? Just the serializer class name ? Or a map [serializer, message] indicating which message type will be serialized using which serializer ? Otherwise how does the mapping take place ?

another alternative will be to ask the user to pack the serializer name along with the message

another alternative will be to use the strategy Viktor suggested of iterating thru the list of serializers installed and somehow find out which one can be used

Can u please elaborate a bit on your thoughts with a snippet of the config file that u r foreseeing ?
Also is there any doc on your plans of the changes in remoting that u r planning for 2.0 ?

Thanks.

on 2011-05-14 21:50 *

By Jonas Bonér

"Anonymous" - who is that? Could you please sign in so I know who I'm talking to?

on 2011-05-14 21:58 *

By Jonas Bonér

Here is a snippet of what I have right now. See the 'format' option, FQN to the Format object.

The idea is to add a virtual "address" when creating the Actor like this: actorOf[MyActor]("service-pi").
And then from the outside configure it how it should be deployed using a deployment config as below.
Then it is write the code once, not thinking about remoting or clustering and then as a deployment step configure it to run in a specific way.

akka {
  actor {
    deployment {

      # -------------------------------
      # -- all configuration options --
      # -------------------------------

      service-pi {                   # stateless actor with replication factor 3 and round-robin load-balancer
        router = "round-robin"       # routing (load-balance) scheme to use
                                     #     available: "direct", "round-robin", "random", "least-cpu", "least-ram", "least-messages"
                                     #     or:        fully qualified class name of the router class
                                     #     default is "direct";
        format = "akka.serializer.Format$Default$"
        clustered {                  # makes the actor available in the cluster registry
                                     #     default (if omitted) is local non-clustered actor
          home = "node:test-1"       # defines the hostname, IP-address or node name of the "home" node for clustered actor
                                     #     available: "host:<hostname>", "ip:<ip address>" and "node:<node name>"
                                     #     default is "host:localhost"
          replicas = 3               # number of actor replicas in the cluster
                                     #     available: integer above 0 (1-N) or the string "auto" for auto-scaling
                                     #     if "auto" is used then 'home' has no meaning
                                     #     default is '1';
          stateless = on             # is the actor stateless or stateful
                                     #    if turned 'on':               actor is defined as stateless and can be load-balanced accordingly
                                     #    if turned 'off' (or omitted): actor is defined as stateful which means replicatable through transaction log
                                     #    default is 'off'
        }
      }
  }
}

But Viktor's idea of specifying them more top-level and the refer to the by name might be better.

The example is for serializing Actors but we need to fine a solution that works equally good for serializing messages.

on 2011-05-14 22:15 *

By debasishg

The last anonymous is /me .. forgot to sign in - sorry!

This idea of separating the deployment from the design of the actor is indeed very good. With a virtual address mentioned while creating the actor, we can refer to it in the config file (as u have done). So, this, coupled with the Format object specification will work for actor serialization. In this scheme, you get the format object specification from this file and use reflection to load it through a classloader - is this the idea ?

For messages, there's an added complexity - we will have many message types for an actor. So we need to specify tuples (MessageType, Format) for every message type that the actor can handle. Or we can make the user bundle the Format object name along with the message itself - somewhat increase in the payload, but may be simpler to implement.

trait MessageBase {
  def serializerName: String
}

trat MyMessage extends MessageBase {
  val serializerName = "akka.serializer.Format$Default$"
  //..
}

In this case we have a definitive map of which format object to use while de-serializing a particular type of message. Thoughts ?
Viktor's idea of having a list of serializers registered through the config file also works. The problem is for every message we need to iterate the list and somehow find out which one can be used (not sure how though) ..

Thoughts ?

on 2011-05-14 22:21 *

By viktorklang

I still think that Iterating over a List of perhaps 2-5 entries will be far less overhead than the other solutions.
In the future we could even optimize this approach by caching type->appropriate serializer after first lookup.

on 2011-05-15 12:10 *

By Patrik Nordwall

I would like to add versioning to the discussion. Even though maybe not implemented in the first version the design should take it into account. Msg can be of different versions (sent with one version and received with another) and might need additional, user defined, transformation when deserialized.

on 2011-05-15 22:35 *

By Jonas Bonér

I think iterating over a list of serializers would be quick enough. The question is how to define the 'canSerialize' method. It can be memoized if expensive.
Ideas:
1. Detect a trait/interface on the message like we do now (protobuf, sjson)
2. Specify a mapping in config:
<FQN message1> = protobuf
<FQN message2> = avro
etc.
Could get quite long though.

on 2011-05-15 22:38 *

By Anonymous

I can optimize the hell out of it, I think we should go with autodetection.

on 2011-05-15 22:39 *

By Anonymous

This is viktor btw, writing from cellphone

on 2011-05-15 23:32 *

By Jonas Bonér

You mean alt 1?

on 2011-05-15 23:39 *

By Anonymous

Yeah. In a future iteration we could allow for overrides in the config.
But that's for later. Wdyt?

/V

on 2011-05-16 01:07 *

By uboness

Adding the serializer info to the actual message classes is too intrusive to my taste. I'd much rather have these mappings configured:

f.q.n.Message1 = protobuf
f.q.n.Streamable = f.q.n.StreamableSerializer
f.q.n.Serializable = java-default

list order determines the precedence of the serializers.

on 2011-05-16 08:32 *

By Jonas Bonér

We can start with getting 1) to work and then add 2) later if needed.

on 2011-05-16 08:40 *

By debasishg

Ok .. so going by the current strategy of auto-detection, we do something like ..

trait Serializable {
  def toBytes //..
  def fromBytes //..
}
trait JsonSerializable extends Serializable {
  def toJson //..
  def fromJson //..
}
trait ScalaJson[T] extends JsonSerialzable {
  //..
}
trait JavaBinary[T] extends Serializable {
  //..
}
case class FooMessage(..) extends ScalaJson[FooMessage] {
  // defines all serialization methods for the message
  // this needs to be done here since the user may not want to
  // serialize all fields of the message
}

Similarly for any other forms of serialization we adopt the same strategy as above. This can be one of the serialization formats that we provide with Akka or any other custom serializers.

As Viktor has suggested we can always provide overrides in the config file (may be later).

Here are some other issues ..

1. When we serialize the message we can invoke the toBytes() (or whatever) method directly - is this assumption correct ? No reflection ..
2. What about de-serialization ? When we de-serialize we need to know the serializer to use, which we currently store as SerializationScheme. We now need to get this from the user. Maybe we can store it as part of Serializable ?

trait Serializable {
  // other methods
  def serializer: String
}

Then we can load this using reflection and invoke fromBytes(bytes, classloader) directly ? This classloader is for loading the class of the de-serialized object.

on 2011-05-16 09:02 *

By Jonas Bonér

Putting the serializer field in Serializable sounds good to me.

on 2011-05-18 11:29 *

By Jonas Bonér

When do you think you could be done with this Debasish?

on 2011-05-18 11:50 *

By debasishg

I will be in my base this weekend, but will get only a few hours coding time. And I will need one more weekend. I expect to deliver this by the end of this month .. Is that ok ?

on 2011-05-19 09:51 *

By Jonas Bonér

Of course it is ok. Thank you. Very grateful.

on 2011-05-22 18:26 *

By debasishg

Ok .. my slow internet connection is timing out on maven central - hence thought of documenting what I am doing so far ..

The serializerName strategy in Serializable will not work, since I can't access it from within the de-serialize method. So here it goes ..

trait Serializable // marker ..

Every message will implement Serializable ..

trait Serializer extends scala.Serializable {
  def toBinary(o: Serializable): Array[Byte]
  def fromBinary(bytes: Array[Byte], classLoader: Option[ClassLoader] = None): Serializable
}

and we will have custom Serializer implementations .. e..g.

trait JavaSerializer extends Serializer { //.. concrete impl
}

The config file (akka.conf) will have a mapping from serializer fqns to message fqns, starting from specific to generic, so that we can get the most specific serializer for a particular Message. Possibly there will be a mapping from "akka.serialization.Serializable" to a default serializer, which will kick in in case the user does not specify the mapping in the config file.

Now the facade object ..

object Serialization {
  def serialize(o: Serializable): Either[Exception, Array[Byte]] = {
   //.. will get the serializer from the mapping in the config file and reflectively invoke 
   // toBinary on the message
  }

  def deserialize(bytes: Array[Byte], classLoader: Option[ClassLoader]): Either[Exception, Serializable] = {
    //.. will get the serializer from the mapping in the config file and reflectively invoke 
    // fromBinary on the message
  }
}

Feebacks ?

Question : Do we need to follow the same strategy for Actor serialization ? Or actors can be serialized internally using Akka implemneted serializers, like we have now. Just wondering if we need to expose actor serialization just like we do for messages .. Thoughts ??

on 2011-05-22 21:21 *

By viktorklang

The only problem I have with the code above is that I generally don't like mixing concerns like that. If I have an object that is implemented to use ProtobufSerialization, then there's no way I can replace that with Avro serialization without yanking out the old code and replacing it with new.
But that might be something that will not be a problem for anyone.

Thanks for taking on this task D, I really, really appreciate it.

on 2011-06-25 18:06 *

By Jonas Bonér

Component changed from None to cluster

Status changed from Accepted to Fixed

on 2011-08-09 08:28 *

By Jonas Bonér

Milestone changed from 1.2 to 2.0

on 2011-12-15 03:54 *

By viktorklang

Milestone changed from 2.0 to 2.0-M1