I was reading an email thread on the Maven Developers mailing list about standardizing maven repositories, and re-factoring Maven Central. The discussion got me thinking – and me believes Maven Central Repository should not be a repository of artifacts at all, but more a look-up service / directory with some basic information (think UDDI). Let me try to add some logic behind this:
Anyone who has used Apache Maven or worked on a project built using Maven knows about Maven Repositories. Repositories are a wonderful (and when they were introduced, innovative) concept for dependency management. In fact though I still work with some Apache Ant scripts on some projects, I don’t think I’ve used anything but Maven on most new projects in the last 4 years. In fact, Ant also supports dependency management using Maven 2 POMs and repositories using Ivy.
Structurally, a Maven Repository is very simple – as far as Maven is concerned it simply a directory tree organized by group-id, artifact-id and version with specific metadata and checksum files at the correct location. It is a wonderful concept, and possibly one of the key reasons to the success of Maven as a SCM tool / methodology. This works great when you are working on a private project (even major enterprise projects within an organization). However, when you expand this to a global scale, I think a major possibly flawed assumption in the original structure / layout is that all artifacts are freely available to everyone, both IP and license wise; and that everything a project / library might ever need to depend upon is always available in Central.
As much as I like Open Source Software, and would love everything every written in any programming language be open sourced, still reality is – there are things like Intellectual Property, legal, financial issues to consider. Numerous folks, smarter than me, have pointed out that Maven Central is not very (for lack of a better term) Enterprise Friendly. Enterprises would not deploy their IP sensitive or commercially licensed artifacts on Maven Central. Nor will most enterprises allow a free range on dependencies to Maven Central – they would want control and audits on what their applications depends on. Then there is the issue of security – what if someone deploys a trojan in code deployed from a free open repository.
The enterprise problem can for the most part, in my opinion, solved with Repository Managers. There are a few good ones out there providing many features including access control, proxying remote repositories, supporting custom deployment/release work-flows, etc.. I guess the most notable one would be Nexus with both open source (free) and commercial editions; and then there is Apache Archiva.
Maven Central is growing: at last count Maven Central was more than 175GB in size and growing. And with the ever-increasing popularity of Maven, it has to handle traffic from around the world, putting pressure on storage, computing power and bandwidth. Maven devs have tried to handle the situation by blocking IPs executing excessive requests against Central – a solution I do not believe is scalable, specially when you can have hundreds of developers NATed behind a corporate firewall proxy. There are also mirrors of Central – 4 at the last check of the meta-data, but Maven does not automatically select the mirrors (though there is a proposal for this), you have to set the mirror in the local settings.xml. Maven development suggests using a repository manager to handle this as well – but in this case I think a repository manager may be defeating the purpose of having a central repository.
When you download a piece of software from a website, there is a license agreement smack in front of you, in some or most cases you even have to accept the agreement before you are allowed to download the software. Even eclipse makes you do that when new plug-ins and features are added to the installation. However, with Maven you don’t have that check – it is a big risk for software developers and corporations of introducing incompatible licenses in one’s software; and what about things like export control. I don’t believe there is a viable solution around this for Maven – and one of the reasons why a number of applications do not publish their artifacts on Maven Central.
So, what am I suggesting? Maven Central should not just be a repository, but a directory that points to other repositories or may even optionally host some artifacts (think DNS). The way I picture it, each group, artifact, version and classified download-able artifact within that hierarchy can be defined with a meta-data file. Meta-data can extend the existing meta-data structure in Maven or can even be as simple as plain text files, maybe even property files. The repository itself should be replicated (across the globe), either using unicast addresses or using application dependent mechanisms of load balanced instances. The repository itself may be the authoritative responder for some groups, for others it may have non-authoritative information or simply redirection to an authoritative repository. Each of those repositories could simply be a single instance somewhere or a distributed group of instances behaving like Central. The metadata for each artifact to be downloaded should include the location of the artifact – either in the same repository or a remote location (which may or may not be a maven repository at all, like a SourceForge.net download URL); license information; check-sums, etc.
Maven clients can get a little smarter with the repository. For example, if an artifact requires a license code or password that needs to be sent to download the artifact the client can negotiate that or use trust mechanisms. The repository may also send down license for the user to accept (like the Eclipse install software mechanism) which the client may auto accept based on policy or present to the user to decide (first time for an artifact), for instance accept all ASL 2.0 artifacts but not GPL. Maven clients can also read and respect central ‘policy’ files, specially in the enterprise.
Now what does it give us:
- A truly distributed location mechanism of software titles that Maven and other systems can use for their dependencies
- No single repository is too big – Central would probably end up being a collection of redirects. More hosted instances of it
- Software publishers can host their own repositories and not have to upload software to Central, but still be reachable without everyone requiring to define other repositories in the POMs
- Enterprise group ids can be part of Maven Central with redirects to repositories controlled by the Enterprise
- The system is scalable!
I think there are other benefits to this idea, and of course being a rough first draft this does need polishing.