Proposal

# Removing restrictions on valid characters in paths

Summary
The data structure org.eclipse.core.runtime.IPath and its canonical implementation, org.eclipse.core.runtime.Path, impose restrictions on segment names that can be more restrictive than those for file names in the underlying file system. When Eclipse paths are used to represent file system paths, these restrictions prevent valid files from being added to the Eclipse workspace. This document describes the current set of restrictions in Eclipse 3.0, and proposes changes to lift these restrictions.

### Background

IPath is an abstract data structure supplied by the org.eclipse.core.runtime plug-in, and consists of the following parts:

• An optional device (a String)
• Zero or more ordered segments (represented as String)
• Optional leading, trailing, and UNC separators

To facilitate conversion between IPath and String instances, IPath reserves the colon (':') character as the device delimiter, and the forward and back slash ('/' and '\') characters as segment delimiters. The API javadoc for IPath.isValidSegment outlines the complete set of restrictions:

• the empty string is not valid
• any string containing the colon character (":") is not valid
• any string containing the slash character ("/") is not valid
• any string containing the backslash character ("\") is not valid
• any string starting or ending with a whitespace character is not valid
• all other strings are valid
Contrast this with the restrictions on file names in Unix (as defined in the "Base Definitions" volume of IEEE Std. 1003.1-2001, section 3.169 Filename):
• the empty string is not valid
• any string containing the slash character ('/') is not valid
• the strings "." and ".." have special meaning
• any string containing the null byte is not valid
• all other strings are valid
Thus when Eclipse IPath objects are used to represent Unix file names, they are unable to represent file names containing '\' or ':', or file names with leading or trailing white space.

### Proposed Solution

Lifting the restriction on paths with leading or trailing whitespace and paths containing the '\' character is easily achieved by specifying a new constructor for creation of paths. Lifting the restriction on the ':' character is more difficult, since it is needed as the path delimiter on operating systems that support a device.

The solution must accomodate the two interesting categories of IPath users:

• Code that needs a platform-specific representation. In other words, someone that actually wants to open a stream on a file, delete a file, etc. They essentially need to translate between an IPath and a java.io.File.
• Code that is reading, storing, and manipulating file system paths, but does not need the platform-specific manifestation. In particular, there is a strong need for a serialized representation of paths in a platform-neutral way so that they can be later read and interpreted on a different platform (typically in the form of a path that is relative to some platform specific prefix represented by a variable).
IPath already acknowledges these two uses in its toString methods. The standard toString method creates a platform-neutral encoding of the path as a String. The toOSString method creates a platform-specific encoding suitable for passing to java.io.File or other API that deals directly with the file system.

The proposed solution is to introduce two constructors for creating IPath that perform the inverse of the two existing toString methods:

• Path.fromOSString: A factory method that decodes a platform-specific string. For example, this will parse the output of a previous call to IPath.toOSString, or the value returned by java.io.File.getAbsolutePath.
• Path.fromPortableString: A factory method that decodes a platform-neutral string, such as the output of a previous call to IPath.toPortableString.

Since changing the behaviour of the existing toString method would cause too much breakage, an new method, toPortableString will be introduced for creating a platform-neutral string representation of paths. The existing toString method will remain unchanged.

Most clients will use the platform-specific form of paths. The path can be converted to/from a platform-neutral representation when a path needs to be serialized in a portable fashion.

The platform-neutral encoding of paths (IPath.toPortableString) will allow all characters except slash ('/') in segment names, and include an optional device separated from the segments by a single colon character. Literal colon characters in path segments are escaped through doubling (one colon becomes two colons). The following are some examples of windows file system paths and the corresponding platform-neutral encoding:

• "C:\folder\file.txt" becomes "C:/folder/file.txt"
• "C:folder\file.txt" becomes "C:folder/file.txt"
• "C:\folder\" becomes "C:/folder/"
Canonical Unix paths look identical to their platform-neutral encoding, except in the presence of segments containing the colon character. The following are some Unix paths and the corresponding encoding by IPath.toPortableString:
• "/etc/" is encoded as "/etc/"
• "/etc/passwd" is encoded as "/etc/passwd"
• "/etc/timeNowIs4:25:12PM" is encoded as "/etc/timeNowIs4::25::12PM"
• "c:/folder/file.txt" is encoded as "c::/folder/file.txt"

UNC paths, which typically have no device but have a double leading separator will generally be the same

• "//Server/Volume" becomes "//Server/Volume"
• "//Server/TimeIs4:25:12PM" becomes "//Server/TimeIs4::25::12PM"
If for some reason a UNC path had a device, it will preceed the slashes:
• "C://Server/Volume" becomes "C://Server/Volume"
• "C://Server/TimeIs4:25:12PM" becomes "C://Server/TimeIs4::25::12PM"

This platform-neutral encoding unambiguously encodes all possible paths on all supported platforms. Most importantly, this toPortableString implementation is fully backward compatible with the Eclipse 3.0 implementation of IPath.toString for all paths that can be created in Eclipse 3.0. This means that clients who previously used toString for serializing paths can move to the new toPortableString/fromPortableString methods without migrating file formats.

The platform-specific Path factory method will impose the minimum platform-specific requirements needed to unambiguosly parse all possible paths on that platform. The Windows implementation, for example, will interpret everything up to the first ':' as the device, and treat both '/' and '\' as path segment separators. No other rules will be imposed. Thus the existing restriction on paths that prevents path segments from having leading or trailing whitespace will no longer be enforced on any platform.

As before, detailed validation of all legal characters and names on that platform will not be enforced. Some clients use technology such as Cygwin or Samba to mount foreign file systems on a platform. In these situations, path name rules for the local file system do not apply. While it is difficult to fully support these users, any additional platform-specific verification performed on paths causes further problems for these users. Imposing the absolute minimum requirements for unamiguously parsing paths allows the majority of users to function without further impacting the corner cases.

### API Details

The following existing methods on IPath and Path are affected:

• isValidPath/isValidSegment. The implementation of these methods will change to match the fromOSString factory method. In other words, path validity becomes a platform-specific issue. The specification will change to a more ambiguous wording stating only that certain characters are reserved on some operating systems. In implementation, it will just check for the device separator on operating systems that require it. The restriction on leading and trailing spaces in segment names will be removed on all operating systems that allow such paths.
• Path(String, String). This constructor previously had the unspecified behaviour of extracting the device from the second argument when no device argument was supplied. This behaviour will change to no longer parse the device from the second argument. This constructor will now handle literal colon characters in path segments on all platforms.
• IPath.append(String). This method implicitly constructs or interprets a path literal from the provided parameter. This method will use the os-specific rules when interpreting the provided path ('\' and ':' treated as segment and device delimiters on Windows only).
New methods for Eclipse 3.1:
• Path.fromPortableString. A factory method for producing an IPath given a platform-neutral encoding of a string. This is the inverse of the IPath.toPortableString method. In particular, double colons will be interpreted as single colon characters in segment names, and the first single colon (if any) will be treated as the device separator.
• toPortableString. A method for producing a platform-neutral encoding of a path as a string, suitable for storing in files that need to be platform-neutral. This encoding will escape literal colons in path segments using a double colon.
• Path.fromOSString. A factory method for producing an IPath given a platform-specific encoding of a string. This is the inverse of the IPath.toOSString method. On Windows the colon character is treated as the device separator, and both varieties of slash are treated as segment separators.

### What do we do with the Path(String) constructor?

This proposal introduces two factory methods that clearly distinguish platform-neutral and platform-specific encodings of paths. The difficult question is what to do with old single argument Path constructor. The two options are:

1. Leave the implementation of this constructor unchanged, but deprecate it. The advantage of this solution is that it does not break the API contract spelled out in the current Path constructor, which explicitly states how it handles ':' and '\' characters. The disadvantage is that this will require all callers of the existing Path constructors to migrate to one of the two path factory methods, depending on the origin of the path string being used. Clients that do not migrate to the new factory methods risk errors introduced when trying to construct IPath instances corresponding to file system paths that were previously treated as invalid. For example, the resources plugin would allow introduction of resources with the ':' and '\' characters. Other plugins trying to create a path corresponding to those resources using the old constructors will fail. Experiments with this solution showed that plug-ins that failed to migrate to the new factory method were broken due to the unexpected introduction of previously invalid paths. This presents a bleak picture for backwards-compatibility, regardless of the fact that no API contracts are broken.
2. The second option is to change the existing single argument path constructor to be platform-specific. In other words, the Windows implementation of these methods would remain unchanged, but implementations on other platforms would stop treating ':' as the device separator, and no longer treat '\' as a path segment separator. This clearly violates the existing API specification of the Path constructor. On the positive side, this introduces very little breakage in practice. The net effect is of removing old restrictions on some operating systems. The only breakage will be caused to clients who use a device for some reason on all operating systems, and clients that need to construct IPath objects representing file system paths from platforms other than the one that the current Eclipse instance is running in. For example, a plug-in running on Linux would not be able to use the old constructors to create IPath objects representing files from a remote Windows system.

After investigating the implementation of both of the above approaches, the second option introduces the smallest breakage by far. For example, the first option requires almost all of the 600 references to the Path constructors found in the current edition of the Eclipse platform. The second option requires only a small set of localized changes in code that deals with serializing and deserializing paths in a platform-neutral manner. Based on testing the implementation of these two options, this proposal recommends option two.

### Examples

The following examples illustrate the behaviour of the various Path constructors and to*String methods.

Given the absolute path with device "C:" and single segment "foo", the following IPath methods will produce:

• toString -> "C:/foo"
• toPortableString -> "C:/foo"
• toOSString (windows) -> "C:\foo"
• toOSString (Linux) -> "C:/foo"
• getDevice -> "C:"
Given the relative path with null device, and two segments "C:" and "foo":
• toString -> "C:/foo"
• toPortableString -> "C::/foo"
• toOSString (windows) -> "C:\foo"
• toOSString (Linux) -> "C:/foo"
• getDevice -> (null)
Given the string "C:\\foo" (single backslash escaped in Java literal format), the following constructors will produce:
• fromOSString and Path(String) (windows) -> Absolute path with device "C:" and single segment "foo"
• fromOSString and Path(String) (Linux) -> Relative path with null device and single segment "C:\foo"
• fromPortableString -> Relative path with device "C:" and single segment "\foo"
Given the string "C:/foo":
• fromOSString and Path(String) (windows) -> Absolute path with device "C:" and single segment "foo"
• fromOSString and Path(String) (Linux) -> Relative path with null device and two segments "C:" and "foo"
• fromPortableString -> Absolute path with device "C:" and single segment "foo"
Given the string "C::/foo":
• fromOSString and Path(String) (windows) -> Relative path with device "C:" and two segments ":" and "foo" (an invalid path)
• fromOSString and Path(String) (Linux) -> Relative path with null device and two segments "C::" and "foo"
• fromPortableString -> Relative path with null device and two segments "C:" and "foo"

### Other migration issues

All clients who store absolute IPath objects as platform-neutral strings in a serialized form (as produced by IPath.toString in Eclipse 3.0), should switch to the new fromPortableString/toPortableString methods rather than the Path constructor and the toString method. Backward compatibility with files written by Eclipse 3.0 is automatic (no changes to file format or changing file format version numbers required). Examples of files that contain string representations of paths that will need to migrate include the workspace .project and .classpath files.

### Other Observations

Under this proposal IPath.toPortableString and Path.fromPortableString are perfect inverses of each other. In other words, the expression

path.equals(Path.fromPortableString(path.toPortableString()))

will be true for all paths, and
string.equals(Path.fromPortableString(string).toPortableString())

will be true for all strings that represent canonical paths (strings with duplicate slashes or "." and ".." references will turn out differently). Furthermore, the Eclipse 3.1 implementation of Path.fromPortableString will be the perfect inverse of the Eclipse 3.0 implementation of IPath.toString.

On Unix, the toOSString and fromOSString methods will be inverses of each other. On Windows, the same can only be said for paths that do not contain colon or backslash characters within segment names (such paths are invalid on Windows anyway). Consider the following example:

   String input = "foo::bar";
IPath pathOne = Path.fromPortableString(input);
IPath pathTwo = Path.fromOSString(pathOne.toOSString());
pathOne.equals(pathTwo) -> false!

The input string represents a path with no device, and a single segment whose name is "foo:bar" (invalid on Windows). When this is output using toOSString, it is encoded as "foo:bar". The fromOSString then interprets this as a path with device "foo:" and first segment "bar". Similar mangling occurs if you create a path with a segment containing the backslash character:
   String input = "foo\\bar";
IPath pathOne = Path.fromPortableString(input);
IPath pathTwo = Path.fromOSString(pathOne.toOSString());
pathOne.equals(pathTwo) -> false!

In this case, the input is a path with one segment whose name is "foo\bar". This is interpreted by fromOSString as a path with two segments "foo" and "bar". In other words, under this proposal you cannot reliably manipulate paths containing backslash or colon using to/fromOSString on Windows. This seems to be an acceptable limitation.