Non-uniform file encodings in the Eclipse Platform

Last modified: June 12, 2003

Plan item description: Eclipse 2.1 uses a single global file encoding setting for reading and writing files in the workspace. This is problematic; for example, when Java source files in the workspace use OS default file encoding while XML files in the workspace use UTF-8 file encoding. The Platform should support non-uniform file encodings. [Platform Core, Platform UI, Text, Search, Compare, JDT UI, JDT Core] [Theme: User experience] (bug 37933, 5399)

The current situation is as follows:

ResourcesPlugin.getEncoding returns the default encoding for the workspace (the org.eclipse.core.resources.encoding preference value if available, otherwise the value of the file.encoding Java system property).
IFile.getContents/setContents work with byte streams - no encoding can be applied.
IFile.getEncoding tries to guess the file encoding (looking for the Byte Order Mark), which is not enough. Also, this API has no known client so far. This API method would be deprecated.
the Java compiler supports non-uniform encondings for Java source files, but in Eclipse it relies on ResourcesPlugin.getEncoding (same value for all sources).
the text editor framework supports setting the encoding for files being edited (setting a persistent property on the file resource), but there is no support for setting the encoding of multiple files simultaneously, and other components are not aware of the encoding settings.

Requirements

users should be able to set the encoding for a file or a default encoding for a container (folders, projects, the workspace root) and its children.
users should be able to share encoding settings in a team repository (metadata should reside in the project content area).
file-specific encoding set by users prevails upon file contents-based encoding.
file contents-based encoding prevails upon the inherited encoding setting.

Proposed solution

The encoding for a resource (as returned by IResource.getCharset - see API changes) will be:

the encoding explictly set by a client/user (with IResource.setCharset - see API changes), if any, or
for a file resource, the encoding discovered by an encoding interpreter associated to the file extension, if one exists and can determine the encoding, or
for a file resource, the file encoding determined by its Byte Order Mark, if it exists, or
the resource parent's encoding (except for the workspace root, whose encoding is equivalent to ResourcesPlugin.getEncoding()).

Regarding #2, an extension-point would allow file format-aware encoding interpreters to register to the encoding discovery mechanism for specific file types (extensions) or to associate existing encoding interpreters to their own file extensions. Users would be able to associate more file extensions for the known interpreters (preference).

All clients, when creating character-based streams when reading/writing the contents of a file resource, should pass along the charset string obtained from IFile.getCharset instead of the one provided by ResourcesPlugin.getEncoding. Examples are: text editors, compiler, search, compare.

Also, setting the encoding for a resource would generate a resource change event, but only for the directly affected resource (if clients are interested on what effects the change in a directory had on files inside it, they will have to find it out by themselves).

API changes

Added:

public void IResource.setCharset(String charsetName) throws CoreException

Sets the charset name for this resource. May be null, which sets it to default. For the workspace root, it sets the workspace's default encoding preference to the charset's canonical name (or to the default encoding, if null was provided).

public String IResource.getCharset() throws CoreException

Returns the name of the charset for this resource. For files, if none has been defined (with setCharset), returns the default charset. To determine the default charset, it tries to guess it by a) inspecting the file contents (BOM), b) calling the corresponding encoding interpreter (if any). Otherwise, the parent's charset is returned. For the workspace root, a charset corresponding to the workspace's default encoding preference is returned.

public boolean IResource.isDefaultCharset() throws CoreException

Returns true if the currently configured charset was not explicitly set by the user - (has a default value either guessed by file contents, or inherited from parent).

public static final int IResourceDelta.ENCODING = 0x100000;

public String IResourceDelta.getNewCharset();

public String IResourceDelta.getOldCharset();

For notifying changes in file encodings. Both methods should only be called only valid when getKind()==CHANGE, and (getFlags()&ENCODING)!=0.

public interface IEncodingInterpreter {
	/** returns null if the charset cannot be determined. */
	public String interpretCharset(java.io.InputStream input);
}

Encoding interpreters will be associated to file types through a new core resources extension point. Users can associate additional file extensions ia preferences.

The platform would provide itself implementations for xml and other popular (?) file formats.

Deprecated:

public int IFile.getEncoding()
public int IFile.ENCODING_* constants

Encoding settings metadata

The encoding settings metadata will be stored inside the project's content area so it can be easily shared.

Scenarios

The user opens in an editor a text file whose contents where created using encoding "MS932" in a workspace whose default encoding is "US-ASCII". It was not possible to guess the file encoding automatically, so what the user sees is gibberish. The user figures out the cause of the problem and expliclty sets the encoding for that specific file to be "MS932". The editor will get notified and might offer to the user the option of reloading the file contents.
The user gets the source code for a set of classes he/she needs to use, but the classes do not compile because the author used internal Java identifiers not supported in the user workspace's current encoding. The user then selects the offending source files, and apply to them the correct encoding. The encoding change in the affected files will be reported in subsequent resource change events (to listeners and builders). Builders may recompute build state affected by changed encodings. Views depending on the contents of the affected files may decide to reload the contents using the new encoding.
If the user changes the encoding for a bunch of directories, only the directly affected resources will appear in the delta. Clients may want to re-read/re-build the contents of files whose parent changed encoding. Otherwise, the user will have to trigger a full build in the affected project. Refresh will not help.
The user moves a text file with default encoding to a directory which has a different encoding than the previous parent (which means the encoding for the file has changed). No encoding changes will be reported. Clients may want to re-read the file when it is moved if it reports a different encoding than the one originally used to load its contents.