Hi!
I want to give a little update on the indexing prototype for Clangd I've been working on. I wanted to share the actual code on Github before I went on vacation but I ran out of time! Sorry about that!
Here's a short summary of the several components there now and what's in progress:
--ClangdIndexStorage--
malloc-like interface that allocates/frees data blocks of variable sizes on disk. The blocks can contain ints, shorts, strings, pointers (i.e. file offsets), etc. The data is cached in 4K pieces so that local
and repeated accesses are all done quickly, in memory.
Clangd mallocs and writes its index model objects using this.
--BTree--
A pretty classic BTree implementation. Used to speed up lookup (symbol names, file names). It allocates its nodes using ClangdIndexStorage therefore it is stored on disk. Keys are actually records in ClangdIndexStorage
so you can really think of the BTree as a collection of sorted pointers (sorted according to a provided comparator).
The index model is not very rich yet but the idea is that lower level building blocks (ClangdIndexStorage, BTree) will be there so that we can start iterating.
--ClangdIndexFile--
Path + information about inclusion in order to be able to represent an include graph.
The include graph is used to know which files need reindexing for example when a header file changes and also which compilation database entry to use when opening a header in the editor. The first source file
including the header file is used to look up the entry in the compilation database. This will also be used for the "Include Browser" feature in the future.
--ClangdIndexSymbol--
USR + location (pointer to a ClangdIndexFile + start/end offsets)
This only represents definitions in source files for now. This is part of the indexing-based feature to "Open Definition".
This is likely to have more changes to represent declaration vs definition, references (Find references feature), call graph, etc.
--ClangdIndex--
Owns a BTree of ClangdIndexSymbol sorted by USR for fast lookup (used by Open Definition for now). getSymbol(USR) returns a ClangdIndexSymbol.
Also owns a BTree of ClangdIndexFile sorted by Path for fast lookup. As explained above, used to find proper compilation database entry and to react to header changes. getFile(Path) returns a ClangdIndexFile.
# Building the index
When Clangd gets the initialize request from the client, it is given a rootURI. It starts indexing each source files under this root, creating ClangdIndexFiles and ClangdIndexSymbols. This is done with the help
of index::IndexDataConsumer.
At the moment, the main covered scenarios are building the index from scratch and adding files. Support for modifying files and removing them is not fully implemented yet (but this is a must of course!).
In case you are wondering, there is no fancy caching of headers or preamble used while indexing, so the process is quite slow. I have been focusing on the model and persistence of the index versus the input
(parsing). This will have to be addressed too.
# Examples of using the index
When the user does the "Open Declaration" command, it retrieves the ClangdIndexSymbol from the ClangdIndex using the USR at the requested offset (sped up by the BTree). The location of the ClangdIndexSymbol
(if found) is then returned to the editor.
When the user opens a header file, it retrieves the ClangdIndexFile from the ClangdIndex using the path of the header (sped up by the BTree). Then it recursively finds which file includes it until there is no
more, at this point chances are that this is a source file. Use this source file path to find a potential entry in the compilation database (so that we gets proper compiler flags, etc).
This is just to give you a taste of what I have in mind and what kind of progress is being made. I'd like to have the "lower-level" parts ready for review soon after I come back from vacation (Aug 24th). I was
thinking that ClangdIndexStorage and BTree can go in first as they are quite isolated and unit-testable. The rest of the code will also be available on Github to show more concrete usage of them if necessary.
Regards,
Marc-André