Scribo KDE roadmap

From Mandriva Community Wiki

Jump to: navigation, search

Scribo stands for Semi-automatic Collaborative Retrieval of Information Based on Ontologies. The project has the hallmark of the System@tic competitiveness cluster http://www.systematic-paris-region.org. The project's Web site is [1]. Scribo is delivering 1) a set of natural language processing engines capable of adding semantic annotations to documents (identification of named entities, of coreferences, of relations between entitites, term desambiguation, etc.) 2) tools for managing annotations on the KDE desktop 3) applications dedicated to specific activities: activity management, Linux documentation annotation, press article annotation.


This page is about the KDE roadmap of the Scribo developments in 2009 and 2010.

Contents

Note on Scribo analysis engines input/output

The NLP engines designed within Scribo take as input an XML representation of a document containing the headings and the body of the document. They return a set of annotations consisting of an XML file containing:

  • the text fragment the annotation relates to,
  • the context of the text fragment, i.e. n lines before the fragment, n lines after the fragment, and the position of the fragment in the input XML document section,
  • the content of the annotation as a set of triples serialized in XML.

Mandriva roadmap ideas

(All time estimates are based on full days. A real day is seldom "full" of development time)


Feature Description Todos Workload estimation Status
Capability to highlight fragments in html files Just like in plain text files we would also like to highlight the annotations in html files. This should be done in the annotation window below and in the konqueror integration below.
  • Input: text fragments with their annotations (stored in the Nepomuk db)
  • Result: highlighted text fragments in Konqueror, with one color per RDF type
  • Find a way to locate the annotations in the html. Maybe by skipping the tags while counting characters and then doing a proximity search for the matched words.
  • Allow to display html and plain text in the annotation window by using QWebKit or just plain Qt rtf in QTextEdit.
1 day for integrating html into the annotation windowongoing
Creation of an annotation sidebar for Konqueror Create a sidebar for Konqueror similar to the Firefox OpenCalais plugin. The plugin lets the user configure the analysis engine to be used (OpenCalais, Scribo, Alchemy etc.). When the user gets a page analysed, the plugin highlights with different colors the identified text fragments.
  • creation of an action available from Konqueror for submitting a text to an external text analysis engine available as a Web service (see how the Dolphin action is implemented)
  • parse the results
  • display the results in a Konqueror sidebar presenting the annotations by type in a tree, and linking the tree nodes with the corresponding text fragments in the analysed HTML page.
4-5 days
Capability to add manually semantic annotations to a URL or to text fragments in an HTML page. Manual annotations
  • add a manual annotation action (for a URL in the address bar or for a text fragment)
  • prompts the standard Nepomuk annotation window
  • store the annotation in the Nepomuk database
  • highlight the annotated fragment
  • update the annotation sidebar accordingly
Integrate automated text analysis results to OkularThe simplest way to implement this would mean to add annotation support through Nepomuk/Scribo to Okular. Existing parts:
  • Okular supports manual annotations which are stored in an xml file.
  • Add an action in the menu to get the text analyzed by an external Scribo analysis engine,
  • Based on a Scribo analysis result:
    • Highlight the identified entities (map text+ context to positions in the PDF), using configurable colors (one color for people, one for organizations, for events etc.)
    • Automatically add the annotations inferred by Scribo to the identified entities
    • Let the user accept / reject / modifiy the annotations
  • Make Okular store the annotations in Nepomuk store
  • Put the system into practice using OpenCalais and other analysis engines.
Hard to say. This could take longer since it may involve changing some parts of Okular.
Manual annotations Okular Okular already provides annotations of some kind as mentioned above. This task will add manual annotations to the semi-automatic ones. This means that the user can select a passage in the pdf and then link this passage with some resource or tag it or make it a resource (like for example selecting "Paris" and then stating that this is the city Paris). Integrate the manual annotation system into Okular. For a first prototype it should be sufficient to use a context menu and a separate dialog similar to the annotation window discussed above.Once the annotation window is done and the semi-automatic annotations in Okular are done, this should be fairly easy and be done in 2 days.
CEA annotation engine integrationCEA provides another web service for text analysis. This service should be integrated into the Scribo framework as a plugin.
  • Write a plugin sim)ilar to the OpenCalais plugin
  • Tune the Scribo API to support CEA's features while staying generic enough to also handle OpenCalais and the simpler DERI keywork extraction (at the moment the API is tailored towards OpenCalais
4-5 daysongoing
Integrate manual/automated annotation capabilities to KMail
  • Basically integrate all features mentioned above in KMail:
    • The text analysis
    • The annotation recommendation
    • The entity actions
    • The information extraction from Nepomuk
  • Since KMail still does not use Akonadi we need to do a simple mapping somehow.
ongoing
Define actions for extracted entitiesFor extracted entities such as cities or persons actions can be defined. The most simple one could be to open google maps for the extracted city. Idea: create something like the mimetype actions? Maybe also using desktop files?
Create a framework for defining actions based on annotation recommendations and information in the Nepomuk store
Define a set of standard actions and tie them to typical RDF classes (such as pimo:City maybe or OpenCalais' city class)
Integrate the system into the test shell
Create test data ongoing


Done

Feature Description Todo WL Status
Capability to highlight fragments in text files (with various colors) The automatic annotations created by the Scribo system mentioned above optionally relate to positions in the text. These positions should be highlighted. Existing parts:
  • A simple test app that can highlight text positions based on information from a model exists. It uses QTextEdit::setExtraSelections to highlight extracted entities.||
  • DONE Add the highlighting of the text to the annotation window described above.
  • DONE Since the annotations window uses the annotation system and not the Scribo text analysis framework (to be open for all sorts of annotations) we might need a way for annotations to also contain the text position. A generic system would be preferable.
1-2 days once the annotation window above is done.done
DONE Context action for launching a text analysis on a file from Dolphin The user selects a file in Dolphin and clicks the "annotate" action. A window opens which provides the means to create manual annotations (tags, comments, relations to pimo things, relations to arbitrary things). In the background the Scribo system creates possible annotations using the annotation plugin system. The generated annotations are proposed to the user. The user can accept them via a simple click, ignore them (by doing nothing), or rejecting them as being useless or wrong. The window also provides a means to configure the plugin system (choosing which plugins should be used for annotation creation: OpenCalais, DERI engine, Proxem, or others). The window also shows all current annotations set for the file. Whenever the user accepts an annotation or creates a new manual one the view is updated.

Existing parts:

  • We already have an annotation context menu action which currently opens a simple annotation window showing existing annotations as an html block and allowing to annotate via the annotation plugin system. The latter means that the user enters a short text which will be used as a filter for the annotation plugins. The system will provide tags, pimo types, and even geonames entities for annotation. (The code can be found in KDE's playground)
  • We have code for extracting plain text from arbitrary files using Strigi. (code not commited yet, only locally on SebastianTrueg's system)
  • We have an annotation model and a delegate which has + and - buttons to accept or reject the annotation (also in playground)
  • DONE Add the text extraction to the annotation window (once it opens, start the annotation search in the background)
  • Optionally filter annotation recommendations using the filter entered by the user
  • DONE Make the + button in the delegate actually create the annotation
  • DONE If Strigi could extract text from the file, allow the user to assign manual annotations to text positions.
As most of the code already exists and only needs to be combined I would predict 2-3 full days.done
DONE UI for sending feedback to the annotation engineCertain suggested semi-automatic annotations may be completely wrong. In this case it would be good to allow the user to give feedback by "telling" the system about the error. Most likely a reject button would be enough (see above).

Existing parts:

  • As mentioned above we already have a model delegate which provides + and - buttons.||
  • DONE connect the minus button to the reject method in the annotation
done


Ideas for actions

Entity type Action
City, Country Open Google Maps
Persons Write email to
Dates Create a date in the calendar (if possible use context: propose extracted entities as events)
FIXME add more

Use information from the Nepomuk store

There could already be information about extracted entities (or also the resource itself) in the Nepomuk store. This should be presented to the user while hovering the entity in the text via a tooltip. This tooltip could also be combined with the actions idea above: show a map in the tooltip and maybe even mark known points of interest from the Nepomuk store in the map.

Existing parts

  • In KDE's playground we have a system for presenting arbitrary information from Nepomuk based on templates. This could be used and improved here.

ToDo

  • Map the extracted entities to entities in the Nepomuk store
  • Extract all information about the mapped entities from Nepomuk and present them


Misc


To Remember

  • Annotations should mostly be based on PIMO. This means that for example the OpenCalais plugin is supposed to create a pimo thing which has the extracted OpenCalais resource as an occurrence. If possible the pimo things's type should match the OpenCalais class (unsure: should we also create new pimo classes in certain situations?)

See also

Page history

Created by Sebastian and St├ęphane, may 2009

Personal tools