08 February 2007

Item Duplicate Detection

The new release 1.2.6 introduces a solution for item duplicate detection. My goal was to make a simple unobtrusive implementation that will avoid encountering the same article twice as unread.

Here is what it does:
  1. When rendering an item which is duplicated in other feeds there will be an extra "Also in [feed]" header line for each feed with a duplicate.
  2. When changing the read state of an item with duplicates the read state change will be propagated to all other duplicates.
An open problem is the question what should happen when a new duplicate item arrives after reading a copy of it. Should it be added as unread or changed to match the state of all other duplicates? Is one interested in the fact that a new blog posted the same item?

Well the current implementation does add the new duplicate as unread (leaving the older duplicated items as they are). This has the advantage of being noticed that the item was posted somewhere else. But depending on the spreading time of an item (maybe due to slow planet updates) this might be unpractical. Therefore please give it a try and tell me of your experiences!

13 comments:

Anonymous said...

Nice, a very very nice Application.
But the only thing which annoys is the fact that liferea is becoming a MemoryHog, a real one.
I'm the man who, when downloads the release of liferea, the downloads stats for that particular release is 0, then I'm a damn liferea fan.
But..., the memory seems is a real issue, in my machine the only thing which uses memory more than liferea is Java (and sometime Firefox).
Liferea's memory usage is more than 3times of my MUA (Sylpheed).
Would be nice if you do something regarding solving this.

I'm not an ungrateful user, but just some suggestions as an end user to see my favorite application getting better and better.

Lars said...

Please create a bug report in the SF bug tracker and give exact values for the memory used by Liferea, the number of feeds you have subscribed and the cache size setting you used. It is also important if you use the "optimize for memory usage" or "optimize for speed" preference.

Ralf said...

Concerning the duplicates issue you asked our opinions about:

- Do not display duplicates within the same feed.

- Show duplicates in other feeds as 'read' when the original is already read.

- Use a different color for messages that get 'updated' although they still have the same id and name.

Most importantly, unread should really mean: UNREAD. Not the same message from somewhere else, not the same message with a different rss-id, etc. That would be great. I care about the contents of the message, not who, why and where it is from. So information as 'also posted on .. ' is nice, but of no use to me.

------------------------------------------
I use Liferea as my main RSS tool, because of three reasons:

1. Being able to read some of the blogs as long pages, instead of the boring outlook-like view.

2. Being able to add RSS feeds from Firefox

3. Being a seperate program (not integrated with the browser) which I can choose to launch.

Favorite new features would be:

1. Third viewing option: show webpage rss-item is linking to instead of the message. I have some RSS feeds which do not contain the full message, but just a link to it.

2. When clicking on an RSS feed in firefox, I would want liferea to open. Now it only works when liferea is already running. (Perhaps liferea-add-feed can be tweaked to do this?)

3. Better mixing of message in groups. It currently requires more CPU/Memory to show/read all the feeds at once. I would love an option to choose wether or not to include an RSS feed into the 'parent' feed.

My favorite all time RSS reader is flock. I just don't like the fact that it doesn't integrate in my desktop. The way they create a sort of 'front-page' of several feeds is brilliant.

Lars said...

@ralf: The duplicate issue: duplicate detection within the same feed does not make much sense. The syndication standards do ensure that the globally unique ids are unique and XML consumers HAVE to rely on it.
I agree about the synchronization of the read state of all duplicates, but that doesn't really answer my original question about newly arriving duplicates. Also more extensive state representation by coloring the title is not really practical because distinguishing the meanings of many colors is not intuitive and it is a mess to make it work on all color themes.

As for your feature requests: Without development help there is no chance to realize more features.

1.) Already implemented in 1.2.x, there is a option in the feed properties to enable auto-loading of the item link (3 pane mode only).

2.) Yes, this would be worth to improve. Its a really simple task but there are just no volunteers.

3.) To complex. Quote from homepage: "Liferea is a simple feed reader"

Tsukasa said...

Lars is right, it's really easy to implement into liferea-add-feed. I hope it's not yet been done in SVN (I don't keep up... duh) but oh well, who cares ;)

For your convenience: a quick-and-dirty way to do things.

Lars said...

@tsukasa: You are the first one to realize it. I'll do some tests and will merge it with the source. Thanks for the solution!!!

Ralf said...
This comment has been removed by the author.
Ralf said...

Lars, thanks for your response!
I really do appreciate Liferea.

1. The "auto-load-link-item-in-internal-browser."

It opens in an external browser for me! Is this a bug?

2. Tsukasa's liferea-add-feed script:

It doesn't work for me:

/usr/bin/liferea-add-feed: line 32: 10804 Segmentation fault (core dumped) `which liferea-bin`

3. I understand you do not want to add to many options. But is there a way to make a directory in combined-view faster with lots of sub-feeds?

4. Is it possible the 'item read' mode is turned on after a say 2 seconds instead of immediately? And more complex I suspect: to make an item in combined-view read only when it has actually been in an visible area of the screen?

5. Which language is Liferea written in? Perhaps I can help out! Although my favorite languages are a bit extreme (like Haskell) I don't mind touching Python. I stay away from C though ;-) Don't fancy pointers and wouldn't really know how to program without some sort of abstraction mechanism, be it first-class-functions or objects-in-a-dynamic-language.

6. Do bug-reports on launchpad reach you? I've posted some bugs there about 1.2.4 which is in Feisty's repository about dealing with malformed XML-feeds in combined-view of a directory (it makes liferea hang)

Thanks again!

Lars said...

@Ralf:

1.) You found a bug. This needs to be fixed.

2.) I still need to verify this...

3.) No. There is nothing you can do besides reorganizing in subfolders.

4.) The first suggestion is probably possible, but not planned (patches welcome). The automatic read-marking in 2 pane mode is not possible when considering that the GtkHTML2 browser module doesn't support Javascript and the user can disable Javascript in the preferences.

5.) I must disappoint, but Liferea is a huge bunch of pointers :-)

6.) Yes, the Ubuntu maintainer(s) do forward important bugs, so I'd say every Ubuntu user should post bugs at Launchpad.
The issue you mention happens when rendering invalid XHTML in Gecko. Which makes Gecko hang immediately. I'm not sure who is responsible for this: the application supplying MIME type XHTML with broken XML or the rendering engine freezing on it. Personally I think Gecko should behave more gracefully, besides always checking the generated XHTML for well-formedness I see no solution.

Ralf said...

Perhaps i'm totally mistaken, but the bugs I reported only happen with these conditions:

1. its on combined view
2. its a directory containing feeds

Looking at the feed directly I get a nice XML-Malformed error (which is identical to the one Firefox would give me). Which is perfectly fine. It doesn't work, its not the program's fault and I know why its not working.

But when I look at the directory containing the feed in combined view, _thats when it hangs_

The way you describe it, you are using GtkHtml with the directories, but Gecko for the actual feeds.

This also explains the other weird behavior I discovered. A broken link (such as "//www.ubuntu.com") works when you are looking at the combined view of a feed, but not when you are looking at the combined view of a directory containing feeds.

Looking directly at the feed, it fixes the link and assumes http: prefix, whereas when I look at the directory it assumes file: and does not do anything.

Perhaps the best solution would be to use the Gecko engine for the directory view as well? Might even speed it up a bit.

astopy said...

The duplicate detection is really nice. My only suggestion would be to change it so that each feed isn't displayed on a separate line.

e.g. currently it says something like:

"Also posted in Planet Ubuntu
Also posted in Planet GNOME
Also posted in Planet ubuntu-uk"

It would take up less screen space to say:

"Also posted in Planet Ubuntu, Planet GNOME and Planet ubuntu-uk"

All on one line.

Mich said...

I'm trying to find the setting to make liferea automatically open the item in the html pane when I click on an item. I have to click three times now, annoying.

Lars said...

@Mich: The feed specific property to auto-load item links was added with 1.2.0 and can be found in the feed properties (right click on a feed in the feed list and select "Properties" then go the last tab and there you will find a check box to enable it).