Beagle handling of compressed files and man pages



Hi all,
	This e-mail contains a brief summary of an e-mail conversation that
I've had with Jon over the past couple of days. It also contains the
code mentioned in the summary. Let me apologize right off the bat for
the length....

I was looking for some way to contribute and as you all know, the great
work that's gone into beagle has made it into a project which is both
usable and understandable relatively quickly.

	The beagle TODO file contained an item about indexing man pages.
Figuring it was a good place to start I whipped up a very simple man
filter (FilterMan.cs which is attached). Initially it handled gzipped
files 'internaly' (i.e. it created a GzipInputStream). i sent this off
to Jon who responded nearly within the hour! He suggested that a more
generalized handling of gzipped files would be better. Here is an
extract of his e-mail:

<Jon>
Instead of building gz-support in filter by filter, it would probably
make sense to add some sort of generic support for gzipped files so that
the HTML filter will work on gzipped HTML files, etc.That would be a
great project to work on, if you had the interest and inclination.

The open question is how to figure out the mime type of the gzipped
file.  There should be some way to sniff the mime type w/o having to
unzip the whole thing.  I'm not quite sure what gnome-vfs offers in this
regard, but I'm sure that there is some reasonable way to do it.
</Jon>

I went on to follow this sound advice (and more which followed later)
and submitted some code which allowed the crawler to dispatch gzipped
and bzipped files to a new function, removed the gzip specific code from
FilterMan and created a new type of stream which is meant to allow to
peek at the beginning of a non-seekable stream (i.e. read as much of it
as wanted, or as memory allows, re-read that data as often as wanted and
then tell the stream that you won't be peeking anymore, the following
Read()s behave like normal Read()s on a non-seekable stream, starting at
it's beginning). I called it PeekableStream.cs and used it to unzip
files and try to sniff the data to determine the mime-type. I created a
function in tools/gnome.cs:GuessMimeType which uses
gnome_vfs_get_mime_type_for_data() to do the sniffing. GuessMimeType
also accepts a path which is used to try to be smarter about the
mime-type of the data it gets as input.

I also created an IndexableCompressedFile.cs (actually I'd started with
the idea of an IndexableStream, but Jon suggested a less generic
name...once again he was quite right!) which is meant to be the
Indexable for compressed files. There remain some questions about the
appropriate Type, Uri and other meta-data to associate. Jon suggested
the following:

<Jon>
Also there is the question of the mime- type... Probably a zipped file
should be flagged w/ mimetype application/x-gzip, but there could be a
metadata field like "ActualMimeType" containing the sniffed mimetype. 
Instead of a generic IndexableStream, it might make more sense to have
an IndexableCompressedFile that parallels IndexableFile.  My worry is
that IndexableStream is too generic --- different types of streams will
need different types of handling, and I think it would be hard to make
an IndexableStream type that does the right thing in enough cases to be
useful.  But I could be wrong...
</Jon>

I haven't done anything in this respect yet. I guess that's it. So here
is the executive simmary:

In Summary:
PeekableStream.cs   Allows to preview a stream      : New, attached
IndexableCompressedFiles.cs  Indexing of stream     : New, attached
FilterMan.cs        Filter for man pages            : New, attached
Flavor.cs           Added FlavorFromStream()        : Modified, in Patch
Util/gnome.cs       Added GuessMimeType()           : Modified, in Patch
Crawler.cs          Added handling of gzip and bzip : Modified, in Patch
Filters/Makefile.am Added FilterMan.cs              : Modified, in Patch
Util/Makefile.am    Added PeekableStream.cs         : Modified, in Patch
indexer/Makefile.am Added IndexableStream.cs        : Modified, in Patch
tools/Makefile.am   Added SharpZipLib.dll ref.      : Modified, in Patch

The patch was made with :
cvs -q -z3 diff Filters/Flavor.cs \
                Filters/Makefile.am \
                Util/Makefile.am \
                Util/gnome.cs \
                indexer/Makefile.am \
                tools/Crawler.cs \
                tools/Makefile.am > mlevy_June6_2004.patch

Note: 	PeekableStream.cs lives in Util/
	IndexableCompressedFile.cd lives in indexer/
	FilterMan.cs lices in Filters/
	

Comments? Criticism (constructive of course ;)? I'm all ears!
Keep up the great work (beagle and dashboard both absolutely rock!)
All the best and thanks for staying with me,

Mike Levy


Index: Filters/Flavor.cs
===================================================================
RCS file: /cvs/gnome/beagle/Filters/Flavor.cs,v
retrieving revision 1.4
diff -r1.4 Flavor.cs
67a68,88
> 		static public Flavor FromStream (Stream stream, String path, int maxSize)
> 		{
> 			byte [] buffer = new byte [maxSize];
> 			
> 			// Read up to maxSize bytes of stream to try to guess mime-type
> 			int read = stream.Read(buffer, 0, maxSize);
> 			String mimeType = Beagle.Util.VFS.Mime.GuessMimeType (buffer, read, path);
> 			return new Flavor (mimeType, "");
> 		}
> 		
> 		// Default to one K
> 		static public Flavor FromStream (Stream stream, String path)
> 		{
> 			return FromStream (stream, path, 1024);
> 		}
> 		
> 		static public Flavor FromStream (Stream stream)
> 		{
> 			return FromStream (stream, null);
> 		}
> 
Index: Filters/Makefile.am
===================================================================
RCS file: /cvs/gnome/beagle/Filters/Makefile.am,v
retrieving revision 1.9
diff -r1.9 Makefile.am
26c26,27
< 	$(srcdir)/FilterText.cs
---
> 	$(srcdir)/FilterText.cs		\
> 	$(srcdir)/FilterMan.cs
Index: Util/Makefile.am
===================================================================
RCS file: /cvs/gnome/beagle/Util/Makefile.am,v
retrieving revision 1.9
diff -r1.9 Makefile.am
19c19,20
< 	$(srcdir)/Logger.cs
---
> 	$(srcdir)/Logger.cs	\
> 	$(srcdir)/PeekableStream.cs
Index: Util/gnome.cs
===================================================================
RCS file: /cvs/gnome/beagle/Util/gnome.cs,v
retrieving revision 1.3
diff -r1.3 gnome.cs
22a23
> 			[DllImport ("libgnomevfs-2")] extern static string gnome_vfs_get_mime_type_for_data (string data, int length);
31a33,49
> 			}
> 			
> 			public static string GuessMimeType (byte [] buffer, int buffSize, String uri)
> 			{
> 				System.Text.StringBuilder sb = new System.Text.StringBuilder (buffSize);
> 				// FIXME: This just doesn't seem like the right way!
> 				for (int i = 0; i < buffSize; i++)
> 					sb.Append ((char)buffer[i]);
> 				String text = sb.ToString ();
> 				String guessedType = gnome_vfs_get_mime_type_for_data (text, text.Length);
> 				if (guessedType == "text/plain" ||
> 				    guessedType == "application/octet-stream") {
> 					String guessed2  = GetMimeType (uri);
> 					if (guessed2 != null && guessed2 != "")
> 						guessedType = guessed2;
> 				}
> 				return guessedType;
Index: indexer/Makefile.am
===================================================================
RCS file: /cvs/gnome/beagle/indexer/Makefile.am,v
retrieving revision 1.17
diff -r1.17 Makefile.am
21c21,22
< 	$(srcdir)/GoogleDriver.cs
---
> 	$(srcdir)/GoogleDriver.cs	\	
> 	$(srcdir)/IndexableCompressedFile.cs
Index: tools/Crawler.cs
===================================================================
RCS file: /cvs/gnome/beagle/tools/Crawler.cs,v
retrieving revision 1.16
diff -r1.16 Crawler.cs
29a30
> using System.Collections.Specialized;
33a35
> using Beagle.Util;
35a38,43
> using ICSharpCode.SharpZipLib.Zip;
> using ICSharpCode.SharpZipLib.GZip;
> using ICSharpCode.SharpZipLib.BZip2;
> using ICSharpCode.SharpZipLib.Tar;
> 
> 
124a133,145
> 		// mime-types to consider as archives
> 		StringCollection archiveMimeTypes = new StringCollection();
> 		
> 		public Crawler ()
> 		{
> 			archiveMimeTypes.Add ("application/x-gzip");
> 			archiveMimeTypes.Add ("application/x-bzip2");
> 			archiveMimeTypes.Add ("application/x-bzip");
> 			// To be added 
> 			// archiveMimeTypes.Add ("applicatrion/x-tar");
> 			// archiveMimeTypes.Add ("applicatrion/x-gtar");
> 			// archiveMimeTypes.Add ("applicatrion/zip");
> 		}
174c195,245
< 		void CrawlFile (FileInfo info, Hit hit)
---
> 		void HandleZipLikeStream (Stream s, FileInfo info, Hit hit)
> 		{
> 			if (s == null)
> 				throw new Exception (String.Format ("Got a null stream for {0}",
> 							 	     info.FullName));
> 			
> 			PeekableStream pStream = new PeekableStream (s);
> 			// Try to sniff the mime-type from the PeekableStream
> 			String strippedName = Path.GetFileNameWithoutExtension (info.FullName);
> 			Flavor sFlavor = Flavor.FromStream (pStream, strippedName);
> 			pStream.EndPeek (); // turn the PeekableStream back to a normal stream
> 			//Console.WriteLine(" --> {0}  ", sFlavor);
> 			
> 			if (! HandleFlavor (sFlavor, info))
> 				return;
> 
> 			Indexable indexable = new IndexableCompressedFile (pStream, sFlavor, info.FullName);
> 
> 			// If our indexable isn't newer, don't even bother...
> 			if (! indexable.IsNewerThan (hit))
> 				return;
> 				
> 			// FIXME: Does info have the right info!
> 			ScheduleAdd (info, indexable);
> 			// FIXME: What does this do?
> 			ScheduleDelete (hit);
> 		}
> 			
> 		void CrawlArchive (Flavor flavor, FileInfo info, Hit hit)
> 		{
> 			String path = Path.GetFullPath (info.FullName);
> 			if (! File.Exists (path))
> 				throw new Exception ("No such file: " + path);
> 			Stream s = new FileStream (path, FileMode.Open, FileAccess.Read);
> 
> 			switch (flavor.MimeType) {
> 			case "application/x-gzip" :
> 				HandleZipLikeStream (new GZipInputStream (s), info, hit);
> 				break;
> 			case "application/x-bzip2" :
> 			case "application/x-bzip" :
> 				HandleZipLikeStream (new BZip2InputStream (s), info, hit);
> 				break;
> 			default :
> 				Console.WriteLine("CrawlArchive : Unhandled mime-type: {0} file: {1}",
> 						   flavor.MimeType, info.FullName);
> 				break;
> 			}
> 		}
> 
> 		bool HandleFlavor (Flavor flavor, FileInfo info)
176d246
< 			Flavor flavor = Flavor.FromPath (info.FullName);
182c252
< 
---
> 			
185c255
< 				return;
---
> 				return false;
192c262
< 				return;
---
> 				return false;
193a264,265
> 			return true;
> 		}
194a267,278
> 		void CrawlFile (FileInfo info, Hit hit)
> 		{
> 			Flavor flavor = Flavor.FromPath (info.FullName);
> 
> 			if (archiveMimeTypes.Contains (flavor.MimeType)) {
> 				CrawlArchive (flavor, info, hit);
> 				return;
> 			}
> 			
> 			if(! HandleFlavor (flavor, info))
> 				return;
> 				
Index: tools/Makefile.am
===================================================================
RCS file: /cvs/gnome/beagle/tools/Makefile.am,v
retrieving revision 1.18
diff -r1.18 Makefile.am
10c10,11
< 	-r:../indexer/Indexer.dll
---
> 	-r:../indexer/Indexer.dll	\
> 	-r:ICSharpCode.SharpZipLib.dll
//
// Beagle
//
// PeekableStream.cs : The following class is meant to be a stream which
//	allows you to look ahead for a number of bytes then come back to
//	the start to re-read.
//	This stream is meant to decorate another stream. Note if the other stream
//	is seekable then we have little work to do, otherwise we store
//	the stuff that's read in a memory stream.
//	This is a readonly stream!
//
// Author :
//      Michael Levy <mlevy wardium homeip net>
//
// Copyright (C) 2004 Michael levy
//
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
//

using System;
using System.IO;

namespace Beagle.Util {

	public class PeekableStream : Stream {
	
		MemoryStream buff = null;
		Stream stream;
		long length;
		long position;
		bool peekMode;
		byte [] singleByte; // Used for single byte reads
	
		public PeekableStream (Stream _stream)
		{
			stream = _stream;
		
			if(! stream.CanRead)
				throw new NotSupportedException("Parent stream must be able to read");
		
			if (! stream.CanSeek)
				buff = new MemoryStream ();
			
			try {
				length = stream.Length;
			} catch (Exception) {
				// the stream may not support .Length
				length = 0;
			}
			position = 0;
			// Start peeking right away
			peekMode = true;
			singleByte = new byte [1];
			// DEBUG
			//Console.WriteLine ("buff in {0}", (buff== null ? "null" : "not null"));
		}
	
		public override bool CanRead {
			get { return true; }
		}
	
		// Should we allow seeking if stream allows it?
		public override bool CanSeek {
			get { return false; }
		}
	
		public override bool CanWrite {
			get { return false; }
		}
	
		public override long Length {
			get { return length; }
		}
	
		// FIXME : Not sure about what the set should do!
		public override long Position {
			get { return position; }
			set {
				throw new NotSupportedException ("Setting Position is not supported"); 
			}
		}

		public override IAsyncResult BeginRead (byte[] buffer, int offset, int count, AsyncCallback callback, object state)
		{
			throw new NotSupportedException ("Asynch read not currently supported");
		}

		public override IAsyncResult BeginWrite (byte[] buffer, int offset, int count, AsyncCallback callback, object state)
		{
			throw new NotSupportedException ("Asynch write not currently supported");
		} 
	
		public override void Close ()
		{
			if (buff != null) 
				buff.Close ();
			stream.Close ();
		}
	
		// Doesn't make sense in readonly strem...right?
		public override void Flush ()
		{
			// Do nothing
		}

		// This is sort of a limitted seek. The idea is that it should behave
		// like Seek(0, SeekOrigin.Begin)
		public void Restart ()
		{
			if (!peekMode)
				throw new NotSupportedException ("Restart only allowed before EndPeek");
			if (buff == null) {
				if (position != 0)
						stream.Seek (-position, SeekOrigin.Current);
			} else {
				buff.Seek(0, SeekOrigin.Begin);
				buff.Position = 0;
			}
			position = 0;
		}

		// Peeking start upon creation
		// This stops the caching of data
		// Return to the start of the stream
		public void EndPeek ()
		{
			if (peekMode) {
				Restart ();
				peekMode = false;
			}
		}
	
		public override int Read (byte [] buffer, int offset, int count)	
		{
			int read = 0;

			if (peekMode) {
				// Either buff was depleated or we never had one...
				if(buff == null) {
					// stream CanSeek so we don't have a cache
					read = stream.Read (buffer, offset, count);
				} else {
					int mCount = 0;
					int sCount = 0;

					// Read as much as possible from buff
					mCount = buff.Read (buffer, offset, count);
					if (mCount < count) {
						// We've depleated buff. Get more from stream
						sCount = stream.Read (buffer, offset + mCount, count - mCount);
						buff.Write (buffer, offset + mCount, sCount);
					}
					read = mCount + sCount;
				}
			} else {
				int mCount = 0;
				int sCount = 0;
			
				if (buff != null) {
					mCount = buff.Read (buffer, offset, count);
					if (mCount < count) {
						// We've depleated buff. Kill it
						buff = null;
					}
				}
			
				if(mCount < count) {
					// We haven't read enough so read from stream
					sCount = stream.Read (buffer, offset + mCount, count - mCount);
				} 
				read = mCount + sCount;
			}
			position += read;
			return read;
		}
	
		public override int ReadByte ()
		{
			int count = Read(singleByte, 0, 1); //read one byte
			if (count > 0)
				return singleByte [0];

			return -1; // ok
		}
	
		public override long Seek (long offset, SeekOrigin origin)
		{
			throw new NotSupportedException ("Seek not implemented");
		}
	
		public override void SetLength (long value)
		{
			throw new NotSupportedException ("SetLength not implemented");
		}
	
		public override void Write (byte [] buffer, int offset, int count)
		{
			throw new NotSupportedException ("Write not supported");
		}
	
		public override void WriteByte (byte value)
		{
			// in case we decide to support writes one day...
			singleByte[0] = value;
			Write (singleByte, 0, 1);
		}
	}
}
//
// Beagle
// 
// IndexableCompressedFile.cs : an indexer for gzip / bzip files
//
// Author :
//	Michael Levy <mlevy wardium homeip net>
//
// Copyright (C) 2004 Michael levy
//
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
//

using System;
using System.Collections;
using System.IO;

using Beagle.Filters;

namespace Beagle {

	public class IndexableCompressedFile : Indexable {

		Filter filter;
		String path;
		Stream stream;
		
		public IndexableCompressedFile (Stream _stream, Flavor flavor, String _path)
		{
			stream = _stream;
			filter = Filter.FilterFromFlavor (flavor);
			path = _path;
			
			Type = "FIXME:DONTKNOWYET";
			Uri = "FIXME://DontKnowYET/" + _path;
			MimeType = flavor.MimeType;
			if (_path != null && File.Exists (_path))
				Timestamp = File.GetLastWriteTime (_path);
		}
		
		override protected void DoBuild ()
		{
			filter.Open (stream);
			foreach (String key in filter.Keys)
				this [key] = filter [key];
			Content = filter.Content;
			HotContent = filter.HotContent;
			filter.Close ();
			stream.Close ();
			
			if (path != null) {
				FileInfo info = new FileInfo (path);
				this ["_Directory"] = info.DirectoryName;
			} else {
				this ["_Directory"] = "Unknown";
			} 
		}
	}
}
//
// Beagle
//
// FilterMan.cs : Trivial implementation of a man-page filter.
//
// Author :
//      Michael Levy <mlevy wardium homeip net>
//
// Copyright (C) 2004 Michael levy
//
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
//

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
 
using ICSharpCode.SharpZipLib.GZip;

namespace Beagle.Filters {

	public class FilterMan : Filter {

		public FilterMan ()
		{
			AddSupportedMimeType ("application/x-troff-man");
			AddSupportedMimeType ("text/x-troff-man");
		}
 		/*
 			FIXME: 
 			Right now we don't handle pages with just one line like:
 				.so man3/strcpy.3
			Which is in strncpy.3.gz and points to strcpy.3.gz
		*/
		protected void ParseManFile (StreamReader reader)
		{
			String str;
			/*
			   The regular expression for the header line is built to allow a suite of non-spaces,
			   or words separated by spaces which are encompassed in quotes
			   The regexp should be :
			   
			Regex headerRE = new Regex (@"^\.TH\s+" +
						    @"(?<title>(\S+|(""(\S+\s*)+"")))\s+" +
						    @"(?<section>\d+)\s+" + 
						    @"(?<date>(\S+|(""(\S+\s*)+"")))\s+" +
						    @"(?<source>(\S+|(""(\S+\s*)+"")))\s+" +
						    @"(?<manual>(\S+|(""(\S+\s*)+"")))\s*" +
						    "$");
						    
			 But there seem to be a number of broken man pages. Since we're only 
			 keeping the <title> field we'll be less stringent
			*/
			Regex headerRE = new Regex (@"^\.TH\s+" +
						    @"(?<title>(\S+|(""(\S+\s*)+"")))\s+" +
						    @"(?<section>\d+)\s*");
						    
			while ((str = reader.ReadLine ()) != null) {
				if (str.StartsWith (@".\""")) {
					/* Comment in man page */
					continue;
				} else if (str.StartsWith (".TH ")) {
					MatchCollection matches = headerRE.Matches (str);
					if (matches.Count != 1) {
						Console.Error.WriteLine ("In title Expected 1 match but found {0} matches in '{1}'",
									  matches.Count, str);
						continue;
					}
					foreach (Match theMatch in matches) {
						this ["Title"] = theMatch.Groups ["title"].ToString ();
						/* debug If using the complete regexp!!
						Console.Error.WriteLine ("title : '{0}'\nsection : '{1}'\n" +
									 "date : '{2}'\nsource : '{3}'\n" +
									 "manual : '{4}'\n",
									 theMatch.Groups ["title"],
									 theMatch.Groups ["section"],
									 theMatch.Groups ["date"],
									 theMatch.Groups ["source"],
									 theMatch.Groups ["manual"]);
						*/
					}
                      		} else {
                      			/* A "regular" string */
                      			/* FIXME: We can probably do better by stripping other macros (.SH for example) */
                      			AppendContent (str);
				
                      		}
                      		
			}   
		}

		override protected void Read (Stream stream)
		{
			StreamReader reader = new StreamReader (stream);
			ParseManFile (reader);
		}
	}
}


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]