hello: i want to write an indexer filter (aplugin for nutch) that take the arabic words from the indexer and remove the movements from this words then return them to the indexer what i should use instead of the parse.getdata() and what i should put in the doc.add(name,value) . I don't know what is the error in it. Tthis is the code:-->
package com.mycompany.nutch.indexing; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.log4j.Logger; import org.apache.nutch.crawl.CrawlDatum; import org.apache.nutch.crawl.Inlinks; import org.apache.nutch.indexer.IndexingException; import org.apache.nutch.indexer.IndexingFilter; import org.apache.nutch.indexer.NutchDocument; //import org.apache.nutch.parsedData.parsedData; public class InvalidUrlIndexFilter implements IndexingFilter { private static final Logger LOGGER = Logger.getLogger(InvalidUrlIndexFilter.class); private Configuration conf; public void addIndexBackendOptions(Configuration conf) { // NOOP return; } public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException { if (url == null) { return null; } string parsedData =parse; char[] parsedData = input.trim().toCharArray(); for(int p=0;p<parsedData.length;p++) if(!(parsedData[p]=='?'||parsedData[p]=='?'||parsedData[p]=='?'||parsedData[p]=='?'||parsedData[p]=='?'||parsedData[p]=='?' ||parsedData[p]=='?'||parsedData[p]=='?' ||parsedData[p]=='"' )) new String.append(parsedData[p]); return doc.add("value",parsedData); } public Configuration getConf() { return conf; } public void setConf(Configuration conf) { this.conf = conf; } }
I think that the error is in using parsedData but I don't know what I should use instead of it?