|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.avro.mapred.AvroMultipleOutputs
public class AvroMultipleOutputs
The AvroMultipleOutputs class simplifies writing Avro output data to multiple outputs
Case one: writing to additional outputs other than the job default output.
Each additional output, or named output, may be configured with its own
Schema and OutputFormat.
A named output can be a single file or a multi file. The later is refered as
a multi named output which is an unbound set of files all sharing the same
Schema.
Case two: to write data to different files provided by user
AvroMultipleOutputs supports counters, by default they are disabled. The
counters group is the AvroMultipleOutputs class name. The names of the
counters are the same as the output name. These count the number of records
written to each output name. For multi
named outputs the name of the counter is the concatenation of the named
output, and underscore '_' and the multiname.
JobConf job = new JobConf();
FileInputFormat.setInputPath(job, inDir);
FileOutputFormat.setOutputPath(job, outDir);
job.setMapperClass(MyAvroMapper.class);
job.setReducerClass(HadoopReducer.class);
job.set("avro.reducer",MyAvroReducer.class);
...
Schema schema;
...
// Defines additional single output 'avro1' for the job
AvroMultipleOutputs.addNamedOutput(job, "avro1", AvroOutputFormat.class,
schema);
// Defines additional output 'avro2' with different schema for the job
AvroMultipleOutputs.addNamedOutput(job, "avro2",
AvroOutputFormat.class,
null); // if Schema is specified as null then the default output schema is used
...
job.waitForCompletion(true);
...
Usage in Reducer:
public class MyAvroReducer extends
AvroReducer<K, V, OUT> {
private MultipleOutputs amos;
public void configure(JobConf conf) {
...
amos = new AvroMultipleOutputs(conf);
}
public void reduce(K, Iterator<V> values,
AvroCollector<OUT>, Reporter reporter)
throws IOException {
...
amos.getCollector("avro1", reporter).collect(datum);
amos.getCollector("avro2", "A", reporter).collect(datum);
amos.getCollector("avro3", "B", reporter).collect(datum);
...
}
public void close() throws IOException {
amos.close();
...
}
}
| Constructor Summary | |
|---|---|
AvroMultipleOutputs(org.apache.hadoop.mapred.JobConf job)
Creates and initializes multiple named outputs support, it should be instantiated in the Mapper/Reducer configure method. |
|
| Method Summary | |
|---|---|
static void |
addMultiNamedOutput(org.apache.hadoop.mapred.JobConf conf,
String namedOutput,
Class<? extends org.apache.hadoop.mapred.OutputFormat> outputFormatClass,
org.apache.avro.Schema schema)
Adds a multi named output for the job. |
static void |
addNamedOutput(org.apache.hadoop.mapred.JobConf conf,
String namedOutput,
Class<? extends org.apache.hadoop.mapred.OutputFormat> outputFormatClass,
org.apache.avro.Schema schema)
Adds a named output for the job. |
void |
close()
Closes all the opened named outputs. |
AvroCollector |
getCollector(String namedOutput,
org.apache.hadoop.mapred.Reporter reporter)
Gets the output collector for a named output. |
AvroCollector |
getCollector(String namedOutput,
String multiName,
org.apache.hadoop.mapred.Reporter reporter)
Gets the output collector for a multi named output. |
static boolean |
getCountersEnabled(org.apache.hadoop.mapred.JobConf conf)
Returns if the counters for the named outputs are enabled or not. |
static Class<? extends org.apache.hadoop.mapred.OutputFormat> |
getNamedOutputFormatClass(org.apache.hadoop.mapred.JobConf conf,
String namedOutput)
Returns the named output OutputFormat. |
Iterator<String> |
getNamedOutputs()
Returns iterator with the defined name outputs. |
static List<String> |
getNamedOutputsList(org.apache.hadoop.mapred.JobConf conf)
Returns list of channel names. |
static boolean |
isMultiNamedOutput(org.apache.hadoop.mapred.JobConf conf,
String namedOutput)
Returns if a named output is multiple. |
static void |
setCountersEnabled(org.apache.hadoop.mapred.JobConf conf,
boolean enabled)
Enables or disables counters for the named outputs. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public AvroMultipleOutputs(org.apache.hadoop.mapred.JobConf job)
job - the job configuration object| Method Detail |
|---|
public static List<String> getNamedOutputsList(org.apache.hadoop.mapred.JobConf conf)
conf - job conf
public static boolean isMultiNamedOutput(org.apache.hadoop.mapred.JobConf conf,
String namedOutput)
conf - job confnamedOutput - named output
true if the name output is multi, false
if it is single. If the name output is not defined it returns
false
public static Class<? extends org.apache.hadoop.mapred.OutputFormat> getNamedOutputFormatClass(org.apache.hadoop.mapred.JobConf conf,
String namedOutput)
conf - job confnamedOutput - named output
public static void addNamedOutput(org.apache.hadoop.mapred.JobConf conf,
String namedOutput,
Class<? extends org.apache.hadoop.mapred.OutputFormat> outputFormatClass,
org.apache.avro.Schema schema)
conf - job conf to add the named outputnamedOutput - named output name, it has to be a word, letters
and numbers only, cannot be the word 'part' as
that is reserved for the
default output.outputFormatClass - OutputFormat class.schema - Schema to used for this namedOutput
public static void addMultiNamedOutput(org.apache.hadoop.mapred.JobConf conf,
String namedOutput,
Class<? extends org.apache.hadoop.mapred.OutputFormat> outputFormatClass,
org.apache.avro.Schema schema)
conf - job conf to add the named outputnamedOutput - named output name, it has to be a word, letters
and numbers only, cannot be the word 'part' as
that is reserved for the
default output.outputFormatClass - OutputFormat class.schema - Schema to used for this namedOutput
public static void setCountersEnabled(org.apache.hadoop.mapred.JobConf conf,
boolean enabled)
MultipleOutputs class name.
The names of the counters are the same as the named outputs. For multi
named outputs the name of the counter is the concatenation of the named
output, and underscore '_' and the multiname.
conf - job conf to enableadd the named output.enabled - indicates if the counters will be enabled or not.public static boolean getCountersEnabled(org.apache.hadoop.mapred.JobConf conf)
MultipleOutputs class name.
The names of the counters are the same as the named outputs. For multi
named outputs the name of the counter is the concatenation of the named
output, and underscore '_' and the multiname.
conf - job conf to enableadd the named output.
public Iterator<String> getNamedOutputs()
public AvroCollector getCollector(String namedOutput,
org.apache.hadoop.mapred.Reporter reporter)
throws IOException
namedOutput - the named output namereporter - the reporter
IOException - thrown if output collector could not be created
public AvroCollector getCollector(String namedOutput,
String multiName,
org.apache.hadoop.mapred.Reporter reporter)
throws IOException
namedOutput - the named output namemultiName - the multi name partreporter - the reporter
IOException - thrown if output collector could not be created
public void close()
throws IOException
super.close() at the
end of their close()
IOException - thrown if any of the MultipleOutput files
could not be closed properly.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||