Rack Awareness / Rack Topology http://www.slideshare.net/tutorialvillage/hadoop-hdfs-concepts |
La política por defecto de Hadoop para replicar (factor 3) los bloques a groso modo es la siguiente: Almacena el primer bloque en un nodo, ej. N1 ubicado en el rack1; A continuación replica ese bloque en un nodo diferente a su vez hayado en otro rack, ej. N2 y rack2; La tercera réplica la realiza sobre otro nodo, pero en este caso ubicado en el mismo rack que el primero, ej. N3 y rack1. En el caso de haberse definido un mayor número de réplicas, éstas serán aleatoriamente asignadas a otros nodos.
Rack topology viene a definir cómo físicamente las máquinas están conectadas en racks en nuestro CPD, proporcionando un conocimiento sobre como de cerca o lejos están nuestros nodos, los unos de los otros, hablando siempre en términos de conectividad de red. La comprensión de este punto puede llegar a ser especialmente crítico cuando hablamos o planteamos dominios de fallo en nuestro sistema.
A continuación se muestran los pasos necesarios para implementar esta funcionalidad. Parto de un pequeño clúster de pruebas formado por:
A continuación se muestran los pasos necesarios para implementar esta funcionalidad. Parto de un pequeño clúster de pruebas formado por:
- vlihdp01.domain => NameNode y DataNode
- vlihdp02.domain => DataNode
- vlihdp03.domain => DataNode
HADOOP: home & version |
A continuación crearemos un script, 100% personalizable, que a partir del fichero anterior nos devuelva el rack al que pertenece un nodo o lista de nodos pasados como argumentos. Podéis usar el del siguiente enlace ó bien el aquí mostrado a continuación y obtenido del libro Hadoop Operations:
$ vim /opt/hadoop/etc/hadoop/topology.py
#!/usr/bin/python
import sys
class RackTopology:
# Make sure you include the absolute path to topology.csv.
DEFAULT_TOPOLOGY_FILE = '/opt/hadoop/etc/hadoop/topology.csv'
DEFAULT_RACK = '/default-rack'
def __init__(self, filename = DEFAULT_TOPOLOGY_FILE):
self._filename = filename
self._mapping = dict()
self._load_topology(filename)
def _load_topology(self, filename):
'''
Load a CSV-ish mapping file. Should be
hostname or IP and the second the rack
it's discarded. Each field is stripped
the file fails to load for any reason,
'''
try:
f = file(filename, 'r')
for line in f:
fields = line.split(',')
if len(fields) == 2:
self._mapping[fields[0].strip()] = fields[1].strip()
except:
pass
def rack_of(self, host):
'''
Look up and a hostname or IP address in the mapping and return its rack.
'''
if self._mapping.has_key(host):
return self._mapping[host]
else:
return RackTopology.DEFAULT_RACK
if __name__ == '__main__':
app = RackTopology()
logFile = open('/tmp/topology.log', 'a')
for node in sys.argv[1:]:
rack = app.rack_of(node)
logFile.write(node + ' => ' + rack + '\n')
print rack
logFile.close()
¡Importante! Asegurarse que la variable DEFAULT_TOPOLOGY_FILE quede correctamente configurada con el path completo del archivo anteriormente creado con la topología de nuestros nodos. En mi caso: /opt/hadoop/etc/hadoop/topology.csv
Lo siguiente será darle permisos de ejecución a dicho script:
$ chmod 755 /opt/hadoop/etc/hadoop/topology.py
Podemos realizar una serie de pruebas de su correcto funcionamiento:
$ python /opt/hadoop/etc/hadoop/topology.py vlihdp01.domain
/rack1
$ python /opt/hadoop/etc/hadoop/topology.py vlihdp02.domain
/rack1
$ python /opt/hadoop/etc/hadoop/topology.py vlihdp03.domain
/rack2
$ python /opt/hadoop/etc/hadoop/topology.py vlihdp01.domain vlihdp02.domain vlihdp03.domain
/rack1
/rack1
/rack2
Por último, toca modificar la configuración, core-site.xml, de Hadoop. Para ello añadiremos el parámetro net.topology.script.file.name y estableceremos su valor al path completo del script recien creado:
Ya sólo quedar comprobar su correcto funcionamiento.
$ hadoop dfsadmin -printTopology
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
15/11/25 22:55:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Rack: /rack1
10.0.3.11:50010 (vlihdp01.domain)
10.0.3.12:50010 (vlihdp02.domain)
Rack: /rack2
10.0.3.13:50010 (vlihdp03.domain)
$ hadoop dfsadmin -report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
15/11/25 22:57:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Configured Capacity: 144813367296 (134.87 GB)
Present Capacity: 132053340160 (122.98 GB)
DFS Remaining: 132053200896 (122.98 GB)
DFS Used: 139264 (136 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Live datanodes (3):
Name: 10.0.3.13:50010 (vlihdp03.domain)
Hostname: vlihdp03.domain
Rack: /rack2
Decommission Status : Normal
Configured Capacity: 48271122432 (44.96 GB)
DFS Used: 49152 (48 KB)
Non DFS Used: 4243726336 (3.95 GB)
DFS Remaining: 44027346944 (41.00 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.21%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 25 22:57:53 CET 2015
Name: 10.0.3.12:50010 (vlihdp02.domain)
Hostname: vlihdp02.domain
Rack: /rack1
Decommission Status : Normal
Configured Capacity: 48271122432 (44.96 GB)
DFS Used: 49152 (48 KB)
Non DFS Used: 4243742720 (3.95 GB)
DFS Remaining: 44027330560 (41.00 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.21%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 25 22:57:54 CET 2015
Name: 10.0.3.11:50010 (vlihdp01.domain)
Hostname: vlihdp01.domain
Rack: /rack1
Decommission Status : Normal
Configured Capacity: 48271122432 (44.96 GB)
DFS Used: 40960 (40 KB)
Non DFS Used: 4272558080 (3.98 GB)
DFS Remaining: 43998523392 (40.98 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.15%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 25 22:57:53 CET 2015
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
15/11/25 22:55:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Rack: /rack1
10.0.3.11:50010 (vlihdp01.domain)
10.0.3.12:50010 (vlihdp02.domain)
Rack: /rack2
10.0.3.13:50010 (vlihdp03.domain)
$ hadoop dfsadmin -report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
15/11/25 22:57:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Configured Capacity: 144813367296 (134.87 GB)
Present Capacity: 132053340160 (122.98 GB)
DFS Remaining: 132053200896 (122.98 GB)
DFS Used: 139264 (136 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Live datanodes (3):
Name: 10.0.3.13:50010 (vlihdp03.domain)
Hostname: vlihdp03.domain
Rack: /rack2
Decommission Status : Normal
Configured Capacity: 48271122432 (44.96 GB)
DFS Used: 49152 (48 KB)
Non DFS Used: 4243726336 (3.95 GB)
DFS Remaining: 44027346944 (41.00 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.21%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 25 22:57:53 CET 2015
Name: 10.0.3.12:50010 (vlihdp02.domain)
Hostname: vlihdp02.domain
Rack: /rack1
Decommission Status : Normal
Configured Capacity: 48271122432 (44.96 GB)
DFS Used: 49152 (48 KB)
Non DFS Used: 4243742720 (3.95 GB)
DFS Remaining: 44027330560 (41.00 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.21%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 25 22:57:54 CET 2015
Name: 10.0.3.11:50010 (vlihdp01.domain)
Hostname: vlihdp01.domain
Rack: /rack1
Decommission Status : Normal
Configured Capacity: 48271122432 (44.96 GB)
DFS Used: 40960 (40 KB)
Non DFS Used: 4272558080 (3.98 GB)
DFS Remaining: 43998523392 (40.98 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.15%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 25 22:57:53 CET 2015