Troubleshooting

Slack Docker Pulls GitHub edit source

This page is a collection of high-level guides and tips regarding how to diagnose issues encountered in Alluxio.

Note: this doc is not intended to be the full list of Alluxio questions. Feel free to post questions on the Alluxio Mailing List.

Where are the Alluxio logs?

Alluxio generates Master, Worker and Client logs under the dir ${ALLUXIO_HOME}/logs. They are named as master.log, master.out, worker.log, worker.out and user_${USER}.log. Files suffixed with .log are generated by log4j; File suffixed with .out are generated by redirection of stdout and stderr of the corresponding process.

The master and worker logs are useful to understand what the Alluxio Master and Workers are doing, especially when running into any issues. If you do not understand the error messages, search for them in the Mailing List, in the case the problem has been discussed before.

The client-side logs are also helpful when Alluxio service is running but the client cannot connect to the servers. Alluxio client emits logging messages through log4j, so the location of the logs is determined by the client side log4j configuration used by the application. For more information about logging, please check out this page.

Alluxio remote debug

Java remote debugging makes it easier to debug Alluxio at the source level without modifying any code. You will need to append the JVM remote debugging parameters and start a debugging server. There are several ways to append the remote debugging parameters; you can export the following configuration properties in shell or alluxio-env.sh:

  1. export ALLUXIO_WORKER_JAVA_OPTS="$ALLUXIO_JAVA_OPTS -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=6606"
  2. export ALLUXIO_MASTER_JAVA_OPTS="$ALLUXIO_JAVA_OPTS -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=6607"
  3. export ALLUXIO_USER_DEBUG_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=6609"

If you want to debug shell commands, you can add the -debug flag to start a debug server with the JVM debug parameters ALLUXIO_USER_DEBUG_JAVA_OPTS, such as alluxio fs -debug ls /.

suspend = y/n will decide whether the JVM process wait until the debugger connects. If you want to debug with the shell command, set the suspend = y. Otherwise, you can set suspend = n to avoid unnecessary waiting time.

After starting the master or worker, use Eclipse, IntelliJ IDEA, or another java IDE. Create a new java remote configuration, set the debug server’s host and port, and start the debug session. If you set a breakpoint which can be reached, the IDE will enter debug mode and you can inspect the current context’s variables, call stack, thread list, and expression evaluation.

Setup FAQ

Q: I’m new to Alluxio and cannot set up Alluxio on my local machine. What should I do?

A: Check ${ALLUXIO_HOME}/logs to see if there are any master or worker logs. Look for any errors in these logs. Double check if you missed any configuration steps in Running-Alluxio-Locally.

Typical issues:

  • ALLUXIO_UNDERFS_ADDRESS is not configured correctly.
  • If running ssh localhost fails, make sure the public SSH key for the host is added in ~/.ssh/authorized_keys.

Q: I’m trying to deploy Alluxio in a cluster with Spark and HDFS. Are there any suggestions?

A: Please follow Running-Alluxio-on-a-Cluster, Configuring-Alluxio-with-HDFS.

Tips:

  • The best performance gains occur when Alluxio workers are co-located with the nodes of the computation frameworks.
  • You can use Mesos and Yarn integration if you are already using Mesos or Yarn to manage your cluster.
  • If the under storage is remote (like S3 or remote HDFS), using Alluxio can be especially beneficial.

Q: Why do I see “Unsupported major.minor version 52.0” error when I start Alluxio?

A: Alluxio requires Java 8 runtime to function properly. If this error is seen at the start of Alluxio master orworker, please setup your environment so that the default Java version is 8. If you see this error while using an application to access files on Alluxio, please make sure your application is running on Java 8.

Usage FAQ

Q: Why do I see exceptions like “No FileSystem for scheme: alluxio”?

A: This error message is seen when your applications (e.g., MapReduce, Spark) try to access Alluxio as an HDFS-compatible file system, but the alluxio:// scheme is not recognized by the application. Please make sure your HDFS configuration file core-site.xml (in your default hadoop installation or spark/conf/ if you customize this file for Spark) has the following property:

  1. <configuration>
  2. <property>
  3. <name>fs.alluxio.impl</name>
  4. <value>alluxio.hadoop.FileSystem</value>
  5. </property>
  6. </configuration>

Q: Why do I see exceptions like “java.lang.RuntimeException: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found”?

A: This error message is seen when your applications (e.g., MapReduce, Spark) try to access Alluxio as an HDFS-compatible file system, the alluxio:// scheme has been configured correctly, but the Alluxio client jar is not found on the classpath of your application. Depending on the computation frameworks, users usually need to add the Alluxio client jar to their class path of the framework through environment variables or properties on all nodes running this framework. Here are some examples:

  • For MapReduce jobs, you can append the client jar to $HADOOP_CLASSPATH:
  1. export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/client/alluxio-1.8.2-client.jar:${HADOOP_CLASSPATH}
  • For Spark jobs, you can append the client jar to $SPARK_CLASSPATH:
  1. export SPARK_CLASSPATH=/<PATH_TO_ALLUXIO>/client/alluxio-1.8.2-client.jar:${SPARK_CLASSPATH}

Alternatively, add the following lines to spark/conf/spark-defaults.conf:

  1. spark.driver.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-1.8.2-client.jar
  2. spark.executor.extraClassPath
  3. /<PATH_TO_ALLUXIO>/client/alluxio-1.8.2-client.jar

If the corresponding classpath has been set but exceptions still exist, users can check whether the path is valid by:

  1. ls /<PATH_TO_ALLUXIO>/client/alluxio-1.8.2-client.jar

Q: I’m seeing error messages like “Frame size (67108864) larger than max length (16777216)”. What is wrong?

A: This problem can be caused by different possible reasons.

  • Please double check if the port of Alluxio master address is correct. The default listening port for Alluxio master is port 19998, while a common mistake causing this error message is due to using a wrong port in master address (e.g., using port 19999 which is the default Web UI port for Alluxio master).
  • Please ensure that the security settings of Alluxio client and master are consistent. Alluxio provides different approaches to authenticate users by configuring alluxio.security.authentication.type. This error happens if this property is configured with different values across servers and clients (e.g., one uses the default value NOSASL while the other is customized to SIMPLE). Please read Configuration-Settings for how to customize Alluxio clusters and applications.

Q: I’m copying or writing data to Alluxio while seeing error messages like “Failed to cache: Not enough space to store block on worker”. Why?

A: This error indicates insufficient space left on Alluxio workers to complete your write request.

  • For Alluxio version 1.6.0 and above, copyFromLocal uses RoundRobinPolicy by default. You can change the location policy for this command by changing alluxio.user.file.copyfromlocal.write.location.policy.class property.

Before version 1.6.0, if you are copying a file to Alluxio using copyFromLocal, by default this shell command applies LocalFirstPolicy and stores data on the local worker (see location policy). In this case, you will see the above error once the local worker does not have enough space. To distribute the data of your file on different workers, you can change this policy to RoundRobinPolicy (see below).

  1. bin/alluxio fs -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.RoundRobinPolicy copyFromLocal foo /alluxio/path/foo
  • Check if you have any files unnecessarily pinned in memory and unpin them to release space. See Command-Line-Interface for more details.
  • Increase the capacity of workers by changing alluxio.worker.memory.size property. See Configuration for more description.

Q: I’m writing a new file/directory to Alluxio and seeing journal errors in my application

A: When you see errors like “Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try”, it is because Alluxio master failed to update journal files stored in a HDFS directory according to the property alluxio.master.journal.folder setting. There can be multiple reasons for this type of errors, typically because some HDFS datanodes serving the journal files are under heavy load or running out of disk space. Please ensure the HDFS deployment is connected and healthy for Alluxio to store journals when the journal directory is set to be in HDFS.

Q: I’m seeing that client connection was rejected by master

A: When you see errors from applications like "alluxio.exception.status.UnavailableException: Failed to connect to BlockMasterClient @ hostname:19998 after 13 attempts" and also find the following warning messages in logs/master.log: "WARN TThreadPoolServer - Task has been rejected by ExecutorService 9 times till timedout, reason: java.util.concurrent.RejectedExecutionException: Task org.apache.thrift.server.TThreadPoolServer$WorkerProcess@22fba58c rejected from java.util.concurrent.ThreadPoolExecutor@19593091[Running, pool size = 2048, active threads = 2048, queued tasks = 0, completed tasks = 14]", it indicates that the Alluxio master server has run out threads in its thread pool to serve new incoming client requests.

To solve this issue, you can try:

  • Increase the thread pool size on the master to serve client requests by increasing alluxio.master.worker.threads.max. You can set this property to a larger value in conf/alluxio-site.properties. Note that, this value should be no larger than the number of max open files allowed by the system allows. One can check the system limit using "ulimit -n" on Linux or other approaches
  • Decrease the connection pool size on the client to send requests to master by decreasing alluxio.user.block.master.client.threads (default to 10) and alluxio.user.file.master.client.threads (default to 10). You can set this property to a smaller value in conf/alluxio-site.properties. Note that, reducing the value of these two properties may potentially add latency for master to serve requests.

Q: I added some files in under file system. How can I reveal the files in Alluxio?

By default, Alluxio loads the list of files the first time a directory is visited. Alluxio will keep using the cached file list regardless of the changes in the under file system. To reveal new files from under file system, you can use the command alluxio fs ls -f /some/path to manually discover the new content inside a specific folder. Another way to refresh a directory is to use UFS sync. You can either use it in command line by running alluxio fs ls -R -Dalluxio.user.file.metadata.sync.interval=${SOME_INTERVAL} /path or by setting the same configuration property in masters’ alluxio-site.properties. The value for the configuration property is used to determine the minimum interval between two syncs. You can read more about loading files from under file system here.

Q: I see an error “Block ?????? is unavailable in both Alluxio and UFS” while reading some file. Where is my file?

A: When writing files to Alluxio, one of the several write type can be used to tell Alluxio worker how the data should be stored:

MUST_CACHE: data will be stored in Alluxio only

CACHE_THROUGH: data will be cached in Alluxio as well as written to UFS

THROUGH: data will be only written to UFS

By default the write type used by Alluxio client is MUST_CACHE, therefore a new file written to Alluxio is only stored in Alluxio worker storage, and can be evicted when Alluxio worker storage is full and some new data needs to be cached. To make sure data is persisted, either use CACHE_THROUGH or THROUGH write type, or pin the files you would like to preserve.

Another possible cause for this error is that the block exists in the file system, but no worker has connected to master. In that case the error will go away once at least one worker containing this block is connected.

Q: I’m running an Alluxio shell command and it hangs without giving any output. What’s going on?

A: Most Alluxio shell commands require connecting to Alluxio master to execute. If the command fails to connect to master it will keep retrying several times, appearing as “hanging” for a long time. It is also possible that some command can take a long time to execute, such as persisting a large file on a slow UFS. If you want to know what happens under the hood, check the user log (stored as ${ALLUXIO_HOME}/logs/user_${USER_NAME}.log by default) or master log (stored as ${ALLUXIO_HOME}/logs/master.log on the master node by default).

Performance FAQ

Q: I tested Alluxio/Spark against HDFS/Spark (running simple word count of GBs of files). Why is there no discernible performance difference?

A: Alluxio accelerates your system performance by leveraging temporal or spatial locality using distributed in-memory storage (and tiered storage). If your workloads don’t have any locality, you will not see noticeable performance boost.

For a comprehensive guide on tuning performance of Alluxio cluster, please check out this page.

Environment

Alluxio can be configured under a variety of modes, in different production environments. Please make sure the Alluxio version being deployed is update-to-date and supported.

When posting questions on the Mailing List, please attach the full environment information, including

  • Alluxio version
  • OS version
  • Java version
  • UnderFileSystem type and version
  • Computing framework type and version
  • Cluster information, e.g. the number of nodes, memory size in each node, intra-datacenter or cross-datacenter