testucker

Author

Sarah Urbut

Published

September 1, 2024

Comparison of Tensor Decomposition and Our Model

Let

\[\Theta = \sum_{f=1}^r \lambda_f a_f \otimes b_f \otimes c_f\]

Where:

\(\Theta \in \mathbb{R}^{N \times T \times L}\) is the parameter tensor \(A = (a_1, \ldots, a_r) \in \mathbb{R}^{N \times r}\) is the subject-topic matrix \(B = (b_1, \ldots, b_r) \in \mathbb{R}^{T \times r}\) is the age-topic matrix \(C = (c_1, \ldots, c_r) \in \mathbb{R}^{L \times r}\) is the disease-topic matrix \(\lambda_f\) represents the importance of each topic

\(\Theta \in \mathbb{R}^{N \times T \times L}\) is the parameter tensor \(A = (a_1, \ldots, a_r) \in \mathbb{R}^{N \times r}\) is the subject-topic matrix \(B = (b_1, \ldots, b_r) \in \mathbb{R}^{T \times r}\) is the age-topic matrix \(C = (c_1, \ldots, c_r) \in \mathbb{R}^{L \times r}\) is the disease-topic matrix \(\lambda_f\) represents the importance of each topic

For a specific entry: \[\theta_{i,t,l} = \sum_{f=1}^r \lambda_f a_{i,f} b_{t,f} c_{l,f}\]

Where:

Only \(b_{t,f}\) varies with time \(c_{l,f}\) (disease loading) is constant over time Time variation is projected onto the same factor (\(b_{t,f}\)) for both topics and diseases

[ _{i,t,l} = d(t) + k {ik}(t) {kd}(t) ]

Where:

In our approach: \[ \lambda_{ik}(t) \sim GP(\Gamma_k g_i, K_{\ell_\lambda, \sigma^2_\lambda}(t, t')) \] \[ \phi_{kd}(t) \sim GP(0, K_{\ell_\phi, \sigma^2_\phi}(t, t')) \] Where:

\(\lambda_{ik}(t)\) represents the topic weight for individual \(i\) and topic \(k\) at time \(t\) \(\phi_{kd}(t)\) represents the disease loading for topic \(k\) and disease \(d\) at time \(t\) \(K_{\ell, \sigma^2}(t, t')\) is the covariance function for the Gaussian Process

The key difference in our model is that \(\lambda_{ik}(t)\) and \(\phi_{kd}(t)\) are modeled as separate Gaussian Processes, allowing for distinct temporal patterns.

In contrast, the Tucker decomposition approach you mentioned earlier constrains the time variation pattern: \[ \text{Lambda}[i, k, t] = U_1[i, k] \cdot U_3[t, k] \] \[ \text{Phi}[k, d, t] = U_2[d, k] \cdot U_3[t, k] \] Here, \(U_3[t, k]\) imposes the same temporal pattern on both topic weights and disease loadings for a given topic \(k\). Your approach provides more flexibility by allowing:

Different temporal patterns for topic weights and disease loadings Individual-specific temporal patterns for topic weights Disease-specific temporal patterns for loadings within each topic

This increased flexibility in modeling temporal patterns separately for topic weights and disease loadings is a key advantage of our approach over standard tensor decomposition methods.

Key Differences:

  1. Both λ_ik(t) (topic weights) and φ_kd(t) (disease loadings) are modeled as separate Gaussian Processes, allowing for distinct temporal patterns.

  2. The covariance structure U_k for φ_k captures disease correlation patterns over time, which is different from projecting both topics and diseases onto the same time factor (U3) as in standard tensor decomposition.

  3. Our model allows for:

  • Different temporal patterns for topic weights and disease loadings
  • Individual-specific temporal patterns for topic weights
  • Disease-specific temporal patterns for loadings within each topic
  • Explicit modeling of disease correlations over time through U_k

This formulation provides significantly more flexibility in modeling temporal patterns and disease interactions compared to the standard tensor decomposition approach. It allows for a more nuanced representation of how both topic weights and disease loadings evolve over time, as well as how diseases correlate within topics across time.

Code
library(rTensor)
library(ggplot2)
library(reshape2)
library(dplyr)
library(data.table)

set.seed(123)

# Dimensions
N <- 50  # Number of individuals
D <- 20  # Number of diseases
T <- 30  # Number of time points
K <- 3   # Number of latent factors/topics

# Create synthetic data
create_synthetic_data <- function(N, D, T, K) {
  # Individual-topic preferences (varying over time)
  individual_preferences <- array(0, dim = c(N, K, T))
  for(i in 1:N) {
    for(k in 1:K) {
      individual_preferences[i,k,] <- cumsum(rnorm(T, 0, 0.1)) + sin(seq(0, runif(1,min = 0.5,max = 5)*pi, length.out = T))
    }
  }
  
  # Disease-topic associations (varying over time)
  disease_associations <- array(0, dim = c(D, K, T))
  for(d in 1:D) {
    for(k in 1:K) {
      disease_associations[d,k,] <- cumsum(rnorm(T, 0, 0.1)) + cos(seq(0, runif(1,min = 0.5,max = 5)*pi, length.out = T))
    }
  }
  
  # Create tensor
  tensor <- array(0, dim = c(N, D, T))
  for(i in 1:N) {
    for(d in 1:D) {
      for(t in 1:T) {
        tensor[i,d,t] <- sum(individual_preferences[i,,t] * disease_associations[d,,t])
      }
    }
  }
  
  return(list(tensor = tensor, 
              individual_preferences = individual_preferences,
              disease_associations = disease_associations))
}

# Generate data
data <- create_synthetic_data(N, D, T, K)

# Apply Tucker decomposition
tucker_result <- tucker(as.tensor(data$tensor), ranks = c(K, K, K))

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |======================================================================| 100%
Code
# Extract components
U1 <- tucker_result$U[[1]]  # Individual factors
U2 <- tucker_result$U[[2]]  # Disease factors
U3 <- tucker_result$U[[3]]  # Time factors
G <- tucker_result$Z        # Core tensor

# Reconstruct the tensor
reconstructed_tensor <- ttl(G, list(U1, U2, U3), ms = c(1,2,3))

# Compare original and reconstructed tensors
mse <- mean((data$tensor - as.array(reconstructed_tensor@data))^2)
cor_val <- cor(as.vector(data$tensor), as.vector(reconstructed_tensor@data))

print(paste("Mean Squared Error:", mse))
[1] "Mean Squared Error: 0.773091375056652"
Code
print(paste("Correlation:", cor_val))
[1] "Correlation: 0.619830928948207"
Code
rm(r)
mtensor=melt(reconstructed_tensor@data)
mdata=melt(data$tensor)
colnames(mtensor)=c("N","D","T","value")
colnames(mdata)=c("N","D","T","value")
mtensor$type="reconstructed"
mdata$type="original"

r=rbind(mtensor,mdata)
ggplot(r[r$N%in%sample(N,1),],aes(x=T,y=value,group=as.factor(type),col=as.factor(type)))+
  geom_smooth(se=FALSE)+facet_wrap(~as.factor(D),scales="free_y")+theme_classic()

Code
image(as.array(reconstructed_tensor@data)[1,,],main="Reconstructed tensor")

Code
image(as.array(data$tensor)[1,,],main="Original Data")

Tucker Simulation

Now we simulate under the tucker situation in which the disease loadings are constant over time and the time variation is projected onto the same factor for both topics and diseases.

Code
## tucker compoativble

# Dimensions
N <- 50  # Number of individuals
D <- 20  # Number of diseases
T <- 30  # Number of time points
K <- 3   # Number of latent factors/topics

# Create Tucker-compatible synthetic data
create_tucker_compatible_data <- function(N, D, T, K) {
  # Fixed individual-topic preferences
  U1 <- matrix(runif(N * K), N, K)
  
  # Fixed disease-topic associations
  U2 <- matrix(runif(D * K), D, K)
  
  # Time-varying topic strengths
  U3 <- matrix(0, T, K)
  for(k in 1:K) {
    U3[,k] <- cumsum(rnorm(T, 0, 0.1)) + sin(seq(0, 2*pi, length.out = T))
  }
  
  # Core tensor
  G <- array(runif(K^3), dim = c(K, K, K))
  
  # Create tensor
  tensor <- ttl(as.tensor(G), list(U1, U2, U3), ms = c(1,2,3))
  
  return(list(tensor = tensor@data, U1 = U1, U2 = U2, U3 = U3, G = G))
}

# Generate Tucker-compatible data
tucker_data <- create_tucker_compatible_data(N, D, T, K)

# Apply Tucker decomposition
tucker_result <- tucker(as.tensor(tucker_data$tensor), ranks = c(K, K, K))

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======================================================================| 100%
Code
# Reconstruct the tensor
reconstructed_tensor <- ttl(tucker_result$Z, tucker_result$U, ms = c(1,2,3))

here we see that if data is simulated this way, tucker decomp perfectly reproudces it.

Code
# Analyze core tensor


# Compare original and reconstructed tensors
mse <- mean((tucker_data$tensor - as.array(reconstructed_tensor@data))^2)
cor_val <- cor(as.vector(tucker_data$tensor), as.vector(reconstructed_tensor@data))
print(paste("Mean Squared Error:", mse))
[1] "Mean Squared Error: 1.90911964619841e-29"
Code
print(paste("Correlation:", cor_val))
[1] "Correlation: 1"
Code
mtensor=melt(reconstructed_tensor@data)
mdata=melt(tucker_data$tensor)
colnames(mtensor)=c("N","D","T","value")
colnames(mdata)=c("N","D","T","value")
mtensor$type="reconstructed"
mdata$type="original"

r=rbind(mtensor,mdata)
ggplot(r[r$N%in%sample(N,1),],aes(x=T,y=value,group=as.factor(type),col=as.factor(type)))+
  geom_point()+facet_wrap(~interaction(as.factor(D),as.factor(type)),scales="free_y")+theme_classic()

Code
# Function to plot overall patterns for an individual
plot_individual_overall <- function(data, result, individual_id) {
  original <- apply(data$U1[individual_id,] %o% t(data$U3), 2, sum)
  reconstructed <- apply(result$U[[1]][individual_id,] %o% t(result$U[[3]]), 2, sum)
  
  df <- data.frame(
    Time = 1:T,
    Original = original,
    Reconstructed = reconstructed
  )
  
  ggplot(df, aes(x = Time)) +
    geom_line(aes(y = Original, color = "Original")) +
    geom_line(aes(y = Reconstructed, color = "Reconstructed")) +
    ggtitle(paste("Overall Pattern for Individual", individual_id)) +
    theme_minimal()
}

# Function to plot overall patterns for a disease
plot_disease_overall <- function(data, result, disease_id) {
  original <- apply(data$U2[disease_id,] %o% t(data$U3), 2, sum)
  reconstructed <- apply(result$U[[2]][disease_id,] %o% t(result$U[[3]]), 2, sum)
  
  df <- data.frame(
    Time = 1:T,
    Original = original,
    Reconstructed = reconstructed
  )
  
  ggplot(df, aes(x = Time)) +
    geom_line(aes(y = Original, color = "Original")) +
    geom_line(aes(y = Reconstructed, color = "Reconstructed")) +
    ggtitle(paste("Overall Pattern for Disease", disease_id)) +
    theme_minimal()
}

# Generate and print overall pattern plots
individual_overall_plot <- plot_individual_overall(tucker_data, tucker_result, 1)
disease_overall_plot <- plot_disease_overall(tucker_data, tucker_result, 1)
#print(individual_overall_plot)
#print(disease_overall_plot)

# Visualize multiple slices
plot_multiple_slices <- function(original, reconstructed, n_slices = 4) {
  slices <- sample(dim(original)[1], n_slices)
  par(mfrow = c(n_slices, 2), mar = c(2,2,2,2))
  for (i in slices) {
    image(t(original[i,,]), main = paste("Original Slice", i))
    image(t(reconstructed[i,,]), main = paste("Reconstructed Slice", i))
  }
}

plot_multiple_slices(as.array(tucker_data$tensor), as.array(reconstructed_tensor@data))