Páginas

jueves, 18 de julio de 2013

Creación de Texto "Falso" usando R

En esta entrada mostraré cómo crear texto falso, con apariencia de verdadero, simulando la distribución probabilística de las palabras en un texto mayor.
Antes de nada, necesitamos un documento de texto para calcular dicha distribución de palabras, en nuestro caso usaremos bien "Guerra y Paz" o bien "El Quijote", según queramos que el idioma del texto sea inglés o español:
lang <- TRUE # True for English, False for Spanish

if (lang){
  fileName <- "warandpeace.txt"
  allowedChars <- c(LETTERS, "'")
  }else{
    fileName <- "quijote.txt"
    allowedChars <- c(LETTERS, "Ñ", "Á", "É", "Í", "Ó", "Ú")    
    }

allowedChars
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "'"

filterSpecials <- TRUE
removeSpecialsDuplicated <- TRUE
reference <- readChar(fileName, file.info(fileName)$size)
reference <- toupper(reference)
Una vez leído el archivo de texto, lo limpiamos eliminando caracteres especiales y reemplazándolos por espacios en blanco:
# Individual characters
txt <- strsplit(reference, split = "", fixed = TRUE)[[1]]


if (filterSpecials) {
    txt[!(txt %in% allowedChars)] <- " "
}


if (removeSpecialsDuplicated) {
    txt <- txt[!(txt[1:length(txt) - 1] == txt[2:length(txt)] & !(txt[1:length(txt) - 
        1] %in% allowedChars))]
}

txt2 <- paste(txt, collapse = "")
Determinamos las palabras del texto, eliminando las repeticiones, y les asignamos a cada una de ellas un único índice:
# Split by words
txt <- strsplit(txt2, split = " ", fixed = FALSE)[[1]]

# Unique characters
uniqueWords <- unique(txt)
head(uniqueWords)
## [1] "THE"       "PROJECT"   "GUTENBERG" "EBOOK"     "OF"        "WAR"

# Assign each char a unique index
indices <- match(txt, uniqueWords)

# Numeric index of each of the chars in a bigram
firstWords <- indices[1:length(txt) - 1]
secondWords <- indices[2:length(txt)]
Construimos la matriz de transición, que define la probabilidad de que a una determinada palabra le siga otra dada:
# Build the transition matrix
library(Matrix)
## Loading required package: lattice
trans.mat <- sparseMatrix(firstWords, secondWords, x = rep(1, length.out = length(firstWords)), 
    dimnames = list(uniqueWords, uniqueWords))


Tr2 <- as.matrix(trans.mat[1:30, 1:30])
Dibujemos las primeras filas y columnas de la matriz de transición:
# Plot transition matrix and emision probabilities
library(ggplot2)
library(reshape2)

ggplot(melt(Tr2), aes(Var2, Var1)) + geom_tile(aes(fill = value)) + scale_fill_gradient(low = "white", 
    high = "black", limits = c(0, 1)) + labs(x = "Probability of Second Letter", 
    y = "Conditioning on First Letter", fill = "Prob") + scale_y_discrete(limits = rev(levels(melt(Tr2)$Var1))) + 
    coord_equal() + theme(axis.text.x = element_text(angle = 90, hjust = 1, 
    vjust = 0.5))
plot of chunk unnamed-chunk-5
Y construimos el texto falso, partiendo de una palabra, y eligiendo las sucesivas de acuerdo a la matriz de transición:
# Build fake text
newTxtLength <- 400
newTxt <- rep(0, newTxtLength)

# Seed word to begin fake text. You could also pick one randomly
newTxt[1] <- sample(uniqueWords, 1, prob = trans.mat[ifelse(filterSpecials, 
    "A"), ])

for (j in 2:newTxtLength) {
    # Look at the corresponding row of the matrix
    tFreq <- as.matrix(trans.mat[newTxt[j - 1], ])

    # Pick a new character based on transition probabilities
    newTxt[j] <- sample(uniqueWords, 1, prob = tFreq^2)

}

# Collapse it all into a single string
(newTxt = paste(newTxt, collapse = " "))
## [1] "CHANNEL AND IT WAS A MOST OF THE EMPEROR ALEXANDER THE RUSSIANS WERE AS IF IT WAS BEING THE FRENCH ARMY TO SHOW YOU WOULD BE A MINUTE TO THE SAME TIME HE HAD BEEN REVEALED TO THE WAR IS THE RUSSIAN ARMY TO BE A LONG TIME AND THE PRINCESS MARY FELT THAT THE OLD MAN HE DID NOT TO THE COMMANDER IN THE FRENCH TO THE SAME EVENING AND MORE SKILLFUL COMMANDER IN THE DOOR OF THE ROOM AND THE FRENCH SOLDIERS AND THE FRONT OF THE OFFICERS WHO HAD BEEN SENT TO THE MEN AND LAID DOWN TO THE TABLE WITH A BIT OF THE ENEMY AND DID NOT THE FRENCH ARMY IS A MINUTE I MAY I AM SURE THAT COULD NOT HAVE BEEN IN THE FIRST TO THE ROOM AND WITH A DAY EVIDENTLY EXPECTING SOMETHING TO THE FRENCH WERE HEARD THE NEWS OF THE EMPEROR WAS IN THE ENEMY AND WAS TO HIS FACE WAS A LARGE SUMS OF THE SAME FEELING THAT HAD BEEN IN HER HAND AND THE SAKE OF THE ROOM AND WITH HIS HEAD AND A FRENCH AND SO THAT THE WHOLE ARMY WAS A WHITE FACE WAS NOT YET HE IS IT HE HAD BEEN RECEIVED THE EMPEROR ALEXANDER PRINCE ANDREW'S FACE WAS NOT UNDERSTAND THAT THE ENEMY IN THE OLD PRINCE ANDREW WAS THE FRENCH AND THE ROSTOVS ARRIVED AT THE OLD MAN IN THE SAME TIME TO THE FRENCH EMPEROR AND THE ROOM WITH THE SAME TONE OF THE ROOM AND THAT HIS EYES AND WAS A MILITARY SCIENCE YOU SIR SAID HE WOULD HAVE YOU ARE YOU ARE THE COUNTESS ROSTOVA WHICH WAS GOING TO THE ROOM AND I AM NOT THE EMPEROR WOULD NOT LISTEN TO HIM WITH HIS SHOULDER HE HAD NOT AT THE EMPEROR AND THE FRENCH ARMY BUT I AM I HAVE DONE FOR THE LEFT FLANK AND TO THE FRENCH AND SO THAT THE FRENCH ARMY THEY HAD MET IN THE ROOM WITH YOU THE EMPEROR AND THE WHOLE LIFE AND RAN WITH A SOLDIER WITH A WONDERFUL BOY I ASKED THE PRINCESS MARY AND THE ROOM AND THE MEN AND IN THE SAME TIME THE DAY THE CROWD OF HIS LINE OF THE DOOR OF THE CARRIAGE AND THE WAR AND THE STAFF WERE THE COUNTESS MARY AND AGAIN WITH A BONAPARTIST SAYS THAT THE COUNTESS AND THE RIVER AND"
¿Qué os ha parecido? ¿Qué otros usos le dais a este sistema?