Tidyr separate

Tidyr separate DEFAULT

Using separate() in tidyr on multiple columns?

John Mola's profile photo

John Mola

unread,
Mar 2, 2017, 4:08:39 AM3/2/17

Reply to author

Sign in to reply to author

Forward

Sign in to forward

Delete

You do not have permission to delete messages in this group

Link

Report message as abuse

Sign in to report message as abuse

Show original message

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to [email protected]

Hi all,

Not a very good drug user, so please be gentle with me. 

Here is some dummy data:

Ind = c("SNP1","SNP1","SNP2","SNP2","SNP3")

SNP1 = c("AA","TT","AT","TT","TT")

SNP2 = c("GC","GG","GG","CC","GC")

SNP3 = c("GG","GC","GG","CC","GC")

df = data.frame(Ind,SNP1,SNP2,SNP3)

df_filt= distinct(df, Ind, .keep_all = TRUE)

df_filt

> df_filt

   Ind SNP1 SNP2 SNP3

1 SNP1   AA   GC   GG

2 SNP2   AT   GG   GG

3 SNP3   TT   GC   GC

(Dummy data loaded in a bit weird, because there's an intermediate step in there, where I only wanted the unique "SNP" rows)

So now, I'd like to split the AA's, GC's, etc into separate columns. I manage to do this with separate() in one column:

> separate(data = df_filt, col = SNP1, into = c("SNP1.1","SNP1.2"), sep=c(1))

   Ind SNP1.1 SNP1.2 SNP2 SNP3

1 SNP1      A      A   GC   GG

2 SNP2      A      T   GG   GG

3 SNP3      T      T   GC   GC

But I can't seem to figure out a way to automate this across many (eventually thousands) of columns at once. Separate does not allow you to call a vector of column names, for instance.

Any thoughts?

Thanks!

John

--
John Mola's profile photo

John Mola

unread,
Mar 2, 2017, 4:12:08 AM3/2/17

Reply to author

Sign in to reply to author

Forward

Sign in to forward

Delete

You do not have permission to delete messages in this group

Link

Report message as abuse

Sign in to report message as abuse

Show original message

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to [email protected]

Oh yeah. Apologies for the column/row names matching. I was messing with my messing with data. 

Here:

> Ind = c("Bee1","Bee1","Bee2","Bee2","Bee3")

> SNP1 = c("AA","TT","AT","TT","TT")

> SNP2 = c("GC","GG","GG","CC","GC")

> SNP3 = c("GG","GC","GG","CC","GC")

> df = data.frame(Ind,SNP1,SNP2,SNP3)

> df_filt= distinct(df, Ind, .keep_all = TRUE)

> df_filt

   Ind SNP1 SNP2 SNP3

1 Bee1   AA   GC   GG

2 Bee2   AT   GG   GG

3 Bee3   TT   GC   GC

> separate(data = df_filt, col = SNP1, into = c("SNP1.1","SNP1.2"), sep=c(1))

   Ind SNP1.1 SNP1.2 SNP2 SNP3

1 Bee1      A      A   GC   GG

2 Bee2      A      T   GG   GG

3 Bee3      T      T   GC   GC

Vince S. Buffalo's profile photo

Vince S. Buffalo

unread,
Mar 2, 2017, 4:22:52 AM3/2/17

Reply to author

Sign in to reply to author

Forward

Sign in to forward

Delete

You do not have permission to delete messages in this group

Link

Report message as abuse

Sign in to report message as abuse

Show original message

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to [email protected]

So a few things —

First I'd use a tibble:

df <- tibble(Ind,SNP1,SNP2,SNP3)

df_filt <- distinct(df, Ind, .keep_all = TRUE)

and since you don't have a separator, you can't use spread() — you need to use extract() with a grouping regular expression. This is a bit permissive, but works:

df_filt %>% gather(snp, value, -Ind) %>% extract(value, into=c('h1', 'h2'), '(.)(.)')

# A tibble: 9 × 4

    Ind   snp    h1    h2

* <chr> <chr> <chr> <chr>

1  Bee1  SNP1     A     A

2  Bee2  SNP1     A     T

3  Bee3  SNP1     T     T

4  Bee1  SNP2     G     C

5  Bee2  SNP2     G     G

6  Bee3  SNP2     G     C

7  Bee1  SNP3     G     G

8  Bee2  SNP3     G     G

9  Bee3  SNP3     G     C

Note how I use gather here to gather all SNP columns. It's easier to apply this operation to long data and then recast to wide data using spread(). 

HTH,

Vince

Jaime Ashander's profile photo

Jaime Ashander

unread,
Mar 2, 2017, 4:55:52 AM3/2/17

Reply to author

Sign in to reply to author

Forward

Sign in to forward

Delete

You do not have permission to delete messages in this group

Link

Report message as abuse

Sign in to report message as abuse

Show original message

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to [email protected]

As mentioned, the key thing is using gather to make the data long. You could still use
separate in place of Vince's extract, you'd just need to use the new column
name you passed to gather (value in this case) so this would work too:

df_filt %>% gather(snp, value, -Ind) %>% separate(col = value, into = c("SNP1.1","SNP1.2"), sep=1)

(when numeric, sep is interpreted as a position in the string)

John Mola's profile photo

John Mola

unread,
Mar 2, 2017, 5:39:58 AM3/2/17

Reply to author

Sign in to reply to author

Forward

Sign in to forward

Delete

You do not have permission to delete messages in this group

Link

Report message as abuse

Sign in to report message as abuse

Show original message

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to [email protected]

By god. It works. Thank you very much y'all!

Cheers,

John

Sours: https://groups.google.com/

Given either a regular expression or a vector of character positions, turns a single character column into multiple columns.

separate(data, col, into, sep ="[^[:alnum:]]+", remove =TRUE, convert =FALSE, extra ="warn", fill ="warn", ...)

Arguments

data

A data frame.

col

Column name or position. This is passed to .

This argument is passed by expression and supports quasiquotation (you can unquote column names or column positions).

into

Names of new variables to create as character vector. Use to omit the variable in the output.

sep

Separator between columns.

If character, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.

If numeric, is interpreted as character positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of should be one less than .

remove

If , remove input column from output data frame.

convert

If , will run with on new columns. This is useful if the component columns are integer, numeric or logical.

NB: this will cause string s to be converted to s.

extra

If is a character vector, this controls what happens when there are too many pieces. There are three valid options:

  • "warn" (the default): emit a warning and drop extra values.

  • "drop": drop any extra values without a warning.

  • "merge": only splits at most times

fill

If is a character vector, this controls what happens when there are not enough pieces. There are three valid options:

  • "warn" (the default): emit a warning and fill from the right

  • "right": fill with missing values on the right

  • "left": fill with missing values on the left

...

Additional arguments passed on to methods.

See also

, the complement, which uses regular expression capturing groups.

Examples

library(dplyr)# If you want to split by any non-alphanumeric value (the default):df<-data.frame(x =c(NA, "x.y", "x.z", "y.z"))df%>%separate(x, c("A", "B"))

#> A B #> 1 <NA> <NA> #> 2 x y #> 3 x z #> 4 y z

# If you just want the second variable:df%>%separate(x, c(NA, "B"))

#> B #> 1 <NA> #> 2 y #> 3 z #> 4 z

# If every row doesn't split into the same number of pieces, use# the extra and fill arguments to control what happens:df<-data.frame(x =c("x", "x y", "x y z", NA))df%>%separate(x, c("a", "b"))

#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].

#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].

#> a b #> 1 x <NA> #> 2 x y #> 3 x y #> 4 <NA> <NA>

# The same behaviour as previous, but drops the c without warnings:df%>%separate(x, c("a", "b"), extra ="drop", fill ="right")

#> a b #> 1 x <NA> #> 2 x y #> 3 x y #> 4 <NA> <NA>

# Opposite of previous, keeping the c and filling left:df%>%separate(x, c("a", "b"), extra ="merge", fill ="left")

#> a b #> 1 <NA> x #> 2 x y #> 3 x y z #> 4 <NA> <NA>

# Or you can keep all three:df%>%separate(x, c("a", "b", "c"))

#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].

#> a b c #> 1 x <NA> <NA> #> 2 x y <NA> #> 3 x y z #> 4 <NA> <NA> <NA>

# To only split a specified number of times use extra = "merge":df<-data.frame(x =c("x: 123", "y: error: 7"))df%>%separate(x, c("key", "value"), ": ", extra ="merge")

#> key value #> 1 x 123 #> 2 y error: 7

# Use regular expressions to separate on multiple characters:df<-data.frame(x =c(NA, "x?y", "x.z", "y:z"))df%>%separate(x, c("A","B"), sep ="([.?:])")

#> A B #> 1 <NA> <NA> #> 2 x y #> 3 x z #> 4 y z

# convert = TRUE detects column classes:df<-data.frame(x =c("x:1", "x:2", "y:4", "z", NA))df%>%separate(x, c("key","value"), ":")%>%str

#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].

#> 'data.frame': 5 obs. of 2 variables: #> $ key : chr "x" "x" "y" "z" ... #> $ value: chr "1" "2" "4" NA ...

df%>%separate(x, c("key","value"), ":", convert =TRUE)%>%str

#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].

#> 'data.frame': 5 obs. of 2 variables: #> $ key : chr "x" "x" "y" "z" ... #> $ value: int 1 2 4 NA NA

Sours: https://tidyr.tidyverse.org/reference/separate.html
  1. Delta flight 543
  2. Schwinn bike aluminum frame
  3. Brat tv show auditions
#' Separate a character column into multiple columns with a regular#' expression or numeric locations#'#' Given either a regular expression or a vector of character positions,#' `separate()` turns a single character column into multiple columns.#'#' @inheritParams extract#' @param sep Separator between columns.#'#' If character, `sep` is interpreted as a regular expression. The default#' value is a regular expression that matches any sequence of#' non-alphanumeric values.#'#' If numeric, `sep` is interpreted as character positions to split at. Positive#' values start at 1 at the far-left of the string; negative value start at -1 at#' the far-right of the string. The length of `sep` should be one less than#' `into`.#' @param extra If `sep` is a character vector, this controls what#' happens when there are too many pieces. There are three valid options:#'#' * "warn" (the default): emit a warning and drop extra values.#' * "drop": drop any extra values without a warning.#' * "merge": only splits at most `length(into)` times#' @param fill If `sep` is a character vector, this controls what#' happens when there are not enough pieces. There are three valid options:#'#' * "warn" (the default): emit a warning and fill from the right#' * "right": fill with missing values on the right#' * "left": fill with missing values on the left#' @seealso [unite()], the complement, [extract()] which uses regular#' expression capturing groups.#' @export#' @examples#' library(dplyr)#' # If you want to split by any non-alphanumeric value (the default):#' df <- data.frame(x = c(NA, "x.y", "x.z", "y.z"))#' df %>% separate(x, c("A", "B"))#'#' # If you just want the second variable:#' df %>% separate(x, c(NA, "B"))#'#' # If every row doesn't split into the same number of pieces, use#' # the extra and fill arguments to control what happens:#' df <- data.frame(x = c("x", "x y", "x y z", NA))#' df %>% separate(x, c("a", "b"))#' # The same behaviour as previous, but drops the c without warnings:#' df %>% separate(x, c("a", "b"), extra = "drop", fill = "right")#' # Opposite of previous, keeping the c and filling left:#' df %>% separate(x, c("a", "b"), extra = "merge", fill = "left")#' # Or you can keep all three:#' df %>% separate(x, c("a", "b", "c"))#'#' # To only split a specified number of times use extra = "merge":#' df <- data.frame(x = c("x: 123", "y: error: 7"))#' df %>% separate(x, c("key", "value"), ": ", extra = "merge")#'#' # Use regular expressions to separate on multiple characters:#' df <- data.frame(x = c(NA, "x?y", "x.z", "y:z"))#' df %>% separate(x, c("A","B"), sep = "([.?:])")#'#' # convert = TRUE detects column classes:#' df <- data.frame(x = c("x:1", "x:2", "y:4", "z", NA))#' df %>% separate(x, c("key","value"), ":") %>% str#' df %>% separate(x, c("key","value"), ":", convert = TRUE) %>% strseparate<-function(data, col, into, sep="[^[:alnum:]]+", remove=TRUE,convert=FALSE, extra="warn", fill="warn", ...) {ellipsis::check_dots_used() UseMethod("separate")}#' @exportseparate.data.frame<-function(data, col, into, sep="[^[:alnum:]]+",remove=TRUE, convert=FALSE,extra="warn", fill="warn", ...) { check_present(col)var<-tidyselect::vars_pull(names(data), !! enquo(col))value<- as.character(data[[var]])new_cols<- str_separate(value,into=into,sep=sep,convert=convert,extra=extra,fill=fill )out<- append_df(data, new_cols, var, remove=remove) reconstruct_tibble(data, out, if (remove) varelseNULL)}str_separate<-function(x, into, sep, convert=FALSE, extra="warn", fill="warn") {if (!is.character(into)) { abort("`into` must be a character vector") }if (is.numeric(sep)) {out<- strsep(x, sep) } elseif (is_character(sep)) {out<- str_split_fixed(x, sep, length(into), extra=extra, fill=fill) } else { abort("`sep` must be either numeric or character") } names(out) <- as_utf8_character(into)out<-out[!is.na(names(out))]if (convert) {out[] <- map(out, type.convert, as.is=TRUE) } as_tibble(out)}strsep<-function(x, sep) {nchar<- nchar(x)pos<- map(sep, function(i) {if (i>=0) return(i) pmax(0, nchar+i) })pos<- c(list(0), pos, list(nchar)) map(1:(length(pos) -1), function(i) { substr(x, pos[[i]] +1, pos[[i+1]]) })}str_split_fixed<-function(value, sep, n, extra="warn", fill="warn") {if (extra=="error") { warn(glue("`extra = \"error\"` is deprecated. \\ Please use `extra = \"warn\"` instead" ))extra<-"warn" }extra<- arg_match(extra, c("warn", "merge", "drop"))fill<- arg_match(fill, c("warn", "left", "right"))n_max<-if (extra=="merge") nelse-1Lpieces<- str_split_n(value, sep, n_max=n_max)simp<- simplifyPieces(pieces, n, fill=="left")n_big<- length(simp$too_big)if (extra=="warn"&&n_big>0) {idx<- list_indices(simp$too_big) warn(glue("Expected {n} pieces. Additional pieces discarded in {n_big} rows [{idx}].")) }n_sml<- length(simp$too_sml)if (fill=="warn"&&n_sml>0) {idx<- list_indices(simp$too_sml) warn(glue("Expected {n} pieces. Missing pieces filled with `NA` in {n_sml} rows [{idx}].")) }simp$strings}str_split_n<-function(x, pattern, n_max=-1) {if (is.factor(x)) {x<- as.character(x) }m<- gregexpr(pattern, x, perl=TRUE)if (n_max>0) {m<- lapply(m, function(x) slice_match(x, seq_along(x) <n_max)) } regmatches(x, m, invert=TRUE)}slice_match<-function(x, i) { structure(x[i],match.length= attr(x, "match.length")[i],index.type= attr(x, "index.type"),useBytes= attr(x, "useBytes") )}list_indices<-function(x, max=20) {if (length(x) >max) {x<- c(x[seq_len(max)], "...") } paste(x, collapse=", ")}
Sours: https://github.com/tidyverse/tidyr/blob/master/R/separate.R
Tidy Data and tidyr -- Pt 2 Intro to Data Wrangling with R and the Tidyverse

Reshaping Your Data with tidyr

Although many fundamental data processing functions exist in R, they have been a bit convoluted to date and have lacked consistent coding and the ability to easily flow together. This leads to difficult-to-read nested functions and/or choppy code. R Studio is driving a lot of new packages to collate data management tasks and better integrate them with other analysis activities. As a result, a lot of data processing tasks are becoming packaged in more cohesive and consistent ways, which leads to:

  • More efficient code
  • Easier to remember syntax
  • Easier to read syntax

is a one such package which was built for the sole purpose of simplifying the process of creating tidy data. This tutorial provides you with the basic understanding of the four fundamental functions of data tidying that tidyr provides:

Packages Utilized

%>% Operator

Although not required, the tidyr and dplyr packages make use of the pipe operator developed by Stefan Milton Bache in the R package magrittr. Although all the functions in tidyr and dplyr can be used without the pipe operator, one of the great conveniences these packages provide is the ability to string multiple functions together by incorporating .

This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:

filter(data, variable == numeric_value)
or
data %>% filter(variable == numeric_value)

Both functions complete the same task and the benefit of using is not evident; however, when you desire to perform multiple functions its advantage becomes obvious. For more info check out the tutorial.

gather( ) function:

Objective: Reshaping wide format to long format

Description: There are times when our data is considered unstacked and a common attribute of concern is spread out across columns. To reformat the data such that these common attributes are gathered together as a single variable, the function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.

gather() function

This function is a complement to

Example

We’ll start with the following data set:

This data is considered wide since the time variable (represented as quarters) is structured such that each quarter represents a variable. To re-structure the time component as an individual variable, we can gather each quarter within one column variable and also gather the values associated with each quarter in a second column variable.

These all produce the same results:

Also note that if you do not supply arguments for na.rm or convert values then the defaults are used.

separate( ) function:

Objective: Splitting a single variable into two

Description: Many times a single column variable will capture multiple variables, or even parts of a variable you just don’t care about. Some examples include:

In each of these cases, our objective may be to separate characters within the variable string. This can be accomplished using the function which turns a single character column into multiple columns.

This function is a complement to

Example

We can go back to our long_DF dataframe we created above in which way may desire to clean up or separate the Quarter variable.

By applying the function we get the following:

These produce the same results:

unite( ) function:

Objective: Merging two variables into one

Description: There may be a time in which we would like to combine the values of two variables. The function is a convenience function to paste together multiple variable values into one. In essence, it combines two variables of a single observation into one variable.

This function is a complement to

Example

Using the separate_DF dataframe we created above, we can re-unite the Time_Interval and Interval_ID variables we created and re-create the original Quarter variable we had in the long_DF dataframe.

These produce the same results:

spread( ) function:

Objective: Reshaping long format to wide format

Description: There are times when we are required to turn long formatted data into wide formatted data. The function spreads a key-value pair across multiple columns.

This function is a complement to

Example

Additional Resources

Sours: https://uc-r.github.io/tidyr

Separate tidyr

.

Cleaning Data Using tidyr: Separating Data From Within A Column

.

You will also like:

.



834 835 836 837 838