[00:27] *** BlueMaxim has joined #archiveteam-bs
[00:32] *** bsmith093 has joined #archiveteam-bs
[01:44] *** pizzaiolo has quit IRC (Read error: Operation timed out)
[01:44] *** pizzaiolo has joined #archiveteam-bs
[01:56] *** dashcloud has quit IRC (Read error: Operation timed out)
[02:01] *** dashcloud has joined #archiveteam-bs
[02:03] *** RichardG_ is now known as RichardG
[02:08] *** BubuAnabe has joined #archiveteam-bs
[02:11] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
[02:17] *** pizzaiolo has quit IRC (Ping timeout: 268 seconds)
[02:17] *** pizzaiolo has joined #archiveteam-bs
[02:55] *** tsp_ has quit IRC (Read error: Connection reset by peer)
[02:59] *** dashcloud has quit IRC (Read error: Operation timed out)
[03:00] *** dashcloud has joined #archiveteam-bs
[03:22] *** qw3rty has joined #archiveteam-bs
[03:22] *** ZexaronS has quit IRC (Leaving)
[03:25] *** qw3rty2 has quit IRC (Read error: Operation timed out)
[03:38] *** ZexaronS has joined #archiveteam-bs
[04:06] *** Sk1d has quit IRC (Ping timeout: 194 seconds)
[04:12] *** Sk1d has joined #archiveteam-bs
[04:26] *** godane has quit IRC (Read error: Operation timed out)
[04:30] *** godane has joined #archiveteam-bs
[05:09] *** j08nY has joined #archiveteam-bs
[05:34] *** Fletcher_ has joined #archiveteam-bs
[05:36] *** j08nY has quit IRC (Read error: Operation timed out)
[06:21] *** Yoshimura has quit IRC (Ping timeout: 260 seconds)
[06:22] *** Yoshimura has joined #archiveteam-bs
[06:27] <godane> code i could use to grab libgen.pw pdfs : curl -s  https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%2F|/|g' | sed 's|%3A|:|g' | sed 's|%26|\&|g' | sed 's|.*chl=||g' | sed 's|pdf.*|pdf|g'
[06:36] <godane> better code: curl -s  https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|.*chl=||g' | sed 's|.*libgen.pw|http://libgen.pw|g' | sed 's|pdf.*|pdf|g' | sed 's|%2F|/|g' | sed 's| |_|g' | sed 's|%26|&|g' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%26|\&|g'
[07:39] *** brayden has quit IRC (Read error: Connection reset by peer)
[07:39] *** brayden has joined #archiveteam-bs
[07:39] *** swebb sets mode: +o brayden
[07:50] *** Jonison has joined #archiveteam-bs
[08:16] *** j08nY has joined #archiveteam-bs
[08:46] *** icedice has joined #archiveteam-bs
[09:53] *** SHODAN_UI has joined #archiveteam-bs
[10:30] *** SHODAN_UI has quit IRC (Remote host closed the connection)
[12:02] *** pizzaiolo has quit IRC (Remote host closed the connection)
[12:28] *** BlueMaxim has quit IRC (Read error: Operation timed out)
[12:35] *** SHODAN_UI has joined #archiveteam-bs
[12:40] *** Mateon1 has joined #archiveteam-bs
[12:42] *** pizzaiolo has joined #archiveteam-bs
[12:44] *** j08nY has quit IRC (Read error: Operation timed out)
[12:44] *** dashcloud has quit IRC (Remote host closed the connection)
[12:46] *** dashcloud has joined #archiveteam-bs
[13:25] *** eccfill has quit IRC (Read error: Connection reset by peer)
[13:26] *** eccfill has joined #archiveteam-bs
[13:31] *** Yurume has quit IRC (Read error: Operation timed out)
[13:35] *** Yurume has joined #archiveteam-bs
[13:43] *** bitBaron has joined #archiveteam-bs
[13:55] *** icedice has quit IRC (Quit: Leaving)
[14:13] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
[14:14] *** j08nY has joined #archiveteam-bs
[14:40] *** SHODAN_UI has quit IRC (Remote host closed the connection)
[15:33] *** ZexaronS has quit IRC (Quit: Leaving)
[15:37] *** Jonison has quit IRC (Quit: Leaving)
[15:56] *** dashcloud has quit IRC (Read error: Operation timed out)
[16:00] *** dashcloud has joined #archiveteam-bs
[16:03] <odemg> godane, first line output url with spaces, second didnt get anything 
[16:32] *** bitBaron has joined #archiveteam-bs
[16:32] *** bitBaron has quit IRC (Client Quit)
[16:34] *** Stiletto has quit IRC (Read error: Operation timed out)
[16:34] *** bitBaron has joined #archiveteam-bs
[16:43] *** Famicoman has quit IRC (Ping timeout: 260 seconds)
[16:53] <eccfill> godane: look into djvu for magazine compression
[16:57] *** BubuAnabe has quit IRC (Ping timeout: 268 seconds)
[16:58] *** BubuAnabe has joined #archiveteam-bs
[17:16] *** Odd0002 has quit IRC (Read error: Operation timed out)
[17:27] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
[17:28] *** bitBaron has joined #archiveteam-bs
[17:30] *** bitBaron has quit IRC (Client Quit)
[17:32] *** bitBaron has joined #archiveteam-bs
[17:32] *** bitBaron has quit IRC (Read error: Connection reset by peer)
[17:33] *** bitBaron has joined #archiveteam-bs
[17:42] <Sanqui> CLICK for Photos 📷
[17:48] *** Jonison has joined #archiveteam-bs
[18:19] *** dashcloud has quit IRC (Read error: Operation timed out)
[18:19] *** dashcloud has joined #archiveteam-bs
[18:33] *** icedice has joined #archiveteam-bs
[18:46] *** bitBaron has quit IRC (Read error: Connection reset by peer)
[18:46] *** bitBaron_ has joined #archiveteam-bs
[18:46] <tuluu> Hi, is 'grab-site' the appropriate way to backup a webpage and create an warc file?
[18:55] *** dashcloud has quit IRC (Read error: Operation timed out)
[18:59] *** dashcloud has joined #archiveteam-bs
[19:03] <Meroje> for one page webrecorder is probably easier to use
[19:16] *** SHODAN_UI has joined #archiveteam-bs
[19:34] <ivan> tuluu: I use it but then I wrote it so I could be biased
[19:35] <ivan> (is there a better self-serve crawling setup you can do? heritrix3? httrack with a proxy?)
[19:38] <JAA> Chromium/Firefox headless with warcprox, in theory. Can't wait to try that out when Firefox headless is released in September.
[19:38] <ivan> why not headless Chrome now?
[19:39] <ivan> I mean, that sounds cool but it's not a "crawl this entire site and linked pages" solution
[19:39] <ivan> maybe there's something on github that does that though
[19:39] <JAA> Because exactly that. I want to script it to follow links etc. And based on a quick glance at the documentation, Chromium's headless mode isn't exactly powerful.
[19:40] <JAA> It really looks like a programmatic interface to the rendering engine, basically.
[19:40] <JAA> But not a full browser which supports extensions (for the link following part etc.).
[19:41] <ivan> what's missing? you can do pretty much anything you'd be able to do from the devtools
[19:41] <JAA> You can?
[19:41] <ivan> yes
[19:41] <JAA> As I said, I just quickly skimmed the documentation.
[19:41] <JAA> If you can, sweet.
[19:43] *** TheLovina has quit IRC (Read error: Connection reset by peer)
[19:48] *** TheLovina has joined #archiveteam-bs
[19:59] <tuluu> ivan: thanks
[19:59] <eccfill> JAA: yeah, there's a repl mode
[20:00] <eccfill> and chrome headless speaks devtools protocol, so the same things that remote chrome debuggers use are available
[20:05] <tuluu> grab-site seems to work well. I will use that for now.
[20:29] <JAA> I guess I'll play around with headless Chromium in the next weeks then.
[20:29] *** Stilett0 has joined #archiveteam-bs
[20:33] <JAA> I'll still wait for headless Firefox before trying to build anything real though. I'm curious to see how they stack up against each other, in particular regarding resource usage.
[21:01] *** icedice has quit IRC (Quit: Leaving)
[21:21] *** logchfoo3 starts logging #archiveteam-bs at Mon Jul 10 21:21:32 2017
[21:21] *** logchfoo3 has joined #archiveteam-bs
[21:22] *** powerArch has quit IRC (Remote host closed the connection)
[21:29] *** Mateon1 has quit IRC (Quit: Mateon1)
[21:31] *** SHODAN_UI has quit IRC (Remote host closed the connection)
[21:33] *** powerArch has joined #archiveteam-bs
[21:34] *** bwn_ is now known as bwn
[21:34] *** LeG0ax is now known as Ing3b0rg
[21:44] *** Famicoman has joined #archiveteam-bs
[21:57] *** Jonison has quit IRC (Read error: Connection reset by peer)
[22:11] *** purplebot has joined #archiveteam-bs
[22:12] *** voidsta has joined #archiveteam-bs
[22:13] *** i0npulse has joined #archiveteam-bs
[22:37] *** Hecatz has joined #archiveteam-bs
[22:41] <JAA> Argh, I hate wix.com sites. They're very annoying to archive as well.
[22:55] *** kimmer has joined #archiveteam-bs
[22:56] <kimmer> soo, there have been implemented rate lmits on the URL shortner services?
[22:57] <xmc> most shorteners will tell scrapers to get stuffed if they ask for too many urls
[22:57] <xmc> that's why more ip addresses is better
[22:58] <kimmer> allright.. guess it was takeing a bit of a load on their servers if we just hammered them.
[22:58] *** Odd0002 has joined #archiveteam-bs
[23:07] <dashcloud> Wanted to make sure people were aware of https://twitter.com/LinkArchiver - if you follow the account, anytime you tweet a link, it gets archived on IA, and if you @ the account, it'll tweet back the archive link
[23:07] *** agrecasci has joined #archiveteam-bs
[23:11] <JAA> Nice. Would be even better if it also archived the tweet.
[23:11] <JAA> But I guess that's a separate project.
[23:11] <arkiver> yeah, awesome project
[23:12] <arkiver> also a great way I think to get more people to know the wayback machine
[23:20] <arkiver> joepie91: finally fixed :) https://github.com/jjjake/internetarchive/issues/63
[23:21] <arkiver> (it should be at least)
[23:35] *** j08nY has quit IRC (Quit: Leaving)
[23:47] <DFJustin> oh looks like they fixed the bookreader for special character filenames at some point too
[23:47] <mundus201> What is the process for extracting all of the files from a .megawarc.warc.gz?
[23:51] <arkiver> https://github.com/chfoo/warcat
[23:52] <arkiver> mundus201 ^
[23:54] <mundus201> Gracias arkiver