[00:27] *** BlueMaxim has joined #archiveteam-bs [00:32] *** bsmith093 has joined #archiveteam-bs [01:44] *** pizzaiolo has quit IRC (Read error: Operation timed out) [01:44] *** pizzaiolo has joined #archiveteam-bs [01:56] *** dashcloud has quit IRC (Read error: Operation timed out) [02:01] *** dashcloud has joined #archiveteam-bs [02:03] *** RichardG_ is now known as RichardG [02:08] *** BubuAnabe has joined #archiveteam-bs [02:11] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) [02:17] *** pizzaiolo has quit IRC (Ping timeout: 268 seconds) [02:17] *** pizzaiolo has joined #archiveteam-bs [02:55] *** tsp_ has quit IRC (Read error: Connection reset by peer) [02:59] *** dashcloud has quit IRC (Read error: Operation timed out) [03:00] *** dashcloud has joined #archiveteam-bs [03:22] *** qw3rty has joined #archiveteam-bs [03:22] *** ZexaronS has quit IRC (Leaving) [03:25] *** qw3rty2 has quit IRC (Read error: Operation timed out) [03:38] *** ZexaronS has joined #archiveteam-bs [04:06] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:12] *** Sk1d has joined #archiveteam-bs [04:26] *** godane has quit IRC (Read error: Operation timed out) [04:30] *** godane has joined #archiveteam-bs [05:09] *** j08nY has joined #archiveteam-bs [05:34] *** Fletcher_ has joined #archiveteam-bs [05:36] *** j08nY has quit IRC (Read error: Operation timed out) [06:21] *** Yoshimura has quit IRC (Ping timeout: 260 seconds) [06:22] *** Yoshimura has joined #archiveteam-bs [06:27] code i could use to grab libgen.pw pdfs : curl -s https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%2F|/|g' | sed 's|%3A|:|g' | sed 's|%26|\&|g' | sed 's|.*chl=||g' | sed 's|pdf.*|pdf|g' [06:36] better code: curl -s https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|.*chl=||g' | sed 's|.*libgen.pw|http://libgen.pw|g' | sed 's|pdf.*|pdf|g' | sed 's|%2F|/|g' | sed 's| |_|g' | sed 's|%26|&|g' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%26|\&|g' [07:39] *** brayden has quit IRC (Read error: Connection reset by peer) [07:39] *** brayden has joined #archiveteam-bs [07:39] *** swebb sets mode: +o brayden [07:50] *** Jonison has joined #archiveteam-bs [08:16] *** j08nY has joined #archiveteam-bs [08:46] *** icedice has joined #archiveteam-bs [09:53] *** SHODAN_UI has joined #archiveteam-bs [10:30] *** SHODAN_UI has quit IRC (Remote host closed the connection) [12:02] *** pizzaiolo has quit IRC (Remote host closed the connection) [12:28] *** BlueMaxim has quit IRC (Read error: Operation timed out) [12:35] *** SHODAN_UI has joined #archiveteam-bs [12:40] *** Mateon1 has joined #archiveteam-bs [12:42] *** pizzaiolo has joined #archiveteam-bs [12:44] *** j08nY has quit IRC (Read error: Operation timed out) [12:44] *** dashcloud has quit IRC (Remote host closed the connection) [12:46] *** dashcloud has joined #archiveteam-bs [13:25] *** eccfill has quit IRC (Read error: Connection reset by peer) [13:26] *** eccfill has joined #archiveteam-bs [13:31] *** Yurume has quit IRC (Read error: Operation timed out) [13:35] *** Yurume has joined #archiveteam-bs [13:43] *** bitBaron has joined #archiveteam-bs [13:55] *** icedice has quit IRC (Quit: Leaving) [14:13] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) [14:14] *** j08nY has joined #archiveteam-bs [14:40] *** SHODAN_UI has quit IRC (Remote host closed the connection) [15:33] *** ZexaronS has quit IRC (Quit: Leaving) [15:37] *** Jonison has quit IRC (Quit: Leaving) [15:56] *** dashcloud has quit IRC (Read error: Operation timed out) [16:00] *** dashcloud has joined #archiveteam-bs [16:03] godane, first line output url with spaces, second didnt get anything [16:32] *** bitBaron has joined #archiveteam-bs [16:32] *** bitBaron has quit IRC (Client Quit) [16:34] *** Stiletto has quit IRC (Read error: Operation timed out) [16:34] *** bitBaron has joined #archiveteam-bs [16:43] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [16:53] godane: look into djvu for magazine compression [16:57] *** BubuAnabe has quit IRC (Ping timeout: 268 seconds) [16:58] *** BubuAnabe has joined #archiveteam-bs [17:16] *** Odd0002 has quit IRC (Read error: Operation timed out) [17:27] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) [17:28] *** bitBaron has joined #archiveteam-bs [17:30] *** bitBaron has quit IRC (Client Quit) [17:32] *** bitBaron has joined #archiveteam-bs [17:32] *** bitBaron has quit IRC (Read error: Connection reset by peer) [17:33] *** bitBaron has joined #archiveteam-bs [17:42] CLICK for Photos 📷 [17:48] *** Jonison has joined #archiveteam-bs [18:19] *** dashcloud has quit IRC (Read error: Operation timed out) [18:19] *** dashcloud has joined #archiveteam-bs [18:33] *** icedice has joined #archiveteam-bs [18:46] *** bitBaron has quit IRC (Read error: Connection reset by peer) [18:46] *** bitBaron_ has joined #archiveteam-bs [18:46] Hi, is 'grab-site' the appropriate way to backup a webpage and create an warc file? [18:55] *** dashcloud has quit IRC (Read error: Operation timed out) [18:59] *** dashcloud has joined #archiveteam-bs [19:03] for one page webrecorder is probably easier to use [19:16] *** SHODAN_UI has joined #archiveteam-bs [19:34] tuluu: I use it but then I wrote it so I could be biased [19:35] (is there a better self-serve crawling setup you can do? heritrix3? httrack with a proxy?) [19:38] Chromium/Firefox headless with warcprox, in theory. Can't wait to try that out when Firefox headless is released in September. [19:38] why not headless Chrome now? [19:39] I mean, that sounds cool but it's not a "crawl this entire site and linked pages" solution [19:39] maybe there's something on github that does that though [19:39] Because exactly that. I want to script it to follow links etc. And based on a quick glance at the documentation, Chromium's headless mode isn't exactly powerful. [19:40] It really looks like a programmatic interface to the rendering engine, basically. [19:40] But not a full browser which supports extensions (for the link following part etc.). [19:41] what's missing? you can do pretty much anything you'd be able to do from the devtools [19:41] You can? [19:41] yes [19:41] As I said, I just quickly skimmed the documentation. [19:41] If you can, sweet. [19:43] *** TheLovina has quit IRC (Read error: Connection reset by peer) [19:48] *** TheLovina has joined #archiveteam-bs [19:59] ivan: thanks [19:59] JAA: yeah, there's a repl mode [20:00] and chrome headless speaks devtools protocol, so the same things that remote chrome debuggers use are available [20:05] grab-site seems to work well. I will use that for now. [20:29] I guess I'll play around with headless Chromium in the next weeks then. [20:29] *** Stilett0 has joined #archiveteam-bs [20:33] I'll still wait for headless Firefox before trying to build anything real though. I'm curious to see how they stack up against each other, in particular regarding resource usage. [21:01] *** icedice has quit IRC (Quit: Leaving) [21:21] *** logchfoo3 starts logging #archiveteam-bs at Mon Jul 10 21:21:32 2017 [21:21] *** logchfoo3 has joined #archiveteam-bs [21:22] *** powerArch has quit IRC (Remote host closed the connection) [21:29] *** Mateon1 has quit IRC (Quit: Mateon1) [21:31] *** SHODAN_UI has quit IRC (Remote host closed the connection) [21:33] *** powerArch has joined #archiveteam-bs [21:34] *** bwn_ is now known as bwn [21:34] *** LeG0ax is now known as Ing3b0rg [21:44] *** Famicoman has joined #archiveteam-bs [21:57] *** Jonison has quit IRC (Read error: Connection reset by peer) [22:11] *** purplebot has joined #archiveteam-bs [22:12] *** voidsta has joined #archiveteam-bs [22:13] *** i0npulse has joined #archiveteam-bs [22:37] *** Hecatz has joined #archiveteam-bs [22:41] Argh, I hate wix.com sites. They're very annoying to archive as well. [22:55] *** kimmer has joined #archiveteam-bs [22:56] soo, there have been implemented rate lmits on the URL shortner services? [22:57] most shorteners will tell scrapers to get stuffed if they ask for too many urls [22:57] that's why more ip addresses is better [22:58] allright.. guess it was takeing a bit of a load on their servers if we just hammered them. [22:58] *** Odd0002 has joined #archiveteam-bs [23:07] Wanted to make sure people were aware of https://twitter.com/LinkArchiver - if you follow the account, anytime you tweet a link, it gets archived on IA, and if you @ the account, it'll tweet back the archive link [23:07] *** agrecasci has joined #archiveteam-bs [23:11] Nice. Would be even better if it also archived the tweet. [23:11] But I guess that's a separate project. [23:11] yeah, awesome project [23:12] also a great way I think to get more people to know the wayback machine [23:20] joepie91: finally fixed :) https://github.com/jjjake/internetarchive/issues/63 [23:21] (it should be at least) [23:35] *** j08nY has quit IRC (Quit: Leaving) [23:47] oh looks like they fixed the bookreader for special character filenames at some point too [23:47] What is the process for extracting all of the files from a .megawarc.warc.gz? [23:51] https://github.com/chfoo/warcat [23:52] mundus201 ^ [23:54] Gracias arkiver