[00:02] quick question [00:03] I can't remember if this was us or not, but a while back there was an archiving tool to basically scrape images from certain image hosting websites, one such was I believe, prntscr [00:04] have we given any thought to archiving sites like imgur? [00:44] *** icedice has quit IRC (Ping timeout: 740 seconds) [00:54] *** j08nY has quit IRC (Quit: Leaving) [00:58] *** BlueMaxim has joined #archiveteam-bs [01:36] *** schbirid2 has joined #archiveteam-bs [01:40] *** schbirid has quit IRC (Read error: Operation timed out) [02:05] *** fie has quit IRC (Read error: Operation timed out) [02:19] *** fie has joined #archiveteam-bs [02:28] *** kristian_ has quit IRC (Quit: Leaving) [02:57] *** wacky_ has quit IRC (Read error: Operation timed out) [02:57] *** wacky_ has joined #archiveteam-bs [02:57] *** HUBI has quit IRC (Read error: Operation timed out) [02:57] *** balrog has quit IRC (Read error: Operation timed out) [02:57] *** HUBI has joined #archiveteam-bs [02:57] *** sep332 has joined #archiveteam-bs [02:58] *** midas1 has quit IRC (Read error: Operation timed out) [02:58] *** midas1 has joined #archiveteam-bs [02:58] *** dashcloud has quit IRC (Read error: Operation timed out) [02:58] *** kanzure has quit IRC (Read error: Operation timed out) [02:58] *** chazchaz_ has quit IRC (Read error: Operation timed out) [02:58] *** kurt has quit IRC (Read error: Operation timed out) [02:58] *** whydomain has quit IRC (Read error: Operation timed out) [02:58] *** closure has quit IRC (Read error: Operation timed out) [02:58] *** sep332_ has quit IRC (Read error: Operation timed out) [02:58] *** kanzure has joined #archiveteam-bs [02:59] *** closure has joined #archiveteam-bs [02:59] *** whydomain has joined #archiveteam-bs [02:59] *** Yurume has quit IRC (Read error: Operation timed out) [02:59] *** jerrystie has quit IRC (Read error: Operation timed out) [03:00] *** balrog has joined #archiveteam-bs [03:00] *** swebb sets mode: +o balrog [03:00] *** Yurume has joined #archiveteam-bs [03:01] SketchCow: i'm watching your twitch stream [03:02] *** dashcloud has joined #archiveteam-bs [03:02] *** c4rc4s has quit IRC (Ping timeout: 506 seconds) [03:04] *** c4rc4s has joined #archiveteam-bs [03:06] godane: He has a twitch? [03:06] also SketchCow https://ubidestates.hibid.com/catalog/103245/radioshack-auction--1/ [03:06] All kinds of old radio shack crap up for sale, including lots of old books [03:06] https://www.twitch.tv/textfilesdotcom [03:12] *** JerryStie has joined #archiveteam-bs [03:13] *** kurt has joined #archiveteam-bs [03:55] *** qw3rty2 has joined #archiveteam-bs [03:56] *** pizzaiolo has quit IRC (Quit: pizzaiolo) [04:02] *** qw3rty has quit IRC (Read error: Operation timed out) [04:30] *** kristian_ has joined #archiveteam-bs [04:37] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:44] *** Sk1d has joined #archiveteam-bs [04:56] *** Stiletto has quit IRC (Read error: Operation timed out) [04:57] *** BlueMaxim has quit IRC (Read error: Operation timed out) [04:57] *** BlueMaxim has joined #archiveteam-bs [05:01] *** Stilett0 has joined #archiveteam-bs [06:19] *** godane has quit IRC (Ping timeout: 245 seconds) [06:38] *** bwn has quit IRC (Read error: Connection reset by peer) [06:48] *** bwn has joined #archiveteam-bs [07:14] *** bwn has quit IRC (Read error: Operation timed out) [07:22] *** bwn has joined #archiveteam-bs [08:17] *** godane has joined #archiveteam-bs [09:10] *** SHODAN_UI has joined #archiveteam-bs [09:13] *** kristian_ has quit IRC (Quit: Leaving) [09:24] *** BlueMaxim has quit IRC (Read error: Operation timed out) [09:38] *** SHODAN_UI has quit IRC (Remote host closed the connection) [09:46] *** j08nY has joined #archiveteam-bs [10:29] Whoa, my Tilt API grab has retrieved about 360k URLs (about 1 GB as .warc.gz) and discovered over 150k additional users and about 4600 campaigns already. Currently, there are about 2.2 million URLs in the queue, and it's running at about 30k URLs per hour. Well, this will take a few days. [10:40] *** Jonison has joined #archiveteam-bs [11:31] *** pizzaiolo has joined #archiveteam-bs [11:44] *** SHODAN_UI has joined #archiveteam-bs [12:33] *** thuban3 has joined #archiveteam-bs [12:35] *** thuban2 has quit IRC (Read error: Operation timed out) [13:20] *** thuban4 has joined #archiveteam-bs [13:25] *** thuban3 has quit IRC (Read error: Operation timed out) [13:33] *** SHODAN_UI has quit IRC (Remote host closed the connection) [13:47] *** pizzaiolo has quit IRC (Ping timeout: 260 seconds) [13:54] *** pnJay has joined #archiveteam-bs [14:05] *** SHODAN_UI has joined #archiveteam-bs [14:44] *** pizzaiolo has joined #archiveteam-bs [14:53] *** thuban has joined #archiveteam-bs [14:58] *** thuban4 has quit IRC (Read error: Operation timed out) [15:02] *** dashcloud has quit IRC (Read error: Operation timed out) [15:03] *** dashcloud has joined #archiveteam-bs [15:14] *** username1 has joined #archiveteam-bs [15:16] *** schbirid2 has quit IRC (Read error: Operation timed out) [15:55] *** schbirid2 has joined #archiveteam-bs [15:58] *** username1 has quit IRC (Read error: Operation timed out) [16:01] *** schbirid2 has quit IRC (Read error: Operation timed out) [16:05] *** TheLovina has joined #archiveteam-bs [16:08] *** schbirid2 has joined #archiveteam-bs [16:13] *** sep332_ has joined #archiveteam-bs [16:14] *** sep332 has quit IRC (Read error: Operation timed out) [16:19] *** mgrytbak has quit IRC (Read error: Operation timed out) [16:19] *** useretail has quit IRC (Read error: Operation timed out) [16:19] *** mgrytbak has joined #archiveteam-bs [16:20] *** sep332_ has quit IRC (Read error: Operation timed out) [16:20] *** useretail has joined #archiveteam-bs [16:30] *** SHODAN_UI has quit IRC (Remote host closed the connection) [17:08] *** odemg_ has joined #archiveteam-bs [17:11] *** odemg has quit IRC (Read error: Operation timed out) [17:34] *** bitBaron has joined #archiveteam-bs [17:37] *** bitBaron has quit IRC (Client Quit) [17:38] *** bitBaron has joined #archiveteam-bs [17:38] *** bitBaron has quit IRC (Read error: Connection reset by peer) [17:38] *** bitBaron has joined #archiveteam-bs [17:53] HCross2: yeah I just saw the grab-site thing you mentioned in the other channel, crazy stuff [17:56] HCross2: if you have time and you have evidence that it's making thousands of useless DNS lookups this might be a good bug to file on crbug [17:56] I think it would make sense for chromium to give up on predicting at some point [17:56] ivan: I've got crawl logs and packet captures which match up [17:58] huh https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-DNS-Prefetch-Control [17:58] https://www.chromium.org/developers/design-documents/dns-prefetching has it too [18:00] *** timmc has left [18:02] HCross2: can you add after the other meta tags in your installed libgrabsite/dashboard.html and tell me if that stops the lookups? [18:06] JAA: re: queue slowness wpull might be doing an fsync frequently, there's some PRAGMA in grab-site that turns it off, you might try that [18:07] libgrabsite/plugin.py NoFsyncSQLTable [18:09] HCross2: heh easy to test with chrome://net-internals/#dns I am seeing what you saw now [18:10] ivan: I saw my home DNS server go ballistic [18:12] meta tag seems to work [18:18] ivan: Thanks, I'll have a look. [18:19] HCross2: alright fixed in grab-site 1.2.3 thanks for finding this [18:19] *** SHODAN_UI has joined #archiveteam-bs [18:29] this should probably be fixed on the archivebot dashboard too [18:30] HCross2: can I credit you somehow? github username? [18:33] hah found you [18:35] HarryC145 [18:36] How many of you are there??? :P [18:38] *** bitBaron_ has joined #archiveteam-bs [18:39] *** bitBaron has quit IRC (Ping timeout: 250 seconds) [18:44] *** bitBaron has joined #archiveteam-bs [18:44] *** bitBaron_ has quit IRC (Read error: Connection reset by peer) [18:46] *** Boppen has quit IRC (Ping timeout: 194 seconds) [18:46] *** Boppen has joined #archiveteam-bs [19:18] ivan: also credit madyoda in github please, he helped me initially diagnose it [19:47] done [19:55] *** kristian_ has joined #archiveteam-bs [20:21] *** kristian_ has quit IRC (Quit: Leaving) [20:39] phantomjs + Magneto web shops = terrible idea [21:00] *** pnJay has quit IRC (Quit: Leaving) [21:00] *** schbirid2 has quit IRC (Quit: Leaving) [21:24] *** SHODAN_UI has quit IRC (Remote host closed the connection) [21:48] *** Asparagir has quit IRC (Asparagir) [22:04] *** bitBaron has quit IRC (Read error: Operation timed out) [22:04] *** powerArch has quit IRC (Remote host closed the connection) [22:04] *** bitBaron has joined #archiveteam-bs [22:05] *** powerArch has joined #archiveteam-bs [22:40] *** Soni has quit IRC (Ping timeout: 272 seconds) [22:42] *** Jonison has quit IRC (Read error: Connection reset by peer) [22:43] Just over 24 hours into my Tilt API grab: 735k URLs retrieved for a .warc.gz of about 2 GiB, 3.1M urls queued, 270k users and 8k campaigns newly discovered (seeded with 30k users and 53k campaigns). [22:46] Something I forgot to mention before: I also extract all URLs from the JSON so I can grab those later. They're mostly images and shortened links like http://til.tt/yeWm . 670k such URLs found so far. [22:48] *** Soni has joined #archiveteam-bs [23:42] *** andai_ has joined #archiveteam-bs [23:42] *** andai has quit IRC (Read error: Connection reset by peer) [23:44] *** Crusher_ has joined #archiveteam-bs [23:45] *** andai_ has quit IRC (Read error: Connection reset by peer) [23:45] *** andai has joined #archiveteam-bs [23:45] Any word on when the Vine project is going to continue? [23:49] *** Crusher_ has quit IRC (Read error: Connection reset by peer)