[00:17] *** ndiddy has joined #archiveteam-bs [00:33] *** powerKitt has quit IRC (Ping timeout: 268 seconds) [00:37] *** GLaDOS has joined #archiveteam-bs [01:02] *** ndiddy has quit IRC () [01:06] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [01:10] *** sheaf has quit IRC (Quit: sheaf) [01:11] *** Sk1d has joined #archiveteam-bs [01:36] *** schbirid2 has joined #archiveteam-bs [01:39] *** schbirid has quit IRC (Read error: Operation timed out) [02:04] *** Asparagir has joined #archiveteam-bs [02:09] *** Odd0002 has joined #archiveteam-bs [03:43] Somebody2: that email i sent to info@archive i never received a response to yet, but that was friday so maybe they'll answer it on monday [03:44] yes, it is a job, not a lifestyle [03:44] well [03:44] you know what i mean [03:49] Lord_Nigh: It may be longer than that, if it isn't a simple fix. [03:50] i'm guessing its a regression in the robots.txt parser and its a simple/stupid bug, heck the source code to it is probably available, maybe i can fix it... [03:51] Heh, I'm not sure where the source code for the new version of the Wayback Machine is. [03:51] If you do come up with a patch, that might be likely to get a response sooner [04:04] the latest dump of fanfiction.net, 16gb compressed, 54gb uncompressed. 745K stories, https://archive.org/details/Fanfictiondotnet1011dump [04:20] Somebody2: i'm not sure either [04:20] where the source is [07:27] *** BlueMaxim has quit IRC (Read error: Operation timed out) [08:38] *** Honno has joined #archiveteam-bs [08:48] I noticed WaybackMachine added an "About this capture" thingy [08:48] So you can now identify ArchiveBot crawls [08:55] *** GE has joined #archiveteam-bs [09:14] *** Honno_ has joined #archiveteam-bs [09:15] *** Honno__ has joined #archiveteam-bs [09:19] *** Honno has quit IRC (Ping timeout: 370 seconds) [09:20] *** Honno_ has quit IRC (Ping timeout: 370 seconds) [09:23] *** Honno_ has joined #archiveteam-bs [09:28] *** Honno__ has quit IRC (Ping timeout: 370 seconds) [10:30] *** GE has quit IRC (Remote host closed the connection) [11:30] bsmith093: nice. how do you generate metadata.sqlite? [12:10] *** Jonison has joined #archiveteam-bs [12:10] *** Jonison has quit IRC (Read error: Connection reset by peer) [12:47] *** GE has joined #archiveteam-bs [13:22] *** sheaf has joined #archiveteam-bs [15:00] *** Fletcher has joined #archiveteam-bs [15:00] *** kurt has joined #archiveteam-bs [15:00] *** kvieta has joined #archiveteam-bs [15:00] *** espes__ has joined #archiveteam-bs [15:00] *** SilSte has joined #archiveteam-bs [15:00] *** Kenshin has joined #archiveteam-bs [15:00] *** w0rp has joined #archiveteam-bs [15:00] *** dashcloud has joined #archiveteam-bs [15:00] *** HP has joined #archiveteam-bs [15:00] *** antonizoo has joined #archiveteam-bs [15:00] *** tapedrive has joined #archiveteam-bs [15:00] *** eprillios has joined #archiveteam-bs [15:00] *** chfoo has joined #archiveteam-bs [15:00] *** cf has joined #archiveteam-bs [15:00] *** joepie91 has joined #archiveteam-bs [15:00] *** brayden has joined #archiveteam-bs [15:00] *** hub.dk sets mode: +oo Fletcher brayden [15:00] *** swebb sets mode: +o brayden [15:00] *** jmtd has joined #archiveteam-bs [15:01] *** Smiley has joined #archiveteam-bs [15:13] *** Asparagir has quit IRC (Asparagir) [15:40] *** RichardG has joined #archiveteam-bs [16:07] *** RichardG has quit IRC (Read error: Operation timed out) [16:07] *** RichardG has joined #archiveteam-bs [16:31] *** powerArch has quit IRC (Remote host closed the connection) [16:33] *** RedType_ has quit IRC (Read error: Operation timed out) [16:34] *** icedice has joined #archiveteam-bs [17:02] *** RichardG has quit IRC (Read error: Operation timed out) [17:02] *** RichardG has joined #archiveteam-bs [17:29] *** RichardG has quit IRC (Read error: Operation timed out) [17:30] *** RichardG has joined #archiveteam-bs [17:33] *** ndiddy has joined #archiveteam-bs [17:37] *** Asparagir has joined #archiveteam-bs [17:38] *** Honno__ has joined #archiveteam-bs [17:39] *** GE has quit IRC (Remote host closed the connection) [17:43] *** Honno_ has quit IRC (Ping timeout: 370 seconds) [18:12] *** RichardG has quit IRC (Read error: Operation timed out) [18:12] *** RichardG has joined #archiveteam-bs [18:59] *** RichardG has quit IRC (Read error: Operation timed out) [18:59] *** RichardG has joined #archiveteam-bs [19:13] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [19:36] *** GE has joined #archiveteam-bs [19:40] *** C4K3_ is now known as C4K3 [19:59] *** BartoCH has joined #archiveteam-bs [20:04] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [20:08] *** BartoCH has joined #archiveteam-bs [20:19] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [20:34] *** BartoCH has joined #archiveteam-bs [20:44] *** powerArch has joined #archiveteam-bs [21:11] *** RichardG has quit IRC (Read error: Operation timed out) [21:11] *** RichardG has joined #archiveteam-bs [21:37] *** RedType has joined #archiveteam-bs [21:37] *** RichardG has quit IRC (Read error: Operation timed out) [21:37] *** RichardG has joined #archiveteam-bs [21:54] *** icedice has quit IRC (Ping timeout: 250 seconds) [21:58] im in yr internet archive archiving yr internets [21:58] no seriously, I'm working at Funston today, come say hi if you're around [21:58] ohai! [21:59] *** xmc sets mode: +o Asparagir [21:59] Now I'm all super-powerful, thanks! [21:59] Step two, get me one of those orbs [21:59] Step three, profit. [22:00] The WiFi here is about 165 Mbps. :-O [22:14] *** GE has quit IRC (Remote host closed the connection) [22:20] Alright, my setup for Razer Arena with wpull and PhantomJS seems to work in principle. The main problems are that it still doesn't capture everything (wpull doesn't seem to extract links from the DOM generated by PhantomJS) and that the grab will be quite large due to duplication (each page grabs all the JavaScript, imagery, etc. again through PhantomJS). [22:22] I think arkiver has a script to dedup WARCs [22:24] Yeah, I guess that shouldn't be too difficult. I'm more concerned about the "doesn't capture everything" part. [22:44] *** dashcloud has quit IRC (Remote host closed the connection) [22:46] *** dashcloud has joined #archiveteam-bs [23:00] If anyone has any ideas, please let me know. For the record, I'm using wpull 1.2.3 with PhantomJS 2.1.1 and the options --phantomjs --phantomjs-exe /path/to/phantomjs --no-phantomjs-snapshot . [23:01] Otherwise, I'll just grab the actual data through the API and ignore the interface. [23:22] *** Ravenloft has joined #archiveteam-bs [23:35] *** BlueMaxim has joined #archiveteam-bs [23:56] hmm, is there anywhere other than archive.org that I could go to look for or upload old, late 90's/early 2000's PC games? Archive doesn't seem to have them, and there's almost no information on the internet about these games [23:57] archive.org is a good place to upload [23:57] i don't know where a good place to find is though