[00:00] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [00:08] *** nertzy has joined #archiveteam [00:12] *** JesseW has joined #archiveteam [00:17] *** Guest1247 has quit IRC (Ping timeout: 240 seconds) [00:26] *** dashcloud has quit IRC (Read error: Operation timed out) [00:39] *** dashcloud has joined #archiveteam [00:50] *** Famicoman has quit IRC (Read error: Operation timed out) [00:52] *** primus104 has quit IRC (Leaving.) [00:52] *** VADemon has quit IRC (Read error: Connection reset by peer) [00:53] *** w3r4 has joined #archiveteam [00:54] hey, I have yuku on the warrior, and I think it's got stuck on a particular forum [00:54] *** nightpool has joined #archiveteam [00:55] one of the items has not moved on from "36=301 http://av1611godsword.yuku.com/forum/previous/topic/530. " for about 4 hours [00:56] *** Fletcher has quit IRC (Read error: Operation timed out) [00:56] arkiver: ping ^ [00:57] hey all I don't know if you've heard but OTW (http://archiveofourown.org/) is going through some organizational instability right now. [00:57] someone suggested spidering ao3 and fanlore and etc. and this seemed like the best place to ask for tips [00:58] *** vtyl has joined #archiveteam [00:58] anyone have pointers about where I could start code-wise? i'm a pretty competent coder but would like to get a handle on current best practices or w/e [00:59] nightpool: there's a tool called grab-site. Check that out. Also, wpull. [00:59] *** lytv has quit IRC (Ping timeout: 606 seconds) [00:59] *** Fletcher has joined #archiveteam [01:00] we also have a system called archivebot that can grab things using shared servers [01:00] probably don't need anything like that yet. this is just precautionary :D [01:01] hopefully they can figure out their internal politics stuff [01:01] but ao3 is super central to a lot of fan cultures so I'd like to be prepared [01:02] *** w3r4 has quit IRC (Ping timeout: 240 seconds) [01:02] and for particularly big and urgent jobs, we can write customized code that is then automatically distributed to a network of volunteers running virtual machines (called the ArchiveTeam Warrior) that can distribute the job of grabbing a whole site; we're currently working on grabbing all of a set of forums called Yuku, and a video site called screener. [01:03] A03 is really important, yes. [01:03] check out the wiki, and do ask if you have more questions [01:04] thx :D [01:04] i hope i'm just overreacting [01:04] *** philpem has quit IRC (Read error: Operation timed out) [01:06] I hope so too. [01:06] but it's good to have independent copies in any case [01:33] *** nightpool has quit IRC (Ping timeout: 258 seconds) [01:33] *** nightpool has joined #archiveteam [01:47] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [01:55] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [02:11] *** zenguy_pc has joined #archiveteam [02:11] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [02:21] *** zhongfu has quit IRC (Read error: Operation timed out) [02:22] *** zhongfu has joined #archiveteam [02:28] *** zenguy_pc has joined #archiveteam [02:30] *** nightpool has quit IRC (Read error: Operation timed out) [02:49] *** nightpool has joined #archiveteam [02:52] *** Stiletto has quit IRC (Read error: Connection reset by peer) [02:52] *** Stiletto has joined #archiveteam [02:58] Who has access to the machine the urlteam tracker is on? It appears to be down, or otherwise not working... [03:03] *** ex-parrot has left [03:21] *** chfoo has joined #archiveteam [03:30] *** Famicoman has joined #archiveteam [03:39] *** Elegance_ has quit IRC (Read error: Operation timed out) [03:50] *** Elegance has joined #archiveteam [03:50] *** Elegance has quit IRC (Connection closed) [03:56] *** nightpool has quit IRC (Ping timeout: 252 seconds) [03:59] *** Stilett0 has joined #archiveteam [04:02] *** Stiletto has quit IRC (Read error: Connection reset by peer) [04:10] *** nightpool has joined #archiveteam [04:28] *** Guest1247 has joined #archiveteam [04:28] *** aaaaaaaaa has quit IRC (Read error: Operation timed out) [04:33] urlteam tracker is down again. :-( [04:35] *** Guest1247 has quit IRC (Ping timeout: 240 seconds) [04:39] *** Guest1247 has joined #archiveteam [04:41] and it's back [04:44] *** Guest1247 has quit IRC (Ping timeout: 240 seconds) [05:11] *** Sk1d has quit IRC (Read error: Operation timed out) [05:17] *** remsen has joined #archiveteam [05:29] *** remsen has quit IRC (Leaving) [05:34] *** xk_id has quit IRC (Remote host closed the connection) [06:06] heyy let me know if this is the the wrong place to ask this, but is there a good way to write a grab-site ignore regex to *only* grab a certain set of routes? [06:07] *** xk_id has joined #archiveteam [06:11] nevermind, looks like I can do this with wpull_args [06:16] *** Sk1d has joined #archiveteam [06:22] *** slyphic is now known as slyphic|a [06:36] *** remsen has joined #archiveteam [06:54] w3r4: I think you can abort that items and start over [06:55] so uh are all the screenr items supposed to be 0.1 MB total? [06:56] *** arkiver2 has joined #archiveteam [06:57] yes, videos have already been grabbed [07:00] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [07:17] *** nightpool has quit IRC (Read error: Operation timed out) [07:20] *** remsen has quit IRC (Leaving) [07:24] *** JesseW has quit IRC (Leaving.) [07:24] *** za3k has joined #archiveteam [07:24] All right, downloaded a list of github.com repositories and small amounts of metadata. It looks like the first pass is missing some data and has some repeats (I think due to low error rates from github) but I'm doing another run. [07:25] It does give accurate statistics at least, the wiki's updated a bit. [07:25] Downloads are at za3k.com/github, but you should wait 1-2 weeks for the second pass of data because this seems flakey [07:26] Short version is 28 million repos, 120TB [07:26] *** nightpool has joined #archiveteam [07:30] *** nightpool has quit IRC (Ping timeout: 183 seconds) [07:31] *** za3k has quit IRC (Quit: http://chat.efnet.org (EOF)) [07:42] *** remsen has joined #archiveteam [07:43] *** primus104 has joined #archiveteam [07:48] *** WinterFox has quit IRC (Read error: Operation timed out) [07:49] *** Elegance has joined #archiveteam [07:54] *** WinterFox has joined #archiveteam [08:04] *** remsen2 has joined #archiveteam [08:07] *** godane has left [08:08] *** remsen has quit IRC (Read error: Operation timed out) [08:19] *** R5M has joined #archiveteam [08:20] *** nightpool has joined #archiveteam [08:22] *** R5M has quit IRC (Client Quit) [08:23] *** nightpool has quit IRC (Ping timeout: 183 seconds) [08:25] *** remsen2 has quit IRC (Read error: Operation timed out) [08:30] *** arkiver2 has joined #archiveteam [08:50] https://archive.org/details/archivebotis coming along nicely. [08:50] https://archive.org/details/archivebot [09:14] *** nightpool has joined #archiveteam [09:15] *** godane has joined #archiveteam [09:22] *** nightpool has quit IRC (Read error: Operation timed out) [09:28] 1318, impressive [09:28] *** schbirid has joined #archiveteam [09:38] Well, I've got the thing generating thumbnails. [09:38] It's going by most viewed down, so the page will look nice and then filter down. [09:41] *** atomotic has joined #archiveteam [09:47] SketchCow: dailymail.co.uk is up to 2011 [09:47] Great [09:49] i'm also uploading some twilight dvds i have [09:50] from the same guy that upload the others [09:51] *** remsen has joined #archiveteam [09:51] A cache of 60gb of Geocities sites is coming to me. [10:08] *** nightpool has joined #archiveteam [10:12] *** Ungstein has joined #archiveteam [10:16] *** nightpool has quit IRC (Read error: Operation timed out) [10:18] *** Ungstein has quit IRC (Read error: Connection reset by peer) [10:45] *** primus104 has quit IRC (Leaving.) [11:15] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [11:45] *** kyan has quit IRC (Remote host closed the connection) [11:46] *** kyan has joined #archiveteam [11:53] *** kyan is now known as kyna [11:53] *** kyna is now known as kyan [11:54] *** WinterFox has quit IRC (Remote host closed the connection) [11:54] *** arkiver2 has joined #archiveteam [11:59] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [12:01] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [12:13] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [12:21] *** oli has quit IRC (Read error: Operation timed out) [12:24] *** oli has joined #archiveteam [12:55] *** arkiver2 has joined #archiveteam [13:05] 43G Blizzcon-2015-Virtual-Ticket-DirecTV finally done :) [13:06] cool [13:08] *** atomotic has joined #archiveteam [13:10] *** toad1 has quit IRC (Read error: Operation timed out) [13:17] *** nertzy has joined #archiveteam [13:17] godane: where/how should i upload it? they're under directv's copyright afaik [13:18] just upload it [13:19] where, archive.org? [13:19] yes [13:19] mmk [13:19] i'm off to bed [13:20] bbl [13:30] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [13:30] *** atomotic has joined #archiveteam [13:30] *** arkiver2 has quit IRC (Quit: Nettalk6 - www.ntalk.de) [13:35] *** vtyl has quit IRC (Read error: Operation timed out) [13:38] *** lytv has joined #archiveteam [13:49] *** slyphic|a is now known as slyphic [13:54] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [14:00] *** primus104 has joined #archiveteam [14:04] *** toad1 has joined #archiveteam [14:34] *** Guest1247 has joined #archiveteam [14:40] *** jonimus has joined #archiveteam [14:47] *** primus104 has quit IRC (Leaving.) [14:52] *** Ungstein has joined #archiveteam [14:53] *** Ungstein has quit IRC (Client Quit) [14:54] *** Ungstein has joined #archiveteam [15:01] *** Guest1247 has quit IRC (Quit: Page closed) [15:01] *** Guest1247 has joined #archiveteam [15:02] *** scyther has joined #archiveteam [15:06] *** Ungstein has quit IRC (Quit: Leaving.) [15:09] *** Ungstein has joined #archiveteam [15:18] *** nertzy has joined #archiveteam [15:21] *** ersi has joined #archiveteam [15:21] *** swebb sets mode: +o ersi [15:25] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [15:28] so I have a large archive ill have to share at some point, and a lot of the files inside it are duplicates. rather than manually create symlinks for everything, is there a way i can deduplicate the files in an archive? [15:31] http://www.digitalvolcano.co.uk/duplicatecleaner.html [15:32] Guest1247: I don't want to delete them [15:32] ^ I recommend this software. [15:32] they need to exist, but they can exist as symlinks, or can be re-created upon extraction (as symlinks even maybe) [15:32] They end up in the trash bin [15:32] yes I really don't need that [15:32] sec [15:33] Guest1247: to give you an idea of what it looks like: http://dpaste.com/2CNJZDW [15:33] You can just cut and paste it in some folder after that [15:35] jleclanch, eh, I think I've seen an app for that [15:35] maybe pmatch (a paranoid ruby app for finding duplicates) [15:35] kyan: someone just linked http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl [15:36] I think it lets you retrieve a list of the duplicates that you could then stick into a bash script [15:36] which is what i need essentially, but with symlinks instead of hardlinks [15:36] ill see if i can modify it, my perl is .. well, it's perl [15:36] Unfortunately I don't know jack about perl [15:37] aside from one file-extracty thing I wrote in high school [15:41] *** scyther has quit IRC (Quit: Leaving) [15:45] I'm gonna try to hack something together, i'll let you know if I get it working [15:48] kyan: ill stick to hard links actually after all [15:48] so this script should do [15:48] okeydokey [15:49] good luck :) [15:49] I'm probably going to keep doing this though just for me lol [15:49] because I've wanted something to do duplicates->symlink conversion for a while but I was always too lazy to start it [15:51] !ao http://free-electrons.com/community/tools/utils/clink/ [15:52] ...oh -_- [15:52] *** Ungstein has quit IRC (Quit: Leaving.) [16:06] *** Fletcher has quit IRC (Ping timeout: 252 seconds) [16:06] *** diacope has quit IRC (Ping timeout: 252 seconds) [16:16] if you use a "solid" archive format like .tar.gz or .7z then the duplicate files should not take any extra space [16:19] That depends on how good the data size and the format's dictionary size. [16:19] The gzip format's dictionary is only 32k in size so any duplicate files larger than that will not deduplicate [16:19] See https://superuser.com/questions/479074/why-doesnt-gzip-compression-eliminate-duplicate-chunks-of-data [16:19] true, I think with rar and 7z it sorts files by checksum to facilitate it though [16:19] *depends on how big the data size [16:20] *** JesseW has joined #archiveteam [16:22] *** Ungstein has joined #archiveteam [16:31] *** Ungstein1 has joined #archiveteam [16:32] *** nightpool has joined #archiveteam [16:32] *** Ungstein has quit IRC (Ping timeout: 252 seconds) [16:42] *** primus104 has joined #archiveteam [16:45] *** primus105 has joined #archiveteam [16:46] I don't know how well it scales, but I just tested 7Zip by creating a folder, placing a file in the folder, archiving the folder, then duplicating that file within a subfolder, and re-archiving. Size difference: 56 bytes. [16:48] So, inside the folder, I placed a test file (PNG image I had laying around), and then created an archive, measured the size at 1,597,149 bytes. Then, I made a subfolder within that same folder copied the PNG file to the subfolder, and rearchived the original folder (with both images included). The seecond archive was 1,597,505 bytes. [16:48] Sorry, typo, the size difference was 356 bytes. [16:49] *** primus104 has quit IRC (Read error: Operation timed out) [16:49] *** kyan_ has joined #archiveteam [16:50] For reference, the original file I used was a PNG file that is 1,596,862 bytes in size. [16:54] *** kyan_ has quit IRC (Leaving) [16:54] *** kyan has quit IRC (Ping timeout: 615 seconds) [16:55] *** JesseW has quit IRC (Leaving.) [17:03] *** atomotic has joined #archiveteam [17:10] *** kyan has joined #archiveteam [17:14] *** Ungstein1 has quit IRC (Quit: Leaving.) [17:15] Could someone please verify me in #archivebot so that I can use the archive commands? [17:19] Guest1247: Anyone can use !ao but only users with voice (+v) or op (+o) permission can use !a [17:20] I have op and I tried voicing you but for some reason it won't let me [17:39] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [17:40] *** MrRadar has quit IRC (Read error: Operation timed out) [17:40] I had !a access yesterday, but that was at home and now I'm in my student appartment, so I have a different IP address. [17:43] We gotta get you a permanent name and place, buddy [17:43] It's time! [17:43] You tried before you buy'd. [17:43] Guest1247, bro do you even voice? (you have voice in #archivebot now) [17:45] Sorry, hadn't read up on the log in #archivebot when I posted that [17:56] *** nightpool has quit IRC (Read error: Operation timed out) [18:06] *** MrRadar has joined #archiveteam [18:07] *** philpem has joined #archiveteam [18:22] *** scyther has joined #archiveteam [18:23] https://defacto2.wordpress.com/2015/11/23/files-files-and-more-files/ [18:30] *** Guest1247 has quit IRC (Ping timeout: 240 seconds) [18:53] *** primus105 has quit IRC (Leaving.) [18:56] Dear Archive Team, [18:56] On November 27th, the Kerbal Space Program forums are to be updated to new software and in the process some of their subforums and all of the blogs on the site are scheduled for deletion. As these things are important to many members of the active community, I ask you to get in touch with the operators of these forums and if possible obtain the data for preservation. [18:56] With Best Regards [18:56] Someone get the fuck on thaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaat [18:56] Kai Wolter [18:57] its already in ArchiveBot [18:57] wtf [19:13] *** Guest1247 has joined #archiveteam [19:15] *** nightpool has joined #archiveteam [19:24] *** RedType_ has quit IRC (leaving) [19:31] *** primus104 has joined #archiveteam [19:35] *** RedType has joined #archiveteam [19:42] *** Start has quit IRC (Read error: Connection reset by peer) [19:43] *** Start has joined #archiveteam [19:53] *** SimpBrai1 has quit IRC (Leaving) [19:54] *** SimpBrain has joined #archiveteam [20:11] *** diacope has joined #archiveteam [20:11] *** Fletcher has joined #archiveteam [20:34] *** RedType has quit IRC (Changing server) [20:35] *** RedType has joined #archiveteam [20:36] *** RedType has quit IRC (Remote host closed the connection) [20:41] *** schbirid has quit IRC (Quit: Leaving) [20:45] *** nightpool has quit IRC (Read error: Operation timed out) [20:51] *** atomotic has joined #archiveteam [21:12] *** Ghost_of_ has joined #archiveteam [21:15] *** Zebranky_ is now known as Zebranky [21:26] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [21:37] *** nightpool has joined #archiveteam [21:42] *** scyther has quit IRC (Leaving) [21:53] *** nightpool has quit IRC (Read error: Operation timed out) [21:53] *** nightpool has joined #archiveteam [21:58] *** nightpool has quit IRC (Read error: Operation timed out) [21:59] *** BlueMaxim has joined #archiveteam [22:09] *** RedType has joined #archiveteam [22:23] *** SilSte has quit IRC (Read error: Connection reset by peer) [22:43] *** nightpool has joined #archiveteam [22:44] *** remsen has quit IRC (Read error: Operation timed out) [22:46] *** Guest1247 has quit IRC (Ping timeout: 240 seconds) [23:08] *** bwn has joined #archiveteam [23:16] *** Guest1247 has joined #archiveteam [23:27] *** remsen has joined #archiveteam [23:29] *** Stilett0 is now known as Stiletto [23:47] *** aaaaaaaaa has joined #archiveteam [23:47] *** swebb sets mode: +o aaaaaaaaa [23:51] *** Guest1247 has quit IRC (Ping timeout: 243 seconds) [23:53] *** nightpool has quit IRC (Read error: Operation timed out) [23:56] looks like we're going to start the Google Code project very soon [23:56] Probably tomorrow [23:56] I double tested everything and it's all working very well [23:56] *** nightpool has joined #archiveteam [23:57] We'll start slow and see how it goes, but I think Google Code has a lot of bandwidth, so this might be a project were we can go all out [23:58] SketchCow: so FOS will be getting a lot of new data tomorrow. [23:58] How much space does FOS have available at the moment for Google Code? [23:58] If needed, we'll also search for an other rsync target [23:58] *** Selanda has quit IRC (Read error: Connection reset by peer)