#archiveteam 2015-11-23,Mon

↑back Search

Time Nickname Message
00:00 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
00:08 🔗 nertzy has joined #archiveteam
00:12 🔗 JesseW has joined #archiveteam
00:17 🔗 Guest1247 has quit IRC (Ping timeout: 240 seconds)
00:26 🔗 dashcloud has quit IRC (Read error: Operation timed out)
00:39 🔗 dashcloud has joined #archiveteam
00:50 🔗 Famicoman has quit IRC (Read error: Operation timed out)
00:52 🔗 primus104 has quit IRC (Leaving.)
00:52 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
00:53 🔗 w3r4 has joined #archiveteam
00:54 🔗 w3r4 hey, I have yuku on the warrior, and I think it's got stuck on a particular forum
00:54 🔗 nightpool has joined #archiveteam
00:55 🔗 w3r4 one of the items has not moved on from "36=301 http://av1611godsword.yuku.com/forum/previous/topic/530. " for about 4 hours
00:56 🔗 Fletcher has quit IRC (Read error: Operation timed out)
00:56 🔗 JesseW arkiver: ping ^
00:57 🔗 nightpool hey all I don't know if you've heard but OTW (http://archiveofourown.org/) is going through some organizational instability right now.
00:57 🔗 nightpool someone suggested spidering ao3 and fanlore and etc. and this seemed like the best place to ask for tips
00:58 🔗 vtyl has joined #archiveteam
00:58 🔗 nightpool anyone have pointers about where I could start code-wise? i'm a pretty competent coder but would like to get a handle on current best practices or w/e
00:59 🔗 JesseW nightpool: there's a tool called grab-site. Check that out. Also, wpull.
00:59 🔗 lytv has quit IRC (Ping timeout: 606 seconds)
00:59 🔗 Fletcher has joined #archiveteam
01:00 🔗 JesseW we also have a system called archivebot that can grab things using shared servers
01:00 🔗 nightpool probably don't need anything like that yet. this is just precautionary :D
01:01 🔗 nightpool hopefully they can figure out their internal politics stuff
01:01 🔗 nightpool but ao3 is super central to a lot of fan cultures so I'd like to be prepared
01:02 🔗 w3r4 has quit IRC (Ping timeout: 240 seconds)
01:02 🔗 JesseW and for particularly big and urgent jobs, we can write customized code that is then automatically distributed to a network of volunteers running virtual machines (called the ArchiveTeam Warrior) that can distribute the job of grabbing a whole site; we're currently working on grabbing all of a set of forums called Yuku, and a video site called screener.
01:03 🔗 JesseW A03 is really important, yes.
01:03 🔗 JesseW check out the wiki, and do ask if you have more questions
01:04 🔗 nightpool thx :D
01:04 🔗 nightpool i hope i'm just overreacting
01:04 🔗 philpem has quit IRC (Read error: Operation timed out)
01:06 🔗 JesseW I hope so too.
01:06 🔗 JesseW but it's good to have independent copies in any case
01:33 🔗 nightpool has quit IRC (Ping timeout: 258 seconds)
01:33 🔗 nightpool has joined #archiveteam
01:47 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
01:55 🔗 zenguy_pc has quit IRC (Read error: Connection reset by peer)
02:11 🔗 zenguy_pc has joined #archiveteam
02:11 🔗 zenguy_pc has quit IRC (Read error: Connection reset by peer)
02:21 🔗 zhongfu has quit IRC (Read error: Operation timed out)
02:22 🔗 zhongfu has joined #archiveteam
02:28 🔗 zenguy_pc has joined #archiveteam
02:30 🔗 nightpool has quit IRC (Read error: Operation timed out)
02:49 🔗 nightpool has joined #archiveteam
02:52 🔗 Stiletto has quit IRC (Read error: Connection reset by peer)
02:52 🔗 Stiletto has joined #archiveteam
02:58 🔗 JesseW Who has access to the machine the urlteam tracker is on? It appears to be down, or otherwise not working...
03:03 🔗 ex-parrot has left
03:21 🔗 chfoo has joined #archiveteam
03:30 🔗 Famicoman has joined #archiveteam
03:39 🔗 Elegance_ has quit IRC (Read error: Operation timed out)
03:50 🔗 Elegance has joined #archiveteam
03:50 🔗 Elegance has quit IRC (Connection closed)
03:56 🔗 nightpool has quit IRC (Ping timeout: 252 seconds)
03:59 🔗 Stilett0 has joined #archiveteam
04:02 🔗 Stiletto has quit IRC (Read error: Connection reset by peer)
04:10 🔗 nightpool has joined #archiveteam
04:28 🔗 Guest1247 has joined #archiveteam
04:28 🔗 aaaaaaaaa has quit IRC (Read error: Operation timed out)
04:33 🔗 JesseW urlteam tracker is down again. :-(
04:35 🔗 Guest1247 has quit IRC (Ping timeout: 240 seconds)
04:39 🔗 Guest1247 has joined #archiveteam
04:41 🔗 JesseW and it's back
04:44 🔗 Guest1247 has quit IRC (Ping timeout: 240 seconds)
05:11 🔗 Sk1d has quit IRC (Read error: Operation timed out)
05:17 🔗 remsen has joined #archiveteam
05:29 🔗 remsen has quit IRC (Leaving)
05:34 🔗 xk_id has quit IRC (Remote host closed the connection)
06:06 🔗 nightpool heyy let me know if this is the the wrong place to ask this, but is there a good way to write a grab-site ignore regex to *only* grab a certain set of routes?
06:07 🔗 xk_id has joined #archiveteam
06:11 🔗 nightpool nevermind, looks like I can do this with wpull_args
06:16 🔗 Sk1d has joined #archiveteam
06:22 🔗 slyphic is now known as slyphic|a
06:36 🔗 remsen has joined #archiveteam
06:54 🔗 arkiver w3r4: I think you can abort that items and start over
06:55 🔗 wp494 so uh are all the screenr items supposed to be 0.1 MB total?
06:56 🔗 arkiver2 has joined #archiveteam
06:57 🔗 arkiver yes, videos have already been grabbed
07:00 🔗 arkiver2 has quit IRC (Ping timeout: 252 seconds)
07:17 🔗 nightpool has quit IRC (Read error: Operation timed out)
07:20 🔗 remsen has quit IRC (Leaving)
07:24 🔗 JesseW has quit IRC (Leaving.)
07:24 🔗 za3k has joined #archiveteam
07:24 🔗 za3k All right, downloaded a list of github.com repositories and small amounts of metadata. It looks like the first pass is missing some data and has some repeats (I think due to low error rates from github) but I'm doing another run.
07:25 🔗 za3k It does give accurate statistics at least, the wiki's updated a bit.
07:25 🔗 za3k Downloads are at za3k.com/github, but you should wait 1-2 weeks for the second pass of data because this seems flakey
07:26 🔗 za3k Short version is 28 million repos, 120TB
07:26 🔗 nightpool has joined #archiveteam
07:30 🔗 nightpool has quit IRC (Ping timeout: 183 seconds)
07:31 🔗 za3k has quit IRC (Quit: http://chat.efnet.org (EOF))
07:42 🔗 remsen has joined #archiveteam
07:43 🔗 primus104 has joined #archiveteam
07:48 🔗 WinterFox has quit IRC (Read error: Operation timed out)
07:49 🔗 Elegance has joined #archiveteam
07:54 🔗 WinterFox has joined #archiveteam
08:04 🔗 remsen2 has joined #archiveteam
08:07 🔗 godane has left
08:08 🔗 remsen has quit IRC (Read error: Operation timed out)
08:19 🔗 R5M has joined #archiveteam
08:20 🔗 nightpool has joined #archiveteam
08:22 🔗 R5M has quit IRC (Client Quit)
08:23 🔗 nightpool has quit IRC (Ping timeout: 183 seconds)
08:25 🔗 remsen2 has quit IRC (Read error: Operation timed out)
08:30 🔗 arkiver2 has joined #archiveteam
08:50 🔗 SketchCow https://archive.org/details/archivebotis coming along nicely.
08:50 🔗 SketchCow https://archive.org/details/archivebot
09:14 🔗 nightpool has joined #archiveteam
09:15 🔗 godane has joined #archiveteam
09:22 🔗 nightpool has quit IRC (Read error: Operation timed out)
09:28 🔗 Nemo_bis 1318, impressive
09:28 🔗 schbirid has joined #archiveteam
09:38 🔗 SketchCow Well, I've got the thing generating thumbnails.
09:38 🔗 SketchCow It's going by most viewed down, so the page will look nice and then filter down.
09:41 🔗 atomotic has joined #archiveteam
09:47 🔗 godane SketchCow: dailymail.co.uk is up to 2011
09:47 🔗 SketchCow Great
09:49 🔗 godane i'm also uploading some twilight dvds i have
09:50 🔗 godane from the same guy that upload the others
09:51 🔗 remsen has joined #archiveteam
09:51 🔗 SketchCow A cache of 60gb of Geocities sites is coming to me.
10:08 🔗 nightpool has joined #archiveteam
10:12 🔗 Ungstein has joined #archiveteam
10:16 🔗 nightpool has quit IRC (Read error: Operation timed out)
10:18 🔗 Ungstein has quit IRC (Read error: Connection reset by peer)
10:45 🔗 primus104 has quit IRC (Leaving.)
11:15 🔗 arkiver2 has quit IRC (Ping timeout: 252 seconds)
11:45 🔗 kyan has quit IRC (Remote host closed the connection)
11:46 🔗 kyan has joined #archiveteam
11:53 🔗 kyan is now known as kyna
11:53 🔗 kyna is now known as kyan
11:54 🔗 WinterFox has quit IRC (Remote host closed the connection)
11:54 🔗 arkiver2 has joined #archiveteam
11:59 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
12:01 🔗 arkiver2 has quit IRC (Ping timeout: 252 seconds)
12:13 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
12:21 🔗 oli has quit IRC (Read error: Operation timed out)
12:24 🔗 oli has joined #archiveteam
12:55 🔗 arkiver2 has joined #archiveteam
13:05 🔗 jleclanch 43G Blizzcon-2015-Virtual-Ticket-DirecTV finally done :)
13:06 🔗 godane cool
13:08 🔗 atomotic has joined #archiveteam
13:10 🔗 toad1 has quit IRC (Read error: Operation timed out)
13:17 🔗 nertzy has joined #archiveteam
13:17 🔗 jleclanch godane: where/how should i upload it? they're under directv's copyright afaik
13:18 🔗 godane just upload it
13:19 🔗 jleclanch where, archive.org?
13:19 🔗 godane yes
13:19 🔗 jleclanch mmk
13:19 🔗 godane i'm off to bed
13:20 🔗 godane bbl
13:30 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
13:30 🔗 atomotic has joined #archiveteam
13:30 🔗 arkiver2 has quit IRC (Quit: Nettalk6 - www.ntalk.de)
13:35 🔗 vtyl has quit IRC (Read error: Operation timed out)
13:38 🔗 lytv has joined #archiveteam
13:49 🔗 slyphic|a is now known as slyphic
13:54 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
14:00 🔗 primus104 has joined #archiveteam
14:04 🔗 toad1 has joined #archiveteam
14:34 🔗 Guest1247 has joined #archiveteam
14:40 🔗 jonimus has joined #archiveteam
14:47 🔗 primus104 has quit IRC (Leaving.)
14:52 🔗 Ungstein has joined #archiveteam
14:53 🔗 Ungstein has quit IRC (Client Quit)
14:54 🔗 Ungstein has joined #archiveteam
15:01 🔗 Guest1247 has quit IRC (Quit: Page closed)
15:01 🔗 Guest1247 has joined #archiveteam
15:02 🔗 scyther has joined #archiveteam
15:06 🔗 Ungstein has quit IRC (Quit: Leaving.)
15:09 🔗 Ungstein has joined #archiveteam
15:18 🔗 nertzy has joined #archiveteam
15:21 🔗 ersi has joined #archiveteam
15:21 🔗 swebb sets mode: +o ersi
15:25 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
15:28 🔗 jleclanch so I have a large archive ill have to share at some point, and a lot of the files inside it are duplicates. rather than manually create symlinks for everything, is there a way i can deduplicate the files in an archive?
15:31 🔗 Guest1247 http://www.digitalvolcano.co.uk/duplicatecleaner.html
15:32 🔗 jleclanch Guest1247: I don't want to delete them
15:32 🔗 Guest1247 ^ I recommend this software.
15:32 🔗 jleclanch they need to exist, but they can exist as symlinks, or can be re-created upon extraction (as symlinks even maybe)
15:32 🔗 Guest1247 They end up in the trash bin
15:32 🔗 jleclanch yes I really don't need that
15:32 🔗 jleclanch sec
15:33 🔗 jleclanch Guest1247: to give you an idea of what it looks like: http://dpaste.com/2CNJZDW
15:33 🔗 Guest1247 You can just cut and paste it in some folder after that
15:35 🔗 kyan jleclanch, eh, I think I've seen an app for that
15:35 🔗 kyan maybe pmatch (a paranoid ruby app for finding duplicates)
15:35 🔗 jleclanch kyan: someone just linked http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl
15:36 🔗 kyan I think it lets you retrieve a list of the duplicates that you could then stick into a bash script
15:36 🔗 jleclanch which is what i need essentially, but with symlinks instead of hardlinks
15:36 🔗 jleclanch ill see if i can modify it, my perl is .. well, it's perl
15:36 🔗 kyan Unfortunately I don't know jack about perl
15:37 🔗 kyan aside from one file-extracty thing I wrote in high school
15:41 🔗 scyther has quit IRC (Quit: Leaving)
15:45 🔗 kyan I'm gonna try to hack something together, i'll let you know if I get it working
15:48 🔗 jleclanch kyan: ill stick to hard links actually after all
15:48 🔗 jleclanch so this script should do
15:48 🔗 kyan okeydokey
15:49 🔗 kyan good luck :)
15:49 🔗 kyan I'm probably going to keep doing this though just for me lol
15:49 🔗 kyan because I've wanted something to do duplicates->symlink conversion for a while but I was always too lazy to start it
15:51 🔗 kyan !ao http://free-electrons.com/community/tools/utils/clink/
15:52 🔗 kyan ...oh -_-
15:52 🔗 Ungstein has quit IRC (Quit: Leaving.)
16:06 🔗 Fletcher has quit IRC (Ping timeout: 252 seconds)
16:06 🔗 diacope has quit IRC (Ping timeout: 252 seconds)
16:16 🔗 DFJustin if you use a "solid" archive format like .tar.gz or .7z then the duplicate files should not take any extra space
16:19 🔗 MrRadar That depends on how good the data size and the format's dictionary size.
16:19 🔗 MrRadar The gzip format's dictionary is only 32k in size so any duplicate files larger than that will not deduplicate
16:19 🔗 MrRadar See https://superuser.com/questions/479074/why-doesnt-gzip-compression-eliminate-duplicate-chunks-of-data
16:19 🔗 DFJustin true, I think with rar and 7z it sorts files by checksum to facilitate it though
16:19 🔗 MrRadar *depends on how big the data size
16:20 🔗 JesseW has joined #archiveteam
16:22 🔗 Ungstein has joined #archiveteam
16:31 🔗 Ungstein1 has joined #archiveteam
16:32 🔗 nightpool has joined #archiveteam
16:32 🔗 Ungstein has quit IRC (Ping timeout: 252 seconds)
16:42 🔗 primus104 has joined #archiveteam
16:45 🔗 primus105 has joined #archiveteam
16:46 🔗 phuzion I don't know how well it scales, but I just tested 7Zip by creating a folder, placing a file in the folder, archiving the folder, then duplicating that file within a subfolder, and re-archiving. Size difference: 56 bytes.
16:48 🔗 phuzion So, inside the folder, I placed a test file (PNG image I had laying around), and then created an archive, measured the size at 1,597,149 bytes. Then, I made a subfolder within that same folder copied the PNG file to the subfolder, and rearchived the original folder (with both images included). The seecond archive was 1,597,505 bytes.
16:48 🔗 phuzion Sorry, typo, the size difference was 356 bytes.
16:49 🔗 primus104 has quit IRC (Read error: Operation timed out)
16:49 🔗 kyan_ has joined #archiveteam
16:50 🔗 phuzion For reference, the original file I used was a PNG file that is 1,596,862 bytes in size.
16:54 🔗 kyan_ has quit IRC (Leaving)
16:54 🔗 kyan has quit IRC (Ping timeout: 615 seconds)
16:55 🔗 JesseW has quit IRC (Leaving.)
17:03 🔗 atomotic has joined #archiveteam
17:10 🔗 kyan has joined #archiveteam
17:14 🔗 Ungstein1 has quit IRC (Quit: Leaving.)
17:15 🔗 Guest1247 Could someone please verify me in #archivebot so that I can use the archive commands?
17:19 🔗 MrRadar Guest1247: Anyone can use !ao but only users with voice (+v) or op (+o) permission can use !a
17:20 🔗 MrRadar I have op and I tried voicing you but for some reason it won't let me
17:39 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
17:40 🔗 MrRadar has quit IRC (Read error: Operation timed out)
17:40 🔗 Guest1247 I had !a access yesterday, but that was at home and now I'm in my student appartment, so I have a different IP address.
17:43 🔗 SketchCow We gotta get you a permanent name and place, buddy
17:43 🔗 SketchCow It's time!
17:43 🔗 SketchCow You tried before you buy'd.
17:43 🔗 kyan Guest1247, bro do you even voice? (you have voice in #archivebot now)
17:45 🔗 Guest1247 Sorry, hadn't read up on the log in #archivebot when I posted that
17:56 🔗 nightpool has quit IRC (Read error: Operation timed out)
18:06 🔗 MrRadar has joined #archiveteam
18:07 🔗 philpem has joined #archiveteam
18:22 🔗 scyther has joined #archiveteam
18:23 🔗 schbirid https://defacto2.wordpress.com/2015/11/23/files-files-and-more-files/
18:30 🔗 Guest1247 has quit IRC (Ping timeout: 240 seconds)
18:53 🔗 primus105 has quit IRC (Leaving.)
18:56 🔗 SketchCow Dear Archive Team,
18:56 🔗 SketchCow On November 27th, the Kerbal Space Program forums are to be updated to new software and in the process some of their subforums and all of the blogs on the site are scheduled for deletion. As these things are important to many members of the active community, I ask you to get in touch with the operators of these forums and if possible obtain the data for preservation.
18:56 🔗 SketchCow With Best Regards
18:56 🔗 SketchCow Someone get the fuck on thaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaat
18:56 🔗 SketchCow Kai Wolter
18:57 🔗 HCross its already in ArchiveBot
18:57 🔗 schbirid wtf
19:13 🔗 Guest1247 has joined #archiveteam
19:15 🔗 nightpool has joined #archiveteam
19:24 🔗 RedType_ has quit IRC (leaving)
19:31 🔗 primus104 has joined #archiveteam
19:35 🔗 RedType has joined #archiveteam
19:42 🔗 Start has quit IRC (Read error: Connection reset by peer)
19:43 🔗 Start has joined #archiveteam
19:53 🔗 SimpBrai1 has quit IRC (Leaving)
19:54 🔗 SimpBrain has joined #archiveteam
20:11 🔗 diacope has joined #archiveteam
20:11 🔗 Fletcher has joined #archiveteam
20:34 🔗 RedType has quit IRC (Changing server)
20:35 🔗 RedType has joined #archiveteam
20:36 🔗 RedType has quit IRC (Remote host closed the connection)
20:41 🔗 schbirid has quit IRC (Quit: Leaving)
20:45 🔗 nightpool has quit IRC (Read error: Operation timed out)
20:51 🔗 atomotic has joined #archiveteam
21:12 🔗 Ghost_of_ has joined #archiveteam
21:15 🔗 Zebranky_ is now known as Zebranky
21:26 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
21:37 🔗 nightpool has joined #archiveteam
21:42 🔗 scyther has quit IRC (Leaving)
21:53 🔗 nightpool has quit IRC (Read error: Operation timed out)
21:53 🔗 nightpool has joined #archiveteam
21:58 🔗 nightpool has quit IRC (Read error: Operation timed out)
21:59 🔗 BlueMaxim has joined #archiveteam
22:09 🔗 RedType has joined #archiveteam
22:23 🔗 SilSte has quit IRC (Read error: Connection reset by peer)
22:43 🔗 nightpool has joined #archiveteam
22:44 🔗 remsen has quit IRC (Read error: Operation timed out)
22:46 🔗 Guest1247 has quit IRC (Ping timeout: 240 seconds)
23:08 🔗 bwn has joined #archiveteam
23:16 🔗 Guest1247 has joined #archiveteam
23:27 🔗 remsen has joined #archiveteam
23:29 🔗 Stilett0 is now known as Stiletto
23:47 🔗 aaaaaaaaa has joined #archiveteam
23:47 🔗 swebb sets mode: +o aaaaaaaaa
23:51 🔗 Guest1247 has quit IRC (Ping timeout: 243 seconds)
23:53 🔗 nightpool has quit IRC (Read error: Operation timed out)
23:56 🔗 arkiver looks like we're going to start the Google Code project very soon
23:56 🔗 arkiver Probably tomorrow
23:56 🔗 arkiver I double tested everything and it's all working very well
23:56 🔗 nightpool has joined #archiveteam
23:57 🔗 arkiver We'll start slow and see how it goes, but I think Google Code has a lot of bandwidth, so this might be a project were we can go all out
23:58 🔗 arkiver SketchCow: so FOS will be getting a lot of new data tomorrow.
23:58 🔗 arkiver How much space does FOS have available at the moment for Google Code?
23:58 🔗 arkiver If needed, we'll also search for an other rsync target
23:58 🔗 Selanda has quit IRC (Read error: Connection reset by peer)

irclogger-viewer