#archiveteam 2014-02-03,Mon

↑back Search

Time Nickname Message
03:57 🔗 dashcloud hi folks, travis goodspeed would like if anyone send him paper copies or high-res scans of the Japanese, Brazilian, German, or Arabic (Jordan) variants of Byte Magazine?
04:40 🔗 SketchCow Good luck on THAT, Travis
05:48 🔗 yipdw y
05:49 🔗 yipdw oops
06:24 🔗 SketchCow http://devslovebacon.com/conferences/bacon-2014/talks/from-colo-to-yolo-confessions-of-the-angriest-archivist
07:31 🔗 yipdw SketchCow: that reminds me -- is audio or video of your Build talk available anywhere?
07:31 🔗 yipdw seems like the official videos are not up yet
08:06 🔗 SketchCow No.
08:06 🔗 SketchCow Not up yet.
08:06 🔗 SketchCow No idea why they're taking so long, EXCEPT.
08:06 🔗 SketchCow This is the last year, so he might be working on a deluxe version of the talk, with the new year as the chaser.
08:07 🔗 asdfsadf there some place I can go to read historical b threads?
08:07 🔗 namespace asdfsadf: /b/?
08:08 🔗 asdfsadf good one
08:08 🔗 namespace asdfsadf: It was a serious question.
08:08 🔗 namespace What's a b thread?
08:08 🔗 asdfsadf i mean like previous years
08:08 🔗 asdfsadf oh yea
08:08 🔗 namespace Oh okay you do mean /b/.
08:09 🔗 namespace Uh.
08:09 🔗 asdfsadf never mind
08:09 🔗 asdfsadf i thought you meant the joke was all threads are historical haha
08:09 🔗 namespace asdfsadf: :P
08:09 🔗 namespace asdfsadf: Nah. I think there is but I don't remember where it is.
08:44 🔗 Atluxity myopera.com is about to die and I have the list of all non-bannen users.
08:44 🔗 Atluxity 16 457 047 of them
08:45 🔗 Atluxity https://docs.google.com/uc?id=0B8aRlPij6kNrTTdRNHdOdDcxWDA&export=download
08:45 🔗 Atluxity I dont know how to take this information from the userlist to a project
08:46 🔗 namespace Atluxity: You would need a way for us to use that list to grab the content with wget.
08:46 🔗 namespace I don't know much about the process beyond that.
08:46 🔗 namespace I think there was an engineering channel for archive team.
08:54 🔗 midas I think there was a grab already up
08:54 🔗 midas https://archive.org/details/files.myopera.com-initialgrab
09:12 🔗 namespace midas: So then we just have to use the method we used last time to grab new data?
09:17 🔗 midas probably yeah
10:38 🔗 arkiver Atluxity: I will start grabbing some users now and see how fast I can get my speed
10:44 🔗 Atluxity cool
10:56 🔗 arkiver looks like there are no limits!!!! :D
10:56 🔗 arkiver going very fast
10:56 🔗 arkiver 100 links per second or something like that
10:57 🔗 arkiver Atluxity: does my opera also host videos?
10:59 🔗 Atluxity I was told there was no limits, the guy just said "please be gentle". I dont know how gentle we can be if we are to get it all withing March 1st
10:59 🔗 Atluxity I dont know if myopera hosts videos
11:00 🔗 Atluxity it would supprise me
11:00 🔗 Atluxity did the list of usernames help?
11:00 🔗 arkiver yep
11:00 🔗 arkiver running a test on chooseopera
11:00 🔗 arkiver it looks like it's quite big
11:00 🔗 arkiver so a good test
11:05 🔗 joepie91 arkiver; what's the status on wallbase?
11:05 🔗 arkiver joepie91: well they have limits
11:06 🔗 joepie91 :(?
11:06 🔗 arkiver so the crawl is limited but it's going fine
11:06 🔗 arkiver around 40000 items per 12 hours now
11:06 🔗 arkiver know*
11:06 🔗 joepie91 how long do you expect it to take to grab everything including metadata?
11:06 🔗 arkiver and they say they don't have plans to shut it
11:06 🔗 joepie91 mhmm
11:06 🔗 arkiver I'm first doing all the wallpaper pages with the wallpapers and stuff
11:07 🔗 arkiver and then I'll start doing the other things like search terms and stuff
11:07 🔗 arkiver (zoom works!!!!)
11:08 🔗 arkiver doing around 63 pages per minute
11:09 🔗 arkiver when is my opera going to shut down?
11:09 🔗 arkiver ^ Atluxity
11:11 🔗 arkiver will upload the first warc of chooseopera when finished
11:11 🔗 arkiver it is a lot bigger then the average blog
11:11 🔗 arkiver but a good test to start with
11:12 🔗 Atluxity arkiver: March 1st
11:12 🔗 arkiver hmm march 1st
11:12 🔗 arkiver thank you
11:14 🔗 arkiver and do you know when my opera was created?
11:15 🔗 Atluxity I dont have that information easily availible, do you want me to research it?
11:17 🔗 arkiver got it
11:17 🔗 arkiver http://en.wikipedia.org/wiki/My_Opera
11:17 🔗 arkiver 2001
11:17 🔗 arkiver well I need that since there are urls for a calendar
11:17 🔗 arkiver that are going forever
11:17 🔗 arkiver so even to the midages
11:17 🔗 arkiver going to limit it to a time
11:17 🔗 arkiver when it was created
11:19 🔗 Atluxity a wise move
11:20 🔗 Atluxity is this just something you have put on a server you have access to or can it be made to a warrior-project?
11:23 🔗 arkiver I don't know how to create a warrior project
11:23 🔗 arkiver you should ask chfoo about that
11:24 🔗 arkiver I', just excluding everything with /archive/monthly/ now
11:27 🔗 joepie91 PANIC
11:27 🔗 joepie91 http://www.heise.de/open/meldung/BerliOS-Entwicklerplattform-macht-zu-2104211.html?wt_mc=rss.open.beitrag.atom
11:27 🔗 joepie91 BerliOS shutting down April 30
11:36 🔗 joepie91 http://developer.berlios.de/forum/forum.php?forum_id=39220
11:40 🔗 ersi Havn't we grabbed berlios before? :o
11:41 🔗 joepie91 ersi: the response I got in another channel was "oh, shutting down again?" so quite possibly
11:42 🔗 joepie91 but figure a grab would be important regardless
11:56 🔗 arkiver chfoo: can we create a warrior project from my opera?
12:11 🔗 ersi joepie91: Indeed
12:33 🔗 midas for some sites i almost start to think "shouldnt you be dead yet?"
12:37 🔗 Nemo_bis Yes, that's what the Yahoo! CEO thinks of every website loaded in their browser.
12:38 🔗 midas true story
12:38 🔗 joepie91 midas: haha
12:38 🔗 joepie91 "euhm.. why is this still around?"
12:39 🔗 midas yeah, the last time berliOS said they would shutdown was like 2 years ago?
12:40 🔗 midas 2011
13:49 🔗 arkiver hmm
13:49 🔗 arkiver I can try to get a script running here to download most of the channels
13:49 🔗 arkiver chooseopera is downloaded
13:49 🔗 arkiver 1,4 GB
13:50 🔗 arkiver took a ew minutes only
13:50 🔗 arkiver and i think that's one of the biggest accounts...
13:52 🔗 arkiver and 45639 urls
14:01 🔗 arkiver working on batch script for WarcMiddleware
14:12 🔗 arkiver wokring for me... :D
14:12 🔗 arkiver testing with the first 45 accounts from the list
14:18 🔗 arkiver 45 accounts done
14:18 🔗 arkiver going to do a crawl of the first 1 million accounts now
14:21 🔗 arkiver 100.000 accounts*
14:26 🔗 ersi okay godane2
14:34 🔗 arkiver doing a test on the first 100.000 accounts
14:34 🔗 arkiver if that works, I'll put all of the millions of accounts in it
14:34 🔗 arkiver and it should be going then
14:42 🔗 arkiver but I'm not sure if I can make it before the deadliner
14:42 🔗 arkiver even if it's going this fast
14:42 🔗 arkiver so
14:42 🔗 arkiver the best thing is to plit it up between people I think
14:49 🔗 arkiver I can try to make it go even faster...
14:51 🔗 arkiver will try that tonight or tomorrow
14:51 🔗 arkiver shall I upload some warc.gz examples to show that the warc's work?
14:53 🔗 arkiver according to what I see I should be able to run multiple sessions
14:53 🔗 arkiver need 17 sessions then
14:53 🔗 arkiver will try that
15:14 🔗 arkiver going to start 30 sessions tomorrow
15:14 🔗 arkiver to even have sopme days left before the deadline to be sure everything is there
15:14 🔗 arkiver will keep you guys informed every day about the progress
15:14 🔗 arkiver I'm also still doing wallbase.cc BTW
15:18 🔗 midas nice arkiver
15:18 🔗 midas if you need any help, let me know
15:18 🔗 arkiver thank you midas
15:18 🔗 arkiver I'll keep that in mind!! :D
15:18 🔗 midas :p
15:18 🔗 arkiver tonight I will uplaod some warc's for people to see
15:18 🔗 arkiver buuuuuut
15:19 🔗 arkiver someone does need to do the forums
15:19 🔗 arkiver and the things beside the accounts
15:20 🔗 midas my FTP grab just passed the 9TB...
15:22 🔗 arkiver midas, wow that's a lot
15:22 🔗 arkiver which ftp server?
15:23 🔗 midas 5 ftp's
15:23 🔗 midas tp.tu-chemnitz.de ftp.uni-muenster.de gatekeeper.dec.com
15:23 🔗 midas ftp.uni-erlangen.de ftp.warwick.ac.uk
15:23 🔗 arkiver haha good job man!
15:23 🔗 midas she's still going ;-)
15:23 🔗 midas it's going to take me weeks to upload this
15:27 🔗 arkiver lol
15:27 🔗 arkiver know that
15:27 🔗 arkiver download faster then upload...
15:27 🔗 arkiver :/
15:27 🔗 arkiver horrible...
15:30 🔗 midas yeah
15:30 🔗 midas 180Mbit down, 100mbit up
15:31 🔗 midas still, 32TB/mnd should be doable at max speed, im guessing ill hit 20TB/mnd
15:39 🔗 arkiver FYI: I need to exclude all accounts that have |, &, <, >, (, ), ^ and @
15:39 🔗 arkiver so if someone can search for the channels with those thigs in it and download them
15:39 🔗 arkiver would be great
15:39 🔗 arkiver since I can't do them here with this script
15:50 🔗 Nemo_bis /mnd? you german, midas? :)
15:50 🔗 midas dutch :p i mean month, /mo?
15:55 🔗 arkiver midas: also dutch?? :D
15:58 🔗 midas jup :p
15:58 🔗 arkiver haha me too! :)
16:05 🔗 midas lots of dutchies here
16:06 🔗 ersi spoorwaggen
16:06 🔗 ersi and shit yo
16:13 🔗 arkiver spoortwaggen?
16:13 🔗 arkiver spoorwaggen*
16:13 🔗 arkiver you mean spoorwagen? :)
16:14 🔗 arkiver downloading 1680 accounts per second
16:14 🔗 arkiver per hour*
16:14 🔗 ersi yeah, I did
16:14 🔗 arkiver going to get that up tomorrow to around 50000-60000
17:35 🔗 chfoo anyone can create a project. i'm only writing the recent grab scripts, but someone else adds it master list of warrior projects.
18:02 🔗 yipdw if there is no BerliOS channel, join #honeynutberlios
18:08 🔗 SketchCow https://archive.org/details/businesscase now in 1.0
18:37 🔗 DFJustin switch back to browser, ctrl-t and start typing url, realize inputs are going into atari 800 visicalc #archiveteamproblems
19:12 🔗 Tony_ hello
19:15 🔗 joepie91 DFJustin: hahaha
19:54 🔗 SketchCow Pretty much.
20:21 🔗 arkiver chfoo: I'm already guite good going with that website. :D
20:21 🔗 arkiver today just testing
20:22 🔗 arkiver tomorrow I'll try to do up to 50000-60000 accounts per hour.
20:30 🔗 chfoo sounds good
20:33 🔗 arkiver :)
20:33 🔗 arkiver well
20:33 🔗 arkiver chfoo, this one has no limits, so it isnt't too hard... :)
20:33 🔗 arkiver hehe, most account are just empty
20:33 🔗 arkiver created and never sometyhing done with
20:33 🔗 arkiver but
20:33 🔗 arkiver chfoo
20:34 🔗 arkiver I download many thousands of accounts now as a test
20:34 🔗 arkiver and here it looks like they are working
20:34 🔗 arkiver would you mind if I upload some warc's so that you can also test them?
20:34 🔗 arkiver (just in case)
20:35 🔗 chfoo arkiver: sure, maybe a few others can take a look at them too
20:35 🔗 arkiver yes that sounds good
20:36 🔗 arkiver just to be 100% sure they work
20:36 🔗 arkiver imagine: downloaded millions of accounts and they don't work...
20:36 🔗 arkiver O.o
20:36 🔗 arkiver going to pack up and upload some
20:40 🔗 arkiver chfoo hang on uploading...
20:47 🔗 arkiver chfoo: https://www.filepicker.io/api/file/ZizgWffKT1e9PGaCpiLP
20:47 🔗 arkiver some warc's
20:47 🔗 arkiver I added two big warc's
20:47 🔗 arkiver and a lot of small warc's
20:47 🔗 arkiver you'll see most of them are just emtpy accounts
20:56 🔗 Smiley D:
20:57 🔗 Smiley oh right
20:57 🔗 Smiley most are empty because theres nothing to download D:
20:59 🔗 arkiver no no
20:59 🔗 arkiver they are not empty
20:59 🔗 arkiver just the account has been created
20:59 🔗 arkiver and the creater has done nothing
21:00 🔗 arkiver Smiley: like this one:
21:00 🔗 arkiver http://my.opera.com/4bass8/
21:00 🔗 arkiver going to stop the test now
21:01 🔗 arkiver and start testing with more multiple crawlers tomorrow
21:07 🔗 arkiver chfoo: and? do they work for you?
21:08 🔗 chfoo arkiver: i'm a bit concerned that you are using requesting gzip compression "Accept-Encoding: x-gzip,gzip,deflate"
21:09 🔗 arkiver hmm
21:09 🔗 arkiver I could view them well with warc proxy
21:09 🔗 arkiver and I uploaded some to the IA
21:09 🔗 arkiver https://archive.org/details/arkiver20140131-1
21:10 🔗 arkiver my new packs
21:10 🔗 arkiver they are indexed good
21:10 🔗 arkiver https://archive.org/download/arkiver20140131-1/arkiver20140131-1.cdx.gz
21:10 🔗 arkiver but because my item is not in the web collection I can't view them in the wayback machine
21:10 🔗 arkiver would it be possible for you to quickly upload those items ot the wayback machine and see if they work there?
21:11 🔗 chfoo arkiver: i can't. i'm not affiliated in any way.
21:11 🔗 arkiver :/
21:11 🔗 arkiver ah
21:11 🔗 arkiver but they do index
21:12 🔗 arkiver so they should work right?
21:12 🔗 arkiver and they work in the warc proxy
21:13 🔗 chfoo arkiver: it depends on how the wayback machine handles it. i have no idea actually.
21:14 🔗 arkiver hmm
21:17 🔗 chfoo and i noticed that "WARC-Payload-Digest" is missing as well (but that's optional)
21:18 🔗 arkiver yep
21:18 🔗 arkiver maybe SketchCow wants to move the items (temporarily) to the web section of the IA just to see if they work??
21:22 🔗 Dud1 So using wget/warc is the best way to archive a site?
21:22 🔗 chfoo arkiver: hold on, the warc file isn't valid. there's duplicate WARC-Record-ID
21:23 🔗 arkiver hmm
21:27 🔗 arkiver testing....
21:27 🔗 arkiver inserting in older uploded pack
21:27 🔗 arkiver gosh
21:28 🔗 arkiver I hope it works
21:30 🔗 SketchCow Nobody can shove them into wayback but employees.
21:30 🔗 SketchCow Why would I move them in temporarily?
21:31 🔗 ivan` Dud1: or wpull. or use archivebot.
21:43 🔗 arkiver SketchCowL to test if they work
21:43 🔗 arkiver the warc's from my opera
21:45 🔗 SketchCow Yeah, but if they work, they're in.
21:45 🔗 SketchCow Why not have them in.
21:46 🔗 Jonimus no reason to pull them out if they work
21:53 🔗 arkiver SketchCow: ah, yes, of course the warc's I'm producing now seem to be working in warc proxy
21:54 🔗 arkiver so if you want to do that
21:54 🔗 arkiver would be nice!!]
21:58 🔗 SketchCow Give me your internet archive account e-mail.
