#archiveteam 2013-10-18,Fri

↑back Search

Time Nickname Message
00:40 πŸ”— joepie91 hmm
00:40 πŸ”— joepie91 I want to try my hand at setting up a seesaw script
00:40 πŸ”— joepie91 but for the tracker I'd need to have an IA collection...
00:40 πŸ”— joepie91 is that strictly necessary or can I put it elsewhere for now?
00:41 πŸ”— omf_ you can stick it in opensource as a texts mediatype
00:41 πŸ”— omf_ you also need an upload target
00:42 πŸ”— joepie91 :P
00:42 πŸ”— joepie91 omf_: as in, rsync target? that'd be my OVH box
00:43 πŸ”— * joepie91 is setting up the megawarc factory
00:43 πŸ”— omf_ on that box you have to setup rsync and a megawarc factory if needed
00:43 πŸ”— joepie91 yes, I'm following the instructions for that atm
00:43 πŸ”— joepie91 hence my question about collections
00:43 πŸ”— joepie91 "Bother or ask politely someone about getting permission to upload your files to the collection archiveteam_PROJECT_NAME. You can ask on #archiveteam on EFNet."
00:44 πŸ”— joepie91 but 'opensource' as target collection would do also?
00:45 πŸ”— omf_ opensource is an open collection anyone can stick things in
00:45 πŸ”— ATZ0 what's the frequency, kenneths?
00:45 πŸ”— omf_ To make sure it is kept track of I set the keywords "archiveteam webcrawl"
00:45 πŸ”— joepie91 omf_: no idea how to do that in the config.sh
00:45 πŸ”— omf_ and you can add the project name as a keyword as well
00:47 πŸ”— joepie91 omf_: it is sufficient to have the same prefix for each item name? I can't figure out where to set keywords :P
00:47 πŸ”— joepie91 is it *
00:48 πŸ”— omf_ you modify the curl command in upload-one to add the proper s3 headers
00:49 πŸ”— omf_ like this
00:49 πŸ”— omf_ --header 'x-archive-meta-subject:archiveteam;webcrawl' \
00:51 πŸ”— joepie91 I see, thanks
00:52 πŸ”— omf_ here is the one we used for zapd. It is nothing special but you get the idea --> http://paste.archivingyoursh.it/velajudige.pl
00:54 πŸ”— omf_ Each s3 header matches an option on the web interface for creating/editing an item
01:08 πŸ”— joepie91 omf_: halp!
01:08 πŸ”— joepie91 'ruby' was not found, cannot install rubygems unless ruby is present (Do you have an RVM ruby installed & selected?)
01:08 πŸ”— joepie91 when running `rvm rubygems current`
01:08 πŸ”— joepie91 after the rvm install 2.0
01:09 πŸ”— * joepie91 knows 0 about ruby
01:11 πŸ”— joepie91 wtf ruby
07:13 πŸ”— Lord_Nigh nintendo says fullscreenmario.com is illegal, takedown may happen at some point (ref: http://www.washingtonpost.com/blogs/the-switch/wp/2013/10/17/nintendo-says-this-amazing-super-mario-site-is-illegal-heres-why-it-shouldnt-be/ )
07:13 πŸ”— Lord_Nigh fullscreenmario's engine code is at https://github.com/Diogenesthecynic/FullScreenMario.git
07:52 πŸ”— tephra Lord_Nigh: I cloned the repo yesterday :)
07:53 πŸ”— Lord_Nigh I got it today, they fixed a bug on one of the levels
07:55 πŸ”— tephra nioce
08:00 πŸ”— godane i'm grabbing all the imaginingamercia videos
08:01 πŸ”— godane will archive.org take a zip file of videos and display the videos?
08:32 πŸ”— godane looks like this video disappeared: http://www.youtube.com/watch?v=hT_rY-Lk8nc
08:33 πŸ”— godane it goes from close to 2 hours to just 1 second now
14:38 πŸ”— joepie91 righ
14:38 πŸ”— joepie91 right *
14:38 πŸ”— joepie91 I'm giving up on setting up the archiveteam tracker for now :|
14:39 πŸ”— joepie91 it's a disaster to set up... it expects upstart to exist according to the guide (which it doesn't on Debian, and installing it would break sysvinit), I have issues with rvm and the absence of a login shell, and so on
14:39 πŸ”— joepie91 D:
14:41 πŸ”— omf_ joepie91, you were trying to install the universal tracker, why?
14:41 πŸ”— joepie91 omf_: because that's what the guide says?
14:42 πŸ”— joepie91 http://www.archiveteam.org/index.php?title=Tracker_Setup
14:42 πŸ”— joepie91 I want to get a tracker / project running
14:42 πŸ”— joepie91 but I've spent some 4 hours on this now
14:42 πŸ”— joepie91 and I still don't have a working setup
14:42 πŸ”— omf_ We already have a tracker instance it is tracker.archiveteam.org
14:43 πŸ”— joepie91 which I don't have any form of admin access to, nor do I expect it to be appreciated to use it while testing stuff
14:44 πŸ”— omf_ We all test shit using that instance, we just don't put the projects in the projects.json
14:44 πŸ”— omf_ All you need to know about the tracker is it takes a list of items to send out, beyond that is not needed to startup a new project
14:46 πŸ”— omf_ bam new instance --> http://tracker.archiveteam.org/isoprey
14:46 πŸ”— omf_ now let me create you an account
14:53 πŸ”— omf_ You add items using the Queues page
14:53 πŸ”— omf_ Claims page is for monitoring
15:33 πŸ”— yipdw joepie91: you don't need upstart
15:58 πŸ”— joepie91 unrelated note; if scraping a site, you'll want to pretend to be Firefox, not Chrome
15:58 πŸ”— joepie91 :P
15:58 πŸ”— joepie91 Chrome auto-updates in nearly every case so if the spoofed useragent you're using is an outdated Chrome version it's very very easy for a server admin to single you out
15:58 πŸ”— joepie91 FF is much less rigorous with that
16:38 πŸ”— omf_ Anyone remember which warrior project had the pipeline that required multiple ids to do the crawling
16:38 πŸ”— omf_ was that puush or xanga? or something else?
16:38 πŸ”— antomatic I think Puush specifies a number of IDs within a range
16:39 πŸ”— antomatic Xanga was straightforward 1:1 items
16:40 πŸ”— antomatic Might have been Patch that issued multiple items-per-item ?
16:44 πŸ”— joepie91 it was puush indeed
16:45 πŸ”— joepie91 mmm, is there a particular reason for the MoveFiles task existing? cc omf_
16:45 πŸ”— joepie91 can't immediately see the point of it, and considering rm'ing it from my script
16:53 πŸ”— antomatic It may not be vital but I think I'm right in saying that it moves the .warc.gz files from the location that they're temporarily downloading in, to a location where they can be assumed as 'finished and ready to upload'.
16:54 πŸ”— antomatic Can be useful if a script crashes sometimes.
16:54 πŸ”— antomatic What's finished can be shown with ls data/*/*.gz, whereas partial downloads are left at data/*/*/*.gz
16:55 πŸ”— joepie91 right.
17:11 πŸ”— yipdw antomatic: yeah, patch.com did items-per-item
17:12 πŸ”— yipdw I don't recommend taking the patch.com pipeline as an example, though -- it's not that I think it's bad, but it's doing some really specialized stuff that requires substantial additional server support
17:20 πŸ”— joepie91 in seesaw, what's the conceptual idea behind Task.process and Task.enqueue, and how do they differ/relate?
17:24 πŸ”— joepie91 cc yipdw
17:54 πŸ”— joepie91 also, what's the "realize" thing?
18:02 πŸ”— joepie91 OH
18:02 πŸ”— joepie91 realize does the item interpolation?
18:41 πŸ”— yipdw joepie91: item interpolation is best handled by the ItemInterpolation object
18:41 πŸ”— joepie91 yipdw: what I meant was that realize appears to handle the processing of ItemInterpolation objects and such
18:41 πŸ”— joepie91 in the actual Task code
18:41 πŸ”— yipdw oh
18:42 πŸ”— yipdw yeah, maybe -- I try to keep above that level in the seesaw code
18:42 πŸ”— joepie91 I'm still trying to figure out what all this does
18:42 πŸ”— joepie91 :P
18:42 πŸ”— joepie91 right
18:42 πŸ”— yipdw I've not gone that far into it
18:42 πŸ”— yipdw luckily, you don't need to go that far to write custom tasks
18:42 πŸ”— joepie91 yipdw: idk, I'm trying to do ranges
18:42 πŸ”— joepie91 and the WgetDownloadMany thing in puush seemed faaaaar too complex in use for me
18:43 πŸ”— yipdw joepie91: you can write a SimpleTask subclass that expands the range and adds the expanded list to an item key
18:44 πŸ”— yipdw from there you can feed them in as URLs to wget (if they're URLs) or process them further into a wget-usable form
18:44 πŸ”— yipdw that's what the ExpandItem task in the patch pipeline does
18:49 πŸ”— joepie91 yipdw: right, my setup is a bit more complex though :P
18:49 πŸ”— joepie91 it tries to separately download a .torrent file first and depending on whether that succeeds it attempts to do a recursive grab of other pages
20:01 πŸ”— Lord_Nigh http://www.dopplr.com/ shutting down
20:02 πŸ”— Lord_Nigh godane: that video can be downloaded as .mp4 here, 1.6gb
20:02 πŸ”— Lord_Nigh http://www.youtube.com/watch?v=hT_rY-Lk8nc <-i mean that one
20:21 πŸ”— godane i just tryed and i'm only getting 128kb
20:21 πŸ”— godane also it will not play in browser
20:41 πŸ”— ersi joepie91: There's a #warrior channel that you could use for seesaw discussions
22:13 πŸ”— lemonkey http://www.dopplr.com/
22:13 πŸ”— lemonkey nm lord_nigh already posted
23:18 πŸ”— SketchCow Let's doooo it
23:35 πŸ”— kyan guys there's no way I'm going to be able to download all of fisheye.toolserver.org in a monthҀ¦ I'm only at 43k URLs (been downloading for a day or two now) and have over 3 million queued already. Is there a way to distribute it so multiple requests can be happening at once??
23:38 πŸ”— joepie91 kyan: what exactly is the situation with fisheye.toolserver.org?
23:39 πŸ”— joepie91 also, re: isohunt
23:39 πŸ”— joepie91 <cayce>7 calendar days from the signing of the judgement
23:39 πŸ”— joepie91 <cayce>Filed 10/17/13
23:39 πŸ”— joepie91 <cayce>It's pretty much a legal cease and desist, but he's got 7 days to do it. Nothing stated that he can't do stuff in the interim, as long as he makes that deadline.
23:39 πŸ”— joepie91 <cayce>better hurry the fuck up with that grab, you've got 7 days
23:39 πŸ”— joepie91 <cayce>especially since the only applicable parties is him and his company
23:39 πŸ”— joepie91 <cayce>joepie91:) someone should ask him. He's required to shut it down within 7 days and not continue operating it, but there's nothing in there about not making a backup or somesuch.
23:39 πŸ”— joepie91 <cayce>yeah, okay
23:39 πŸ”— kyan joepie91: it's a website that's shutting down, but it's a) really slow and b) really big
23:39 πŸ”— joepie91 kyan: it just looks like a repository viewer to me?
23:40 πŸ”— kyan joepie91: it is, but for some reason things can't be exported via SVN normally
23:40 πŸ”— kyan joepie91: apparently the history can only be obtained through the web diff interface
23:41 πŸ”— joepie91 that makes no sense...
23:41 πŸ”— joepie91 :|
23:42 πŸ”— balrog kyan: are these svn repos?
23:42 πŸ”— balrog did you try svnsync?
23:42 πŸ”— balrog if you can svn co -r <rev>, then svnsync will do the job
23:42 πŸ”— kyan balrog: IDK, I just know someone else ran into issues with doing it the normal way and so they switched over to wget
23:42 πŸ”— kyan and then wget borked because the site was so big
23:42 πŸ”— balrog link me a repo that has issues
23:43 πŸ”— balrog ugh
23:43 πŸ”— kyan so I tried taking it on with Heritrix
23:43 πŸ”— balrog using wget for this...
23:43 πŸ”— kyan and it's going ok, but not fast enough with a deadline
23:43 πŸ”— kyan let me see if i can find the chatlogs about it
23:45 πŸ”— kyan here we go http://badcheese.com/~steve/atlogs/?chan=archiveteam&day=2013-10-16
23:45 πŸ”— kyan 10 or 15 lines down
23:46 πŸ”— balrog I suggested svnsync there...
23:46 πŸ”— balrog Nemo_bis: ping
23:47 πŸ”— balrog Nemo_bis: svnsync DOES give you history
23:47 πŸ”— balrog it works by using svn co -r to check out each rev and build a new svn repo from those checkouts
23:47 πŸ”— balrog with all metadata and such
23:49 πŸ”— kyan balrog: "at least one root refuses svn export"Ҁ¦ not sure what that indicates
23:49 πŸ”— balrog kyan: he was using svn export
23:49 πŸ”— balrog I'd like to know which repo failed it
23:49 πŸ”— balrog kyan: are you good with terminal/command line?
23:50 πŸ”— kyan balrog: not really. I can do enough to get by usually
23:50 πŸ”— balrog ah :/ ok
23:50 πŸ”— * kyan is, however, an EXPERT at writing unusable spaghetti code in php

irclogger-viewer