[00:40] hmm [00:40] I want to try my hand at setting up a seesaw script [00:40] but for the tracker I'd need to have an IA collection... [00:40] is that strictly necessary or can I put it elsewhere for now? [00:41] you can stick it in opensource as a texts mediatype [00:41] you also need an upload target [00:42] :P [00:42] omf_: as in, rsync target? that'd be my OVH box [00:43] * joepie91 is setting up the megawarc factory [00:43] on that box you have to setup rsync and a megawarc factory if needed [00:43] yes, I'm following the instructions for that atm [00:43] hence my question about collections [00:43] "Bother or ask politely someone about getting permission to upload your files to the collection archiveteam_PROJECT_NAME. You can ask on #archiveteam on EFNet." [00:44] but 'opensource' as target collection would do also? [00:45] opensource is an open collection anyone can stick things in [00:45] what's the frequency, kenneths? [00:45] To make sure it is kept track of I set the keywords "archiveteam webcrawl" [00:45] omf_: no idea how to do that in the config.sh [00:45] and you can add the project name as a keyword as well [00:47] omf_: it is sufficient to have the same prefix for each item name? I can't figure out where to set keywords :P [00:47] is it * [00:48] you modify the curl command in upload-one to add the proper s3 headers [00:49] like this [00:49] --header 'x-archive-meta-subject:archiveteam;webcrawl' \ [00:51] I see, thanks [00:52] here is the one we used for zapd. It is nothing special but you get the idea --> http://paste.archivingyoursh.it/velajudige.pl [00:54] Each s3 header matches an option on the web interface for creating/editing an item [01:08] omf_: halp! [01:08] 'ruby' was not found, cannot install rubygems unless ruby is present (Do you have an RVM ruby installed & selected?) [01:08] when running `rvm rubygems current` [01:08] after the rvm install 2.0 [01:09] * joepie91 knows 0 about ruby [01:11] wtf ruby [07:13] nintendo says fullscreenmario.com is illegal, takedown may happen at some point (ref: http://www.washingtonpost.com/blogs/the-switch/wp/2013/10/17/nintendo-says-this-amazing-super-mario-site-is-illegal-heres-why-it-shouldnt-be/ ) [07:13] fullscreenmario's engine code is at https://github.com/Diogenesthecynic/FullScreenMario.git [07:52] Lord_Nigh: I cloned the repo yesterday :) [07:53] I got it today, they fixed a bug on one of the levels [07:55] nioce [08:00] i'm grabbing all the imaginingamercia videos [08:01] will archive.org take a zip file of videos and display the videos? [08:32] looks like this video disappeared: http://www.youtube.com/watch?v=hT_rY-Lk8nc [08:33] it goes from close to 2 hours to just 1 second now [14:38] righ [14:38] right * [14:38] I'm giving up on setting up the archiveteam tracker for now :| [14:39] it's a disaster to set up... it expects upstart to exist according to the guide (which it doesn't on Debian, and installing it would break sysvinit), I have issues with rvm and the absence of a login shell, and so on [14:39] D: [14:41] joepie91, you were trying to install the universal tracker, why? [14:41] omf_: because that's what the guide says? [14:42] http://www.archiveteam.org/index.php?title=Tracker_Setup [14:42] I want to get a tracker / project running [14:42] but I've spent some 4 hours on this now [14:42] and I still don't have a working setup [14:42] We already have a tracker instance it is tracker.archiveteam.org [14:43] which I don't have any form of admin access to, nor do I expect it to be appreciated to use it while testing stuff [14:44] We all test shit using that instance, we just don't put the projects in the projects.json [14:44] All you need to know about the tracker is it takes a list of items to send out, beyond that is not needed to startup a new project [14:46] bam new instance --> http://tracker.archiveteam.org/isoprey [14:46] now let me create you an account [14:53] You add items using the Queues page [14:53] Claims page is for monitoring [15:33] joepie91: you don't need upstart [15:58] unrelated note; if scraping a site, you'll want to pretend to be Firefox, not Chrome [15:58] :P [15:58] Chrome auto-updates in nearly every case so if the spoofed useragent you're using is an outdated Chrome version it's very very easy for a server admin to single you out [15:58] FF is much less rigorous with that [16:38] Anyone remember which warrior project had the pipeline that required multiple ids to do the crawling [16:38] was that puush or xanga? or something else? [16:38] I think Puush specifies a number of IDs within a range [16:39] Xanga was straightforward 1:1 items [16:40] Might have been Patch that issued multiple items-per-item ? [16:44] it was puush indeed [16:45] mmm, is there a particular reason for the MoveFiles task existing? cc omf_ [16:45] can't immediately see the point of it, and considering rm'ing it from my script [16:53] It may not be vital but I think I'm right in saying that it moves the .warc.gz files from the location that they're temporarily downloading in, to a location where they can be assumed as 'finished and ready to upload'. [16:54] Can be useful if a script crashes sometimes. [16:54] What's finished can be shown with ls data/*/*.gz, whereas partial downloads are left at data/*/*/*.gz [16:55] right. [17:11] antomatic: yeah, patch.com did items-per-item [17:12] I don't recommend taking the patch.com pipeline as an example, though -- it's not that I think it's bad, but it's doing some really specialized stuff that requires substantial additional server support [17:20] in seesaw, what's the conceptual idea behind Task.process and Task.enqueue, and how do they differ/relate? [17:24] cc yipdw [17:54] also, what's the "realize" thing? [18:02] OH [18:02] realize does the item interpolation? [18:41] joepie91: item interpolation is best handled by the ItemInterpolation object [18:41] yipdw: what I meant was that realize appears to handle the processing of ItemInterpolation objects and such [18:41] in the actual Task code [18:41] oh [18:42] yeah, maybe -- I try to keep above that level in the seesaw code [18:42] I'm still trying to figure out what all this does [18:42] :P [18:42] right [18:42] I've not gone that far into it [18:42] luckily, you don't need to go that far to write custom tasks [18:42] yipdw: idk, I'm trying to do ranges [18:42] and the WgetDownloadMany thing in puush seemed faaaaar too complex in use for me [18:43] joepie91: you can write a SimpleTask subclass that expands the range and adds the expanded list to an item key [18:44] from there you can feed them in as URLs to wget (if they're URLs) or process them further into a wget-usable form [18:44] that's what the ExpandItem task in the patch pipeline does [18:49] yipdw: right, my setup is a bit more complex though :P [18:49] it tries to separately download a .torrent file first and depending on whether that succeeds it attempts to do a recursive grab of other pages [20:01] http://www.dopplr.com/ shutting down [20:02] godane: that video can be downloaded as .mp4 here, 1.6gb [20:02] http://www.youtube.com/watch?v=hT_rY-Lk8nc <-i mean that one [20:21] i just tryed and i'm only getting 128kb [20:21] also it will not play in browser [20:41] joepie91: There's a #warrior channel that you could use for seesaw discussions [22:13] http://www.dopplr.com/ [22:13] nm lord_nigh already posted [23:18] Let's doooo it [23:35] guys there's no way I'm going to be able to download all of fisheye.toolserver.org in a monthâ¦ I'm only at 43k URLs (been downloading for a day or two now) and have over 3 million queued already. Is there a way to distribute it so multiple requests can be happening at once?? [23:38] kyan: what exactly is the situation with fisheye.toolserver.org? [23:39] also, re: isohunt [23:39] 7 calendar days from the signing of the judgement [23:39] Filed 10/17/13 [23:39] It's pretty much a legal cease and desist, but he's got 7 days to do it. Nothing stated that he can't do stuff in the interim, as long as he makes that deadline. [23:39] better hurry the fuck up with that grab, you've got 7 days [23:39] especially since the only applicable parties is him and his company [23:39] joepie91:) someone should ask him. He's required to shut it down within 7 days and not continue operating it, but there's nothing in there about not making a backup or somesuch. [23:39] yeah, okay [23:39] joepie91: it's a website that's shutting down, but it's a) really slow and b) really big [23:39] kyan: it just looks like a repository viewer to me? [23:40] joepie91: it is, but for some reason things can't be exported via SVN normally [23:40] joepie91: apparently the history can only be obtained through the web diff interface [23:41] that makes no sense... [23:41] :| [23:42] kyan: are these svn repos? [23:42] did you try svnsync? [23:42] if you can svn co -r , then svnsync will do the job [23:42] balrog: IDK, I just know someone else ran into issues with doing it the normal way and so they switched over to wget [23:42] and then wget borked because the site was so big [23:42] link me a repo that has issues [23:43] ugh [23:43] so I tried taking it on with Heritrix [23:43] using wget for this... [23:43] and it's going ok, but not fast enough with a deadline [23:43] let me see if i can find the chatlogs about it [23:45] here we go http://badcheese.com/~steve/atlogs/?chan=archiveteam&day=2013-10-16 [23:45] 10 or 15 lines down [23:46] I suggested svnsync there... [23:46] Nemo_bis: ping [23:47] Nemo_bis: svnsync DOES give you history [23:47] it works by using svn co -r to check out each rev and build a new svn repo from those checkouts [23:47] with all metadata and such [23:49] balrog: "at least one root refuses svn export"â¦ not sure what that indicates [23:49] kyan: he was using svn export [23:49] I'd like to know which repo failed it [23:49] kyan: are you good with terminal/command line? [23:50] balrog: not really. I can do enough to get by usually [23:50] ah :/ ok [23:50] * kyan is, however, an EXPERT at writing unusable spaghetti code in php