Time |
Nickname |
Message |
00:40
π
|
joepie91 |
hmm |
00:40
π
|
joepie91 |
I want to try my hand at setting up a seesaw script |
00:40
π
|
joepie91 |
but for the tracker I'd need to have an IA collection... |
00:40
π
|
joepie91 |
is that strictly necessary or can I put it elsewhere for now? |
00:41
π
|
omf_ |
you can stick it in opensource as a texts mediatype |
00:41
π
|
omf_ |
you also need an upload target |
00:42
π
|
joepie91 |
:P |
00:42
π
|
joepie91 |
omf_: as in, rsync target? that'd be my OVH box |
00:43
π
|
* |
joepie91 is setting up the megawarc factory |
00:43
π
|
omf_ |
on that box you have to setup rsync and a megawarc factory if needed |
00:43
π
|
joepie91 |
yes, I'm following the instructions for that atm |
00:43
π
|
joepie91 |
hence my question about collections |
00:43
π
|
joepie91 |
"Bother or ask politely someone about getting permission to upload your files to the collection archiveteam_PROJECT_NAME. You can ask on #archiveteam on EFNet." |
00:44
π
|
joepie91 |
but 'opensource' as target collection would do also? |
00:45
π
|
omf_ |
opensource is an open collection anyone can stick things in |
00:45
π
|
ATZ0 |
what's the frequency, kenneths? |
00:45
π
|
omf_ |
To make sure it is kept track of I set the keywords "archiveteam webcrawl" |
00:45
π
|
joepie91 |
omf_: no idea how to do that in the config.sh |
00:45
π
|
omf_ |
and you can add the project name as a keyword as well |
00:47
π
|
joepie91 |
omf_: it is sufficient to have the same prefix for each item name? I can't figure out where to set keywords :P |
00:47
π
|
joepie91 |
is it * |
00:48
π
|
omf_ |
you modify the curl command in upload-one to add the proper s3 headers |
00:49
π
|
omf_ |
like this |
00:49
π
|
omf_ |
--header 'x-archive-meta-subject:archiveteam;webcrawl' \ |
00:51
π
|
joepie91 |
I see, thanks |
00:52
π
|
omf_ |
here is the one we used for zapd. It is nothing special but you get the idea --> http://paste.archivingyoursh.it/velajudige.pl |
00:54
π
|
omf_ |
Each s3 header matches an option on the web interface for creating/editing an item |
01:08
π
|
joepie91 |
omf_: halp! |
01:08
π
|
joepie91 |
'ruby' was not found, cannot install rubygems unless ruby is present (Do you have an RVM ruby installed & selected?) |
01:08
π
|
joepie91 |
when running `rvm rubygems current` |
01:08
π
|
joepie91 |
after the rvm install 2.0 |
01:09
π
|
* |
joepie91 knows 0 about ruby |
01:11
π
|
joepie91 |
wtf ruby |
07:13
π
|
Lord_Nigh |
nintendo says fullscreenmario.com is illegal, takedown may happen at some point (ref: http://www.washingtonpost.com/blogs/the-switch/wp/2013/10/17/nintendo-says-this-amazing-super-mario-site-is-illegal-heres-why-it-shouldnt-be/ ) |
07:13
π
|
Lord_Nigh |
fullscreenmario's engine code is at https://github.com/Diogenesthecynic/FullScreenMario.git |
07:52
π
|
tephra |
Lord_Nigh: I cloned the repo yesterday :) |
07:53
π
|
Lord_Nigh |
I got it today, they fixed a bug on one of the levels |
07:55
π
|
tephra |
nioce |
08:00
π
|
godane |
i'm grabbing all the imaginingamercia videos |
08:01
π
|
godane |
will archive.org take a zip file of videos and display the videos? |
08:32
π
|
godane |
looks like this video disappeared: http://www.youtube.com/watch?v=hT_rY-Lk8nc |
08:33
π
|
godane |
it goes from close to 2 hours to just 1 second now |
14:38
π
|
joepie91 |
righ |
14:38
π
|
joepie91 |
right * |
14:38
π
|
joepie91 |
I'm giving up on setting up the archiveteam tracker for now :| |
14:39
π
|
joepie91 |
it's a disaster to set up... it expects upstart to exist according to the guide (which it doesn't on Debian, and installing it would break sysvinit), I have issues with rvm and the absence of a login shell, and so on |
14:39
π
|
joepie91 |
D: |
14:41
π
|
omf_ |
joepie91, you were trying to install the universal tracker, why? |
14:41
π
|
joepie91 |
omf_: because that's what the guide says? |
14:42
π
|
joepie91 |
http://www.archiveteam.org/index.php?title=Tracker_Setup |
14:42
π
|
joepie91 |
I want to get a tracker / project running |
14:42
π
|
joepie91 |
but I've spent some 4 hours on this now |
14:42
π
|
joepie91 |
and I still don't have a working setup |
14:42
π
|
omf_ |
We already have a tracker instance it is tracker.archiveteam.org |
14:43
π
|
joepie91 |
which I don't have any form of admin access to, nor do I expect it to be appreciated to use it while testing stuff |
14:44
π
|
omf_ |
We all test shit using that instance, we just don't put the projects in the projects.json |
14:44
π
|
omf_ |
All you need to know about the tracker is it takes a list of items to send out, beyond that is not needed to startup a new project |
14:46
π
|
omf_ |
bam new instance --> http://tracker.archiveteam.org/isoprey |
14:46
π
|
omf_ |
now let me create you an account |
14:53
π
|
omf_ |
You add items using the Queues page |
14:53
π
|
omf_ |
Claims page is for monitoring |
15:33
π
|
yipdw |
joepie91: you don't need upstart |
15:58
π
|
joepie91 |
unrelated note; if scraping a site, you'll want to pretend to be Firefox, not Chrome |
15:58
π
|
joepie91 |
:P |
15:58
π
|
joepie91 |
Chrome auto-updates in nearly every case so if the spoofed useragent you're using is an outdated Chrome version it's very very easy for a server admin to single you out |
15:58
π
|
joepie91 |
FF is much less rigorous with that |
16:38
π
|
omf_ |
Anyone remember which warrior project had the pipeline that required multiple ids to do the crawling |
16:38
π
|
omf_ |
was that puush or xanga? or something else? |
16:38
π
|
antomatic |
I think Puush specifies a number of IDs within a range |
16:39
π
|
antomatic |
Xanga was straightforward 1:1 items |
16:40
π
|
antomatic |
Might have been Patch that issued multiple items-per-item ? |
16:44
π
|
joepie91 |
it was puush indeed |
16:45
π
|
joepie91 |
mmm, is there a particular reason for the MoveFiles task existing? cc omf_ |
16:45
π
|
joepie91 |
can't immediately see the point of it, and considering rm'ing it from my script |
16:53
π
|
antomatic |
It may not be vital but I think I'm right in saying that it moves the .warc.gz files from the location that they're temporarily downloading in, to a location where they can be assumed as 'finished and ready to upload'. |
16:54
π
|
antomatic |
Can be useful if a script crashes sometimes. |
16:54
π
|
antomatic |
What's finished can be shown with ls data/*/*.gz, whereas partial downloads are left at data/*/*/*.gz |
16:55
π
|
joepie91 |
right. |
17:11
π
|
yipdw |
antomatic: yeah, patch.com did items-per-item |
17:12
π
|
yipdw |
I don't recommend taking the patch.com pipeline as an example, though -- it's not that I think it's bad, but it's doing some really specialized stuff that requires substantial additional server support |
17:20
π
|
joepie91 |
in seesaw, what's the conceptual idea behind Task.process and Task.enqueue, and how do they differ/relate? |
17:24
π
|
joepie91 |
cc yipdw |
17:54
π
|
joepie91 |
also, what's the "realize" thing? |
18:02
π
|
joepie91 |
OH |
18:02
π
|
joepie91 |
realize does the item interpolation? |
18:41
π
|
yipdw |
joepie91: item interpolation is best handled by the ItemInterpolation object |
18:41
π
|
joepie91 |
yipdw: what I meant was that realize appears to handle the processing of ItemInterpolation objects and such |
18:41
π
|
joepie91 |
in the actual Task code |
18:41
π
|
yipdw |
oh |
18:42
π
|
yipdw |
yeah, maybe -- I try to keep above that level in the seesaw code |
18:42
π
|
joepie91 |
I'm still trying to figure out what all this does |
18:42
π
|
joepie91 |
:P |
18:42
π
|
joepie91 |
right |
18:42
π
|
yipdw |
I've not gone that far into it |
18:42
π
|
yipdw |
luckily, you don't need to go that far to write custom tasks |
18:42
π
|
joepie91 |
yipdw: idk, I'm trying to do ranges |
18:42
π
|
joepie91 |
and the WgetDownloadMany thing in puush seemed faaaaar too complex in use for me |
18:43
π
|
yipdw |
joepie91: you can write a SimpleTask subclass that expands the range and adds the expanded list to an item key |
18:44
π
|
yipdw |
from there you can feed them in as URLs to wget (if they're URLs) or process them further into a wget-usable form |
18:44
π
|
yipdw |
that's what the ExpandItem task in the patch pipeline does |
18:49
π
|
joepie91 |
yipdw: right, my setup is a bit more complex though :P |
18:49
π
|
joepie91 |
it tries to separately download a .torrent file first and depending on whether that succeeds it attempts to do a recursive grab of other pages |
20:01
π
|
Lord_Nigh |
http://www.dopplr.com/ shutting down |
20:02
π
|
Lord_Nigh |
godane: that video can be downloaded as .mp4 here, 1.6gb |
20:02
π
|
Lord_Nigh |
http://www.youtube.com/watch?v=hT_rY-Lk8nc <-i mean that one |
20:21
π
|
godane |
i just tryed and i'm only getting 128kb |
20:21
π
|
godane |
also it will not play in browser |
20:41
π
|
ersi |
joepie91: There's a #warrior channel that you could use for seesaw discussions |
22:13
π
|
lemonkey |
http://www.dopplr.com/ |
22:13
π
|
lemonkey |
nm lord_nigh already posted |
23:18
π
|
SketchCow |
Let's doooo it |
23:35
π
|
kyan |
guys there's no way I'm going to be able to download all of fisheye.toolserver.org in a monthΓ’ΒΒ¦ I'm only at 43k URLs (been downloading for a day or two now) and have over 3 million queued already. Is there a way to distribute it so multiple requests can be happening at once?? |
23:38
π
|
joepie91 |
kyan: what exactly is the situation with fisheye.toolserver.org? |
23:39
π
|
joepie91 |
also, re: isohunt |
23:39
π
|
joepie91 |
<cayce>7 calendar days from the signing of the judgement |
23:39
π
|
joepie91 |
<cayce>Filed 10/17/13 |
23:39
π
|
joepie91 |
<cayce>It's pretty much a legal cease and desist, but he's got 7 days to do it. Nothing stated that he can't do stuff in the interim, as long as he makes that deadline. |
23:39
π
|
joepie91 |
<cayce>better hurry the fuck up with that grab, you've got 7 days |
23:39
π
|
joepie91 |
<cayce>especially since the only applicable parties is him and his company |
23:39
π
|
joepie91 |
<cayce>joepie91:) someone should ask him. He's required to shut it down within 7 days and not continue operating it, but there's nothing in there about not making a backup or somesuch. |
23:39
π
|
joepie91 |
<cayce>yeah, okay |
23:39
π
|
kyan |
joepie91: it's a website that's shutting down, but it's a) really slow and b) really big |
23:39
π
|
joepie91 |
kyan: it just looks like a repository viewer to me? |
23:40
π
|
kyan |
joepie91: it is, but for some reason things can't be exported via SVN normally |
23:40
π
|
kyan |
joepie91: apparently the history can only be obtained through the web diff interface |
23:41
π
|
joepie91 |
that makes no sense... |
23:41
π
|
joepie91 |
:| |
23:42
π
|
balrog |
kyan: are these svn repos? |
23:42
π
|
balrog |
did you try svnsync? |
23:42
π
|
balrog |
if you can svn co -r <rev>, then svnsync will do the job |
23:42
π
|
kyan |
balrog: IDK, I just know someone else ran into issues with doing it the normal way and so they switched over to wget |
23:42
π
|
kyan |
and then wget borked because the site was so big |
23:42
π
|
balrog |
link me a repo that has issues |
23:43
π
|
balrog |
ugh |
23:43
π
|
kyan |
so I tried taking it on with Heritrix |
23:43
π
|
balrog |
using wget for this... |
23:43
π
|
kyan |
and it's going ok, but not fast enough with a deadline |
23:43
π
|
kyan |
let me see if i can find the chatlogs about it |
23:45
π
|
kyan |
here we go http://badcheese.com/~steve/atlogs/?chan=archiveteam&day=2013-10-16 |
23:45
π
|
kyan |
10 or 15 lines down |
23:46
π
|
balrog |
I suggested svnsync there... |
23:46
π
|
balrog |
Nemo_bis: ping |
23:47
π
|
balrog |
Nemo_bis: svnsync DOES give you history |
23:47
π
|
balrog |
it works by using svn co -r to check out each rev and build a new svn repo from those checkouts |
23:47
π
|
balrog |
with all metadata and such |
23:49
π
|
kyan |
balrog: "at least one root refuses svn export"Γ’ΒΒ¦ not sure what that indicates |
23:49
π
|
balrog |
kyan: he was using svn export |
23:49
π
|
balrog |
I'd like to know which repo failed it |
23:49
π
|
balrog |
kyan: are you good with terminal/command line? |
23:50
π
|
kyan |
balrog: not really. I can do enough to get by usually |
23:50
π
|
balrog |
ah :/ ok |
23:50
π
|
* |
kyan is, however, an EXPERT at writing unusable spaghetti code in php |