Time |
Nickname |
Message |
00:17
🔗
|
n00b221 |
Hey guys just a thought but I noticed something today, archive.org now gives you the option to archive any page if you put it into the waybackmachine and it is not already archived and also auto archives various pages if they are linked to from within a archived page but not archived themselves, so would it be possible to write a script to use this to expand the archive.org archives as well, more or less a a |
00:17
🔗
|
n00b221 |
them if the option to archive them pops up |
00:27
🔗
|
n00b221 |
If there had been something like that during the isohunt bit, the entire site could have been archived in a few hours |
00:30
🔗
|
odie5533 |
n00b221: I'd imagine at this point IA doesn't want to do that. |
00:30
🔗
|
odie5533 |
and would perhaps rather we use archivebot or warrior projects |
00:41
🔗
|
odie5533 |
Does anyone know who manages the Warrior project list and rsync targets? |
06:57
🔗
|
ersi |
odie5533: Anyone else |
06:58
🔗
|
ersi |
We are a bunch with access to the tracker/project list. And SketchCow usually provides rsync targets |
15:39
🔗
|
odie5533 |
So how does one go about adding a Warrior project to the tracker? |
15:44
🔗
|
GLaDOS |
All that needs to be done is editing projects.json (located at http://warriorhq.archiveteam.org/projects.json for AT) |
15:47
🔗
|
odie5533 |
GLaDOS: Doesn't it need to be added to e.g. tracker.archiveteam.org/project/> |
15:47
🔗
|
GLaDOS |
Warrior projects don't need a tracker at that location (e.g URLTeam, and some other projects which had others hosting trackers) |
15:50
🔗
|
odie5533 |
GLaDOS: Is the tracker.archiveteam.org tracker for only certain projects, or can others request their project be hosted on it? |
15:52
🔗
|
GLaDOS |
You can request a project to be hosted on it |
15:53
🔗
|
GLaDOS |
as long as it doesn't take too much RAM up ;) |
15:57
🔗
|
odie5533 |
Is anyone around that has created a Seesaw pipeline before? chfoo, you there? |
15:58
🔗
|
odie5533 |
In what dev enviornment is a pipeline created? Can I use the Warrior to create Seesaw pipelines? |
15:59
🔗
|
GLaDOS |
You can use the warrior. |
15:59
🔗
|
GLaDOS |
Hell, it's probably best to do so, seeing as that's where it'll be run from |
15:59
🔗
|
odie5533 |
It has no gui, so I'd have to install one, or edit in vim/emacs |
15:59
🔗
|
GLaDOS |
Just remember to have the script install any packages you may instalL! |
16:00
🔗
|
GLaDOS |
Oh, right, people usually use IDEs. |
16:00
🔗
|
odie5533 |
I've been liking using Eclipse for my Python development lately |
16:00
🔗
|
GLaDOS |
You could just install the seesaw packages from pypi |
16:00
🔗
|
GLaDOS |
That's all the code that the warrior runs |
16:01
🔗
|
chfoo |
i just do it in ubuntu and assume it works in the warrior |
16:01
🔗
|
GLaDOS |
WELL, NOW I SLEEP |
16:01
🔗
|
odie5533 |
GLaDOS: thanks for the help. g'night |
16:01
🔗
|
odie5533 |
chfoo: you ran pip install seesaw and that was it? |
16:02
🔗
|
odie5533 |
also get-wget-lua.sh which seems to get and build the wget-lua branch |
16:03
🔗
|
chfoo |
yeah, i don't recall having to do anything special |
16:05
🔗
|
odie5533 |
chfoo: Are the Redis items that the Pipeline gets typically just urls? |
16:06
🔗
|
chfoo |
i would try to get the item names as short as possible |
16:06
🔗
|
odie5533 |
What do you mean? |
16:08
🔗
|
chfoo |
the way redis has its data structures is that its optimized for speed which means it uses up a lot of memory |
16:09
🔗
|
odie5533 |
So what is a typical item for the Pipeline? Not a urls? |
16:11
🔗
|
chfoo |
it depends. for something like a blog, its usually just the username and then the pipeline interpolates that into username.example.com and wget will crawl the entire domain |
16:12
🔗
|
odie5533 |
What does the blip.tv grabber use? |
16:12
🔗
|
chfoo |
its using a full url, but it shouldn't be |
16:13
🔗
|
odie5533 |
should have parsed off at least the http://blip.tv/ part? |
16:13
🔗
|
chfoo |
yeah, its a bit redundant |
16:13
🔗
|
chfoo |
but i guess since there isn't millions of items to load, it's not using up much memory |
16:15
🔗
|
odie5533 |
At what number of urls would using full urls become excessive? |
16:15
🔗
|
chfoo |
i'm not sure, i'm not an redis expert. |
16:17
🔗
|
chfoo |
i can calculate the memory usage for the puush tracker though for a rough estimate |
16:19
🔗
|
chfoo |
it looks like 11 character item names uses 94 bytes per name (568.86MB for 6376877 items) |
16:44
🔗
|
odie5533 |
chfoo: have you ever run the universal-tracker yourself? |
16:50
🔗
|
chfoo |
odie5533: yeah, i'm running it for the puush project |
16:50
🔗
|
odie5533 |
on a vps? |
16:51
🔗
|
chfoo |
yeah |
16:51
🔗
|
odie5533 |
For the puush items, are they just the img filename like 4ddxg ? |
16:54
🔗
|
chfoo |
yeah, just prefix puu.sh/ to get the url. though currently the items are ranges of 13 images |
17:56
🔗
|
odie5533 |
chfoo: Is the tracker hosted on your personal VPS? |
18:59
🔗
|
odie5533 |
chfoo: what license, if any, is your blip.tv pipeline released under? |
23:22
🔗
|
GLaDOS |
odie5533: generally, treat anything from Archive Team as being licenced under the WTFPL |