Time |
Nickname |
Message |
01:22
🔗
|
|
brayden_ has joined #archiveteam-bs |
01:22
🔗
|
|
swebb sets mode: +o brayden_ |
01:26
🔗
|
|
brayden has quit IRC (Read error: Operation timed out) |
03:25
🔗
|
Odd0002 |
I wonder if archive wants video files from a university course I just took... |
03:39
🔗
|
|
pizzaiolo has quit IRC (pizzaiolo) |
03:56
🔗
|
Somebody2 |
Hm, looks like the only active Warrior project right now is #urlteam . I'll go add more shortners to urlteam. |
04:17
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:24
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:24
🔗
|
|
Sk1d has quit IRC (Connection Closed) |
04:35
🔗
|
|
ploop has joined #archiveteam-bs |
04:37
🔗
|
ploop |
Somebody2: so far I've been writing a new script every time I want to archive files from a site, but they're always very far from perfect and stop working every now and again and require constant maintenance |
04:38
🔗
|
ploop |
additionally i have no idea how i should be handling various errors so if my internet cuts out for a few seconds or something i end up with the script either crashing or missing files |
04:38
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
04:39
🔗
|
ploop |
and it occurred to me that downloading webpages is not something that i should be having problems with, since plenty of other people's software does it without issue |
04:41
🔗
|
Somebody2 |
well, you've come to the right place. |
04:41
🔗
|
ploop |
the easy part is figuring out that i need to download x.com/fileid/x where x is {1..5000000} and maybe do some mime detection to give it a good filename or something |
04:42
🔗
|
ploop |
but somehow i struggle with http, which should be the easier part |
04:42
🔗
|
Somebody2 |
Look over the docs for wpull; there's also grab-site that offers an interface over it. |
04:43
🔗
|
Somebody2 |
You may also find the code for the Warrior projects informative; those are in the ArchiveTeam github organization. |
04:44
🔗
|
Somebody2 |
I don't persionally do a whole lot of that exact thing, so I'm probably not the best person to answer really detailed questions. |
04:47
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
04:51
🔗
|
ploop |
this looks interesting |
04:53
🔗
|
Somebody2 |
I hope so. :-) It serves us pretty well. |
07:26
🔗
|
godane |
there is a thunderstorm outside |
07:26
🔗
|
|
GE has joined #archiveteam-bs |
07:26
🔗
|
godane |
like monsoon like rain is going on where i live |
07:44
🔗
|
|
Jonison has joined #archiveteam-bs |
07:53
🔗
|
|
schbirid has joined #archiveteam-bs |
08:05
🔗
|
|
espes___ has joined #archiveteam-bs |
08:06
🔗
|
|
will has quit IRC (Ping timeout: 250 seconds) |
08:07
🔗
|
|
luckcolor has quit IRC (Remote host closed the connection) |
08:08
🔗
|
|
midas has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
Jonimus has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
JensRex has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
Lord_Nigh has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
alfiepate has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
Riviera has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
espes__ has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
tammy_ has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
i0npulse has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
purplebot has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
Rai-chan has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
medowar has quit IRC (hub.se irc.underworld.no) |
08:08
🔗
|
|
Hecatz has quit IRC (hub.se irc.underworld.no) |
08:09
🔗
|
|
LordNigh2 has joined #archiveteam-bs |
08:09
🔗
|
|
luckcolor has joined #archiveteam-bs |
08:09
🔗
|
|
will has joined #archiveteam-bs |
08:10
🔗
|
|
alfie has joined #archiveteam-bs |
08:11
🔗
|
t2t2 |
I think #noanswers needs requeuing, 70k items out |
08:17
🔗
|
|
midas1 has joined #archiveteam-bs |
08:17
🔗
|
|
Jonimoose has joined #archiveteam-bs |
08:17
🔗
|
|
swebb sets mode: +o Jonimoose |
08:23
🔗
|
|
LordNigh2 is now known as Lord_Nigh |
08:53
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
09:12
🔗
|
|
Jonison has quit IRC (Read error: Connection reset by peer) |
09:18
🔗
|
|
Jonison has joined #archiveteam-bs |
09:19
🔗
|
|
Somebody2 has quit IRC (Read error: Operation timed out) |
09:20
🔗
|
|
Jonimoose has quit IRC (west.us.hub irc.Prison.NET) |
09:21
🔗
|
|
xmc has quit IRC (Read error: Operation timed out) |
09:21
🔗
|
|
Somebody2 has joined #archiveteam-bs |
09:24
🔗
|
|
midas1 is now known as midas |
09:26
🔗
|
|
xmc has joined #archiveteam-bs |
09:26
🔗
|
|
swebb sets mode: +o xmc |
09:43
🔗
|
|
deathy has quit IRC (Remote host closed the connection) |
09:43
🔗
|
|
HCross2 has quit IRC (Remote host closed the connection) |
09:47
🔗
|
|
JAA has joined #archiveteam-bs |
09:52
🔗
|
|
deathy has joined #archiveteam-bs |
09:57
🔗
|
JAA |
Server: IIS/4.1 |
09:57
🔗
|
JAA |
X-Powered-By: Visual Basic 2.0 on Rails |
09:57
🔗
|
JAA |
I lol'd |
10:20
🔗
|
|
HCross2 has joined #archiveteam-bs |
10:28
🔗
|
|
JAA has quit IRC (Quit: Page closed) |
10:34
🔗
|
|
Jonimoose has joined #archiveteam-bs |
10:34
🔗
|
|
irc.Prison.NET sets mode: +o Jonimoose |
10:34
🔗
|
|
swebb sets mode: +o Jonimoose |
10:36
🔗
|
|
purplebot has joined #archiveteam-bs |
10:36
🔗
|
|
Rai-chan has joined #archiveteam-bs |
10:36
🔗
|
|
medowar has joined #archiveteam-bs |
10:36
🔗
|
|
Hecatz has joined #archiveteam-bs |
10:39
🔗
|
|
i0npulse has joined #archiveteam-bs |
10:39
🔗
|
|
tammy_ has joined #archiveteam-bs |
11:03
🔗
|
|
JensRex has joined #archiveteam-bs |
11:03
🔗
|
|
dashcloud has quit IRC (Read error: Connection reset by peer) |
11:04
🔗
|
|
dashcloud has joined #archiveteam-bs |
11:32
🔗
|
HCross2 |
Upload of the first chunk of data.gov has begun - 1.5TB at 55Mbps |
11:33
🔗
|
HCross2 |
Anyone know if I can use the IA python tool to upload more than 1 file to an item at a time please? |
12:30
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
13:05
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
14:02
🔗
|
|
JensRex has quit IRC (Remote host closed the connection) |
14:03
🔗
|
|
JensRex has joined #archiveteam-bs |
14:20
🔗
|
|
Yurume has quit IRC (Remote host closed the connection) |
14:20
🔗
|
|
antomati_ is now known as antomatic |
14:24
🔗
|
|
Ravenloft has quit IRC (Read error: Operation timed out) |
14:31
🔗
|
|
Yurume has joined #archiveteam-bs |
14:44
🔗
|
|
Dark_Star has quit IRC (Read error: Operation timed out) |
14:44
🔗
|
|
hook54321 has quit IRC (Ping timeout: 250 seconds) |
14:44
🔗
|
|
godane has quit IRC (Ping timeout: 250 seconds) |
14:44
🔗
|
|
kanzure has quit IRC (Ping timeout: 250 seconds) |
14:44
🔗
|
|
kanzure has joined #archiveteam-bs |
14:44
🔗
|
|
alembic has quit IRC (Ping timeout: 260 seconds) |
14:47
🔗
|
|
godane has joined #archiveteam-bs |
14:58
🔗
|
|
logchfoo0 starts logging #archiveteam-bs at Tue May 02 14:58:53 2017 |
14:58
🔗
|
|
logchfoo0 has joined #archiveteam-bs |
14:59
🔗
|
|
hook54321 has joined #archiveteam-bs |
15:00
🔗
|
|
alembic has joined #archiveteam-bs |
15:07
🔗
|
|
Ctrl-S___ has joined #archiveteam-bs |
15:12
🔗
|
|
kvieta has quit IRC (Ping timeout: 370 seconds) |
15:12
🔗
|
|
GE has joined #archiveteam-bs |
15:13
🔗
|
|
nightpool has joined #archiveteam-bs |
15:26
🔗
|
|
icedice has joined #archiveteam-bs |
15:26
🔗
|
|
icedice2 has joined #archiveteam-bs |
15:31
🔗
|
|
yipdw has quit IRC (Read error: Operation timed out) |
15:33
🔗
|
|
me_ has joined #archiveteam-bs |
15:36
🔗
|
|
icedice2 has quit IRC (Quit: Leaving) |
17:28
🔗
|
arkiver |
HCross2: yes, just give it a list of items |
17:28
🔗
|
arkiver |
or a directory where it can find all items |
17:28
🔗
|
arkiver |
files* |
17:48
🔗
|
HCross2 |
I meant concurrent - I fed it a directory and off it went |
17:49
🔗
|
HCross2 |
So I point it at a directory and it uploads say 5 files at once |
17:55
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
18:02
🔗
|
|
namespace has joined #archiveteam-bs |
18:02
🔗
|
namespace |
But yeah. |
18:02
🔗
|
namespace |
It's not so much that piracy sites have no cultural value, quite the contrary they're some of the largest 'open' repositories of cultural value out there. |
18:02
🔗
|
xmc |
traditionally we don't care much about legal risk, because the real risk seems low |
18:03
🔗
|
namespace |
They're just radioactive to touch. |
18:03
🔗
|
namespace |
Yeah but. |
18:03
🔗
|
namespace |
Piracy sites are one of the cases where it's not. |
18:03
🔗
|
namespace |
Especially if they just shut down because someone else was suing them or whatever. |
18:03
🔗
|
xmc |
i see no evidence, only fear |
18:04
🔗
|
* |
namespace shrugs |
18:04
🔗
|
namespace |
Not gonna argue this when it's not even my decision lol. |
18:05
🔗
|
xmc |
it's the decision of every member for themselves, of whether they want to participate in that sort of project |
18:06
🔗
|
DFJustin |
we've archived shitloads of pirated everything and nothing has happened so far |
18:06
🔗
|
xmc |
we've even archived people being scared about it in irc! |
18:06
🔗
|
xmc |
hehe |
18:07
🔗
|
xmc |
i think we've received a few takedowns on things, but no other fallout |
18:08
🔗
|
xmc |
i know that a ftpsite i archived got darked |
18:08
🔗
|
SketchCow |
FEEEAR |
18:08
🔗
|
SketchCow |
Did someone call for fear? I work in fear. |
18:09
🔗
|
xmc |
yes, hello, fear department, we need a delivery |
18:09
🔗
|
SketchCow |
Did you want regular fear or extra spicy fear |
18:09
🔗
|
xmc |
well what did the requisition form say |
18:09
🔗
|
xmc |
come ON we have standardized forms for a *reason* |
18:10
🔗
|
SketchCow |
Form unintelligible, blood streaks covering checkboxes |
18:10
🔗
|
MrRadar |
While people are here: is there a list of people who have access to the tracker for different projects? Yahoo Answers needs a requeue and I'm not sure who is best to ping |
18:10
🔗
|
SketchCow |
Ping arkiver or yipdw or I'm not sure who else |
18:12
🔗
|
|
me_ is now known as yipdw |
18:12
🔗
|
yipdw |
the claims page is 500ing out |
18:12
🔗
|
yipdw |
one sec |
18:13
🔗
|
xmc |
yahooanswers has admins set as arkiver and medowar, for the record |
18:14
🔗
|
xmc |
(they, and anyone set as global-admin, can jiggle it) |
18:14
🔗
|
yipdw |
oh |
18:15
🔗
|
yipdw |
it's because someone named pronerdJay has something like 100,000 claims and the page is going FML |
18:15
🔗
|
yipdw |
i haven't come across something so quinessentially AT in a while |
18:15
🔗
|
yipdw |
er, maybe it's closer to 50,000 |
18:15
🔗
|
yipdw |
either way |
18:16
🔗
|
xmc |
haha |
18:16
🔗
|
yipdw |
$ ruby release-claims.rb yahooanswers pronerdJay |
18:16
🔗
|
yipdw |
/home/yipdw/.rvm/gems/ruby-2.3.3/gems/activesupport-3.2.5/lib/active_support/values/time_zone.rb:270: warning: circular argument reference - now |
18:16
🔗
|
yipdw |
/home/yipdw/.rvm/gems/ruby-2.3.3/gems/redis-2.2.2/lib/redis.rb:215:in `block in hgetall': stack level too deep (SystemStackError) |
18:16
🔗
|
yipdw |
fuck Rub |
18:16
🔗
|
yipdw |
y |
18:16
🔗
|
xmc |
that's the rub |
18:16
🔗
|
yipdw |
wait what how is that stack trace possible |
18:16
🔗
|
yipdw |
is hgetall recursing to build a hash?? |
18:17
🔗
|
yipdw |
oh, no, it uses Hash[] and passes the reply in using a splat |
18:17
🔗
|
yipdw |
fuck Ruby |
18:17
🔗
|
xmc |
archiveteam: finding bugs in standard system tools since 2009 |
18:17
🔗
|
yipdw |
I think newer versions of redis-rb fix this |
18:20
🔗
|
yipdw |
oh, but that script is using the tracker gem bundle and I can't update it without affecting the world |
18:20
🔗
|
yipdw |
bleh I'll write something |
18:21
🔗
|
icedice |
Is Yahoo Answers going down? |
18:21
🔗
|
yipdw |
I have some places where Yahoo Answers can go |
18:22
🔗
|
MrRadar |
icedice: Yahoo Answers is being grabbed preemptively in case Verizon decides to can it |
18:22
🔗
|
icedice |
Ah, right |
18:22
🔗
|
icedice |
Yahoo sold out to Verizon |
18:23
🔗
|
yipdw |
ok, it looks like release-stale worked |
18:23
🔗
|
yipdw |
the spice is flowing again on yahooanswers and I'm getting out of jwz mode |
18:24
🔗
|
MrRadar |
Thanks yipdw |
18:24
🔗
|
arkiver |
yipdw: we already have a way of handling too many out items |
18:25
🔗
|
arkiver |
Requeue on the Workarounds page |
18:25
🔗
|
yipdw |
there's a few scripts that seem to work, release-claims just can't handle firepower of that magnitude |
18:25
🔗
|
yipdw |
oh, right |
18:25
🔗
|
yipdw |
I guess that page does the same as release-stale, huh |
18:27
🔗
|
arkiver |
I guess so |
18:44
🔗
|
SketchCow |
https://archive.org/details/pulpmagazinearchive?&sort=-publicdate&and[]=addeddate:2017* |
18:44
🔗
|
SketchCow |
I'm uploading 10,000 zines |
18:44
🔗
|
SketchCow |
Should I ask permission |
18:44
🔗
|
* |
SketchCow bites nails |
19:06
🔗
|
|
ndiddy has quit IRC () |
19:06
🔗
|
HCross2 |
Even more data.gov has just started the slow march up to the IA |
19:15
🔗
|
namespace |
SketchCow: lolno |
19:36
🔗
|
t2t2 |
BTW the tracker also has stale items for yuku, almost a year old |
19:39
🔗
|
|
GE has joined #archiveteam-bs |
19:59
🔗
|
icedice |
Is there any way to find the Imgur link that was posted in OP's (now deleted) post? |
19:59
🔗
|
icedice |
https://www.reddit.com/r/webhosting/comments/4w6d63/buyshared_gets_mentioned_a_lot_when_it_comes_to/ |
19:59
🔗
|
icedice |
Nothing on Archive.org |
20:02
🔗
|
MrRadar |
icedice: It looks like this may be a mirror of the original post: https://webdesignersolutions.wordpress.com/2016/08/04/buyshared-gets-mentioned-a-lot-when-it-comes-to-cheap-shared-hosting-heres-the-uptime-log-since-february-for-an-account-i-have-with-them-via-rwebhosting/ |
20:06
🔗
|
icedice |
Thanks! |
20:30
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
20:32
🔗
|
|
kvieta has joined #archiveteam-bs |
20:46
🔗
|
|
kvieta has quit IRC (Read error: Operation timed out) |
20:54
🔗
|
|
Ravenloft has joined #archiveteam-bs |
20:56
🔗
|
|
kvieta has joined #archiveteam-bs |
21:04
🔗
|
|
tuluu_ has joined #archiveteam-bs |
21:04
🔗
|
|
tuluu has quit IRC (Ping timeout: 250 seconds) |
21:07
🔗
|
|
Jonison has quit IRC (Read error: Connection reset by peer) |
21:10
🔗
|
|
ndiddy has joined #archiveteam-bs |
21:58
🔗
|
|
espes__ has joined #archiveteam-bs |
21:59
🔗
|
|
espes___ has quit IRC (Ping timeout: 250 seconds) |
22:02
🔗
|
|
midas has quit IRC (Ping timeout: 250 seconds) |
22:02
🔗
|
|
Gfy has quit IRC (Ping timeout: 250 seconds) |
22:03
🔗
|
|
mls has quit IRC (Ping timeout: 250 seconds) |
22:03
🔗
|
|
midas has joined #archiveteam-bs |
22:04
🔗
|
|
tsr has quit IRC (Ping timeout: 250 seconds) |
22:05
🔗
|
|
Gfy has joined #archiveteam-bs |
22:06
🔗
|
|
andai has quit IRC (Ping timeout: 250 seconds) |
22:08
🔗
|
|
Kaz has quit IRC (Ping timeout: 250 seconds) |
22:10
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
22:11
🔗
|
|
Aoede has quit IRC (Ping timeout: 250 seconds) |
22:11
🔗
|
|
hook54321 has quit IRC (Ping timeout: 250 seconds) |
22:11
🔗
|
|
C4K3 has quit IRC (Ping timeout: 250 seconds) |
22:13
🔗
|
|
tsr has joined #archiveteam-bs |
22:13
🔗
|
|
HP_ has joined #archiveteam-bs |
22:13
🔗
|
|
C4K3 has joined #archiveteam-bs |
22:14
🔗
|
|
hook54321 has joined #archiveteam-bs |
22:14
🔗
|
|
andai has joined #archiveteam-bs |
22:14
🔗
|
|
HP has quit IRC (Ping timeout: 250 seconds) |
22:14
🔗
|
|
nightpool has quit IRC (Ping timeout: 250 seconds) |
22:15
🔗
|
|
Kaz has joined #archiveteam-bs |
22:16
🔗
|
|
mls has joined #archiveteam-bs |
22:17
🔗
|
|
andai has quit IRC (Ping timeout: 250 seconds) |
22:17
🔗
|
|
SN4T14 has quit IRC (Ping timeout: 250 seconds) |
22:17
🔗
|
|
SN4T14 has joined #archiveteam-bs |
22:21
🔗
|
|
mls has quit IRC (Ping timeout: 250 seconds) |
22:21
🔗
|
|
mls has joined #archiveteam-bs |
22:22
🔗
|
|
Aoede has joined #archiveteam-bs |
22:22
🔗
|
|
andai has joined #archiveteam-bs |
22:27
🔗
|
|
nightpool has joined #archiveteam-bs |
22:46
🔗
|
|
Aoede has quit IRC (Ping timeout: 250 seconds) |
22:48
🔗
|
|
Aoede has joined #archiveteam-bs |
22:57
🔗
|
|
andai has quit IRC (Ping timeout: 250 seconds) |
22:58
🔗
|
|
andai has joined #archiveteam-bs |
23:05
🔗
|
|
sun_rise has joined #archiveteam-bs |
23:06
🔗
|
sun_rise |
I have questions about what is/is not appropriate for archiveteam/bot and not sure where to pose them |
23:06
🔗
|
xmc |
here is a good place to ask |
23:09
🔗
|
sun_rise |
Three people I know have been sued for defamation over 'survivor' websites by institutions they alleged abused them/others as children. Two of them were forced to settle and remove the content from the web. |
23:09
🔗
|
xmc |
archive it |
23:09
🔗
|
xmc |
this is 100% okay |
23:10
🔗
|
xmc |
unless they want it removed, which, well, doesn't sound like they do |
23:12
🔗
|
sun_rise |
"it", in this case, is going to be a lot bigger than just the 'survivor' websites. I am interested in crawling the 'industry' sites as well. My original plan was to do this own my own and I started researching best practices for this sort of thing. I was really pleasantly surprised to find Archiveteam/bot. |
23:12
🔗
|
sun_rise |
It's an amazing service and I don't want to abuse it. The crawl I started yesterday pointed at a single domain has already grown much larger than I was expecting. |
23:14
🔗
|
xmc |
yep, that'll happen |
23:14
🔗
|
xmc |
if you want, you can next time run your jobs with --no-offsite-links |
23:14
🔗
|
xmc |
by default archivebot will fetch every page on the site you submit, and every page that is linked to |
23:14
🔗
|
xmc |
in order to present context |
23:14
🔗
|
xmc |
(along with images and script and stylesheets used on these pages) |
23:14
🔗
|
sun_rise |
I think, for this job, that was probably the appropriate setting - I didn't realize this until after it started running, though. |
23:15
🔗
|
xmc |
mm, possibly |
23:20
🔗
|
sun_rise |
Ultimately I'm going to be interested in hundreds of domains that this site points to or that I have collected elsewhere that are relevant to this topic. I doubt any single one of them will end up as large as this - they seem to mostly be fairly lean wordpress product page type sites. I guess what I'm after is a general sense of what *wouldn't* be appropriate for archivebot. At what point should I be using something else? |
23:20
🔗
|
sun_rise |
Is there some standard/threshold of general interest or threatened status? If I end up trying to crawl from a list of sites - should that be done in chunks? How do I ensure my jobs don't spiral out of control? |
23:21
🔗
|
sun_rise |
If I made a donation to offset my usage is there some guide to how much things generally cost? |
23:21
🔗
|
xmc |
feel free to use archivebot |
23:21
🔗
|
xmc |
you sound like someone who's fairly conscious of the resources they're using |
23:22
🔗
|
xmc |
if you look on the dashboard and you have more jobs running than anyone else, you might want to rethink how you're going about doing things |
23:22
🔗
|
xmc |
that said, everyone who cares about something fills up the queue eventually |
23:23
🔗
|
xmc |
we have a cost shameboard that kind of tries to be a forever-cost of data storage |
23:23
🔗
|
sun_rise |
I saw this but wasn't sure how quickly that would fill up. There are some high scorers! |
23:23
🔗
|
xmc |
but if you throw some chum towards https://archive.org/donate/ it'll probably be fine |
23:23
🔗
|
xmc |
hehe |
23:24
🔗
|
sun_rise |
I noticed there are 2 warc files associated with my crawl that have already been uploaded to archive.org. Will those continue to be uploaded in chunks? |
23:24
🔗
|
xmc |
yep |
23:24
🔗
|
xmc |
whenever the pipeline cuts off the warc file and starts a new one, the uploader sends the finished warc file off to IA |
23:24
🔗
|
sun_rise |
if I do a crawl from a pastebin list of domains will they show up in the same IA folder or separate per domain? |
23:25
🔗
|
xmc |
jobs go into warc files named by the url you submit, no matter of whether you use it as a list of urls or a single website |
23:26
🔗
|
xmc |
if you're doing less than a few dozen sites, i'd suggest one !a per site |
23:26
🔗
|
xmc |
like, one day i did all the campaign websites for my city's election |
23:28
🔗
|
|
dashcloud has quit IRC (Remote host closed the connection) |
23:29
🔗
|
DFJustin |
we've asked before about what wouldn't be appropriate and sketchcow weighed in: |
23:29
🔗
|
DFJustin |
<SketchCow> In another channel, regarding uploading stuff of dubious value or duplication to archive.org: |
23:29
🔗
|
DFJustin |
<SketchCow> General archive rule: gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad. |
23:29
🔗
|
DFJustin |
<SketchCow> I am going to go ahead and define dubious value that the uploader can't even begin to dream up a use. |
23:29
🔗
|
DFJustin |
<SketchCow> If the uploader can'te ven come up with a use case, that's dubious value. |
23:29
🔗
|
DFJustin |
<SketchCow> Example: 14gb quicktime movie aimed at a blank wall for an hour, no change |
23:30
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:31
🔗
|
DFJustin |
so if it's in any way useful and it's not already archived, go hog wild, if it's gonna be mainly duplicated data then be careful about getting up into tens or hundreds of gigs |
23:32
🔗
|
DFJustin |
small sites don't matter except don't do many at the same time that there aren't any archivebot slots free for emergencies |
23:33
🔗
|
|
dashcloud has joined #archiveteam-bs |
23:33
🔗
|
DFJustin |
this is admittedly hampered by the fact that we don't actually have a readout for the number of free slots |
23:33
🔗
|
sun_rise |
so submitting a list of urls might be more polite? |
23:34
🔗
|
DFJustin |
or come in and feed one in every so often as previous ones finish |
23:35
🔗
|
sun_rise |
I'm thinking I can prioritize the stuff that I most fear being lost right now and get to crawling 'the enemy' later when I have a better grasp of how big these things get |
23:35
🔗
|
DFJustin |
having a ton of sites on one job can be a problem because the jobs do crash from time to time |
23:40
🔗
|
DFJustin |
what I usually do before putting a site through archivebot is bring the site up in the wayback machine and see if the site has been crawled pretty well already or not |
23:41
🔗
|
DFJustin |
if the most recent crawl is from ages ago or you click a couple links and they come up "this page has not been archived" then it's due for a go |
23:48
🔗
|
sun_rise |
ok |