#archiveteam-bs 2017-05-02,Tue

↑back Search

Time	Nickname	Message
01:22 ^🔗		brayden_ has joined #archiveteam-bs
01:22 ^🔗		swebb sets mode: +o brayden_
01:26 ^🔗		brayden has quit IRC (Read error: Operation timed out)
03:25 ^🔗	Odd0002	I wonder if archive wants video files from a university course I just took...
03:39 ^🔗		pizzaiolo has quit IRC (pizzaiolo)
03:56 ^🔗	Somebody2	Hm, looks like the only active Warrior project right now is #urlteam . I'll go add more shortners to urlteam.
04:17 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
04:24 ^🔗		Sk1d has joined #archiveteam-bs
04:24 ^🔗		Sk1d has quit IRC (Connection Closed)
04:35 ^🔗		ploop has joined #archiveteam-bs
04:37 ^🔗	ploop	Somebody2: so far I've been writing a new script every time I want to archive files from a site, but they're always very far from perfect and stop working every now and again and require constant maintenance
04:38 ^🔗	ploop	additionally i have no idea how i should be handling various errors so if my internet cuts out for a few seconds or something i end up with the script either crashing or missing files
04:38 ^🔗		BlueMaxim has joined #archiveteam-bs
04:39 ^🔗	ploop	and it occurred to me that downloading webpages is not something that i should be having problems with, since plenty of other people's software does it without issue
04:41 ^🔗	Somebody2	well, you've come to the right place.
04:41 ^🔗	ploop	the easy part is figuring out that i need to download x.com/fileid/x where x is {1..5000000} and maybe do some mime detection to give it a good filename or something
04:42 ^🔗	ploop	but somehow i struggle with http, which should be the easier part
04:42 ^🔗	Somebody2	Look over the docs for wpull; there's also grab-site that offers an interface over it.
04:43 ^🔗	Somebody2	You may also find the code for the Warrior projects informative; those are in the ArchiveTeam github organization.
04:44 ^🔗	Somebody2	I don't persionally do a whole lot of that exact thing, so I'm probably not the best person to answer really detailed questions.
04:47 ^🔗		Aranje has quit IRC (Quit: Three sheets to the wind)
04:51 ^🔗	ploop	this looks interesting
04:53 ^🔗	Somebody2	I hope so. :-) It serves us pretty well.
07:26 ^🔗	godane	there is a thunderstorm outside
07:26 ^🔗		GE has joined #archiveteam-bs
07:26 ^🔗	godane	like monsoon like rain is going on where i live
07:44 ^🔗		Jonison has joined #archiveteam-bs
07:53 ^🔗		schbirid has joined #archiveteam-bs
08:05 ^🔗		espes___ has joined #archiveteam-bs
08:06 ^🔗		will has quit IRC (Ping timeout: 250 seconds)
08:07 ^🔗		luckcolor has quit IRC (Remote host closed the connection)
08:08 ^🔗		midas has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		Jonimus has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		JensRex has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		Lord_Nigh has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		alfiepate has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		Riviera has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		espes__ has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		tammy_ has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		i0npulse has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		purplebot has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		Rai-chan has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		medowar has quit IRC (hub.se irc.underworld.no)
08:08 ^🔗		Hecatz has quit IRC (hub.se irc.underworld.no)
08:09 ^🔗		LordNigh2 has joined #archiveteam-bs
08:09 ^🔗		luckcolor has joined #archiveteam-bs
08:09 ^🔗		will has joined #archiveteam-bs
08:10 ^🔗		alfie has joined #archiveteam-bs
08:11 ^🔗	t2t2	I think #noanswers needs requeuing, 70k items out
08:17 ^🔗		midas1 has joined #archiveteam-bs
08:17 ^🔗		Jonimoose has joined #archiveteam-bs
08:17 ^🔗		swebb sets mode: +o Jonimoose
08:23 ^🔗		LordNigh2 is now known as Lord_Nigh
08:53 ^🔗		GE has quit IRC (Remote host closed the connection)
09:12 ^🔗		Jonison has quit IRC (Read error: Connection reset by peer)
09:18 ^🔗		Jonison has joined #archiveteam-bs
09:19 ^🔗		Somebody2 has quit IRC (Read error: Operation timed out)
09:20 ^🔗		Jonimoose has quit IRC (west.us.hub irc.Prison.NET)
09:21 ^🔗		xmc has quit IRC (Read error: Operation timed out)
09:21 ^🔗		Somebody2 has joined #archiveteam-bs
09:24 ^🔗		midas1 is now known as midas
09:26 ^🔗		xmc has joined #archiveteam-bs
09:26 ^🔗		swebb sets mode: +o xmc
09:43 ^🔗		deathy has quit IRC (Remote host closed the connection)
09:43 ^🔗		HCross2 has quit IRC (Remote host closed the connection)
09:47 ^🔗		JAA has joined #archiveteam-bs
09:52 ^🔗		deathy has joined #archiveteam-bs
09:57 ^🔗	JAA	Server: IIS/4.1
09:57 ^🔗	JAA	X-Powered-By: Visual Basic 2.0 on Rails
09:57 ^🔗	JAA	I lol'd
10:20 ^🔗		HCross2 has joined #archiveteam-bs
10:28 ^🔗		JAA has quit IRC (Quit: Page closed)
10:34 ^🔗		Jonimoose has joined #archiveteam-bs
10:34 ^🔗		irc.Prison.NET sets mode: +o Jonimoose
10:34 ^🔗		swebb sets mode: +o Jonimoose
10:36 ^🔗		purplebot has joined #archiveteam-bs
10:36 ^🔗		Rai-chan has joined #archiveteam-bs
10:36 ^🔗		medowar has joined #archiveteam-bs
10:36 ^🔗		Hecatz has joined #archiveteam-bs
10:39 ^🔗		i0npulse has joined #archiveteam-bs
10:39 ^🔗		tammy_ has joined #archiveteam-bs
11:03 ^🔗		JensRex has joined #archiveteam-bs
11:03 ^🔗		dashcloud has quit IRC (Read error: Connection reset by peer)
11:04 ^🔗		dashcloud has joined #archiveteam-bs
11:32 ^🔗	HCross2	Upload of the first chunk of data.gov has begun - 1.5TB at 55Mbps
11:33 ^🔗	HCross2	Anyone know if I can use the IA python tool to upload more than 1 file to an item at a time please?
12:30 ^🔗		pizzaiolo has joined #archiveteam-bs
13:05 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
14:02 ^🔗		JensRex has quit IRC (Remote host closed the connection)
14:03 ^🔗		JensRex has joined #archiveteam-bs
14:20 ^🔗		Yurume has quit IRC (Remote host closed the connection)
14:20 ^🔗		antomati_ is now known as antomatic
14:24 ^🔗		Ravenloft has quit IRC (Read error: Operation timed out)
14:31 ^🔗		Yurume has joined #archiveteam-bs
14:44 ^🔗		Dark_Star has quit IRC (Read error: Operation timed out)
14:44 ^🔗		hook54321 has quit IRC (Ping timeout: 250 seconds)
14:44 ^🔗		godane has quit IRC (Ping timeout: 250 seconds)
14:44 ^🔗		kanzure has quit IRC (Ping timeout: 250 seconds)
14:44 ^🔗		kanzure has joined #archiveteam-bs
14:44 ^🔗		alembic has quit IRC (Ping timeout: 260 seconds)
14:47 ^🔗		godane has joined #archiveteam-bs
14:58 ^🔗		logchfoo0 starts logging #archiveteam-bs at Tue May 02 14:58:53 2017
14:58 ^🔗		logchfoo0 has joined #archiveteam-bs
14:59 ^🔗		hook54321 has joined #archiveteam-bs
15:00 ^🔗		alembic has joined #archiveteam-bs
15:07 ^🔗		Ctrl-S___ has joined #archiveteam-bs
15:12 ^🔗		kvieta has quit IRC (Ping timeout: 370 seconds)
15:12 ^🔗		GE has joined #archiveteam-bs
15:13 ^🔗		nightpool has joined #archiveteam-bs
15:26 ^🔗		icedice has joined #archiveteam-bs
15:26 ^🔗		icedice2 has joined #archiveteam-bs
15:31 ^🔗		yipdw has quit IRC (Read error: Operation timed out)
15:33 ^🔗		me_ has joined #archiveteam-bs
15:36 ^🔗		icedice2 has quit IRC (Quit: Leaving)
17:28 ^🔗	arkiver	HCross2: yes, just give it a list of items
17:28 ^🔗	arkiver	or a directory where it can find all items
17:28 ^🔗	arkiver	files*
17:48 ^🔗	HCross2	I meant concurrent - I fed it a directory and off it went
17:49 ^🔗	HCross2	So I point it at a directory and it uploads say 5 files at once
17:55 ^🔗		GE has quit IRC (Remote host closed the connection)
18:02 ^🔗		namespace has joined #archiveteam-bs
18:02 ^🔗	namespace	But yeah.
18:02 ^🔗	namespace	It's not so much that piracy sites have no cultural value, quite the contrary they're some of the largest 'open' repositories of cultural value out there.
18:02 ^🔗	xmc	traditionally we don't care much about legal risk, because the real risk seems low
18:03 ^🔗	namespace	They're just radioactive to touch.
18:03 ^🔗	namespace	Yeah but.
18:03 ^🔗	namespace	Piracy sites are one of the cases where it's not.
18:03 ^🔗	namespace	Especially if they just shut down because someone else was suing them or whatever.
18:03 ^🔗	xmc	i see no evidence, only fear
18:04 ^🔗	*	namespace shrugs
18:04 ^🔗	namespace	Not gonna argue this when it's not even my decision lol.
18:05 ^🔗	xmc	it's the decision of every member for themselves, of whether they want to participate in that sort of project
18:06 ^🔗	DFJustin	we've archived shitloads of pirated everything and nothing has happened so far
18:06 ^🔗	xmc	we've even archived people being scared about it in irc!
18:06 ^🔗	xmc	hehe
18:07 ^🔗	xmc	i think we've received a few takedowns on things, but no other fallout
18:08 ^🔗	xmc	i know that a ftpsite i archived got darked
18:08 ^🔗	SketchCow	FEEEAR
18:08 ^🔗	SketchCow	Did someone call for fear? I work in fear.
18:09 ^🔗	xmc	yes, hello, fear department, we need a delivery
18:09 ^🔗	SketchCow	Did you want regular fear or extra spicy fear
18:09 ^🔗	xmc	well what did the requisition form say
18:09 ^🔗	xmc	come ON we have standardized forms for a reason
18:10 ^🔗	SketchCow	Form unintelligible, blood streaks covering checkboxes
18:10 ^🔗	MrRadar	While people are here: is there a list of people who have access to the tracker for different projects? Yahoo Answers needs a requeue and I'm not sure who is best to ping
18:10 ^🔗	SketchCow	Ping arkiver or yipdw or I'm not sure who else
18:12 ^🔗		me_ is now known as yipdw
18:12 ^🔗	yipdw	the claims page is 500ing out
18:12 ^🔗	yipdw	one sec
18:13 ^🔗	xmc	yahooanswers has admins set as arkiver and medowar, for the record
18:14 ^🔗	xmc	(they, and anyone set as global-admin, can jiggle it)
18:14 ^🔗	yipdw	oh
18:15 ^🔗	yipdw	it's because someone named pronerdJay has something like 100,000 claims and the page is going FML
18:15 ^🔗	yipdw	i haven't come across something so quinessentially AT in a while
18:15 ^🔗	yipdw	er, maybe it's closer to 50,000
18:15 ^🔗	yipdw	either way
18:16 ^🔗	xmc	haha
18:16 ^🔗	yipdw	$ ruby release-claims.rb yahooanswers pronerdJay
18:16 ^🔗	yipdw	/home/yipdw/.rvm/gems/ruby-2.3.3/gems/activesupport-3.2.5/lib/active_support/values/time_zone.rb:270: warning: circular argument reference - now
18:16 ^🔗	yipdw	/home/yipdw/.rvm/gems/ruby-2.3.3/gems/redis-2.2.2/lib/redis.rb:215:in `block in hgetall': stack level too deep (SystemStackError)
18:16 ^🔗	yipdw	fuck Rub
18:16 ^🔗	yipdw	y
18:16 ^🔗	xmc	that's the rub
18:16 ^🔗	yipdw	wait what how is that stack trace possible
18:16 ^🔗	yipdw	is hgetall recursing to build a hash??
18:17 ^🔗	yipdw	oh, no, it uses Hash[] and passes the reply in using a splat
18:17 ^🔗	yipdw	fuck Ruby
18:17 ^🔗	xmc	archiveteam: finding bugs in standard system tools since 2009
18:17 ^🔗	yipdw	I think newer versions of redis-rb fix this
18:20 ^🔗	yipdw	oh, but that script is using the tracker gem bundle and I can't update it without affecting the world
18:20 ^🔗	yipdw	bleh I'll write something
18:21 ^🔗	icedice	Is Yahoo Answers going down?
18:21 ^🔗	yipdw	I have some places where Yahoo Answers can go
18:22 ^🔗	MrRadar	icedice: Yahoo Answers is being grabbed preemptively in case Verizon decides to can it
18:22 ^🔗	icedice	Ah, right
18:22 ^🔗	icedice	Yahoo sold out to Verizon
18:23 ^🔗	yipdw	ok, it looks like release-stale worked
18:23 ^🔗	yipdw	the spice is flowing again on yahooanswers and I'm getting out of jwz mode
18:24 ^🔗	MrRadar	Thanks yipdw
18:24 ^🔗	arkiver	yipdw: we already have a way of handling too many out items
18:25 ^🔗	arkiver	Requeue on the Workarounds page
18:25 ^🔗	yipdw	there's a few scripts that seem to work, release-claims just can't handle firepower of that magnitude
18:25 ^🔗	yipdw	oh, right
18:25 ^🔗	yipdw	I guess that page does the same as release-stale, huh
18:27 ^🔗	arkiver	I guess so
18:44 ^🔗	SketchCow	https://archive.org/details/pulpmagazinearchive?&sort=-publicdate&and[]=addeddate:2017*
18:44 ^🔗	SketchCow	I'm uploading 10,000 zines
18:44 ^🔗	SketchCow	Should I ask permission
18:44 ^🔗	*	SketchCow bites nails
19:06 ^🔗		ndiddy has quit IRC ()
19:06 ^🔗	HCross2	Even more data.gov has just started the slow march up to the IA
19:15 ^🔗	namespace	SketchCow: lolno
19:36 ^🔗	t2t2	BTW the tracker also has stale items for yuku, almost a year old
19:39 ^🔗		GE has joined #archiveteam-bs
19:59 ^🔗	icedice	Is there any way to find the Imgur link that was posted in OP's (now deleted) post?
19:59 ^🔗	icedice	https://www.reddit.com/r/webhosting/comments/4w6d63/buyshared_gets_mentioned_a_lot_when_it_comes_to/
19:59 ^🔗	icedice	Nothing on Archive.org
20:02 ^🔗	MrRadar	icedice: It looks like this may be a mirror of the original post: https://webdesignersolutions.wordpress.com/2016/08/04/buyshared-gets-mentioned-a-lot-when-it-comes-to-cheap-shared-hosting-heres-the-uptime-log-since-february-for-an-account-i-have-with-them-via-rwebhosting/
20:06 ^🔗	icedice	Thanks!
20:30 ^🔗		schbirid has quit IRC (Quit: Leaving)
20:32 ^🔗		kvieta has joined #archiveteam-bs
20:46 ^🔗		kvieta has quit IRC (Read error: Operation timed out)
20:54 ^🔗		Ravenloft has joined #archiveteam-bs
20:56 ^🔗		kvieta has joined #archiveteam-bs
21:04 ^🔗		tuluu_ has joined #archiveteam-bs
21:04 ^🔗		tuluu has quit IRC (Ping timeout: 250 seconds)
21:07 ^🔗		Jonison has quit IRC (Read error: Connection reset by peer)
21:10 ^🔗		ndiddy has joined #archiveteam-bs
21:58 ^🔗		espes__ has joined #archiveteam-bs
21:59 ^🔗		espes___ has quit IRC (Ping timeout: 250 seconds)
22:02 ^🔗		midas has quit IRC (Ping timeout: 250 seconds)
22:02 ^🔗		Gfy has quit IRC (Ping timeout: 250 seconds)
22:03 ^🔗		mls has quit IRC (Ping timeout: 250 seconds)
22:03 ^🔗		midas has joined #archiveteam-bs
22:04 ^🔗		tsr has quit IRC (Ping timeout: 250 seconds)
22:05 ^🔗		Gfy has joined #archiveteam-bs
22:06 ^🔗		andai has quit IRC (Ping timeout: 250 seconds)
22:08 ^🔗		Kaz has quit IRC (Ping timeout: 250 seconds)
22:10 ^🔗		GE has quit IRC (Remote host closed the connection)
22:11 ^🔗		Aoede has quit IRC (Ping timeout: 250 seconds)
22:11 ^🔗		hook54321 has quit IRC (Ping timeout: 250 seconds)
22:11 ^🔗		C4K3 has quit IRC (Ping timeout: 250 seconds)
22:13 ^🔗		tsr has joined #archiveteam-bs
22:13 ^🔗		HP_ has joined #archiveteam-bs
22:13 ^🔗		C4K3 has joined #archiveteam-bs
22:14 ^🔗		hook54321 has joined #archiveteam-bs
22:14 ^🔗		andai has joined #archiveteam-bs
22:14 ^🔗		HP has quit IRC (Ping timeout: 250 seconds)
22:14 ^🔗		nightpool has quit IRC (Ping timeout: 250 seconds)
22:15 ^🔗		Kaz has joined #archiveteam-bs
22:16 ^🔗		mls has joined #archiveteam-bs
22:17 ^🔗		andai has quit IRC (Ping timeout: 250 seconds)
22:17 ^🔗		SN4T14 has quit IRC (Ping timeout: 250 seconds)
22:17 ^🔗		SN4T14 has joined #archiveteam-bs
22:21 ^🔗		mls has quit IRC (Ping timeout: 250 seconds)
22:21 ^🔗		mls has joined #archiveteam-bs
22:22 ^🔗		Aoede has joined #archiveteam-bs
22:22 ^🔗		andai has joined #archiveteam-bs
22:27 ^🔗		nightpool has joined #archiveteam-bs
22:46 ^🔗		Aoede has quit IRC (Ping timeout: 250 seconds)
22:48 ^🔗		Aoede has joined #archiveteam-bs
22:57 ^🔗		andai has quit IRC (Ping timeout: 250 seconds)
22:58 ^🔗		andai has joined #archiveteam-bs
23:05 ^🔗		sun_rise has joined #archiveteam-bs
23:06 ^🔗	sun_rise	I have questions about what is/is not appropriate for archiveteam/bot and not sure where to pose them
23:06 ^🔗	xmc	here is a good place to ask
23:09 ^🔗	sun_rise	Three people I know have been sued for defamation over 'survivor' websites by institutions they alleged abused them/others as children. Two of them were forced to settle and remove the content from the web.
23:09 ^🔗	xmc	archive it
23:09 ^🔗	xmc	this is 100% okay
23:10 ^🔗	xmc	unless they want it removed, which, well, doesn't sound like they do
23:12 ^🔗	sun_rise	"it", in this case, is going to be a lot bigger than just the 'survivor' websites. I am interested in crawling the 'industry' sites as well. My original plan was to do this own my own and I started researching best practices for this sort of thing. I was really pleasantly surprised to find Archiveteam/bot.
23:12 ^🔗	sun_rise	It's an amazing service and I don't want to abuse it. The crawl I started yesterday pointed at a single domain has already grown much larger than I was expecting.
23:14 ^🔗	xmc	yep, that'll happen
23:14 ^🔗	xmc	if you want, you can next time run your jobs with --no-offsite-links
23:14 ^🔗	xmc	by default archivebot will fetch every page on the site you submit, and every page that is linked to
23:14 ^🔗	xmc	in order to present context
23:14 ^🔗	xmc	(along with images and script and stylesheets used on these pages)
23:14 ^🔗	sun_rise	I think, for this job, that was probably the appropriate setting - I didn't realize this until after it started running, though.
23:15 ^🔗	xmc	mm, possibly
23:20 ^🔗	sun_rise	Ultimately I'm going to be interested in hundreds of domains that this site points to or that I have collected elsewhere that are relevant to this topic. I doubt any single one of them will end up as large as this - they seem to mostly be fairly lean wordpress product page type sites. I guess what I'm after is a general sense of what wouldn't be appropriate for archivebot. At what point should I be using something else?
23:20 ^🔗	sun_rise	Is there some standard/threshold of general interest or threatened status? If I end up trying to crawl from a list of sites - should that be done in chunks? How do I ensure my jobs don't spiral out of control?
23:21 ^🔗	sun_rise	If I made a donation to offset my usage is there some guide to how much things generally cost?
23:21 ^🔗	xmc	feel free to use archivebot
23:21 ^🔗	xmc	you sound like someone who's fairly conscious of the resources they're using
23:22 ^🔗	xmc	if you look on the dashboard and you have more jobs running than anyone else, you might want to rethink how you're going about doing things
23:22 ^🔗	xmc	that said, everyone who cares about something fills up the queue eventually
23:23 ^🔗	xmc	we have a cost shameboard that kind of tries to be a forever-cost of data storage
23:23 ^🔗	sun_rise	I saw this but wasn't sure how quickly that would fill up. There are some high scorers!
23:23 ^🔗	xmc	but if you throw some chum towards https://archive.org/donate/ it'll probably be fine
23:23 ^🔗	xmc	hehe
23:24 ^🔗	sun_rise	I noticed there are 2 warc files associated with my crawl that have already been uploaded to archive.org. Will those continue to be uploaded in chunks?
23:24 ^🔗	xmc	yep
23:24 ^🔗	xmc	whenever the pipeline cuts off the warc file and starts a new one, the uploader sends the finished warc file off to IA
23:24 ^🔗	sun_rise	if I do a crawl from a pastebin list of domains will they show up in the same IA folder or separate per domain?
23:25 ^🔗	xmc	jobs go into warc files named by the url you submit, no matter of whether you use it as a list of urls or a single website
23:26 ^🔗	xmc	if you're doing less than a few dozen sites, i'd suggest one !a per site
23:26 ^🔗	xmc	like, one day i did all the campaign websites for my city's election
23:28 ^🔗		dashcloud has quit IRC (Remote host closed the connection)
23:29 ^🔗	DFJustin	we've asked before about what wouldn't be appropriate and sketchcow weighed in:
23:29 ^🔗	DFJustin	<SketchCow> In another channel, regarding uploading stuff of dubious value or duplication to archive.org:
23:29 ^🔗	DFJustin	<SketchCow> General archive rule: gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
23:29 ^🔗	DFJustin	<SketchCow> I am going to go ahead and define dubious value that the uploader can't even begin to dream up a use.
23:29 ^🔗	DFJustin	<SketchCow> If the uploader can'te ven come up with a use case, that's dubious value.
23:29 ^🔗	DFJustin	<SketchCow> Example: 14gb quicktime movie aimed at a blank wall for an hour, no change
23:30 ^🔗		BlueMaxim has joined #archiveteam-bs
23:31 ^🔗	DFJustin	so if it's in any way useful and it's not already archived, go hog wild, if it's gonna be mainly duplicated data then be careful about getting up into tens or hundreds of gigs
23:32 ^🔗	DFJustin	small sites don't matter except don't do many at the same time that there aren't any archivebot slots free for emergencies
23:33 ^🔗		dashcloud has joined #archiveteam-bs
23:33 ^🔗	DFJustin	this is admittedly hampered by the fact that we don't actually have a readout for the number of free slots
23:33 ^🔗	sun_rise	so submitting a list of urls might be more polite?
23:34 ^🔗	DFJustin	or come in and feed one in every so often as previous ones finish
23:35 ^🔗	sun_rise	I'm thinking I can prioritize the stuff that I most fear being lost right now and get to crawling 'the enemy' later when I have a better grasp of how big these things get
23:35 ^🔗	DFJustin	having a ton of sites on one job can be a problem because the jobs do crash from time to time
23:40 ^🔗	DFJustin	what I usually do before putting a site through archivebot is bring the site up in the wayback machine and see if the site has been crawled pretty well already or not
23:41 ^🔗	DFJustin	if the most recent crawl is from ages ago or you click a couple links and they come up "this page has not been archived" then it's due for a go
23:48 ^🔗	sun_rise	ok

irclogger-viewer