#archiveteam-bs 2017-11-05,Sun

↑back Search

Time	Nickname	Message
00:33 ^🔗		dashcloud has quit IRC (Remote host closed the connection)
01:19 ^🔗		schbirid has quit IRC (Read error: Operation timed out)
01:31 ^🔗		schbirid has joined #archiveteam-bs
02:08 ^🔗		Darkstar has quit IRC (Ping timeout: 506 seconds)
02:25 ^🔗	Somebody2	Regarding the whole, "IA doesn't distribute everything!" conversation -- Please DO upload as much as possible to as many different places as possible!
02:25 ^🔗	Somebody2	No matter what else, more copies are a GOOD THING.
02:25 ^🔗		Darkstar has joined #archiveteam-bs
02:30 ^🔗	Rai-chan	^
02:30 ^🔗	Ceryn	Somebody2: Do you have some resource on what the options are? Who will take data, pros and cons, where you might find similar data already? I now know of IA, obviously.
02:35 ^🔗		drumstick has quit IRC (Ping timeout: 248 seconds)
02:43 ^🔗		pizzaiolo has quit IRC (Remote host closed the connection)
02:44 ^🔗		drumstick has joined #archiveteam-bs
02:56 ^🔗	JensRex	chfoo: Why Docker for the Warrior?
02:56 ^🔗	JensRex	Docker is stateless. Doesn't seem like a good fit.
02:57 ^🔗		schbirid has quit IRC (Read error: Operation timed out)
03:04 ^🔗		MadArchiv has joined #archiveteam-bs
03:09 ^🔗		schbirid has joined #archiveteam-bs
03:18 ^🔗		MadArchiv has quit IRC (Read error: Operation timed out)
03:42 ^🔗		jspiros has quit IRC (Ping timeout: 492 seconds)
03:56 ^🔗		Asparagir has quit IRC (Asparagir)
04:11 ^🔗		qw3rty5 has joined #archiveteam-bs
04:18 ^🔗		qw3rty4 has quit IRC (Read error: Operation timed out)
04:25 ^🔗		jspiros has joined #archiveteam-bs
04:30 ^🔗	Somebody2	Ceryn: I don't have much, but there is some on the archiveteam wiki (and we should add more)
04:33 ^🔗	Somebody2	A lot depends on how much data you are looking for a home for.
04:36 ^🔗	Somebody2	There are quite a few plces where you can stash a few kilobytes (i.e. a couple pages of text) while there are many fewer places to drop a petabyte in need of a home.
04:43 ^🔗	Ceryn	I'll scour the wiki I guess.
04:44 ^🔗	Ceryn	If you know of many such sites and they're not in the wiki yet I'd like to know of them.
04:48 ^🔗	Somebody2	Ceryn: yes, that's a good idea.
04:48 ^🔗	Somebody2	Eh, you've got me interested; I'll go write up a wiki page.
04:49 ^🔗	Somebody2	Or better, add more stuff to http://archiveteam.org/index.php?title=Valhalla
04:49 ^🔗	Somebody2	which is (I think) the right place for this
04:50 ^🔗	Somebody2	well, kinda
04:51 ^🔗	Somebody2	Sigh, I'll make a new page http://archiveteam.org/index.php?title=Places_to_store_data
04:52 ^🔗	Ceryn	Hah! Hook, line and sinker!
04:53 ^🔗	Somebody2	:-P
04:53 ^🔗	Ceryn	Thanks. :)
05:04 ^🔗		Asparagir has joined #archiveteam-bs
05:09 ^🔗		drumstick has quit IRC (Read error: Operation timed out)
05:10 ^🔗		drumstick has joined #archiveteam-bs
05:20 ^🔗	Somebody2	Ceryn: OK, wrote up the intro; comments welcomed; I'll add more specific suggestions soon.
05:20 ^🔗	Ceryn	Somebody2: Cool! Reading.
05:25 ^🔗	Ceryn	Somebody2: Looks good (y). I think you should leave the general information up top and put the IA stuff down under places to store data. Some captions would probably be useful too.
05:26 ^🔗	Ceryn	Somebody2: When I hear "video" I think "movie", and that's on the order of ~20GB. Maybe call that a photo album instead?
05:29 ^🔗	Ceryn	Somebody2: Once people know where to store data, it would also be relevant to know what the commonly preferred data formats are for given types of data. Assuming there's anything resembling a consensus.
05:29 ^🔗	Ceryn	Somebody2: And obviously the "Places to store data" will need that list of suggested places to store data before it really becomes relevant. :)
05:45 ^🔗	Somebody2	Regarding formats -- ha. HA. Hahahahah AhahaaahaHAHAHAa. No, no there really isn't anything resembling a consensus.
05:45 ^🔗	Somebody2	And we have an entire wiki devoted to that -- the fileformats wiki.
05:46 ^🔗	Somebody2	I'll change video to video clip -- I was thinking of short youtube clips.
05:46 ^🔗	Somebody2	I'm not sure what you mean by "captions"?
05:47 ^🔗	Ceryn	Haha. There ought to be one.
05:47 ^🔗	Ceryn	Surely one general approach is better than the rest.
05:47 ^🔗	Ceryn	By captions I mean section titles.
05:48 ^🔗	Somebody2	Ah, yeah I'm planning on making sections for each of the size groups
05:51 ^🔗	Ceryn	(y)
06:24 ^🔗	Somebody2	Ceryn: add some more
06:25 ^🔗	Somebody2	er, I have added some more
06:25 ^🔗	Ceryn	Right.
06:27 ^🔗	Ceryn	Haha. Having your data .accesslog'ed.
06:27 ^🔗	Ceryn	Forceful Data Archiving Attack.
06:30 ^🔗	Ceryn	Interesting ways to obscurely store bytes data.
06:31 ^🔗	Ceryn	If you actually want to post something for storage, however, are sources that don't explicitly attempt to provide long term storage even relevant?
06:32 ^🔗	Ceryn	(I do like the ideas. They're original. Just questioning practicality in actual use case scenarios.)
06:39 ^🔗	wp494	I'm gonna call the top-end "petabytes" category More Than A Motherfucking Shitload
06:39 ^🔗	wp494	based on https://www.youtube.com/watch?v=Y0Z0raWIHXk
06:40 ^🔗	Somebody2	wp494: I like that name.
06:41 ^🔗	Somebody2	Ceryn: I think they are, because everything is temporary; an additional copy is an additional copy as long as it stays around, however long that is.
06:42 ^🔗	wp494	I don't think there would be much kerfuffle if we used the rest of penn's scale to fill the middle in either
06:43 ^🔗		Pixi has quit IRC (Quit: Pixi)
06:43 ^🔗	wp494	but yeah, that name definitely should be used on the top end
06:45 ^🔗	Somebody2	Please do add Penn's scale to the page.
07:02 ^🔗		SketchCow has quit IRC (Read error: Connection reset by peer)
07:02 ^🔗		SketchCow has joined #archiveteam-bs
07:02 ^🔗		swebb sets mode: +o SketchCow
07:18 ^🔗	Somebody2	wp494: thanks
07:18 ^🔗	Somebody2	OK, I've more or less dumped by brain out onto the page now. I may add more later, but may not.
07:27 ^🔗		Asparagir has quit IRC (Asparagir)
07:34 ^🔗		REiN^ has joined #archiveteam-bs
07:36 ^🔗		Valentin- has joined #archiveteam-bs
07:38 ^🔗		Valentine has quit IRC (Ping timeout: 506 seconds)
09:12 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
09:46 ^🔗	godane	i'm digitizing the pilot of The Tick tape
09:50 ^🔗	godane	!ao http://www.sacbee.com/news/state/california/fires/article182675911.html
09:50 ^🔗	godane	i put in archivebot channel
10:01 ^🔗		jschwart has joined #archiveteam-bs
10:32 ^🔗		godane has left
10:32 ^🔗		godane has joined #archiveteam-bs
11:02 ^🔗		pizzaiolo has joined #archiveteam-bs
11:04 ^🔗		odemg has quit IRC (Read error: Operation timed out)
11:32 ^🔗		drumstick has quit IRC (Read error: Operation timed out)
11:49 ^🔗		pizzaiolo has quit IRC (pizzaiolo)
12:29 ^🔗		odemg has joined #archiveteam-bs
12:48 ^🔗		pizzaiolo has joined #archiveteam-bs
12:50 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
12:50 ^🔗		Mateon1 has joined #archiveteam-bs
13:02 ^🔗	godane	SketchCow: i'm uploading 3 more tapes to FOS
13:03 ^🔗	godane	i also upload 2 Guys and A Girl on We channel for 2003-08-11 to 2003-08-13
13:24 ^🔗		MadArchiv has joined #archiveteam-bs
13:26 ^🔗	MadArchiv	Can someone please explain me what is this whole thing with the tapes that's going on? I've seen you people talk about for days now but I still don't really know what it is about, are you guys trying to digitalize tv stuff or something?
13:36 ^🔗	godane	i'm officially at 1.1 Million items as of today
13:39 ^🔗		MadArchiv has quit IRC (Ping timeout: 246 seconds)
13:45 ^🔗	godane	SketchCow : This guy has some magazines you will like: https://archive.org/details/@neil_parsons_48
14:19 ^🔗	godane	so found a iomega zipdrive install tape
14:19 ^🔗	godane	also i found another Felicity tape
14:21 ^🔗	godane	btw there are least 3 more tapes with Felicity on them by there label
14:23 ^🔗	godane	thats not including that tape i found and uploaded that had the last 2 episodes of Season 2 of Felicity
14:38 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
14:39 ^🔗		Mateon1 has joined #archiveteam-bs
14:45 ^🔗	godane	SketchCow: so i found the note with g4 Pulse tape
14:45 ^🔗	godane	thanks
14:45 ^🔗	godane	also found the 4 tapes of porn
14:54 ^🔗	Ceryn	Lol. How much data is on these tapes total?
15:01 ^🔗		TheLovina has joined #archiveteam-bs
15:24 ^🔗	godane	don't know yet
15:25 ^🔗	godane	this Felicity tape may have episode from S02E16 to S02E21
15:25 ^🔗	godane	i only say that cause i have S02E22 and S02E23 from the same channel and month i think
15:44 ^🔗	SketchCow	Don't forget the porn!
16:30 ^🔗	JAA	https://theintercept.com/2017/11/02/war-crimes-youtube-facebook-syria-rohingya/
16:31 ^🔗		Pixi has joined #archiveteam-bs
16:32 ^🔗		icedice has joined #archiveteam-bs
16:32 ^🔗	icedice	Hi
16:32 ^🔗		icedice has quit IRC (Remote host closed the connection)
16:53 ^🔗		Asparagir has joined #archiveteam-bs
17:11 ^🔗		icedice has joined #archiveteam-bs
17:29 ^🔗		dashcloud has joined #archiveteam-bs
17:32 ^🔗		icedice2 has joined #archiveteam-bs
17:34 ^🔗		icedice has quit IRC (Ping timeout: 245 seconds)
18:20 ^🔗		icedice2 has quit IRC (Quit: Leaving)
18:20 ^🔗		icedice has joined #archiveteam-bs
19:54 ^🔗	odemg	https://www.ebay.com/itm/IBM-17R7063-LTO7-INTERNAL-SAS-ULTRIUM-15000-TAPE-DRIVE-NEW-SEALED-/142566860443
20:12 ^🔗		pizzaiolo has quit IRC (Remote host closed the connection)
20:25 ^🔗	Asparagir	What does ArchiveTeam think about joining Open Collective? It's a way for open source projects to get community funding and donations, but without having to laboriously incorporate as 501(c)(3) and all that. https://opencollective.com/
20:26 ^🔗	Asparagir	If we had even $200/month, that could go to... (1) actually paying some of the amazing coders who build our open source software, get old bugs finally taken care of, get crucial new features finally developed and PAID for. (Hello, a good scraper for Instagram feeds! Or XML!)
20:27 ^🔗	Asparagir	Or (2) could pay for far more ArchiveBot servers, which are $20/month. Imagine a world without 29841394692347 !pending jobs...
20:27 ^🔗	Asparagir	Or (3) [your idea here]
20:28 ^🔗	Asparagir	A lot of us donate a lot of time and money, whether it's hours of coding or $$$ per month for servers, to keep this ship floating.
20:28 ^🔗	Asparagir	Open Collective could send funds from the web community to cover some of that.
20:29 ^🔗	Asparagir	Other open source groups using it (or something like this, not saying this is the be-all end-all solution) are raising serious $$$ for sustaining their projects. Why not ArchiveTeam?
20:30 ^🔗	Asparagir	SketchCow, would love your thoughts on this, too ^
20:31 ^🔗	Asparagir	Sort of "Patreon for open source online groups, with lots of transparency into where every $ goes"
20:32 ^🔗	Asparagir	Note: requirement for sign-up is a GitHub repository with at least 100 stars. Our ArchiveBot repo just hit 108.
20:34 ^🔗	Asparagir	yipdw, you too ^
20:56 ^🔗	zino	Might be useful if someone takes care of it. I'm not going close to anything dealing with money, too much work.
20:56 ^🔗	zino	Ironic sidenote: Yahoo is listed as a supporter, with no dollars contributed.
20:56 ^🔗	Frogging	lol
21:02 ^🔗	zino	In fairness, that is probably an indicator that they once contributed some amount of money and have now stopped doing that. Cloundflare is listed as "$500 contributed", but they are contributing $500 per month, not in total over time.
21:42 ^🔗	kisspunch	Asparagir: I like the idea of helping with that list of things but dislike that it makes archiveteam sound like a Thing instead of a bunch of people who whatever they want
21:42 ^🔗	JensRex	chfoo: Since you're showing old Warrior2 some love, consider replacing /etc/apt/sources.list to contain (only) "deb http://archive.debian.org/debian squeeze main".
21:42 ^🔗	JensRex	Current default contents are invalid and broken.
21:43 ^🔗	kisspunch	I love the idea of having a community "wanted" list for things we'd like to be done (and possibly would give a bounty for)
21:44 ^🔗	kisspunch	Like bugs + co
21:44 ^🔗	kisspunch	Not a list of sites, that would be endless
21:46 ^🔗	chfoo	JensRex, i should but i don't want to touch anything unnecessary in the old warrior. i just want it able to boot up properly.
21:48 ^🔗	Asparagir	kisspunch: JAA and I wrote up a long list of our top to-do items in #archivebot like a month or two ago.
21:48 ^🔗	Asparagir	Any one of those top ten items getting built or fixed would seriously help us all.
21:49 ^🔗	Asparagir	Let me see if I can find the log...
21:49 ^🔗	JensRex	Asparagir: That stuff should be in the wiki.
21:51 ^🔗	JAA	Maybe, but then in five years someone will get confused by the list because it never got updated.
21:52 ^🔗	JAA	Asparagir: Found it, 2017-10-10 23:51:22 UTC
21:52 ^🔗	Asparagir	Here's what we were discussing...
21:52 ^🔗	Asparagir	My long-term goals for ArchiveTeam, in no particular order:
21:52 ^🔗	kisspunch	It should be somewhere persistent, I don't know it needs to be updated
21:52 ^🔗	Asparagir	1) Have the ability to scale up to lots of pipelines, easily
21:52 ^🔗	JensRex	TODO += Update TODO.
21:52 ^🔗	Asparagir	2) Find ways for more people to participate in suggesting sites to archive, even going out to Twitter for suggestions, not just us IRC folks
21:52 ^🔗	Asparagir	3) Proactively start reaching out to different communities asking them for suggestions of at-risk content, or particularly unique user-generated content, like message boards
21:52 ^🔗	Asparagir	4) Find someone to build us a proper Instagram scraper (for individual users' feeds, or hashtags or locations, or all of the above)
21:52 ^🔗	Asparagir	5) Fix the current youtube-dl issue, and figure out a way to do auto-update on youtube-dl on everyone's pipeline once a month
21:53 ^🔗	Asparagir	From JAA -- "- Fix the various wpull bugs, in particular the FTP crashes, jobs not terminating, crashes not being reported back as failed jobs to here, etc."
21:53 ^🔗	Asparagir	7) Find a way to implement an --urgent flag that takes precedence over stuff in queue
21:53 ^🔗	Asparagir	From JAA: "Experiment with headless browsers so we can let PhantomJS die already."
21:53 ^🔗	Asparagir	8) Find a way to cancel stuff in queue. Right now you can cancel jobs that are pending but it doesn't go into effect until that item gets to top of queue.
21:53 ^🔗	Asparagir	9) Find a way for us to get free server space, maybe Amazon AWS credits or Digital Ocean credits. But for that we'd probably need to be a real 501(c)(3) and that's a big deal.
21:54 ^🔗	Asparagir	(note: this OpenCollective idea neatly sidesteps the 501(c)(3) problems)
21:54 ^🔗	JAA	Oh right, I totally forgot about the Instagram scraper.
21:54 ^🔗	kisspunch	I want to see finished crawls as my top request :) I can't tell if things never got added or are already done
21:54 ^🔗	Asparagir	More from JAA:
21:54 ^🔗	Asparagir	Also, !pending not listing all jobs is quite annoying.
21:54 ^🔗	Asparagir	And more metadata in the JSON uploaded to IA
21:54 ^🔗	Asparagir	[end list]
21:54 ^🔗	Asparagir	I think that was our main I WANT THIS NOW list
21:54 ^🔗		schbirid has quit IRC (Quit: Leaving)
21:54 ^🔗	kisspunch	Reaching out to communities to find stuff in need to scraping sounds important
21:55 ^🔗	JAA	Regarding maintaining wpull, there was a little bit of discussion in #newsgrabber the other day.
21:55 ^🔗	Asparagir	And of course, this is in addition to a long long list of feature requests and bug reports in GitHub on several projects.
21:55 ^🔗	JensRex	Seriously though, the list should be somewhere where it doesn't just scroll by and is forgotten.
21:56 ^🔗	Asparagir	And we cant keep flogging a dead horse and hoping that people will be super-generous and magically swoop down, like the Open Source Archiver Fairy, to fix our problems.
21:56 ^🔗	Asparagir	I mean, the fact that ArchiveTam has gotten this far on pure volunteerism is astonishing and awesome.
21:56 ^🔗	kisspunch	I'm most likely to do the long list of things on other projects
21:56 ^🔗	JAA	How about an issue tracker? Oh, right.
21:56 ^🔗	Asparagir	ArchiveTeam, evem :-)
21:56 ^🔗	JAA	;-)
21:56 ^🔗	kisspunch	I tend to want to fix fundamental tools, ArchiveBot is lower impact
21:57 ^🔗	kisspunch	Maybe I could document wpull better
21:57 ^🔗	Asparagir	That's good too! wpull, the Warrior, documentation, all need help
21:57 ^🔗	Asparagir	everything
21:57 ^🔗	JAA	Indeed
21:57 ^🔗	kisspunch	Oh right--I'm supposed to make a windows IA.bak client
21:57 ^🔗	kisspunch	That's the thing I'm supposed to do for archiveteam
21:57 ^🔗	JAA	I'm not really convinced yet that it's a good thing that wpull is mostly compatible with/a drop-in replacement for wget.
21:58 ^🔗	Asparagir	But...think how much help we could get, and how much progress we could make, if we could pay someone here (not me!) say $500 for one week of all-you-can-eat bug fixes.
21:58 ^🔗	Asparagir	Or more.
21:58 ^🔗	kisspunch	"totally compatible" would be dubious, "mostly compatible" is aggravating, especially since the docs just write "totally compatible" and not details
21:58 ^🔗	JensRex	Regarding wpull and youtube-dl. I think the conclusion was that the precompiled wpull for Newsgrabber is terrible somehow, and breaks when using --youtube-dl. Can't use wpull from pip, beacuse it's Python3 only, and Newsgrabber is Python2.
21:58 ^🔗	Asparagir	Or have the community pay the hosting bills for five new servers!
21:58 ^🔗	JAA	JensRex: Normal wpull is broken, too.
21:59 ^🔗	JensRex	JAA: Interesting.
21:59 ^🔗	JAA	Well, the most current version on FalconK's fork, at least.
21:59 ^🔗	JAA	The last version by Chris (2.0.1) is so crashy that it's not very usable.
21:59 ^🔗	kisspunch	ivan: ^ want to do a week of bugfixes for archiveteam
21:59 ^🔗	JAA	FalconK fixed some of those bugs and completely broke youtube-dl in the process.
21:59 ^🔗	JensRex	So the rabbithole of terribleness goes deeper.
22:00 ^🔗	JAA	But I believe it was already broken before that, judging from his commit message.
22:00 ^🔗	JAA	It does.
22:01 ^🔗	Asparagir	Yeah. And most people here are already burnt out from full time jobs; asking them to keep giving free labor and free code and free urgent fixes is not fair to them or sustainable to ArchiveTeam.
22:01 ^🔗	Asparagir	But luckily, there's this concept where people exchange money for services...
22:02 ^🔗	JensRex	I have all the time in the world, but I'm just some guy who knows enough Linux to be dangerous, and make unhelpful bug reports.
22:02 ^🔗	JAA	kisspunch: The docs do list the differences (though that list is not complete, I think). But it also means that there's a lot of luggage from wget's CLI. For example, some of the option names are just wrong because the option doesn't do what it seems it should. For example, I'd expect --waitretry to specify the time that has to pass before an errored URL is reattempted. Nope, it does some linear backof
22:02 ^🔗	JAA	f and that's the maximum time it waits...
22:02 ^🔗	kisspunch	JensRex: please improve documentation on everything then!
22:02 ^🔗	JensRex	kisspunch: What needs documenting?
22:02 ^🔗	kisspunch	wpull
22:02 ^🔗	JensRex	groan
22:02 ^🔗	kisspunch	I don't remember, but probably warrior
22:03 ^🔗	JensRex	I do have edit permissions on the Wiki. I'll keep it in mind.
22:03 ^🔗	kisspunch	Just /collecting/ what has been archived, to what degree, when, by who, and where there are copies, would be my #2 after IA.bak
22:03 ^🔗	JAA	Asparagir: I'm not coding nearly enough in my job, and I'm willing to work on wpull in general. The problem is that I'm often busy trying to find sites that are at risk or archiving those sites (or freeing space on my disks so I can archive them). There's just so much to do...
22:04 ^🔗	kisspunch	The wiki only has big projects and is usually missing some fraction of that (especially, for finished projects that it finished and where it is)
22:04 ^🔗	JAA	Plus the situation with wpull isn't really clear right now: whether chfoo might resume maintenance, whether the repo is passed to AT as a whole, or whether it needs to be forked.
22:04 ^🔗	kisspunch	I'm thinking of giving up and just being IA.bak instead of trying to make some distributed thing, there's some efficiency-of-batching there
22:05 ^🔗	kisspunch	Like asking people to mail me HDDs
22:05 ^🔗	Asparagir	Wait, do you guys not know about this site? https://archive.fart.website/archivebot/viewer/ You can look up any domain fed into ArchiveBot lately. This doesn't cover all the stuff archived through the Warrior or other projects, but it's a start.
22:05 ^🔗	JensRex	IA.bak... the ArchiveTeam white whale.
22:05 ^🔗	kisspunch	Asparagir: I don't really follow archivebot generally, thanks!
22:05 ^🔗	JAA	Asparagir: It's broken though.
22:05 ^🔗	kisspunch	I mostly follow the warrior projects
22:05 ^🔗	Asparagir	Broken? Aggggggh.
22:06 ^🔗	JAA	Asparagir: Yep, doesn't display all jobs.
22:06 ^🔗	JAA	See https://github.com/ArchiveTeam/ArchiveBot/issues/282
22:06 ^🔗	Asparagir	But I guess this proves the point: it's a 90% awesome tool! But funding a few hours of hardcore work on it would get us up to "usable".
22:06 ^🔗	JAA	Indeed
22:10 ^🔗	zino	Since we are discussing archivebot wishes: I way to shut down the pipeline, do service on the machine or update parts of the pipeline and then resume the jobs when the machine is up again is nr 1 on my list.
22:11 ^🔗	JAA	Yes
22:11 ^🔗	JAA	There was some discussion previously about splitting up jobs to begin with.
22:11 ^🔗	JAA	So that you don't have one huge multi-million URL job, but blocks of e.g. 10k URLs.
22:11 ^🔗	JAA	Less potential for crashes that way.
22:11 ^🔗	JAA	However, this would be a major redesign obviously.
22:11 ^🔗	zino	Yea, we talked a bit about that. Would help a lot.
22:12 ^🔗	Asparagir	Right -- we can and do segment jobs by (estimated) WARC size, so that WARC's get uploaded in chunks (500 MB, I think?). But we don't do it yet by job size, i.e. number of URL's.
22:12 ^🔗	Asparagir	That wouldn't be exact either, because of course some of those URL's might be video files or something, and might be bigger than you'd think.
22:13 ^🔗	zino	Asparagir: chunk is a few gigs. We cant segment jobs on chunks though, the job must complete all chunks on the same pipeline.
22:15 ^🔗	JAA	Yeah, there was some discussion about that as well. Parallelising jobs across multiple machines.
22:15 ^🔗	JAA	I'm not sure it would work in all cases though.
22:18 ^🔗	Asparagir	My two cents: work on fixing our considerable technical debt first, before moving on to building out new features, which will probably break in new and exciting ways. :-)
22:18 ^🔗	JAA	Yeah
22:19 ^🔗	zino	Maybe, but if the new features mitigates the failures we have that brings more robustness.
22:19 ^🔗	zino	I'd rather have a way to kill and restart the pipeline on the same job than have a mythical wpull that doesn't hang.
22:21 ^🔗	zino	That would solve both the wpull problems and let me start 10 pipelines when needed without having to worry that I need to keep those machines up and unpatched for the next 3 months.
22:21 ^🔗	Asparagir	Fair point.
22:21 ^🔗	Asparagir	And I'd like to be able to reboot the dashboard to clear out jobs that we know for sure have died and gone to job heaven.
22:22 ^🔗	Asparagir	But which hang around cluttering the dashboard as zombies...
22:22 ^🔗	Asparagir	Minor issue, I know, but would also be helpful to day-to-day work.
22:22 ^🔗	JAA	I'd like it if the dashboard was documented better so people with access to the control node (like me) can do that sort of maintenance without fearing that it'll break everything.
22:22 ^🔗	Asparagir	Yes
22:22 ^🔗	Asparagir	Needs documentation badly
22:24 ^🔗	Asparagir	Buuuuut yeah, to circle back to the original question...how do people feel about the larger issue, of ArchiveTeam posting on OpenCollective (or somewhere else, like Patreon) to raise money from the Internet to PAY for some of this work? Instead of hoping that the Archive Coder Fairy will do it for free, forever?
22:24 ^🔗	Asparagir	I mean, I do like that this is totally decentralized and people can hack away at what they want and are interested in.
22:25 ^🔗	Asparagir	But.
22:25 ^🔗	Asparagir	I mean, look at this thread.
22:28 ^🔗	zino	The question is, do we have a Coder Fairy that is willing to work for money?
22:32 ^🔗	Asparagir	I think that's a question for people like yipdw, FalconK, astrid, JAA, and others who do some of the heavy lifting, code-wise. And the lurkers around here, of whom there are many (hiiiii, we see you, we won't bite)
22:32 ^🔗	Asparagir	And the people on this list: https://github.com/orgs/ArchiveTeam/people
22:33 ^🔗	Asparagir	JesseW and chfoo too. Lots of people. If even one or two say "yes, I will do annoying task XXX for $YYY" then we're good!
22:35 ^🔗	Asparagir	I want SketchCow to weigh in on this too, but according to Twitter he's doing "a little mold remediation work so I'll be away for a while" right now
22:40 ^🔗	zino	So regarding restart. Conceptually something like this would be needed:
22:40 ^🔗	zino	1. pipeline needs to save how to spawn currently running wpulls
22:40 ^🔗	zino	2. at pipeline startup, check the save file and just resume them
22:40 ^🔗	zino	3. restart crashed wpulls, up to a limit
22:40 ^🔗	zino	This would solve:
22:40 ^🔗	zino	1. Machine or pipeline maintenance, kill the pipeline instead of STOP:ing it.
22:40 ^🔗	zino	2. Crashing wpulls
22:40 ^🔗	zino	3. Locked wpulls, just kill the locked one
22:40 ^🔗	zino	The big questionmark is do we need to revire anything in the
22:40 ^🔗	zino	controller communiocation, or is that stateless? If the output from
22:40 ^🔗	zino	wpull is currently just piped to the controller without channel
22:40 ^🔗	zino	negotiation that will break.
22:40 ^🔗	zino	JAA, do you have any insight in how that works now?
22:41 ^🔗	JAA	Nothing special needs to be done to respawn wpull itself. You just rerun the same command in the same directory and it'll continue based on the database.
22:42 ^🔗		pizzaiolo has joined #archiveteam-bs
22:42 ^🔗	zino	Yea, I mean how the communication with the controller works. I don't know how hard it would be to restart that info-steam, or if it's possible at all right now.
22:43 ^🔗	JAA	But I don't know much about the communication. I think it's all one-way communication, i.e. the control node runs a Redis database and a process on the pipeline (the wpull plugin?) connects to that database.
22:44 ^🔗	JAA	If you add an ignore or change the job's settings, that's written to the database by the control node, and it takes effect on the pipeline as soon as it notices that something has changed (the settings watcher).
22:44 ^🔗	JAA	The logs go back by the wpull plugin (?) writing to the same database. The control node then forwards that to the people looking at the dashboard.
22:45 ^🔗		drumstick has joined #archiveteam-bs
22:45 ^🔗	zino	If that how it works this should not be THAT hard to fix. I'll have a look another night.
22:46 ^🔗	JAA	This should basically mean that it should be possible to resume jobs without too much effort. I'm not sure if anything even needs to be changed on the control node apart from the IRC bot handling a few additional commands.
22:46 ^🔗	JAA	We'd have to look into it in more detail though regarding how it should work exactly.
22:47 ^🔗	JAA	For example, it would be nice if we could !pause a job also on a pipeline that doesn't need maintenance/reboot, e.g. in case of a ban, and if the pipeline then started another job perhaps.
22:47 ^🔗	zino	Yea. We really should have a test setup of the whole system to stage tests on.
22:47 ^🔗	JAA	But I'm not sure what !resume in that case should do exactly, etc.
22:48 ^🔗	JAA	Yeah, I've been wondering about that, how to test any code written for ArchiveBot.
22:49 ^🔗	zino	I'm scared to test anything as is. One typo and you hose all jobs the pipeline manages to ingest before you stop it.
22:50 ^🔗	JAA	We'd probably need a full parallel test setup.
22:50 ^🔗	zino	Yep
22:51 ^🔗	JAA	jrwr was able to set it up for the Tor version, so it shouldn't be too difficult.
22:51 ^🔗	JAA	Maybe he can tell us what to look out for.
22:51 ^🔗	JAA	There are some instructions in the repo, but no idea how complete those are.
22:52 ^🔗	zino	And I'd be happy to set that up, so maybe we could pump jrwr for some info.
22:52 ^🔗	zino	Anyways, time to sleep. To be continued.
22:52 ^🔗	JAA	Good night!
22:54 ^🔗	JAA	Asparagir: To get back to that question above: For me, it's more a matter of time than of money. And as far as I know, it's not possible to transfer time (yet?). :-/
22:56 ^🔗	Asparagir	TO-DO #1765765: invent Hermione's Time-Turner
22:57 ^🔗	JAA	:-)
23:06 ^🔗		drumstick has quit IRC (Quit: Leaving)
23:11 ^🔗	kisspunch	How does archiveteam feel about making a single gateway clone of requester-pays content. I'm happy to pay to get this stuff (already grabbed ArXiV, I guess imdb switched to this recently), but I don't have somewhere to distribute it with enough storage space
23:11 ^🔗	kisspunch	Torrents might be a good option
23:13 ^🔗	JAA	(IMDB claims they'll add a free gateway. Not sure if that exists by now or still not.)
23:14 ^🔗	JAA	What's wrong with putting it on IA?
23:14 ^🔗	kisspunch	Putting on IA is also a good option, main issue is ones that update often
23:15 ^🔗	JAA	Hmm. You can also update IA items as much as you like though.
23:15 ^🔗	kisspunch	Both ArXiV and IMDB have an additive-update process, IMDB also has a mutating "summary"
23:15 ^🔗	kisspunch	Apparently I need to learn how to put shit on IA
23:16 ^🔗	kisspunch	Maybe I should mirror githubarchive (timeline) and ghtorrent to IA
23:17 ^🔗	kisspunch	The timeline in particular is pretty small
23:19 ^🔗	Frogging	Asparagir: fwiw, Internet Archive has paid employees that work on this kind of stuff. maybe not so open though, unfortunately.
23:20 ^🔗	Frogging	also I'd love a time turner. too many times do I find out about something only after it's gone forever :(
23:26 ^🔗	JensRex	FUCK! 94% done uploading a 8GB warc at 200 kbs, and my ISP takes a shit.
23:27 ^🔗	JensRex	Still down. Quassel on mobile.
23:27 ^🔗		jschwart has quit IRC (Konversation terminated!)
23:35 ^🔗		pizzaiolo has quit IRC (pizzaiolo)
23:36 ^🔗		BlueMaxim has joined #archiveteam-bs

irclogger-viewer