#archiveteam-bs 2017-11-05,Sun

↑back Search

Time Nickname Message
00:33 🔗 dashcloud has quit IRC (Remote host closed the connection)
01:19 🔗 schbirid has quit IRC (Read error: Operation timed out)
01:31 🔗 schbirid has joined #archiveteam-bs
02:08 🔗 Darkstar has quit IRC (Ping timeout: 506 seconds)
02:25 🔗 Somebody2 Regarding the whole, "IA doesn't distribute everything!" conversation -- Please DO upload as much as possible to as many different places as possible!
02:25 🔗 Somebody2 No matter *what* else, more copies are a GOOD THING.
02:25 🔗 Darkstar has joined #archiveteam-bs
02:30 🔗 Rai-chan ^
02:30 🔗 Ceryn Somebody2: Do you have some resource on what the options are? Who will take data, pros and cons, where you might find similar data already? I now know of IA, obviously.
02:35 🔗 drumstick has quit IRC (Ping timeout: 248 seconds)
02:43 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
02:44 🔗 drumstick has joined #archiveteam-bs
02:56 🔗 JensRex chfoo: Why Docker for the Warrior?
02:56 🔗 JensRex Docker is stateless. Doesn't seem like a good fit.
02:57 🔗 schbirid has quit IRC (Read error: Operation timed out)
03:04 🔗 MadArchiv has joined #archiveteam-bs
03:09 🔗 schbirid has joined #archiveteam-bs
03:18 🔗 MadArchiv has quit IRC (Read error: Operation timed out)
03:42 🔗 jspiros has quit IRC (Ping timeout: 492 seconds)
03:56 🔗 Asparagir has quit IRC (Asparagir)
04:11 🔗 qw3rty5 has joined #archiveteam-bs
04:18 🔗 qw3rty4 has quit IRC (Read error: Operation timed out)
04:25 🔗 jspiros has joined #archiveteam-bs
04:30 🔗 Somebody2 Ceryn: I don't have much, but there is some on the archiveteam wiki (and we should add more)
04:33 🔗 Somebody2 A lot depends on how much data you are looking for a home for.
04:36 🔗 Somebody2 There are quite a few plces where you can stash a few kilobytes (i.e. a couple pages of text) while there are many fewer places to drop a petabyte in need of a home.
04:43 🔗 Ceryn I'll scour the wiki I guess.
04:44 🔗 Ceryn If you know of many such sites and they're not in the wiki yet I'd like to know of them.
04:48 🔗 Somebody2 Ceryn: yes, that's a good idea.
04:48 🔗 Somebody2 Eh, you've got me interested; I'll go write up a wiki page.
04:49 🔗 Somebody2 Or better, add more stuff to http://archiveteam.org/index.php?title=Valhalla
04:49 🔗 Somebody2 which is (I think) the right place for this
04:50 🔗 Somebody2 well, kinda
04:51 🔗 Somebody2 Sigh, I'll make a new page http://archiveteam.org/index.php?title=Places_to_store_data
04:52 🔗 Ceryn Hah! Hook, line and sinker!
04:53 🔗 Somebody2 :-P
04:53 🔗 Ceryn Thanks. :)
05:04 🔗 Asparagir has joined #archiveteam-bs
05:09 🔗 drumstick has quit IRC (Read error: Operation timed out)
05:10 🔗 drumstick has joined #archiveteam-bs
05:20 🔗 Somebody2 Ceryn: OK, wrote up the intro; comments welcomed; I'll add more specific suggestions soon.
05:20 🔗 Ceryn Somebody2: Cool! Reading.
05:25 🔗 Ceryn Somebody2: Looks good (y). I think you should leave the general information up top and put the IA stuff down under places to store data. Some captions would probably be useful too.
05:26 🔗 Ceryn Somebody2: When I hear "video" I think "movie", and that's on the order of ~20GB. Maybe call that a photo album instead?
05:29 🔗 Ceryn Somebody2: Once people know where to store data, it would also be relevant to know what the commonly preferred data formats are for given types of data. Assuming there's anything resembling a consensus.
05:29 🔗 Ceryn Somebody2: And obviously the "Places to store data" will need that list of suggested places to store data before it really becomes relevant. :)
05:45 🔗 Somebody2 Regarding formats -- ha. HA. Hahahahah AhahaaahaHAHAHAa. No, no there *really* isn't anything resembling a consensus.
05:45 🔗 Somebody2 And we have an entire wiki devoted to that -- the fileformats wiki.
05:46 🔗 Somebody2 I'll change video to video clip -- I was thinking of short youtube clips.
05:46 🔗 Somebody2 I'm not sure what you mean by "captions"?
05:47 🔗 Ceryn Haha. There ought to be one.
05:47 🔗 Ceryn Surely one general approach is better than the rest.
05:47 🔗 Ceryn By captions I mean section titles.
05:48 🔗 Somebody2 Ah, yeah I'm planning on making sections for each of the size groups
05:51 🔗 Ceryn (y)
06:24 🔗 Somebody2 Ceryn: add some more
06:25 🔗 Somebody2 er, I have added some more
06:25 🔗 Ceryn Right.
06:27 🔗 Ceryn Haha. Having your data .accesslog'ed.
06:27 🔗 Ceryn Forceful Data Archiving Attack.
06:30 🔗 Ceryn Interesting ways to obscurely store bytes data.
06:31 🔗 Ceryn If you actually want to post something for storage, however, are sources that don't explicitly attempt to provide long term storage even relevant?
06:32 🔗 Ceryn (I do like the ideas. They're original. Just questioning practicality in actual use case scenarios.)
06:39 🔗 wp494 I'm gonna call the top-end "petabytes" category More Than A Motherfucking Shitload
06:39 🔗 wp494 based on https://www.youtube.com/watch?v=Y0Z0raWIHXk
06:40 🔗 Somebody2 wp494: I like that name.
06:41 🔗 Somebody2 Ceryn: I think they are, because everything is temporary; an additional copy is an additional copy as long as it stays around, however long that is.
06:42 🔗 wp494 I don't think there would be much kerfuffle if we used the rest of penn's scale to fill the middle in either
06:43 🔗 Pixi has quit IRC (Quit: Pixi)
06:43 🔗 wp494 but yeah, that name definitely should be used on the top end
06:45 🔗 Somebody2 Please do add Penn's scale to the page.
07:02 🔗 SketchCow has quit IRC (Read error: Connection reset by peer)
07:02 🔗 SketchCow has joined #archiveteam-bs
07:02 🔗 swebb sets mode: +o SketchCow
07:18 🔗 Somebody2 wp494: thanks
07:18 🔗 Somebody2 OK, I've more or less dumped by brain out onto the page now. I may add more later, but may not.
07:27 🔗 Asparagir has quit IRC (Asparagir)
07:34 🔗 REiN^ has joined #archiveteam-bs
07:36 🔗 Valentin- has joined #archiveteam-bs
07:38 🔗 Valentine has quit IRC (Ping timeout: 506 seconds)
09:12 🔗 BlueMaxim has quit IRC (Quit: Leaving)
09:46 🔗 godane i'm digitizing the pilot of The Tick tape
09:50 🔗 godane !ao http://www.sacbee.com/news/state/california/fires/article182675911.html
09:50 🔗 godane i put in archivebot channel
10:01 🔗 jschwart has joined #archiveteam-bs
10:32 🔗 godane has left
10:32 🔗 godane has joined #archiveteam-bs
11:02 🔗 pizzaiolo has joined #archiveteam-bs
11:04 🔗 odemg has quit IRC (Read error: Operation timed out)
11:32 🔗 drumstick has quit IRC (Read error: Operation timed out)
11:49 🔗 pizzaiolo has quit IRC (pizzaiolo)
12:29 🔗 odemg has joined #archiveteam-bs
12:48 🔗 pizzaiolo has joined #archiveteam-bs
12:50 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
12:50 🔗 Mateon1 has joined #archiveteam-bs
13:02 🔗 godane SketchCow: i'm uploading 3 more tapes to FOS
13:03 🔗 godane i also upload 2 Guys and A Girl on We channel for 2003-08-11 to 2003-08-13
13:24 🔗 MadArchiv has joined #archiveteam-bs
13:26 🔗 MadArchiv Can someone please explain me what is this whole thing with the tapes that's going on? I've seen you people talk about for days now but I still don't really know what it is about, are you guys trying to digitalize tv stuff or something?
13:36 🔗 godane i'm officially at 1.1 Million items as of today
13:39 🔗 MadArchiv has quit IRC (Ping timeout: 246 seconds)
13:45 🔗 godane SketchCow : This guy has some magazines you will like: https://archive.org/details/@neil_parsons_48
14:19 🔗 godane so found a iomega zipdrive install tape
14:19 🔗 godane also i found another Felicity tape
14:21 🔗 godane btw there are least 3 more tapes with Felicity on them by there label
14:23 🔗 godane thats not including that tape i found and uploaded that had the last 2 episodes of Season 2 of Felicity
14:38 🔗 Mateon1 has quit IRC (Remote host closed the connection)
14:39 🔗 Mateon1 has joined #archiveteam-bs
14:45 🔗 godane SketchCow: so i found the note with g4 Pulse tape
14:45 🔗 godane thanks
14:45 🔗 godane also found the 4 tapes of porn
14:54 🔗 Ceryn Lol. How much data is on these tapes total?
15:01 🔗 TheLovina has joined #archiveteam-bs
15:24 🔗 godane don't know yet
15:25 🔗 godane this Felicity tape may have episode from S02E16 to S02E21
15:25 🔗 godane i only say that cause i have S02E22 and S02E23 from the same channel and month i think
15:44 🔗 SketchCow Don't forget the porn!
16:30 🔗 JAA https://theintercept.com/2017/11/02/war-crimes-youtube-facebook-syria-rohingya/
16:31 🔗 Pixi has joined #archiveteam-bs
16:32 🔗 icedice has joined #archiveteam-bs
16:32 🔗 icedice Hi
16:32 🔗 icedice has quit IRC (Remote host closed the connection)
16:53 🔗 Asparagir has joined #archiveteam-bs
17:11 🔗 icedice has joined #archiveteam-bs
17:29 🔗 dashcloud has joined #archiveteam-bs
17:32 🔗 icedice2 has joined #archiveteam-bs
17:34 🔗 icedice has quit IRC (Ping timeout: 245 seconds)
18:20 🔗 icedice2 has quit IRC (Quit: Leaving)
18:20 🔗 icedice has joined #archiveteam-bs
19:54 🔗 odemg https://www.ebay.com/itm/IBM-17R7063-LTO7-INTERNAL-SAS-ULTRIUM-15000-TAPE-DRIVE-NEW-SEALED-/142566860443
20:12 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
20:25 🔗 Asparagir What does ArchiveTeam think about joining Open Collective? It's a way for open source projects to get community funding and donations, but without having to laboriously incorporate as 501(c)(3) and all that. https://opencollective.com/
20:26 🔗 Asparagir If we had even $200/month, that could go to... (1) actually paying some of the amazing coders who build our open source software, get old bugs finally taken care of, get crucial new features finally developed and PAID for. (Hello, a good scraper for Instagram feeds! Or XML!)
20:27 🔗 Asparagir Or (2) could pay for far more ArchiveBot servers, which are $20/month. Imagine a world without 29841394692347 !pending jobs...
20:27 🔗 Asparagir Or (3) [your idea here]
20:28 🔗 Asparagir A lot of us donate a lot of time and money, whether it's hours of coding or $$$ per month for servers, to keep this ship floating.
20:28 🔗 Asparagir Open Collective could send funds from the web community to cover some of that.
20:29 🔗 Asparagir Other open source groups using it (or something like this, not saying this is the be-all end-all solution) are raising serious $$$ for sustaining their projects. Why not ArchiveTeam?
20:30 🔗 Asparagir SketchCow, would love your thoughts on this, too ^
20:31 🔗 Asparagir Sort of "Patreon for open source online groups, with lots of transparency into where every $ goes"
20:32 🔗 Asparagir Note: requirement for sign-up is a GitHub repository with at least 100 stars. Our ArchiveBot repo just hit 108.
20:34 🔗 Asparagir yipdw, you too ^
20:56 🔗 zino Might be useful if someone takes care of it. I'm not going close to anything dealing with money, too much work.
20:56 🔗 zino Ironic sidenote: Yahoo is listed as a supporter, with no dollars contributed.
20:56 🔗 Frogging lol
21:02 🔗 zino In fairness, that is probably an indicator that they once contributed some amount of money and have now stopped doing that. Cloundflare is listed as "$500 contributed", but they are contributing $500 per month, not in total over time.
21:42 🔗 kisspunch Asparagir: I like the idea of helping with that list of things but dislike that it makes archiveteam sound like a Thing instead of a bunch of people who whatever they want
21:42 🔗 JensRex chfoo: Since you're showing old Warrior2 some love, consider replacing /etc/apt/sources.list to contain (only) "deb http://archive.debian.org/debian squeeze main".
21:42 🔗 JensRex Current default contents are invalid and broken.
21:43 🔗 kisspunch I love the idea of having a community "wanted" list for things we'd like to be done (and possibly would give a bounty for)
21:44 🔗 kisspunch Like bugs + co
21:44 🔗 kisspunch Not a list of sites, that would be endless
21:46 🔗 chfoo JensRex, i should but i don't want to touch anything unnecessary in the old warrior. i just want it able to boot up properly.
21:48 🔗 Asparagir kisspunch: JAA and I wrote up a long list of our top to-do items in #archivebot like a month or two ago.
21:48 🔗 Asparagir Any one of those top ten items getting built or fixed would seriously help us all.
21:49 🔗 Asparagir Let me see if I can find the log...
21:49 🔗 JensRex Asparagir: That stuff should be in the wiki.
21:51 🔗 JAA Maybe, but then in five years someone will get confused by the list because it never got updated.
21:52 🔗 JAA Asparagir: Found it, 2017-10-10 23:51:22 UTC
21:52 🔗 Asparagir Here's what we were discussing...
21:52 🔗 Asparagir My long-term goals for ArchiveTeam, in no particular order:
21:52 🔗 kisspunch It should be somewhere persistent, I don't know it needs to be updated
21:52 🔗 Asparagir 1) Have the ability to scale up to lots of pipelines, easily
21:52 🔗 JensRex TODO += Update TODO.
21:52 🔗 Asparagir 2) Find ways for more people to participate in suggesting sites to archive, even going out to Twitter for suggestions, not just us IRC folks
21:52 🔗 Asparagir 3) Proactively start reaching out to different communities asking them for suggestions of at-risk content, or particularly unique user-generated content, like message boards
21:52 🔗 Asparagir 4) Find someone to build us a proper Instagram scraper (for individual users' feeds, or hashtags or locations, or all of the above)
21:52 🔗 Asparagir 5) Fix the current youtube-dl issue, and figure out a way to do auto-update on youtube-dl on everyone's pipeline once a month
21:53 🔗 Asparagir From JAA -- "- Fix the various wpull bugs, in particular the FTP crashes, jobs not terminating, crashes not being reported back as failed jobs to here, etc."
21:53 🔗 Asparagir 7) Find a way to implement an --urgent flag that takes precedence over stuff in queue
21:53 🔗 Asparagir From JAA: "Experiment with headless browsers so we can let PhantomJS die already."
21:53 🔗 Asparagir 8) Find a way to cancel stuff in queue. Right now you can cancel jobs that are pending but it doesn't go into effect until that item gets to top of queue.
21:53 🔗 Asparagir 9) Find a way for us to get free server space, maybe Amazon AWS credits or Digital Ocean credits. But for that we'd probably need to be a real 501(c)(3) and that's a big deal.
21:54 🔗 Asparagir (note: this OpenCollective idea neatly sidesteps the 501(c)(3) problems)
21:54 🔗 JAA Oh right, I totally forgot about the Instagram scraper.
21:54 🔗 kisspunch I want to see finished crawls as my top request :) I can't tell if things never got added or are already done
21:54 🔗 Asparagir More from JAA:
21:54 🔗 Asparagir Also, !pending not listing all jobs is quite annoying.
21:54 🔗 Asparagir And more metadata in the JSON uploaded to IA
21:54 🔗 Asparagir [end list]
21:54 🔗 Asparagir I think that was our main I WANT THIS NOW list
21:54 🔗 schbirid has quit IRC (Quit: Leaving)
21:54 🔗 kisspunch Reaching out to communities to find stuff in need to scraping sounds important
21:55 🔗 JAA Regarding maintaining wpull, there was a little bit of discussion in #newsgrabber the other day.
21:55 🔗 Asparagir And of course, this is in addition to a long long list of feature requests and bug reports in GitHub on several projects.
21:55 🔗 JensRex Seriously though, the list should be somewhere where it doesn't just scroll by and is forgotten.
21:56 🔗 Asparagir And we cant keep flogging a dead horse and hoping that people will be super-generous and magically swoop down, like the Open Source Archiver Fairy, to fix our problems.
21:56 🔗 Asparagir I mean, the fact that ArchiveTam has gotten this far on pure volunteerism is astonishing and awesome.
21:56 🔗 kisspunch I'm most likely to do the long list of things on other projects
21:56 🔗 JAA How about an issue tracker? Oh, right.
21:56 🔗 Asparagir ArchiveTeam, evem :-)
21:56 🔗 JAA ;-)
21:56 🔗 kisspunch I tend to want to fix fundamental tools, ArchiveBot is lower impact
21:57 🔗 kisspunch Maybe I could document wpull better
21:57 🔗 Asparagir That's good too! wpull, the Warrior, documentation, all need help
21:57 🔗 Asparagir everything
21:57 🔗 JAA Indeed
21:57 🔗 kisspunch Oh right--I'm supposed to make a windows IA.bak client
21:57 🔗 kisspunch That's the thing I'm supposed to do for archiveteam
21:57 🔗 JAA I'm not really convinced yet that it's a good thing that wpull is mostly compatible with/a drop-in replacement for wget.
21:58 🔗 Asparagir But...think how much help we could get, and how much progress we could make, if we could pay someone here (not me!) say $500 for one week of all-you-can-eat bug fixes.
21:58 🔗 Asparagir Or more.
21:58 🔗 kisspunch "totally compatible" would be dubious, "mostly compatible" is aggravating, especially since the docs just write "totally compatible" and not details
21:58 🔗 JensRex Regarding wpull and youtube-dl. I think the conclusion was that the precompiled wpull for Newsgrabber is terrible somehow, and breaks when using --youtube-dl. Can't use wpull from pip, beacuse it's Python3 only, and Newsgrabber is Python2.
21:58 🔗 Asparagir Or have the community pay the hosting bills for five new servers!
21:58 🔗 JAA JensRex: Normal wpull is broken, too.
21:59 🔗 JensRex JAA: Interesting.
21:59 🔗 JAA Well, the most current version on FalconK's fork, at least.
21:59 🔗 JAA The last version by Chris (2.0.1) is so crashy that it's not very usable.
21:59 🔗 kisspunch ivan: ^ want to do a week of bugfixes for archiveteam
21:59 🔗 JAA FalconK fixed some of those bugs and completely broke youtube-dl in the process.
21:59 🔗 JensRex So the rabbithole of terribleness goes deeper.
22:00 🔗 JAA But I believe it was already broken before that, judging from his commit message.
22:00 🔗 JAA It does.
22:01 🔗 Asparagir Yeah. And most people here are already burnt out from full time jobs; asking them to keep giving free labor and free code and free urgent fixes is not fair to them or sustainable to ArchiveTeam.
22:01 🔗 Asparagir But luckily, there's this concept where people exchange money for services...
22:02 🔗 JensRex I have all the time in the world, but I'm just some guy who knows enough Linux to be dangerous, and make unhelpful bug reports.
22:02 🔗 JAA kisspunch: The docs do list the differences (though that list is not complete, I think). But it also means that there's a lot of luggage from wget's CLI. For example, some of the option names are just wrong because the option doesn't do what it seems it should. For example, I'd expect --waitretry to specify the time that has to pass before an errored URL is reattempted. Nope, it does some linear backof
22:02 🔗 JAA f and that's the maximum time it waits...
22:02 🔗 kisspunch JensRex: please improve documentation on everything then!
22:02 🔗 JensRex kisspunch: What needs documenting?
22:02 🔗 kisspunch wpull
22:02 🔗 JensRex *groan*
22:02 🔗 kisspunch I don't remember, but probably warrior
22:03 🔗 JensRex I do have edit permissions on the Wiki. I'll keep it in mind.
22:03 🔗 kisspunch Just /collecting/ what has been archived, to what degree, when, by who, and where there are copies, would be my #2 after IA.bak
22:03 🔗 JAA Asparagir: I'm not coding nearly enough in my job, and I'm willing to work on wpull in general. The problem is that I'm often busy trying to find sites that are at risk or archiving those sites (or freeing space on my disks so I can archive them). There's just so much to do...
22:04 🔗 kisspunch The wiki only has big projects and is usually missing some fraction of that (especially, for finished projects that it finished and where it is)
22:04 🔗 JAA Plus the situation with wpull isn't really clear right now: whether chfoo might resume maintenance, whether the repo is passed to AT as a whole, or whether it needs to be forked.
22:04 🔗 kisspunch I'm thinking of giving up and just being IA.bak instead of trying to make some distributed thing, there's some efficiency-of-batching there
22:05 🔗 kisspunch Like asking people to mail me HDDs
22:05 🔗 Asparagir Wait, do you guys not know about this site? https://archive.fart.website/archivebot/viewer/ You can look up any domain fed into ArchiveBot lately. This doesn't cover all the stuff archived through the Warrior or other projects, but it's a start.
22:05 🔗 JensRex IA.bak... the ArchiveTeam white whale.
22:05 🔗 kisspunch Asparagir: I don't really follow archivebot generally, thanks!
22:05 🔗 JAA Asparagir: It's broken though.
22:05 🔗 kisspunch I mostly follow the warrior projects
22:05 🔗 Asparagir Broken? Aggggggh.
22:06 🔗 JAA Asparagir: Yep, doesn't display all jobs.
22:06 🔗 JAA See https://github.com/ArchiveTeam/ArchiveBot/issues/282
22:06 🔗 Asparagir But I guess this proves the point: it's a 90% awesome tool! But funding a few hours of hardcore work on it would get us up to "usable".
22:06 🔗 JAA Indeed
22:10 🔗 zino Since we are discussing archivebot wishes: I way to shut down the pipeline, do service on the machine or update parts of the pipeline and then resume the jobs when the machine is up again is nr 1 on my list.
22:11 🔗 JAA Yes
22:11 🔗 JAA There was some discussion previously about splitting up jobs to begin with.
22:11 🔗 JAA So that you don't have one huge multi-million URL job, but blocks of e.g. 10k URLs.
22:11 🔗 JAA Less potential for crashes that way.
22:11 🔗 JAA However, this would be a major redesign obviously.
22:11 🔗 zino Yea, we talked a bit about that. Would help a lot.
22:12 🔗 Asparagir Right -- we can and do segment jobs by (estimated) WARC size, so that WARC's get uploaded in chunks (500 MB, I think?). But we don't do it yet by job size, i.e. number of URL's.
22:12 🔗 Asparagir That wouldn't be exact either, because of course some of those URL's might be video files or something, and might be bigger than you'd think.
22:13 🔗 zino Asparagir: chunk is a few gigs. We cant segment jobs on chunks though, the job must complete all chunks on the same pipeline.
22:15 🔗 JAA Yeah, there was some discussion about that as well. Parallelising jobs across multiple machines.
22:15 🔗 JAA I'm not sure it would work in all cases though.
22:18 🔗 Asparagir My two cents: work on fixing our considerable technical debt first, before moving on to building out new features, which will probably break in new and exciting ways. :-)
22:18 🔗 JAA Yeah
22:19 🔗 zino Maybe, but if the new features mitigates the failures we have that brings more robustness.
22:19 🔗 zino I'd rather have a way to kill and restart the pipeline on the same job than have a mythical wpull that doesn't hang.
22:21 🔗 zino That would solve both the wpull problems and let me start 10 pipelines when needed without having to worry that I need to keep those machines up and unpatched for the next 3 months.
22:21 🔗 Asparagir Fair point.
22:21 🔗 Asparagir And I'd like to be able to reboot the dashboard to clear out jobs that we know for sure have died and gone to job heaven.
22:22 🔗 Asparagir But which hang around cluttering the dashboard as zombies...
22:22 🔗 Asparagir Minor issue, I know, but would also be helpful to day-to-day work.
22:22 🔗 JAA I'd like it if the dashboard was documented better so people with access to the control node (like me) can do that sort of maintenance without fearing that it'll break everything.
22:22 🔗 Asparagir Yes
22:22 🔗 Asparagir Needs documentation badly
22:24 🔗 Asparagir Buuuuut yeah, to circle back to the original question...how do people feel about the larger issue, of ArchiveTeam posting on OpenCollective (or somewhere else, like Patreon) to raise money from the Internet to PAY for some of this work? Instead of hoping that the Archive Coder Fairy will do it for free, forever?
22:24 🔗 Asparagir I mean, I do like that this is totally decentralized and people can hack away at what they want and are interested in.
22:25 🔗 Asparagir But.
22:25 🔗 Asparagir I mean, look at this thread.
22:28 🔗 zino The question is, do we have a Coder Fairy that is willing to work for money?
22:32 🔗 Asparagir I think that's a question for people like yipdw, FalconK, astrid, JAA, and others who do some of the heavy lifting, code-wise. And the lurkers around here, of whom there are many (hiiiii, we see you, we won't bite)
22:32 🔗 Asparagir And the people on this list: https://github.com/orgs/ArchiveTeam/people
22:33 🔗 Asparagir JesseW and chfoo too. Lots of people. If even one or two say "yes, I will do annoying task XXX for $YYY" then we're good!
22:35 🔗 Asparagir I want SketchCow to weigh in on this too, but according to Twitter he's doing "a little mold remediation work so I'll be away for a while" right now
22:40 🔗 zino So regarding restart. Conceptually something like this would be needed:
22:40 🔗 zino 1. pipeline needs to save how to spawn currently running wpulls
22:40 🔗 zino 2. at pipeline startup, check the save file and just resume them
22:40 🔗 zino 3. restart crashed wpulls, up to a limit
22:40 🔗 zino This would solve:
22:40 🔗 zino 1. Machine or pipeline maintenance, kill the pipeline instead of STOP:ing it.
22:40 🔗 zino 2. Crashing wpulls
22:40 🔗 zino 3. Locked wpulls, just kill the locked one
22:40 🔗 zino The big questionmark is do we need to revire anything in the
22:40 🔗 zino controller communiocation, or is that stateless? If the output from
22:40 🔗 zino wpull is currently just piped to the controller without channel
22:40 🔗 zino negotiation that will break.
22:40 🔗 zino JAA, do you have any insight in how that works now?
22:41 🔗 JAA Nothing special needs to be done to respawn wpull itself. You just rerun the same command in the same directory and it'll continue based on the database.
22:42 🔗 pizzaiolo has joined #archiveteam-bs
22:42 🔗 zino Yea, I mean how the communication with the controller works. I don't know how hard it would be to restart that info-steam, or if it's possible at all right now.
22:43 🔗 JAA But I don't know much about the communication. I *think* it's all one-way communication, i.e. the control node runs a Redis database and a process on the pipeline (the wpull plugin?) connects to that database.
22:44 🔗 JAA If you add an ignore or change the job's settings, that's written to the database by the control node, and it takes effect on the pipeline as soon as it notices that something has changed (the settings watcher).
22:44 🔗 JAA The logs go back by the wpull plugin (?) writing to the same database. The control node then forwards that to the people looking at the dashboard.
22:45 🔗 drumstick has joined #archiveteam-bs
22:45 🔗 zino If that how it works this should not be THAT hard to fix. I'll have a look another night.
22:46 🔗 JAA This should basically mean that it should be possible to resume jobs without too much effort. I'm not sure if anything even needs to be changed on the control node apart from the IRC bot handling a few additional commands.
22:46 🔗 JAA We'd have to look into it in more detail though regarding how it should work exactly.
22:47 🔗 JAA For example, it would be nice if we could !pause a job also on a pipeline that doesn't need maintenance/reboot, e.g. in case of a ban, and if the pipeline then started another job perhaps.
22:47 🔗 zino Yea. We really should have a test setup of the whole system to stage tests on.
22:47 🔗 JAA But I'm not sure what !resume in that case should do exactly, etc.
22:48 🔗 JAA Yeah, I've been wondering about that, how to test any code written for ArchiveBot.
22:49 🔗 zino I'm scared to test anything as is. One typo and you hose all jobs the pipeline manages to ingest before you stop it.
22:50 🔗 JAA We'd probably need a full parallel test setup.
22:50 🔗 zino Yep
22:51 🔗 JAA jrwr was able to set it up for the Tor version, so it shouldn't be too difficult.
22:51 🔗 JAA Maybe he can tell us what to look out for.
22:51 🔗 JAA There are some instructions in the repo, but no idea how complete those are.
22:52 🔗 zino And I'd be happy to set that up, so maybe we could pump jrwr for some info.
22:52 🔗 zino Anyways, time to sleep. To be continued.
22:52 🔗 JAA Good night!
22:54 🔗 JAA Asparagir: To get back to that question above: For me, it's more a matter of time than of money. And as far as I know, it's not possible to transfer time (yet?). :-/
22:56 🔗 Asparagir TO-DO #1765765: invent Hermione's Time-Turner
22:57 🔗 JAA :-)
23:06 🔗 drumstick has quit IRC (Quit: Leaving)
23:11 🔗 kisspunch How does archiveteam feel about making a single gateway clone of requester-pays content. I'm happy to pay to get this stuff (already grabbed ArXiV, I guess imdb switched to this recently), but I don't have somewhere to distribute it with enough storage space
23:11 🔗 kisspunch Torrents might be a good option
23:13 🔗 JAA (IMDB claims they'll add a free gateway. Not sure if that exists by now or still not.)
23:14 🔗 JAA What's wrong with putting it on IA?
23:14 🔗 kisspunch Putting on IA is also a good option, main issue is ones that update often
23:15 🔗 JAA Hmm. You can also update IA items as much as you like though.
23:15 🔗 kisspunch Both ArXiV and IMDB have an additive-update process, IMDB also has a mutating "summary"
23:15 🔗 kisspunch Apparently I need to learn how to put shit on IA
23:16 🔗 kisspunch Maybe I should mirror githubarchive (timeline) and ghtorrent to IA
23:17 🔗 kisspunch The timeline in particular is pretty small
23:19 🔗 Frogging Asparagir: fwiw, Internet Archive has paid employees that work on this kind of stuff. maybe not so open though, unfortunately.
23:20 🔗 Frogging also I'd love a time turner. too many times do I find out about something only after it's gone forever :(
23:26 🔗 JensRex FUCK! 94% done uploading a 8GB warc at 200 kbs, and my ISP takes a shit.
23:27 🔗 JensRex Still down. Quassel on mobile.
23:27 🔗 jschwart has quit IRC (Konversation terminated!)
23:35 🔗 pizzaiolo has quit IRC (pizzaiolo)
23:36 🔗 BlueMaxim has joined #archiveteam-bs

irclogger-viewer