Time |
Nickname |
Message |
00:33
🔗
|
|
dashcloud has quit IRC (Remote host closed the connection) |
01:19
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
01:31
🔗
|
|
schbirid has joined #archiveteam-bs |
02:08
🔗
|
|
Darkstar has quit IRC (Ping timeout: 506 seconds) |
02:25
🔗
|
Somebody2 |
Regarding the whole, "IA doesn't distribute everything!" conversation -- Please DO upload as much as possible to as many different places as possible! |
02:25
🔗
|
Somebody2 |
No matter *what* else, more copies are a GOOD THING. |
02:25
🔗
|
|
Darkstar has joined #archiveteam-bs |
02:30
🔗
|
Rai-chan |
^ |
02:30
🔗
|
Ceryn |
Somebody2: Do you have some resource on what the options are? Who will take data, pros and cons, where you might find similar data already? I now know of IA, obviously. |
02:35
🔗
|
|
drumstick has quit IRC (Ping timeout: 248 seconds) |
02:43
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
02:44
🔗
|
|
drumstick has joined #archiveteam-bs |
02:56
🔗
|
JensRex |
chfoo: Why Docker for the Warrior? |
02:56
🔗
|
JensRex |
Docker is stateless. Doesn't seem like a good fit. |
02:57
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
03:04
🔗
|
|
MadArchiv has joined #archiveteam-bs |
03:09
🔗
|
|
schbirid has joined #archiveteam-bs |
03:18
🔗
|
|
MadArchiv has quit IRC (Read error: Operation timed out) |
03:42
🔗
|
|
jspiros has quit IRC (Ping timeout: 492 seconds) |
03:56
🔗
|
|
Asparagir has quit IRC (Asparagir) |
04:11
🔗
|
|
qw3rty5 has joined #archiveteam-bs |
04:18
🔗
|
|
qw3rty4 has quit IRC (Read error: Operation timed out) |
04:25
🔗
|
|
jspiros has joined #archiveteam-bs |
04:30
🔗
|
Somebody2 |
Ceryn: I don't have much, but there is some on the archiveteam wiki (and we should add more) |
04:33
🔗
|
Somebody2 |
A lot depends on how much data you are looking for a home for. |
04:36
🔗
|
Somebody2 |
There are quite a few plces where you can stash a few kilobytes (i.e. a couple pages of text) while there are many fewer places to drop a petabyte in need of a home. |
04:43
🔗
|
Ceryn |
I'll scour the wiki I guess. |
04:44
🔗
|
Ceryn |
If you know of many such sites and they're not in the wiki yet I'd like to know of them. |
04:48
🔗
|
Somebody2 |
Ceryn: yes, that's a good idea. |
04:48
🔗
|
Somebody2 |
Eh, you've got me interested; I'll go write up a wiki page. |
04:49
🔗
|
Somebody2 |
Or better, add more stuff to http://archiveteam.org/index.php?title=Valhalla |
04:49
🔗
|
Somebody2 |
which is (I think) the right place for this |
04:50
🔗
|
Somebody2 |
well, kinda |
04:51
🔗
|
Somebody2 |
Sigh, I'll make a new page http://archiveteam.org/index.php?title=Places_to_store_data |
04:52
🔗
|
Ceryn |
Hah! Hook, line and sinker! |
04:53
🔗
|
Somebody2 |
:-P |
04:53
🔗
|
Ceryn |
Thanks. :) |
05:04
🔗
|
|
Asparagir has joined #archiveteam-bs |
05:09
🔗
|
|
drumstick has quit IRC (Read error: Operation timed out) |
05:10
🔗
|
|
drumstick has joined #archiveteam-bs |
05:20
🔗
|
Somebody2 |
Ceryn: OK, wrote up the intro; comments welcomed; I'll add more specific suggestions soon. |
05:20
🔗
|
Ceryn |
Somebody2: Cool! Reading. |
05:25
🔗
|
Ceryn |
Somebody2: Looks good (y). I think you should leave the general information up top and put the IA stuff down under places to store data. Some captions would probably be useful too. |
05:26
🔗
|
Ceryn |
Somebody2: When I hear "video" I think "movie", and that's on the order of ~20GB. Maybe call that a photo album instead? |
05:29
🔗
|
Ceryn |
Somebody2: Once people know where to store data, it would also be relevant to know what the commonly preferred data formats are for given types of data. Assuming there's anything resembling a consensus. |
05:29
🔗
|
Ceryn |
Somebody2: And obviously the "Places to store data" will need that list of suggested places to store data before it really becomes relevant. :) |
05:45
🔗
|
Somebody2 |
Regarding formats -- ha. HA. Hahahahah AhahaaahaHAHAHAa. No, no there *really* isn't anything resembling a consensus. |
05:45
🔗
|
Somebody2 |
And we have an entire wiki devoted to that -- the fileformats wiki. |
05:46
🔗
|
Somebody2 |
I'll change video to video clip -- I was thinking of short youtube clips. |
05:46
🔗
|
Somebody2 |
I'm not sure what you mean by "captions"? |
05:47
🔗
|
Ceryn |
Haha. There ought to be one. |
05:47
🔗
|
Ceryn |
Surely one general approach is better than the rest. |
05:47
🔗
|
Ceryn |
By captions I mean section titles. |
05:48
🔗
|
Somebody2 |
Ah, yeah I'm planning on making sections for each of the size groups |
05:51
🔗
|
Ceryn |
(y) |
06:24
🔗
|
Somebody2 |
Ceryn: add some more |
06:25
🔗
|
Somebody2 |
er, I have added some more |
06:25
🔗
|
Ceryn |
Right. |
06:27
🔗
|
Ceryn |
Haha. Having your data .accesslog'ed. |
06:27
🔗
|
Ceryn |
Forceful Data Archiving Attack. |
06:30
🔗
|
Ceryn |
Interesting ways to obscurely store bytes data. |
06:31
🔗
|
Ceryn |
If you actually want to post something for storage, however, are sources that don't explicitly attempt to provide long term storage even relevant? |
06:32
🔗
|
Ceryn |
(I do like the ideas. They're original. Just questioning practicality in actual use case scenarios.) |
06:39
🔗
|
wp494 |
I'm gonna call the top-end "petabytes" category More Than A Motherfucking Shitload |
06:39
🔗
|
wp494 |
based on https://www.youtube.com/watch?v=Y0Z0raWIHXk |
06:40
🔗
|
Somebody2 |
wp494: I like that name. |
06:41
🔗
|
Somebody2 |
Ceryn: I think they are, because everything is temporary; an additional copy is an additional copy as long as it stays around, however long that is. |
06:42
🔗
|
wp494 |
I don't think there would be much kerfuffle if we used the rest of penn's scale to fill the middle in either |
06:43
🔗
|
|
Pixi has quit IRC (Quit: Pixi) |
06:43
🔗
|
wp494 |
but yeah, that name definitely should be used on the top end |
06:45
🔗
|
Somebody2 |
Please do add Penn's scale to the page. |
07:02
🔗
|
|
SketchCow has quit IRC (Read error: Connection reset by peer) |
07:02
🔗
|
|
SketchCow has joined #archiveteam-bs |
07:02
🔗
|
|
swebb sets mode: +o SketchCow |
07:18
🔗
|
Somebody2 |
wp494: thanks |
07:18
🔗
|
Somebody2 |
OK, I've more or less dumped by brain out onto the page now. I may add more later, but may not. |
07:27
🔗
|
|
Asparagir has quit IRC (Asparagir) |
07:34
🔗
|
|
REiN^ has joined #archiveteam-bs |
07:36
🔗
|
|
Valentin- has joined #archiveteam-bs |
07:38
🔗
|
|
Valentine has quit IRC (Ping timeout: 506 seconds) |
09:12
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
09:46
🔗
|
godane |
i'm digitizing the pilot of The Tick tape |
09:50
🔗
|
godane |
!ao http://www.sacbee.com/news/state/california/fires/article182675911.html |
09:50
🔗
|
godane |
i put in archivebot channel |
10:01
🔗
|
|
jschwart has joined #archiveteam-bs |
10:32
🔗
|
|
godane has left |
10:32
🔗
|
|
godane has joined #archiveteam-bs |
11:02
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
11:04
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
11:32
🔗
|
|
drumstick has quit IRC (Read error: Operation timed out) |
11:49
🔗
|
|
pizzaiolo has quit IRC (pizzaiolo) |
12:29
🔗
|
|
odemg has joined #archiveteam-bs |
12:48
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:50
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
12:50
🔗
|
|
Mateon1 has joined #archiveteam-bs |
13:02
🔗
|
godane |
SketchCow: i'm uploading 3 more tapes to FOS |
13:03
🔗
|
godane |
i also upload 2 Guys and A Girl on We channel for 2003-08-11 to 2003-08-13 |
13:24
🔗
|
|
MadArchiv has joined #archiveteam-bs |
13:26
🔗
|
MadArchiv |
Can someone please explain me what is this whole thing with the tapes that's going on? I've seen you people talk about for days now but I still don't really know what it is about, are you guys trying to digitalize tv stuff or something? |
13:36
🔗
|
godane |
i'm officially at 1.1 Million items as of today |
13:39
🔗
|
|
MadArchiv has quit IRC (Ping timeout: 246 seconds) |
13:45
🔗
|
godane |
SketchCow : This guy has some magazines you will like: https://archive.org/details/@neil_parsons_48 |
14:19
🔗
|
godane |
so found a iomega zipdrive install tape |
14:19
🔗
|
godane |
also i found another Felicity tape |
14:21
🔗
|
godane |
btw there are least 3 more tapes with Felicity on them by there label |
14:23
🔗
|
godane |
thats not including that tape i found and uploaded that had the last 2 episodes of Season 2 of Felicity |
14:38
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
14:39
🔗
|
|
Mateon1 has joined #archiveteam-bs |
14:45
🔗
|
godane |
SketchCow: so i found the note with g4 Pulse tape |
14:45
🔗
|
godane |
thanks |
14:45
🔗
|
godane |
also found the 4 tapes of porn |
14:54
🔗
|
Ceryn |
Lol. How much data is on these tapes total? |
15:01
🔗
|
|
TheLovina has joined #archiveteam-bs |
15:24
🔗
|
godane |
don't know yet |
15:25
🔗
|
godane |
this Felicity tape may have episode from S02E16 to S02E21 |
15:25
🔗
|
godane |
i only say that cause i have S02E22 and S02E23 from the same channel and month i think |
15:44
🔗
|
SketchCow |
Don't forget the porn! |
16:30
🔗
|
JAA |
https://theintercept.com/2017/11/02/war-crimes-youtube-facebook-syria-rohingya/ |
16:31
🔗
|
|
Pixi has joined #archiveteam-bs |
16:32
🔗
|
|
icedice has joined #archiveteam-bs |
16:32
🔗
|
icedice |
Hi |
16:32
🔗
|
|
icedice has quit IRC (Remote host closed the connection) |
16:53
🔗
|
|
Asparagir has joined #archiveteam-bs |
17:11
🔗
|
|
icedice has joined #archiveteam-bs |
17:29
🔗
|
|
dashcloud has joined #archiveteam-bs |
17:32
🔗
|
|
icedice2 has joined #archiveteam-bs |
17:34
🔗
|
|
icedice has quit IRC (Ping timeout: 245 seconds) |
18:20
🔗
|
|
icedice2 has quit IRC (Quit: Leaving) |
18:20
🔗
|
|
icedice has joined #archiveteam-bs |
19:54
🔗
|
odemg |
https://www.ebay.com/itm/IBM-17R7063-LTO7-INTERNAL-SAS-ULTRIUM-15000-TAPE-DRIVE-NEW-SEALED-/142566860443 |
20:12
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
20:25
🔗
|
Asparagir |
What does ArchiveTeam think about joining Open Collective? It's a way for open source projects to get community funding and donations, but without having to laboriously incorporate as 501(c)(3) and all that. https://opencollective.com/ |
20:26
🔗
|
Asparagir |
If we had even $200/month, that could go to... (1) actually paying some of the amazing coders who build our open source software, get old bugs finally taken care of, get crucial new features finally developed and PAID for. (Hello, a good scraper for Instagram feeds! Or XML!) |
20:27
🔗
|
Asparagir |
Or (2) could pay for far more ArchiveBot servers, which are $20/month. Imagine a world without 29841394692347 !pending jobs... |
20:27
🔗
|
Asparagir |
Or (3) [your idea here] |
20:28
🔗
|
Asparagir |
A lot of us donate a lot of time and money, whether it's hours of coding or $$$ per month for servers, to keep this ship floating. |
20:28
🔗
|
Asparagir |
Open Collective could send funds from the web community to cover some of that. |
20:29
🔗
|
Asparagir |
Other open source groups using it (or something like this, not saying this is the be-all end-all solution) are raising serious $$$ for sustaining their projects. Why not ArchiveTeam? |
20:30
🔗
|
Asparagir |
SketchCow, would love your thoughts on this, too ^ |
20:31
🔗
|
Asparagir |
Sort of "Patreon for open source online groups, with lots of transparency into where every $ goes" |
20:32
🔗
|
Asparagir |
Note: requirement for sign-up is a GitHub repository with at least 100 stars. Our ArchiveBot repo just hit 108. |
20:34
🔗
|
Asparagir |
yipdw, you too ^ |
20:56
🔗
|
zino |
Might be useful if someone takes care of it. I'm not going close to anything dealing with money, too much work. |
20:56
🔗
|
zino |
Ironic sidenote: Yahoo is listed as a supporter, with no dollars contributed. |
20:56
🔗
|
Frogging |
lol |
21:02
🔗
|
zino |
In fairness, that is probably an indicator that they once contributed some amount of money and have now stopped doing that. Cloundflare is listed as "$500 contributed", but they are contributing $500 per month, not in total over time. |
21:42
🔗
|
kisspunch |
Asparagir: I like the idea of helping with that list of things but dislike that it makes archiveteam sound like a Thing instead of a bunch of people who whatever they want |
21:42
🔗
|
JensRex |
chfoo: Since you're showing old Warrior2 some love, consider replacing /etc/apt/sources.list to contain (only) "deb http://archive.debian.org/debian squeeze main". |
21:42
🔗
|
JensRex |
Current default contents are invalid and broken. |
21:43
🔗
|
kisspunch |
I love the idea of having a community "wanted" list for things we'd like to be done (and possibly would give a bounty for) |
21:44
🔗
|
kisspunch |
Like bugs + co |
21:44
🔗
|
kisspunch |
Not a list of sites, that would be endless |
21:46
🔗
|
chfoo |
JensRex, i should but i don't want to touch anything unnecessary in the old warrior. i just want it able to boot up properly. |
21:48
🔗
|
Asparagir |
kisspunch: JAA and I wrote up a long list of our top to-do items in #archivebot like a month or two ago. |
21:48
🔗
|
Asparagir |
Any one of those top ten items getting built or fixed would seriously help us all. |
21:49
🔗
|
Asparagir |
Let me see if I can find the log... |
21:49
🔗
|
JensRex |
Asparagir: That stuff should be in the wiki. |
21:51
🔗
|
JAA |
Maybe, but then in five years someone will get confused by the list because it never got updated. |
21:52
🔗
|
JAA |
Asparagir: Found it, 2017-10-10 23:51:22 UTC |
21:52
🔗
|
Asparagir |
Here's what we were discussing... |
21:52
🔗
|
Asparagir |
My long-term goals for ArchiveTeam, in no particular order: |
21:52
🔗
|
kisspunch |
It should be somewhere persistent, I don't know it needs to be updated |
21:52
🔗
|
Asparagir |
1) Have the ability to scale up to lots of pipelines, easily |
21:52
🔗
|
JensRex |
TODO += Update TODO. |
21:52
🔗
|
Asparagir |
2) Find ways for more people to participate in suggesting sites to archive, even going out to Twitter for suggestions, not just us IRC folks |
21:52
🔗
|
Asparagir |
3) Proactively start reaching out to different communities asking them for suggestions of at-risk content, or particularly unique user-generated content, like message boards |
21:52
🔗
|
Asparagir |
4) Find someone to build us a proper Instagram scraper (for individual users' feeds, or hashtags or locations, or all of the above) |
21:52
🔗
|
Asparagir |
5) Fix the current youtube-dl issue, and figure out a way to do auto-update on youtube-dl on everyone's pipeline once a month |
21:53
🔗
|
Asparagir |
From JAA -- "- Fix the various wpull bugs, in particular the FTP crashes, jobs not terminating, crashes not being reported back as failed jobs to here, etc." |
21:53
🔗
|
Asparagir |
7) Find a way to implement an --urgent flag that takes precedence over stuff in queue |
21:53
🔗
|
Asparagir |
From JAA: "Experiment with headless browsers so we can let PhantomJS die already." |
21:53
🔗
|
Asparagir |
8) Find a way to cancel stuff in queue. Right now you can cancel jobs that are pending but it doesn't go into effect until that item gets to top of queue. |
21:53
🔗
|
Asparagir |
9) Find a way for us to get free server space, maybe Amazon AWS credits or Digital Ocean credits. But for that we'd probably need to be a real 501(c)(3) and that's a big deal. |
21:54
🔗
|
Asparagir |
(note: this OpenCollective idea neatly sidesteps the 501(c)(3) problems) |
21:54
🔗
|
JAA |
Oh right, I totally forgot about the Instagram scraper. |
21:54
🔗
|
kisspunch |
I want to see finished crawls as my top request :) I can't tell if things never got added or are already done |
21:54
🔗
|
Asparagir |
More from JAA: |
21:54
🔗
|
Asparagir |
Also, !pending not listing all jobs is quite annoying. |
21:54
🔗
|
Asparagir |
And more metadata in the JSON uploaded to IA |
21:54
🔗
|
Asparagir |
[end list] |
21:54
🔗
|
Asparagir |
I think that was our main I WANT THIS NOW list |
21:54
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
21:54
🔗
|
kisspunch |
Reaching out to communities to find stuff in need to scraping sounds important |
21:55
🔗
|
JAA |
Regarding maintaining wpull, there was a little bit of discussion in #newsgrabber the other day. |
21:55
🔗
|
Asparagir |
And of course, this is in addition to a long long list of feature requests and bug reports in GitHub on several projects. |
21:55
🔗
|
JensRex |
Seriously though, the list should be somewhere where it doesn't just scroll by and is forgotten. |
21:56
🔗
|
Asparagir |
And we cant keep flogging a dead horse and hoping that people will be super-generous and magically swoop down, like the Open Source Archiver Fairy, to fix our problems. |
21:56
🔗
|
Asparagir |
I mean, the fact that ArchiveTam has gotten this far on pure volunteerism is astonishing and awesome. |
21:56
🔗
|
kisspunch |
I'm most likely to do the long list of things on other projects |
21:56
🔗
|
JAA |
How about an issue tracker? Oh, right. |
21:56
🔗
|
Asparagir |
ArchiveTeam, evem :-) |
21:56
🔗
|
JAA |
;-) |
21:56
🔗
|
kisspunch |
I tend to want to fix fundamental tools, ArchiveBot is lower impact |
21:57
🔗
|
kisspunch |
Maybe I could document wpull better |
21:57
🔗
|
Asparagir |
That's good too! wpull, the Warrior, documentation, all need help |
21:57
🔗
|
Asparagir |
everything |
21:57
🔗
|
JAA |
Indeed |
21:57
🔗
|
kisspunch |
Oh right--I'm supposed to make a windows IA.bak client |
21:57
🔗
|
kisspunch |
That's the thing I'm supposed to do for archiveteam |
21:57
🔗
|
JAA |
I'm not really convinced yet that it's a good thing that wpull is mostly compatible with/a drop-in replacement for wget. |
21:58
🔗
|
Asparagir |
But...think how much help we could get, and how much progress we could make, if we could pay someone here (not me!) say $500 for one week of all-you-can-eat bug fixes. |
21:58
🔗
|
Asparagir |
Or more. |
21:58
🔗
|
kisspunch |
"totally compatible" would be dubious, "mostly compatible" is aggravating, especially since the docs just write "totally compatible" and not details |
21:58
🔗
|
JensRex |
Regarding wpull and youtube-dl. I think the conclusion was that the precompiled wpull for Newsgrabber is terrible somehow, and breaks when using --youtube-dl. Can't use wpull from pip, beacuse it's Python3 only, and Newsgrabber is Python2. |
21:58
🔗
|
Asparagir |
Or have the community pay the hosting bills for five new servers! |
21:58
🔗
|
JAA |
JensRex: Normal wpull is broken, too. |
21:59
🔗
|
JensRex |
JAA: Interesting. |
21:59
🔗
|
JAA |
Well, the most current version on FalconK's fork, at least. |
21:59
🔗
|
JAA |
The last version by Chris (2.0.1) is so crashy that it's not very usable. |
21:59
🔗
|
kisspunch |
ivan: ^ want to do a week of bugfixes for archiveteam |
21:59
🔗
|
JAA |
FalconK fixed some of those bugs and completely broke youtube-dl in the process. |
21:59
🔗
|
JensRex |
So the rabbithole of terribleness goes deeper. |
22:00
🔗
|
JAA |
But I believe it was already broken before that, judging from his commit message. |
22:00
🔗
|
JAA |
It does. |
22:01
🔗
|
Asparagir |
Yeah. And most people here are already burnt out from full time jobs; asking them to keep giving free labor and free code and free urgent fixes is not fair to them or sustainable to ArchiveTeam. |
22:01
🔗
|
Asparagir |
But luckily, there's this concept where people exchange money for services... |
22:02
🔗
|
JensRex |
I have all the time in the world, but I'm just some guy who knows enough Linux to be dangerous, and make unhelpful bug reports. |
22:02
🔗
|
JAA |
kisspunch: The docs do list the differences (though that list is not complete, I think). But it also means that there's a lot of luggage from wget's CLI. For example, some of the option names are just wrong because the option doesn't do what it seems it should. For example, I'd expect --waitretry to specify the time that has to pass before an errored URL is reattempted. Nope, it does some linear backof |
22:02
🔗
|
JAA |
f and that's the maximum time it waits... |
22:02
🔗
|
kisspunch |
JensRex: please improve documentation on everything then! |
22:02
🔗
|
JensRex |
kisspunch: What needs documenting? |
22:02
🔗
|
kisspunch |
wpull |
22:02
🔗
|
JensRex |
*groan* |
22:02
🔗
|
kisspunch |
I don't remember, but probably warrior |
22:03
🔗
|
JensRex |
I do have edit permissions on the Wiki. I'll keep it in mind. |
22:03
🔗
|
kisspunch |
Just /collecting/ what has been archived, to what degree, when, by who, and where there are copies, would be my #2 after IA.bak |
22:03
🔗
|
JAA |
Asparagir: I'm not coding nearly enough in my job, and I'm willing to work on wpull in general. The problem is that I'm often busy trying to find sites that are at risk or archiving those sites (or freeing space on my disks so I can archive them). There's just so much to do... |
22:04
🔗
|
kisspunch |
The wiki only has big projects and is usually missing some fraction of that (especially, for finished projects that it finished and where it is) |
22:04
🔗
|
JAA |
Plus the situation with wpull isn't really clear right now: whether chfoo might resume maintenance, whether the repo is passed to AT as a whole, or whether it needs to be forked. |
22:04
🔗
|
kisspunch |
I'm thinking of giving up and just being IA.bak instead of trying to make some distributed thing, there's some efficiency-of-batching there |
22:05
🔗
|
kisspunch |
Like asking people to mail me HDDs |
22:05
🔗
|
Asparagir |
Wait, do you guys not know about this site? https://archive.fart.website/archivebot/viewer/ You can look up any domain fed into ArchiveBot lately. This doesn't cover all the stuff archived through the Warrior or other projects, but it's a start. |
22:05
🔗
|
JensRex |
IA.bak... the ArchiveTeam white whale. |
22:05
🔗
|
kisspunch |
Asparagir: I don't really follow archivebot generally, thanks! |
22:05
🔗
|
JAA |
Asparagir: It's broken though. |
22:05
🔗
|
kisspunch |
I mostly follow the warrior projects |
22:05
🔗
|
Asparagir |
Broken? Aggggggh. |
22:06
🔗
|
JAA |
Asparagir: Yep, doesn't display all jobs. |
22:06
🔗
|
JAA |
See https://github.com/ArchiveTeam/ArchiveBot/issues/282 |
22:06
🔗
|
Asparagir |
But I guess this proves the point: it's a 90% awesome tool! But funding a few hours of hardcore work on it would get us up to "usable". |
22:06
🔗
|
JAA |
Indeed |
22:10
🔗
|
zino |
Since we are discussing archivebot wishes: I way to shut down the pipeline, do service on the machine or update parts of the pipeline and then resume the jobs when the machine is up again is nr 1 on my list. |
22:11
🔗
|
JAA |
Yes |
22:11
🔗
|
JAA |
There was some discussion previously about splitting up jobs to begin with. |
22:11
🔗
|
JAA |
So that you don't have one huge multi-million URL job, but blocks of e.g. 10k URLs. |
22:11
🔗
|
JAA |
Less potential for crashes that way. |
22:11
🔗
|
JAA |
However, this would be a major redesign obviously. |
22:11
🔗
|
zino |
Yea, we talked a bit about that. Would help a lot. |
22:12
🔗
|
Asparagir |
Right -- we can and do segment jobs by (estimated) WARC size, so that WARC's get uploaded in chunks (500 MB, I think?). But we don't do it yet by job size, i.e. number of URL's. |
22:12
🔗
|
Asparagir |
That wouldn't be exact either, because of course some of those URL's might be video files or something, and might be bigger than you'd think. |
22:13
🔗
|
zino |
Asparagir: chunk is a few gigs. We cant segment jobs on chunks though, the job must complete all chunks on the same pipeline. |
22:15
🔗
|
JAA |
Yeah, there was some discussion about that as well. Parallelising jobs across multiple machines. |
22:15
🔗
|
JAA |
I'm not sure it would work in all cases though. |
22:18
🔗
|
Asparagir |
My two cents: work on fixing our considerable technical debt first, before moving on to building out new features, which will probably break in new and exciting ways. :-) |
22:18
🔗
|
JAA |
Yeah |
22:19
🔗
|
zino |
Maybe, but if the new features mitigates the failures we have that brings more robustness. |
22:19
🔗
|
zino |
I'd rather have a way to kill and restart the pipeline on the same job than have a mythical wpull that doesn't hang. |
22:21
🔗
|
zino |
That would solve both the wpull problems and let me start 10 pipelines when needed without having to worry that I need to keep those machines up and unpatched for the next 3 months. |
22:21
🔗
|
Asparagir |
Fair point. |
22:21
🔗
|
Asparagir |
And I'd like to be able to reboot the dashboard to clear out jobs that we know for sure have died and gone to job heaven. |
22:22
🔗
|
Asparagir |
But which hang around cluttering the dashboard as zombies... |
22:22
🔗
|
Asparagir |
Minor issue, I know, but would also be helpful to day-to-day work. |
22:22
🔗
|
JAA |
I'd like it if the dashboard was documented better so people with access to the control node (like me) can do that sort of maintenance without fearing that it'll break everything. |
22:22
🔗
|
Asparagir |
Yes |
22:22
🔗
|
Asparagir |
Needs documentation badly |
22:24
🔗
|
Asparagir |
Buuuuut yeah, to circle back to the original question...how do people feel about the larger issue, of ArchiveTeam posting on OpenCollective (or somewhere else, like Patreon) to raise money from the Internet to PAY for some of this work? Instead of hoping that the Archive Coder Fairy will do it for free, forever? |
22:24
🔗
|
Asparagir |
I mean, I do like that this is totally decentralized and people can hack away at what they want and are interested in. |
22:25
🔗
|
Asparagir |
But. |
22:25
🔗
|
Asparagir |
I mean, look at this thread. |
22:28
🔗
|
zino |
The question is, do we have a Coder Fairy that is willing to work for money? |
22:32
🔗
|
Asparagir |
I think that's a question for people like yipdw, FalconK, astrid, JAA, and others who do some of the heavy lifting, code-wise. And the lurkers around here, of whom there are many (hiiiii, we see you, we won't bite) |
22:32
🔗
|
Asparagir |
And the people on this list: https://github.com/orgs/ArchiveTeam/people |
22:33
🔗
|
Asparagir |
JesseW and chfoo too. Lots of people. If even one or two say "yes, I will do annoying task XXX for $YYY" then we're good! |
22:35
🔗
|
Asparagir |
I want SketchCow to weigh in on this too, but according to Twitter he's doing "a little mold remediation work so I'll be away for a while" right now |
22:40
🔗
|
zino |
So regarding restart. Conceptually something like this would be needed: |
22:40
🔗
|
zino |
1. pipeline needs to save how to spawn currently running wpulls |
22:40
🔗
|
zino |
2. at pipeline startup, check the save file and just resume them |
22:40
🔗
|
zino |
3. restart crashed wpulls, up to a limit |
22:40
🔗
|
zino |
This would solve: |
22:40
🔗
|
zino |
1. Machine or pipeline maintenance, kill the pipeline instead of STOP:ing it. |
22:40
🔗
|
zino |
2. Crashing wpulls |
22:40
🔗
|
zino |
3. Locked wpulls, just kill the locked one |
22:40
🔗
|
zino |
The big questionmark is do we need to revire anything in the |
22:40
🔗
|
zino |
controller communiocation, or is that stateless? If the output from |
22:40
🔗
|
zino |
wpull is currently just piped to the controller without channel |
22:40
🔗
|
zino |
negotiation that will break. |
22:40
🔗
|
zino |
JAA, do you have any insight in how that works now? |
22:41
🔗
|
JAA |
Nothing special needs to be done to respawn wpull itself. You just rerun the same command in the same directory and it'll continue based on the database. |
22:42
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
22:42
🔗
|
zino |
Yea, I mean how the communication with the controller works. I don't know how hard it would be to restart that info-steam, or if it's possible at all right now. |
22:43
🔗
|
JAA |
But I don't know much about the communication. I *think* it's all one-way communication, i.e. the control node runs a Redis database and a process on the pipeline (the wpull plugin?) connects to that database. |
22:44
🔗
|
JAA |
If you add an ignore or change the job's settings, that's written to the database by the control node, and it takes effect on the pipeline as soon as it notices that something has changed (the settings watcher). |
22:44
🔗
|
JAA |
The logs go back by the wpull plugin (?) writing to the same database. The control node then forwards that to the people looking at the dashboard. |
22:45
🔗
|
|
drumstick has joined #archiveteam-bs |
22:45
🔗
|
zino |
If that how it works this should not be THAT hard to fix. I'll have a look another night. |
22:46
🔗
|
JAA |
This should basically mean that it should be possible to resume jobs without too much effort. I'm not sure if anything even needs to be changed on the control node apart from the IRC bot handling a few additional commands. |
22:46
🔗
|
JAA |
We'd have to look into it in more detail though regarding how it should work exactly. |
22:47
🔗
|
JAA |
For example, it would be nice if we could !pause a job also on a pipeline that doesn't need maintenance/reboot, e.g. in case of a ban, and if the pipeline then started another job perhaps. |
22:47
🔗
|
zino |
Yea. We really should have a test setup of the whole system to stage tests on. |
22:47
🔗
|
JAA |
But I'm not sure what !resume in that case should do exactly, etc. |
22:48
🔗
|
JAA |
Yeah, I've been wondering about that, how to test any code written for ArchiveBot. |
22:49
🔗
|
zino |
I'm scared to test anything as is. One typo and you hose all jobs the pipeline manages to ingest before you stop it. |
22:50
🔗
|
JAA |
We'd probably need a full parallel test setup. |
22:50
🔗
|
zino |
Yep |
22:51
🔗
|
JAA |
jrwr was able to set it up for the Tor version, so it shouldn't be too difficult. |
22:51
🔗
|
JAA |
Maybe he can tell us what to look out for. |
22:51
🔗
|
JAA |
There are some instructions in the repo, but no idea how complete those are. |
22:52
🔗
|
zino |
And I'd be happy to set that up, so maybe we could pump jrwr for some info. |
22:52
🔗
|
zino |
Anyways, time to sleep. To be continued. |
22:52
🔗
|
JAA |
Good night! |
22:54
🔗
|
JAA |
Asparagir: To get back to that question above: For me, it's more a matter of time than of money. And as far as I know, it's not possible to transfer time (yet?). :-/ |
22:56
🔗
|
Asparagir |
TO-DO #1765765: invent Hermione's Time-Turner |
22:57
🔗
|
JAA |
:-) |
23:06
🔗
|
|
drumstick has quit IRC (Quit: Leaving) |
23:11
🔗
|
kisspunch |
How does archiveteam feel about making a single gateway clone of requester-pays content. I'm happy to pay to get this stuff (already grabbed ArXiV, I guess imdb switched to this recently), but I don't have somewhere to distribute it with enough storage space |
23:11
🔗
|
kisspunch |
Torrents might be a good option |
23:13
🔗
|
JAA |
(IMDB claims they'll add a free gateway. Not sure if that exists by now or still not.) |
23:14
🔗
|
JAA |
What's wrong with putting it on IA? |
23:14
🔗
|
kisspunch |
Putting on IA is also a good option, main issue is ones that update often |
23:15
🔗
|
JAA |
Hmm. You can also update IA items as much as you like though. |
23:15
🔗
|
kisspunch |
Both ArXiV and IMDB have an additive-update process, IMDB also has a mutating "summary" |
23:15
🔗
|
kisspunch |
Apparently I need to learn how to put shit on IA |
23:16
🔗
|
kisspunch |
Maybe I should mirror githubarchive (timeline) and ghtorrent to IA |
23:17
🔗
|
kisspunch |
The timeline in particular is pretty small |
23:19
🔗
|
Frogging |
Asparagir: fwiw, Internet Archive has paid employees that work on this kind of stuff. maybe not so open though, unfortunately. |
23:20
🔗
|
Frogging |
also I'd love a time turner. too many times do I find out about something only after it's gone forever :( |
23:26
🔗
|
JensRex |
FUCK! 94% done uploading a 8GB warc at 200 kbs, and my ISP takes a shit. |
23:27
🔗
|
JensRex |
Still down. Quassel on mobile. |
23:27
🔗
|
|
jschwart has quit IRC (Konversation terminated!) |
23:35
🔗
|
|
pizzaiolo has quit IRC (pizzaiolo) |
23:36
🔗
|
|
BlueMaxim has joined #archiveteam-bs |