#archiveteam-bs 2018-11-05,Mon

↑back Search

Time Nickname Message
00:45 πŸ”— dashcloud has quit IRC (Read error: Connection reset by peer)
00:52 πŸ”— VerifiedJ has quit IRC (Leaving)
01:17 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
01:19 πŸ”— Sk1d has joined #archiveteam-bs
01:27 πŸ”— dashcloud has joined #archiveteam-bs
01:31 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
01:34 πŸ”— Sk1d has joined #archiveteam-bs
02:29 πŸ”— Stiletto has joined #archiveteam-bs
02:31 πŸ”— Stilett0 has quit IRC (Ping timeout: 265 seconds)
02:35 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
02:39 πŸ”— Sk1d has joined #archiveteam-bs
02:51 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
02:56 πŸ”— Sk1d has joined #archiveteam-bs
04:25 πŸ”— Stilett0 has joined #archiveteam-bs
04:30 πŸ”— Stiletto has quit IRC (Read error: Operation timed out)
04:30 πŸ”— Stiletto has joined #archiveteam-bs
04:33 πŸ”— Stilett0 has quit IRC (Read error: Operation timed out)
04:47 πŸ”— qw3rty111 has joined #archiveteam-bs
04:50 πŸ”— qw3rty119 has quit IRC (Read error: Operation timed out)
05:09 πŸ”— BlueMaxim has joined #archiveteam-bs
05:12 πŸ”— BlueMax has quit IRC (Ping timeout: 260 seconds)
05:42 πŸ”— jodizzle has joined #archiveteam-bs
06:42 πŸ”— fenn has quit IRC (Remote host closed the connection)
06:51 πŸ”— Mateon1 has quit IRC (Ping timeout: 255 seconds)
06:51 πŸ”— Mateon1 has joined #archiveteam-bs
08:05 πŸ”— Lord_Nigh frhed is safe, sort of: https://web.archive.org/web/20181105080330/http://www.snipersparadise.net/download/unrealeditor/frhed-v1.1.zip
08:06 πŸ”— Lord_Nigh by 'sort of' i mean that zip file is slightly modified, someone added a .txt file that the original zip did not have
08:06 πŸ”— Lord_Nigh the .exe inside is exactly the same crc32 though
08:06 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
08:09 πŸ”— Lord_Nigh https://web.archive.org/web/20181105080857/http://unrealtexture.com/Unreal/Downloads/Tutorials/frhed-v1.1.zip is an untainted original archive
08:10 πŸ”— Sk1d has joined #archiveteam-bs
08:12 πŸ”— Lord_Nigh https://web.archive.org/web/20181105081155/https://sobac.com/ZIPfiles/File%20Utilities/frhed-v1.1.zip is another untainted one
08:12 πŸ”— Lord_Nigh and NONE of those sites were properly archived, so i fed em to the bot
08:13 πŸ”— Flashfire Lord_Night what collection are they from?
08:36 πŸ”— Flashfire I am going to make a controversial page
08:45 πŸ”— Lord_Nigh Flashfire: random sites; the first two are old unreal/unreal tournament modding files archives
08:45 πŸ”— Lord_Nigh the latter is weird
08:46 πŸ”— Flashfire Alright also Lord_Nigh what do you think I decided to start another wiki page
08:46 πŸ”— Lord_Nigh ? why should that bother me?
08:46 πŸ”— Lord_Nigh I barely have anything to do with the wiki
08:47 πŸ”— Flashfire lol I dont know
08:47 πŸ”— Flashfire Just wanted an opinion
09:38 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
09:40 πŸ”— Sk1d has joined #archiveteam-bs
09:54 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
09:56 πŸ”— Sk1d has joined #archiveteam-bs
10:02 πŸ”— godane so i placed a bid on this set of magazines : https://www.ebay.com/itm/PC-Computing-magazine-45-issues-1993-1997-no-reserve/192702649323
10:03 πŸ”— godane SketchCow: looks like there only 11 issues scanned of it : https://archive.org/search.php?query=%22PC%20Computing%20Magazine%22&and[]=mediatype%3A%22texts%22
10:05 πŸ”— Flashfire Good luck godane
10:05 πŸ”— godane the only issue to overlap is 1997-06 issue
10:06 πŸ”— godane i was between that and the PCM magazine set : https://www.ebay.com/itm/PCM-Person-Computing-Magazine-for-Tandy-Computer-Users/323534021660
10:06 πŸ”— godane that one has another 6+ days
10:07 πŸ”— godane the PC Computing i have only 6 and half hours
10:09 πŸ”— Flashfire I will continue adding to the FTP list until it ban no longer be ignored
10:12 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
10:15 πŸ”— Sk1d has joined #archiveteam-bs
10:41 πŸ”— SmileyG has joined #archiveteam-bs
10:41 πŸ”— Smiley has quit IRC (Read error: Operation timed out)
10:54 πŸ”— BlueMaxim has quit IRC (Read error: Connection reset by peer)
11:18 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
11:21 πŸ”— Sk1d has joined #archiveteam-bs
11:35 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
11:39 πŸ”— Sk1d has joined #archiveteam-bs
11:51 πŸ”— Kaz how often is the wiki backed up?
11:52 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
11:53 πŸ”— JAA Kaz: Our wiki, you mean? Weekly dumps at https://archiveteam.org/dumps/ which I believe are collected by SketchCow and uploaded to IA periodically (every few months or so).
11:54 πŸ”— Kaz cool - yeah, just wanted to confirm the backups were actually being put somewhere other than on the host itself
11:54 πŸ”— Kaz seems like there's been a lot of activity on it recently
11:56 πŸ”— Sk1d has joined #archiveteam-bs
12:18 πŸ”— NetwideRo has joined #archiveteam-bs
12:21 πŸ”— NetwideRo I'm trying to archive a few sites from an old ARG, which requires passwords out of books to be sent in a form before you can get some of the content. I've been using wget to make WARCs of all the easy static content, is there a recommended best way to archive the interactions that send passwords?
12:23 πŸ”— JAA NetwideRo: Depends on how that login works. If it's an HTTP basic auth, then you can just pass a URL of the form http://username:password@example.org/path. If it's a form, then you need to use --post-data and --save-cookies (and likely --keep-session-cookies) to "log in" and save the cookies to a file, then --load-cookies plus --warc-* etc. to do the actual archival.
12:24 πŸ”— JAA Assuming the form uses POST. If it uses GET, then just use the appropriate URL instead of --post-data.
12:25 πŸ”— SketchCow I checked the backup situation, and yeah.
12:25 πŸ”— SketchCow It's not automatically shoved into the item, but it was on FOS and ultimately I'm likely to put it in automatically.
12:35 πŸ”— JAA Found a wonderful way to foster depression: going through sites marked as "closing" and archival "upcoming" on the wiki which are long gone and lost. :-|
12:36 πŸ”— NetwideRo Thanks JAA. Is it possible to save the password POST followed by a GET using the cookies in the same WARC? The user is supposed to send one of several passwords to "lib/login.php" and then load "chapter.php" in the same session to get the page corresponding to the password they sent. I want to save all of the possible interactions.
12:38 πŸ”— JAA NetwideRo: I don't think that's possible with a single wget invocation. While you can specify more than one URL on the command line, that would cause wget to POST all of those URLs (with the same data).
12:39 πŸ”— JAA Of course, you can also use --warc-* with the initial POST request. Since wget doesn't have a --warc-append option (unlike wpull), you'll have to merge the WARCs manually afterwards.
12:39 πŸ”— JAA Regarding "in the same session", that's why I referred to --keep-session-cookies. That allows you to continue the same "session" across multiple invocations of wget.
12:40 πŸ”— JAA (Specifically, "sessions" work through cookies that don't have an explicit expiration date. --keep-session-cookies causes wget to still write those cookies to the cookie file, and so when you --load-cookies them again on the next wget process, it'll have the same cookies and therefore behave as if it's the same session.)
12:41 πŸ”— NetwideRo Merging should work. Thanks for your help
12:42 πŸ”— NetwideRo has left
12:42 πŸ”— betamax Well, my attempt to get all the campaign sites for the midterms is going terribly
12:42 πŸ”— betamax been running for 12+ hours and only got ~110 of 6,000
12:43 πŸ”— JAA betamax: Homepages or entire websites?
12:43 πŸ”— betamax entire websites
12:43 πŸ”— betamax it's because many have event calandars
12:43 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
12:43 πŸ”— betamax with a page for every day
12:43 πŸ”— JAA Ah, right.
12:43 πŸ”— JAA Calendars are the wors.t
12:43 πŸ”— betamax and they go back forever, so I have to notice it's stuck on one and cancel the archiving
12:44 πŸ”— betamax don't suppose there's any easy way round this?
12:45 πŸ”— VerifiedJ has joined #archiveteam-bs
12:46 πŸ”— Sk1d has joined #archiveteam-bs
12:46 πŸ”— JAA betamax: You could limit the depth of archival using --limit (assuming you're using wget or wpull for this). There's a small chance it might miss some content with this though.
12:48 πŸ”— JAA Other than that, there is no generic way to handle this. You can add patterns to filter those calendars out beyond certain dates, but the required pattern will depend on the exact calendar software used.
12:51 πŸ”— alex____ has quit IRC (Ping timeout: 360 seconds)
12:54 πŸ”— alex__ has joined #archiveteam-bs
12:59 πŸ”— betamax JAA: is there a way to abort wget gracefully without damaging the warc its creating?
13:00 πŸ”— betamax (if so, I'll make it stop after a max of 3 minutes, as most of the time it takes <1 minute to archive a site)
13:00 πŸ”— JAA betamax: No idea. I don't use wget for archival.
13:03 πŸ”— betamax ah, well. I'll do it non-gracefully and make a list of all the ones that were aborted, to be done later
13:06 πŸ”— JAA The number of dead paste links on our wiki is eye-watering.
13:28 πŸ”— Stilett0 has joined #archiveteam-bs
13:28 πŸ”— Stiletto has quit IRC (Ping timeout: 265 seconds)
13:54 πŸ”— hiroi The wiki page watch emails have β€œlocalhost” set as the host name for the wiki. Can someone with admin access fix it? Thx
13:55 πŸ”— Kaz cc jrwr SketchCow ^
13:55 πŸ”— dashcloud has quit IRC (No Ping reply in 180 seconds.)
13:57 πŸ”— dashcloud has joined #archiveteam-bs
14:44 πŸ”— SketchCow Need jrwr to do it, he's the smart one.
14:44 πŸ”— SketchCow JAA: I'd like to set up archiving for a couple outlink parts of the archive.
14:44 πŸ”— SketchCow I also wanted to do so for justsolve.archiveteam.org
14:45 πŸ”— SketchCow JAA: And yes, as for missed opportunities, let's cut back on them now
14:45 πŸ”— SketchCow Hence I'm back, baby
14:47 πŸ”— * anarcat waves
14:47 πŸ”— anarcat so a couple of fellows here have setup a server to mirror ... er... brasil
14:47 πŸ”— anarcat we've got 160GB of videos and ... hmm... about 30GB of sites so far
14:47 πŸ”— anarcat i think a lot of that stuff should go in a collection in IA
14:48 πŸ”— anarcat maybe not the videos, but definitely the site crawls
14:48 πŸ”— anarcat we've been running grab-site around the clock to save sites before the political turnover over there
14:48 πŸ”— anarcat and i've been wondering how to better coordinate with you folks
14:49 πŸ”— anarcat so far i've thrown a few sites at archivebot, but have mostly stopped because i don't want to clog the queue
14:49 πŸ”— SketchCow Well, first of all, if you do it outside of archiveteam structure, it won't go in wayback.
14:49 πŸ”— anarcat ah
14:49 πŸ”— SketchCow But you can certainly upload the WARCs you're making and they'll go into warczone on IA
14:49 πŸ”— SketchCow Which is something
14:49 πŸ”— anarcat right
14:49 πŸ”— anarcat i've uploaded smaller WARC files to IA before and someone (you i think?) have been nice enough to add it to IA
14:50 πŸ”— SketchCow Yeah, that got shutdown, but you can upload them to the general sets and I'll have scripts that see them and fling them in WARCzone
14:50 πŸ”— anarcat git-annex says we got 34.76 GB of site data so far
14:50 πŸ”— jrwr I HAVE BEEN SUMMONED
14:50 πŸ”— jrwr Whats up SketchCow
14:51 πŸ”— anarcat SketchCow: could we get a collection to regroup that stuff?
14:51 πŸ”— SketchCow 1. Apparenty we have a situation that e-mails from archiveteam wiki defined as "localhost"
14:51 πŸ”— SketchCow < hiroi> The wiki page watch emails have "localhost" set as the host name for the wiki. Can someone with admin access fix it? Thx
14:51 πŸ”— jrwr Oh, Ya I can fix that SketchCow, I didn't keep the logins for the box, so let me dig into my archives and see if I can find them
14:51 πŸ”— SketchCow That'd be good.
14:51 πŸ”— SketchCow Also, is it JAA or jrwr who is responsible for this storify blob on FOS
14:52 πŸ”— SketchCow Because work needs to start on that soon or you'll have to do it later sometime
14:52 πŸ”— jrwr not me
14:55 πŸ”— dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
14:55 πŸ”— jrwr So, I have everything but the cpanel password
14:55 πŸ”— jrwr I think we used hang outs for that SketchCow
14:56 πŸ”— dashcloud has joined #archiveteam-bs
15:00 πŸ”— jrwr nvm, I dug it up, that was interesting how long it keeps those
15:00 πŸ”— SketchCow Yeah
15:01 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
15:04 πŸ”— Sk1d has joined #archiveteam-bs
15:05 πŸ”— jrwr @hiroi can I get a headers dump from the email, I've got the senders set in mediawiki
15:06 πŸ”— anarcat SketchCow: so i dump everything in "opensource" and tell you about items?
15:06 πŸ”— SketchCow Yes
15:06 πŸ”— anarcat SketchCow: wouldn't it be easier to group this into collections?
15:06 πŸ”— SketchCow Or I'll find them
15:06 πŸ”— SketchCow And then at some point, we'll make a big-ol collection and you can then keep uploading there
15:06 πŸ”— anarcat ah
15:06 πŸ”— anarcat why not make the collection first?
15:09 πŸ”— jrwr nvm, I was dumb -- should be fixed now SketchCow Kaz hiroi
15:12 πŸ”— SketchCow OK, great
15:12 πŸ”— SketchCow anarcat: Silly IA rule, we don't make new collections before items arrive unless there's contracts and shiznat
15:12 πŸ”— SketchCow Cuts down on a lot of "TBD" collections
15:13 πŸ”— anarcat SketchCow: what if we create an archiveteam project here?
15:13 πŸ”— anarcat SketchCow: what we're doing is basically https://archiveteam.org/index.php?title=Government_Backup but for .br
15:13 πŸ”— anarcat i've been hesitant in creating a wiki page and channel and all that stuff because i'm unfamiliar with the process
15:13 πŸ”— anarcat but if that's the way to go, i'm all in
15:13 πŸ”— anarcat we've got stuff to throw at warriors too
15:13 πŸ”— anarcat if i can figure out how that works
15:14 πŸ”— anarcat and large CKAN datasets and so on
15:15 πŸ”— jrwr updated the wiki to 1.31.1
15:16 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
15:20 πŸ”— jrwr rebuilding caches now
15:21 πŸ”— Sk1d has joined #archiveteam-bs
15:21 πŸ”— jrwr SketchCow: have the backups been showing up in FOS from Archiveteam wiki
15:21 πŸ”— SketchCow Yes
15:22 πŸ”— SketchCow And I finally dumped them into the IA infrastructure
15:22 πŸ”— jrwr nice
15:22 πŸ”— jrwr should help in case of explosions
15:22 πŸ”— SketchCow teamarchive2:/1/CHFOO/warrior/at.org-wiki-dumps# ls -l
15:22 πŸ”— SketchCow total 485716
15:22 πŸ”— SketchCow -rw-r--r-- 1 1001 1001 497361153 Nov 2 08:01 archiveteam-dump-2018-11-02.xml.gz
15:22 πŸ”— SketchCow -rw-r--r-- 1 root root 107 Nov 2 00:06 UPLOAD_TO_INTERNETARCHIVE.sh
15:23 πŸ”— SketchCow I'll be making that .sh run automatically daily shortly
15:23 πŸ”— JAA SketchCow: Storify is me, yeah. Sorry about that.
15:23 πŸ”— SketchCow Move on it, cutie. I have a guy from Poland sending me 15tb of crap
15:23 πŸ”— SketchCow I can stay on top of him because he's "only" doing 3-10gb a day, but who knows
15:24 πŸ”— jrwr cool
15:25 πŸ”— * anarcat got 180GB at day six :)
15:25 πŸ”— anarcat i'll start uploading to archive.org now
15:26 πŸ”— SketchCow OK, I made it automatic before I forget to make it automatic
15:26 πŸ”— SketchCow More a "hey the backups aren't working false alarms" than "disk space", obviously
15:27 πŸ”— jrwr cool
15:27 πŸ”— SketchCow Currently, the "inbox" folder at FOS (that's the archivebot and similar shiznat) is at 89mb, which is good cheese
15:28 πŸ”— SketchCow The ARCHIVETEAM folder (which is the one that is handed all the stuff from inbox) is at 720k
15:29 πŸ”— * jrwr has never seen SketchCow use the word shiznat and believes IRC is corrupting him
15:29 πŸ”— SketchCow The ARCHIVEBOT folder is 240gb
15:29 πŸ”— SketchCow Mostly due to an infrastructure backup (S3 is going barf)
15:30 πŸ”— SketchCow https://twitter.com/textfiles/status/597925997288099840
15:30 πŸ”— SketchCow (Me saying Shiznat in Twitter in 2015)
15:30 πŸ”— SketchCow Also, I've been on IRC since 1998.
15:31 πŸ”— SketchCow So literally 20 years
15:31 πŸ”— anarcat kids these days
15:32 πŸ”— * jrwr has been on since 2000
15:32 πŸ”— jrwr it ruined my grammer
15:32 πŸ”— jrwr since I was (I'm going to date my self) 10 years old on a webtv talking on Talk City
15:33 πŸ”— * jut was born in 2000
15:35 πŸ”— SketchCow Yes, you've only known a time with IRC
15:35 πŸ”— SketchCow It's sad.
15:35 πŸ”— SketchCow So sad
15:36 πŸ”— SketchCow So, in 1995, babies never died and we all were fit and healthy
15:36 πŸ”— SketchCow I'm sorry you missed it
15:36 πŸ”— anarcat that is kind of sad
15:37 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
15:39 πŸ”— jut I missed many things, what is sadder that a 10 year old only knows a world with smartphones
15:39 πŸ”— jrwr I missed out on BBSes
15:39 πŸ”— jrwr I would of ran one
15:39 πŸ”— jrwr would of been king of the 777 seas
15:40 πŸ”— JAA would have*
15:40 πŸ”— Sk1d has joined #archiveteam-bs
15:40 πŸ”— SketchCow Shhh, he'd totally of fit in back there
15:40 πŸ”— jrwr I do infosec CTFs and I still do scavenger hunts on Textfiles
15:41 πŸ”— jrwr fun to go pull crap out of old posts or disk images and make people hunt them down using a set of clues
15:42 πŸ”— jrwr I do have a 6502 badge that I really want to port CP/M to
16:19 πŸ”— anarcat SketchCow: a tiny test crawl https://archive.org/download/memorialanistia.org.br
16:19 πŸ”— anarcat https://archive.org/details/cnv.memoriasreveladas.gov.br is a larger one being uploaded
16:20 πŸ”— anarcat i have 31 more crawls to upload, totaling 30+GB
16:20 πŸ”— anarcat SketchCow: do those look okay to you? should i keep going?
16:38 πŸ”— SimpBrain has quit IRC (Ping timeout: 252 seconds)
17:07 πŸ”— anarcat gaah why is IA's s3 failing now (403 Forbidden. "The request signature we calculated does not match the signature you provided.") - yet i create a few buckets already without problems
17:07 πŸ”— anarcat are there quotas or something?
17:17 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
17:17 πŸ”— wp494 has joined #archiveteam-bs
17:20 πŸ”— adinbied Any idea on what the issue with the tracker is? Something with the Redis side?
17:25 πŸ”— PurpleSym betamax: Yeah, same for the chromebot grab. About 1/4th for Facebook, nothing for Twitter yet. I need a bigger machine.
17:35 πŸ”— adinbied Also, I've now got the angelfire scripts done and working correctly (as far as I can tell) - if someone (JAA, arkiver, or whoever) can take a look at everything and create an ArchiveTeam github repo for them to get ready for Warriors, that would be great.
17:35 πŸ”— adinbied https://github.com/adinbied/angelfire-grab
17:35 πŸ”— arkiver sure
17:36 πŸ”— adinbied https://github.com/adinbied/angelfire-items
17:38 πŸ”— arkiver it looks pretty good already, but needs a few changes
17:40 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
17:40 πŸ”— arkiver adinbied: IΒ΄ve made https://github.com/orgs/ArchiveTeam/teams/angelfire
17:40 πŸ”— adinbied OK, feel free to make any adjustments you need to or let me know what needs changing - I'm still learning the basics of Lua
17:40 πŸ”— adinbied Sounds good - thanks
17:41 πŸ”— arkiver and forked the repos into the archiveteam github
17:41 πŸ”— arkiver and the team
17:44 πŸ”— Sk1d has joined #archiveteam-bs
17:48 πŸ”— arkiver do you have an example of a large user?
17:48 πŸ”— arkiver adinbied: ^
17:49 πŸ”— adinbied I have it somewhere, hold on....
17:50 πŸ”— jodizzle betamax: Did you grab candidate/campaign Instagrams as well? There are at least a few Instagrams listed on Vote411.
17:51 πŸ”— arkiver so the scripts look fine overall
17:51 πŸ”— adinbied arkiver, id2/silas has ~2300 URLS in its sitemap and is one of the largest
17:51 πŸ”— arkiver but i think they could go off to other users, since thereΒ΄s not checking in the allowed function if a page is allowed since it is from a certain user
17:52 πŸ”— adinbied Although alot of those are 404ing
17:52 πŸ”— JAA adinbied, arkiver: Please move this to #angelonfire.
17:52 πŸ”— adinbied Got it
17:53 πŸ”— SketchCow anarcat: I wish you folks had metadata to go with these items.
17:53 πŸ”— SketchCow They certainly parse as WARCs.
17:57 πŸ”— anarcat SketchCow: which metadata would you like to see in there?
17:57 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
18:00 πŸ”— Sk1d has joined #archiveteam-bs
18:13 πŸ”— adinbied Heads up, the ./get-wget-lua.sh script is also broken as it queries warriorhq.archiveteam.org
18:14 πŸ”— adinbied Getting a connection refused
18:15 πŸ”— JAA Yup, same server.
18:16 πŸ”— * anarcat waves at JAA
18:16 πŸ”— anarcat JAA: pm?
18:25 πŸ”— betamax jodizzle: no, unless my hastily-written scraper mistook an instagram url for a campaign website
18:25 πŸ”— betamax so maybe one or two, but I didn't specifically targe them
18:26 πŸ”— betamax if you have any lists of instagram pages, I have a spare machine I can run snscrape on them to get the links, though
18:40 πŸ”— jodizzle betamax: Don't have them all, but I can try doing some of my own scraping later today. It seems like Instagram links are pretty rare.
18:40 πŸ”— SimpBrain has joined #archiveteam-bs
18:41 πŸ”— jodizzle Not that the candidates don't have Instagrams, but that the links aren't on Vote411 for whatever reason (I think candidates probably submit their own info to Vote411)
18:44 πŸ”— JAA anarcat: Anytime.
18:44 πŸ”— jodizzle From a grep of your files it doesn't seem like you picked up any
18:45 πŸ”— betamax yeah, I think Vote411 reaches out to candidates to get info, so quite a lot are empty or incomplete
19:46 πŸ”— odemg has joined #archiveteam-bs
19:46 πŸ”— odemg arkiver, can we stick this on the tracker, distribute this shii, it's got a nice api https://case.law/about/
19:47 πŸ”— odemg `40 million pages of US court decisions`
19:47 πŸ”— odemg `currently you may view or download no more than 500 cases per day`
19:47 πŸ”— arkiver cool
19:47 πŸ”— arkiver we should totally get that
19:47 πŸ”— odemg <3
19:50 πŸ”— t2t2 has quit IRC (Remote host closed the connection)
19:52 πŸ”— t2t2 has joined #archiveteam-bs
20:06 πŸ”— Kaz 500/day/account, that could be an issue
20:19 πŸ”— odemg Kaz, slow but if we get enough people/ips we're good
20:19 πŸ”— odemg ipv6/
20:19 πŸ”— odemg ?
20:20 πŸ”— Kaz yeah, adding the accounts will be a pain though, unless we just make the warriors pull in a list from somewhere
20:20 πŸ”— odemg accounts? We dont need accounts to hit the api right?
20:20 πŸ”— odemg doesn't seem so
20:22 πŸ”— Kaz https://usercontent.irccloud-cdn.com/file/AHCDUhWc/image.png
20:22 πŸ”— Kaz not too sure what that means
20:26 πŸ”— jut What does "bulk download" mean?
20:26 πŸ”— jut Because it sounds like what we would like to do
20:27 πŸ”— Kaz bulk download sounds like something we're not going to get
20:27 πŸ”— Kaz because it sounds like they'll ask for justification
20:27 πŸ”— Kaz and "we want to download all your data, stick it up elsewhere and distribute it ourselves" probably doesn't tick all their boxes
20:30 πŸ”— jut Well IA has an option to set items dark doesn't it? So "we want to download all your data, stick it up elsewhere juat in case"
20:30 πŸ”— jut *just
20:35 πŸ”— PurpleSym sets mode: +o arkiver
20:36 πŸ”— PurpleSym sets mode: +oooo HCross Kaz dxrt joepie91_
20:36 πŸ”— dxrt sets mode: +o dxrt_
20:43 πŸ”— icedice has joined #archiveteam-bs
20:51 πŸ”— betamax so I have a new method for getting the campaign sites:
20:51 πŸ”— betamax run wget, mirroring and creating a warc
20:51 πŸ”— betamax if it doesn't finish in 120 seconds, cancel it, and add it to a list of 'long' sites
20:51 πŸ”— betamax this works because the majority of sites are small and so finish in ~40 seconds
20:58 πŸ”— anarcat has quit IRC (Read error: Connection reset by peer)
21:03 πŸ”— icedice has quit IRC (Quit: Leaving)
21:05 πŸ”— anarcat has joined #archiveteam-bs
21:17 πŸ”— Ryz has joined #archiveteam-bs
21:18 πŸ”— Ryz Huh, this might one of their A to B test results they've been pulling, I couldn't get a snapshot because I thought the hiding of search number results is permament like YouTube's
21:20 πŸ”— Ryz I mentioned this in the main channel because it may affect how the archiving process would work if there's no quick glimpse of how large the website could be
21:30 πŸ”— schbirid has quit IRC (Remote host closed the connection)
21:35 πŸ”— BlueMax has joined #archiveteam-bs
21:46 πŸ”— igglybee has joined #archiveteam-bs
21:48 πŸ”— igglybee Is Archive-Bot going to change to SlimmerJS?
22:04 πŸ”— Flashfire Arkiver any news on #effteepee
22:37 πŸ”— Flashfire http://ftp.arnes.si/ftp/software/unix/perl/CPAN/modules/by-category/15_World_Wide_Web_HTML_HTTP_CGI/WWW/CLSCOTT/
22:37 πŸ”— Flashfire Bebo is listed as lost on our wiki
22:37 πŸ”— Flashfire I found this while scanning for FTP servers
22:37 πŸ”— Flashfire Anyone think its useful?
22:38 πŸ”— JAA These particular files are also on CPAN though: https://cpan.metacpan.org/authors/id/C/CL/CLSCOTT/
22:38 πŸ”— JAA (At least that's what it looks like; didn't check.)
22:39 πŸ”— Flashfire It looks to be a mirror of CPAN on an FTP server but that part it beside the point. If "Version 1" is listed as lost and these are early API is data able to be extracted?
22:40 πŸ”— JAA < Flashfire> Bebo is listed as lost on our wiki -- I think that refers to the site, not this Perl module.
22:41 πŸ”— Flashfire Ok then. My bad. Thought it could be useful
22:54 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
22:57 πŸ”— Sk1d has joined #archiveteam-bs
23:07 πŸ”— igglybee https://slimerjs.org/
23:15 πŸ”— JAA igglybee: Not very likely. There are massive issues with the way PhantomJS, SlimerJS & Co. work. I'd rather spend my time developing a usable tool based on an actual browser (Firefox or Chromium Headless) than trying to replace PhantomJS with another mediocre solution. Others might disagree though. I wouldn't be opposed to accepting a PR that does all the necessary work to integrate SlimerJS (or
23:15 πŸ”— JAA whatever else) into wpull in general though.
23:16 πŸ”— VerifiedJ has quit IRC (Quit: Leaving)
23:16 πŸ”— igglybee JAA: It apparantly has the same api has phantomjs, so there should be minimal work needed
23:16 πŸ”— igglybee *as phantoms
23:17 πŸ”— igglybee phantomjs
23:17 πŸ”— igglybee https://github.com/laurentj/slimerjs/blob/master/API_COMPAT.md
23:18 πŸ”— JAA I see. Well, that would still leave those issues we had with PhantomJS, the main one being the massive duplication in the archives because all page requisites (images, stylesheets, scripts, fonts, etc.) were redownloaded for every single page.
23:19 πŸ”— igglybee Wouldn't it still be better than nothing though?
23:20 πŸ”— igglybee and don't browser-based things have the same issue?
23:21 πŸ”— igglybee I'm assuming someone should be able to plop it in and it would mostly work
23:22 πŸ”— ivan do you want to maintain it? :-)
23:22 πŸ”— JAA Well, you can try it out by specifying SlimerJS through --phantomjs-exe. If it is indeed compatible, then it should work.
23:25 πŸ”— hook54321 I doubt I'll be able to figure it out, but if no one else wants to try it I can I guess
23:25 πŸ”— hook54321 What's the correct wpull fork to use?
23:26 πŸ”— JAA hook54321: ArchiveTeam/wpull, but install from Git, not from PyPI.
23:26 πŸ”— JAA 2.0.3 is currently not on PyPI.
23:30 πŸ”— Ing3b0rg has quit IRC (Quit: woopwoop)
23:30 πŸ”— anarcat has quit IRC (Read error: Connection reset by peer)
23:31 πŸ”— Ing3b0rg has joined #archiveteam-bs
23:32 πŸ”— bakJAA has quit IRC (Ping timeout: 260 seconds)
23:32 πŸ”— plue has quit IRC (Ping timeout: 260 seconds)
23:33 πŸ”— moufu are there page requisites phantomjs can't download, but wpull can? if not, why not just exclude -p when --phantomjs is used?
23:33 πŸ”— betamax has quit IRC (Remote host closed the connection)
23:34 πŸ”— betamax has joined #archiveteam-bs
23:34 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
23:34 πŸ”— hook54321 I don't know what that means
23:34 πŸ”— plue has joined #archiveteam-bs
23:35 πŸ”— bakJAA has joined #archiveteam-bs
23:35 πŸ”— swebb sets mode: +o bakJAA
23:35 πŸ”— JAA sets mode: +o bakJAA
23:35 πŸ”— moufu I was referring to the duplication issue JAA mentioned
23:35 πŸ”— anarcat has joined #archiveteam-bs
23:36 πŸ”— hook54321 JAA: I tried to install it, this was the result.
23:36 πŸ”— hook54321 https://www.irccloud.com/pastebin/5wRqVIbL/
23:37 πŸ”— JAA moufu: I *think* that PhantomJS will download page requisites regardless of whether or not --page-requisites is specified. I also think that PhantomJS will only download the page requisites actually necessary for display, e.g. if there are multiple font formats, wpull would fetch all but PhantomJS itself wouldn't. But I don't actually know.
23:38 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
23:38 πŸ”— JAA hook54321: pip uninstall tornado && pip install tornado==4.5.3 (with --user as appropriate; assuming you're in a venv or something)
23:38 πŸ”— Sk1d has joined #archiveteam-bs
23:38 πŸ”— JAA And while you're at it, make sure you have html5lib==0.9999999 (seven nines).
23:38 πŸ”— hook54321 πŸ‘€
23:38 πŸ”— hook54321 ok
23:39 πŸ”— BlueMax has joined #archiveteam-bs
23:40 πŸ”— moufu makes sense
23:42 πŸ”— Odd0002 has quit IRC (Ping timeout: 260 seconds)
23:45 πŸ”— Odd0002 has joined #archiveteam-bs
23:48 πŸ”— hook54321 https://www.irccloud.com/pastebin/jslcmSyH/
23:49 πŸ”— JAA Oh, SlimerJS is a Python script? Hmm.
23:49 πŸ”— JAA I guess it's not marked as executable (and is lacking the appropriate shebang maybe)?
23:50 πŸ”— Flashfire IT WILL NEVER STOP
23:50 πŸ”— hook54321 ?
23:54 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
23:55 πŸ”— anarcat has quit IRC (Read error: Operation timed out)
23:56 πŸ”— Flashfire Oh I am adding ftp sites to the list until someone restarts the scripts
23:56 πŸ”— hook54321 JAA: What did you expect it to be?
23:56 πŸ”— hook54321 https://www.irccloud.com/pastebin/gdDkfyRq/
23:56 πŸ”— hook54321 wait
23:56 πŸ”— Flashfire I know that ftp://ftp.aztech.com/ contains stuff cause thats how I found it but it doesnt appear indexed......
23:56 πŸ”— hook54321 https://www.irccloud.com/pastebin/3gwQ7D1o/
23:56 πŸ”— hook54321 agh, I need firefox 59
23:56 πŸ”— anarcat has joined #archiveteam-bs
23:57 πŸ”— JAA :-/
23:58 πŸ”— anarcat ffs what's up with this fucking router
23:59 πŸ”— anarcat https://forum.turris.cz/t/omnia-unreachable/8616
23:59 πŸ”— Sk1d has joined #archiveteam-bs

irclogger-viewer