#archiveteam-bs 2017-12-21,Thu

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
zinojrwr, not that it helps, but I'm sending all my symphaty and good thoughts. [00:05]
jrwrIm going to hack up some cgi-bin
and compile php7 on this box
wouldnt be the first time I've worked around shit like this (LOOKING AT YOUR FERAL)
[00:05]
........... (idle for 51mn)
so I turned on file based cache
it should help /some/
I'm working with apache some to get this working its being a PITA, I suspect a apache module doing this
[00:56]
***BnAboyZ has joined #archiveteam-bs [01:08]
.... (idle for 19mn)
Somebody2BTW, regarding WARC uploads going into the Wayback Machine -- I've now gotten confirmation that it is still a trusted-uploaders-only process (which isn't surprising).
JAA is trusted, and ivan as well, presumably.
[01:27]
jrwrSo SketchCow, I'm stuck, the PHP is too old to update mediawiki, Apache is not behaving with the CGI override due to mod_security being forced on, and overall the entire account is limited to 7 (confirmed) connections concurrent (thats whats causing the resource limit pages currently)
I've added the static file cache and it is helping
[01:42]
SketchCowThere'll be some roughness as we figure out what to do. [01:44]
jrwrYa
its using 2.6 as its kernel....
[01:44]
SketchCowBut if I have intelligent requests for the host, I'm sure they can help. [01:44]
jrwrOk, So the main one is can I have my limits increased for the number of CGI scripts ran at one one time. I keep getting resource limit errors on top of this error log: [Wed Dec 20 20:41:37 2017] [error] mod_hostinglimits:Error on LVE enter: LVE(527) HANDLER(application/x-httpd-php5) HOSTNAME(archiveteam.org) URL(/index.php) TID(318310) errno (7) Read more: http://e.cloudlinux.com/MHL-E2BIG min_uid (0) [01:46]
SketchCowWell, assemble them all in one place for me.
I mean after a day of looking it COMPLETELY over
And then I'll bring it to TQ and see what they thing
think
[01:46]
jrwrOk [01:47]
SketchCowNo sense in piecemealing
Also, let me glance at the cpanel
[01:47]
jrwrOk
Ya, its pretty much those two issues I have with it, I'm compiling them into a google sheet for tracking
[01:49]
jacketchaWow. I was having problems compiling WARC files in javascript and was going to ask if there was a preexisting API for something like that, but I can barely even read what you guys are saying. [01:58]
jrwrIm poking the poor wiki very hard
whats up jacketcha
WARC reading in javascript, hrm
[01:58]
jacketchayeah [01:59]
jrwrWell, the WARC standard is pretty simple overall
its all about the indexing and lookups that make it fast and JavaScript is not that great at it.
https://www.npmjs.com/package/node-warc
have some node
its /javascript/
[01:59]
jacketchathanks
I was planning to add it to my chrome extension
[02:00]
jrwrah
im logging off for now SketchCow, See ya in the morning
[02:01]
***robink has quit IRC (Ping timeout: 246 seconds) [02:13]
jacketchaok, so can somebody explain to me how warc files work
sorry for being dumb
i whonestly have no idea
*honestly
[02:15]
Froggingdo you have a more specific question? [02:16]
jacketchaHow is the data structured? I am going to assume that it isn't just copying in the HTML source code after the headers are added. [02:20]
FroggingIt stores the full response headers and body
That includes responses containing binary data, HTML, CSS, plain text, whatever
[02:20]
jacketchaIs there any specific order to that? [02:22]
FroggingRecords can be in any order AFAIK [02:24]
jacketchaGreat, so it'll work just find with asynchronous saving. Thanks, that was actually really helpful. [02:25]
Frogginghttp://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/index.html [02:28]
jacketchathanks! [02:28]
Froggingthere's WARC 1.1 (which is the latest version) linked there too [02:29]
***bithippo has joined #archiveteam-bs [02:38]
bithippoThinking about grabbing Imgur. All of it. Anything I should keep in mind prior to putting it in cold storage?
(iterating over ever permutation of image urls based on how Imgur generates image urls)
[02:43]
jacketchaIs the way Imgur generates urls known?
Better question, is the source of data for the RNG Imgur uses known?
[02:47]
***robink has joined #archiveteam-bs [02:48]
bithippohttps://blog.imgur.com/2013/01/18/more-characters-in-filenames/
"Choosing 5 characters from 26 lowercase letters + 26 uppercase letters + 10 numerical digests, leaves us with 916,132,832 possible combinations (625). Upgrading to 7 characters gives us 3,521,614,606,208 (3.52 trillion) possibilities."
404->check back in the future, 200->WARC gz
Etag header on a request is the MD5 of the image
[02:50]
jacketchaThat is still 3 trillion web requests [02:53]
bithippo¯\_(ツ)_/¯
Alternatives? Besides waiting until Imgur runs out of runway and then its more pressing :/
(ie Twitpic 2.0)
My only frustration is that the URL isn't deterministic from a hash of the image, so it's possible an image exists, is deleted, and then replaced without any way to know
[02:53]
jacketchaUnless it was already archived
Look, it's a good idea in practice, but here's the thing
[02:58]
bithippoAhh, truth [02:58]
jacketchaImgur gets around 17.3611111111 new images per second
That would place it at around 2687000000 images today
[02:58]
bithippoLe sigh. [03:01]
jacketchaIt gets worse
That means you have a roughly 0.076300228743465619423705658937702969914635168416766807608280491104220976146849283303277891127907878357259164917744009861455363906114286574925748247085571170136781627322552461470824865159611513732687195% chance of getting an image every time you send a request
Don't be fooled by the high precision, the accuracy of your plan is very low.
But, there is a way to make it higher.
Much higher, in fact
If you can figure out the source of the data that Imgur uses for its random number generation algorithms, you can at least grab the newest images
[03:01]
bithippoSounds workable using their API to get latest image paths and then working backwards [03:06]
jacketchapossibly
But if it is randomly generated, even pseudorandomly generated, you're still screwed.
or
you know
you could just email them
and ask
[03:07]
bithippo"Hi. I will take one Imgur pls. Will be over with hard drives shortly."
Appreciate the input!
[03:10]
jacketchaNo problem [03:11]
hold up
If a Warrior or ArchiveBot finds a WARC file, is it added to the collection of WARC files or is it added into the WARC file of the site it is located on?
[03:19]
godanei'm up to 18k items this month
this year has been slower then last year
i just hope i can get it past 100k for the year
https://archive.org/details/@chris85?&and[]=addeddate:2017
it 97,682 so far
[03:23]
jacketchawoah
maybe I should start counting mine
and putting it in actual warc files
[03:24]
***pizzaiolo has quit IRC (Remote host closed the connection) [03:33]
robink has quit IRC (Ping timeout: 246 seconds) [03:45]
Somebody2jacketcha: if a HTTP request returns a WARC file, and that HTTP request and response is being stored into a WARC file,
then, yes, you'll have nested WARC-formatted data
AFAIK, no WARC-recording tool will automatically un-nest it (and that would probably not be a good idea in any case)
[03:48]
jacketchawait
so that means that there possibly could be an archive of the entire internet floating around the wayback machine somewhere, but nobody would ever know because it was nested.
[03:55]
SketchCowThis is.....
See, this is one of the things
You are asking... well, you're asking for a college course in how WARC works
It's sort of on topic and sort of off
It's certainly sucking all the air out of the room
It's nice to see people talking
[04:01]
jacketchaso nested warc files are basically politics
got it
[04:01]
***bithippo has quit IRC (Ping timeout: 260 seconds) [04:11]
SketchCowNo.
You're wandering into a welding shop going "So.... why cold welds"
[04:14]
jacketchathat seems very accurate [04:16]
***bithippo has joined #archiveteam-bs [04:20]
kyounko has joined #archiveteam-bs [04:33]
..... (idle for 22mn)
qw3rty117 has joined #archiveteam-bs [04:55]
qw3rty116 has quit IRC (Read error: Operation timed out) [05:01]
..... (idle for 21mn)
bithippo has quit IRC (Quit: Page closed) [05:22]
Stiletto has quit IRC (Read error: Operation timed out) [05:29]
Stilett0 has joined #archiveteam-bs
BlueMaxim has quit IRC (Read error: Operation timed out)
BlueMaxim has joined #archiveteam-bs
[05:34]
...... (idle for 26mn)
wp494 has quit IRC (Ping timeout: 250 seconds)
wp494 has joined #archiveteam-bs
[06:02]
zgrant has left
wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
wp494 has joined #archiveteam-bs
[06:12]
kimmer1 has joined #archiveteam-bs
midas2 has quit IRC (Ping timeout: 1212 seconds)
kimmer12 has quit IRC (Ping timeout: 633 seconds)
[06:27]
midas2 has joined #archiveteam-bs [06:47]
........ (idle for 37mn)
wp494_ has joined #archiveteam-bs [07:24]
ZexaronS has quit IRC (Read error: Connection reset by peer)
ZexaronS has joined #archiveteam-bs
wp494 has quit IRC (Read error: Operation timed out)
[07:29]
wp494_ has quit IRC (Ping timeout: 633 seconds)
odemg has quit IRC (Read error: Operation timed out)
[07:39]
odemg has joined #archiveteam-bs [07:46]
wp494 has joined #archiveteam-bs [07:53]
robink has joined #archiveteam-bs [08:03]
jacketchaI wonder how many times jquery has been archived
By this point, there must be at least a hundred copies made of it each week
[08:06]
............. (idle for 1h2mn)
***jacketcha has quit IRC (Read error: Connection reset by peer)
jacketcha has joined #archiveteam-bs
[09:09]
Mateon1 has quit IRC (Ping timeout: 245 seconds)
Mateon1 has joined #archiveteam-bs
[09:16]
.... (idle for 19mn)
jrwrSWEET BABY JESUS
someone, I got php7 to run
on this holy shit old host
[09:35]
PurpleSymIs this a dedicated server, jrwr? [09:39]
jrwrnot even close
its a shared host running linux 2.6 on a old cpanel
running a god old apache + php 5.3
I override the mod_security and mod_suphp all to fux to get PHP scripts to run with a custom statically linked php binary I made
[09:39]
PurpleSymWtf? 2.6 EOL’d years ago. [09:41]
jrwrI'm making do with what I have
its where its staying
I'm making its own little world on this webhost
jrwr compiles memcached
now
comes the fun part
I'm going to update mediawiki
[09:43]
jacketchagood luck
i can't even update windows without doing a ritual to please the tech gods
[09:46]
jrwrthis is dark magic
php does not like doing this
I have a plan
to compile php with memcached, and then run a little memcached server so mediawiki can cache objects
[09:50]
jacketchanah
you don't even need php
just do what I do and use 437 IFTTT applets as your server
with a touch of github pages
[09:53]
jrwrlol
this is the archiveteam wiki I'm working on
[09:54]
jacketchaIs the ArchiveTeam wiki archived? [09:55]
IglooMostly :p [09:55]
jacketchaYou know what I want to try? My school has unlimited storage on all google accounts under its organization. I wonder how far they would let me push that. [09:57]
jrwrIts staying where it is for now
for ~reasons~
[09:58]
jacketchais it because you missed a semicolon somewhere but there isn't a really good php linter yet
oh no
i just remembered i have midterms
gn
[09:58]
.... (idle for 18mn)
jrwroh man
thats a ton better
[10:19]
***jacketcha has quit IRC (Remote host closed the connection) [10:19]
jrwrArchive team is now running on mediawiki 1.30.0 [10:19]
***jacketcha has joined #archiveteam-bs
jacketcha has quit IRC (Remote host closed the connection)
[10:21]
jacket has joined #archiveteam-bs [10:29]
fie has joined #archiveteam-bs [10:43]
fie has quit IRC (Read error: Connection reset by peer) [10:48]
.... (idle for 19mn)
pizzaiolo has joined #archiveteam-bs [11:07]
jrwrIgloo: better huh? [11:14]
Igloojrwr: miles and miles [11:16]
jrwrYa
Response times are sub 200ms
Before they where 1400ms
[11:16]
JAAjrwr: Well done! Much, much better. <3 [11:20]
jrwrThanks [11:20]
JAASomebody2: Ah, makes sense. Thanks for checking. [11:21]
jrwrI woke up from a strange dream at 3am (flying a airplane and somewhat crashing it)
And then had a brainwave on how to get php working correctly
Been up since then, work is going to be hell today
[11:21]
.......... (idle for 46mn)
***BlueMaxim has quit IRC (Leaving) [12:08]
....... (idle for 32mn)
IglooOk, So it looks like we can iterate through the numbers [12:40]
JAAFor user profiles, maybe. For characters, no way. [12:41]
Igloo212 million users? Unlikely [12:41]
JAABut it should be fairly simple to scrape them from https://www.saintsrow.com/community/characters/mostrecent
The question is, how do we get the actual characters (not just the images)?
[12:42]
Smileyis there a log of what has gone before?
I have a 'archiveteam' account registered along with my personal one
not sure why
maybe i just suggested scraping for SR3
[12:42]
Igloohttps://www.saintsrow.com/users/show/212300001 appears to be the lowest. https://www.saintsrow.com/users/show/213056573 latest
2300001 3056573 ~705,000 user profiles?
Those are easy
[12:46]
........ (idle for 35mn)
jrwrSketchCow: email fixed
confirmed working with password resets being sent to a gmail account
[13:22]
SketchCowGreat [13:23]
jrwrhttps://usercontent.irccloud-cdn.com/file/brVInVWJ/image.png
you can see when I dropped in the php changes
[13:32]
..... (idle for 20mn)
joepie91: it doesnt work like that
this whole box is from 2011
[13:52]
joepie91ah, just an old cpanel then that doesn't support it, or? [13:53]
jrwrya
I have methods and apis
im patching it in
[14:02]
***icedice has joined #archiveteam-bs [14:05]
icedice has quit IRC (Ping timeout: 250 seconds) [14:13]
JAAjrwr: LOL, that graph is beautiful! [14:13]
.... (idle for 15mn)
jrwrThanks
its going up on my wall
[14:28]
........... (idle for 50mn)
JAA: Igloo
guess what
SSL BITCHES
[15:18]
JAAYiss [15:19]
jrwrhttps://www.ssllabs.com/ssltest/analyze.html?d=archiveteam.org
fucking A rating!
[15:19]
MrRadar2:D :D :D https://i.imgur.com/CloHYLR.png [15:20]
jrwrwith Strict Transport Security (HSTS) on (left it pretty short just in case) [15:20]
IglooJust need a redirect now ;-) [15:20]
JAA^ [15:20]
jrwrna, not going to enforce it
HTST is enough
[15:21]
***zgrant has joined #archiveteam-bs [15:23]
jrwrfuck it
done
SketchCow: SSL is now installed
anything else?
[15:24]
Igloo:)
hehe, all my home stuff with LE gets A rating too
Which is bonza
[15:24]
SketchCowI think that's all I can think of
Someone proposed some sort of theme upgrade
But it all seems just fine to me now.
[15:26]
jrwrah
its fine
I /might/ get bored and add in a new editor but the new editor requires all kinds of crazy
[15:31]
SketchCowIf people come up with things, we'll consider them now that it's possible
Generally, someone complaining they can't work on te Wiki because they miss a gimgaw are focused on the wrong things.
[15:32]
jrwrYa
I am using the file based cache built into mw
so bots and stuff all get served static pages
I feel like I just refurbed my 1984 Chrysler lebaron convertible (I own one) https://drive.google.com/file/d/1AQqXNiluKTk5xuCYStfVexiH1LLUOYaLLQ/view?usp=sharing
[15:33]
IglooNice car [15:43]
jrwr900$
runs great, and talks to you
https://www.youtube.com/watch?v=nGuRS-L2BN0
[15:45]
I love the old DEC speech Synths
sound better then software ones
[15:53]
........ (idle for 35mn)
godaneso another box of tapes i bought is shipped [16:28]
..... (idle for 22mn)
***dd0a13f37 has joined #archiveteam-bs [16:50]
godaneso this happenedhttp://mashable.com/2017/12/20/sesame-street-irc-macarthur-grant-refugee-middle-east/#fIS9la5_bSq7 [16:53]
...... (idle for 25mn)
***schbirid has joined #archiveteam-bs
jacket has quit IRC (Read error: Connection reset by peer)
jacket has joined #archiveteam-bs
[17:18]
dd0a13f37aria2c is a mystery
if I have it use 1 connection or 10, I still get about 2r/s
if I split it up across 6 command windows, 12r/s
might have to do with the fact that it's split across multiple IPs though
Anyone know a good tool to do this automatically? Split up http requests over multiple proxies?
[17:23]
....... (idle for 34mn)
***bithippo has joined #archiveteam-bs [17:59]
......... (idle for 43mn)
ola_norskis there some sort of code available to look at on how IA get urls from warcs?
or does it convert first, somehow, then do _that_ ?
i was kind of expecting warcs to a kind of archive with an index, not all data being a single file :/
e.g containing something i could open in gedit etc..
[18:42]
dd0a13f37iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/index.html [18:50]
bithippohttps://github.com/recrm/ArchiveTools/wiki/warc-extractor [18:51]
dd0a13f37Can someone help me?
I'm trying to archive a site with aria2c. When using multiple concurrent connections, I get about 2r/s. The speed is around 50bkit, despite my internet connection being much faster.
When using multiple instances over multiple IPs, it's not much better. The individual speeds for some drop down to 5kbit.
[18:52]
bithippoRecommend using https://github.com/ludios/grab-site to archive instead [18:54]
dd0a13f37I am currently running 50 instances of aria2 with 20 concurrent connections each. I get 5r/s total. [18:54]
bithippoIf you need a WARC file, request/response headers, etc. [18:54]
dd0a13f37This is abysmally low, what gives? [18:54]
bithippoYou are most likely being throttled by IP [18:54]
dd0a13f37But I'm spreading it over 50 different IPs. [18:54]
bithippo(even if distributing across multiple IPs)
Sliding window of bytes in the webserver config. Initial requests are fast, subsequent requests slow down if you try to firehose
What's the hostname?
[18:54]
dd0a13f37ratsit.se
Or my hostname?
[18:55]
bithippoNope, site hostname. Checking something. [18:55]
dd0a13f37So how can it throttle them to 2-5 KiB/s when using 50 different IPs, but 50 KiB/s when using 1? [18:55]
bithippoAre these all anonymous web requests? Or are you signed in/setting a cookie to be logged in to fetch data? [18:57]
dd0a13f37These are all anonymous requests from tor exit nodes. No cookies are stored. [18:58]
bithippoCould be throttling by tor IPs. I did that at my last gig on our Nginx servers.
(Tor requests were notoriously bad scraping actors in our case)
[18:58]
dd0a13f37But I used tor IPs before too. And those were at 50KiB/s, not 2-5KiB/s. [18:58]
bithippoI don't have a good answer unfortunately :/ Lots of variables that could be causing it. What's the purpose of using Tor to perform the requests? [18:59]
dd0a13f37The only logical explanation is my connection being the bottleneck, but that would put it at around 2 mbit, which is way too slow
Because I don't want to get in any trouble for the scraping, and they could ban my IP
Now it jumped up to 12 resp/s, which was my previous peak when using 6 different IPs.
[18:59]
bithippoIs a cloud provider VM out of the question with a slow concurrency rate?
2 requests per second, say.
[19:01]
dd0a13f37It could be on their end too.
I could just leave the computer on over night, but I would prefer not to pay any money.
Could it be they just serve 12 connections at the same time?
Now down to 7 again. Sure is a mystery what is going on...
And now back up to 14. At this speed, it will take 6 hours, which is slow but acceptable.
[19:01]
bithippoI can rip it for you and provide a torrent file when I'm done. [19:07]
dd0a13f37It went up to 34 now.
How? With grab-site?
They might block the IP being used in that case
[19:08]
bithippo10 second wait between requests
It'll take a while, but it'll finish eventually.
Have to head out, leave me a note here if that's a plan
[19:09]
dd0a13f3710 seconds would take half a year for the whole site, and it would change during the time, so I don't think that's a good idea
But if it continues to be this fast then it should be done in a few hours, which is good.
[19:11]
Well, the only logical explanation is some advanced throttling algorithm in place. I can't find any other explanation for why it's so slow. [19:21]
https://pastebin.com/VkCS1yJ1 It apparently got faster over time, with a peak of 46 resp/s, before slowing down. [19:27]
.... (idle for 18mn)
ola_norskany sqlite geninouses savvy to very basic sqlite reational database structure in here who wouldn't mind if ask some questions?
relational*
[19:45]
dd0a13f37I have a basic knowledge, shoot [19:46]
ola_norskdd0a13f37: thanks. if you please take a look at the SQL here, (just page search for 'sqlqueries') https://github.com/DuckHP/twario-warrior-tool/blob/master/src/twario/sqlitetwario.py
i'm sure that sql could be done better, and i think you'll agree. Sadly my sql is quite shit
to optimize storage etc..i mean
and speed etc
[19:47]
dd0a13f37The schema? [19:50]
ola_norskaye [19:50]
dd0a13f37You can add a constraint for TweetUserId so it has to have a corresponding entry in Users
And Users should either have id INTEGER PRIMARY KEY, or TweetUserID should be a username
[19:50]
ola_norskyeah been thinking that so i made a 'users' table
ok
[19:51]
dd0a13f37Display name isn't stable, but it might be overkill to provision for that
search for foreign key constraint
[19:52]
ola_norskaye, i'm not even sure yet if 'tweep' reads display names [19:52]
dd0a13f37https://sqlite.org/foreignkeys.html http://www.sqlitetutorial.net/sqlite-foreign-key/
Well, if you want to archive avatars etc it might be neat to have. You could have three tables, but it might be overkill
tweets - tweet text, date, username
users - username (not unique), date, avatar, displayname
or wait, that makes two
[19:52]
***Valentine has quit IRC (Read error: Connection reset by peer) [19:54]
ola_norskavatar might be doable [19:54]
dd0a13f37And then just do SELECT * FROM users WHERE username = ... LIMIT 1 [19:54]
ola_norskty [19:54]
dd0a13f37not sure about the syntax [19:55]
ola_norskthe requests sql i can figure out i think, but i suck at schema/structure :/ [19:58]
***BnAboyZ has quit IRC (Quit: The Lounge - https://thelounge.github.io) [19:58]
ola_norski'll check out that link you posted. thanks [19:59]
dd0a13f37Well, have one users table that for each username can have multiple entries (e.g. if they change their avatar you get a new entry with same username)
And one tweets table, since they are immutable
[20:00]
***Valentine has joined #archiveteam-bs [20:00]
ola_norske.g if a tweet is identical, just have a 'content' table perhaps, and refereance that? [20:03]
dd0a13f37If you're ever building your own scraper, it seems like mobile.twitter.com is more pleasant to work with
view-source:https://twitter.com/jack view-source:https://mobile.twitter.com/jack
[20:03]
ola_norski'm just "re-doing" a tool called 'tweep' [20:03]
dd0a13f37As in, modifying it? [20:04]
ola_norskaye, this: https://github.com/haccer/tweep ..it seems to work quite well, but could use some tweaking [20:05]
dd0a13f37If you want a complete archive, you could probably crawl pretty nicely. Start off by the timeline, then see what accounts and hashtags you find. Then traverse those accounts and hashtags, see what accounts and hashtags you find.
>The --fruit feature will display Tweets that might contain sensitive info
uh
[20:05]
ola_norskaye, have not tested that yet, but i've been thinking of removing it
basically 'user' and 'search words' is my focus
not exactly too keen on archiving 'doxing' tweets
[20:07]
dd0a13f37Well, why bother taking it out? Just don't use it, or remove all documentation references to it if you're really concerned. [20:08]
ola_norskaye [20:08]
dd0a13f37I think mobile.twitter.com is better. It shows 30 tweets/page instead of 20, and the pages are faster to download [20:09]
hook54321JAA: Are you still grabbing the Catalonia cameras that update every 5 minutes or so? [20:10]
ola_norskdd0a13f37: it seems to require a signin/account [20:10]
JAAhook54321: Yeah
I think so, at least.
Let me check.
[20:10]
hook54321lol [20:10]
ola_norskdd0a13f37: i deliberaly made myself banned from twitter :/ [20:10]
hook54321I need to start recording the cameras I was recording again [20:10]
JAAYep, it's still grabbing... something. [20:11]
dd0a13f37mobile.twitter.com doesn't need an account [20:11]
JAAHaven't looked at the content in a long time though. [20:11]
dd0a13f37https://mobile.twitter.com/jack?max_id=938593014343024639 works just fine for me [20:11]
ola_norskdd0a13f37: so it's just because i'm using desktop browser then?
dd0a13f37: that link worked btw
dd0a13f37: doh, i got a "join today to see it all" when scrolling
[20:12]
hook54321JAA: Should I grab this whole youtube channel? https://www.youtube.com/user/gencat [20:13]
dd0a13f37I am using tor browser with JS disabled. [20:14]
ola_norskdd0a13f37: i think if 'twario/tweep' is made a bit less agressive, it wouldn't need to be 'torified' [20:16]
dd0a13f37https://mobile.twitter.com/jack?max_id=743833014343024639 I can go quite a bit back
Why castrate your perfectly working tweet scraping tool? Requests can use proxies, or multiple.
[20:16]
ola_norskdd0a13f37: with original 'tweep' it seemed to stop at half a year or so back in time at search word
they could, but it would eventually get noticed i think if's running continously :/
[20:17]
dd0a13f37differnt users go differently far back https://mobile.twitter.com/realDonaldTrump?max_id=793833014343024639 [20:18]
ola_norski don't mean users, but e.g one word [20:19]
dd0a13f37There are many tor exit nodes. [20:19]
ola_norskhow could a python script be _fully_ torifyed? If it could be done without using a virtual machine, that would be cool :D [20:21]
dd0a13f37torsocks python ./myscript [20:21]
ola_norskty [20:21]
dd0a13f37Or you can just have requests use a proxy
torsocks -i for guaranteed fresh ip
[20:22]
***BnAboyZ has joined #archiveteam-bs [20:24]
hook54321What collection should I upload that channel to? There's like 400 videos.... [20:27]
ola_norskdd0a13f37: will definetly test that. And i'm guessing just the tiny bit of extra time storing to an sqlitedb counts as tiny bit of it being nice-ifyed :D [20:28]
dd0a13f37The time the request takes will, unless you're using twisted/multithreading [20:29]
ola_norskdd0a13f37: the reason 'local capture time' column is in tweets i think i put in for exactly that purpose, since JAA pointed out that 'tweep' itself does not seem to be correct at keeping times
aye
[20:30]
dd0a13f37The mobile search url query string is ... interesting...
https://mobile.twitter.com/hashtag/EU?src=hash
https://mobile.twitter.com/search?q=EU&next_cursor=TWEET-943937901217370114-943937901217370114-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWQABAAAAIAAAAAAAAQgAAAAAAAJAAAAAAAAAAABAAAAQAAAAAAAAAAiAAAQAAAAAAABAAAAAAAAAACBAAIAAAAAQAAAAAAIAAAAAACAAIAAAAAAAAAAAAAAACAAAhAAAAAAAAACAgAAAAAAAAAAAIAAAAAAAAAAAAAAAAgAAIAAAAAAFAAIIAAACCAAAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAAAAAEAAAAAACgCAAAAgAABwAAABAAAAAAAAAAIAAAAARAAEAAAAAAAA
AAAAAAIAAAAgAAAAAAAAAAAAAAAAACAAAABAAAAABAAAAAAAAQAAAAQAAEAABAAAEAAEAQAAAAAgAAAAAAAAAAAAAwACAAAAAAAAAAAAAAAAABQAAAAAAAAAAAAAAAAAACQAACAAAAAAAAAIAAQACAAAAFABAAAAAAAQkAAAEAAAAAAAAoAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAIAAAAAICAAAAAAAAAAAAAAEAAAAAAEAAACAAAAAAAAAEAAAAAAAAAgAAAAAAQAEAAQAAAAAAAAABAUAAAEAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAIQAAAACACAQAAQAAIAAAAIAAAAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQBAAAAwAAAAAAAAAAAAAAAAAA
AAAAAQAAAAAAAAAAAgAAAAEAAAAAAACABAAAAAAAAAAAAAAEAAQAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAIAAAQAAAAAAAAAAEAAAAAAAAAQAAAAAABABAAAAAAACAAAQAAAAAAAAAAAAAAAAAgAAAAAIAABACIAAAAAAAAAIAAAQAAAAAAAAA%3D%3D-R-0
(that was one link)
[20:31]
hook54321oh dear [20:31]
dd0a13f37the base64 encoded part is some kind of bitmask [20:31]
ola_norskmy eyes!
i think i've seen that garbled shit before, at tweep crashing :/
when adding 'loggin' module, that looks exactly like the output given on the line where it stopped
logging*
[20:31]
dd0a13f37hmm, strange
because there is no base64 encoding or anything of the sort in tweep
[20:34]
ola_norskmaybe i have the output..one sec
seems i've deleted the log, I'll risk trying to run the same command one more time. brb
[20:34]
jrwrgood news
I enabled the new editor toolbar in the wiki (cc SketchCow )
[20:38]
ola_norskdd0a13f37: could it be compressed stuff, like in 'header: gz' crap? [20:40]
dd0a13f37No, it's base64
run base64 -d | xxd, then paste it in
You'll see most of the bytes only have one bit set
[20:41]
SketchCowHurrah [20:42]
dd0a13f37Since it doesn't do anything if you change the numbers at the beginning (max id), the max_id parameter is in there too [20:42]
ola_norskdd0a13f37: all i know that mess of "AAAAAAAA" was the end of the log line when i last tested tweep. And also where it apprently failed. [20:42]
dd0a13f37Not at the beginning, since that's the same across requests [20:43]
ola_norski'm running the same command now, and will pastebin (when) it fails [20:43]
jrwrSketchCow: its snazzy [20:44]
ola_norskdd0a13f37: for all i know it might've been some nasty character(s) that did it [20:45]
jrwrmakes it a /little/ simpler to edit pages [20:46]
***bithippo has quit IRC (Quit: Page closed) [20:48]
BartoCH has quit IRC (Ping timeout: 260 seconds) [20:54]
dd0a13f37Oh, regular twitter has that same AAAAAAAAAAA mess, just not as the requested URL [20:59]
JAAI think it does. Load a page in your browser (with JS enabled), enable dev console, scroll to the bottom, check out the requests that happen in the background.
I think tweep just tries to imitate what the browser would do.
[21:01]
ola_norskhmm
it has not crashed yet here like last time, but seems like a lot of people love to tweet 'netneutrality' these days. So it's not even done with this month. I think last time it crashed at about this years month of may tweets
holy shit people have been tweeting 'netneutrality' lol
[21:02]
dd0a13f37Twitter gets 6k tweets/sec, with 20 tweets/request archiving this is in the realm of possibility [21:05]
ola_norskdd0a13f37: aye :D and using webarchive.io, or wget/curl with requests to web.archive.org/save/ is quite futile :D [21:06]
dd0a13f37You would need to do a few hundred requests per second. The problem is archiving all those avatars, if you saturate a 1gbit/s line you can afford to archive 20kbit avatars assuming no overhead or IP bans
But the avatars aren't very important, are they?
[21:07]
ola_norskreconstructing the links to the tweets is more important [21:07]
dd0a13f37That's possible too, all the info is in the HTML [21:08]
ola_norskand 'tweep' captures the id
aye
[21:08]
dd0a13f37Does tweep have a mode where it can just show you all the tweets being done? [21:08]
ola_norskit does by default [21:09]
dd0a13f37Without narrowing down to a hashtag? Does it get 100%? [21:09]
ola_norski don't know. I does a fuck of a lot of tweets though :D
if why it stopp(ed), could be worked out, i bet it could do 100%
[21:10]
jrwrfor the wikidump nerds At 04:00 on Friday. a copy of the wiki's XML + Images are uploaded to the IA
for good measure
[21:12]
ola_norskright now i'm just doing "python tweep -s 'netneutrality' > tweets.txt" ..to see if it eventually stops like last time. For all i know, piping to a textfile is what did it. [21:12]
dd0a13f37But can you just run python tweep > t.txt? [21:13]
SketchCowAND THE WINNER OF THE "DO YOU WANT THIS" SWEEPSTAKES FOR DECEMBER 21 IS [21:14]
ola_norskwith '-s <search word>' , yes [21:14]
SketchCow...hundreds of gigs of funeral recordings in mp3 [21:14]
***BartoCH has joined #archiveteam-bs [21:14]
dd0a13f37But without -s parameter?
You could cheat and just use the X most common words, but that's not a nice solution
[21:14]
ola_norskthen it asks for parameters i think .. I think it's either '-u (user)' or '-s (word(s))' possible
either one of those are required.. _i think_
[21:15]
dd0a13f37Then a full scrape is difficult, or at least harder [21:16]
jrwrSketchCow: your collection never ceases to amaze me [21:16]
JAAjrwr: Yay, finally. The last such dumps were uploaded in 2014 or 15. [21:16]
jrwrthey get dumped here after processing https://www.archiveteam.org/dumps/ [21:17]
ola_norskdd0a13f37: with a 'users' table i could be easier though perhaps..or :D [21:17]
jrwronly keeps one [21:17]
ola_norskdd0a13f37: it* [21:17]
jrwrthe backup log for it is in https://www.archiveteam.org/backup.log [21:17]
dd0a13f37Well, there's still a few users who are never mentioned by others, never use certain hashtags, and never use certain words [21:18]
ola_norskdd0a13f37: yeah [21:18]
JAANeat [21:18]
ola_norskdd0a13f37: not to mentioned banned, yet mentioned, and private..etc. i guess
dd0a13f37: i don't have much experience using tweep, so i don't even know how it behaves on finding disabled accounts, or banned users :/
[21:19]
dd0a13f37If you're scraping in realtime, that doesn't matter.. it would be one hell of a tweet to get banned in under 5 milliseconds [21:22]
ola_norskdd0a13f37: i think it just goes back in time from point of start [21:22]
dd0a13f37You'll never keep up, better to go in realtime [21:23]
ola_norskdd0a13f37: that would need some genoius at python threads i think..and perhaps faster bandwitch than mine :D
bandwidth*
[21:23]
dd0a13f37twisted-http is fast, no? [21:24]
***jacketcha has joined #archiveteam-bs
pizzaiolo has quit IRC (Read error: Operation timed out)
[21:24]
ola_norskdd0a13f37: i do not know..tweep uses 'request' / 'urllib(3?)' i think [21:25]
dd0a13f37one results page is 8.5k gzipped, contains 20 tweets, at 6k tweets/sec this gives 20 mbit/s [21:26]
ola_norskdd0a13f37: it wouldn't surprise me if in certain senarios a hashtag were quicker than i could process [21:26]
dd0a13f37Yeah, requests without threading. [21:26]
***jacket has quit IRC (Ping timeout: 248 seconds) [21:27]
dd0a13f37But I think caching will make such attempts impossible, if you do the same query multiple times you'll get the same result [21:27]
ola_norskwhen using crontab wget, i had to cut time from 5 minutes to 3 minutes between each web.archive.org/save/ request..just to have a chance [21:27]
dd0a13f37You can do those in the background. Fire and forget. But IA won't like it [21:29]
ola_norskdd0a13f37: going "upwards" in time in a twitter feed is most likely the best solution. But my grasp of how to do that..is weak :D [21:29]
dd0a13f37I think archiving twitter is an insanity project anyway, better to just wait for library of congress to get their shit together [21:30]
ola_norskdd0a13f37: i just focous on hastags, like netneutrality :D [21:30]
dd0a13f37that's probably possible [21:30]
ola_norskdd0a13f37: entire twitter, or twitter by even years or months..yeah, some congress would've have to do that :D [21:31]
jrwrand now I rest from poking the wiki really hard over the last 24hr [21:31]
***jacketcha has quit IRC (Read error: Operation timed out) [21:33]
ola_norskdd0a13f37: tweets containing 'netneutrality' been scrolling on my screen for 'since i said i started the command' , and i'm still on 2017-12-19 :/
dd0a13f37: though i expect it will speed up when getting past the 14th a bit
[21:33]
dd0a13f37You could archive faster if you modify it to use twisted [21:34]
ola_norskjust by the protocol stuff or using threading? [21:35]
dd0a13f37What?
>Twisted is an event-driven networking engine
https://twistedmatrix.com/documents/current/api/twisted.web.client.html
[21:36]
ola_norskso its beatifulsoup that's bottleneck, or? [21:37]
dd0a13f37No, requests
and that it's not using requests with threads
[21:38]
ola_norskcould it "re-use" already established connections? because that is one thing that pisses me off about tweep. It seems to do one connection per damn tweet [21:40]
dd0a13f37yeah [21:40]
ola_norsk..or at least try, like wget
ty
[21:40]
jrwranyway JAA I figured once a week is a good backup for a low traffic wiki [21:41]
dd0a13f37or apparently asyncio is recommended [21:42]
JAAYeah, sounds reasonable.
ola_norsk, dd0a13f37: It would probably be easiest to reimplement the whole thing based on aiohttp or similar.
[21:42]
ola_norskola_norsk taking notes of all :D [21:44]
JAAI've written scrapers with aiohttp before, it's really nice. [21:44]
ola_norskgot git? :D [21:45]
JAAHTTP/2 support would be even better.
No, haven't shared it yet.
It's on my list for the holidays, uploading all my grabs and the corresponding code.
[21:45]
ola_norskJAA: feel free to punch in some stuff :) https://github.com/DuckHP/twario-warrior-tool
i have to the get database thingy working first i guess, before i do anything else :/
(that, and making sure it doesn't freeze)
[21:46]
dd0a13f37http://www.sqlalchemy.org/ [21:47]
JAAWhat did you change so far?
Also, port to Python 3 please.
[21:48]
ola_norskJAA: i've barely (not really) touched tweep itself so far :/ [21:48]
JAAAh ok [21:49]
ola_norskbah...I've not python'ed in years..2.7 is new to me :D [21:49]
JAAOh please, Python 3 was released in 2008. :-P [21:50]
ola_norskwhy is there not a python script that converts e.g 'print "shit"' ? [21:51]
dd0a13f372to3? [21:51]
JAA2to3
That should handle the most obvious stuff.
[21:51]
ola_norskgood [21:52]
***icedice has joined #archiveteam-bs [21:53]
ola_norskthe need of porting an interpreted language...that's a travesty by itself :( [21:54]
JAAWell, it's necessary because they cleaned up a ton of poorly designed stuff in Python 3. [21:54]
dd0a13f37it truly boggles the mind
yet c can remain source compatible for 28 years and counting
[21:55]
JAAYeah, let's compare C to Python...
And I doubt that C was as stable in the early stages of development.
Let's discuss that again when Python is 45 years old.
[21:55]
ola_norsk:D [21:56]
dd0a13f37But python is old by now. The 2 to 3 migration was a complete catastrophe. [21:56]
JAAWell yeah, many of those things (e.g. string vs. unicode distinction) should've been fixed earlier.
But they waited and accumulated all those things and then made one big backwards-incompatible release.
Which makes sense, otherwise you'd have to keep changing the code all the time.
[21:57]
hook54321Should these videos be uploaded to community video, or a different collection? https://www.youtube.com/user/gencat [21:58]
JAAAnyway, this is getting way too offtopic for this channel. [21:58]
dd0a13f37C was standardized in 1989, and k&r was released in 1978 - 11 years [21:58]
ola_norsk:D [21:58]
dd0a13f37python was released in 1991
it was not standardized by 2002
[21:58]
***JAA changes topic to: Lengthy Archive Team and archive discussions here | Offtopic: #archiveteam-ot | <godane> SketchCow: your porn tapes are getting digitized right now
schbirid has quit IRC (Quit: Leaving)
[21:59]
dd0a13f37Or, to be fair, python2 was released in 2000, and it wasn't standardized by 2011 [21:59]
JAA-> #archiveteam-ot [22:00]
dd0a13f37I didn't know we had an offtopic channel for the offtopic channel [22:01]
JAAThis isn't the offtopic channel, it was always about lengthy discussions (because #archiveteam is limited to announcements).
And -ot is new, just opened last week I think.
[22:01]
DFJustinoh great another channel [22:02]
hook54321lol [22:02]
dd0a13f37#archiveteam-ot-bs when? [22:04]
***icedice2 has joined #archiveteam-bs [22:06]
JAAhook54321: Community video sounds reasonable to me. Are you uploading each video as its own item? If so, you should probably ask info@ to create a collection of all of them in the end. [22:08]
***icedice has quit IRC (Ping timeout: 250 seconds) [22:08]
hook54321Each of them there own item yeah. I'm using tubeup to do it. I'll email info@ when it's done I guess. [22:09]
***ola_norsk has quit IRC (R.I.P dear known Python :( https://youtu.be/uy9Mc_ozoP4) [22:10]
Smileyis -bs not the off topic channel tho?!
Smiley so confuse. fuck that.
[22:12]
dd0a13f37When do Igloo's pipelines upload? As part of
Archiveteam: Archivebot GO Pack?
[22:14]
JAAYes
All pipelines do, except astrid's and FalconK's.
[22:15]
dd0a13f37So why can't I find a certain !ao job in it? [22:16]
JAALet's go to #archivebot. [22:16]
***pizzaiolo has joined #archiveteam-bs [22:19]
...... (idle for 27mn)
icedice has joined #archiveteam-bs
icedice2 has quit IRC (Ping timeout: 245 seconds)
icedice2 has joined #archiveteam-bs
icedice has quit IRC (Ping timeout: 245 seconds)
icedice2 has quit IRC (Client Quit)
kristian_ has joined #archiveteam-bs
icedice has joined #archiveteam-bs
[22:46]
icedice2 has joined #archiveteam-bs
icedice has quit IRC (Ping timeout: 245 seconds)
[23:06]
......... (idle for 40mn)
jacketcha has joined #archiveteam-bs [23:48]
jacketchaHey, does anybody know if there is a node.js implementation of the Warrior program? [23:56]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)